【Python-MP4文体提取】

Python-MP4文体提取

■ pip 和 setuptools工具
■ OpenCV和Tesseract
■ Tesseract OCR V5.0安装教程（Windows）
- ■ 1. 运行程序出现如下问题：我们需要安装Tesseract OCR
- ■ 2. 下载Tesseract-OCR
- ■ 3. 安装Tesseract-OCR
- ■ 4. 添加到环境变量的系统变量（PATH）去
- ■ 5. 增加一个TESSDATA_PREFIX变量名，
- ■ 6. 打开终端，输入：tesseract -v，可以看到版本信息
- ■ 7. 在pytesseract库下的pytesseract.py文件中找到tesseract_cmd = 'tesseract'，修改成 tesseract_cmd =r'C:\Program Files\Tesseract-OCR\tesseract.exe'
- ■ 8. 再去运行程序
■ 运行代码
■ 运行代码2-openAI生成的

■ pip 和 setuptools工具

先对 pip 和 setuptools工具更新最新版本

python.exe -m pip install --upgrade pip setuptools   
或
pip install --upgrade pip setuptools

■ OpenCV和Tesseract

使用以下命令安装OpenCV和Tesseract:

pip install opencv-python
pip install tesseract

■ Tesseract OCR V5.0安装教程（Windows）

■ 1. 运行程序出现如下问题：我们需要安装Tesseract OCR

在这里插入图片描述

■ 2. 下载Tesseract-OCR

官方网站：https://github.com/tesseract-ocr/tesseract
官方文档：https://github.com/tesseract-ocr/tessdoc
语言包地址：https://github.com/tesseract-ocr/tessdata
下载地址：https://digi.bib.uni-mannheim.de/tesseract/

下载地址
在这里插入图片描述

■ 3. 安装Tesseract-OCR

在这里插入图片描述

■ 4. 添加到环境变量的系统变量（PATH）去

在这里插入图片描述

■ 5. 增加一个TESSDATA_PREFIX变量名，

增加一个TESSDATA_PREFIX变量名，变量值还是我的安装路径C:\Program Files\Tesseract-OCR\tessdata这是将语言字库文件夹添加到变量中;
在这里插入图片描述

■ 6. 打开终端，输入：tesseract -v，可以看到版本信息

在这里插入图片描述

■ 7. 在pytesseract库下的pytesseract.py文件中找到tesseract_cmd = ‘tesseract’，修改成 tesseract_cmd =r’C:\Program Files\Tesseract-OCR\tesseract.exe’

在这里插入图片描述
我的路径
D:\software\Python\Python312\Lib\site-packages\pytesseract

■ 8. 再去运行程序

在这里插入图片描述
结果提取的数据不符合预期要求

■ 运行代码

import cv2
import pytesseract
import re# 读取视频文件
def extract_frames(video_path):cap = cv2.VideoCapture(video_path)frames = []while cap.isOpened():ret, frame = cap.read()if not ret:breakframes.append(frame)cap.release()return frames# 文字识别
def recognize_text(image):text = pytesseract.image_to_string(image)return text# 降噪声或错误的字符
def process_text(text):processed_text = re.sub(r'\s+|[^\w\s]', '', text)return processed_text#video_path = 'E:\PythonProject\tcipy\1111111111.mp4'
video_path = '1111111111.mp4'frames = extract_frames(video_path)for frame in frames:text = recognize_text(frame)processed_text = process_text(text)print(processed_text)print('OK')

■ 运行代码2-openAI生成的

import cv2
import pytesseract
from pytesseract import Output
import pandas as pd# 打开MP4文件
cap = cv2.VideoCapture('video.mp4')# 检查文件是否成功打开
if not cap.isOpened():print("Error: Could not open video file.")exit()# 使用Tesseract进行OCR
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'# 初始化表格
data = {'Frame Number': [], 'Text': []}# 读取视频并解析文字内容
frame_number = 0
while True:# 从文件中读取一帧ret, frame = cap.read()# 检查是否成功读取帧if not ret:print("Error: Could not read frame.")break# 进行文字识别d = pytesseract.image_to_data(frame, output_type=Output.DICT)# 提取识别到的文字for i in range(len(d['text'])):text = d['text'][i].strip()if text:data['Frame Number'].append(frame_number)data['Text'].append(text)frame_number += 1# 释放资源
cap.release()# 创建DataFrame
df = pd.DataFrame(data)# 输出DataFrame到CSV文件
df.to_csv('text_from_video.csv', index=False)