更新
Project Name:pdf2tx (P6)
Date: 5oct.24
Function: 在浏览器中翻译PDF文件
Code:https://blog.csdn.net/davenian/article/details/142723144
升级
Project Name: pdf2tx-mm (P8)
7oct.24
加入多线程,分页OCR识别,提高性能与速度
使用google translator, Azure API 做为翻译机
使用NLTR 自然语言识别 多种西方文字 提高翻译质量
显示:处理时间, 翻译引擎,OCR识别的语言
Docker Folder: /app/pdf2tx-mm
Code:https://blog.csdn.net/davenian/article/details/142750333
升级 (P8.1)
8oct.24
使用jieba,可以对中文进行自然语言识别
对日文翻译,使用janome库,对日文使用自然语言分割,提高翻译准确
程序可以识别PDF是文本(langdetect),还是图片,图片才调用OCR
翻译过程并行化(ThreadPoolExecutor)
翻译结果加入 传统中文
可以下载翻译的文本
放弃ZhipuAI做为翻译机,因为在测试时,总是出发敏感词检测。
进度算法改为:考虑页数
已知问题:
代码调用的Google 翻译请求,在测试时用3语PDF文件(中+日+英),第一次翻译可以正常,但第二次(即使切换输出为不同语言)会有机会出现 1-5 次 “Request exception can happen due to an api connection error. Please check your connection and try again”,所以试着修改 max_length 从 5000 ,往下减 100 的值后,这个值也不能稳定在 4500 ,所以代码有了随机长度:“# 根据翻译引擎设置最大字符长度 if engine == 'google': max_length = random.randint(4200, 4700) else: max_length = 5000” 这段。 至少5次测试后,能正常翻译。 正在看 RequestError · Issue #239 · nidhaloff/deep-translator · GitHub 提到的用 MyMemoryTranslator 加入到代码。 added on 9oct.24 719pm
代码
1. app.py
import os
import uuid
import logging
import configparser
from flask import Flask, render_template, request, redirect, url_for, Response
from threading import Thread, Lock
from werkzeug.utils import secure_filename
from pdf2image import convert_from_path
import pytesseract
from deep_translator import GoogleTranslator, MicrosoftTranslator
from concurrent.futures import ThreadPoolExecutor
from collections import defaultdict
import time # 导入 time 模块, 显示处理时间用
from datetime import timedelta #在结果页面显示处理时间,格式为 HH:MM
import nltk
#try:
# nltk.data.find('tokenizers/punkt','tokenizers/punkt_tank')
#except LookupError:
# nltk.download('punkt','punkt_tank', quiet=True)#nltk.download('punkt', quiet=True) # 已经安装,用:python -m nltk.downloader all
# 但运行还会报错! 还需要安装 unstructured 库,Y TMD在介绍里没说 f!
from functools import lru_cache
from pdfminer.high_level import extract_text as pdf_extract_text
from pdfminer.pdfparser import PDFSyntaxError
from langdetect import detect
import jieba
from janome.tokenizer import Tokenizer
import random# 定义支持的语言映射
language_mapping = {'en': 'english','fr': 'french','de': 'german','es': 'spanish','it': 'italian','ja': 'japanese','ko': 'korean','ru': 'russian','zh-cn': 'chinese','zh-tw': 'chinese','zh': 'chinese','pt': 'portuguese','ar': 'arabic','hi': 'hindi',# 添加其他语言
}# OCR 语言代码映射
ocr_language_mapping = {'en': 'eng','fr': 'fra','de': 'deu','es': 'spa','it': 'ita','ja': 'jpn','ko': 'kor','ru': 'rus','zh-cn': 'chi_sim','zh-tw': 'chi_tra',# 添加更多语言如有需要
}# Microsoft Translator 语言代码映射
microsoft_language_mapping = {'en': 'en','fr': 'fr','de': 'de','es': 'es','it': 'it','ja': 'ja','ko': 'ko','ru': 'ru','zh-cn': 'zh-hans','zh-tw': 'zh-hant','pt': 'pt','ar': 'ar','hi': 'hi',# 添加更多语言如有需要
}# Google Translator 语言代码映射
google_language_mapping = {'en': 'en','fr': 'fr','de': 'de','es': 'es','it': 'it','ja': 'ja','ko': 'ko','ru': 'ru','zh-cn': 'zh-CN', # 修正为 Google 支持的简体中文代码'zh-tw': 'zh-TW', # 修正为 Google 支持的繁体中文代码'zh': 'zh-CN', # 默认简体中文'pt': 'pt','ar': 'ar','hi': 'hi',# 添加更多语言如有需要
}# 初始化 Flask 应用
app = Flask(__name__)
app.config['ALLOWED_EXTENSIONS'] = {'pdf'}
app.config['UPLOAD_FOLDER'] = 'uploads'
app.config['MAX_CONTENT_LENGTH'] = 50 * 1024 * 1024 # 50MB# 确保上传文件夹存在
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)# 全局变量
progress = defaultdict(int)
results = {}
progress_lock = Lock()# 设置日志 格式
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')# 读取配置文件
config = configparser.ConfigParser()
config_file = 'config.ini'if not os.path.exists(config_file):raise FileNotFoundError(f"配置文件 {config_file} 未找到,请确保其存在并包含必要的配置。")config.read(config_file)try:AZURE_API_KEY = config.get('translator', 'azure_api_key') # Microsoft Azure 需要KEY, 它给了2个,可以循环使用。用一个就行。AZURE_REGION = config.get('translator', 'azure_region') # 还需要 copied: This is the location (or region) of your resource. You may need to use this field when making calls to this API.# 如果有其他 API 密钥,例如 Yandex,可以在此添加# YANDEX_API_KEY = config.get('translator', 'yandex_api_key')
except (configparser.NoSectionError, configparser.NoOptionError):raise ValueError("配置文件中缺少必要的配置选项。")# 允许的文件类型检查函数
def allowed_file(filename):return '.' in filename and filename.rsplit('.', 1)[1] in app.config['ALLOWED_EXTENSIONS']# OCR 函数,指定语言
def ocr_image(image, lang='eng'):try:text = pytesseract.image_to_string(image, lang=lang)except Exception as e:logging.error(f"OCR 失败: {e}")text = ''return textdef chinese_sentence_split(text):# 使用 jieba 进行分词并辅助分句sentences = []current_sentence = []for word in jieba.cut(text):current_sentence.append(word)if word in ['。', '!', '?', ';']:sentence = ''.join(current_sentence).strip()if sentence:sentences.append(sentence)current_sentence = []if current_sentence:sentence = ''.join(current_sentence).strip()if sentence:sentences.append(sentence)return sentencesdef japanese_sentence_split(text):# 使用 Janome 进行分词,并按标点符号分割tokenizer = Tokenizer()tokens = tokenizer.tokenize(text, wakati=True)sentences = []current_sentence = []for token in tokens:current_sentence.append(token)if token in ['。', '!', '?']:sentence = ''.join(current_sentence).strip()if sentence:sentences.append(sentence)current_sentence = []if current_sentence:sentence = ''.join(current_sentence).strip()if sentence:sentences.append(sentence)return sentences# 翻译文本函数,支持分段、并行、进度更新、重试和缓存
# 翻译文本函数,支持分段、并行、进度更新、重试和缓存
def translate_text(text, engine, progress_callback=None, text_lang='en', target_language='en'):global google_language_mappingglobal microsoft_language_mappinglogging.info(f"翻译引擎参数: {engine}")# 句子分割nltk_lang = language_mapping.get(text_lang, 'english')if nltk_lang in ['english', 'french', 'german', 'spanish', 'italian', 'russian']:try:sentences = nltk.sent_tokenize(text, language=nltk_lang)except Exception as e:logging.error(f"NLTK 分句失败,使用默认分割方法:{e}")sentences = text.split('\n')elif nltk_lang == 'chinese':sentences = chinese_sentence_split(text)elif nltk_lang == 'japanese':sentences = japanese_sentence_split(text)else:sentences = text.split('\n')# 根据翻译引擎设置最大字符长度if engine == 'google':max_length = random.randint(4200, 4700)else: max_length = 5000# 确保 target_language 已被正确设置if not target_language:logging.error("未能正确设置目标语言,使用默认值 'en'")target_language = 'en'# 初始化翻译器translator = Noneif engine == 'google':target_language = google_language_mapping.get(target_language, 'en') # 使用正确的目标语言translator = GoogleTranslator(source='auto', target=target_language)logging.info(f"初始化翻译器, google Target_language: {target_language}")elif engine == 'microsoft':# 使用用户提供的目标语言代码进行翻译source_language = microsoft_language_mapping.get(text_lang, 'en')target_language = microsoft_language_mapping.get(target_language, 'en')logging.info(f"初始化翻译器, Azure Source Language: {source_language}, Target Language: {target_language}")translator = MicrosoftTranslator(source=source_language,target=target_language,api_key=AZURE_API_KEY,region=AZURE_REGION)# 将句子组合成不超过最大长度的块chunks = []current_chunk = ''for sentence in sentences:if len(current_chunk) + len(sentence) + 1 <= max_length:current_chunk += sentence + ' 'else:chunks.append(current_chunk.strip())current_chunk = sentence + ' 'if current_chunk:chunks.append(current_chunk.strip())translated_chunks = [''] * len(chunks)total_chunks = len(chunks)completed_chunks = 0# 定义翻译单个块的函数,带有重试机制def translate_chunk(index, chunk):nonlocal completed_chunksmax_retries = 3for attempt in range(max_retries):try:translated_chunk = translator.translate(chunk)translated_chunks[index] = translated_chunkbreak # 成功后跳出循环except Exception as e:logging.error(f"翻译块 {index} 失败,尝试次数 {attempt + 1}: {e}")if attempt == max_retries - 1:translated_chunks[index] = chunk # 最后一次重试失败,使用原文completed_chunks += 1if progress_callback:progress = int(100 * completed_chunks / total_chunks)progress_callback(progress)# 使用线程池并行翻译with ThreadPoolExecutor(max_workers=5) as executor:for idx, chunk in enumerate(chunks):executor.submit(translate_chunk, idx, chunk)# 重建翻译后的文本translated_text = ' '.join(translated_chunks)return translated_text.strip()# 后台处理函数
# 使用 logging.info 在调试模式中输出所使用的翻译引擎和处理时间
# 在任务开始时,记录开始时间 start_time。
# 在任务结束时,记录结束时间 end_time,计算处理时间 elapsed_time。
# 将 elapsed_time 保存到 results 字典中,以便在结果页面显示
# 加入对pdf file checking. 如果不是Image,跳过OCR. 9oct.24 1230am
def process_file(task_id, filepath, engine, ocr_language, target_language):global resultsglobal language_mapping # 声明使用全局变量try:start_time = time.time() # 记录开始时间logging.info(f"任务 {task_id}: 开始处理文件 {filepath},使用 OCR 语言 {ocr_language},翻译引擎 {engine}, 目标语言 {target_language}"), # 输出详细信息with progress_lock:progress[task_id] = 0# 尝试直接提取文本extracted_text = ''try:extracted_text = pdf_extract_text(filepath)if extracted_text.strip():logging.info(f"任务 {task_id}: 成功提取文本,无需 OCR")with progress_lock:progress[task_id] = 50 # 文本提取完成,进度更新为 50%# 在提取文本后,检测语言try:detected_language = detect(extracted_text)logging.info(f"检测到的文本语言:{detected_language}")if detected_language not in language_mapping:logging.warning(f"检测到的语言 '{detected_language}' 不在支持的语言列表中,使用默认语言 'en'")detected_language = 'en'except Exception as e:logging.error(f"语言检测失败,使用默认语言 'en'。错误信息:{e}")detected_language = 'en'else:logging.info(f"任务 {task_id}: 提取到的文本为空,使用 OCR 处理")raise ValueError("Empty text extracted")except Exception as e: # 如果直接提取文本失败,使用 OCR 处理logging.info(f"任务 {task_id}: 无法直接提取文本,将使用 OCR 处理。原因:{e}")# 将 PDF 转换为图像images = convert_from_path(filepath)total_pages = len(images)total_steps = total_pagesextracted_text = ''for i, image in enumerate(images):text = ocr_image(image, lang=ocr_language_mapping.get(ocr_language,'eng'))extracted_text += text + '\n'with progress_lock:progress[task_id] = int(100 * (i + 1) / total_steps * 0.5) # OCR 占 50% 进度with progress_lock:progress[task_id] = 50 # OCR 完成,进度更新为 50%# 在 OCR 提取后,检测语言try:detected_language = detect(extracted_text)logging.info(f"检测到的文本语言:{detected_language}")if detected_language not in language_mapping:logging.warning(f"检测到的语言 '{detected_language}' 不在支持的语言列表中,使用默认语言 'en'")detected_language = 'en'except Exception as e:logging.error(f"语言检测失败,使用默认语言 'en'。错误信息:{e}")detected_language = 'en'# 翻译文本,传递 progress_callbackdef progress_callback(p):with progress_lock:progress[task_id] = 50 + int(p * 0.5) # 翻译占 50% 进度# 将检测到的语言传递给 translate_text 函数,并确保 engine 是小写translated_text = translate_text(extracted_text, engine, progress_callback, detected_language, target_language)with progress_lock:progress[task_id] = 100# 计算处理时间end_time = time.time()elapsed_time = end_time - start_time # 处理所用的时间,单位为秒# 将处理时间保存到结果中result = {'original': extracted_text,'translated': translated_text,'elapsed_time': elapsed_time, # 添加处理时间'engine': engine, # 添加翻译引擎'ocr_language': ocr_language, # 添加 OCR 语言'target_language': target_language}results[task_id] = result# 删除上传的文件os.remove(filepath)logging.info(f"任务 {task_id}: 处理完成,耗时 {elapsed_time:.2f} 秒") # 输出处理时间except Exception as e:logging.error(f"处理失败: {e}")with progress_lock:progress[task_id] = -1finally:# 确保上传的文件被删除,即使出现异常if os.path.exists(filepath):os.remove(filepath)logging.info(f"任务 {task_id}: 文件已删除")# 文件上传路由
@app.route('/', methods=['GET', 'POST'])
def upload_file():if request.method == 'POST':# 检查请求中是否有文件if 'file' not in request.files:return '请求中没有文件部分', 400file = request.files['file']if file.filename == '':return '未选择文件', 400if file and allowed_file(file.filename):# 安全地保存文件filename = secure_filename(f"{uuid.uuid4().hex}_{file.filename}")filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)file.save(filepath)# 获取选择的翻译引擎和 OCR 语言,设置默认值engine = request.form.get('engine', 'google')ocr_language = request.form.get('ocr_language', 'en')target_language = request.form.get('target_language', 'zh-cn')# 创建唯一的任务 IDtask_id = str(uuid.uuid4())progress[task_id] = 0# 启动后台处理线程thread = Thread(target=process_file, args=(task_id, filepath, engine, ocr_language, target_language))thread.start()# 重定向到进度页面return redirect(url_for('processing', task_id=task_id))else:return '文件类型不被允许', 400return render_template('upload.html')# 处理页面路由
@app.route('/processing/<task_id>')
def processing(task_id):return render_template('processing.html', task_id=task_id)# 进度更新路由
@app.route('/progress/<task_id>')
def progress_status(task_id):def generate():while True:with progress_lock:status = progress.get(task_id, 0)yield f"data: {status}\n\n"if status >= 100 or status == -1:breakreturn Response(generate(), mimetype='text/event-stream')# 结果页面路由
@app.route('/result/<task_id>')
def result(task_id):result_data = results.get(task_id)if not result_data:return '结果未找到', 404# 获取处理时间elapsed_time = result_data.get('elapsed_time', 0)# 将处理时间格式化为 HH:MM:SSelapsed_time_str = str(timedelta(seconds=int(elapsed_time)))return render_template('result.html', original=result_data['original'], translated=result_data['translated'], elapsed_time=elapsed_time_str,engine=result_data['engine'],ocr_language=result_data['ocr_language'])if __name__ == '__main__':app.run(host='0.0.0.0', port=9006, debug=True)
2. upload.html
<!-- templates/upload.html -->
<!DOCTYPE html>
<html lang="zh-CN">
<head><meta charset="UTF-8"><title>PDF翻译器</title>
</head>
<body><h1>上传PDF文件进行翻译</h1><form action="{{ url_for('upload_file') }}" method="post" enctype="multipart/form-data"><div><label for="file">选择PDF文件:</label><input type="file" id="file" name="file" accept=".pdf" required></div><div><label for="ocr_language">选择OCR语言:</label><select id="ocr_language" name="ocr_language"><option value="en">英语</option><option value="fr">法语</option><option value="de">德语</option><option value="es">西班牙语</option><option value="it">意大利语</option><option value="ja">日语</option><option value="ko">韩语</option><option value="ru">俄语</option><option value="zh-cn">简体中文</option><option value="zh-tw">繁体中文</option><!-- 如需更多语言,请在此添加 --></select></div><div><label for="engine">选择翻译引擎:</label><select id="engine" name="engine"><option value="google">Google 翻译</option><option value="microsoft">Microsoft 翻译</option><!-- 如有其他翻译引擎,可在此添加 --></select><label for="target_language">选择目标语言:</label><select id="target_language" name="target_language"><option value="zh-cn">简体中文</option><option value="zh-tw">繁体中文(台湾)</option><!-- 其他语言选项 --></select></div><div><button type="submit">开始翻译</button></div></form>
</body>
</html>
3. processing.html
<!-- templates/processing.html --><!doctype html>
<html>
<head><title>处理中...</title><style>#progress-bar {width: 50%;background-color: #f3f3f3;margin: 20px 0;}#progress-bar-fill {height: 30px;width: 0%;background-color: #4caf50;text-align: center;line-height: 30px;color: white;}</style>
</head>
<body><h1>文件正在处理中,请稍候...</h1><div id="progress-bar"><div id="progress-bar-fill">0%</div></div><script>var taskId = "{{ task_id }}";var progressBarFill = document.getElementById('progress-bar-fill');var eventSource = new EventSource('/progress/' + taskId);eventSource.onmessage = function(event) {var progress = event.data;if (progress == '-1') {<!-- alert('处理失败,请重试。'); -->eventSource.close();window.location.href = '/';} else {progressBarFill.style.width = progress + '%';progressBarFill.innerText = progress + '%';if (progress >= 100) {eventSource.close();window.location.href = '/result/' + taskId;}}};</script>
</body>
</html>
4. result.html
<!-- templates/result.html -->
<!doctype html>
<html>
<head><title>翻译结果</title><style>.container {display: flex;}.content {width: 50%;padding: 20px;box-sizing: border-box;overflow-y: scroll;height: 80vh; /* 调整高度,给处理时间留出空间 */}.original {background-color: #f9f9f9;}.translated {background-color: #eef9f1;}pre {white-space: pre-wrap;word-wrap: break-word;}</style>
</head>
<body><h1>翻译结果</h1><p>处理时间:{{ elapsed_time }}</p> <!-- 显示处理时间 --><p>使用的翻译引擎:{{ engine|capitalize }}</p> <!-- 显示翻译引擎 , 使用capitalize过滤器 首字母大字--><p>OCR 语言:{{ ocr_language }}</p> <!-- 显示OCR 语言 --><!-- CHANGE: 添加下载译文的功能 --><button onclick="downloadTranslatedText()">下载译文</button><button onclick="window.location.href='/'">返回主页</button><div class="container"><div class="content original"><h2>原文</h2><pre>{{ original }}</pre></div><div class="content translated"><h2>译文</h2><pre>{{ translated }}</pre></div></div><script>function downloadTranslatedText() {var element = document.createElement('a');var text = `{{ translated|e }}`;var file = new Blob([text], {type: 'text/plain'});element.href = URL.createObjectURL(file);element.download = 'translated.txt';document.body.appendChild(element);element.click();document.body.removeChild(element);}</script>
</body>
</html>
5. config.ini
[translator]
azure_api_key = 5abb1ab..
azure_region = south..
mymemorytranslator_key = 4ba808c..
email_address = dave3.nian@gmail.com
openai_api_key = sk-proj..9KrfsMyI30Am3..
#yandex_api_key = YOUR_YANDEX_API_KEY
zhipu_api_key = 23358bf...
6. Dockerfile
# 使用官方的 Python 3.12.3 slim 版本作为基础镜像
FROM python:3.12.3-slim# 设置环境变量
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1# 设置工作目录 #从P8开始,项目文件在container中位置: /app/<project name>
WORKDIR /app/pdf2tx-mm# 复制应用程序代码到容器中 #从P8开始,项目文件在container中位置: /app/<project name>
COPY . /app/pdf2tx-mm# 升级 pip
RUN pip install --upgrade pip# 安装系统依赖项
RUN apt-get update && apt-get install -y --no-install-recommends \build-essential \tesseract-ocr \libtesseract-dev \poppler-utils \libglib2.0-0 \&& rm -rf /var/lib/apt/lists/*# 如果需要特定的 Tesseract 语言包,中文 日文
RUN apt-get update && apt-get install -y --no-install-recommends \tesseract-ocr-chi-sim \tesseract-ocr-chi-tra \tesseract-ocr-jpn\&& rm -rf /var/lib/apt/lists/*# 安装 Python 依赖项
RUN pip install --no-cache-dir -r requirements.txt# 下载 NLTK 数据
RUN python -m nltk.downloader all# Copy the rest of the application code
COPY . /app/# 暴露应用程序运行的端口
EXPOSE 9006# 设置环境变量以指定Flask运行的主机和端口
ENV FLASK_RUN_HOST=0.0.0.0
ENV FLASK_RUN_PORT=9006# 运行应用程序
CMD ["python", "app.py"]
7. requirements.txt
Flask
pdf2image
pytesseract
deep_translator
nltk
pdfminer.six
langdetect
jieba
janome
werkzeug
gunicorn
Docker deployment:
docker build -t pdf2tx-mm.8.1 .
docker run -d -p 9006:9006 --name pdf2tx-mm.8.1_container pdf2tx-mm.8.1
注: 第一条命令是,创建一个image: pdf2tx-mm.8.1
命令二是: 创建一个来自pdf2tx-mm.8.1 镜像的容器,容器名字是: pdf2tx-mm.8.1_container
演示
这个PDF是3语的,翻译能看。
如果在windows上运行,看P8安装指导。
Linux docker部署,可直接使用命令,看本文章