《AI大模型趣味实战》第6集：基于大模型和RSS聚合打造个人新闻电台

摘要

本文将带您探索如何结合AI大模型和RSS聚合技术，打造一个功能丰富的个人新闻电台系统。我们将使用Python和PyQt5构建一个桌面应用程序，该应用可以从多个RSS源抓取新闻，使用大模型进行内容优化和标签生成，并通过语音播报功能将文字新闻转化为语音广播。本项目融合了爬虫、自然语言处理、数据库存储和语音合成等多种技术，是一个非常实用且有趣的AI应用实例。
完整代码仓 https://github.com/wyg5208/rss_news_boke
在这里插入图片描述

核心概念和知识点

1. 项目架构概述

我们的RSS新闻电台系统由以下几个核心模块组成：

RSS抓取模块：负责从各种新闻源获取最新内容
内容提取模块：使用多种策略从HTML页面中提取新闻正文
大模型优化模块：利用Ollama本地大模型精炼内容、去除广告
标签生成模块：基于大模型分析新闻内容并生成分类标签
数据存储模块：使用SQLite数据库保存新闻和标签信息
语音播报模块：将新闻转化为语音进行播报
定时任务模块：实现定时抓取和播报功能

2. 环境设置与依赖安装

首先，我们需要安装必要的依赖包：

# requirements.txt
beautifulsoup4==4.13.3
feedparser==6.0.10
requests==2.31.0
selenium==4.15.2
webdriver-manager==4.0.1
lxml==4.9.3
ollama==0.1.5
PyQt5==5.15.9
schedule==1.2.1
pyttsx3==2.90

安装依赖包：

pip install -r requirements.txt

此外，确保安装了Ollama并拉取GLM4模型：

# 安装Ollama (按官方文档步骤)
# 启动Ollama服务
ollama serve
# 拉取GLM4模型
ollama pull glm4:latest

3. 大模型正文优化实现

大模型在新闻内容处理中扮演着关键角色。传统的HTML解析往往无法准确区分正文与广告、导航等无关内容。通过大模型，我们实现了智能内容精炼：

def refine_content_with_llm(title, description, raw_content):"""使用大模型精炼新闻正文内容，去除广告和无关信息"""try:if not raw_content or len(raw_content) < 200:return raw_content# 如果原始内容过长，截取适当长度content_for_processing = raw_content[:8000] if len(raw_content) > 8000 else raw_content# 准备提示词prompt = f"""请帮我提取以下网页内容中的新闻正文，清除广告、导航栏、版权声明等非核心内容。标题: {title}
描述: {description}
原始内容:
{content_for_processing}请按以下规则处理:
1. 只保留与新闻主题相关的正文内容
2. 移除所有广告、推荐阅读、社交媒体链接等无关内容
3. 移除网站导航、页脚版权信息等
4. 保持原文的段落结构
5. 如果找不到明确的正文，返回最有可能的主要内容
6. 直接返回处理后的纯文本，不要添加额外说明处理后的正文:
"""# 调用大模型进行内容处理response = ollama.chat(model='glm4:latest', messages=[{'role': 'user','content': prompt}])refined_content = response['message']['content']# 如果结果不合理，回退到原始内容if not refined_content or len(refined_content) < 100 or len(refined_content) > len(raw_content) * 1.5:print(f"大模型内容提取结果不合理，回退到原始内容处理")return raw_contentreturn refined_contentexcept Exception as e:print(f"使用大模型精炼内容失败: {str(e)}")return raw_content  # 出错时回退到原始内容

4. 大模型标签生成

除了内容优化，我们还利用大模型进行智能标签生成，帮助用户更好地分类和筛选新闻：

def generate_tags(self, title, description, content):try:# 获取标签库中的所有标签library_tags = self.get_library_tags()# 构建提示，让模型识别已有标签并建议新标签prompt = f"""请分析以下新闻内容，从标签库中选择最多5个相关标签，并在需要时建议最多2个新标签。标题: {title}描述: {description}内容概要: {content[:500]}...标签库中的现有标签:{', '.join(library_tags)}请按以下JSON格式返回：{{"existing_tags": ["已有标签1", "已有标签2", ...],"new_tags": ["新标签1", "新标签2"]}}"""response = ollama.chat(model='glm4:latest', messages=[{'role': 'user','content': prompt}])result = response['message']['content']# 提取JSON内容（可能需要从markdown代码块中提取）json_match = re.search(r'```json\s*(.*?)\s*```', result, re.DOTALL)if json_match:result = json_match.group(1)else:# 尝试直接解析JSONjson_match = re.search(r'({.*})', result, re.DOTALL)if json_match:result = json_match.group(1)try:tags_data = json.loads(result)# 处理现有标签existing_tags = tags_data.get("existing_tags", [])# 处理新标签并添加到标签库new_tags = tags_data.get("new_tags", [])for tag in new_tags:self.add_to_library(tag)# 合并所有标签all_tags = existing_tags + new_tags# 更新标签使用频率self.update_tag_frequency(all_tags)return all_tagsexcept json.JSONDecodeError:# 回退到基于关键词的标签生成return self.match_tags_from_library(title + " " + description)except Exception as e:print(f"生成标签失败: {str(e)}")# 回退到简单的标签匹配return self.match_tags_from_library(title + " " + description)

5. 语音播报功能

将文字转化为语音是本项目的核心功能之一，使用pyttsx3库实现：

class NewsBroadcaster:def __init__(self):"""初始化语音引擎"""self.engine = pyttsx3.init()# 设置默认语速和音量self.engine.setProperty('rate', 150)  # 语速self.engine.setProperty('volume', 0.8)  # 音量# 尝试设置中文语音voices = self.engine.getProperty('voices')for voice in voices:if "chinese" in voice.id.lower() or "zh" in voice.id.lower():self.engine.setProperty('voice', voice.id)breakdef broadcast(self, text):"""播报文本内容"""self.engine.say(text)self.engine.runAndWait()def broadcast_news(self, news_items):"""播报新闻列表"""if not news_items:self.broadcast("没有找到可播报的新闻")returnself.broadcast("开始播报今日新闻")time.sleep(1)for i, news in enumerate(news_items):# 播报标题self.broadcast(f"第{i+1}条新闻")self.broadcast(f"标题: {news['title']}")# 播报来源self.broadcast(f"来源: {news['source']}")# 播报摘要if news['description']:self.broadcast("新闻摘要:")self.broadcast(news['description'])# 间隔time.sleep(1)self.broadcast("新闻播报完毕")

6. 定时任务管理

通过schedule库实现定时抓取和播报功能：

class ScheduleManager:_instance = None_running = False_thread = None_fetch_tasks = {}  # 存储抓取任务_broadcast_tasks = {}  # 存储播报任务@classmethoddef get_instance(cls):if cls._instance is None:cls._instance = ScheduleManager()return cls._instancedef start(self):"""启动定时任务线程"""if not self._running:self._running = Trueself._thread = threading.Thread(target=self._run_scheduler, daemon=True)self._thread.start()def _run_scheduler(self):"""运行定时器"""while self._running:schedule.run_pending()time.sleep(1)# 添加月度任务示例def add_broadcast_task(self, task_id, schedule_type, value, time_value, app_instance, count=5):"""添加新闻播报定时任务"""# 先删除同ID的旧任务self.remove_broadcast_task(task_id)# 创建新任务job = Noneif schedule_type == "每天":job = schedule.every().day.at(time_value).do(app_instance.start_news_broadcast, count)elif schedule_type == "每周":days = ["monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]day_attr = getattr(schedule.every(), days[value-1])job = day_attr.at(time_value).do(app_instance.start_news_broadcast, count)elif schedule_type == "每月":# 智能月度任务实现 - 使用每日检查日期方法def monthly_broadcast_job():# 仅在每月特定日期运行if datetime.now().day == value:app_instance.start_news_broadcast(count)job = schedule.every().day.at(time_value).do(monthly_broadcast_job)if job:self._broadcast_tasks[task_id] = jobreturn Truereturn False

疑难点和技术突破

1. 多层内容提取策略

一个主要的挑战是如何从各类网站中提取有效的新闻内容。我们采用了多层内容提取策略，结合传统爬虫和大模型：

def extract_content(url, use_selenium=False):# 第一层：尝试使用Selenium提取（处理动态内容）if use_selenium:try:return RSSParser.extract_with_selenium(url)except Exception as e:print(f"Selenium提取失败: {url}, 错误: {str(e)}")# 回退到普通提取# 第二层：使用requests+BeautifulSoup提取try:# 添加用户代理头，模拟Chrome浏览器headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',# 其他请求头...}response = requests.get(url, headers=headers, timeout=15)response.raise_for_status()# 检测并处理页面编码if response.encoding == 'ISO-8859-1':response.encoding = response.apparent_encodingsoup = BeautifulSoup(response.text, 'html.parser')# 移除脚本、样式和其他非内容元素for element in soup(['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'iframe']):element.extract()# 多种策略提取内容content = ""# 策略1: 尝试找到常见的文章容器article_containers = soup.find_all(['article', 'main', 'div'], class_=lambda x: x and any(term in str(x).lower() for term in ['article', 'content', 'post', 'entry', 'main', 'text', 'body']))if article_containers:# 使用最大的容器largest_container = max(article_containers, key=lambda x: len(str(x)))content = largest_container.get_text(separator='\n', strip=True)# 策略2: 尝试找长段落if not content or len(content) < 200:paragraphs = soup.find_all('p')# 筛选长段落 (可能是正文)long_paragraphs = [p.get_text(strip=True) for p in paragraphs if len(p.get_text(strip=True)) > 60]if long_paragraphs:content = '\n\n'.join(long_paragraphs)# 第三层：如上述方法都失败，则在主函数中会尝试使用大模型优化return contentexcept Exception as e:print(f"提取内容失败: {url}, 错误: {str(e)}")# 如果普通提取失败但还没尝试过Selenium，则尝试Seleniumif not use_selenium:try:return RSSParser.extract_with_selenium(url)except Exception as selenium_error:print(f"Selenium备选提取也失败: {url}, 错误: {str(selenium_error)}")return f"内容提取失败: {str(e)}"

2. WebDriver资源管理

在使用Selenium时，一个常见问题是浏览器资源未正确释放。我们通过单例模式解决这一问题：

class WebDriverManager:_instance = None_driver = None@classmethoddef get_instance(cls):if cls._instance is None:cls._instance = WebDriverManager()return cls._instancedef get_driver(self):"""获取或创建WebDriver实例"""if self._driver is None:try:service = Service(ChromeDriverManager().install())options = webdriver.ChromeOptions()options.add_argument("--headless")  # 无头模式options.add_argument("--disable-gpu")options.add_argument("--disable-extensions")options.add_argument("--disable-dev-shm-usage")options.add_argument("--no-sandbox")self._driver = webdriver.Chrome(service=service, options=options)print("已初始化WebDriver")except Exception as e:print(f"初始化WebDriver失败: {e}")raisereturn self._driverdef close_driver(self):"""关闭WebDriver释放资源"""if self._driver:try:self._driver.quit()except Exception as e:print(f"关闭WebDriver出错: {e}")finally:self._driver = Noneprint("已释放WebDriver资源")

3. 智能月度任务实现

在实现月度定时任务时，我们采用了一种创新的方法，通过每日检查当前日期来执行月度任务：

# 月度任务实现
def monthly_job():# 仅在每月特定日期运行if datetime.now().day == value:app_instance.start_fetch_scheduled()
job = schedule.every().day.at(time_value).do(monthly_job)

4. 链接去重机制

为了避免重复处理相同的新闻，我们实现了高效的链接去重机制：

def link_exists(self, link):"""检查链接是否已存在于数据库中"""conn = sqlite3.connect(self.db_path)cursor = conn.cursor()try:cursor.execute("SELECT id FROM news WHERE link = ?", (link,))result = cursor.fetchone()return result is not Noneexcept Exception as e:print(f"检查链接存在性失败: {e}")return Falsefinally:conn.close()# 在抓取线程中使用
if self.db.link_exists(link):self.update_signal.emit(f"已跳过(数据库中已存在): {title}")continue

完整代码实战

下面通过一个完整的流程示例，展示如何从RSS源抓取新闻、优化内容、生成标签并播报：

def run(self):total = len(self.rss_urls)total_processed = 0total_new = 0for i, url in enumerate(self.rss_urls):try:self.update_signal.emit(f"正在处理 ({i+1}/{total}): {url}")feed = self.parser.get_feed(url)if not feed:continuesource = feed.feed.title if hasattr(feed.feed, 'title') else urlprocessed = 0new_added = 0for entry in feed.entries[:100]:  # 每个源最多处理100条新闻title = entry.title if hasattr(entry, 'title') else "无标题"link = entry.link if hasattr(entry, 'link') else ""description = entry.description if hasattr(entry, 'description') else ""pub_date = entry.published if hasattr(entry, 'published') else ""if not link:continueprocessed += 1total_processed += 1# 检查链接是否已存在于数据库中if self.db.link_exists(link):self.update_signal.emit(f"已跳过(数据库中已存在): {title}")continueself.update_signal.emit(f"正在提取: {title}")# 提取正文内容，根据选项使用Seleniumraw_content = self.parser.extract_content(link, self.use_selenium)# 根据选项使用大模型精炼内容content = raw_contentif self.use_llm:self.update_signal.emit(f"正在使用大模型优化正文: {title}")content = self.parser.refine_content_with_llm(title, description, raw_content)# 生成标签self.update_signal.emit(f"正在生成标签: {title}")tags = self.tag_generator.generate_tags(title, description, content)# 保存到数据库if self.db.add_news(title, link, description, content, source, pub_date, tags):new_added += 1total_new += 1time.sleep(1)  # 避免过快请求self.update_signal.emit(f"完成 {url}: 处理 {processed} 条新闻，新增 {new_added} 条")self.news_added_signal.emit(processed, new_added)except Exception as e:self.update_signal.emit(f"处理RSS出错: {url}, 错误: {e}")self.update_signal.emit(f"抓取完成: 共处理 {total_processed} 条新闻，新增 {total_new} 条")self.finished_signal.emit()

播报新闻实例：

def start_news_broadcast(self, count=5):"""开始新闻播报"""self.log_message(f"开始新闻播报，播报{count}条最新新闻")# 获取最新的新闻news_list = self.db.get_latest_news(count)if not news_list:self.log_message("没有找到可播报的新闻")return# 使用单独线程进行播报，避免UI卡顿broadcast_thread = threading.Thread(target=self.broadcaster.broadcast_news,args=(news_list,),daemon=True)broadcast_thread.start()# 记录播报内容for i, news in enumerate(news_list):self.log_message(f"播报第{i+1}条: {news['title']} - {news['source']}")