当前位置：首页 > news >正文

《AI大模型趣味实战》第6集：基于大模型和RSS聚合打造个人新闻电台

news 2025/7/15 4:31:38

《AI大模型趣味实战》第6集：基于大模型和RSS聚合打造个人新闻电台

摘要

本文将带您探索如何结合AI大模型和RSS聚合技术，打造一个功能丰富的个人新闻电台系统。我们将使用Python和PyQt5构建一个桌面应用程序，该应用可以从多个RSS源抓取新闻，使用大模型进行内容优化和标签生成，并通过语音播报功能将文字新闻转化为语音广播。本项目融合了爬虫、自然语言处理、数据库存储和语音合成等多种技术，是一个非常实用且有趣的AI应用实例。
完整代码仓 https://github.com/wyg5208/rss_news_boke
在这里插入图片描述

核心概念和知识点

1. 项目架构概述

我们的RSS新闻电台系统由以下几个核心模块组成：

RSS抓取模块：负责从各种新闻源获取最新内容
内容提取模块：使用多种策略从HTML页面中提取新闻正文
大模型优化模块：利用Ollama本地大模型精炼内容、去除广告
标签生成模块：基于大模型分析新闻内容并生成分类标签
数据存储模块：使用SQLite数据库保存新闻和标签信息
语音播报模块：将新闻转化为语音进行播报
定时任务模块：实现定时抓取和播报功能

2. 环境设置与依赖安装

首先，我们需要安装必要的依赖包：

# requirements.txt
beautifulsoup4==4.13.3
feedparser==6.0.10
requests==2.31.0
selenium==4.15.2
webdriver-manager==4.0.1
lxml==4.9.3
ollama==0.1.5
PyQt5==5.15.9
schedule==1.2.1
pyttsx3==2.90

安装依赖包：

pip install -r requirements.txt

此外，确保安装了Ollama并拉取GLM4模型：

# 安装Ollama (按官方文档步骤)
# 启动Ollama服务
ollama serve
# 拉取GLM4模型
ollama pull glm4:latest

3. 大模型正文优化实现

大模型在新闻内容处理中扮演着关键角色。传统的HTML解析往往无法准确区分正文与广告、导航等无关内容。通过大模型，我们实现了智能内容精炼：

def refine_content_with_llm(title, description, raw_content):
    """使用大模型精炼新闻正文内容，去除广告和无关信息"""
    try:
        if not raw_content or len(raw_content) < 200:
            return raw_content
            
        # 如果原始内容过长，截取适当长度
        content_for_processing = raw_content[:8000] if len(raw_content) > 8000 else raw_content
        
        # 准备提示词
        prompt = f"""请帮我提取以下网页内容中的新闻正文，清除广告、导航栏、版权声明等非核心内容。

标题: {title}
描述: {description}
原始内容:
{content_for_processing}

请按以下规则处理:
1. 只保留与新闻主题相关的正文内容
2. 移除所有广告、推荐阅读、社交媒体链接等无关内容
3. 移除网站导航、页脚版权信息等
4. 保持原文的段落结构
5. 如果找不到明确的正文，返回最有可能的主要内容
6. 直接返回处理后的纯文本，不要添加额外说明

处理后的正文:
"""
        
        # 调用大模型进行内容处理
        response = ollama.chat(model='glm4:latest', messages=[
            {
                'role': 'user',
                'content': prompt
            }
        ])
        
        refined_content = response['message']['content']
        
        # 如果结果不合理，回退到原始内容
        if not refined_content or len(refined_content) < 100 or len(refined_content) > len(raw_content) * 1.5:
            print(f"大模型内容提取结果不合理，回退到原始内容处理")
            return raw_content
            
        return refined_content
        
    except Exception as e:
        print(f"使用大模型精炼内容失败: {str(e)}")
        return raw_content  # 出错时回退到原始内容

4. 大模型标签生成

除了内容优化，我们还利用大模型进行智能标签生成，帮助用户更好地分类和筛选新闻：

def generate_tags(self, title, description, content):
    try:
        # 获取标签库中的所有标签
        library_tags = self.get_library_tags()
        
        # 构建提示，让模型识别已有标签并建议新标签
        prompt = f"""请分析以下新闻内容，从标签库中选择最多5个相关标签，并在需要时建议最多2个新标签。
        
        标题: {title}
        描述: {description}
        内容概要: {content[:500]}...
        
        标签库中的现有标签:
        {', '.join(library_tags)}
        
        请按以下JSON格式返回：
        {{
            "existing_tags": ["已有标签1", "已有标签2", ...],
            "new_tags": ["新标签1", "新标签2"]
        }}
        """
        
        response = ollama.chat(model='glm4:latest', messages=[
            {
                'role': 'user',
                'content': prompt
            }
        ])
        
        result = response['message']['content']
        
        # 提取JSON内容（可能需要从markdown代码块中提取）
        json_match = re.search(r'```json\s*(.*?)\s*```', result, re.DOTALL)
        if json_match:
            result = json_match.group(1)
        else:
            # 尝试直接解析JSON
            json_match = re.search(r'({.*})', result, re.DOTALL)
            if json_match:
                result = json_match.group(1)
        
        try:
            tags_data = json.loads(result)
            
            # 处理现有标签
            existing_tags = tags_data.get("existing_tags", [])
            
            # 处理新标签并添加到标签库
            new_tags = tags_data.get("new_tags", [])
            for tag in new_tags:
                self.add_to_library(tag)
            
            # 合并所有标签
            all_tags = existing_tags + new_tags
            
            # 更新标签使用频率
            self.update_tag_frequency(all_tags)
            
            return all_tags
            
        except json.JSONDecodeError:
            # 回退到基于关键词的标签生成
            return self.match_tags_from_library(title + " " + description)
            
    except Exception as e:
        print(f"生成标签失败: {str(e)}")
        # 回退到简单的标签匹配
        return self.match_tags_from_library(title + " " + description)

5. 语音播报功能

将文字转化为语音是本项目的核心功能之一，使用pyttsx3库实现：

class NewsBroadcaster:
    def __init__(self):
        """初始化语音引擎"""
        self.engine = pyttsx3.init()
        # 设置默认语速和音量
        self.engine.setProperty('rate', 150)  # 语速
        self.engine.setProperty('volume', 0.8)  # 音量
        
        # 尝试设置中文语音
        voices = self.engine.getProperty('voices')
        for voice in voices:
            if "chinese" in voice.id.lower() or "zh" in voice.id.lower():
                self.engine.setProperty('voice', voice.id)
                break
    
    def broadcast(self, text):
        """播报文本内容"""
        self.engine.say(text)
        self.engine.runAndWait()
    
    def broadcast_news(self, news_items):
        """播报新闻列表"""
        if not news_items:
            self.broadcast("没有找到可播报的新闻")
            return
        
        self.broadcast("开始播报今日新闻")
        time.sleep(1)
        
        for i, news in enumerate(news_items):
            # 播报标题
            self.broadcast(f"第{i+1}条新闻")
            self.broadcast(f"标题: {news['title']}")
            
            # 播报来源
            self.broadcast(f"来源: {news['source']}")
            
            # 播报摘要
            if news['description']:
                self.broadcast("新闻摘要:")
                self.broadcast(news['description'])
            
            # 间隔
            time.sleep(1)
        
        self.broadcast("新闻播报完毕")

6. 定时任务管理

通过schedule库实现定时抓取和播报功能：

class ScheduleManager:
    _instance = None
    _running = False
    _thread = None
    _fetch_tasks = {}  # 存储抓取任务
    _broadcast_tasks = {}  # 存储播报任务
    
    @classmethod
    def get_instance(cls):
        if cls._instance is None:
            cls._instance = ScheduleManager()
        return cls._instance
    
    def start(self):
        """启动定时任务线程"""
        if not self._running:
            self._running = True
            self._thread = threading.Thread(target=self._run_scheduler, daemon=True)
            self._thread.start()
    
    def _run_scheduler(self):
        """运行定时器"""
        while self._running:
            schedule.run_pending()
            time.sleep(1)
            
    # 添加月度任务示例
    def add_broadcast_task(self, task_id, schedule_type, value, time_value, app_instance, count=5):
        """添加新闻播报定时任务"""
        # 先删除同ID的旧任务
        self.remove_broadcast_task(task_id)
        
        # 创建新任务
        job = None
        if schedule_type == "每天":
            job = schedule.every().day.at(time_value).do(app_instance.start_news_broadcast, count)
        elif schedule_type == "每周":
            days = ["monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]
            day_attr = getattr(schedule.every(), days[value-1])
            job = day_attr.at(time_value).do(app_instance.start_news_broadcast, count)
        elif schedule_type == "每月":
            # 智能月度任务实现 - 使用每日检查日期方法
            def monthly_broadcast_job():
                # 仅在每月特定日期运行
                if datetime.now().day == value:
                    app_instance.start_news_broadcast(count)
            job = schedule.every().day.at(time_value).do(monthly_broadcast_job)
        
        if job:
            self._broadcast_tasks[task_id] = job
            return True
        return False

疑难点和技术突破

1. 多层内容提取策略

一个主要的挑战是如何从各类网站中提取有效的新闻内容。我们采用了多层内容提取策略，结合传统爬虫和大模型：

def extract_content(url, use_selenium=False):
    # 第一层：尝试使用Selenium提取（处理动态内容）
    if use_selenium:
        try:
            return RSSParser.extract_with_selenium(url)
        except Exception as e:
            print(f"Selenium提取失败: {url}, 错误: {str(e)}")
            # 回退到普通提取
    
    # 第二层：使用requests+BeautifulSoup提取
    try:
        # 添加用户代理头，模拟Chrome浏览器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
            # 其他请求头...
        }
        
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        
        # 检测并处理页面编码
        if response.encoding == 'ISO-8859-1':
            response.encoding = response.apparent_encoding
            
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 移除脚本、样式和其他非内容元素
        for element in soup(['script', 'style', 'nav', 'header', 'footer', 'aside', 'form', 'iframe']):
            element.extract()
            
        # 多种策略提取内容
        content = ""
        
        # 策略1: 尝试找到常见的文章容器
        article_containers = soup.find_all(['article', 'main', 'div'], class_=lambda x: x and any(term in str(x).lower() for term in ['article', 'content', 'post', 'entry', 'main', 'text', 'body']))
        
        if article_containers:
            # 使用最大的容器
            largest_container = max(article_containers, key=lambda x: len(str(x)))
            content = largest_container.get_text(separator='\n', strip=True)
        
        # 策略2: 尝试找长段落
        if not content or len(content) < 200:
            paragraphs = soup.find_all('p')
            # 筛选长段落 (可能是正文)
            long_paragraphs = [p.get_text(strip=True) for p in paragraphs if len(p.get_text(strip=True)) > 60]
            if long_paragraphs:
                content = '\n\n'.join(long_paragraphs)
                
        # 第三层：如上述方法都失败，则在主函数中会尝试使用大模型优化
        return content
        
    except Exception as e:
        print(f"提取内容失败: {url}, 错误: {str(e)}")
        # 如果普通提取失败但还没尝试过Selenium，则尝试Selenium
        if not use_selenium:
            try:
                return RSSParser.extract_with_selenium(url)
            except Exception as selenium_error:
                print(f"Selenium备选提取也失败: {url}, 错误: {str(selenium_error)}")
        
        return f"内容提取失败: {str(e)}"

2. WebDriver资源管理

在使用Selenium时，一个常见问题是浏览器资源未正确释放。我们通过单例模式解决这一问题：

class WebDriverManager:
    _instance = None
    _driver = None
    
    @classmethod
    def get_instance(cls):
        if cls._instance is None:
            cls._instance = WebDriverManager()
        return cls._instance
    
    def get_driver(self):
        """获取或创建WebDriver实例"""
        if self._driver is None:
            try:
                service = Service(ChromeDriverManager().install())
                options = webdriver.ChromeOptions()
                options.add_argument("--headless")  # 无头模式
                options.add_argument("--disable-gpu")
                options.add_argument("--disable-extensions")
                options.add_argument("--disable-dev-shm-usage")
                options.add_argument("--no-sandbox")
                
                self._driver = webdriver.Chrome(service=service, options=options)
                print("已初始化WebDriver")
            except Exception as e:
                print(f"初始化WebDriver失败: {e}")
                raise
        
        return self._driver
    
    def close_driver(self):
        """关闭WebDriver释放资源"""
        if self._driver:
            try:
                self._driver.quit()
            except Exception as e:
                print(f"关闭WebDriver出错: {e}")
            finally:
                self._driver = None
                print("已释放WebDriver资源")

3. 智能月度任务实现

在实现月度定时任务时，我们采用了一种创新的方法，通过每日检查当前日期来执行月度任务：

# 月度任务实现
def monthly_job():
    # 仅在每月特定日期运行
    if datetime.now().day == value:
        app_instance.start_fetch_scheduled()
job = schedule.every().day.at(time_value).do(monthly_job)

4. 链接去重机制

为了避免重复处理相同的新闻，我们实现了高效的链接去重机制：

def link_exists(self, link):
    """检查链接是否已存在于数据库中"""
    conn = sqlite3.connect(self.db_path)
    cursor = conn.cursor()
    
    try:
        cursor.execute("SELECT id FROM news WHERE link = ?", (link,))
        result = cursor.fetchone()
        return result is not None
    except Exception as e:
        print(f"检查链接存在性失败: {e}")
        return False
    finally:
        conn.close()

# 在抓取线程中使用
if self.db.link_exists(link):
    self.update_signal.emit(f"已跳过(数据库中已存在): {title}")
    continue

完整代码实战

下面通过一个完整的流程示例，展示如何从RSS源抓取新闻、优化内容、生成标签并播报：

def run(self):
    total = len(self.rss_urls)
    total_processed = 0
    total_new = 0
    
    for i, url in enumerate(self.rss_urls):
        try:
            self.update_signal.emit(f"正在处理 ({i+1}/{total}): {url}")
            feed = self.parser.get_feed(url)
            
            if not feed:
                continue
                
            source = feed.feed.title if hasattr(feed.feed, 'title') else url
            
            processed = 0
            new_added = 0
            
            for entry in feed.entries[:100]:  # 每个源最多处理100条新闻
                title = entry.title if hasattr(entry, 'title') else "无标题"
                link = entry.link if hasattr(entry, 'link') else ""
                description = entry.description if hasattr(entry, 'description') else ""
                pub_date = entry.published if hasattr(entry, 'published') else ""
                
                if not link:
                    continue
                
                processed += 1
                total_processed += 1
                
                # 检查链接是否已存在于数据库中
                if self.db.link_exists(link):
                    self.update_signal.emit(f"已跳过(数据库中已存在): {title}")
                    continue
                
                self.update_signal.emit(f"正在提取: {title}")
                
                # 提取正文内容，根据选项使用Selenium
                raw_content = self.parser.extract_content(link, self.use_selenium)
                
                # 根据选项使用大模型精炼内容
                content = raw_content
                if self.use_llm:
                    self.update_signal.emit(f"正在使用大模型优化正文: {title}")
                    content = self.parser.refine_content_with_llm(title, description, raw_content)
                
                # 生成标签
                self.update_signal.emit(f"正在生成标签: {title}")
                tags = self.tag_generator.generate_tags(title, description, content)
                
                # 保存到数据库
                if self.db.add_news(title, link, description, content, source, pub_date, tags):
                    new_added += 1
                    total_new += 1
                
                time.sleep(1)  # 避免过快请求
            
            self.update_signal.emit(f"完成 {url}: 处理 {processed} 条新闻，新增 {new_added} 条")
            self.news_added_signal.emit(processed, new_added)
                
        except Exception as e:
            self.update_signal.emit(f"处理RSS出错: {url}, 错误: {e}")
    
    self.update_signal.emit(f"抓取完成: 共处理 {total_processed} 条新闻，新增 {total_new} 条")
    self.finished_signal.emit()

播报新闻实例：

def start_news_broadcast(self, count=5):
    """开始新闻播报"""
    self.log_message(f"开始新闻播报，播报{count}条最新新闻")
    
    # 获取最新的新闻
    news_list = self.db.get_latest_news(count)
    
    if not news_list:
        self.log_message("没有找到可播报的新闻")
        return
    
    # 使用单独线程进行播报，避免UI卡顿
    broadcast_thread = threading.Thread(
        target=self.broadcaster.broadcast_news,
        args=(news_list,),
        daemon=True
    )
    broadcast_thread.start()
    
    # 记录播报内容
    for i, news in enumerate(news_list):
        self.log_message(f"播报第{i+1}条: {news['title']} - {news['source']}")