当前位置: 首页 > news >正文

从一堆新闻正文中,提取出“事实型句子(fact)”,并保存到新文件中

示例代码:

import os
import re
import json
import nltk
from tqdm import tqdm
from transformers import pipeline
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
## Check If Fact or Opinion
#lighteternal/fact-or-opinion-xlmr-el

fact_opinion_classifier = pipeline("text-classification", model="lighteternal/fact-or-opinion-xlmr-el")

def wr_dict(filename,dic):
    if not os.path.isfile(filename):
        data = []
        data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
    else:      
        with open(filename, 'r') as f:
            data = json.load(f)
            data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
            
def rm_file(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)

with open('datasource/news_filter_dup.json', 'r') as file:
    data = json.load(file)

save_path = 'datasource/news_filter_fact.json'
print(len(data))
print(data[0].keys())
rm_file(save_path)

for d in tqdm(data):
    fact_list = []
    body = d['body']
    paragraphs = re.split(r'\n{2,}', body)
    for text in paragraphs:
        sentences = sent_tokenize(text)
        for sentence in sentences:
            try:
                sentence_result = fact_opinion_classifier(sentence)[0]
                # If Fact
                if sentence_result["label"] == "LABEL_1":
                    fact_list.append(sentence)
            except:
                print(sentence)
    d['fact_list'] = fact_list
    wr_dict(save_path,d)

💡 代码功能:

把新闻正文拆成句子,每句用模型 lighteternal/fact-or-opinion-xlmr-el 判断是否为“Fact”,如果是,就保留下来,最终写入新的 JSON 文件中。


🔍 分段详细解析

📦 1. 导入依赖 & 初始化模型

import os, re, json, nltk
from tqdm import tqdm
from transformers import pipeline
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
  • nltk.download('punkt'): 下载英文句子分割工具(句子分割器)
  • pipeline("text-classification", model=...):加载判断“Fact/Opinion”的模型
fact_opinion_classifier = pipeline("text-classification", model="lighteternal/fact-or-opinion-xlmr-el")

🧱 2. 工具函数

def wr_dict(filename,dic):
    ...
  • 把每个处理好的新闻写入 JSON 文件,逐条追加写入
def rm_file(file_path):
    ...
  • 如果输出文件已存在,先删除,避免旧内容混入

📂 3. 加载输入数据(新闻正文)

with open('datasource/news_filter_dup.json', 'r') as file:
    data = json.load(file)
  • 加载去重后的新闻文件

🧹 4. 主处理逻辑(提取事实句子)

for d in tqdm(data):  # 遍历每条新闻
    fact_list = []
    body = d['body']  # 新闻正文
  • 先用 \n\n 把正文分成段落
  • 然后再用 sent_tokenize 分成句子
    paragraphs = re.split(r'\n{2,}', body)
    for text in paragraphs:
        sentences = sent_tokenize(text)
  • 对每句话调用模型分类(模型返回一个 label)
        for sentence in sentences:
            try:
                sentence_result = fact_opinion_classifier(sentence)[0]
                if sentence_result["label"] == "LABEL_1":
                    fact_list.append(sentence)

🔵 LABEL_1 就是 Fact(事实)


📝 5. 保存结果

    d['fact_list'] = fact_list
    wr_dict(save_path,d)
  • 把提取出来的事实句子加到原来的数据结构中
  • 写入保存路径:datasource/news_filter_fact.json

✅ 示例说明

输入 JSON 文件(news_filter_dup.json)的一条数据:

{
  "title": "NASA discovers new exoplanet",
  "body": "NASA announced a new discovery today.\n\nThey found a planet that is 1.3 times the size of Earth. It orbits a star 300 light-years away.\n\nScientists say it might have liquid water."
}

模型判断过程:

分出来的句子:

  1. NASA announced a new discovery today. ⟶ Opinion
  2. They found a planet that is 1.3 times the size of Earth. ⟶ ✅ Fact
  3. It orbits a star 300 light-years away. ⟶ ✅ Fact
  4. Scientists say it might have liquid water. ⟶ Opinion

输出 JSON 文件(news_filter_fact.json)的一条数据:

{
  "title": "NASA discovers new exoplanet",
  "body": "NASA announced a new discovery today.\n\nThey found a planet that is 1.3 times the size of Earth. It orbits a star 300 light-years away.\n\nScientists say it might have liquid water.",
  "fact_list": [
    "They found a planet that is 1.3 times the size of Earth.",
    "It orbits a star 300 light-years away."
  ]
}

🧠 小结

模块作用
re + sent_tokenize切段落、切句子
pipeline(...xlmr-el)判断句子是 Fact 还是 Opinion
LABEL_1是 Fact,才保留
fact_list存放所有 fact 句子
wr_dict把每条带 fact 的新闻写入文件

相关文章:

  • 单例用例(C++)简单分析
  • list的常见接口使用
  • 榕壹云在线商城系统:基于THinkPHP+ Mysql+UniApp全端适配、高效部署的电商解决方案
  • 【AI News | 20250411】每日AI进展
  • maven导入本地jar示例
  • KWDB创作者计划—KWDB关系库与时序库混搭
  • SQL 外键(Foreign Key)详细讲解
  • NHANES指标推荐:aMED
  • Docker镜像如何导出与导入:实现跨机器共享的最佳实践
  • LeeCode 409.最长回文串
  • C++ 学习资源整理
  • 【C++游戏引擎开发】第14篇:视图空间与相机坐标系
  • MySQL 进阶 - 2 ( 12000 字详解)
  • C++ Cereal序列化库的使用
  • C++Cherno 学习笔记day19 [76]-[80] std::optional、variant、any
  • java.lang.OutOfMemoryError: GC overhead limit exceeded如何解决
  • 《Python星球日记》第23天:Pandas基础
  • AGI|AutoGen入门食用手册,搭建你的智能体流水线
  • Position属性实现元素精准定位
  • source insight编码对齐与中文乱码问题以及CRT中文乱码处理
  • 做网站的维护成本/搜索引擎优化的基本内容
  • 平面设计接单赚钱吗/游戏优化软件
  • 卖渔具的亲戚做网站/网络推广网站的方法
  • 电脑自带做网站的软件/营销 推广
  • 上海网站建设价位/谷歌广告代理
  • 微网站是什么意思/seo的基本步骤顺序正确的是