当前位置：首页 > news >正文

从原始新闻数据中筛选出正文内容超过 1024 个词（token）的新闻，并将其保存到新文件中。

news 2025/10/9 7:30:58

示例代码：

import os
import json
import nltk
from tqdm import tqdm

def wr_dict(filename,dic):
    if not os.path.isfile(filename):
        data = []
        data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
    else:      
        with open(filename, 'r') as f:
            data = json.load(f)
            data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
            
def rm_file(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)
        
def count_tokens(text):
    tokens = nltk.word_tokenize(text)
    return len(tokens)

with open('datasource/source/news_data.json', 'r') as file:
    data = json.load(file)
    
save_file = 'datasource/news_filter_token.json'
count = 0
for d in tqdm(data):
    body = d['body']
    if count_tokens(body)>1024:
        wr_dict(save_file,d)
        count = count + 1

print(f'final count = {count}')

✅ 从原始新闻数据中筛选出正文内容超过 1024 个词（token）的新闻，并将其保存到新文件中。

🧠 一句话解释

输入文件：news_data.json（包含多条新闻）
输出文件：news_filter_token.json
筛选规则：正文 body 中的词（token）数量 > 1024 的新闻才保留

🧩 逐行详细解释

1. 导入模块

import os
import json
import nltk
from tqdm import tqdm

os: 用于文件操作（比如检查、删除）
json: 用于读写 JSON 文件
nltk: 自然语言处理库，用于分词
tqdm: 显示进度条（提高用户体验）

2. 定义 `wr_dict()`：将一条数据写入 JSON 文件

def wr_dict(filename,dic):
    if not os.path.isfile(filename):  # 文件不存在时
        data = []
        data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
    else:                             # 文件存在时，先读出来，再追加
        with open(filename, 'r') as f:
            data = json.load(f)
            data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)

这个函数可以 逐条地往 JSON 文件里追加新闻数据（不是一次性保存所有）。

3. 定义 `rm_file()`：如果文件存在就删除它

def rm_file(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)

在写入前调用一次 rm_file(save_file) 来清空旧数据。

4. 定义 `count_tokens()`：计算一段文本有多少个词

def count_tokens(text):
    tokens = nltk.word_tokenize(text)  # 用 nltk 分词
    return len(tokens)                 # 返回词数

举个例子：

text = "I love Python programming!"
tokens = nltk.word_tokenize(text)
print(tokens)  # ['I', 'love', 'Python', 'programming', '!']

5. 读取原始新闻数据

with open('datasource/source/news_data.json', 'r') as file:
    data = json.load(file)

你这个 JSON 文件里的每一条应该长这样：

{
  "title": "某新闻标题",
  "body": "正文正文正文正文......"
}

6. 筛选符合条件的新闻

save_file = 'datasource/news_filter_token.json'
count = 0
for d in tqdm(data):  # 遍历所有新闻
    body = d['body']
    if count_tokens(body) > 1024:  # 如果正文词数 > 1024
        wr_dict(save_file, d)      # 保存该条新闻
        count = count + 1

📌 `tqdm(data)` 会显示进度条，让你看到当前处理了多少条新闻。

7. 输出最终保留的新闻条数

print(f'final count = {count}')

🧪 示例输入输出

输入文件：`news_data.json`

[
  {
    "title": "短新闻",
    "body": "这是一条很短的新闻。"
  },
  {
    "title": "长新闻",
    "body": "（假设这里有超过1024个词）"
  }
]

输出文件：`news_filter_token.json`

[
  {
    "title": "长新闻",
    "body": "（正文超过1024个词）"
  }
]

✅ 总结

步骤	说明
输入	`news_data.json`
过滤规则	正文 `body` 的分词数 > 1024
输出	`news_filter_token.json`
核心函数	`nltk.word_tokenize()` + `tqdm` 显示进度