当前位置：首页 > news >正文

newspaper公共库获取每个 URL 对应的新闻内容，并将提取的新闻正文保存到一个文件中

news 2025/10/9 7:30:58

示例代码：

from newspaper import Article
from newspaper import Config
import json
from tqdm import tqdm
import os
import requests

with open('datasource/api/news_api.json', 'r') as file:
    data = json.load(file)

print(len(data))
save_path = 'datasource/source/news_data.json'
def wr_dict(filename,dic):
    if not os.path.isfile(filename):
        data = []
        data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
    else:      
        with open(filename, 'r') as f:
            data = json.load(f)
            data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
            
def rm_file(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)
# rm_file(save_path)

with open(save_path, 'r') as file:
    have = json.load(file)

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.headers = {'Cookie': "cookie1=xxx;cookie2=zzzz"}
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

RETRY_ATTEMPTS = 1
count = 0
def parse_article(url):
    for attempt in range(RETRY_ATTEMPTS):
        try:
            article = Article(url, config=config)
            article.download()
            article.parse()
            return article.text
        except:
            return None
            # print(f"Error retrieving article from URL '{url}'")
    return None


for idx, d in enumerate(tqdm(data)):
    if idx<len(have):
        continue
    url = d['url']
    maintext = parse_article(url.strip())
    if maintext == None:
        continue
    d['body'] = maintext
    wr_dict(save_path,d)
    count = count + 1
print(count+len(have))

这段代码的功能是从一个包含新闻 URL 的数据集中获取每个 URL 对应的新闻内容，并将提取的新闻正文保存到一个文件中。

1. 导入必要的库

from newspaper import Article
from newspaper import Config
import json
from tqdm import tqdm
import os
import requests

newspaper：用于从新闻网站上提取文章内容，Article 用来获取文章的正文，Config 用来配置请求头和其他设置。
json：用于处理 JSON 格式的数据。
tqdm：用于显示进度条。
os：用于操作文件。
requests：用于发送 HTTP 请求（虽然在这段代码中没有直接用到，但可能是为了配置 HTTP 请求的头部）。

2. 加载新闻 URL 数据

with open('datasource/api/news_api.json', 'r') as file:
    data = json.load(file)

print(len(data))

从 datasource/api/news_api.json 文件中读取新闻 URL 数据，并加载到 data 变量中。
输出 data 的长度，显示有多少条新闻 URL 数据。

3. 定义写入 JSON 文件的函数 `wr_dict`

def wr_dict(filename,dic):
    if not os.path.isfile(filename):
        data = []
        data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)
    else:      
        with open(filename, 'r') as f:
            data = json.load(f)
            data.append(dic)
        with open(filename, 'w') as f:
            json.dump(data, f)

wr_dict 函数用于将新闻数据字典（dic）追加到指定的 JSON 文件中。如果文件不存在，它会先创建文件并写入数据。如果文件已经存在，先读取文件内容，追加新的数据，再写回文件。

4. 删除文件的函数 `rm_file`

def rm_file(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)

rm_file 函数用于删除指定路径的文件。

5. 加载已有的新闻数据

with open(save_path, 'r') as file:
    have = json.load(file)

从 save_path 文件中读取已经保存的新闻数据，保存到 have 变量中。这样可以避免重复下载和保存相同的新闻。

6. 配置请求头和重试次数

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.headers = {'Cookie': "cookie1=xxx;cookie2=zzzz"}
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

设置请求的 User-Agent（模拟浏览器请求头），以及一些其他配置。
通过设置 config.headers 来模拟用户登录，避免因缺少 cookie 导致访问失败。

7. 定义文章解析函数 `parse_article`

def parse_article(url):
    for attempt in range(RETRY_ATTEMPTS):
        try:
            article = Article(url, config=config)
            article.download()
            article.parse()
            return article.text
        except:
            return None
    return None

parse_article 函数用于下载并解析指定 URL 的新闻文章。它会尝试请求文章，下载并提取文本内容。如果成功则返回文章的正文。如果失败（例如：网络问题或者 URL 无效），则返回 None。

8. 处理每条新闻 URL，下载并保存新闻正文

for idx, d in enumerate(tqdm(data)):
    if idx < len(have):
        continue
    url = d['url']
    maintext = parse_article(url.strip())
    if maintext == None:
        continue
    d['body'] = maintext
    wr_dict(save_path,d)
    count = count + 1
print(count + len(have))