当前位置：首页 > news >正文

R语言爬虫实战：如何爬取分页链接并批量保存

news 2025/7/8 16:00:44

1. 引言

在数据采集和分析过程中，爬虫技术（Web Scraping）是一项非常重要的技能。R语言虽然以统计分析和数据可视化闻名，但其强大的网络爬虫能力同样不容忽视。本文将介绍如何使用R语言爬取分页网页的链接，并将数据批量保存到本地文件（如CSV或TXT），适用于新闻聚合、电商数据抓取、学术研究等场景。

2. 准备工作

在开始之前，确保已安装以下R包：

**rvest**：用于HTML解析和数据提取
**httr**：用于HTTP请求（处理GET/POST请求）
**dplyr**：用于数据清洗和整理
**stringr**：用于字符串处理

3. 目标分析

假设我们要爬取一个新闻网站（如示例网站 **https://example-news.com**），该网站的分页结构如下：

首页：**https://example-news.com/page/1**
第二页：**https://example-news.com/page/2**
…
第N页：**https://example-news.com/page/N**

我们的任务是：

爬取所有分页的新闻标题和链接
存储到本地CSV文件

4. 实现步骤

4.1 获取单页链接

首先，我们编写一个函数 **scrape_page()**，用于抓取单页的新闻标题和链接：

library(rvest)
library(httr)
library(dplyr)
library(stringr)scrape_page <- function(page_url) {# 发送HTTP请求response <- GET(page_url, user_agent("Mozilla/5.0"))if (status_code(response) != 200) {stop("Failed to fetch the page")}# 解析HTMLhtml_content <- read_html(response)# 提取新闻标题和链接（假设标题在<h2>标签，链接在<a>标签）titles <- html_content %>%html_nodes("h2 a") %>%html_text(trim = TRUE)links <- html_content %>%html_nodes("h2 a") %>%html_attr("href")# 返回数据框data.frame(Title = titles, URL = links, stringsAsFactors = FALSE)
}

4.2 爬取多页数据

由于网站是分页的，我们需要循环爬取多个页面。这里以爬取前5页为例：

base_url <- "https://example-news.com/page/"
max_pages <- 5all_news <- data.frame()for (page in 1:max_pages) {page_url <- paste0(base_url, page)cat("Scraping:", page_url, "\n")tryCatch({page_data <- scrape_page(page_url)all_news <- bind_rows(all_news, page_data)}, error = function(e) {cat("Error scraping page", page, ":", e$message, "\n")})# 避免被封IP，设置延迟Sys.sleep(2)
}# 查看爬取的数据
head(all_news)

4.3 数据去重

由于某些网站可能在不同分页出现相同新闻，我们需要去重

all_news <- all_news %>%distinct(URL, .keep_all = TRUE)

4.4 保存到CSV文件

最后，将数据保存到本地：

write.csv(all_news, "news_links.csv", row.names = FALSE)
cat("Data saved to 'news_links.csv'")

5. 完整代码

library(rvest)
library(httr)
library(dplyr)
library(stringr)# 代理配置
proxyHost <- "www.16yun.cn"
proxyPort <- "5445"
proxyUser <- "16QMSOML"
proxyPass <- "280651"# 单页爬取函数（已添加代理）
scrape_page <- function(page_url) {# 设置代理proxy_config <- use_proxy(url = proxyHost,port = as.integer(proxyPort),username = proxyUser,password = proxyPass)# 发送HTTP请求（带代理）response <- GET(page_url, user_agent("Mozilla/5.0"),proxy_config)if (status_code(response) != 200) {stop(paste("Failed to fetch the page. Status code:", status_code(response)))}# 解析HTMLhtml_content <- read_html(response)# 提取新闻标题和链接（假设标题在<h2>标签，链接在<a>标签）titles <- html_content %>%html_nodes("h2 a") %>%html_text(trim = TRUE)links <- html_content %>%html_nodes("h2 a") %>%html_attr("href")# 返回数据框data.frame(Title = titles, URL = links, stringsAsFactors = FALSE)
}# 爬取多页数据
base_url <- "https://example-news.com/page/"
max_pages <- 5all_news <- data.frame()for (page in 1:max_pages) {page_url <- paste0(base_url, page)cat("Scraping:", page_url, "\n")tryCatch({page_data <- scrape_page(page_url)all_news <- bind_rows(all_news, page_data)}, error = function(e) {cat("Error scraping page", page, ":", e$message, "\n")})# 随机延迟1-3秒，避免被封Sys.sleep(sample(1:3, 1))
}# 数据去重
all_news <- all_news %>%distinct(URL, .keep_all = TRUE)# 保存到CSV文件
write.csv(all_news, "news_links.csv", row.names = FALSE)
cat("Data saved to 'news_links.csv'")