当前位置：首页 > news >正文

爬虫框架与库

news 2025/10/15 18:37:32

爬虫框架与库是用于网络数据抓取的核心工具，帮助开发者高效地从网页中提取结构化数据。

Requests：用于发送HTTP请求。

BeautifulSoup：用于解析HTML和XML。

Scrapy：强大的爬虫框架，适合大规模爬取。

Selenium：用于处理JavaScript渲染的页面。

PyQuery：类似jQuery的HTML解析库。

一、常用爬虫库（灵活轻量）

1、Requests

特点：HTTP请求库，用于发送GET/POST请求，处理Cookies和Session。

使用场景：简单网页抓取，配合解析库（如BeautifulSoup）使用。

例如：

import requests
     response = requests.get("https://emp.com")

2、BeautifulSoup

特点：HTML/XML解析库，支持多种解析器（如lxml、html.parser）。

使用场景：静态页面解析，提取标签内容。

例如：

from bs4 import BeautifulSoup
     soup = BeautifulSoup(html_content,"lxml")
     title = soup.find("h1").text

3、lxml

特点：高性能XML/HTML解析库，支持XPath。

使用场景：需要快速处理大规模结构化数据。

4、Selenium

特点：自动化浏览器工具，可模拟用户操作（点击，滚动等）。

使用场景：动态渲染页面（如JavaScript加载的内容）。

缺点：资源消耗大，速度较慢。

5、Pyppeteer

特点：基于Chromium的无头浏览器，类似Puppeteer（Node.js）。

使用场景：处理复杂动态页面，支持异步操作。

二、常用爬虫框架（结构化，可扩展）

1、Scrapy

特点：

完整的爬虫框架，内置请求调度，数值管道，中间件等功能。
支持异步处理，适合大规模抓取。

使用场景：复杂项目（如电商商品爬虫，新闻聚合）。

核心组件：

Spiders（定义抓取逻辑）
Items（结构化数据容器）
Pipelines（数据清洗、存储）
Middlewares（扩展请求/响应处理）

2、PySpider

特点：

分布式架构，支持web界面管理任务。
实时监控爬虫状态。

使用场景：需要分布式协作或可视化的项目。

3、Playwright

特点：

支持多浏览器（Chromium、Firefox、WebKit）自动化。
可处理动态内容，生成截图或PDF。

使用场景：复杂交互页面（如登录验证）。

三、反爬虫应对工具

1、代理IP池

工具：requests-html，scrapy-rotating-proxies

用途：防止IP被封禁。

2、随机User-Agent

库：fake-useragent

用途：模拟不同浏览器/设备。

3、验证码识别

工具：Tesseract OCR（图像识别）、第三方 API（如打码平台）。

4、请求频率控制

方法：设置延迟（time.sleep）或使用Scrapy的DOWNLOAD_DELAY。

四、数据处理与存储

1、数据清洗

工具：Pandas（结构化数据）、正则表达式（re模块）。

2、存储方案

数据库：MySQL、MongoDB、Redis。

文件：CSV、JSON、Excel。

云服务：AWS S3、Google Cloud Storage。

五、选择依据

简单任务：Requests + BeautifulSoup/lxml。

动态页面：Selenium/Playwright/Pyppeteer。

大型项目：Scrapy（扩展性强）、PySpider（分布式）。

反爬严格：结合代理、User-Agent轮换、请求频率控制。

六、注意事项

1、合法性：遵守目标网站的 `robots.txt`，避免侵犯隐私或版权。

2、道德性：控制抓取频率，防止对服务器造成压力。

3、异常处理：增加重试机制（如 `retrying` 库）应对网络波动。

4、设置请求头：模拟浏览器行为，避免被封禁。

 headers = {"User-Agent": "Mozilla/5.0"}
   requests.get(url, headers=headers)

5、处理反爬：使用代理 IP、随机延时、验证码识别等。

6、数据存储：结合数据库（如 MySQL、MongoDB）或文件（JSON、CSV）。

七、爬虫工具和框架的用法及实战案例总结

1、Requests + BeautifulSoup/lxml

特点：

Requests：发送 HTTP 请求，获取网页内容。
BeautifulSoup：解析 HTML/XML 数据，语法简单。
lxml：高性能解析库，支持 XPath。

基本用法：

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")  # 使用 lxml 解析器
title = soup.find("h1").text

实战案例：抓取新闻标题

url = "https://news.ycombinator.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
titles = [a.text for a in soup.select(".titleline > a")]
print(titles)

2、Selenium

特点：

模拟浏览器操作，处理动态加载内容（如 JavaScript）。
支持 Chrome、Firefox 等浏览器。

基本用法

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://example.com")
element = driver.find_element("tag name", "h1")
print(element.text)
driver.quit()

实战案例：自动登录并抓取数据

driver = webdriver.Chrome()
driver.get("https://login.example.com")
driver.find_element("id", "username").send_keys("user")
driver.find_element("id", "password").send_keys("pass")
driver.find_element("id", "submit").click()
# 登录后抓取数据
data = driver.find_element("class name", "data").text
driver.quit()

3. Pyppeteer（已不推荐，推荐 Playwright）

特点：

基于 Chromium 的异步无头浏览器。
类似 Puppeteer（Node.js），但已停止维护。

基本用法：

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto("https://example.com")
    title = await page.title()
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

4. Playwright

特点：

支持多浏览器（Chromium、Firefox、WebKit）。
异步操作，性能更高，维护更活跃。

基本用法：

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()

实战实例：抓取动态渲染内容

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://spa.example.com")
    page.wait_for_selector(".dynamic-content")
    content = page.query_selector(".dynamic-content").text_content()
    print(content)

5. Scrapy

基本用法：

1、创建项目：

 scrapy startproject myproject

2. 编写 Spider：

  import scrapy

   class MySpider(scrapy.Spider):
       name = "example"
       start_urls = ["https://example.com"]

       def parse(self, response):
           yield {"title": response.css("h1::text").get()}

3. 运行：

scrapy crawl example -o output.json

实战案例：抓取电商商品信息

class ProductSpider(scrapy.Spider):
    name = "product"
    start_urls = ["https://shop.example.com"]

    def parse(self, response):
        for product in response.css(".product-item"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
            }
        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

6. PySpider

特点：

分布式爬虫框架，自带 Web 界面。
适合实时监控和调度。

基本用法：

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    @every(minutes=24*60)
    def on_start(self):
        self.crawl("https://example.com", callback=self.index_page)

    @config(age=10*24*60*60)
    def index_page(self, response):
        return {"title": response.doc("h1").text()}

工具对比与选型

工具	使用场景	优点	缺点
Requests	简单静态页面	轻量、易用	无法处理动态内容
Selenium	动态渲染页面（少量请求）	支持浏览器操作	性能低，资源占用高
Playwright	动态渲染页面（高性能）	多浏览器支持、异步	学习成本略高
Scrapy	大规模数据抓取	完整框架、扩展性强	配置复杂
PySpider	分布式爬取与实时监控	web界面、分布式支持	社区活跃度下降