Python 爬虫工具全解析及实战指南
一、核心工具对比表
工具类型 | 代表性工具 | 核心优势 | 适用场景 | 学习成本 |
HTTP 请求 | Requests | 简单易用,同步 / 异步支持 | 静态页面爬取 | ★☆☆☆☆ |
aiohttp | 高性能异步 IO | 高并发大规模爬取 | ★★★☆☆ | |
PyCurl | C 语言内核,极致性能 | 高频交易数据抓取 | ★★★★☆ | |
页面解析 | BeautifulSoup | 灵活 API,支持多种解析器 | 复杂 HTML 结构解析 | ★☆☆☆☆ |
lxml | 基于 libxml2,性能最优 | 超大数据量解析 | ★★☆☆☆ | |
PyQuery | jQuery 风格语法 | 前端开发者快速上手 | ★☆☆☆☆ | |
动态渲染 | Selenium | 完整浏览器环境 | 复杂 JavaScript 交互页面 | ★★★☆☆ |
Playwright | 多浏览器支持,现代化 API | 跨浏览器自动化测试与爬取 | ★★★☆☆ | |
Scrapy-Splash | Scrapy 集成渲染服务 | 框架内动态页面处理 | ★★★★☆ | |
爬虫框架 | Scrapy | 全功能爬虫框架 | 大规模结构化数据采集 | ★★★★☆ |
Crawlera | Scrapy 云服务 | 企业级反爬场景 | ★★★☆☆ | |
浏览器自动化 | selenium-wire | 请求 / 响应拦截 | API 数据抓取 | ★★★☆☆ |
MechanicalSoup | 表单操作简化 | 登录验证页面爬取 | ★★☆☆☆ | |
特殊场景 | Newspaper3k | 新闻内容自动提取 | 媒体资讯类网站 | ★☆☆☆☆ |
ScrapingBee | API 化渲染服务 | 快速获取渲染结果 | ★☆☆☆☆ |
二、工具深度解析与代码示例
1. HTTP 请求层
Requests(基础款)
import requests # 发送带请求头的GET请求 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...' } response = requests.get('https://example.com', headers=headers) print(response.status_code) print(response.json()) # 解析JSON响应 |
aiohttp(异步高性能)
import asyncio import aiohttp async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(): urls = ['url1', 'url2', 'url3'] async with aiohttp.ClientSession() as session: tasks = [fetch(session, url) for url in urls] htmls = await asyncio.gather(*tasks) print([len(html) for html in htmls]) asyncio.run(main()) |
2. 页面解析层
BeautifulSoup(灵活解析)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'lxml') # CSS选择器 title = soup.select_one('h1.title').text # 遍历元素 for item in soup.select('div.item'): print(item.get('id')) |
lxml(高性能解析)
from lxml import etree tree = etree.HTML(html_content) # XPath选择 title = tree.xpath('//h1[@class="title"]/text()')[0] # 批量提取 links = tree.xpath('//a/@href') |
3. 动态渲染层
Playwright(现代自动化)
from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page()
# 访问并等待JavaScript执行 page.goto('https://example.com') page.wait_for_selector('div.loaded-content')
# 截图或提取数据 page.screenshot(path='screenshot.png') content = page.inner_text('div.content')
browser.close() |
Selenium(传统方案)
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get('https://example.com') # 显式等待 element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, 'div.content')) ) print(element.text) driver.quit() |
4. 爬虫框架层
Scrapy(全功能框架)
import scrapy class ProductSpider(scrapy.Spider): name = 'products' start_urls = ['https://example.com/products']
def parse(self, response): for product in response.css('div.product'): yield { 'name': product.css('h3::text').get(), 'price': product.css('span.price::text').get(), 'url': product.css('a::attr(href)').get(), }
# 自动翻页 next_page = response.css('a.next-page::attr(href)').get() if next_page: yield response.follow(next_page, self.parse) |
运行命令:
scrapy crawl products -o products.json |
5. 特殊场景处理
Newspaper3k(新闻内容提取)
from newspaper import Article url = 'https://example.com/article' article = Article(url) article.download() article.parse() print(article.title) print(article.text) print(article.authors) print(article.publish_date) # 自动提取关键词和摘要 article.nlp() print(article.keywords) print(article.summary) |
selenium-wire(请求拦截)
from seleniumwire import webdriver # 配置代理和请求头 options = { 'proxy': { 'http': 'http://user:pass@proxy.example.com:8080', 'https': 'http://user:pass@proxy.example.com:8080', } } driver = webdriver.Chrome(seleniumwire_options=options) driver.get('https://example.com') # 查看所有请求 for request in driver.requests: if request.response: print( request.url, request.response.status_code, request.response.headers['Content-Type'] ) |
三、反爬策略与实战技巧
1. 请求头伪装
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Referer': 'https://google.com', 'Connection': 'keep-alive', } |
2. 代理 IP 池
免费代理获取:
import requests from lxml import etree def get_proxies(): url = 'https://free-proxy-list.net/' response = requests.get(url) tree = etree.HTML(response.text)
proxies = [] for row in tree.xpath('//table[@id="proxylisttable"]/tbody/tr'): ip = row.xpath('./td[1]/text()')[0] port = row.xpath('./td[2]/text()')[0] proxies.append(f'{ip}:{port}')
return proxies |
代理验证:
import requests def check_proxy(proxy): try: response = requests.get( 'https://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, timeout=5 ) if response.status_code == 200: return True except: return False |
3. 验证码处理
第三方打码平台 API 示例:
import requests import base64 def recognize_captcha(image_path): with open(image_path, 'rb') as f: image_data = f.read()
# 转换为Base64 image_base64 = base64.b64encode(image_data).decode()
# 调用打码平台API response = requests.post( 'https://api.dama.com/recognize', json={ 'image': image_base64, 'type': 'common' }, headers={'API-Key': 'your_api_key'} )
return response.json()['code'] |
四、性能优化方案
1. 异步化改造
同步代码(慢):
import requests for url in urls: response = requests.get(url) process(response) |
异步代码(快):
import asyncio import aiohttp async def fetch_and_process(url): async with aiohttp.ClientSession() as session: async with session.get(url) as response: data = await response.text() process(data) # 处理函数需适配异步 asyncio.gather(*[fetch_and_process(url) for url in urls]) |
2. 分布式爬虫架构
Scrapy-Redis 实现:
# settings.py SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" REDIS_HOST = "localhost" REDIS_PORT = 6379 # 多个爬虫节点共享同一个Redis队列 scrapy crawl myspider -s REDIS_URL=redis://localhost:6379/0 |
五、选型决策流程图
开始 │ ├── 静态页面? │ │ │ ├── 是 → Requests + BeautifulSoup │ │ │ └── 否 → 动态渲染需求? │ │ │ ├── 是 → 复杂交互? │ │ │ │ │ ├── 是 → Playwright/Selenium │ │ │ │ │ └── 否 → Scrapy-Splash/ScrapingBee │ │ │ └── 否 → 结构化数据? │ │ │ ├── 是 → Scrapy框架 │ │ │ └── 否 → 特殊场景? │ │ │ ├── 是 → 根据场景选择(Newspaper3k等) │ │ │ └── 否 → 基础工具组合 |
六、学习资源推荐
- 官方文档:
- Requests 文档
- Scrapy 文档
- Playwright 文档
- 实战教程:
- 《Python 网络爬虫从入门到实践》
- 慕课网《Python 高级爬虫实战》
- GitHub 项目:Python 爬虫 100 例
- 社区论坛:
- Stack Overflow(爬虫标签)
- Reddit 的 r/webscraping 版块
- 知乎爬虫话题
七、法律与道德准则
- 遵守《网络安全法》和《数据安全法》
- 尊重网站robots.txt规则(可使用robotparser库检查)
- 控制爬取频率,避免影响目标网站正常运行
- 不爬取敏感信息(个人隐私、商业机密等)
- 爬取数据仅供个人学习研究,商业用途需获得授权
建议在开始项目前,先阅读目标网站的《服务条款》和《隐私政策》,必要时咨询法律顾问。
掌握这些工具和策略后,你可以应对 90% 以上的爬虫场景。建议从简单项目入手,逐步积累经验,遇到复杂反爬机制时再针对性地深入学习。