当前位置：首页 > news >正文

协程+连接池：高并发Python爬虫的底层优化逻辑

news 2025/9/16 9:49:44

一、性能瓶颈的根源：同步阻塞I/O与TCP握手

在优化之前，必须理解传统同步爬虫为何缓慢。

同步阻塞I/O（Synchronous Blocking I/O）：使用requests.get()时，程序会发起一个HTTP请求，然后线程会一直等待，直到远端服务器返回响应。在这个等待过程中，CPU大部分时间是空闲的，造成了巨大的资源浪费。这就像只有一个收银员的超市，每个顾客都必须等到前一个顾客完成全部结账流程后才能开始，效率极低。
昂贵的TCP连接建立：HTTP基于TCP协议。每次requests.get()都会经历一次TCP三次握手的过程。在高并发场景下，频繁地创建和销毁连接会产生巨大的开销，成为主要的性能瓶颈之一。

为了解决这两个问题，我们的武器库里有两大法宝：协程解决I/O等待问题，连接池解决TCP连接复用问题。

二、核心武器一：协程（Coroutine）—— I/O等待的“调度艺术”

协程，又称微线程，是一种用户态的轻量级线程。其核心优势在于由用户自行控制调度，在I/O操作时主动让出（yield）CPU，而不是被操作系统强制挂起。

底层逻辑：事件循环（Event Loop）与异步I/O

事件循环（The Event Loop）：这是asyncio的核心。它是一个无限循环，负责监听和管理所有的事件和任务。你可以把它想象成一个极其高效的项目经理。
任务（Tasks）：每一个异步函数（async def）都会被包装成一个Task。
可等待对象（Awaitables）：当任务执行到await语句（通常是I/O操作，如网络请求、读写文件）时，会发生以下神奇的事情：
- 该任务会立即告知事件循环：“我要进行I/O操作了，这会很慢，别等我，你先去处理其他准备好了的任务吧。”
- 事件循环于是暂停（挂起）当前任务，转而执行其他已经准备好继续运行的任务。
- 当底层的操作系统完成I/O操作（如收到服务器响应）后，事件循环会收到通知，并在适当的时机恢复执行刚才被挂起的任务，从await之后的地方继续运行。

这个过程是单线程的，通过在I/O等待期间切换任务，极大地提高了CPU的利用率，从而在单位时间内可以发起成千上万个网络请求。

简单比喻：同步阻塞是单线流水线，一个环节卡住整条线停止。协程是多线流水线，一个环节（I/O）卡住，工人（CPU）立刻去处理其他流水线上的工作，从而保证工人永远在忙碌。

三、核心武器二：连接池（Connection Pool）—— TCP连接的“资源管家”

连接池是另一个被严重低估的底层优化。它的核心思想是：复用，而不是重建。

底层逻辑：TCP连接复用

一个httpx.AsyncClient或aiohttp.ClientSession对象内部默认维护着一个连接池。

当你的爬虫发起第一个请求时：客户端会与目标服务器建立一条TCP连接（经历三次握手）。
请求完成后：这条连接不会立即关闭，而是被放入一个名为“连接池”的容器中，并标记为空闲状态。
当你的爬虫发起下一个请求（至同一主机）时：客户端不会创建新的TCP连接，而是直接从连接池中取出这条空闲的、已经建立好的连接来发送新的HTTP请求。

这样做带来了两大核心好处：

极大降低延迟：避免了每次请求都进行TCP三次握手和SSL握手（对于HTTPS）的开销，请求响应速度更快。
减轻系统负担：大幅减少了操作系统因频繁创建和销毁socket端口所带来的资源消耗。

没有连接池：10个请求 => 10次TCP握手 => 10个socket。
有连接池：10个请求 => 1次TCP握手 => 复用1个socket => 性能提升一个数量级。

四、实战：构建基于协程与连接池的高并发爬虫

下面我们使用httpx库（同时支持HTTP/1.1和HTTP/2，API更现代）来演示如何正确利用这两大武器。

1. 错误示范：没有连接池的异步爬虫

import asyncio
import httpx
import timeasync def fetch_no_pool(url):"""错误示范：每次请求都创建新的连接，无法复用TCP连接"""async with httpx.AsyncClient() as client: # 每次都创建新的Client对象response = await client.get(url)return response.text[:200] # 返回部分内容async def main_no_pool():url = "https://httpbin.org/get"tasks = [fetch_no_pool(url) for _ in range(10)]start_time = time.time()results = await asyncio.gather(*tasks)end_time = time.time()print(f"无连接池模式 耗时: {end_time - start_time:.2f} 秒")# for result in results:#     print(result)# asyncio.run(main_no_pool())

输出可能： 无连接池模式耗时: 1.85 秒
问题分析： 虽然用了协程并发，但每个任务都创建独立的AsyncClient，导致TCP连接无法复用，性能依然低下。

2. 正确示范：协程 + 连接池的最佳实践

import asyncio
import httpx
import timeasync def fetch_with_pool(client, url):"""正确示范：复用同一个Client及其连接池"""response = await client.get(url)return response.text[:200]async def main_with_pool():url = "https://httpbin.org/get"# 关键步骤：在整个爬虫生命周期内，共享同一个AsyncClient实例async with httpx.AsyncClient(limits=httpx.Limits(max_keepalive_connections=10, keepalive_expiry=30),timeout=httpx.Timeout(10.0)) as client:tasks = [fetch_with_pool(client, url) for _ in range(10)]start_time = time.time()results = await asyncio.gather(*tasks)end_time = time.time()print(f"协程+连接池模式 耗时: {end_time - start_time:.2f} 秒")# for result in results:#     print(result)# asyncio.run(main_with_pool())

输出可能： 协程+连接池模式耗时: 0.45 秒
性能对比： 正确的方法比错误的方法快了近4倍！这其中的巨大差异，主要就来源于连接池避免的TCP握手开销。

3. 高级优化：精细化配置连接池与重试机制

一个生产级的爬虫还需要考虑限流、重试和代理。

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
import httpx
import asyncio# 代理配置信息
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"class HighConcurrencyCrawler:def __init__(self, concurrency=10, use_proxy=True):# 精细化配置连接池参数self.limits = httpx.Limits(max_connections=concurrency, # 最大连接数max_keepalive_connections=concurrency, # 最大保持活跃的连接数keepalive_expiry=10 # 活跃连接保持时间（秒）)self.timeout = httpx.Timeout(10.0)self.client = Noneself.use_proxy = use_proxy# 构造代理URL（多种格式）self.proxy_url = f"http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}"self.proxies = {"http://": self.proxy_url,"https://": self.proxy_url,}async def __aenter__(self):# 根据是否使用代理来初始化客户端if self.use_proxy:self.client = httpx.AsyncClient(limits=self.limits,timeout=self.timeout,proxies=self.proxies  # 方式一：使用代理字典# 或者使用以下方式：# proxies=self.proxy_url  # 方式二：直接使用代理URL字符串)else:self.client = httpx.AsyncClient(limits=self.limits,timeout=self.timeout)return selfasync def __aexit__(self, exc_type, exc_val, exc_tb):await self.client.aclose()@retry(stop=stop_after_attempt(3),wait=wait_exponential(multiplier=1, min=2, max=10),retry=retry_if_exception((httpx.NetworkError, httpx.HTTPStatusError)))async def fetch_url(self, url):try:# 方式三：也可以在每次请求时单独设置代理（更灵活）# proxies = self.proxies if self.use_proxy else None# resp = await self.client.get(url, proxies=proxies)resp = await self.client.get(url)resp.raise_for_status()return resp.textexcept httpx.ProxyError as e:print(f"代理连接错误: {e}")raiseexcept Exception as e:print(f"Request failed for {url}: {e}")raiseasync def crawl(self, urls):tasks = [self.fetch_url(url) for url in urls]return await asyncio.gather(*tasks, return_exceptions=True)# 使用代理的示例
async def main_with_proxy():urls = ["https://httpbin.org/ip"] * 5  # 使用这个URL可以查看当前使用的IPasync with HighConcurrencyCrawler(concurrency=5, use_proxy=True) as crawler:results = await crawler.crawl(urls)# 输出结果查看代理是否生效for i, result in enumerate(results):if not isinstance(result, Exception):print(f"结果 {i+1}: {result}")else:print(f"请求 {i+1} 失败: {result}")# 不使用代理的示例（用于对比）
async def main_without_proxy():urls = ["https://httpbin.org/ip"] * 3async with HighConcurrencyCrawler(concurrency=3, use_proxy=False) as crawler:results = await crawler.crawl(urls)for i, result in enumerate(results):if not isinstance(result, Exception):print(f"直连结果 {i+1}: {result}")else:print(f"直连请求 {i+1} 失败: {result}")# 更灵活的代理使用方式：轮询多个代理
class ProxyRotatorCrawler(HighConcurrencyCrawler):def __init__(self, concurrency=10, proxy_list=None):super().__init__(concurrency, use_proxy=True)self.proxy_list = proxy_list or [self.proxy_url]self.current_proxy_index = 0def get_next_proxy(self):"""轮询获取下一个代理"""proxy = self.proxy_list[self.current_proxy_index]self.current_proxy_index = (self.current_proxy_index + 1) % len(self.proxy_list)return proxyasync def fetch_url(self, url):try:# 每次请求使用不同的代理current_proxy = self.get_next_proxy()resp = await self.client.get(url, proxies=current_proxy)resp.raise_for_status()return resp.textexcept Exception as e:print(f"Request failed for {url} with proxy {current_proxy}: {e}")raiseif __name__ == "__main__":# 运行带代理的爬虫print("=== 使用代理访问 ===")asyncio.run(main_with_proxy())print("\n=== 直连访问 ===")asyncio.run(main_without_proxy())