Python爬虫绕过Google reCAPTCHA终极指南
一、引言:Google reCAPTCHA——爬虫工程的最大挑战
在当今网络安全日益重要的背景下,Google reCAPTCHA已成为网站反爬体系中最强大的防线之一。根据2025年爬虫安全报告显示,93.7%的国际网站和78.2%的国内网站已部署Google reCAPTCHA验证机制。
Google reCAPTCHA经历了多个版本的演进:
- reCAPTCHA v1(2007-2018):文字识别验证码(已淘汰)
- reCAPTCHA v2(2014-至今):"我不是机器人"复选框 + 图片验证
- reCAPTCHA v3(2018-至今):无感验证,基于用户行为评分
- reCAPTCHA Enterprise(2020-至今):企业级解决方案,更复杂的AI检测
面对如此强大的验证体系,传统的爬虫技术已完全失效。本文将系统性地讲解各类reCAPTCHA的绕过策略,从基础原理到高级技巧,从免费方案到付费服务,为你构建一套完整的解决方案。
二、reCAPTCHA技术深度解析
2.1 reCAPTCHA v2 工作原理
reCAPTCHA v2采用多层验证机制:
2.1.1 第一层:复选框验证
- 用户点击"我不是机器人"复选框
- Google收集浏览器指纹、鼠标轨迹、点击时间等行为数据
- 如果风险评分较低,直接通过验证
- 如果风险评分较高,进入第二层验证
2.1.2 第二层:图片验证
- 显示9张或16张图片
- 要求用户选择包含特定物体的图片
- 可能需要多轮验证(选择所有包含交通灯的图片 → 选择所有包含人行横道的图片)
2.1.3 技术实现
<!-- reCAPTCHA v2 嵌入代码 -->
<div class="g-recaptcha" data-sitekey="6LcXAAAAA..."></div>
<script src="https://www.google.com/recaptcha/api.js"></script>
验证成功后,会在表单中生成一个隐藏字段:
<input type="hidden" name="g-recaptcha-response" value="03A...">
2.2 reCAPTCHA v3 工作原理
reCAPTCHA v3采用完全无感的验证方式:
2.2.1 核心特点
- 无用户交互:用户完全感知不到验证过程
- 行为评分:返回0.0-1.0的风险评分(1.0表示可信,0.0表示机器人)
- API调用:通过JavaScript API在后台执行验证
- 自定义阈值:网站可设置评分阈值(通常0.5)
2.2.2 技术实现
// reCAPTCHA v3 JavaScript调用
grecaptcha.execute('6LcXAAAAA...', {action: 'login'}).then(function(token) {// 将token发送到服务器验证document.getElementById('recaptchaResponse').value = token;
});
服务器端验证:
# Python服务器端验证示例
import requestsdef verify_recaptcha_v3(token, secret_key):url = "https://www.google.com/recaptcha/api/siteverify"data = {'secret': secret_key,'response': token,'remoteip': request.remote_addr}response = requests.post(url, data=data)result = response.json()return result['success'] and result['score'] >= 0.5
2.3 reCAPTCHA检测机制详解
Google reCAPTCHA通过以下维度检测机器人:
检测维度 | 具体指标 | 绕过难度 |
---|---|---|
浏览器指纹 | User-Agent、WebGL、Canvas、字体列表 | ★★★★☆ |
行为分析 | 鼠标轨迹、点击模式、页面停留时间 | ★★★★★ |
网络特征 | IP信誉、请求频率、TLS指纹 | ★★★★☆ |
自动化工具 | WebDriver属性、自动化框架特征 | ★★★★★ |
环境检测 | 浏览器插件、屏幕分辨率、时区 | ★★★☆☆ |
三、绕过策略全景图
3.1 策略分类
策略类型 | 适用场景 | 成功率 | 成本 | 复杂度 |
---|---|---|---|---|
环境伪装 | reCAPTCHA v2复选框 | 60-70% | 低 | ★★☆☆☆ |
行为模拟 | reCAPTCHA v2图片验证 | 80-90% | 中 | ★★★★☆ |
第三方服务 | 所有类型 | 95%+ | 高 | ★★☆☆☆ |
Token复用 | reCAPTCHA v3 | 70-80% | 低 | ★★★☆☆ |
代理IP池 | 配合其他策略 | 提升10-20% | 中 | ★★★☆☆ |
3.2 选择策略的原则
- 成本优先:预算有限时优先尝试免费方案
- 成功率优先:关键业务场景选择付费服务
- 维护成本:考虑长期维护的复杂度
- 法律合规:确保使用方式符合网站条款
四、环境伪装技术深度实践
4.1 浏览器指纹绕过
4.1.1 WebDriver属性检测
Google会检测navigator.webdriver
属性:
// 检测代码
if (navigator.webdriver) {// 认定为自动化工具
}
绕过方法:
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsdef create_stealth_driver():"""创建隐身Chrome驱动"""chrome_options = Options()chrome_options.add_argument("--disable-blink-features=AutomationControlled")chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(options=chrome_options)# 删除webdriver属性driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': '''Object.defineProperty(navigator, 'webdriver', {get: () => undefined});'''})return driver
4.1.2 Canvas/WebGL指纹绕过
def inject_canvas_fingerprint(driver):"""注入Canvas指纹绕过脚本"""canvas_js = """// 重写Canvas相关方法const toBlob = HTMLCanvasElement.prototype.toBlob;const toDataURL = HTMLCanvasElement.prototype.toDataURL;const getImageData = CanvasRenderingContext2D.prototype.getImageData;HTMLCanvasElement.prototype.toBlob = function() {setTimeout(() => toBlob.apply(this, arguments), 1000);};HTMLCanvasElement.prototype.toDataURL = function(type, quality) {const result = toDataURL.apply(this, arguments);// 添加随机噪声return result.replace(/(.{10})$/, 'X$1');};CanvasRenderingContext2D.prototype.getImageData = function(x, y, w, h) {const imageData = getImageData.apply(this, arguments);if (imageData.data.length > 0) {// 添加微小噪声imageData.data[0] = (imageData.data[0] + Math.floor(Math.random() * 3)) % 256;}return imageData;};// WebGL指纹绕过const getContext = HTMLCanvasElement.prototype.getContext;HTMLCanvasElement.prototype.getContext = function(type, attributes) {const context = getContext.call(this, type, attributes);if (type === 'webgl' || type === 'experimental-webgl') {const getParameter = context.getParameter;context.getParameter = function(parameter) {if (parameter === 37445) { // UNMASKED_VENDOR_WEBGLreturn 'Intel Inc.';}if (parameter === 37446) { // UNMASKED_RENDERER_WEBGLreturn 'Intel Iris OpenGL Engine';}return getParameter.call(this, parameter);};}return context;};"""driver.execute_script(canvas_js)
4.1.3 插件和语言检测绕过
def inject_plugin_fingerprint(driver):"""注入插件指纹绕过脚本"""plugin_js = """// 模拟真实浏览器插件Object.defineProperty(navigator, 'plugins', {get: () => [{name: 'Chrome PDF Plugin',filename: 'internal-pdf-viewer',description: 'Portable Document Format'},{name: 'Chrome PDF Viewer',filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai',description: ''},{name: 'Native Client',filename: 'internal-nacl-plugin',description: ''}]});// 设置语言Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});// 设置硬件并发Object.defineProperty(navigator, 'hardwareConcurrency', {get: () => 8});// 设置设备内存Object.defineProperty(navigator, 'deviceMemory', {get: () => 8});"""driver.execute_script(plugin_js)
4.2 完整环境伪装示例
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import randomclass StealthBrowser:def __init__(self):self.driver = self._create_driver()self._inject_stealth_scripts()def _create_driver(self):"""创建基础驱动"""chrome_options = Options()chrome_options.add_argument("--disable-blink-features=AutomationControlled")chrome_options.add_argument("--disable-infobars")chrome_options.add_argument("--disable-extensions")chrome_options.add_argument("--disable-gpu")chrome_options.add_argument("--no-sandbox")chrome_options.add_argument("--disable-dev-shm-usage")chrome_options.add_argument("--window-size=1920,1080")chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")# 禁用自动化特征chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(options=chrome_options)return driverdef _inject_stealth_scripts(self):"""注入所有隐身脚本"""stealth_scripts = [self._get_webdriver_script(),self._get_canvas_script(),self._get_plugin_script(),self._get_chrome_script()]for script in stealth_scripts:self.driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': script})def _get_webdriver_script(self):return """Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"""def _get_canvas_script(self):return """const toBlob = HTMLCanvasElement.prototype.toBlob;const toDataURL = HTMLCanvasElement.prototype.toDataURL;const getImageData = CanvasRenderingContext2D.prototype.getImageData;HTMLCanvasElement.prototype.toBlob = function() {setTimeout(() => toBlob.apply(this, arguments), 1000);};HTMLCanvasElement.prototype.toDataURL = function(type, quality) {const result = toDataURL.apply(this, arguments);return result.replace(/(.{10})$/, 'X$1');};CanvasRenderingContext2D.prototype.getImageData = function(x, y, w, h) {const imageData = getImageData.apply(this, arguments);if (imageData.data.length > 0) {imageData.data[0] = (imageData.data[0] + Math.floor(Math.random() * 3)) % 256;}return imageData;};"""def _get_plugin_script(self):return """Object.defineProperty(navigator, 'plugins', {get: () => [{name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format'},{name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: ''},{name: 'Native Client', filename: 'internal-nacl-plugin', description: ''}]});Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});Object.defineProperty(navigator, 'hardwareConcurrency', {get: () => 8});Object.defineProperty(navigator, 'deviceMemory', {get: () => 8});"""def _get_chrome_script(self):return """window.chrome = {runtime: {},loadTimes: function() {return {requestTime: new Date().getTime(),startLoadTime: new Date().getTime(),commitLoadTime: new Date().getTime(),finishDocumentLoadTime: new Date().getTime(),finishLoadTime: new Date().getTime(),firstPaintTime: new Date().getTime(),firstPaintAfterLoadTime: new Date().getTime(),navigationType: "Other",wasFetchedViaSpdy: false,wasNpnNegotiated: false,npnNegotiatedProtocol: "",wasAlternateProtocolAvailable: false,connectionInfo: "unknown"};}};"""def get_driver(self):return self.driverdef close(self):self.driver.quit()# 使用示例
if __name__ == "__main__":browser = StealthBrowser()driver = browser.get_driver()try:driver.get("https://www.google.com/recaptcha/api2/demo")time.sleep(5)# 检查是否成功绕过检测if "recaptcha" in driver.page_source.lower():print("页面加载成功,可能需要进一步处理")else:print("可能被检测为机器人")finally:browser.close()
五、行为模拟技术深度实践
5.1 鼠标轨迹模拟
5.1.1 人类鼠标轨迹特征
真实用户的鼠标轨迹具有以下特征:
- 加速度变化:开始慢,中间快,结束慢
- 曲线轨迹:不是直线,有自然弯曲
- 微小抖动:坐标有微小随机波动
- 停顿点:在目标点附近有短暂停顿
5.1.2 贝塞尔曲线轨迹生成
import math
import random
from selenium.webdriver.common.action_chains import ActionChainsdef generate_bezier_curve(start_x, start_y, end_x, end_y, points=20):"""生成贝塞尔曲线轨迹:param start_x: 起始X坐标:param start_y: 起始Y坐标:param end_x: 结束X坐标:param end_y: 结束Y坐标:param points: 轨迹点数量:return: 轨迹点列表"""# 生成控制点(在起始点和结束点之间随机偏移)control_x1 = start_x + random.randint(-100, 100)control_y1 = start_y + random.randint(-100, 100)control_x2 = end_x + random.randint(-100, 100)control_y2 = end_y + random.randint(-100, 100)trajectory = []for i in range(points + 1):t = i / points# 三次贝塞尔曲线公式x = (1 - t)**3 * start_x + 3 * (1 - t)**2 * t * control_x1 + 3 * (1 - t) * t**2 * control_x2 + t**3 * end_xy = (1 - t)**3 * start_y + 3 * (1 - t)**2 * t * control_y1 + 3 * (1 - t) * t**2 * control_y2 + t**3 * end_y# 添加微小抖动x += random.uniform(-2, 2)y += random.uniform(-2, 2)trajectory.append((int(x), int(y)))return trajectorydef human_like_move_to_element(driver, element, duration=2.0):"""模拟人类移动到元素:param driver: WebDriver实例:param element: 目标元素:param duration: 移动总时间(秒)"""# 获取当前鼠标位置(假设在(0,0))current_x, current_y = 0, 0# 获取目标元素位置location = element.locationtarget_x = location['x'] + element.size['width'] // 2target_y = location['y'] + element.size['height'] // 2# 生成轨迹trajectory = generate_bezier_curve(current_x, current_y, target_x, target_y)# 执行移动actions = ActionChains(driver)actions.move_to_element_with_offset(element, -element.size['width'] // 2, -element.size['height'] // 2)actions.perform()# 分步移动total_points = len(trajectory)for i, (x, y) in enumerate(trajectory):if i == 0:continueprev_x, prev_y = trajectory[i-1]dx = x - prev_xdy = y - prev_y# 计算时间间隔(模拟加速度)if i < total_points * 0.3: # 加速阶段time_delay = random.uniform(0.01, 0.03)elif i < total_points * 0.7: # 匀速阶段time_delay = random.uniform(0.02, 0.05)else: # 减速阶段time_delay = random.uniform(0.03, 0.08)actions = ActionChains(driver)actions.move_by_offset(dx, dy)actions.perform()time.sleep(time_delay)
5.2 reCAPTCHA v2 复选框点击模拟
def click_recaptcha_checkbox(driver, max_retries=3):"""模拟人类点击reCAPTCHA复选框:param driver: WebDriver实例:param max_retries: 最大重试次数:return: 是否成功点击"""for attempt in range(max_retries):try:# 等待复选框出现checkbox = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "recaptcha-checkbox-checkmark")))# 模拟人类移动到复选框human_like_move_to_element(driver, checkbox)# 随机停顿time.sleep(random.uniform(0.5, 1.5))# 点击复选框checkbox.click()# 等待验证结果time.sleep(2)# 检查是否出现图片验证try:image_challenge = driver.find_element(By.CLASS_NAME, "rc-imageselect-payload")print("出现图片验证,需要进一步处理")return Falseexcept:# 检查是否验证成功try:success_element = driver.find_element(By.CLASS_NAME, "recaptcha-checkbox-checked")print("reCAPTCHA验证成功!")return Trueexcept:print(f"第{attempt+1}次尝试失败,重试中...")continueexcept Exception as e:print(f"点击复选框时发生错误: {str(e)}")continueprint("所有重试都失败了")return False
5.3 图片验证处理
5.3.1 图片下载与识别
import requests
from PIL import Image
import io
import base64def download_recaptcha_images(driver):"""下载reCAPTCHA图片验证的图片:param driver: WebDriver实例:return: 图片字节数据列表"""images = []# 等待图片加载WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "rc-image-tile-wrapper")))# 获取所有图片元素image_elements = driver.find_elements(By.CLASS_NAME, "rc-image-tile-44")for element in image_elements:try:# 获取图片URLimg_url = element.find_element(By.TAG_NAME, "img").get_attribute("src")# 下载图片if img_url.startswith("data:image"):# Base64图片img_data = img_url.split(",")[1]img_bytes = base64.b64decode(img_data)else:# 网络图片response = requests.get(img_url)img_bytes = response.contentimages.append(img_bytes)except Exception as e:print(f"下载图片时发生错误: {str(e)}")images.append(None)return imagesdef recognize_recaptcha_images(images, instruction):"""识别reCAPTCHA图片(使用第三方服务或本地模型):param images: 图片字节数据列表:param instruction: 验证指令(如"选择包含交通灯的图片"):return: 需要点击的图片索引列表"""# 这里可以集成超级鹰、2Captcha等第三方服务# 或者使用本地训练的深度学习模型# 示例:使用伪随机选择(实际应用中需要真正的识别)selected_indices = []for i, img in enumerate(images):if img is not None:# 这里应该调用真正的识别服务# 为演示目的,随机选择50%的图片if random.random() > 0.5:selected_indices.append(i)return selected_indices
5.3.2 图片点击模拟
def click_recaptcha_images(driver, selected_indices):"""点击reCAPTCHA图片:param driver: WebDriver实例:param selected_indices: 需要点击的图片索引列表"""# 获取所有图片元素image_elements = driver.find_elements(By.CLASS_NAME, "rc-image-tile-44")for index in selected_indices:if index < len(image_elements):element = image_elements[index]# 模拟人类点击human_like_move_to_element(driver, element)time.sleep(random.uniform(0.3, 0.8))element.click()# 随机停顿time.sleep(random.uniform(0.2, 0.5))# 点击验证按钮try:verify_button = driver.find_element(By.ID, "recaptcha-verify-button")human_like_move_to_element(driver, verify_button)time.sleep(random.uniform(0.5, 1.0))verify_button.click()except:print("未找到验证按钮")
六、第三方服务集成方案
6.1 2Captcha服务集成
6.1.1 2Captcha简介
2Captcha是专业的验证码识别服务,支持reCAPTCHA v2/v3,识别率高达95%以上。
6.1.2 Python集成代码
import requests
import time
import jsonclass TwoCaptchaSolver:def __init__(self, api_key):self.api_key = api_keyself.base_url = "http://2captcha.com"def solve_recaptcha_v2(self, site_key, page_url, invisible=0):"""解决reCAPTCHA v2:param site_key: 网站的site key:param page_url: 页面URL:param invisible: 是否为隐形reCAPTCHA (0或1):return: reCAPTCHA响应token"""# 1. 提交验证码任务task_data = {"key": self.api_key,"method": "userrecaptcha","googlekey": site_key,"pageurl": page_url,"invisible": invisible,"json": 1}response = requests.post(f"{self.base_url}/in.php", data=task_data)result = response.json()if result["status"] == 1:task_id = result["request"]print(f"验证码任务提交成功,任务ID: {task_id}")# 2. 轮询获取结果for _ in range(30): # 最多等待30秒time.sleep(5)result = self._get_result(task_id)if result["status"] == 1:return result["request"]elif result["request"] == "CAPCHA_NOT_READY":continueelse:raise Exception(f"验证码识别失败: {result['request']}")raise Exception("验证码识别超时")else:raise Exception(f"提交验证码任务失败: {result['request']}")def solve_recaptcha_v3(self, site_key, page_url, action="verify", min_score=0.3):"""解决reCAPTCHA v3:param site_key: 网站的site key:param page_url: 页面URL:param action: 验证动作:param min_score: 最小分数要求:return: reCAPTCHA响应token"""task_data = {"key": self.api_key,"method": "userrecaptcha","version": "v3","googlekey": site_key,"pageurl": page_url,"action": action,"min_score": min_score,"json": 1}response = requests.post(f"{self.base_url}/in.php", data=task_data)result = response.json()if result["status"] == 1:task_id = result["request"]print(f"reCAPTCHA v3任务提交成功,任务ID: {task_id}")# 轮询获取结果for _ in range(20):time.sleep(3)result = self._get_result(task_id)if result["status"] == 1:return result["request"]elif result["request"] == "CAPCHA_NOT_READY":continueelse:raise Exception(f"reCAPTCHA v3识别失败: {result['request']}")raise Exception("reCAPTCHA v3识别超时")else:raise Exception(f"提交reCAPTCHA v3任务失败: {result['request']}")def _get_result(self, task_id):"""获取验证码识别结果"""params = {"key": self.api_key,"action": "get","id": task_id,"json": 1}response = requests.get(f"{self.base_url}/res.php", params=params)return response.json()def report_bad(self, task_id):"""报告错误的验证码结果"""params = {"key": self.api_key,"action": "reportbad","id": task_id,"json": 1}response = requests.get(f"{self.base_url}/res.php", params=params)return response.json()# 使用示例
if __name__ == "__main__":solver = TwoCaptchaSolver("your_2captcha_api_key")try:# 解决reCAPTCHA v2site_key = "6LcXAAAAA..." # 从网页源码中获取page_url = "https://example.com/login"recaptcha_token = solver.solve_recaptcha_v2(site_key, page_url)print(f"reCAPTCHA token: {recaptcha_token}")# 使用token进行登录login_data = {"username": "your_username","password": "your_password","g-recaptcha-response": recaptcha_token}response = requests.post(page_url, data=login_data)print(f"登录结果: {response.status_code}")except Exception as e:print(f"发生错误: {str(e)}")
6.2 Anti-Captcha服务集成
class AntiCaptchaSolver:def __init__(self, api_key):self.api_key = api_keyself.base_url = "https://api.anti-captcha.com"def solve_recaptcha_v2(self, site_key, page_url):"""解决reCAPTCHA v2"""# 创建任务task_data = {"clientKey": self.api_key,"task": {"type": "NoCaptchaTaskProxyless","websiteURL": page_url,"websiteKey": site_key}}response = requests.post(f"{self.base_url}/createTask", json=task_data)result = response.json()if result["errorId"] == 0:task_id = result["taskId"]print(f"Anti-Captcha任务创建成功,任务ID: {task_id}")# 轮询获取结果for _ in range(30):time.sleep(5)solution = self._get_task_result(task_id)if solution["status"] == "ready":return solution["solution"]["gRecaptchaResponse"]elif solution["status"] == "processing":continueelse:raise Exception(f"Anti-Captcha任务失败: {solution}")raise Exception("Anti-Captcha任务超时")else:raise Exception(f"创建Anti-Captcha任务失败: {result['errorDescription']}")def _get_task_result(self, task_id):"""获取Anti-Captcha任务结果"""data = {"clientKey": self.api_key,"taskId": task_id}response = requests.post(f"{self.base_url}/getTaskResult", json=data)return response.json()
七、reCAPTCHA v3 绕过策略
7.1 Token复用技术
reCAPTCHA v3的token有一定的有效期(通常2分钟),可以在有效期内复用。
import time
from collections import defaultdictclass RecaptchaV3TokenCache:def __init__(self, expiration_time=120): # 2分钟过期self.cache = {}self.expiration_time = expiration_timedef get_token(self, site_key, action):"""获取缓存的token"""key = f"{site_key}:{action}"if key in self.cache:token, timestamp = self.cache[key]if time.time() - timestamp < self.expiration_time:return tokenelse:del self.cache[key]return Nonedef set_token(self, site_key, action, token):"""缓存token"""key = f"{site_key}:{action}"self.cache[key] = (token, time.time())def clear_expired(self):"""清理过期的token"""current_time = time.time()expired_keys = []for key, (token, timestamp) in self.cache.items():if current_time - timestamp >= self.expiration_time:expired_keys.append(key)for key in expired_keys:del self.cache[key]# 使用示例
token_cache = RecaptchaV3TokenCache()def get_recaptcha_v3_token(site_key, action, page_url):"""获取reCAPTCHA v3 token"""# 先检查缓存cached_token = token_cache.get_token(site_key, action)if cached_token:return cached_token# 使用第三方服务获取新tokensolver = TwoCaptchaSolver("your_api_key")new_token = solver.solve_recaptcha_v3(site_key, page_url, action)# 缓存新tokentoken_cache.set_token(site_key, action, new_token)return new_token
7.2 本地模拟reCAPTCHA v3
对于某些简单的reCAPTCHA v3实现,可以尝试本地模拟:
import hashlib
import time
import randomdef generate_fake_recaptcha_v3_token(site_key, action):"""生成伪造的reCAPTCHA v3 token(仅适用于某些简单实现)注意:这种方法对Google官方的reCAPTCHA v3无效"""# 生成基于时间戳和随机数的tokentimestamp = str(int(time.time()))random_str = ''.join(random.choices('abcdefghijklmnopqrstuvwxyz0123456789', k=32))# 创建token(这只是示例,实际token是JWT格式)token_data = f"{site_key}.{action}.{timestamp}.{random_str}"fake_token = hashlib.sha256(token_data.encode()).hexdigest()return fake_token
八、代理IP池集成
8.1 代理IP的重要性
使用代理IP可以:
- 避免IP被封禁
- 模拟不同地理位置的用户
- 提高reCAPTCHA通过率
8.2 代理IP池实现
import requests
from random import choiceclass ProxyPool:def __init__(self, proxy_list=None, proxy_api=None):self.proxy_list = proxy_list or []self.proxy_api = proxy_apiself.current_index = 0def get_proxy(self):"""获取一个代理"""if self.proxy_api:# 从API获取动态代理response = requests.get(self.proxy_api)if response.status_code == 200:proxy = response.text.strip()return {"http": f"http://{proxy}", "https": f"http://{proxy}"}if self.proxy_list:# 从列表中轮询获取proxy = self.proxy_list[self.current_index]self.current_index = (self.current_index + 1) % len(self.proxy_list)return {"http": f"http://{proxy}", "https": f"http://{proxy}"}return Nonedef test_proxy(self, proxy, timeout=5):"""测试代理是否可用"""try:response = requests.get("http://httpbin.org/ip", proxies=proxy, timeout=timeout)return response.status_code == 200except:return False# 使用示例
proxy_pool = ProxyPool(proxy_list=["1.2.3.4:8080","5.6.7.8:8080","9.10.11.12:8080"]
)def make_request_with_proxy(url, **kwargs):"""使用代理发送请求"""proxy = proxy_pool.get_proxy()if proxy:kwargs['proxies'] = proxyreturn requests.get(url, **kwargs)
8.3 Selenium集成代理
def create_driver_with_proxy(proxy_host, proxy_port):"""创建带代理的WebDriver"""chrome_options = Options()chrome_options.add_argument(f"--proxy-server={proxy_host}:{proxy_port}")# 其他隐身设置...chrome_options.add_argument("--disable-blink-features=AutomationControlled")chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])chrome_options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(options=chrome_options)# 注入隐身脚本driver.execute_cdp_cmd('Page.addScriptToEvaluate