当前位置：首页 > news >正文

标题 “Python 网络爬虫 —— selenium库驱动浏览器

news 2025/11/16 16:21:38

一、Selenium 库核心认知

Selenium 库是 Web 应用程序测试与自动化操作的利器，能驱动浏览器（如 Edge、Firefox 等）执行点击、输入、打开、验证等操作 。与 Requests 库差异显著：Requests 库仅能获取网页原始代码，而 Selenium 基于浏览器驱动程序工作，浏览器可渲染网页源代码，借此能轻松拿到渲染后的数据信息（如 JS 动态加载内容），完美解决 Requests 库无法处理的动态页面数据提取难题。

二、使用 Selenium 库前的准备

（一）安装 WebDriver

浏览器依托内核（如 Edge 浏览器基于 Chromium 内核）运行，Selenium 驱动浏览器需对应内核的 WebDriver 。以 Edge 浏览器为例：

确定浏览器内核版本：打开 Edge 浏览器，在设置 - 关于 Microsoft Edge 中查看版本。
下载匹配的 EdgeDriver：访问微软官方 EdgeDriver 下载页（Microsoft Edge WebDriver | Microsoft Edge Developer ），选择与浏览器版本适配的 EdgeDriver（版本尽量贴近）。
配置驱动程序：解压下载的 EdgeDriver，将 msedgedriver.exe（Windows 系统）移动到 Python 安装目录的 Scripts 文件夹（通过 where python（Windows）或 which python3（macOS/Linux ）命令查找 Python 路径），完成环境关联。

（二）安装 Selenium 库

在命令提示符（Windows）或终端（macOS/Linux ）执行：

pip install selenium

安装后，用 pip show selenium 查看库信息，确认安装成功。

三、驱动浏览器（以 Edge 为例）

Selenium 支持多种浏览器，驱动 Edge 浏览器代码如下：

from selenium import webdriver  
from selenium.webdriver.edge.options import Options  # 创建浏览器配置对象
edge_options = Options()  
# 绑定 Edge 浏览器可执行文件路径（需替换为你电脑中 Edge 的实际安装路径）
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  
# 初始化 Edge 浏览器驱动，通过配置对象关联浏览器
driver = webdriver.Edge(options=edge_options)

四、加载网页

（一）`get()` 方法加载单网页

get(url) 可在当前浏览器会话打开指定网页，示例：

from selenium import webdriver  
from selenium.webdriver.edge.options import Options  edge_options = Options()  
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  
driver = webdriver.Edge(options=edge_options)  
# 加载人民邮电出版社官网期刊页
driver.get('https://www.ptpress.com.cn/periodical')

执行后，Edge 浏览器自动启动并打开目标网页，用于后续数据提取、交互操作。

（二）`execute_script()` 方法打开多标签页

该方法通过执行 JavaScript 脚本，在同一浏览器打开多个标签页，格式：execute_script(script, *args) ，script 为 JavaScript 脚本字符串。示例：

from selenium import webdriver  
from selenium.webdriver.edge.options import Options  edge_options = Options()  
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  
driver = webdriver.Edge(options=edge_options)  
# 先打开主页面
driver.get('https://www.ptpress.com.cn/')  
# 打开新标签页（人民邮电出版社登录页）
driver.execute_script("window.open('https://www.ptpress.com.cn/login','_blank');")  
# 打开另一个新标签页（数艺社页面 ）
driver.execute_script("window.open('https://www.shuyishe.com/','_blank');")

借助 JavaScript 的 window.open 方法，实现多标签页批量打开，满足复杂网页跳转需求。

五、获取渲染后的网页代码

浏览器加载网页并渲染后，用 page_source 方法提取渲染后的完整代码（含 JS 动态加载内容），示例：

from selenium import webdriver  
from selenium.webdriver.edge.options import Options  edge_options = Options()  
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  
driver = webdriver.Edge(options=edge_options)  
driver.get('https://www.ptpress.com.cn/')  
# 获取渲染后的网页源代码
rendered_html = driver.page_source  
print(rendered_html)

page_source 获取的代码，可用于正则表达式、XPath 等方式提取目标数据（如商品价格、新闻内容）。

六、获取和操作网页元素

（一）获取网页中的指定元素

Selenium 提供多种元素定位方法（替代正则表达式筛选），常用如下：

find_element(By.ID, "id值")：通过元素 id 定位（页面唯一）。
find_element(By.NAME, "name值")：按 name 属性定位。
find_element(By.XPATH, "XPath 表达式")：灵活的路径定位，适配复杂页面。

示例（定位搜索框并输入内容）：

from selenium import webdriver  
from selenium.webdriver.edge.options import Options  
from selenium.webdriver.common.by import By  edge_options = Options()  
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  
driver = webdriver.Edge(options=edge_options)  
driver.get('https://www.ptpress.com.cn/')  # 用 XPath 定位搜索框（需替换为实际页面 XPath ）
search_box = driver.find_element(By.XPATH, '//input[@placeholder="搜索图书、作者、ISBN"]')  
# 输入搜索关键词（这里通过 input 交互，模拟用户输入 ）
a = input("请输入搜索关键词：")  
search_box.send_keys(a)

通过 find_element 系列方法，精准定位元素后，可执行输入、点击等交互操作。

（二）获取多个元素与批量操作

find_elements() 方法（注意复数）可获取页面中匹配条件的多个元素，示例（提取页面中所有图书封面图片元素）：

from selenium import webdriver  
from selenium.webdriver.edge.options import Options  
from selenium.webdriver.common.by import By  edge_options = Options()  
edge_options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  
driver = webdriver.Edge(options=edge_options)  
driver.get('https://www.ptpress.com.cn/search?keyword=python')  # 定位所有图书封面图片元素（假设用 XPath 匹配 ）
book_covers = driver.find_elements(By.XPATH, '//img[@class="book-cover"]')  
for cover in book_covers:  # 可获取元素属性（如 src ）或执行点击等操作print(cover.get_attribute('src'))

遍历获取的元素列表，能批量提取信息（如图片链接）或执行交互，提升自动化效率。

以上是 Selenium 库驱动浏览器的核心流程与操作！后续还会深入讲解元素高级交互、自动化测试框架集成等更多实战内容，关注我，带你玩转浏览器自动化～

查看全文

http://www.dtcms.com/a/285575.html