当前位置：首页 > news >正文

java爬虫框架，简单高效，易用，附带可运行案例

news 来源：原创 2025/6/14 9:00:43

WebScraper 工具类使用手册

序言：

java简单易用的封装爬虫工具类，代码和案例奉上，把你的点击和收藏也一并奉上吧[狗头👻]，

springboot版本：3.4.5

java版本：17

安装依赖：

 <properties><java.version>17</java.version><maven.compiler.source>11</maven.compiler.source><maven.compiler.target>11</maven.compiler.target><selenium.version>4.20.0</selenium.version> <!-- Check for latest Selenium version --><webdrivermanager.version>5.8.0</webdrivermanager.version> <!-- Check for latest WebDriverManager version --><gson.version>2.10.1</gson.version> <!-- Check for latest Gson version --></properties><dependencies><!-- Selenium Java --><dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-java</artifactId><version>${selenium.version}</version></dependency><!-- WebDriverManager --><dependency><groupId>io.github.bonigarcia</groupId><artifactId>webdrivermanager</artifactId><version>${webdrivermanager.version}</version></dependency><!-- JSON Processing (Gson) --><dependency><groupId>com.google.code.gson</groupId><artifactId>gson</artifactId><version>${gson.version}</version></dependency><!-- SLF4J Simple Logger (for less verbose Selenium logging) --><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-simple</artifactId><version>2.0.12</version> <!-- Or a version compatible with your Selenium --></dependency></dependencies>

1. 类名：WebScraper

功能：封装了浏览器控制、页面交互和数据提取的核心功能，提供灵活易用的爬虫框架。

2. 初始化方法

方法签名：__init__(browser_type="chrome", headless=True, user_agent=None, proxy=None, timeout=30, debug=False)

功能：初始化爬虫实例，配置浏览器和开发工具

参数说明：

browser_type：浏览器类型，可选值："chrome", "firefox", "edge"
headless：是否以无头模式运行浏览器
user_agent：自定义 User-Agent 字符串
proxy：代理服务器配置，格式：{"http": "http://proxy.example.com:8080", "https": "http://proxy.example.com:8080"}
timeout：操作超时时间（秒）
debug：是否开启调试模式

3. 浏览器控制方法

3.1 open_url(url)

功能：打开指定 URL
参数：url - 目标 URL
返回：页面加载完成状态

3.2 close()

功能：关闭浏览器实例
参数：无

3.3 refresh()

功能：刷新当前页面
参数：无

3.4 go_back()

功能：返回上一页
参数：无

4. 元素定位与交互方法

4.1 find_element(selector, by=“css”, timeout=None)

功能：查找单个元素
参数

：
- selector：选择器字符串
- by：选择器类型，可选值："css", "xpath", "id", "class", "name", "link_text", "partial_link_text", "tag_name"
- timeout：等待元素出现的超时时间（秒）
返回：找到的元素对象或 None

4.2 find_elements(selector, by=“css”, timeout=None)

功能：查找多个元素
参数：同find_element
返回：找到的元素列表

4.3 click(element=None, selector=None, by=“css”, timeout=None)

功能：点击元素
参数

：
- element：元素对象（优先使用）
- selector：选择器字符串（当 element 为 None 时使用）
- by：选择器类型
- timeout：等待元素出现的超时时间
返回：操作结果

4.4 type_text(text, element=None, selector=None, by=“css”, timeout=None, clear_first=True)

功能：在输入框中输入文本
参数

：
- text：要输入的文本
- element：元素对象（优先使用）
- selector：选择器字符串（当 element 为 None 时使用）
- by：选择器类型
- timeout：等待元素出现的超时时间
- clear_first：是否先清空输入框
返回：操作结果

5. 滚动方法

5.1 scroll(direction=“down”, amount=None, element=None, smooth=True, duration=0.5)

功能：滚动页面或元素
参数

：
- direction：滚动方向，可选值："up", "down", "left", "right"
- amount：滚动量（像素），默认为页面高度 / 宽度的 50%
- element：要滚动的元素，默认为整个页面
- smooth：是否平滑滚动
- duration：滚动持续时间（秒）
返回：操作结果

5.2 scroll_to_element(element=None, selector=None, by=“css”, timeout=None, align=“center”)

功能：滚动到指定元素
参数

：
- element：元素对象（优先使用）
- selector：选择器字符串（当 element 为 None 时使用）
- by：选择器类型
- timeout：等待元素出现的超时时间
- align：元素对齐方式，可选值："top", "center", "bottom"
返回：操作结果

5.3 scroll_to_bottom(element=None, steps=10, delay=0.5)

功能：滚动到页面或元素底部
参数

：
- element：要滚动的元素，默认为整个页面
- steps：滚动步数
- delay：每步之间的延迟（秒）
返回：操作结果

6. 翻页方法

6.1 next_page(selector=None, method=“click”, url_template=None, page_param=“page”, next_page_func=None)

功能：翻到下一页
参数

：
- selector：下一页按钮的选择器（当 method 为 “click” 时使用）
- method：翻页方法，可选值："click", "url", "function"
- url_template：URL 模板（当 method 为 “url” 时使用）
- page_param：页码参数名（当 method 为 “url” 时使用）
- next_page_func：自定义翻页函数（当 method 为 “function” 时使用）
返回：翻页是否成功

6.2 has_next_page(selector=None, check_func=None)

功能：检查是否有下一页
参数

：
- selector：下一页按钮的选择器
- check_func：自定义检查函数
返回：布尔值，表示是否有下一页

6.3 set_page(page_num, url_template=None, page_param=“page”)

功能：跳转到指定页码
参数

：
- page_num：目标页码
- url_template：URL 模板
- page_param：页码参数名
返回：操作结果

7. 数据提取方法

7.1 get_text(element=None, selector=None, by=“css”, timeout=None)

功能：获取元素的文本内容
参数：同find_element
返回：文本内容或 None

7.2 get_attribute(attribute, element=None, selector=None, by=“css”, timeout=None)

功能：获取元素的属性值
参数

：
- attribute：属性名
- 其他参数同find_element
返回：属性值或 None

7.3 extract_data(template)

功能：根据模板提取页面数据
参数

：
- template：数据提取模板，格式为字典，键为数据字段名，值为选择器或提取函数
返回：提取的数据

8. DevTools 方法

8.1 start_capturing_network()

功能：开始捕获网络请求
参数：无

8.2 stop_capturing_network()

功能：停止捕获网络请求
参数：无

8.3 get_captured_requests(filter_type=None, url_pattern=None)

功能：获取捕获的网络请求
参数

：
- filter_type：请求类型过滤，可选值："xhr", "fetch", "script", "image", "stylesheet" 等
- url_pattern：URL 模式过滤，支持正则表达式
返回：符合条件的请求列表

8.4 add_request_interceptor(pattern, handler_func)

功能：添加请求拦截器
参数

：
- pattern：URL 匹配模式
- handler_func：处理函数，接收请求对象，可修改请求或返回自定义响应
返回：拦截器 ID

9. 辅助方法

9.1 wait_for_element(selector, by=“css”, timeout=None, condition=“visible”)

功能：等待元素满足特定条件
参数

：
- selector：选择器字符串
- by：选择器类型
- timeout：超时时间
- condition：等待条件，可选值："visible", "present", "clickable", "invisible", "not_present"
返回：元素对象或 None

9.2 execute_script(script, *args)

功能：执行 JavaScript 代码
参数

：
- script：JavaScript 代码
- *args：传递给 JavaScript 的参数
返回：JavaScript 执行结果

9.3 set_delay(min_delay, max_delay=None)

功能：设置操作之间的随机延迟
参数

：
- min_delay：最小延迟时间（秒）
- max_delay：最大延迟时间（秒），如果为 None 则固定为 min_delay
返回：无

9.4 take_screenshot(path=None)

功能：截取当前页面截图
参数

：
- path：保存路径，如果为 None 则返回图像数据
返回：如果 path 为 None，返回图像二进制数据；否则返回保存结果

WebScraper 类代码实现

package com.example.demo.utils;import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import io.github.bonigarcia.wdm.WebDriverManager;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.edge.EdgeDriver;
import org.openqa.selenium.edge.EdgeOptions;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxOptions;
import org.openqa.selenium.firefox.FirefoxProfile;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.*;
import java.util.function.Function;
import java.util.regex.Pattern;
import java.util.stream.Collectors;public class WebScraper implements AutoCloseable {private static final Map<String, By> BY_MAP = new HashMap<>();static {BY_MAP.put("css", By.cssSelector("*"));BY_MAP.put("xpath", By.xpath("//*"));BY_MAP.put("id", By.id("placeholder_id"));BY_MAP.put("class", By.className("placeholder_class"));BY_MAP.put("name", By.name("placeholder_name"));BY_MAP.put("linktext", By.linkText("placeholder_linktext"));BY_MAP.put("partiallinktext", By.partialLinkText("placeholder_partial"));BY_MAP.put("tagname", By.tagName("div"));}private WebDriver driver;private final String browserType;private final boolean headless;private final String userAgent;private final Map<String, String> proxyConfig;private final Duration timeout;private final boolean debug;private final String profileDirectory;private int currentPageNum = 1;private double minDelay = 0.5;private double maxDelay = 1.5;private List<Map<String, Object>> networkRequestsRaw = new ArrayList<>();private final JavascriptExecutor jsExecutor;public WebScraper(String browserType, boolean headless, String userAgent,Map<String, String> proxyConfig, int timeoutSeconds, boolean debug, String profileDirectory) {this.browserType = browserType.toLowerCase();this.headless = headless;this.userAgent = userAgent;this.proxyConfig = proxyConfig;this.timeout = Duration.ofSeconds(timeoutSeconds);this.debug = debug;this.profileDirectory = profileDirectory;setupDriver();this.jsExecutor = (JavascriptExecutor) driver;}public WebScraper() {this("chrome", true, null, null, 30, false, null);}// 多个重载构造函数...private void printDebug(String message) {if (debug) {System.out.println("[DEBUG] " + message);}}private void setupDriver() {// 浏览器驱动初始化逻辑...}private By getSeleniumBy(String byString, String selector) {// 选择器类型转换逻辑...}private void performDelay() {// 操作延迟逻辑...}// 浏览器控制方法实现...public boolean openUrl(String url) {// 打开URL逻辑...}@Overridepublic void close() {// 关闭浏览器逻辑...}public void refresh() {// 刷新页面逻辑...}public void goBack() {// 返回上一页逻辑...}// 元素定位与交互方法实现...public WebElement findElement(String selector, String by, Duration customTimeout) {// 查找单个元素逻辑...}// 多个重载方法...public List<WebElement> findElements(String selector, String by, Duration customTimeout) {// 查找多个元素逻辑...}// 多个重载方法...public boolean click(WebElement element, String selector, String by, Duration customTimeout) {// 点击元素逻辑...}// 多个重载方法...public boolean typeText(String text, WebElement element, String selector, String by, Duration customTimeout, boolean clearFirst) {// 输入文本逻辑...}// 多个重载方法...// 滚动方法实现...public boolean scroll(String direction, Integer amount, WebElement element, boolean smooth, double durationSeconds) {// 滚动逻辑...}// 多个重载方法...public boolean scrollToElement(WebElement element, String selector, String by, Duration customTimeout, String align) {// 滚动到元素逻辑...}// 多个重载方法...public boolean scrollToBottom(WebElement element, int steps, double delaySeconds) {// 滚动到底部逻辑...}// 多个重载方法...// 翻页方法实现...public boolean nextPage(String selector, String method, String urlTemplate, String pageParam, Function<WebScraper, Boolean> nextPageFunc) {// 翻页逻辑...}// 多个重载方法...public boolean hasNextPage(String selector, Function<WebScraper, Boolean> checkFunc) {// 检查下一页逻辑...}// 多个重载方法...public boolean setPage(int pageNum, String urlTemplate, String pageParam) {// 跳转到指定页逻辑...}// 数据提取方法实现...public String getText(WebElement element, String selector, String by, Duration customTimeout) {// 获取文本逻辑...}// 多个重载方法...public String getAttribute(String attribute, WebElement element, String selector, String by, Duration customTimeout) {// 获取属性逻辑...}// 多个重载方法...@SuppressWarnings("unchecked")public Map<String, Object> extractData(Map<String, Object> template) {// 数据提取逻辑...}// DevTools方法实现...public void startCapturingNetwork() {// 开始捕获网络请求逻辑...}public void stopCapturingNetwork() {// 停止捕获网络请求逻辑...}@SuppressWarnings("unchecked")public List<Map<String, Object>> getCapturedRequests(String filterType, String urlPattern) {// 获取捕获的网络请求逻辑...}// 多个重载方法...public String addRequestInterceptor(String pattern, Function<Object, Object> handlerFunc) {// 添加请求拦截器逻辑...}// 辅助方法实现...public WebElement waitForElement(String selector, String by, Duration customTimeout, String condition) {// 等待元素逻辑...}// 多个重载方法...public Object executeScript(String script, Object... args) {// 执行JavaScript逻辑...}public void setDelay(double minDelaySeconds, double maxDelaySeconds) {// 设置延迟逻辑...}public boolean takeScreenshot(String path) {// 截图逻辑...}public byte[] takeScreenshot() {// 截图逻辑...}public WebDriver getDriver() {return driver;}
}

百度搜索使用案例

package com.example.demo.utils;import com.example.demo.utils.WebScraper;
import org.openqa.selenium.WebElement;import java.util.Map;
import java.util.concurrent.TimeUnit;public class BaiduSearchDemo {public static void main(String[] args) {String keyword = "人工智能";int pageCount = 5;try (WebScraper scraper = new WebScraper("chrome", false, 30, true)) {performBaiduSearch(scraper, keyword, pageCount);} catch (Exception e) {System.err.println("百度搜索过程中发生错误: " + e.getMessage());e.printStackTrace();}}private static void performBaiduSearch(WebScraper scraper, String keyword, int pageCount) {System.out.println("正在打开百度首页...");if (!scraper.openUrl("https://www.baidu.com")) {System.err.println("打开百度首页失败");return;}System.out.println("正在输入搜索关键词: " + keyword);WebElement searchInput = scraper.findElement("#kw", "css");if (searchInput != null) {scraper.typeText(keyword, searchInput);} else {System.err.println("未找到搜索输入框");return;}System.out.println("正在点击搜索按钮...");if (!scraper.click("#su", "css")) {System.err.println("点击搜索按钮失败");return;}try {TimeUnit.SECONDS.sleep(2);} catch (InterruptedException e) {Thread.currentThread().interrupt();}for (int i = 1; i <= pageCount; i++) {System.out.println("正在处理第 " + i + " 页...");System.out.println("滚动到页面底部...");if (!scraper.scrollToBottom()) {System.err.println("滚动到页面底部失败");}extractCurrentPageInfo(scraper);if (i < pageCount) {System.out.println("准备翻到下一页...");if (!goToNextPage(scraper)) {System.err.println("翻页失败，停止操作");break;}try {TimeUnit.SECONDS.sleep(2);} catch (InterruptedException e) {Thread.currentThread().interrupt();}}}}private static void extractCurrentPageInfo(WebScraper scraper) {String pageTitle = scraper.executeScript("return document.title").toString();String currentUrl = scraper.executeScript("return window.location.href").toString();System.out.println("当前页面标题: " + pageTitle);System.out.println("当前页面URL: " + currentUrl);System.out.println("提取搜索结果标题:");int resultCount = 0;for (WebElement titleElement : scraper.findElements("h3.t a", "css")) {if (resultCount < 5) {String title = scraper.getText(titleElement);if (title != null) {System.out.println("  " + (resultCount + 1) + ". " + title);}resultCount++;} else {break;}}System.out.println("------------------------");}private static boolean goToNextPage(WebScraper scraper) {String nextPageSelector = "a.n";if (scraper.hasNextPage(nextPageSelector)) {return scraper.nextPage(nextPageSelector);} else {System.out.println("已到达最后一页，没有更多内容");return false;}}
}