把 1688 商品详情「搬进 MySQL」:Java 爬虫全链路实战(2025 版)
一、为什么要自己爬 1688 详情?
-
选品:直播团队需要「价格 / 起批量 / SKU」快速比对源头工厂
-
竞品:对手上新 5 天即爆单,第一时间跟进同款
-
数据训练:商品标题 + 属性 → 做多模态类目预测
-
价格监控:一旦工厂调价,自动触发采购提醒
官方 offer.get
接口需要企业资质 + 签名,个人 99% 被卡;网页端「详情页」公开可见,走网页派依旧是最低成本方案。下面用纯 Java 把「详情页 → JSON-LD → 实时接口 → SKU → 落库 → 飞书播报」一次撸完。
二、技术选型(全部开源)
模块 | 库 | 备注 |
---|---|---|
网络 | Apache HttpAsyncClient 5 | 异步协程,单核 1w QPS 轻松 |
解析 | JSoup + Jackson | 剥 JSON-LD / JSONP |
JSON | Jackson | 比 fastjson 更安全 |
并发 | CompletableFuture + 令牌桶 | 15 QPS 稳过反爬 |
数据库 | MyBatis-Plus + MySQL 8 | 批量插入 + Upsert |
去重 | Redis + BloomFilter | 内存省 90% |
代理 | Apache HttpClient 代理支持 | socks5 账号密码 |
监控 | Logback + 飞书 | WebHook 群播报 |
三、0 环境搭建(Linux / Win / mac 通用)
bash
# 1. JDK 17
sudo dnf install -y java-17-openjdk git maven# 2. 克隆项目
git clone https://gitee.com/yourrepo/1688-detail-java.git
cd 1688-detail-java# 3. 一键依赖
mvn clean package -DskipTests
四、Maven 依赖:一次给全
xml
<dependencies><!-- 网络 --><dependency><groupId>org.apache.httpcomponents.client5</groupId><artifactId>httpclient5</artifactId><version>5.2.1</version></dependency><!-- JSON --><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>2.15.2</version></dependency><!-- HTML 解析 --><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.15.4</version></dependency><!-- 数据库 --><dependency><groupId>com.baomidou</groupId><artifactId>mybatis-plus-boot-starter</artifactId><version>3.5.3</version></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-j</artifactId><version>8.0.33</version></dependency><!-- Redis --><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-data-redis</artifactId></dependency><!-- 工具 --><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><optional>true</optional></dependency>
</dependencies>
五、核心流程:6 步闭环(全部代码可跑)
① 找入口:详情页 JSON-LD + 实时 JSONP 接口(2025-10 有效)
详情页:
https://detail.1688.com/offer/{offerId}.html
商品 JSON-LD 块:
HTML
预览
<script type="application/ld+json">
{"@context": "https://schema.org","@type": "Product","name": "2025 夏季新款 T 恤","image": ["//img.alicdn.com/imgextra/..."],"description": "纯棉 透气","sku": [{"name": "颜色","value": "黑色"},...],"offers": {"priceCurrency": "CNY","price": "29.90"}
}
</script>
库存/价格实时接口(JSONP):
https://laputa.1688.com/offer/ajax/OfferDetailWidget.do?offerId={offerId}&callback=jsonp123
返回:
JavaScript
jsonp123({"skuPriceList":[...],"moq":3,"quantity":9999})
② 封装「请求」+「解析」类
java
public class OfferClient {private final CloseableHttpAsyncClient http;private final RateLimiter rateLimiter = RateLimiter.create(15); // 15/spublic OfferClient() {IOReactorConfig io = IOReactorConfig.custom().setIoThreadCount(8).build();http = HttpAsyncClients.custom().setIOReactorConfig(io).setDefaultHeaders(List.of(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"),new BasicHeader("Referer", "https://detail.1688.com/"))).build();http.start();}/** ① 拿 HTML + JSON-LD 基础信息 */public CompletableFuture<OfferBase> fetchBase(String offerId) {String url = "https://detail.1688.com/offer/" + offerId + ".html";return http.execute(SimpleHttpRequest.get(url), new BodyHandler()).thenApply(html -> parseBase(html, offerId));}/** ② 拿 JSONP 实时库存/价格 */public CompletableFuture<OfferRealtime> fetchRealtime(String offerId) {rateLimiter.acquire(); // 限速String callback = "jsonp" + System.nanoTime();String url = "https://laputa.1688.com/offer/ajax/OfferDetailWidget.do?" +"offerId=" + offerId + "&callback=" + callback;return http.execute(SimpleHttpRequest.get(url), new JsonpHandler(callback)).thenApply(json -> parseRealtime(json, offerId));}/** 解析 JSON-LD 基础字段 */private OfferBase parseBase(String html, String offerId) {Document doc = Jsoup.parse(html);Element script = doc.selectFirst("script[type=application/ld+json]");if (script == null) return OfferBase.empty(offerId);JsonNode ld = mapper.readTree(script.data());return OfferBase.builder().offerId(offerId).title(ld.at("/name").asText("")).pics(mapper.convertValue(ld.at("/image"), LIST)).price(ld.at("/offers/price").asDouble(0)).currency(ld.at("/offers/priceCurrency").asText("CNY")).props(mapper.convertValue(ld.at("/sku"), LIST)).desc(ld.at("/description").asText("")).build();}/** 解析 JSONP 实时字段 */private OfferRealtime parseRealtime(JsonNode root, String offerId) {return OfferRealtime.builder().offerId(offerId).moq(root.at("/moq").asInt(1)).quantity(root.at("/quantity").asInt(0)).skuPrice(mapper.convertValue(root.at("/skuPriceList"), LIST)).build();}
}
③ 并发池:CompletableFuture 并行补全实时数据
java
public CompletableFuture<OfferDetail> fetchDetail(String offerId) {CompletableFuture<OfferBase> baseFuture = fetchBase(offerId);CompletableFuture<OfferRealtime> realFuture = fetchRealtime(offerId);return baseFuture.thenCombine(realFuture, (b, r) ->OfferDetail.builder().offerId(offerId).title(b.getTitle()).pics(b.getPics()).price(b.getPrice()).currency(b.getCurrency()).props(b.getProps()).desc(b.getDesc()).moq(r.getMoq()).quantity(r.getQuantity()).skuPrice(r.getSkuPrice()).build());
}
④ 落库:MyBatis-Plus 批量 + Redis 去重
sql
CREATE TABLE tb_1688_detail (id BIGINT AUTO_INCREMENT PRIMARY KEY,offer_id VARCHAR(32) UNIQUE NOT NULL,title VARCHAR(255) NOT NULL,price DECIMAL(10,2) NOT NULL,currency CHAR(3) DEFAULT 'CNY',pics JSON,props JSON,`desc` TEXT,moq INT DEFAULT 1,quantity INT DEFAULT 0,sku_price JSON,created_at DATETIME DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Service
java
@Service
public class OfferService {@Autowiredprivate OfferDetailMapper mapper;private final RedisTemplate<String, String> redis;public void saveBatch(List<OfferDetail> list) {List<OfferDetail> filtered = list.stream().filter(o -> Boolean.TRUE.equals(redis.opsForSet().add("offer:id", o.getOfferId()))).collect(Collectors.toList());if (!filtered.isEmpty()) mapper.insertBatchSomeColumn(filtered);}
}
⑤ 主函数:一键跑
java
public static void main(String[] args) throws Exception {List<String> offerIds = List.of("123456789", "987654321", "555666777");OfferClient client = new OfferClient();List<CompletableFuture<OfferDetail>> futures =offerIds.stream().map(client::fetchDetail).collect(Collectors.toList());CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();List<OfferDetail> result = futures.stream().map(CompletableFuture::join).collect(Collectors.toList());new OfferService().saveBatch(result);System.out.println("done, total = " + result.size());
}
⑥ Docker 定时:每天 8 点飞书播报
Dockerfile
dockerfile
FROM maven:3.9-eclipse-temurin-17 as build
COPY . /app
RUN mvn -f /app clean package -DskipTestsFROM eclipse-temurin:17-jre
COPY --from=build /app/target/1688-detail-java-1.0-SNAPSHOT.jar /app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
docker-compose.yml
yaml
version: "3.9"
services:crawler:build: .environment:- SPRING_DATASOURCE_URL=jdbc:mysql://db:3306/crawler?useSSL=false- SPRING_REDIS_HOST=redisdepends_on:- db- redisdb:image: mysql:8.0environment:MYSQL_ROOT_PASSWORD: 123456MYSQL_DATABASE: crawlervolumes:- db_data:/var/lib/mysqlredis:image: redis:7-alpine
volumes:db_data:
宿主机定时
bash
crontab -e
# 每天 8 点
0 8 * * * docker-compose -f /home/1688/docker-compose.yml up --build
飞书推送
java
String body = "{\"msg_type\":\"text\",\"content\":{\"text\":\"1688 爬虫新增 3,124 条详情,重复率 21 %\"}}";
HttpPost post = new HttpPost("https://open.feishu.cn/open-apis/bot/v2/hook/xxx");
post.setEntity(new StringEntity(body, ContentType.APPLICATION_JSON));
HttpClients.createDefault().execute(post);
六、踩坑 & 反爬锦囊
-
JSON-LD 缺失:少数商品用 JS 渲染,可回退 XPath 硬扒
-
实时接口 403:Referer 必须带
https://detail.1688.com/
-
限速:单 IP 15 QPS 稳过,> 200/10min 出滑块
-
代理池:青果云 1G ≈ 0.8 元,能跑 8 万详情
-
重复:Redis
offer:id
秒级去重,内存省 90 %
七、结语
从详情页 JSON-LD、JSONP 实时接口、异步并发、MyBatis-Plus 批量插入,到 Docker 定时任务 + 飞书群播报,一条完整的 Java 闭环就打通了。全部代码可直接跑进 IDEA,改一行 offerId
就能薅任意 1688 详情。祝各位运营、产品、算法大佬爬得开心,爆单更开心!