当前位置：首页 > news >正文

Node.js/Python 实战：编写一个淘宝商品数据采集器

news 2025/10/11 5:50:25

在电商数据分析、市场调研等场景中，淘宝商品数据采集是一项常见需求。本文将分别使用 Node.js 和 Python 两种主流编程语言，从零构建一个简易的淘宝商品数据采集器，帮助大家理解网络请求、数据解析等核心技术环节。采集器将实现关键词搜索商品、提取商品核心信息（名称、价格、销量、店铺名等）以及数据存储功能，同时会强调合规采集的注意事项。

一、采集前的准备工作

在开始编写代码前，需完成以下基础准备，避免因环境或配置问题导致采集失败：

1. 核心技术原理

淘宝商品列表页的数据加载方式分为两种：

静态 HTML 渲染：部分页面直接将商品数据嵌入 HTML 源码中，可通过解析 HTML 获取；

动态 AJAX 加载：大部分页面（如滚动加载、分页）通过 JavaScript 请求接口获取 JSON 数据，需分析接口参数并模拟请求。

本文将以动态 AJAX 接口为例（更贴近真实场景），通过浏览器开发者工具捕获商品列表接口，提取请求参数后模拟发送请求。

2. 环境与工具准备

Node.js 环境（Node.js 方案）：安装 Node.js（建议 v14+），配套使用axios（发送 HTTP 请求）、cheerio（解析 HTML）、jsonfile（存储 JSON 数据）；

Python 环境（Python 方案）：安装 Python（建议 3.8+），配套使用requests（发送 HTTP 请求）、lxml（解析 HTML/XML）、pandas（数据存储为 Excel）；

浏览器开发者工具：Chrome/Firefox 的 F12 工具，用于捕获淘宝商品列表接口、分析请求头和参数；

代理 IP（可选）：频繁请求可能导致 IP 被限制，可准备代理池避免封禁（本文暂不实现代理，仅提供思路）。

3. 合规性说明

淘宝有明确的robots.txt协议，禁止未经授权的批量数据采集；

本文代码仅用于技术学习，请勿用于商业用途或高频请求，避免违反平台规则；

若需合法采集电商数据，建议申请API接口。

二、Node.js 实现方案

Node.js 凭借异步 I/O 特性，在网络请求场景中表现高效。以下是完整的采集流程实现：

1. 项目初始化与依赖安装

首先创建项目文件夹，执行以下命令安装依赖：

mkdir taobao-collector-nodejs
cd taobao-collector-nodejs
npm init -y
npm install axios cheerio jsonfile

2. 核心代码实现

创建collector.js文件，代码分为请求配置、数据解析、数据存储三部分：

// 引入依赖包
const axios = require('axios');
const cheerio = require('cheerio');
const jsonfile = require('jsonfile');
const path = require('path');// 1. 配置采集参数
const config = {keyword: '笔记本电脑', // 搜索关键词（可自定义）page: 1, // 采集页码（1-100，淘宝默认最大分页）headers: {// 从浏览器开发者工具中复制真实请求头，模拟浏览器环境'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36','Referer': 'https://www.taobao.com/','Cookie': '请从浏览器中复制登录后的Cookie（用于绕过部分登录验证）'},outputPath: path.join(__dirname, 'taobao_products_nodejs.json') // 数据输出路径
};// 2. 发送请求获取商品数据
async function fetchTaobaoProducts() {try {// 淘宝商品搜索接口（通过开发者工具捕获，参数需动态拼接）const url = `https://s.taobao.com/search?q=${encodeURIComponent(config.keyword)}&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&page=${config.page}`;// 发送GET请求const response = await axios.get(url, { headers: config.headers });// 3. 解析HTML中的商品数据（淘宝静态页面数据嵌入在HTML中）const $ = cheerio.load(response.data);const products = [];// 遍历商品列表元素（通过开发者工具查看DOM结构，提取对应class）$('.item.J_MouserOnverReq').each((index, element) => {const product = {title: $(element).find('.J_ClickStat').attr('title') || '', // 商品标题price: $(element).find('.price.g_price.g_price-highlight strong').text() || '', // 商品价格sales: $(element).find('.deal-cnt').text() || '0', // 销量（格式：100+）shopName: $(element).find('.shopname.J_MouserOnverReq a').attr('title') || '', // 店铺名location: $(element).find('.location').text() || '', // 店铺所在地url: `https:${$(element).find('.J_ClickStat').attr('href')}` || '' // 商品详情页链接};products.push(product);});// 4. 存储数据到JSON文件jsonfile.writeFile(config.outputPath, products, { spaces: 2 }, (err) => {if (err) throw err;console.log(`Node.js采集完成！共采集${products.length}件商品，数据已保存至：${config.outputPath}`);});return products;} catch (error) {console.error('Node.js采集失败：', error.message);return [];}
}// 执行采集函数
fetchTaobaoProducts();

3. 代码说明与运行

请求头配置：User-Agent模拟浏览器，Cookie需从登录后的淘宝页面复制（F12→Application→Cookies），否则可能返回登录页面；

数据解析：使用cheerio（类似 jQuery）提取商品 DOM 元素，class 名称需通过开发者工具确认（淘宝可能会动态调整 class，需及时更新）；

运行方式：在终端执行node collector.js，采集完成后会在项目根目录生成taobao_products_nodejs.json文件，包含商品列表数据。

三、Python 实现方案

Python 凭借丰富的数据分析库，在数据处理和存储上更具优势。以下是 Python 版本的采集实现：

1. 环境搭建与依赖安装

创建项目文件夹，执行以下命令安装依赖：

mkdir taobao-collector-python
cd taobao-collector-python
pip install requests lxml pandas openpyxl

2. 核心代码实现

创建collector.py文件，代码逻辑与 Node.js 一致，但数据存储为 Excel 格式（更便于后续分析）：

import requests
from lxml import etree
import pandas as pd
from urllib.parse import quote
import os# 1. 配置采集参数
config = {"keyword": "笔记本电脑",  # 搜索关键词"page": 1,  # 采集页码"headers": {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36","Referer": "https://www.taobao.com/","Cookie": "请从浏览器中复制登录后的Cookie"  # 同Node.js配置},"outputPath": os.path.join(os.getcwd(), "taobao_products_python.xlsx")  # 输出Excel路径
}# 2. 发送请求获取商品数据
def fetch_taobao_products():try:# 拼接URL（quote处理中文关键词编码）url = f"https://s.taobao.com/search?q={quote(config['keyword'])}&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.jianhua.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306&page={config['page']}"# 发送GET请求，禁止重定向（避免被跳转至登录页）response = requests.get(url, headers=config["headers"], allow_redirects=False)response.encoding = "utf-8"  # 设置编码，避免中文乱码# 3. 解析HTML（使用lxml的XPath语法提取数据）html = etree.HTML(response.text)products = []# 提取商品列表（XPath路径通过开发者工具复制）product_elements = html.xpath('//div[@class="item J_MouserOnverReq  "]')for elem in product_elements:# 提取单个商品信息（XPath语法：提取文本用text()，提取属性用@属性名）title = elem.xpath('.//a[@class="J_ClickStat"]/@title')price = elem.xpath('.//strong[@class="J_price"]/text()')sales = elem.xpath('.//div[@class="deal-cnt"]/text()')shop_name = elem.xpath('.//a[@class="shopname J_MouserOnverReq"]/@title')location = elem.xpath('.//div[@class="location"]/text()')product_url = elem.xpath('.//a[@class="J_ClickStat"]/@href')# 处理空值（避免列表索引错误）product = {"商品标题": title[0] if title else "","价格（元）": price[0] if price else "","销量": sales[0] if sales else "0","店铺名称": shop_name[0] if shop_name else "","店铺所在地": location[0] if location else "","商品链接": f"https:{product_url[0]}" if product_url else ""}products.append(product)# 4. 存储数据到Excel（使用pandas，需安装openpyxl引擎）df = pd.DataFrame(products)df.to_excel(config["outputPath"], index=False, engine="openpyxl")print(f"Python采集完成！共采集{len(products)}件商品，数据已保存至：{config['outputPath']}")return productsexcept Exception as e:print(f"Python采集失败：{str(e)}")return []# 执行采集函数
if __name__ == "__main__":fetch_taobao_products()

3. 代码说明与运行