当前位置：首页 > news >正文

构建1688店铺商品数据集：Python爬虫数据采集与格式化实践

news 2025/11/5 11:17:00

一、项目概述与技术选型

我们的目标是：输入一个1688店铺主页URL，输出一个包含该店铺所有商品结构化信息的数据库或文件（如CSV、JSON）。

这个目标可以拆解为三个核心步骤：

数据采集： 模拟浏览器请求，获取店铺商品列表页和详情页的HTML源码。
数据解析： 从HTML中精准提取出我们需要的商品信息（如标题、价格、销量、SKU等）。
数据格式化与存储： 将提取出的数据清洗、规整，并存入持久化存储中。

技术栈选择：

编程语言： Python 3.8+。其丰富的生态库使其成为数据采集的首选。
网络请求库： requests。简单易用，足以应对静态页面。对于动态渲染的页面，我们将使用 selenium 作为备选方案。
HTML解析库： parsel（或 lxml）。语法与Scrapy相似，功能强大，解析速度快。
数据存储： pandas + CSV/JSON 文件。便于后续进行数据分析和处理。
反爬应对： 随机User-Agent、代理IP（生产环境建议使用）、请求间隔。

二、实战代码：分步解析与实现

步骤1：分析1688页面结构

首先，我们打开一个目标店铺（例如：https://shop.abc.1688.com），观察其商品列表页。通过浏览器开发者工具（F12）分析，我们发现几个关键点：

商品列表通常是通过异步加载（AJAX）渲染的，直接请求店铺首页可能无法获得完整的商品列表。
更有效的方式是找到店铺商品列表的专用API接口。通过观察，我们通常能在XHR请求中找到形如 https://shop.abc.1688.com/page/offerlist.html 的页面或类似的JSON数据接口。

本文将以解析列表页HTML为例，因为它更通用，但请注意1688的页面结构可能会频繁变动。

步骤2：构建请求与初始页面采集

我们需要模拟浏览器行为，设置请求头（Headers），其中 User-Agent 是必须的。

import requests
from parsel import Selector
import pandas as pd
import time
import randomclass Ali1688Spider:def __init__(self):self.session = requests.Session()# 设置通用的请求头，模拟浏览器self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',}self.session.headers.update(self.headers)# 代理配置信息self.proxyHost = "www.16yun.cn"self.proxyPort = "5445"self.proxyUser = "16QMSOML"self.proxyPass = "280651"# 构建代理认证信息self.proxyMeta = f"http://{self.proxyUser}:{self.proxyPass}@{self.proxyHost}:{self.proxyPort}"self.proxies = {"http": self.proxyMeta,"https": self.proxyMeta,}def get_page(self, url, max_retries=3, use_proxy=True):"""获取页面HTML，包含简单的重试机制和代理支持"""for i in range(max_retries):try:# 根据参数决定是否使用代理if use_proxy:resp = self.session.get(url, timeout=10, proxies=self.proxies)else:resp = self.session.get(url, timeout=10)resp.raise_for_status() # 如果状态码不是200，抛出异常# 检查页面内容是否包含反爬提示（根据实际情况调整）if "访问受限" in resp.text or "验证码" in resp.text:print(f"第{i+1}次请求可能被反爬，正在重试...")time.sleep(2)continue# 检查代理是否正常工作if use_proxy and resp.status_code == 200:print(f"第{i+1}次请求成功（使用代理）")return resp.textexcept requests.exceptions.ProxyError as e:print(f"代理连接失败: {e}, 第{i+1}次重试...")# 如果代理失败，可以尝试不使用代理if i == max_retries - 1:  # 最后一次重试print("尝试不使用代理...")try:resp = self.session.get(url, timeout=10)resp.raise_for_status()return resp.textexcept:passtime.sleep(2)except requests.exceptions.ConnectTimeout as e:print(f"连接超时: {e}, 第{i+1}次重试...")time.sleep(2)except requests.exceptions.RequestException as e:print(f"请求失败: {e}, 第{i+1}次重试...")time.sleep(2)return Nonedef test_proxy_connection(self):"""测试代理连接是否正常"""test_url = "http://httpbin.org/ip"try:resp = self.session.get(test_url, timeout=10, proxies=self.proxies)if resp.status_code == 200:print("代理连接测试成功")print(f"当前代理IP: {resp.json()['origin']}")return Trueelse:print("代理连接测试失败")return Falseexcept Exception as e:print(f"代理测试异常: {e}")return False# 初始化爬虫
spider = Ali1688Spider()# 测试代理连接
print("正在测试代理连接...")
spider.test_proxy_connection()# 示例使用
if __name__ == "__main__":# 测试爬取一个页面test_url = "https://shop.abc.1688.com/page/offerlist_1.htm"html_content = spider.get_page(test_url, use_proxy=True)if html_content:print("页面获取成功！")# 这里可以添加后续的解析逻辑else:print("页面获取失败！")

步骤3：解析商品列表页，获取商品链接

假设我们找到了一个店铺的商品列表页URL模式，例如：https://shop.abc.1688.com/page/offerlist_[PAGE_NUM].htm。

我们的首要任务是从列表页中解析出所有商品的详情页链接。

def parse_product_links(self, html_content):"""从列表页HTML中解析出所有商品的详情页链接"""if not html_content:return []selector = Selector(text=html_content)product_links = []# 使用XPath定位商品链接元素# !!! 注意：此XPath为示例，需要根据目标店铺的实际HTML结构进行调整 !!!link_elements = selector.xpath('//div[@class="offer-list-row"]//a[contains(@class, "offer-title")]/@href').getall()for link in link_elements:# 确保链接是完整的URLif link.startswith('//'):full_link = 'https:' + linkelif link.startswith('/'):# 假设我们知道店铺主域名，这里需要替换成实际的full_link = 'https://shop.abc.1688.com' + linkelse:full_link = linkproduct_links.append(full_link)return list(set(product_links)) # 去重# 示例：获取第一页的商品链接
list_page_url = "https://shop.abc.1688.com/page/offerlist_1.htm"
html = spider.get_page(list_page_url)
product_links = spider.parse_product_links(html)
print(f"从第一页获取到 {len(product_links)} 个商品链接")

步骤4：深入商品详情页，精准提取数据

这是最核心的一步。我们需要进入每个商品链接，提取出我们关心的字段。

def parse_product_detail(self, html_content, product_url):"""解析单个商品详情页，提取商品信息"""if not html_content:return Noneselector = Selector(text=html_content)product_info = {}# 1. 商品标题# !!! 以下所有XPath路径均为示例，必须根据实际页面结构调整 !!!title = selector.xpath('//h1[@class="d-title"]/text()').get()product_info['title'] = title.strip() if title else None# 2. 商品价格 - 1688价格通常复杂，可能有区间，需要拼接price_elements = selector.xpath('//span[contains(@class, "price-num")]/text()').getall()product_info['price_range'] = '-'.join([p.strip() for p in price_elements if p.strip()]) if price_elements else None# 3. 月销量sales = selector.xpath('//span[contains(text(), "月销量")]/following-sibling::span/text()').get()product_info['monthly_sales'] = sales.strip() if sales else '0'# 4. 库存stock = selector.xpath('//span[contains(text(), "库存")]/following-sibling::span/text()').get()product_info['stock'] = stock.strip() if stock else None# 5. 公司名称company = selector.xpath('//a[contains(@class, "company-name")]/text()').get()product_info['company'] = company.strip() if company else None# 6. 商品图片链接image_urls = selector.xpath('//div[contains(@class, "image-view")]//img/@src').getall()# 处理图片链接，确保是HTTP/HTTPS格式processed_image_urls = []for img_url in image_urls:if img_url.startswith('//'):processed_image_urls.append('https:' + img_url)else:processed_image_urls.append(img_url)product_info['image_urls'] = ' | '.join(processed_image_urls) # 用竖线分隔多个图片URL# 7. 商品URLproduct_info['product_url'] = product_url# 8. 采集时间戳product_info['crawl_time'] = pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')return product_info# 示例：解析第一个商品
if product_links:first_product_url = product_links[0]print(f"正在采集: {first_product_url}")detail_html = spider.get_page(first_product_url)product_data = spider.parse_product_detail(detail_html, first_product_url)print(product_data)# 礼貌性延迟，避免请求过快time.sleep(random.uniform(1, 3))

步骤5：循环翻页与数据存储

为了获取整个店铺的商品，我们需要一个循环机制来处理翻页。

def crawl_entire_shop(self, shop_list_url_pattern, start_page=1, max_pages=10):"""爬取整个店铺的多页商品"""all_products_data = []current_page = start_pagewhile current_page <= max_pages:# 构造列表页URLlist_page_url = shop_list_url_pattern.format(page=current_page)print(f"正在爬取第 {current_page} 页: {list_page_url}")html_content = self.get_page(list_page_url)if not html_content:print(f"第 {current_page} 页获取失败，终止爬取。")breakproduct_links = self.parse_product_links(html_content)if not product_links:print(f"第 {current_page} 页未找到商品链接，可能已到末页。")break# 遍历当前页的所有商品链接for link in product_links:print(f"  正在处理商品: {link}")detail_html = self.get_page(link)product_info = self.parse_product_detail(detail_html, link)if product_info:all_products_data.append(product_info)# 重要：在每个商品请求间设置随机延迟，友好爬取time.sleep(random.uniform(1, 2))current_page += 1# 在每页请求后设置一个稍长的延迟time.sleep(random.uniform(2, 4))return all_products_datadef save_to_csv(self, data, filename='1688_shop_products.csv'):"""将数据保存到CSV文件"""if not data:print("没有数据可保存。")returndf = pd.DataFrame(data)df.to_csv(filename, index=False, encoding='utf_8_sig') # 使用utf_8_sig编码支持Excel直接打开中文print(f"数据已成功保存到 {filename}, 共计 {len(data)} 条商品记录。")# 主执行流程
if __name__ == '__main__':spider = Ali1688Spider()# 假设我们已经分析出了店铺列表页的URL模式，其中 {page} 是页码占位符# 请将此模式替换为真实的目标店铺URL模式shop_url_pattern = "https://shop.abc.1688.com/page/offerlist_{page}.htm"# 开始爬取，例如最多爬5页all_products = spider.crawl_entire_shop(shop_url_pattern, start_page=1, max_pages=5)# 保存数据spider.save_to_csv(all_products)

三、数据格式化与高级处理

我们得到的 all_products 是一个字典列表，pandas 的 DataFrame 可以非常方便地对其进行处理。

数据清洗： 可以使用 df.dropna() 处理空值，或用正则表达式清洗价格和销量字段（例如，去除“件”、“元”等字符，只保留数字）。
数据规整： 将价格拆分为最低价和最高价，将销量字符串转换为整数。
数据导出： 除了CSV，还可以导出为JSON (df.to_json('data.json', orient='records', force_asciiclass False)、或直接存入数据库（如SQLite, MySQL）。

四、注意事项与伦理规范

遵守**** **robots.txt**：在爬取前，务必检查目标网站的 robots.txt（如 https://1688.com/robots.txt），尊重网站的爬虫协议。
控制访问频率： 本文代码中的 time.sleep 是必须的，过度频繁的请求会对目标网站服务器造成压力，也可能导致你的IP被封锁。
法律风险： 爬取公开数据通常问题不大，但将数据用于商业用途，特别是大规模爬取，可能存在法律风险。请确保你的行为符合相关法律法规和网站的服务条款。
动态内容与反爬： 如果目标店铺的商品列表是通过JS动态加载的，requests 将无法直接获取。此时需要升级技术栈，使用 Selenium 或 Playwright 等自动化浏览器工具，或者更优的方案是直接寻找并模拟其背后的JSON API接口。
代码健壮性： 生产环境中需要加入更完善的错误处理、日志记录、代理IP池等机制来保证长时间稳定运行。

结语

通过本文的实践，我们成功地构建了一个能够自动采集、解析并格式化1688店铺商品数据的Python爬虫。这个过程不仅涉及网络请求、HTML解析等核心技术，还涵盖了数据清洗、存储和反爬策略等重要环节。掌握这套技术栈，你将有能力为市场分析、价格监控或选品决策构建起强大的自有数据支持体系，从而在激烈的商业竞争中占据信息高地。

查看全文

http://www.dtcms.com/a/569869.html