当前位置：首页 > news >正文

车型销售数据爬虫代码详细解释

news 2025/8/12 10:01:37

import requests
from lxml import etree
import csv

requests：用于发送 HTTP 请求，获取网页内容

lxml.etree：用于解析 HTML 文档，提取所需数据

csv：用于处理 CSV 文件的读写操作

def get_html(url, timeout=30):"""获取网页内容，模拟浏览器请求"""try:headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",}r = requests.get(url, timeout=timeout, headers=headers)r.raise_for_status()  # 确保请求成功r.encoding = r.apparent_encoding  # 自动识别编码return r.textexcept Exception as error:print(f"网页内容获取失败: {error}")return None

该函数的作用是建立与目标网站的连接并获取网页内容：

headers参数模拟浏览器请求头，避免被网站识别为爬虫
requests.get()发送 GET 请求获取网页
r.raise_for_status()在 HTTP 请求失败时（状态码 4xx/5xx）抛出异常
r.apparent_encoding自动识别网页编码，解决中文乱码问题

异常处理机制确保程序遇到错误时不会崩溃

def parse(html):if not html:return []doc = etree.HTML(html)out_list = []table = doc.xpath('//*[@id="itable"]/table')  # 定位表格if not table:print("错误：未找到id为'itable'的表格")return []rows = table[0].xpath('//*[@id="itable"]/table//tr')  # 获取所有行if not rows:print("错误：表格中未找到任何行")return []for index, row in enumerate(rows[1:], 1):  # 从第2行开始处理（跳过表头）try:tds = row.xpath('./td')  # 获取当前行的所有单元格if len(tds) < 5:print(f"第{index}行数据列数不足，跳过")continue# 提取各单元格文本car_model = ''.join(tds[1].xpath('.//text()')).strip()info = ''.join(tds[2].xpath('.//text()')).strip()monthly_sales = ''.join(tds[3].xpath('.//text()')).strip()annual_total = ''.join(tds[4].xpath('.//text()')).strip()if not all([car_model, info, monthly_sales, annual_total]):print(f"第{index}行数据存在空值，跳过")continueout_list.append([car_model, info, monthly_sales, annual_total])print(f"已提取第{index}条：{car_model}")except Exception as e:print(f"第{index}行解析出错：{str(e)}，跳过")continuereturn out_list

解析函数是数据提取的核心：

使用etree.HTML()将 HTML 文本转换为可解析的对象
通过 XPath 表达式定位表格和表格行（//*[@id="itable"]/table表示 id 为 itable 的表格）
从第二行开始处理数据（rows[1:]），因为第一行通常是表头
提取每个单元格的文本内容（tds[1].xpath('.//text()')），并进行清洗（strip()去除空白）
进行数据校验，过滤掉列数不足或有空值的记录

异常处理确保一行数据解析失败不影响整体程序运行

def save_csv(items, path):"""保存数据到CSV"""if not items:print("没有有效数据可保存")returnwith open(path, 'w', newline='', encoding='utf-8-sig') as f:writer = csv.writer(f)writer.writerow(['车型', '资料信息', '月销售量', '年累计'])writer.writerows(items)print(f"数据保存完成，共{len(items)}条，文件：{path}")

该函数负责将提取的数据保存为 CSV 文件：

with open(...)确保文件正确关闭，即使发生错误
encoding='utf-8-sig'解决 Excel 打开 CSV 时的中文乱码问题
newline=''避免 CSV 文件中出现多余的空行

先写入表头（writer.writerow(...)），再写入所有数据行（writer.writerows(...)）

if __name__ == "__main__":url = "https://www.icauto.com.cn/rank/"html = get_html(url)if html:data = parse(html)save_csv(data, '车型销售数据.csv')else:print("程序终止：无法获取网页内容")

主程序定义了整个爬虫的执行流程：