第一个Python金融爬虫
一、前言
作为专业的爬虫工程师,我将带领您完成第一个Python金融数据爬虫的实现。本文档将从零开始,详细介绍金融爬虫的开发流程、技术要点和最佳实践。
二、环境准备
2.1 创建虚拟环境
# 创建项目目录
mkdir python_spider
cd python_spider# 创建虚拟环境
python -m venv venv# 激活虚拟环境 (Windows)
venv\Scripts\activate# 激活虚拟环境 (macOS/Linux)
source venv/bin/activate
2.2 安装必要依赖
# 安装核心依赖库
pip install requests pandas
三、爬虫基础概念
3.1 HTTP请求基础
- GET请求: 获取数据
- POST请求: 提交数据
- 请求头(Headers): 包含客户端信息
- 状态码(Status Codes): 200成功, 404未找到, 500服务器错误
3.2 数据解析技术
- JSON解析: 直接解析接口返回的结构化数据
- 字段映射: 将原始字段转换为易读中文字段
3.3 数据存储
- CSV文件: 简单表格数据存储
- JSON文件: 结构化数据存储
- 数据库: SQLite/MySQL持久化存储
四、第一个金融爬虫:股票数据获取
4.1 目标分析
我们通过东方财富的公开接口获取股票基本信息,包括:
- 股票代码与名称
- 最新价、涨跌幅、涨跌额
- 成交量与成交额
- 振幅、换手率
- 总市值与流通市值
4.2 技术选型
- 请求库: requests
- 解析: JSON 结构解析
- 数据存储: pandas + CSV
4.3 代码实现
完整代码见 c01_hello.py。
接口参数示例(东方财富 clist/get):
pn: 页码,例如1pz: 每页条数,例如20po: 排序方向,1升序np: 返回格式,1为 JSON 数组ut: 接口标识,示例bd1d9ddb04089700cf9c27f6f7426281fltt: 浮点处理,2invt: 列表格式,2fid: 排序字段,例如f12fs: 市场与板块过滤,示例m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23fields: 返回字段,例如f12,f14,f2,f3,f4,f5,f6,f7,f8,f20,f21
常用字段映射:
f12→ 股票代码f14→ 股票名称f2→ 最新价f3→ 涨跌幅(百分比数值)f4→ 涨跌额f5→ 成交量f6→ 成交额f7→ 振幅(百分比数值)f8→ 换手率(百分比数值)f20→ 总市值f21→ 流通市值
五、爬虫开发流程
5.1 步骤分解
- 目标分析: 明确接口与字段
- 请求发送: 构造 HTTP 请求获取 JSON
- 数据解析: 提取并映射核心字段
- 数据清洗: 处理缺失值与格式
- 数据存储: 保存为 CSV 文件
- 错误处理: 捕获网络与解析异常
- 性能优化: 控制请求频率与重试策略
5.2 最佳实践
- 遵守robots协议: 尊重网站的爬虫规则
- 设置合理延时: 避免对服务器造成压力
- 使用User-Agent: 模拟浏览器行为
- 异常处理: 确保程序健壮性
- 数据验证: 确保数据质量
六、常见问题与解决方案
6.1 反爬机制应对
- IP限制: 使用代理IP池
- 验证码: 使用OCR或第三方验证码服务
- JavaScript渲染: 使用Selenium或Playwright
- 请求频率限制: 添加随机延时
6.2 数据质量保证
- 字段校验: 检查数据完整性
- 去重处理: 避免重复数据
- 异常值处理: 过滤不合理数据
- 数据备份: 定期备份重要数据
七、进阶学习方向
7.1 技术进阶
- 异步爬虫: 使用aiohttp提高效率
- 分布式爬虫: 使用Scrapy-Redis
- 浏览器自动化: Selenium/Playwright
- API逆向工程: 分析JavaScript接口
7.2 数据应用
- 数据分析: 使用pandas进行数据探索
- 数据可视化: 使用matplotlib/plotly
- 机器学习: 构建预测模型
- 实时监控: 构建数据监控系统
八、总结
第一个Python金融爬虫的实现涵盖了爬虫开发的核心流程:从环境准备、目标分析、代码实现到数据存储。通过这个实例,您应该掌握了:
- 基本的HTTP请求发送
- HTML页面解析技术
- 数据提取和清洗方法
- 文件存储操作
- 异常处理和程序健壮性设计
记住,爬虫开发不仅要关注技术实现,更要重视合规性和数据质量。在实际项目中,请务必遵守相关法律法规和网站的使用条款。
九、参考资料
- Requests官方文档
- BeautifulSoup官方文档
- Pandas官方文档
- HTTP状态码说明
4.4 运行与输出
python c01_hello.py
- 运行后生成
stock_data.csv(UTF-8-SIG 编码) - 日志输出到
stock_spider.log - 控制台展示采集摘要与前 5 条预览
完整代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
第一个Python金融数据爬虫 - 股票数据采集功能说明:
1. 从东方财富网获取股票列表数据
2. 解析股票基本信息(代码、名称、价格、涨跌幅等)
3. 数据清洗和格式化
4. 保存到CSV文件
5. 包含完整的异常处理和日志记录作者:专业爬虫工程师
创建时间:2025年
版本:v1.0
"""import requests
import pandas as pd
import time
import random
import logging
from typing import List, Dict, Optional# 配置日志系统
logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',handlers=[logging.FileHandler('stock_spider.log', encoding='utf-8'),logging.StreamHandler()]
)
logger = logging.getLogger(__name__)class StockSpider:"""股票数据爬虫类"""def __init__(self):"""初始化爬虫配置"""# 目标URL - 东方财富网股票列表页self.base_url = "https://push2.eastmoney.com/api/qt/clist/get"# 请求头设置,模拟浏览器行为self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36','Accept': 'application/json, text/plain, */*','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','Connection': 'keep-alive','Referer': 'http://quote.eastmoney.com/center/gridlist.html',}# 请求超时时间(秒)self.timeout = 10# 请求间隔时间(秒),避免过于频繁请求self.delay = random.uniform(1.0, 3.0)logger.info("股票爬虫初始化完成")def make_request(self, url: str, params: Optional[Dict] = None) -> Optional[Dict]:"""发送HTTP请求获取网页内容参数:url: 请求的URL地址params: 请求参数返回:str: 网页HTML内容,失败时返回None"""try:# 添加随机延时,避免被反爬time.sleep(self.delay)logger.info(f"正在请求URL: {url}")# 发送GET请求response = requests.get(url,headers=self.headers,params=params,timeout=self.timeout)# 检查响应状态码response.raise_for_status()# 设置编码为UTF-8response.encoding = 'utf-8'logger.info(f"请求成功,状态码: {response.status_code}")return response.json()except requests.exceptions.RequestException as e:logger.error(f"请求失败: {e}")return Noneexcept Exception as e:logger.error(f"未知错误: {e}")return Nonedef parse_stock_data(self, data: Dict) -> List[Dict]:"""解析HTML内容,提取股票数据参数:html: 网页HTML内容返回:List[Dict]: 股票数据列表"""stocks: List[Dict] = []try:if not data or 'data' not in data or not data['data']:return stocksdiff = data['data'].get('diff') or []for item in diff:try:stock_data = {'股票代码': str(item.get('f12') or '').strip(),'股票名称': str(item.get('f14') or '').strip(),'最新价': self._clean_number(item.get('f2')),'涨跌幅': self._clean_number(item.get('f3')),'涨跌额': self._clean_number(item.get('f4')),'成交量': self._clean_number(item.get('f5')),'成交额': self._clean_number(item.get('f6')),'振幅': self._clean_number(item.get('f7')),'换手率': self._clean_number(item.get('f8')),'总市值': self._clean_number(item.get('f20')),'流通市值': self._clean_number(item.get('f21')),'采集时间': time.strftime('%Y-%m-%d %H:%M:%S')}stocks.append(stock_data)except Exception as e:logger.warning(f"解析单条数据失败: {e}")continuelogger.info(f"成功解析 {len(stocks)} 条股票数据")except Exception as e:logger.error(f"解析数据失败: {e}")return stocksdef _clean_number(self, text: str, is_percent: bool = False) -> Optional[float]:"""清洗数字字符串,转换为数值类型参数:text: 原始文本is_percent: 是否为百分比返回:float: 转换后的数值,转换失败返回None"""if text is None:return Nonetry:if isinstance(text, (int, float)):return float(text)s = str(text).strip()if not s or s == '-':return Nonecleaned = s.replace('%', '').replace(',', '').replace('亿', 'e8').replace('万', 'e4')value = float(cleaned)if is_percent and '%' in s:value = value / 100.0return valueexcept (ValueError, TypeError):logger.warning(f"无法转换数字: {text}")return Nonedef save_to_csv(self, data: List[Dict], filename: str = 'stock_data.csv'):"""保存数据到CSV文件参数:data: 要保存的数据列表filename: 输出文件名"""try:if not data:logger.warning("没有数据可保存")return# 使用pandas创建DataFrame并保存df = pd.DataFrame(data)# 设置 CSV 文件格式df.to_csv(filename,index=False, # 不保存行索引encoding='utf-8-sig', # 支持中文quoting=1 # 引用所有字段)logger.info(f"数据已保存到 {filename},共 {len(data)} 条记录")# 显示数据摘要print("\n=== 数据采集摘要 ===")print(f"采集时间: {time.strftime('%Y-%m-%d %H:%M:%S')}")print(f"股票数量: {len(data)}")print(f"文件路径: {filename}")if len(data) > 0:print("\n前5条数据预览:")print(df.head().to_string(index=False))except Exception as e:logger.error(f"保存CSV文件失败: {e}")def run(self):"""运行爬虫主程序"""logger.info("开始运行股票爬虫")try:# 构造请求参数(东方财富JSON接口)params = {'pn': '1','pz': '20','po': '1','np': '1','ut': 'bd1d9ddb04089700cf9c27f6f7426281','fltt': '2','invt': '2','fid': 'f12','fs': 'm:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23','fields': 'f12,f14,f2,f3,f4,f5,f6,f7,f8,f20,f21'}# 发送请求获取网页内容data = self.make_request(self.base_url, params)if not data:logger.error("获取数据失败")returnstock_data = self.parse_stock_data(data)if not stock_data:logger.warning("未解析到股票数据")return# 保存数据到CSV文件self.save_to_csv(stock_data)logger.info("爬虫运行完成")except KeyboardInterrupt:logger.info("用户中断程序")except Exception as e:logger.error(f"爬虫运行异常: {e}")def main():"""主函数"""print("=" * 60)print(" Python金融数据爬虫 - 股票数据采集")print("=" * 60)print("功能说明:")print("• 从东方财富网获取实时股票数据")print("• 解析股票基本信息并保存到CSV文件")print("• 包含完整的错误处理和日志记录")print("• 遵守爬虫伦理和法律法规")print("=" * 60)# 创建爬虫实例spider = StockSpider()# 运行爬虫spider.run()print("\n" + "=" * 60)print("程序执行完成!")print("请查看 stock_data.csv 文件获取采集的数据")print("详细日志请查看 stock_spider.log 文件")print("=" * 60)if __name__ == "__main__":# 程序入口main()
爬取到的数据
"股票代码","股票名称","最新价","涨跌幅","涨跌额","成交量","成交额","振幅","换手率","总市值","流通市值","采集时间"
"689009","九号公司-WD","60.06","-0.03","-0.02","28701.0","172355676.0","1.48","0.52","43098847472.0","33218698253.0","2025-11-13 10:26:21"
"688981","中芯国际","119.8","0.14","0.17","161767.0","1925898746.0","2.01","0.81","958411133733.0","239547593370.0","2025-11-13 10:26:21"
"688819","天能股份","35.38","3.97","1.35","31263.0","109227460.0","5.05","0.32","34392898000.0","34392898000.0","2025-11-13 10:26:21"
"688800","瑞可达","75.8","0.15","0.11","60873.0","461235153.0","5.56","2.96","15590114593.0","15590114593.0","2025-11-13 10:26:21"
"688799","华纳药厂","51.0","1.94","0.97","14050.0","70906365.0","3.62","1.07","6697320000.0","6697320000.0","2025-11-13 10:26:21"
"688798","艾为电子","78.39","-0.62","-0.49","11935.0","93311189.0","1.93","0.88","18274953776.0","10640912584.0","2025-11-13 10:26:21"
"688793","倍轻松","30.47","-0.23","-0.07","3587.0","10927514.0","1.47","0.42","2618756917.0","2618756917.0","2025-11-13 10:26:21"
"688789","宏华数科","76.72","-0.7","-0.54","2737.0","20984898.0","1.16","0.15","13767506191.0","13767506191.0","2025-11-13 10:26:21"
"688788","科思科技","65.6","-1.81","-1.21","7539.0","49942234.0","4.01","0.48","10290961165.0","10290961165.0","2025-11-13 10:26:21"
"688787","海天瑞声","107.58","1.03","1.1","3532.0","37874168.0","2.45","0.59","6489782864.0","6489782864.0","2025-11-13 10:26:21"
"688786","悦安新材","28.02","0.47","0.13","4132.0","11588119.0","1.79","0.29","4029276084.0","4029276084.0","2025-11-13 10:26:21"
"688783","西安奕材-U","27.62","2.71","0.73","90876.0","245398065.0","3.87","5.52","111524036000.0","4546737725.0","2025-11-13 10:26:21"
"688779","五矿新能","8.89","12.96","1.02","1059212.0","905284519.0","15.63","5.49","17150758830.0","17150758830.0","2025-11-13 10:26:21"
"688778","厦钨新能","77.05","4.82","3.54","37059.0","281562734.0","5.48","0.73","38886447945.0","38886447945.0","2025-11-13 10:26:21"
"688777","中控技术","49.96","1.03","0.51","25343.0","126287186.0","1.13","0.32","39527828769.0","39129759329.0","2025-11-13 10:26:21"
"688776","国光电气","101.5","0.5","0.5","32622.0","338982185.0","6.83","3.01","11000917029.0","11000917029.0","2025-11-13 10:26:21"
"688775","影石创新","265.81","-0.95","-2.56","5044.0","132662631.0","2.38","1.65","106589810000.0","8107622056.0","2025-11-13 10:26:21"
"688772","珠海冠宇","25.99","2.0","0.51","85094.0","217072951.0","3.89","0.75","29422469437.0","29422469437.0","2025-11-13 10:26:21"
"688768","容知日新","42.95","-0.99","-0.43","3460.0","14889130.0","1.29","0.4","3779115223.0","3747969430.0","2025-11-13 10:26:21"
"688767","博拓生物","43.07","-1.42","-0.62","4945.0","21454613.0","2.7","0.33","6431786695.0","6431786695.0","2025-11-13 10:26:21"