当前位置: 首页 > news >正文

第一个Python金融爬虫

一、前言

作为专业的爬虫工程师,我将带领您完成第一个Python金融数据爬虫的实现。本文档将从零开始,详细介绍金融爬虫的开发流程、技术要点和最佳实践。

二、环境准备

2.1 创建虚拟环境

# 创建项目目录
mkdir python_spider
cd python_spider# 创建虚拟环境
python -m venv venv# 激活虚拟环境 (Windows)
venv\Scripts\activate# 激活虚拟环境 (macOS/Linux)
source venv/bin/activate

2.2 安装必要依赖

# 安装核心依赖库
pip install requests pandas

三、爬虫基础概念

3.1 HTTP请求基础

  • GET请求: 获取数据
  • POST请求: 提交数据
  • 请求头(Headers): 包含客户端信息
  • 状态码(Status Codes): 200成功, 404未找到, 500服务器错误

3.2 数据解析技术

  • JSON解析: 直接解析接口返回的结构化数据
  • 字段映射: 将原始字段转换为易读中文字段

3.3 数据存储

  • CSV文件: 简单表格数据存储
  • JSON文件: 结构化数据存储
  • 数据库: SQLite/MySQL持久化存储

四、第一个金融爬虫:股票数据获取

4.1 目标分析

我们通过东方财富的公开接口获取股票基本信息,包括:

  • 股票代码与名称
  • 最新价、涨跌幅、涨跌额
  • 成交量与成交额
  • 振幅、换手率
  • 总市值与流通市值

4.2 技术选型

  • 请求库: requests
  • 解析: JSON 结构解析
  • 数据存储: pandas + CSV

4.3 代码实现

完整代码见 c01_hello.py

接口参数示例(东方财富 clist/get):

  • pn: 页码,例如 1
  • pz: 每页条数,例如 20
  • po: 排序方向,1 升序
  • np: 返回格式,1 为 JSON 数组
  • ut: 接口标识,示例 bd1d9ddb04089700cf9c27f6f7426281
  • fltt: 浮点处理,2
  • invt: 列表格式,2
  • fid: 排序字段,例如 f12
  • fs: 市场与板块过滤,示例 m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23
  • fields: 返回字段,例如 f12,f14,f2,f3,f4,f5,f6,f7,f8,f20,f21

常用字段映射:

  • f12 → 股票代码
  • f14 → 股票名称
  • f2 → 最新价
  • f3 → 涨跌幅(百分比数值)
  • f4 → 涨跌额
  • f5 → 成交量
  • f6 → 成交额
  • f7 → 振幅(百分比数值)
  • f8 → 换手率(百分比数值)
  • f20 → 总市值
  • f21 → 流通市值

五、爬虫开发流程

5.1 步骤分解

  1. 目标分析: 明确接口与字段
  2. 请求发送: 构造 HTTP 请求获取 JSON
  3. 数据解析: 提取并映射核心字段
  4. 数据清洗: 处理缺失值与格式
  5. 数据存储: 保存为 CSV 文件
  6. 错误处理: 捕获网络与解析异常
  7. 性能优化: 控制请求频率与重试策略

5.2 最佳实践

  1. 遵守robots协议: 尊重网站的爬虫规则
  2. 设置合理延时: 避免对服务器造成压力
  3. 使用User-Agent: 模拟浏览器行为
  4. 异常处理: 确保程序健壮性
  5. 数据验证: 确保数据质量

六、常见问题与解决方案

6.1 反爬机制应对

  • IP限制: 使用代理IP池
  • 验证码: 使用OCR或第三方验证码服务
  • JavaScript渲染: 使用Selenium或Playwright
  • 请求频率限制: 添加随机延时

6.2 数据质量保证

  • 字段校验: 检查数据完整性
  • 去重处理: 避免重复数据
  • 异常值处理: 过滤不合理数据
  • 数据备份: 定期备份重要数据

七、进阶学习方向

7.1 技术进阶

  • 异步爬虫: 使用aiohttp提高效率
  • 分布式爬虫: 使用Scrapy-Redis
  • 浏览器自动化: Selenium/Playwright
  • API逆向工程: 分析JavaScript接口

7.2 数据应用

  • 数据分析: 使用pandas进行数据探索
  • 数据可视化: 使用matplotlib/plotly
  • 机器学习: 构建预测模型
  • 实时监控: 构建数据监控系统

八、总结

第一个Python金融爬虫的实现涵盖了爬虫开发的核心流程:从环境准备、目标分析、代码实现到数据存储。通过这个实例,您应该掌握了:

  1. 基本的HTTP请求发送
  2. HTML页面解析技术
  3. 数据提取和清洗方法
  4. 文件存储操作
  5. 异常处理和程序健壮性设计

记住,爬虫开发不仅要关注技术实现,更要重视合规性和数据质量。在实际项目中,请务必遵守相关法律法规和网站的使用条款。

九、参考资料

  1. Requests官方文档
  2. BeautifulSoup官方文档
  3. Pandas官方文档
  4. HTTP状态码说明

4.4 运行与输出

python c01_hello.py
  • 运行后生成 stock_data.csv(UTF-8-SIG 编码)
  • 日志输出到 stock_spider.log
  • 控制台展示采集摘要与前 5 条预览

完整代码


#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
第一个Python金融数据爬虫 - 股票数据采集功能说明:
1. 从东方财富网获取股票列表数据
2. 解析股票基本信息(代码、名称、价格、涨跌幅等)
3. 数据清洗和格式化
4. 保存到CSV文件
5. 包含完整的异常处理和日志记录作者:专业爬虫工程师
创建时间:2025年
版本:v1.0
"""import requests
import pandas as pd
import time
import random
import logging
from typing import List, Dict, Optional# 配置日志系统
logging.basicConfig(level=logging.INFO,format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',handlers=[logging.FileHandler('stock_spider.log', encoding='utf-8'),logging.StreamHandler()]
)
logger = logging.getLogger(__name__)class StockSpider:"""股票数据爬虫类"""def __init__(self):"""初始化爬虫配置"""# 目标URL - 东方财富网股票列表页self.base_url = "https://push2.eastmoney.com/api/qt/clist/get"# 请求头设置,模拟浏览器行为self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36','Accept': 'application/json, text/plain, */*','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','Connection': 'keep-alive','Referer': 'http://quote.eastmoney.com/center/gridlist.html',}# 请求超时时间(秒)self.timeout = 10# 请求间隔时间(秒),避免过于频繁请求self.delay = random.uniform(1.0, 3.0)logger.info("股票爬虫初始化完成")def make_request(self, url: str, params: Optional[Dict] = None) -> Optional[Dict]:"""发送HTTP请求获取网页内容参数:url: 请求的URL地址params: 请求参数返回:str: 网页HTML内容,失败时返回None"""try:# 添加随机延时,避免被反爬time.sleep(self.delay)logger.info(f"正在请求URL: {url}")# 发送GET请求response = requests.get(url,headers=self.headers,params=params,timeout=self.timeout)# 检查响应状态码response.raise_for_status()# 设置编码为UTF-8response.encoding = 'utf-8'logger.info(f"请求成功,状态码: {response.status_code}")return response.json()except requests.exceptions.RequestException as e:logger.error(f"请求失败: {e}")return Noneexcept Exception as e:logger.error(f"未知错误: {e}")return Nonedef parse_stock_data(self, data: Dict) -> List[Dict]:"""解析HTML内容,提取股票数据参数:html: 网页HTML内容返回:List[Dict]: 股票数据列表"""stocks: List[Dict] = []try:if not data or 'data' not in data or not data['data']:return stocksdiff = data['data'].get('diff') or []for item in diff:try:stock_data = {'股票代码': str(item.get('f12') or '').strip(),'股票名称': str(item.get('f14') or '').strip(),'最新价': self._clean_number(item.get('f2')),'涨跌幅': self._clean_number(item.get('f3')),'涨跌额': self._clean_number(item.get('f4')),'成交量': self._clean_number(item.get('f5')),'成交额': self._clean_number(item.get('f6')),'振幅': self._clean_number(item.get('f7')),'换手率': self._clean_number(item.get('f8')),'总市值': self._clean_number(item.get('f20')),'流通市值': self._clean_number(item.get('f21')),'采集时间': time.strftime('%Y-%m-%d %H:%M:%S')}stocks.append(stock_data)except Exception as e:logger.warning(f"解析单条数据失败: {e}")continuelogger.info(f"成功解析 {len(stocks)} 条股票数据")except Exception as e:logger.error(f"解析数据失败: {e}")return stocksdef _clean_number(self, text: str, is_percent: bool = False) -> Optional[float]:"""清洗数字字符串,转换为数值类型参数:text: 原始文本is_percent: 是否为百分比返回:float: 转换后的数值,转换失败返回None"""if text is None:return Nonetry:if isinstance(text, (int, float)):return float(text)s = str(text).strip()if not s or s == '-':return Nonecleaned = s.replace('%', '').replace(',', '').replace('亿', 'e8').replace('万', 'e4')value = float(cleaned)if is_percent and '%' in s:value = value / 100.0return valueexcept (ValueError, TypeError):logger.warning(f"无法转换数字: {text}")return Nonedef save_to_csv(self, data: List[Dict], filename: str = 'stock_data.csv'):"""保存数据到CSV文件参数:data: 要保存的数据列表filename: 输出文件名"""try:if not data:logger.warning("没有数据可保存")return# 使用pandas创建DataFrame并保存df = pd.DataFrame(data)# 设置 CSV 文件格式df.to_csv(filename,index=False,          # 不保存行索引encoding='utf-8-sig', # 支持中文quoting=1            # 引用所有字段)logger.info(f"数据已保存到 {filename},共 {len(data)} 条记录")# 显示数据摘要print("\n=== 数据采集摘要 ===")print(f"采集时间: {time.strftime('%Y-%m-%d %H:%M:%S')}")print(f"股票数量: {len(data)}")print(f"文件路径: {filename}")if len(data) > 0:print("\n前5条数据预览:")print(df.head().to_string(index=False))except Exception as e:logger.error(f"保存CSV文件失败: {e}")def run(self):"""运行爬虫主程序"""logger.info("开始运行股票爬虫")try:# 构造请求参数(东方财富JSON接口)params = {'pn': '1','pz': '20','po': '1','np': '1','ut': 'bd1d9ddb04089700cf9c27f6f7426281','fltt': '2','invt': '2','fid': 'f12','fs': 'm:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23','fields': 'f12,f14,f2,f3,f4,f5,f6,f7,f8,f20,f21'}# 发送请求获取网页内容data = self.make_request(self.base_url, params)if not data:logger.error("获取数据失败")returnstock_data = self.parse_stock_data(data)if not stock_data:logger.warning("未解析到股票数据")return# 保存数据到CSV文件self.save_to_csv(stock_data)logger.info("爬虫运行完成")except KeyboardInterrupt:logger.info("用户中断程序")except Exception as e:logger.error(f"爬虫运行异常: {e}")def main():"""主函数"""print("=" * 60)print("            Python金融数据爬虫 - 股票数据采集")print("=" * 60)print("功能说明:")print("• 从东方财富网获取实时股票数据")print("• 解析股票基本信息并保存到CSV文件")print("• 包含完整的错误处理和日志记录")print("• 遵守爬虫伦理和法律法规")print("=" * 60)# 创建爬虫实例spider = StockSpider()# 运行爬虫spider.run()print("\n" + "=" * 60)print("程序执行完成!")print("请查看 stock_data.csv 文件获取采集的数据")print("详细日志请查看 stock_spider.log 文件")print("=" * 60)if __name__ == "__main__":# 程序入口main()

爬取到的数据

"股票代码","股票名称","最新价","涨跌幅","涨跌额","成交量","成交额","振幅","换手率","总市值","流通市值","采集时间"
"689009","九号公司-WD","60.06","-0.03","-0.02","28701.0","172355676.0","1.48","0.52","43098847472.0","33218698253.0","2025-11-13 10:26:21"
"688981","中芯国际","119.8","0.14","0.17","161767.0","1925898746.0","2.01","0.81","958411133733.0","239547593370.0","2025-11-13 10:26:21"
"688819","天能股份","35.38","3.97","1.35","31263.0","109227460.0","5.05","0.32","34392898000.0","34392898000.0","2025-11-13 10:26:21"
"688800","瑞可达","75.8","0.15","0.11","60873.0","461235153.0","5.56","2.96","15590114593.0","15590114593.0","2025-11-13 10:26:21"
"688799","华纳药厂","51.0","1.94","0.97","14050.0","70906365.0","3.62","1.07","6697320000.0","6697320000.0","2025-11-13 10:26:21"
"688798","艾为电子","78.39","-0.62","-0.49","11935.0","93311189.0","1.93","0.88","18274953776.0","10640912584.0","2025-11-13 10:26:21"
"688793","倍轻松","30.47","-0.23","-0.07","3587.0","10927514.0","1.47","0.42","2618756917.0","2618756917.0","2025-11-13 10:26:21"
"688789","宏华数科","76.72","-0.7","-0.54","2737.0","20984898.0","1.16","0.15","13767506191.0","13767506191.0","2025-11-13 10:26:21"
"688788","科思科技","65.6","-1.81","-1.21","7539.0","49942234.0","4.01","0.48","10290961165.0","10290961165.0","2025-11-13 10:26:21"
"688787","海天瑞声","107.58","1.03","1.1","3532.0","37874168.0","2.45","0.59","6489782864.0","6489782864.0","2025-11-13 10:26:21"
"688786","悦安新材","28.02","0.47","0.13","4132.0","11588119.0","1.79","0.29","4029276084.0","4029276084.0","2025-11-13 10:26:21"
"688783","西安奕材-U","27.62","2.71","0.73","90876.0","245398065.0","3.87","5.52","111524036000.0","4546737725.0","2025-11-13 10:26:21"
"688779","五矿新能","8.89","12.96","1.02","1059212.0","905284519.0","15.63","5.49","17150758830.0","17150758830.0","2025-11-13 10:26:21"
"688778","厦钨新能","77.05","4.82","3.54","37059.0","281562734.0","5.48","0.73","38886447945.0","38886447945.0","2025-11-13 10:26:21"
"688777","中控技术","49.96","1.03","0.51","25343.0","126287186.0","1.13","0.32","39527828769.0","39129759329.0","2025-11-13 10:26:21"
"688776","国光电气","101.5","0.5","0.5","32622.0","338982185.0","6.83","3.01","11000917029.0","11000917029.0","2025-11-13 10:26:21"
"688775","影石创新","265.81","-0.95","-2.56","5044.0","132662631.0","2.38","1.65","106589810000.0","8107622056.0","2025-11-13 10:26:21"
"688772","珠海冠宇","25.99","2.0","0.51","85094.0","217072951.0","3.89","0.75","29422469437.0","29422469437.0","2025-11-13 10:26:21"
"688768","容知日新","42.95","-0.99","-0.43","3460.0","14889130.0","1.29","0.4","3779115223.0","3747969430.0","2025-11-13 10:26:21"
"688767","博拓生物","43.07","-1.42","-0.62","4945.0","21454613.0","2.7","0.33","6431786695.0","6431786695.0","2025-11-13 10:26:21"
http://www.dtcms.com/a/605851.html

相关文章:

  • 如何在线修改ORACLE 临时文件
  • 【Oracle APEX开发小技巧17】交互式网格操作按钮根据条件/状态设置能否被点击生效
  • 淘宝扭蛋机小程序:电商娱乐化赛道的机遇挖掘与风险防控
  • 【AI大模型技术】8.大模型文本理解与生成
  • vue前端静态页面部署
  • 视频直播网站开发流程滁州做网站电话号码
  • 怎样快速仿做网站有哪些企业官网做得比较好
  • CameraBag Mac英文 照片视频滤镜编辑工具
  • 使用Claude Code进行编程——国内用户使用指南
  • (17)python开发经验 --- Python查找dll exe依赖缺失
  • (第五篇)Spring AI 基础入门之嵌入模型与向量基础:AI 理解世界的方式
  • 基于RTDS与DIgSILENT联合仿真的电力系统薄弱点识别及光伏控制策略优化
  • 在俄罗斯用钱让女性做h事情的网站wordpress post模板
  • 网站地区分站系统自助注册搭建网站
  • 自适应残差卷积网络 + 斑马优化:让图像更清晰的“智慧组合“
  • 图形化android可视化开机观测工具bootchart
  • 网站建设网站推广服务公司阿里云买域名
  • 时间复杂度(按增长速度从低到高排序)包括以下几类,用于描述算法执行时间随输入规模 n 增长的变化趋势:
  • 免费如何做网页或网站asp网站部署
  • 基于YOLO11-Hyper的油罐车类型识别与状态检测系统_1
  • 时间的幻觉:当你不在时,宇宙按下暂停键
  • AI入门知识之RAG技术树全景解析:从基础架构到智能演进的演进路径
  • 做网站的数据库万维设计
  • 马尾网站建设郑州网络推广哪家厉害
  • 【Java SE 基础学习打卡】13 Java 开发工具
  • 【文献阅读】网络复杂系统演化历史的重建
  • 编译型语言的两个步骤 | 了解编译与执行过程
  • 基于FP7153的超小封装5V/3A自行车车灯驱动方案
  • Rust入门:运算符和数据类型应用
  • 易语言DLL反编译 | 深入解析反编译技术与应用