python解析通达信dat与blk数据文件【附源码】
在金融数据分析领域,通达信作为国内主流的股票分析软件,其数据文件格式的解析一直是技术开发者面临的挑战。本文将详细介绍一个完整的通达信数据解析器项目,从最初的二进制文件读取到最终实现智能化的数据列名生成,展示了一个完整的技术演进过程。
项目背景与挑战
业务需求
通达信软件使用专有的二进制文件格式存储股票数据,包括:
- BLK文件:板块文件,存储股票代码列表
- DAT文件:数据文件,存储股票的价格、成交量等数值信息
股票代码,股票名称,交易市场,板块类型,总成交量(股),总成交额(元),流通市值(元),总市值(元),涨跌幅(%),换手率(%),比率(%),市盈率(%),交易时间_1,交易时间_2,交易时间_3,记录大小,记录索引,买入量(手),卖出量(股),内盘量(手),买入额(元),解析方法,外盘量(股),持仓量(手),卖出额(元),净资产(元),市净率(%),收益率(%),分红率(%)
151010,,深圳,ETF,822742320,822742320,959787312.0,221853234.0,0.00017279828898608685,0.017279828898608685,,0.004272463047527708,,,,32,16,8227423.2,825241909,8252419.09,825241909,宽松验证,168636464.0,1686364.64,168636464.0,892416010.0,,,
151072,,深圳,ETF,221853234,221853234,892406285.0,842477617.0,,,0.004272463047527708,,,,,24,22,2218532.34,892416010,8924160.1,892416010,宽松验证,942879545.0,9428795.45,942879545.0,805965104.0,,,
151303,,深圳,ETF,805965104,805965104,858862897.0,221262640.0,,,0.0002759810968200327,,,,,40,2,8059651.04,909718833,9097188.33,909718833,宽松验证,168637744.0,1686377.44,168637744.0,892416010.0,,,
151306,,深圳,ETF,221262640,221262640,825231885.0,825637173.0,,,,,,,,32,3,2212626.4,892416010,8924160.1,892416010,宽松验证,825308729.0,8253087.29,825308729.0,822742325.0,,,
151309,,深圳,ETF,858862897,858862897,808595763.0,942737933.0,,,0.004220944902044721,,,,,32,4,8588628.97,221264176,2212641.76,221264176,宽松验证,825569546.0,8255695.46,825569546.0,892612920.0,,,
151312,,深圳,ETF,825569546,825569546,,,,,,,,,,8,17,8255695.46,808595763,8085957.63,808595763,固定大小_8字节,,,,,,,
151314,,深圳,ETF,221852724,221852724,942737933.0,875835704.0,,,0.004220944902044721,,,,,24,7,2218527.24,825569546,8255695.46,825569546,宽松验证,808726835.0,8087268.35,808726835.0,822742326.0,,,
151318,,深圳,ETF,858862897,858862897,809054515.0,942737933.0,,,0.004220944902044721,,,,,40,5,8588628.97,221263921,2212639.21,221263921,宽松验证,825569546.0,8255695.46,825569546.0,842281272.0,,,
151319,,深圳,ETF,825569546,825569546,,,,,,,,,,8,26,8255695.46,809054515,8090545.15,809054515,固定大小_8字节,,,,,,,
151320,,深圳,ETF,822742327,822742327,858862897.0,221262386.0,,,0.004220944902044721,,,,,32,7,8227423.27,842215733,8422157.33,842215733,宽松验证,168636464.0,1686364.64,168636464.0,825569546.0,,,
151326,,深圳,ETF,221262386,221262386,942737933.0,808726840.0,,,0.004220944902044721,,,,,24,10,2212623.86,825569546,8255695.46,825569546,宽松验证,808858163.0,8088581.63,808858163.0,822742326.0,,,
151360,,深圳,ETF,221721651,221721651,942737933.0,875770168.0,,,0.004220944902044721,,,,,24,13,2217216.51,825569546,8255695.46,825569546,宽松验证,808465971.0,8084659.71,808465971.0,805965108.0,,,
151370,,深圳,ETF,858862897,858862897,,,,,,,,,,8,43,8588628.97,221261879,2212618.79,221261879,固定大小_8字节,,,,,,,
151375,,深圳,ETF,825569546,825569546,943141681.0,822742320.0,,,0.0043678275687852874,,,,,32,11,8255695.46,808793907,8087939.07,808793907,宽松验证,892406285.0,8924062.85,892406285.0,842086456.0,,,
151378,,深圳,ETF,892406285,8
股票代码,股票名称,交易市场,板块类型,价格(元),总成交量(股),总成交额(元),流通市值(元),总市值(元),涨跌幅(%),换手率(%),市盈率(%),比率(%),交易时间_1,交易时间_2,交易时间_3,记录大小,记录索引,买入量(手),卖出量(股),内盘量(手),买入额(元),市净率(%),解析方法,外盘量(股),持仓量(手),卖出额(元),净资产(元)
300394,,深圳,创业板,,808648704,808648704,,,,,,,,,,24,10,8086487.04,876163888,8761638.88,876163888,,宽松验证,,,,
520880,,上海,基金/债券,,842334209,842334209,,,,,,,,,,24,3,8423342.09,808990768,8089907.68,808990768,,宽松验证,,,,
600570,,上海,主板,,808845313,808845313,,,,,,,,,,24,6,8088453.13,808924464,8089244.64,808924464,,宽松验证,,,,
688711,,上海,主板,,943063041,943063041,,,,,,0.004339218503446318,,,,24,13,9430630.41,825308984,8253089.84,825308984,,宽松验证,,,,
999999,,其他,其他,,960036865,960036865,,,0.00017642976308707148,0.017642976308707148,0.0001766429195413366,,,,,24,0,9600368.65,960051513,9600515.13,960051513,0.01766429195413366,宽松验证,,,,
技术挑战
- 二进制格式解析:通达信文件采用自定义的二进制格式,缺乏公开的格式规范
- 编码兼容性:文件可能使用多种字符编码(UTF-8、GBK、GB2312)
- 数据结构推断:需要通过逆向工程推断数据的内部结构
- 列名语义化:原始解析结果缺乏业务含义,需要智能化处理
技术架构设计
整体架构
```
原始数据文件 → 基础解析器 → 增强解析器 → 智能列名生成器 → 结构化数据输出
↓ ↓ ↓ ↓ ↓
.blk/.dat 二进制读取 数据验证 语义化处理 CSV/Excel
核心模块设计
1. 基础解析模块 (`tdx_reader.py`)
```python
class TdxDataReader:
"""通达信数据读取器基类"""
def __init__(self):
self.supported_encodings = ['utf-8', 'gbk', 'gb2312']
def read_all_files(self):
"""读取所有通达信文件"""
return {
'blk_data': self._read_blk_files(),
'dat_data': self._read_dat_files(),
'summary': self._generate_summary()
}
2. 增强解析模块 (`enhanced_tdx_parser.py`)
```python
class EnhancedTdxDataReader:
"""增强版通达信数据解析器"""
def __init__(self):
self.market_rules = {
'SH': lambda code: code.startswith(('60', '68', '51')),
'SZ': lambda code: code.startswith(('00', '30', '15'))
}
def classify_stock(self, code):
"""股票代码智能分类"""
# 实现市场和板块的自动识别
pass
3. 智能列名生成器 (`meaningful_tdx_parser.py`)
```python
class MeaningfulColumnGenerator:
"""智能列名生成器"""
def __init__(self):
self.financial_terms = {
'price': ['价格', '现价', '开盘价', '收盘价'],
'volume': ['成交量', '总成交量'],
'amount': ['成交额', '总成交额'],
'ratio': ['涨跌幅', '换手率', '市盈率']
}
def generate_meaningful_names(self, data):
"""生成有意义的列名"""
# 基于数值特征和业务规则生成语义化列名
pass
核心技术实现
1. 二进制文件解析技术
BLK文件解析
```python
def read_blk_file(self,