当前位置：首页 > news >正文

ML4T - 第8章第0节数据准备Data prep

news 2025/10/4 12:09:56

一、获取 Quandl 调整后行情

二、获取 Lasso 预测值

三、统一索引名 & 确定时间窗口

四、对齐行情 & 预测

五、整体代码

这个脚本的作用：（准备阶段）

脚本把“Lasso 预测值”与“Quandl 调整后行情”按最佳超参、统一日期-股票对齐，生成可直接用于策略回测的完整数据表 08_backtest.h5。

一、获取 Quandl 调整后行情

with pd.HDFStore(DATA_DIR / 'assets.h5') as store:

prices = (store['quandl/wiki/prices']

.filter(like='adj')

.rename(columns=lambda x: x.replace('adj_', ''))

.swaplevel(axis=0))

二、获取 Lasso 预测值

# with pd.HDFStore(PROJECT_DIR / '07_linear_models/data.h5') as store:

with pd.HDFStore(PROJECT_DIR / 'data.h5') as store:

print(store.info())

predictions = store[predictions]

# 用 Spearman 秩相关系数衡量“预测值”与“真实值”的单调性，挑整体表现最好的 alpha。

best_alpha = predictions.groupby('alpha').apply(lambda x: spearmanr(x.actuals, x.predicted)[0]).idxmax()

predictions = predictions[predictions.alpha == best_alpha]

三、统一索引名 & 确定时间窗口

predictions.index.names = ['ticker', 'date']

tickers = predictions.index.get_level_values('ticker').unique()

start = predictions.index.get_level_values('date').min().strftime('%Y-%m-%d')

stop = (predictions.index.get_level_values('date').max() + pd.DateOffset(1)).strftime('%Y-%m-%d')

四、对齐行情 & 预测

idx = pd.IndexSlice

prices = prices.sort_index().loc[idx[tickers, start:stop], :]

predictions = predictions.loc[predictions.alpha == best_alpha, ['predicted']]

return predictions.join(prices, how='right')

五、整体代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
__author__ = 'Stefan Jansen'  # https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/08_ml4t_workflow/00_data/data_prep.py
__modified_author__ = 'MangoQuant'  # https://blog.csdn.net/2401_82851462from pathlib import Path
import numpy as np
import pandas as pd
from scipy.stats import spearmanrpd.set_option('display.expand_frame_repr', False)
np.random.seed(42)# PROJECT_DIR = Path('..', '..')
PROJECT_DIR = Path('.')DATA_DIR = PROJECT_DIR / 'data'def get_backtest_data(predictions='lasso/predictions'):"""Combine chapter 7 lr/lasso/ridge regression predictionswith adjusted OHLCV Quandl Wiki data"""# 获取 Quandl 调整后行情with pd.HDFStore(DATA_DIR / 'assets.h5') as store:prices = (store['quandl/wiki/prices'].filter(like='adj').rename(columns=lambda x: x.replace('adj_', '')).swaplevel(axis=0))# 获取 Lasso 预测值# with pd.HDFStore(PROJECT_DIR / '07_linear_models/data.h5') as store:with pd.HDFStore(PROJECT_DIR / 'data.h5') as store:print(store.info())predictions = store[predictions]# 用 Spearman 秩相关系数 衡量“预测值”与“真实值”的单调性，挑整体表现最好的 alpha。best_alpha = predictions.groupby('alpha').apply(lambda x: spearmanr(x.actuals, x.predicted)[0]).idxmax()predictions = predictions[predictions.alpha == best_alpha]# 统一索引名 & 确定时间窗口predictions.index.names = ['ticker', 'date']tickers = predictions.index.get_level_values('ticker').unique()start = predictions.index.get_level_values('date').min().strftime('%Y-%m-%d')stop = (predictions.index.get_level_values('date').max() + pd.DateOffset(1)).strftime('%Y-%m-%d')# 对齐行情 & 预测idx = pd.IndexSliceprices = prices.sort_index().loc[idx[tickers, start:stop], :]predictions = predictions.loc[predictions.alpha == best_alpha, ['predicted']]return predictions.join(prices, how='right')df = get_backtest_data('lasso/predictions')
print(df.info())
# df.to_hdf('backtest.h5', 'data')
df.to_hdf('08_backtest.h5', 'data')
print("08_backtest.h5 saved")

运行后结果：

test1@budas-MacBook-Pro ML4T % python 08_00_data_prep.py 
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/lasso/coeffs                    frame        (shape->[8,33])      
/lasso/predictions               frame        (shape->[590496,3])  
/lasso/scores                    frame        (shape->[6000,3])    
/logistic/coeffs                 frame        (shape->[11,33])     
/logistic/predictions            frame        (shape->[811932,4])  
/logistic/scores                 frame        (shape->[825,5])     
/lr/predictions                  frame        (shape->[73812,2])   
/lr/scores                       frame        (shape->[750,2])     
/model_data                      frame        (shape->[3566454,69])
/ridge/coeffs                    frame        (shape->[18,33])     
/ridge/predictions               frame        (shape->[1328616,3]) 
/ridge/scores                    frame        (shape->[13500,3])   
/Users/test1/Documents/code/my_develop/leader-follower-strategy/ML4T/08_00_data_prep.py:37: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.best_alpha = predictions.groupby('alpha').apply(lambda x: spearmanr(x.actuals, x.predicted)[0]).idxmax()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 198266 entries, ('AAPL', Timestamp('2014-12-09 00:00:00')) to ('MPC', Timestamp('2017-11-30 00:00:00'))
Data columns (total 6 columns):#   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  0   predicted  73812 non-null   float641   open       198266 non-null  float642   high       198266 non-null  float643   low        198266 non-null  float644   close      198266 non-null  float645   volume     198266 non-null  float64
dtypes: float64(6)
memory usage: 14.6+ MB
None
/Users/test1/Documents/code/my_develop/leader-follower-strategy/ML4T/08_00_data_prep.py:56: FutureWarning: Starting with pandas version 3.0 all arguments of to_hdf except for the argument 'path_or_buf' will be keyword-only.df.to_hdf('08_backtest.h5', 'data')
08_backtest.h5 saved