当前位置：首页 > news >正文

day23 机器学习管道 Pipeline

news 2025/11/2 8:55:23

在机器学习中，数据预处理、特征提取、模型训练和评估等步骤通常是按顺序执行的。为了更高效地管理和复用这些步骤，我们可以使用 Pipeline（管道）来构建一个完整的机器学习流水线。本文将详细介绍 Pipeline 的基础概念，并通过一个信贷数据集的案例，展示如何使用 Pipeline 构建高效的机器学习工作流。

一、基础概念

（一）Pipeline 的定义

在机器学习领域，Pipeline（管道或流水线）是一个用于组合多个 估计器（estimator）的 estimator。它按照一定的顺序执行每个估计器的 fit 和 transform 方法，从而实现数据的预处理、特征提取和模型训练等功能。

（二）转换器（Transformer）与估计器（Estimator）

转换器（Transformer）：用于对数据进行预处理和特征提取。它实现了 transform 方法，例如归一化、标准化、特征选择等。转换器是无状态的，不会存储数据的状态信息，仅根据输入数据学习转换规则。
估计器（Estimator）：用于拟合数据并进行预测。它实现了 fit 和 predict 方法，例如分类器、回归器等。估计器是有状态的，会在训练过程中存储数据的状态信息，用于后续的预测。

（三）Pipeline 的优势

防止数据泄露：在交叉验证中，Pipeline 可以确保预处理步骤在每个折叠内独立执行，避免数据泄露。
简化超参数调优：可以同时调优预处理步骤和模型的参数。
代码复用性高：Pipeline 将操作和参数分离，便于复用和维护。

二、代码演示

（一）没有 Pipeline 的代码

以下是传统的机器学习代码示例，展示了手动完成数据预处理、特征提取和模型训练的完整流程：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False# 加载数据
data = pd.read_csv('data.csv')# 数据预处理
# 1. 标签编码
home_ownership_mapping = {'Own Home': 1,'Rent': 2,'Have Mortgage': 3,'Home Mortgage': 4
}
data['Home Ownership'] = data['Home Ownership'].map(home_ownership_mapping)# 2. 独热编码
data = pd.get_dummies(data, columns=['Purpose'])# 3. 处理缺失值
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
for feature in continuous_features:mode_value = data[feature].mode()[0]data[feature].fillna(mode_value, inplace=True)# 4. 划分数据集
from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1)
y = data['Credit Default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 5. 模型训练与评估
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrixrf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)print("默认随机森林 在测试集上的分类报告：")
print(classification_report(y_test, rf_pred))
print("默认随机森林 在测试集上的混淆矩阵：")
print(confusion_matrix(y_test, rf_pred))

（二）使用 Pipeline 的代码

以下是使用 Pipeline 构建机器学习工作流的代码示例：

1. 导入库和加载数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
warnings.filterwarnings("ignore")
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = Falsefrom sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split# 加载原始数据
data = pd.read_csv('data.csv')
print("原始数据加载完成，形状为:", data.shape)

2. 分离特征和标签，划分数据集

# 分离特征和标签
y = data['Credit Default']
X = data.drop(['Credit Default'], axis=1)# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n数据集划分完成 (预处理之前)。")
print("X_train 形状:", X_train.shape)
print("X_test 形状:", X_test.shape)
print("y_train 形状:", y_train.shape)
print("y_test 形状:", y_test.shape)

3. 定义预处理步骤

# 识别不同类型的列
object_cols = X.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X.select_dtypes(exclude=['object']).columns.tolist()# 有序特征
ordinal_features = ['Home Ownership', 'Years in current job', 'Term']
ordinal_categories = [['Own Home', 'Rent', 'Have Mortgage', 'Home Mortgage'],['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'],['Short Term', 'Long Term']
]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=-1))
])# 标称特征
nominal_features = ['Purpose']
nominal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])# 连续特征
continuous_features = [f for f in X.columns if f not in ordinal_features + nominal_features]
continuous_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('scaler', StandardScaler())
])# 构建 ColumnTransformer
preprocessor = ColumnTransformer(transformers=[('ordinal', ordinal_transformer, ordinal_features),('nominal', nominal_transformer, nominal_features),('continuous', continuous_transformer, continuous_features)],remainder='passthrough'
)

4. 构建完整 Pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier(random_state=42))
])

5. 使用 Pipeline 进行训练和评估

print("\n--- 1. 默认参数随机森林 (训练集 -> 测试集) ---")
start_time = time.time()pipeline.fit(X_train, y_train)
pipeline_pred = pipeline.predict(X_test)end_time = time.time()
print(f"训练与预测耗时: {end_time - start_time:.4f} 秒")print("\n默认随机森林 在测试集上的分类报告：")
print(classification_report(y_test, pipeline_pred))
print("默认随机森林 在测试集上的混淆矩阵：")
print(confusion_matrix(y_test, pipeline_pred))

三、总结

通用机器学习Pipeline的逻辑顺序

1. 加载数据

读取数据集。
初步查看数据的基本信息（如形状、列名、数据类型等）。

2. 分离特征和标签

确定目标变量（标签）和特征变量。
分离特征和标签。

3. 划分数据集

将数据划分为训练集和测试集（可选：划分验证集）。

4. 定义预处理步骤

识别不同类型的特征：
- 分类特征（包括有序分类特征和无序分类特征）。
- 连续特征。
分类特征处理：
- 有序分类特征：填充缺失值 + 有序编码。
- 无序分类特征：填充缺失值 + 独热编码。
连续特征处理：
- 填充缺失值。
- 数据标准化或归一化。
其他特征处理：
- 特殊特征（如文本特征、日期特征等）的处理。

5. 构建预处理转换器（ColumnTransformer）

使用ColumnTransformer将不同的预处理步骤应用于不同的特征列。

6. 选择模型

根据问题类型（分类、回归、聚类等）选择合适的模型。

7. 构建完整的Pipeline

将预处理步骤和模型串联起来，形成完整的Pipeline。

8. 训练和评估模型

在训练集上训练模型。
在测试集上评估模型性能。
输出评估指标（如准确率、召回率、F1分数、混淆矩阵等）。

9. 调优和验证（可选）

使用交叉验证、网格搜索等方法对模型进行调优。
验证模型的泛化能力。

通用机器学习Pipeline模板代码

以下是基于上述逻辑顺序的通用机器学习Pipeline模板代码：

# 导入基础库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
warnings.filterwarnings("ignore")
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False# 导入机器学习相关库
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score# 1. 加载数据
data = pd.read_csv('data.csv')  # 替换为你的数据文件路径
print("原始数据加载完成，形状为:", data.shape)# 2. 分离特征和标签
target_column = 'target'  # 替换为目标变量的列名
y = data[target_column]
X = data.drop(columns=[target_column])# 3. 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n数据集划分完成。")
print("X_train 形状:", X_train.shape)
print("X_test 形状:", X_test.shape)
print("y_train 形状:", y_train.shape)
print("y_test 形状:", y_test.shape)# 4. 定义预处理步骤
# 识别不同类型的特征
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_features = X.select_dtypes(exclude=['object', 'category']).columns.tolist()# 有序分类特征（如果有）
ordinal_features = []  # 替换为有序分类特征的列名
ordinal_categories = []  # 替换为有序分类特征的类别顺序# 无序分类特征
nominal_features = [col for col in categorical_features if col not in ordinal_features]# 预处理转换器
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),  # 或 'median', 'most_frequent'('scaler', StandardScaler())  # 或 MinMaxScaler()
])ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('encoder', OrdinalEncoder(categories=ordinal_categories, handle_unknown='use_encoded_value', unknown_value=-1))
])nominal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),('ord', ordinal_transformer, ordinal_features),('nom', nominal_transformer, nominal_features)],remainder='passthrough'  # 对未指定的列进行处理（如保留或丢弃）
)# 5. 选择模型
# 替换为你选择的模型，例如 RandomForestClassifier, LogisticRegression, LinearRegression 等
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)# 6. 构建完整的 Pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', model)
])# 7. 训练和评估模型
print("\n--- 模型训练与评估 ---")
start_time = time.time()# 训练模型
pipeline.fit(X_train, y_train)# 预测测试集
y_pred = pipeline.predict(X_test)end_time = time.time()
print(f"训练与预测耗时: {end_time - start_time:.4f} 秒")# 输出评估指标
print("\n模型在测试集上的分类报告：")
print(classification_report(y_test, y_pred))
print("模型在测试集上的混淆矩阵：")
print(confusion_matrix(y_test, y_pred))# 可选：输出其他评估指标
print("\n模型在测试集上的准确率：", accuracy_score(y_test, y_pred))
print("模型在测试集上的均方误差（MSE）：", mean_squared_error(y_test, y_pred))
print("模型在测试集上的R²分数：", r2_score(y_test, y_pred))# 8. 调优和验证（可选）
# 使用 GridSearchCV 或其他方法进行超参数调优
# param_grid = {
#     'classifier__n_estimators': [100, 200],
#     'classifier__max_depth': [None, 10, 20]
# }
# grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
# grid_search.fit(X_train, y_train)
# print("\n最佳参数：", grid_search.best_params_)
# print("最佳模型的交叉验证分数：", grid_search.best_score_)

通用Pipeline的说明

数据加载和预处理：
- 数据加载部分可以根据实际数据文件格式进行调整（如pd.read_csv、pd.read_excel等）。
- 预处理步骤可以根据数据的具体情况（如缺失值处理策略、标准化方法等）进行调整。
特征处理：
- 分类特征和连续特征的处理方法可以根据具体需求进行选择（如是否需要标准化、是否需要独热编码等）。
- 特殊特征（如文本特征、日期特征）的处理可以根据具体需求添加额外的转换器。
模型选择：
- 模型部分可以根据问题类型（分类、回归、聚类等）选择合适的模型。
调优和验证：
- 调优部分可以根据实际需求选择是否使用GridSearchCV或其他调优方法。
- 交叉验证和超参数调优可以根据具体需求进行配置。