当前位置: 首页 > news >正文

python训练营day23

知识回顾:

  1. 转化器和估计器的概念
  2. 管道工程
  3. ColumnTransformer和Pipeline类

作业:

整理下全部逻辑的先后顺序,看看能不能制作出适合所有机器学习的通用pipeline

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier  # 示例模型
from sklearn.metrics import accuracy_score# 1. 数据加载
def load_data(train_path, test_path):train_data = pd.read_csv(train_path)test_data = pd.read_csv(test_path)return train_data, test_data# 2. 数据预处理
def preprocess_data(train_data, test_data):# 分离特征和目标变量X_train = train_data.drop(columns=['target'])y_train = train_data['target']X_test = test_data.drop(columns=['target'], errors='ignore')  # 测试集可能没有目标变量# 定义数值列和分类列numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columnscategorical_features = X_train.select_dtypes(include=['object', 'category']).columns# 创建预处理管道numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),  # 填充缺失值('scaler', StandardScaler())  # 标准化])categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),  # 填充缺失值('onehot', OneHotEncoder(handle_unknown='ignore'))  # 独热编码])preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),('cat', categorical_transformer, categorical_features)])return preprocessor, X_train, y_train, X_test# 3. 模型训练
def train_model(preprocessor, X_train, y_train):# 定义完整的 Pipelinemodel_pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier(random_state=42))  # 示例模型])# 训练模型model_pipeline.fit(X_train, y_train)return model_pipeline# 4. 模型评估
def evaluate_model(model_pipeline, X_test, y_test=None):if y_test is not None:y_pred = model_pipeline.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print(f"模型准确率: {accuracy:.2f}")else:print("测试集没有目标变量,无法评估模型。")# 5. 主函数
def main():train_path = 'train.csv'test_path = 'test.csv'train_data, test_data = load_data(train_path, test_path)preprocessor, X_train, y_train, X_test = preprocess_data(train_data, test_data)model_pipeline = train_model(preprocessor, X_train, y_train)# 如果测试集有目标变量,可以评估模型if 'target' in test_data.columns:y_test = test_data['target']evaluate_model(model_pipeline, X_test, y_test)else:evaluate_model(model_pipeline, X_test)if __name__ == "__main__":main()

@浙大疏锦行

相关文章:

  • Spark,RDD中的行动算子
  • 深度剖析:Vue2 项目兼容第三方库模块格式的终极解决方案
  • 正则表达式常用验证(一)
  • 【python】—conda新建python3.11的环境报错
  • 无人机信号监测系统技术解析
  • 【Java】网络编程(Socket)
  • Mac上安装Mysql的详细步骤及配置
  • git-gui界面汉化
  • android 权限配置
  • Visual Studio 2022 跨网络远程调试
  • JSP笔记
  • 《类和对象(下)》
  • Android NDK 高版本交叉编译:为何无需配置 FLAGS 和 INCLUDES
  • Cursor 编辑器 的 高级使用技巧与创意玩法
  • Flask Docker Demo 项目指南
  • 二次封装 el-dialog 组件:打造更灵活的对话框解决方案
  • 六、Hive 分桶
  • Spark处理过程-转换算子
  • 运行Spark程序-在Spark-shell——RDD
  • 第四章 部件篇之按钮矩阵部件
  • 多地警务新媒体整合:关停交警等系统账号,统一信息发布渠道
  • 珠峰窗口期5月开启 普通人登一次有多烧钱?
  • 上海国际电影节特设“今日亚洲”单元
  • “饿了么”枣庄一站点两名连襟骑手先后猝死,软件显示生前3天每日工作超11小时
  • 气象干旱黄色预警继续:陕西西南部、河南西南部等地特旱
  • 让“五颜六色”面孔讲述上海故事,2025年上海城市推荐官开启选拔