day10机器学习的全流程
@浙大疏锦行
1.读取数据
import pandas as pd
import pandas as pd #用于数据处理和分析,可处理表格数据。
import numpy as np #用于数值计算,提供了高效的数组操作。
import matplotlib.pyplot as plt #用于绘制各种类型的图表# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows系统常用黑体字体
plt.rcParams['axes.unicode_minus'] = False # 正常显示负号
data = pd.read_csv('data.csv') #读取数据
print("数据基本信息:")
data.info()
print("\n数据前5行预览:")
print(data.head())
数据基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Id 7500 non-null int64 1 Home Ownership 7500 non-null object 2 Annual Income 5943 non-null float643 Years in current job 7129 non-null object 4 Tax Liens 7500 non-null float645 Number of Open Accounts 7500 non-null float646 Years of Credit History 7500 non-null float647 Maximum Open Credit 7500 non-null float648 Number of Credit Problems 7500 non-null float649 Months since last delinquent 3419 non-null float6410 Bankruptcies 7486 non-null float6411 Purpose 7500 non-null object 12 Term 7500 non-null object 13 Current Loan Amount 7500 non-null float6414 Current Credit Balance 7500 non-null float6415 Monthly Debt 7500 non-null float6416 Credit Score 5943 non-null float6417 Credit Default 7500 non-null int64
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB数据前5行预览:Id Home Ownership Annual Income Years in current job Tax Liens \
0 0 Own Home 482087.0 NaN 0.0
1 1 Own Home 1025487.0 10+ years 0.0
2 2 Home Mortgage 751412.0 8 years 0.0
3 3 Own Home 805068.0 6 years 0.0
4 4 Rent 776264.0 8 years 0.0 Number of Open Accounts Years of Credit History Maximum Open Credit \
0 11.0 26.3 685960.0
1 15.0 15.3 1181730.0
2 11.0 35.0 1182434.0
3 8.0 22.5 147400.0
4 13.0 13.6 385836.0 Number of Credit Problems Months since last delinquent Bankruptcies \
0 1.0 NaN 1.0
1 0.0 NaN 0.0
2 0.0 NaN 0.0
3 1.0 NaN 1.0
4 1.0 NaN 0.0 Purpose Term Current Loan Amount \
0 debt consolidation Short Term 99999999.0
1 debt consolidation Long Term 264968.0
2 debt consolidation Short Term 99999999.0
3 debt consolidation Short Term 121396.0
4 debt consolidation Short Term 125840.0 Current Credit Balance Monthly Debt Credit Score Credit Default
0 47386.0 7914.0 749.0 0
1 394972.0 18373.0 737.0 1
2 308389.0 13651.0 742.0 0
3 95855.0 11338.0 694.0 0
4 93309.0 7180.0 719.0 0
data = pd.read_csv('data.csv') #读取数据
print("数据基本信息:")
data.info()
print("\n数据前5行预览:")
print(data.head())
2.数据清洗
清洗object
清洗数值型对象
处理object
# 先筛选字符串变量
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
discrete_features
['Home Ownership', 'Years in current job', 'Purpose', 'Term']
# 依次查看内容
for feature in discrete_features:print(f"\n{feature}的唯一值:")print(data[feature].value_counts())
Home Ownership的唯一值:
Home Ownership
Home Mortgage 3637
Rent 3204
Own Home 647
Have Mortgage 12
Name: count, dtype: int64Years in current job的唯一值:
Years in current job
10+ years 2332
2 years 705
3 years 620
< 1 year 563
5 years 516
1 year 504
4 years 469
6 years 426
7 years 396
8 years 339
9 years 259
Name: count, dtype: int64Purpose的唯一值:
Purpose
debt consolidation 5944
other 665
home improvements 412
business loan 129
buy a car 96
medical bills 71
major purchase 40
take a trip 37
buy house 34
small business 26
wedding 15
moving 11
educational expenses 10
vacation 8
renewable energy 2
Name: count, dtype: int64Term的唯一值:
Term
Short Term 5556
Long Term 1944
Name: count, dtype: int64
进行字典映射
# Home Ownership 标签编码
home_ownership_mapping = {'Own Home': 1,'Rent': 2,'Have Mortgage': 3,'Home Mortgage': 4
}
data['Home Ownership'] = data['Home Ownership'].map(home_ownership_mapping)# Years in current job 标签编码
years_in_job_mapping = {'< 1 year': 1,'1 year': 2,'2 years': 3,'3 years': 4,'4 years': 5,'5 years': 6,'6 years': 7,'7 years': 8,'8 years': 9,'9 years': 10,'10+ years': 11
}
data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)# Purpose 独热编码,记得需要将bool类型转换为数值
data = pd.get_dummies(data, columns=['Purpose'])
data2 = pd.read_csv("data.csv") # 重新读取数据,用来做列名对比
list_final = [] # 新建一个空列表,用于存放独热编码后新增的特征名
for i in data.columns:if i not in data2.columns:list_final.append(i) # 这里打印出来的就是独热编码后的特征名
for i in list_final:data[i] = data[i].astype(int) # 这里的i就是独热编码后的特征名# Term 0 - 1 映射
term_mapping = {'Short Term': 0,'Long Term': 1
}
data['Term'] = data['Term'].map(term_mapping)
data.rename(columns={'Term': 'Long Term'}, inplace=True) # 重命名列
处理数值型对象
# 缺失值填补
# 异常值处理
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist() #把筛选出来的列名转换成列表# 连续特征用中位数补全
for feature in continuous_features: mode_value = data[feature].mode()[0] #获取该列的众数。data[feature].fillna(mode_value, inplace=True) #用众数填充该列的缺失值,inplace=True表示直接在原数据上修改。
这个警告信息是来自Pandas库,它指出了在将来的Pandas版本中(3.0及以后),使用inplace=True参数进行就地修改的方式将会发生变化。具体到你的代码,问题出现在尝试使用fillna方法并设置inplace=True来直接在原DataFrame上修改数据。
问题原因
在Pandas中,当你通过类似data[feature]的方式访问DataFrame的列时,实际上得到的是该列的一个视图(view)或副本(copy)。直接在这个视图或副本上调用fillna方法并设置inplace=True,并不会改变原始DataFrame,因为中间对象(即data[feature])总是表现为一个副本。
解决方案
根据警告信息中的建议,你可以采用以下两种方式之一来解决这个问题:
不使用inplace=True:
直接将fillna的结果赋值回原DataFrame的相应列。这是最简单的解决方案,也是最推荐的方式。
Python
复制
data[feature] = data[feature].fillna(mode_value)
3.可视化分析
4.机器学习模型建模
数据划分
# 划分训练集和测试机
from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1) # 特征,axis=1表示按列删除
y = data['Credit Default'] # 标签
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 划分数据集,20%作为测试集,随机种子为42
# 训练集和测试集的形状
print(f"训练集形状: {X_train.shape}, 测试集形状: {X_test.shape}") # 打印训练集和测试集的形状
训练集形状: (6000, 31), 测试集形状: (1500, 31)
模型训练与评估
# #安装xgboost库
# !pip install xgboost -i https://pypi.tuna.tsinghua.edu.cn/simple/
# #安装lightgbm库
# !pip install lightgbm -i https://pypi.tuna.tsinghua.edu.cn/simple/
# #安装catboost库
# !pip install catboost -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Requirement already satisfied: xgboost in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (2.1.4)
Requirement already satisfied: numpy in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from xgboost) (2.0.2)
Requirement already satisfied: scipy in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from xgboost) (1.13.1)
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple/
Requirement already satisfied: lightgbm in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (4.6.0)
Requirement already satisfied: numpy>=1.17.0 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from lightgbm) (2.0.2)
Requirement already satisfied: scipy in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from lightgbm) (1.13.1)
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: catboost in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (1.2.8)
Requirement already satisfied: graphviz in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (0.20.3)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (3.9.2)
Requirement already satisfied: numpy<3.0,>=1.16.0 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (2.0.2)
Requirement already satisfied: pandas>=0.24 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (2.2.3)
Requirement already satisfied: scipy in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (1.13.1)
Requirement already satisfied: plotly in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (6.1.2)
Requirement already satisfied: six in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from catboost) (1.17.0)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from pandas>=0.24->catboost) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from pandas>=0.24->catboost) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from pandas>=0.24->catboost) (2025.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (1.2.1)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (24.2)
Requirement already satisfied: pillow>=8 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (3.2.0)
Requirement already satisfied: importlib-resources>=3.2.0 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from matplotlib->catboost) (6.4.0)
Requirement already satisfied: zipp>=3.1.0 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from importlib-resources>=3.2.0->matplotlib->catboost) (3.21.0)
Requirement already satisfied: narwhals>=1.15.1 in c:\programdata\anaconda3\envs\ml_env\lib\site-packages (from plotly->catboost) (1.41.0)
from sklearn.svm import SVC #支持向量机分类器
from sklearn.neighbors import KNeighborsClassifier #K近邻分类器
from sklearn.linear_model import LogisticRegression #逻辑回归分类器
import xgboost as xgb #XGBoost分类器
import lightgbm as lgb #LightGBM分类器
from sklearn.ensemble import RandomForestClassifier #随机森林分类器
from catboost import CatBoostClassifier #CatBoost分类器
from sklearn.tree import DecisionTreeClassifier #决策树分类器
from sklearn.naive_bayes import GaussianNB #高斯朴素贝叶斯分类器
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # 用于评估分类器性能的指标
from sklearn.metrics import classification_report, confusion_matrix #用于生成分类报告和混淆矩阵
import warnings #用于忽略警告信息
warnings.filterwarnings("ignore") # 忽略所有警告信息
# KNN
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)print("\nKNN 分类报告:")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩阵:")
print(confusion_matrix(y_test, knn_pred))knn_accuracy = accuracy_score(y_test, knn_pred)
knn_precision = precision_score(y_test, knn_pred)
knn_recall = recall_score(y_test, knn_pred)
knn_f1 = f1_score(y_test, knn_pred)
print("KNN 模型评估指标:")
print(f"准确率: {knn_accuracy:.4f}")
print(f"精确率: {knn_precision:.4f}")
print(f"召回率: {knn_recall:.4f}")
print(f"F1 值: {knn_f1:.4f}")
KNN 分类报告:precision recall f1-score support0 0.73 0.86 0.79 10591 0.41 0.24 0.30 441accuracy 0.68 1500macro avg 0.57 0.55 0.54 1500
weighted avg 0.64 0.68 0.65 1500KNN 混淆矩阵:
[[908 151][336 105]]
KNN 模型评估指标:
准确率: 0.6753
精确率: 0.4102
召回率: 0.2381
F1 值: 0.3013
# 逻辑回归
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)print("\n逻辑回归 分类报告:")
print(classification_report(y_test, logreg_pred))
print("逻辑回归 混淆矩阵:")
print(confusion_matrix(y_test, logreg_pred))logreg_accuracy = accuracy_score(y_test, logreg_pred)
logreg_precision = precision_score(y_test, logreg_pred)
logreg_recall = recall_score(y_test, logreg_pred)
logreg_f1 = f1_score(y_test, logreg_pred)
print("逻辑回归 模型评估指标:")
print(f"准确率: {logreg_accuracy:.4f}")
print(f"精确率: {logreg_precision:.4f}")
print(f"召回率: {logreg_recall:.4f}")
print(f"F1 值: {logreg_f1:.4f}")
逻辑回归 分类报告:precision recall f1-score support0 0.75 0.99 0.85 10591 0.86 0.20 0.33 441accuracy 0.76 1500macro avg 0.80 0.59 0.59 1500
weighted avg 0.78 0.76 0.70 1500逻辑回归 混淆矩阵:
[[1044 15][ 351 90]]
逻辑回归 模型评估指标:
准确率: 0.7560
精确率: 0.8571
召回率: 0.2041
F1 值: 0.3297
# 朴素贝叶斯
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)print("\n朴素贝叶斯 分类报告:")
print(classification_report(y_test, nb_pred))
print("朴素贝叶斯 混淆矩阵:")
print(confusion_matrix(y_test, nb_pred))nb_accuracy = accuracy_score(y_test, nb_pred)
nb_precision = precision_score(y_test, nb_pred)
nb_recall = recall_score(y_test, nb_pred)
nb_f1 = f1_score(y_test, nb_pred)
print("朴素贝叶斯 模型评估指标:")
print(f"准确率: {nb_accuracy:.4f}")
print(f"精确率: {nb_precision:.4f}")
print(f"召回率: {nb_recall:.4f}")
print(f"F1 值: {nb_f1:.4f}")
朴素贝叶斯 分类报告:precision recall f1-score support0 0.98 0.19 0.32 10591 0.34 0.99 0.50 441accuracy 0.43 1500macro avg 0.66 0.59 0.41 1500
weighted avg 0.79 0.43 0.38 1500朴素贝叶斯 混淆矩阵:
[[204 855][ 5 436]]
朴素贝叶斯 模型评估指标:
准确率: 0.4267
精确率: 0.3377
召回率: 0.9887
F1 值: 0.5035
# 决策树
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)print("\n决策树 分类报告:")
print(classification_report(y_test, dt_pred))
print("决策树 混淆矩阵:")
print(confusion_matrix(y_test, dt_pred))dt_accuracy = accuracy_score(y_test, dt_pred)
dt_precision = precision_score(y_test, dt_pred)
dt_recall = recall_score(y_test, dt_pred)
dt_f1 = f1_score(y_test, dt_pred)
print("决策树 模型评估指标:")
print(f"准确率: {dt_accuracy:.4f}")
print(f"精确率: {dt_precision:.4f}")
print(f"召回率: {dt_recall:.4f}")
print(f"F1 值: {dt_f1:.4f}")
决策树 分类报告:precision recall f1-score support0 0.79 0.75 0.77 10591 0.46 0.51 0.48 441accuracy 0.68 1500macro avg 0.62 0.63 0.62 1500
weighted avg 0.69 0.68 0.68 1500决策树 混淆矩阵:
[[791 268][216 225]]
决策树 模型评估指标:
准确率: 0.6773
精确率: 0.4564
召回率: 0.5102
F1 值: 0.4818
# 随机森林
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)print("\n随机森林 分类报告:")
print(classification_report(y_test, rf_pred))
print("随机森林 混淆矩阵:")
print(confusion_matrix(y_test, rf_pred))rf_accuracy = accuracy_score(y_test, rf_pred)
rf_precision = precision_score(y_test, rf_pred)
rf_recall = recall_score(y_test, rf_pred)
rf_f1 = f1_score(y_test, rf_pred)
print("随机森林 模型评估指标:")
print(f"准确率: {rf_accuracy:.4f}")
print(f"精确率: {rf_precision:.4f}")
print(f"召回率: {rf_recall:.4f}")
print(f"F1 值: {rf_f1:.4f}")
随机森林 分类报告:precision recall f1-score support0 0.77 0.97 0.86 10591 0.79 0.30 0.43 441accuracy 0.77 1500macro avg 0.78 0.63 0.64 1500
weighted avg 0.77 0.77 0.73 1500随机森林 混淆矩阵:
[[1023 36][ 309 132]]
随机森林 模型评估指标:
准确率: 0.7700
精确率: 0.7857
召回率: 0.2993
F1 值: 0.4335
# XGBoost
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)print("\nXGBoost 分类报告:")
print(classification_report(y_test, xgb_pred))
print("XGBoost 混淆矩阵:")
print(confusion_matrix(y_test, xgb_pred))xgb_accuracy = accuracy_score(y_test, xgb_pred)
xgb_precision = precision_score(y_test, xgb_pred)
xgb_recall = recall_score(y_test, xgb_pred)
xgb_f1 = f1_score(y_test, xgb_pred)
print("XGBoost 模型评估指标:")
print(f"准确率: {xgb_accuracy:.4f}")
print(f"精确率: {xgb_precision:.4f}")
print(f"召回率: {xgb_recall:.4f}")
print(f"F1 值: {xgb_f1:.4f}")
XGBoost 分类报告:precision recall f1-score support0 0.77 0.91 0.84 10591 0.62 0.37 0.46 441accuracy 0.75 1500macro avg 0.70 0.64 0.65 1500
weighted avg 0.73 0.75 0.72 1500XGBoost 混淆矩阵:
[[960 99][280 161]]
XGBoost 模型评估指标:
准确率: 0.7473
精确率: 0.6192
召回率: 0.3651
F1 值: 0.4593
# LightGBM
lgb_model = lgb.LGBMClassifier(random_state=42)
lgb_model.fit(X_train, y_train)
lgb_pred = lgb_model.predict(X_test)print("\nLightGBM 分类报告:")
print(classification_report(y_test, lgb_pred))
print("LightGBM 混淆矩阵:")
print(confusion_matrix(y_test, lgb_pred))lgb_accuracy = accuracy_score(y_test, lgb_pred)
lgb_precision = precision_score(y_test, lgb_pred)
lgb_recall = recall_score(y_test, lgb_pred)
lgb_f1 = f1_score(y_test, lgb_pred)
print("LightGBM 模型评估指标:")
print(f"准确率: {lgb_accuracy:.4f}")
print(f"精确率: {lgb_precision:.4f}")
print(f"召回率: {lgb_recall:.4f}")
print(f"F1 值: {lgb_f1:.4f}")
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Number of positive: 1672, number of negative: 4328
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000467 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2160
[LightGBM] [Info] Number of data points in the train set: 6000, number of used features: 26
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.278667 -> initscore=-0.951085
[LightGBM] [Info] Start training from score -0.951085LightGBM 分类报告:precision recall f1-score support0 0.78 0.94 0.85 10591 0.70 0.36 0.47 441accuracy 0.77 1500macro avg 0.74 0.65 0.66 1500
weighted avg 0.75 0.77 0.74 1500LightGBM 混淆矩阵:
[[992 67][284 157]]
LightGBM 模型评估指标:
准确率: 0.7660
精确率: 0.7009
召回率: 0.3560
F1 值: 0.4722