使用 Bank Churn 数据集进行二元分类
一、前言
分类任务:预测客户是继续使用其帐户还是关闭帐户(例如,流失)
项目地址:https://www.kaggle.com/competitions/playground-series-s4e1
二、具体步骤
(一)数据导入与预览
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_scoretrain = pd.read_csv('train.csv', index_col='id')
test = pd.read_csv('test.csv', index_col='id')
train.head(5)
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
0 | 15674932 | Okwudilichukwu | 668 | France | Male | 33.0 | 3 | 0.00 | 2 | 1.0 | 0.0 | 181449.97 | 0 |
1 | 15749177 | Okwudiliolisa | 627 | France | Male | 33.0 | 1 | 0.00 | 2 | 1.0 | 1.0 | 49503.50 | 0 |
2 | 15694510 | Hsueh | 678 | France | Male | 40.0 | 10 | 0.00 | 2 | 1.0 | 0.0 | 184866.69 | 0 |
3 | 15741417 | Kao | 581 | France | Male | 34.0 | 2 | 148882.54 | 1 | 1.0 | 1.0 | 84560.88 | 0 |
4 | 15766172 | Chiemenam | 716 | Spain | Male | 33.0 | 5 | 0.00 | 2 | 1.0 | 1.0 | 15068.83 | 0 |
train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 165034 entries, 0 to 165033
Data columns (total 13 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 CustomerId 165034 non-null int64 1 Surname 165034 non-null object 2 CreditScore 165034 non-null int64 3 Geography 165034 non-null object 4 Gender 165034 non-null object 5 Age 165034 non-null float646 Tenure 165034 non-null int64 7 Balance 165034 non-null float648 NumOfProducts 165034 non-null int64 9 HasCrCard 165034 non-null float6410 IsActiveMember 165034 non-null float6411 EstimatedSalary 165034 non-null float6412 Exited 165034 non-null int64
dtypes: float64(5), int64(5), object(3)
memory usage: 17.6+ MB
train.drop('CustomerId', axis=1).nunique()
Surname 2797
CreditScore 457
Geography 3
Gender 2
Age 71
Tenure 11
Balance 30075
NumOfProducts 4
HasCrCard 2
IsActiveMember 2
EstimatedSalary 55298
Exited 2
dtype: int64
绘制分布图 ———— 从总体上熟悉数据,也可以直观发现数据的异常点
# 绘制分布图 ———— 从总体上熟悉数据
plt.figure(figsize=(18, 12))
for i, column in enumerate(train.drop(columns=['CustomerId', 'Surname']).columns, 1):plt.subplot(len(train.columns)//3 + 1, 3, i)sns.histplot(train[column])plt.title(column)plt.tight_layout()
plt.show()
(二)特征工程
# 特征工程
def create_feature(df): df['age_bin'] = pd.cut(df['Age'], bins=[0, 20, 40, 60, 80, 110], labels=['<20', '20-40', '40-60', '60-80', '>=80'])df['is_low_score'] = df['CreditScore'].apply(lambda x: 1 if x < 500 else 0)df['is_senior'] = df['Age'].apply(lambda x: 1 if x >= 60 else 0)return dftrain = create_feature(train)
test = create_feature(test)# 分离特征与目标变量
X = train.drop(columns=['Exited'])
y = train['Exited']
cat_features = X.select_dtypes(['object', 'category']).columns.tolist()
(三)模型构建与评估
方法:GPU加速计算 + 5折交叉验证
超参数优化:未进行详细优化
# 建模与模型评估
n_fold = 5
folds = StratifiedKFold(n_splits=n_fold, random_state=42, shuffle=True) # 注意类别不平衡
auc_valids = [] # 储存每个 fold 的 AUC
y_pred = np.empty((n_fold, len(test))) # 存储每个fold的test预测值for fold, (train_index, valid_index) in enumerate(folds.split(X, y)): # 训练集X_train, y_train = X.iloc[train_index], y.iloc[train_index]# 测试集X_vilid, y_vilid = X.iloc[valid_index], y.iloc[valid_index]# 模型训练train_pool = Pool(X_train, y_train, cat_features=cat_features)valid_pool = Pool(X_vilid, y_vilid, cat_features=cat_features)clf = CatBoostClassifier(eval_metric='AUC', # 评估指标task_type='GPU', learning_rate=0.02, iterations=5000)clf.fit(train_pool, eval_set=valid_pool, verbose=500)# 验证集测试y_pred_valid = clf.predict_proba(X_vilid)[:,1] # 完整概率示例:[[0.2 0.8],而[:,1]表示正类的概率,即[0.8]auc_valid = roc_auc_score(y_vilid, y_pred_valid)print(f'Fold {fold} AUC: {auc_valid}')auc_valids.append(auc_valid)# 用不同fold训练的模型预测测试集y_pred_test = clf.predict_proba(test)[:, 1]y_pred[fold, :] = y_pred_testprint('-'*60)print(f'Mean AUC: {np.mean(auc_valids): .4f}')
Default metric period is 5 because AUC is/are not implemented for GPU
0: test: 0.8739918 best: 0.8739918 (0) total: 70.7ms remaining: 5m 53s
500: test: 0.8944888 best: 0.8944888 (500) total: 10s remaining: 1m 29s
1000: test: 0.8952877 best: 0.8952985 (980) total: 20.5s remaining: 1m 21s
1500: test: 0.8954829 best: 0.8954870 (1490) total: 31.2s remaining: 1m 12s
2000: test: 0.8955086 best: 0.8955306 (1625) total: 42s remaining: 1m 2s
2500: test: 0.8954988 best: 0.8955306 (1625) total: 52.9s remaining: 52.9s
3000: test: 0.8954397 best: 0.8955306 (1625) total: 1m 3s remaining: 42.5s
3500: test: 0.8953933 best: 0.8955306 (1625) total: 1m 14s remaining: 32s
4000: test: 0.8952920 best: 0.8955306 (1625) total: 1m 25s remaining: 21.4s
4500: test: 0.8952608 best: 0.8955306 (1625) total: 1m 37s remaining: 10.8s
4999: test: 0.8951928 best: 0.8955306 (1625) total: 1m 48s remaining: 0us
bestTest = 0.8955305815
bestIteration = 1625
Shrink model to first 1626 iterations.
Fold 0 AUC: 0.8955305898663352
------------------------------------------------------------
Default metric period is 5 because AUC is/are not implemented for GPU
0: test: 0.8743308 best: 0.8743308 (0) total: 21.1ms remaining: 1m 45s
500: test: 0.8945833 best: 0.8945833 (500) total: 9.9s remaining: 1m 28s
1000: test: 0.8954305 best: 0.8954305 (1000) total: 20.3s remaining: 1m 21s
1500: test: 0.8957238 best: 0.8957241 (1485) total: 31.1s remaining: 1m 12s
2000: test: 0.8957269 best: 0.8957676 (1760) total: 41.8s remaining: 1m 2s
2500: test: 0.8956528 best: 0.8957676 (1760) total: 52.7s remaining: 52.7s
3000: test: 0.8956784 best: 0.8957676 (1760) total: 1m 3s remaining: 42.5s
3500: test: 0.8956079 best: 0.8957676 (1760) total: 1m 14s remaining: 32s
4000: test: 0.8955335 best: 0.8957676 (1760) total: 1m 25s remaining: 21.4s
4500: test: 0.8954549 best: 0.8957676 (1760) total: 1m 37s remaining: 10.8s
4999: test: 0.8953199 best: 0.8957676 (1760) total: 1m 48s remaining: 0us
bestTest = 0.8957676291
bestIteration = 1760
Shrink model to first 1761 iterations.
Fold 1 AUC: 0.8957676119974756
------------------------------------------------------------
Default metric period is 5 because AUC is/are not implemented for GPU
0: test: 0.8746601 best: 0.8746601 (0) total: 24.1ms remaining: 2m
500: test: 0.8953991 best: 0.8953991 (500) total: 9.89s remaining: 1m 28s
1000: test: 0.8964818 best: 0.8964818 (1000) total: 20.5s remaining: 1m 21s
1500: test: 0.8968892 best: 0.8968893 (1495) total: 31.2s remaining: 1m 12s
2000: test: 0.8971167 best: 0.8971181 (1965) total: 42.1s remaining: 1m 3s
2500: test: 0.8972195 best: 0.8972385 (2350) total: 53.2s remaining: 53.1s
3000: test: 0.8972575 best: 0.8972622 (2985) total: 1m 4s remaining: 42.8s
3500: test: 0.8972616 best: 0.8972861 (3280) total: 1m 15s remaining: 32.3s
4000: test: 0.8972484 best: 0.8972864 (3850) total: 1m 26s remaining: 21.6s
4500: test: 0.8972290 best: 0.8972864 (3850) total: 1m 37s remaining: 10.8s
4999: test: 0.8972137 best: 0.8972864 (3850) total: 1m 49s remaining: 0us
bestTest = 0.8972864151
bestIteration = 3850
Shrink model to first 3851 iterations.
Fold 2 AUC: 0.8972863638690577
------------------------------------------------------------
Default metric period is 5 because AUC is/are not implemented for GPU
0: test: 0.8753492 best: 0.8753492 (0) total: 22.1ms remaining: 1m 50s
500: test: 0.8957655 best: 0.8957655 (500) total: 9.91s remaining: 1m 28s
1000: test: 0.8964766 best: 0.8964805 (990) total: 20.5s remaining: 1m 22s
1500: test: 0.8967288 best: 0.8967288 (1500) total: 31.3s remaining: 1m 12s
2000: test: 0.8967985 best: 0.8968331 (1860) total: 42s remaining: 1m 2s
2500: test: 0.8968449 best: 0.8968576 (2205) total: 52.9s remaining: 52.8s
3000: test: 0.8968754 best: 0.8968790 (2935) total: 1m 3s remaining: 42.5s
3500: test: 0.8968456 best: 0.8968830 (3010) total: 1m 14s remaining: 32s
4000: test: 0.8968126 best: 0.8968830 (3010) total: 1m 25s remaining: 21.5s
4500: test: 0.8967692 best: 0.8968830 (3010) total: 1m 37s remaining: 10.8s
4999: test: 0.8966937 best: 0.8968830 (3010) total: 1m 48s remaining: 0us
bestTest = 0.8968829513
bestIteration = 3010
Shrink model to first 3011 iterations.
Fold 3 AUC: 0.8968829799706398
------------------------------------------------------------
Default metric period is 5 because AUC is/are not implemented for GPU
0: test: 0.8713666 best: 0.8713666 (0) total: 32.9ms remaining: 2m 44s
500: test: 0.8930401 best: 0.8930401 (500) total: 9.91s remaining: 1m 28s
1000: test: 0.8936611 best: 0.8936611 (1000) total: 20.8s remaining: 1m 23s
1500: test: 0.8937590 best: 0.8937859 (1345) total: 33.2s remaining: 1m 17s
2000: test: 0.8937768 best: 0.8938069 (1880) total: 46.2s remaining: 1m 9s
2500: test: 0.8936644 best: 0.8938069 (1880) total: 58.7s remaining: 58.7s
3000: test: 0.8936239 best: 0.8938069 (1880) total: 1m 10s remaining: 47.1s
3500: test: 0.8935375 best: 0.8938069 (1880) total: 1m 21s remaining: 35.1s
4000: test: 0.8934290 best: 0.8938069 (1880) total: 1m 33s remaining: 23.3s
4500: test: 0.8933018 best: 0.8938069 (1880) total: 1m 44s remaining: 11.6s
4999: test: 0.8932022 best: 0.8938069 (1880) total: 1m 56s remaining: 0us
bestTest = 0.8938069344
bestIteration = 1880
Shrink model to first 1881 iterations.
Fold 4 AUC: 0.8938069232633625
------------------------------------------------------------
Mean AUC: 0.8959
(四)特征重要性可视化
import shapshap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(train_pool)# 特征重要性条形图
shap.summary_plot(shap_values, X_train, plot_type="bar")
说明:从特征重要性结果看,在前面特征工程构建的两个新特征并不重要,可以将其删除,避免使得模型变得复杂或引入噪声。特征工程构造的新变量,是需要经过验证最好,当然了,一些模型会自动进行特征选择,也可能不需要。
(五)结果保存
# 保存结果
y_pred_mean = y_pred.mean(axis=0)
submission = pd.DataFrame({'id': test.index.tolist(), 'Exited': y_pred_mean
})
submission.to_csv('submission.csv', index=False)
kaggle上提交测试集结果: