如何利用机器学习实现信用风险评分
该示例使用XGBoost算法对贷款申请人的信用风险进行分类,并通过SHAP值解释预测逻辑:
python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
import shap
# 1. 加载示例数据(需替换为真实数据集)
data = pd.read_csv("loan_applications.csv")
# 数据包含字段:age, income, credit_score, employment_status, loan_amount, default
# 2. 数据预处理
# 处理缺失值
data.fillna({'income': data['income'].median()}, inplace=True)
# 特征工程
categorical_features = ['employment_status'] # 类别型特征
numerical_features = ['age', 'income', 'credit_score', 'loan_amount']
# 创建预处理管道
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
# 划分训练集和测试集
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. 训练XGBoost模型
model = XGBClassifier(
objective='binary:logistic',
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
# 创建完整管道
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', model)
])
pipeline.fit(X_train, y_train)
# 4. 模型评估
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# 5. SHAP值解释(可视化信用风险决策逻辑)
explainer = shap.TreeExplainer(pipeline.named_steps['classifier'])
shap_values = explainer.shap_values(pipeline.transform(X_test))
# 可视化单个样本的解释
sample_index = 0
shap.summary_plot(shap_values, pipeline.transform(X_test), feature_names=X.columns)
代码功能说明
-
数据处理:
- 处理缺失值(中位数填充收入缺失)
- 对数值特征标准化(
StandardScaler
) - 对类别特征独热编码(
OneHotEncoder
)
-
模型训练:
- 使用XGBoost分类器,通过二分类逻辑回归目标函数(
binary:logistic
) - 设置早停机制(
early_stopping_rounds
可优化)
- 使用XGBoost分类器,通过二分类逻辑回归目标函数(
-
模型评估:
- 输出分类报告(准确率、召回率、F1值)
- 召回率特别重要,因为银行更关注避免漏判高风险客户
-
风险解释:
- 使用SHAP值分析特征重要性
- 生成交互式可视化图表(需安装
shap.summary_plot
)
扩展优化方向
1. 处理非平衡数据
python
# 在XGBoost中直接设置类别权重
model = XGBClassifier(class_weight='balanced', ...)
# 或使用SMOTE进行过采样
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(preprocessor.transform(X_train), y_train)
2. 时序特征工程(适用于信用卡风险预测)
python
# 添加滚动统计特征
data['rolling_3m_avg_income'] = data['income'].rolling(window=3, min_periods=1).mean()
data['rolling_6m_default_rate'] = data['default'].rolling(window=6, min_periods=1).mean()
3. 部署实时风控API
python
from flask import Flask, request
import joblib
app = Flask(__name__)
model = joblib.load('credit_risk_model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
preprocessed_data = preprocessor.transform(data)
probability = model.predict_proba(preprocessed_data)[:, 1]
return {"default_probability": round(probability[0]*100, 2), "risk_level": "high" if probability[0] > 0.5 else "low"}
if __name__ == "__main__":
app.run(debug=True)
实际部署注意事项
-
数据安全:
- 使用加密传输(TLS 1.3)
- 敏感字段(如收入、信用评分)在传输前脱敏
-
性能优化:
- 使用LightGBM替代XGBoost提升训练速度
- 部署模型到GPU服务器加速推理(如NVIDIA Tesla V100)
-
合规性要求:
- 确保模型符合《商业银行信用风险内部评级指引》
- 记录模型决策逻辑以满足可解释性监管要求
该示例展示了从数据预处理到模型部署的完整信用风险评估流程,实际银行系统需结合具体业务规则和监管要求进行调整。建议在部署前完成:
- 模型验证(使用历史数据回测)
- 压力测试(模拟极端市场条件)
- 合规审计(检查算法歧视风险)