✅ 今日目标
- 理解决策树(Decision Tree)的基本原理
- 掌握信息熵、基尼系数等分裂标准
- 使用
DecisionTreeClassifier
和 RandomForestClassifier
构建模型 - 学会可视化决策树与查看特征重要性
- 对比单棵树与集成模型(随机森林)的泛化能力
📘 一、决策树模型简介
特性 | 描述 |
---|
本质 | 以“特征条件”划分决策路径,形成一棵判断树 |
优点 | 逻辑清晰、可解释性强、不需归一化 |
缺点 | 易过拟合、对噪声敏感 |
应用 | 信用评分、规则建模、分类可视化 |
🧠 二、常用模型 API
决策树:
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(max_depth=3, criterion='gini')
clf.fit(X_train, y_train)
随机森林:
from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
📊 三、评估方式建议
模型 | 适合场景 | 可解释性 | 精度表现 |
---|
决策树 | 可视化逻辑、规则推理 | ✅ 强 | 中等 |
随机森林 | 提高精度、降低过拟合 | 中等 | ✅ 强 |
📈 四、可视化与分析
from sklearn.tree import plot_tree
import matplotlib.pyplot as pltplt.figure(figsize=(10, 6))
plot_tree(clf, feature_names=["成绩", "性别"], class_names=["不及格", "及格"], filled=True)
plt.show()
import pandas as pd
importance = rf.feature_importances_
pd.DataFrame({"特征": ["成绩", "性别"], "重要性": importance})
💡 今日思路建议
- 构建同样的“是否及格预测”分类数据集
- 训练决策树模型,尝试调节
max_depth
查看影响 - 训练随机森林模型,查看是否提升性能
- 输出特征重要性对比
- 可视化决策树结构图
📁 练习脚本:decision_tree_forest_demo.py
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
import numpy as np
import pandas as pdplt.rcParams['font.family'] = 'Arial Unicode MS'
plt.rcParams['axes.unicode_minus'] = False
np.random.seed(42)
size = 100
scores = np.random.randint(40, 100, size)
genders = np.random.choice([0, 1], size=size)
labels = (scores >= 60).astype(int)
X = np.column_stack(((scores - scores.mean()) / scores.std(), genders))
y = labelsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
dt_model = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)print("=== 决策树模型评估 ===")
print("准确率:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))
plt.figure(figsize=(10, 6))
plot_tree(dt_model, feature_names=["成绩", "性别"], class_names=["不及格", "及格"], filled=True)
plt.title("决策树可视化")
plt.tight_layout()
plt.show()
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)print("\n=== 随机森林模型评估 ===")
print("准确率:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
feature_importance = rf_model.feature_importances_
features = ["成绩", "性别"]
importance_df = pd.DataFrame({"特征": features, "重要性": feature_importance})
print("\n=== 特征重要性(随机森林) ===")
print(importance_df)
运行输出:

=== 决策树模型评估 ===
准确率: 1.0precision recall f1-score support0 1.00 1.00 1.00 71 1.00 1.00 1.00 13accuracy 1.00 20macro avg 1.00 1.00 1.00 20
weighted avg 1.00 1.00 1.00 20