当前位置：首页 > news >正文

第8篇：决策树与随机森林：从零实现到调参实战

news 2025/9/6 9:25:55

摘要：
本文系统讲解决策树的构建原理（信息增益、基尼不纯度）、递归分裂过程，并用Python从零实现ID3算法。随后深入探讨随机森林的集成思想、Bagging机制，并使用Scikit-learn进行调参与特征重要性分析。结合鸢尾花和波士顿房价数据集，帮助学习者掌握“白盒”模型的可解释性优势与强大性能。

一、为什么选择决策树？

决策树是机器学习中最直观、最易解释的模型之一，其结构类似于“流程图”，模拟人类决策过程。

1.1 决策树的优势

✅ 可解释性强：能清晰展示决策路径
✅ 无需数据标准化：对数值缩放不敏感
✅ 能处理非线性关系：通过分层分裂
✅ 自动特征选择：重要特征出现在树的上层

1.2 典型应用场景

信贷审批（是否放贷）
医疗诊断（是否患病）
用户分群（高/低价值客户）

二、决策树原理：如何“聪明地”分裂？

2.1 核心思想

通过递归地选择最佳特征和分割点，将数据集划分为更“纯净”的子集。

目标：每次分裂后，子集的“混乱度”降低。

2.2 纯度度量：信息增益（ID3/C4.5）与基尼不纯度（CART）

（1）信息熵（Entropy）

衡量数据集的混乱程度：

Entropy(S) = - Σ p_i log₂(p_i)

p_i：类别i在集合S中的比例
熵越小，数据越纯净（如全为同一类，熵=0）

import numpy as npdef entropy(y):_, counts = np.unique(y, return_counts=True)probs = counts / len(y)return -np.sum(probs * np.log2(probs + 1e-9))  # 防止log(0)# 示例
print(f"纯类别: {entropy([0,0,0]):.3f}")     # 0.000
print(f"混合类别: {entropy([0,1]):.3f}")     # 1.000

（2）信息增益（Information Gain）

选择使信息增益最大的特征进行分裂：

IG(S, A) = Entropy(S) - Σ [ |S_v|/|S| × Entropy(S_v) ]

A：特征
S_v：根据特征A的值v划分的子集

（3）基尼不纯度（Gini Impurity）

CART算法使用，计算更高效：

Gini(S) = 1 - Σ (p_i)²

def gini(y):_, counts = np.unique(y, return_counts=True)probs = counts / len(y)return 1 - np.sum(probs ** 2)

📌 Scikit-learn的DecisionTreeClassifier默认使用Gini。

三、从零实现决策树（ID3算法）

class Node:def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):self.feature = feature      # 分裂特征索引self.threshold = threshold  # 分裂阈值（连续值）或值（离散值）self.left = left            # 左子树self.right = right          # 右子树self.value = value          # 叶子节点的预测值class DecisionTree:def __init__(self, min_samples_split=2, max_depth=100, n_features=None):self.min_samples_split = min_samples_splitself.max_depth = max_depthself.n_features = n_featuresself.root = Nonedef fit(self, X, y):self.n_features = X.shape[1] if self.n_features is None else min(self.n_features, X.shape[1])self.root = self._grow_tree(X, y)def _grow_tree(self, X, y, depth=0):n_samples, n_feats = X.shapen_labels = len(np.unique(y))# 停止条件if (depth >= self.max_depth or n_labels == 1 or n_samples < self.min_samples_split):leaf_value = self._most_common_label(y)return Node(value=leaf_value)# 随机选择特征子集（用于随机森林）feat_idxs = np.random.choice(n_feats, self.n_features, replace=False)# 找到最佳分裂best_feat, best_thresh = self._best_criteria(X, y, feat_idxs)# 分裂数据left_idxs, right_idxs = self._split(X[:, best_feat], best_thresh)# 递归构建子树left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth + 1)right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth + 1)return Node(best_feat, best_thresh, left, right)def _best_criteria(self, X, y, feat_idxs):best_gain = -1split_idx, split_thresh = None, Nonefor feat_idx in feat_idxs:X_column = X[:, feat_idx]thresholds = np.unique(X_column)for threshold in thresholds:gain = self._information_gain(y, X_column, threshold)if gain > best_gain:best_gain = gainsplit_idx = feat_idxsplit_thresh = thresholdreturn split_idx, split_threshdef _information_gain(self, y, X_column, threshold):# 父节点熵parent_entropy = entropy(y)# 子集left_idxs = X_column <= thresholdy_left, y_right = y[left_idxs], y[~left_idxs]if len(y_left) == 0 or len(y_right) == 0:return 0# 加权子节点熵n = len(y)n_left, n_right = len(y_left), len(y_right)child_entropy = (n_left / n) * entropy(y_left) + (n_right / n) * entropy(y_right)return parent_entropy - child_entropydef _split(self, X_column, split_thresh):left_idxs = X_column <= split_threshreturn left_idxs, ~left_idxsdef _most_common_label(self, y):counter = np.bincount(y)return counter.argmax()def predict(self, X):return np.array([self._traverse_tree(x, self.root) for x in X])def _traverse_tree(self, x, node):if node.value is not None:return node.valueif x[node.feature] <= node.threshold:return self._traverse_tree(x, node.left)return self._traverse_tree(x, node.right)

四、使用Scikit-learn实战决策树

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt# 加载数据
iris = load_iris()
X, y = iris.data, iris.target# 训练
tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
tree.fit(X, y)# 可视化树结构
plt.figure(figsize=(15, 10))
plot_tree(tree, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.title("决策树可视化")
plt.show()

✅ 图中显示了每个节点的特征、阈值、基尼不纯度、样本数和类别分布。

五、随机森林：集成学习的王者

5.1 为什么需要随机森林？

单棵决策树容易过拟合。随机森林通过集成多棵树，显著提升泛化能力。

5.2 核心思想：Bagging + 随机特征

Bagging（Bootstrap Aggregating）：
- 从训练集中有放回地采样生成多个子集
- 每个子集训练一棵决策树
随机特征：
- 每次分裂时，随机选择部分特征寻找最佳分裂
预测：
- 分类：多棵树投票
- 回归：多棵树取平均

5.3 随机森林的优势

✅ 高准确率：通常优于单棵树
✅ 抗过拟合：集成降低方差
✅ 可估计特征重要性
✅ 能处理高维数据

六、Scikit-learn实现随机森林

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score# 划分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 训练随机森林
rf = RandomForestClassifier(n_estimators=100,      # 树的数量max_depth=5,           # 控制每棵树复杂度max_features='sqrt',   # 每次分裂考虑的特征数random_state=42
)
rf.fit(X_train, y_train)# 预测
y_pred = rf.predict(X_test)
print(f"测试集准确率: {accuracy_score(y_test, y_pred):.3f}")

七、特征重要性分析

随机森林能输出每个特征对模型的贡献度。

# 获取特征重要性
importances = rf.feature_importances_
feature_names = iris.feature_names# 可视化
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.title("特征重要性")
plt.bar(range(len(importances)), importances[indices])
plt.xticks(range(len(importances)), [feature_names[i] for i in indices])
plt.ylabel("重要性")
plt.show()# 输出
for i in indices:print(f"{feature_names[i]}: {importances[i]:.3f}")

✅ 可用于特征选择，去除不重要特征。

八、调参实战：网格搜索优化随机森林

from sklearn.model_selection import GridSearchCV# 参数网格
param_grid = {'n_estimators': [50, 100, 200],'max_depth': [3, 5, 7, None],'min_samples_split': [2, 5, 10]
}# 网格搜索
grid_search = GridSearchCV(RandomForestClassifier(random_state=42),param_grid, cv=5, scoring='accuracy', n_jobs=-1
)grid_search.fit(X_train, y_train)print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳CV得分: {grid_search.best_score_:.3f}")# 最终模型
best_rf = grid_search.best_estimator_
test_acc = best_rf.score(X_test, y_test)
print(f"测试集准确率: {test_acc:.3f}")

九、回归任务：波士顿房价预测

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error# 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 训练随机森林回归
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)# 预测
y_pred = rf_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse:.3f}")# 特征重要性
importances_reg = rf_reg.feature_importances_
indices_reg = np.argsort(importances_reg)[::-1]
plt.figure(figsize=(10, 6))
plt.title("回归任务特征重要性")
plt.bar(range(len(importances_reg)), importances_reg[indices_reg])
plt.xticks(range(len(importances_reg)), [housing.feature_names[i] for i in indices_reg], rotation=45)
plt.show()