当前位置：首页 > news >正文

随机森林所有参数含义以及如何进行采样和网格搜索；

news 2025/10/11 7:09:35

所有参数如下

class RandomForestClassifier(
n_estimators: Int = 100,
/,
criterion: Literal[‘gini’, ‘entropy’, ‘log_loss’] = “gini”,
max_depth: Int | None = None,
min_samples_split: float = 2,
min_samples_leaf: float = 1,
min_weight_fraction_leaf: Float = 0,
max_features: float | Literal[‘sqrt’, ‘log2’] = “sqrt”,
max_leaf_nodes: Int | None = None,
min_impurity_decrease: Float = 0,
bootstrap: bool = True,
oob_score: bool = False,
n_jobs: Int | None = None,
random_state: Int | RandomState | None = None,
verbose: Int = 0,
warm_start: bool = False,
class_weight: Mapping | Sequence[Mapping] | Literal[‘balanced’, ‘balanced_subsample’] | None = None,
ccp_alpha: float = 0,
max_samples: float | None = None
) A random forest classifier. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best split strategy, i.e. equivalent to passing splitter=“best” to the underlying ~sklearn.tree.DecisionTreeRegressor. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree. For a comparison between tree-based ensemble models see the example sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py. Read more in the User Guide . Parameters n_estimators : int, default=100
The number of trees in the forest. versionchanged 0.22 The default value of n_estimators changed from 10 to 100 in 0.22. criterion : {“gini”, “entropy”, “log_loss”}, default=“gini”
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see tree_mathematical_formulation. Note: This parameter is tree-specific. max_depth : int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. min_samples_split : int or float, default=2
The minimum number of samples required to split an internal node: ◦ If int, then consider min_samples_split as the minimum number. ◦ If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. versionchanged 0.18 Added float values for fractions. min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. ◦ If int, then consider min_samples_leaf as the minimum number. ◦ If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. versionchanged 0.18 Added float values for fractions. min_weight_fraction_leaf : float, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. max_features : {“sqrt”, “log2”, None}, int or float, default=“sqrt”
The number of features to consider when looking for the best split: ◦ If int, then consider max_features features at each split. ◦ If float, then max_features is a fraction and max(1, int(max_features * n_features_in_)) features are considered at each split. ◦ If “sqrt”, then max_features=sqrt(n_features). ◦ If “log2”, then max_features=log2(n_features). ◦ If None, then max_features=n_features. versionchanged 1.1 The default of max_features changed from “auto” to “sqrt”. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. max_leaf_nodes : int, default=None
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. min_impurity_decrease : float, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following: N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity) where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed. versionadded 0.19 bootstrap : bool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. oob_score : bool or callable, default=False
Whether to use out-of-bag samples to estimate the generalization score. By default, ~sklearn.metrics.accuracy_score is used. Provide a callable with signature metric(y_true, y_pred) to use a custom metric. Only available if bootstrap=True. n_jobs : int, default=None
The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary <n_jobs> for more details. random_state : int, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary <random_state> for details. verbose : int, default=0
Controls the verbosity when fitting and predicting. warm_start : bool, default=False
When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See Glossary <warm_start> and tree_ensemble_warm_start for details. class_weight : {“balanced”, “balanced_subsample”}, dict or list of dicts, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See minimal_cost_complexity_pruning for details. versionadded 0.22 max_samples : int or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator. ◦ If None (default), then draw X.shape[0] samples. ◦ If int, then draw max_samples samples. ◦ If float, then draw max(round(n_samples * max_samples), 1) samples. Thus, max_samples should be in the interval (0.0, 1.0]. versionadded 0.22 monotonic_cst : array-like of int of shape (n_features), default=None
Indicates the monotonicity constraint to enforce on each feature. ◦ 1: monotonic increase ◦ 0: no constraint ◦ -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: ◦ multiclass classifications (i.e. when n_classes > 2), ◦ multioutput classifications (i.e. when n_outputs_ > 1), ◦ classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the User Guide <monotonic_cst_gbdt>. versionadded 1.4 Attributes **** : :class:~sklearn.tree.DecisionTreeClassifier
The child estimator template used to create the collection of fitted sub-estimators. versionadded 1.2 base_estimator_ was renamed to estimator_. **** : list of DecisionTreeClassifier
The collection of fitted sub-estimators. **** : ndarray of shape (n_classes,) or a list of such arrays
The classes labels (single output problem), or a list of arrays of class labels (multi-output problem). n_classes_ : int or list
The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem). n_features_in_ : int
Number of features seen during fit. versionadded 0.24 **** : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings. versionadded 1.0 n_outputs_ : int
The number of outputs when fit is performed. **** : ndarray of shape (n_features,)
The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance as an alternative. **** : float
Score of the training dataset obtained using an out-of-bag estimate. This attribute exists only when oob_score is True. **** : ndarray of shape (n_samples, n_classes) or (n_samples, n_classes, n_outputs)
Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. This attribute exists only when oob_score is True. **** : list of arrays
The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected. versionadded 1.4 See Also sklearn.tree.DecisionTreeClassifier : A decision tree classifier. sklearn.ensemble.ExtraTreesClassifier : Ensemble of extremely randomized tree classifiers. sklearn.ensemble.HistGradientBoostingClassifier : A Histogram-based Gradient Boosting Classification Tree, very fast for big datasets (n_samples >=
10_000). Notes The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values. The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed. References [1]: L. Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001. Examples >>> from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
… n_informative=2, n_redundant=0,
… random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
RandomForestClassifier(…)
print(clf.predict([[0, 0, 0, 0]]))

RandomForestClassifier 翻译与解析

以下是对 scikit-learn 中 RandomForestClassifier（随机森林分类器）官方说明的精准翻译，并结合核心逻辑进行解析，帮助理解其设计原理与参数作用。

一、基础定义翻译与解析

翻译

随机森林分类器：随机森林是一种元估计器（meta-estimator），它会在数据集的多个子样本上拟合多个决策树分类器，然后通过“平均”的方式提升预测准确率并控制过拟合。
森林中的每棵树均采用“最优分裂策略”（即等价于向底层的 sklearn.tree.DecisionTreeRegressor 传递 splitter="best" 参数）。
若 bootstrap=True（默认值），子样本的大小由 max_samples 参数控制；否则，将使用完整数据集构建每棵树。
关于树基集成模型的对比，可参考示例 sphx_glr_auto_examples_ensemble_plot_forest_hist_grad_boosting_comparison.py。
更多细节请查阅用户指南：（注：官方链接，实际使用时需结合 scikit-learn 官网文档）。

解析

元估计器本质：随机森林不直接做预测，而是通过“集成多棵独立决策树”的结果（分类任务中取“投票”，回归任务中取“均值”）降低单棵树的过拟合风险，体现“群体智慧”。
随机性来源：两个核心随机机制——① 对样本随机抽样（bootstrap，有放回抽样）；② 对特征随机选择（max_features，每次分裂仅考虑部分特征），确保树与树之间的“多样性”，是提升集成效果的关键。
分裂策略：默认用“最优分裂”（splitter="best"），即对可选特征计算分裂增益（如Gini系数、信息熵），选增益最大的分裂点，保证单棵树的拟合能力。

二、核心参数翻译与解析

（按参数功能分类，重点标注默认值、取值逻辑及对模型的影响）

1. 森林规模控制

参数名	翻译与取值	核心作用解析
`n_estimators`	翻译：森林中决策树的数量取值：int，默认=100 版本变更：0.22版本中，默认值从10改为100	- 树越多：模型越稳定（降低方差，减少过拟合），但训练时间和内存消耗线性增加； - 树过少：集成效果差，易受单棵树的噪声影响，可能出现欠拟合。

2. 分裂质量评估

参数名	翻译与取值	核心作用解析
`criterion`	翻译：衡量分裂质量的函数取值：{“gini”（基尼不纯度）, “entropy”（香农信息增益）, “log_loss”（对数损失）}，默认=“gini” 说明：此参数是“树专属”的（即每棵树的分裂评估逻辑由它决定），详见 `tree_mathematical_formulation`（官方数学公式文档）	- Gini系数（默认）：计算简单、速度快，适合大数据集，对“错分代价均等”的场景友好； - Entropy/Log Loss：对分裂增益更敏感（计算时需求对数），但速度稍慢，适合对“分类纯度”要求高的场景。

3. 树深度与复杂度控制

（防止单棵树过深导致过拟合，是随机森林调参的核心方向）

参数名	翻译与取值	核心作用解析
`max_depth`	翻译：树的最大深度取值：int，默认=None 说明：若为None，树会持续分裂，直到所有叶子节点“纯”（仅含同一类样本）或叶子节点样本数少于 `min_samples_split`	- 设为具体值（如10）：限制树的深度，避免对训练数据的细节过度拟合，适合高维数据； - 默认None：树可能长得极深，易过拟合，但对简单数据集（低维、线性可分）可能效果更好。
`min_samples_split`	翻译：分裂内部节点所需的最小样本数取值：int或float，默认=2 规则： ① 若为int：直接取该数值作为最小样本数； ② 若为float：取 `ceil(min_samples_split * n_samples)`（样本总数的比例，向上取整）版本变更：0.18版本新增float类型取值	- 增大值（如5、10）：减少分裂次数，树更简单（欠拟合风险↑，过拟合风险↓）； - 减小值（如2）：树更复杂（过拟合风险↑）。
`min_samples_leaf`	翻译：叶子节点（最终节点）所需的最小样本数取值：int或float，默认=1 规则：与 `min_samples_split` 一致（int为绝对数，float为比例）说明：分裂点仅在“左右分支均满足最小样本数”时才生效，可平滑模型（尤其对回归任务）	- 增大值（如5）：强制叶子节点包含更多样本，避免“单个异常样本主导叶子节点”，降低过拟合； - 默认1：叶子节点可仅含1个样本，易过拟合。
`max_leaf_nodes`	翻译：树的最大叶子节点数取值：int，默认=None 说明：以“最优优先”方式生长树（优先保留 impurity 降低最多的节点），若为None则不限制叶子节点数	- 设为具体值（如100）：控制树的规模，避免树过大，适合内存有限的场景； - 默认None：树可自由生长至最深。
`min_impurity_decrease`	翻译：分裂节点所需的最小不纯度减少量取值：float，默认=0.0 公式： `N_t / N * (impurity - N_t_R/N_t * right_impurity - N_t_L/N_t * left_impurity)` （N=总样本数，N_t=当前节点样本数，N_t_L/R=左右子节点样本数，均为加权和，若有 `sample_weight`）版本新增：0.19版本	- 仅当“分裂带来的不纯度减少量≥该值”时，才分裂节点； - 增大值（如0.01）：减少分裂，树更简单；默认0.0：只要有增益就分裂。

4. 特征选择控制

参数名	翻译与取值	核心作用解析
`max_features`	翻译：寻找最优分裂时考虑的特征数量取值：{“sqrt”（平方根）, “log2”（对数2）, None}、int或float，默认=“sqrt” 规则： ① int：每次分裂考虑指定数量的特征； ② float：考虑 `max(1, int(max_features * n_features_in_))`（特征总数的比例）； ③ “sqrt”：`max_features=sqrt(n_features)`； ④ “log2”：`max_features=log2(n_features)`； ⑤ None：`max_features=n_features`（考虑所有特征）版本变更：1.1版本中，默认值从"auto"改为"sqrt" 说明：即使设置了 `max_features`，若未找到有效分裂点，仍会继续检查更多特征	- 随机性核心：限制每次分裂的特征数量，让不同树依赖不同特征，避免“强特征主导所有树”（如某特征预测能力极强，所有树都用它分裂，导致树同质化，集成失效）； - 默认"sqrt"：平衡“随机性”与“特征覆盖度”，是实践中效果较好的取值。

5. 样本抽样与评估

参数名	翻译与取值	核心作用解析
`bootstrap`	翻译：构建树时是否使用bootstrap抽样（有放回抽样）取值：bool，默认=True	- True（默认）：每棵树用“部分样本”训练（抽样比例由 `max_samples` 控制），增强随机性，降低过拟合； - False：所有树用“完整数据集”训练，树同质化严重，过拟合风险↑。
`max_samples`	翻译：若 `bootstrap=True`，每棵基估计器（决策树）训练时从X中抽取的样本数取值：int或float，默认=None 规则： ① None：抽取 `X.shape[0]` 个样本（即抽样次数=样本总数，因有放回，会有重复）； ② int：抽取指定数量的样本； ③ float：抽取 `max(round(n_samples * max_samples), 1)` 个样本（比例，向上取整）版本新增：0.22版本	- 控制抽样比例（如float=0.8，即每棵树用80%的样本训练），进一步增强树的多样性； - 默认None：抽样次数=样本总数，是经典随机森林的实现方式。
`oob_score`	翻译：是否使用“袋外样本”（out-of-bag，未被bootstrap抽样选中的样本）估计泛化能力取值：bool或可调用对象，默认=False 说明： ① 若为True，默认用 `sklearn.metrics.accuracy_score`（准确率）评估； ② 若为可调用对象，需满足 `metric(y_true, y_pred)` 签名（自定义评估指标）； ③ 仅当 `bootstrap=True` 时可用	- 无需额外划分验证集，用“未参与训练的袋外样本”评估模型，节省数据； - 缺点：若抽样次数少（如 `max_samples` 过小），部分样本可能从未成为“袋外样本”，评估结果有偏差。

6. 计算效率与随机性

参数名	翻译与取值	核心作用解析
`n_jobs`	翻译：并行运行的任务数取值：int，默认=None 说明： ① None：默认1个任务（不并行），除非在 `joblib.parallel_backend` 上下文内； ② -1：使用所有可用CPU核心详见术语表：<n_jobs>	- 随机森林的树可独立训练，并行化能大幅缩短训练时间（如100棵树用10核CPU，时间约为单核的1/10）； - 注意：内存有限时，过多核心可能导致内存溢出（需根据机器配置调整）。
`random_state`	翻译：控制随机性的种子取值：int、RandomState实例或None，默认=None 说明：控制两个随机过程： ① `bootstrap=True` 时，样本抽样的随机性； ② `max_features < n_features` 时，特征选择的随机性详见术语表：<random_state>	- 设为固定int（如42）：多次训练结果完全一致，便于实验复现； - 默认None：每次训练随机性不同，结果略有差异（但长期来看性能稳定）。
`verbose`	翻译：控制拟合和预测过程中的日志输出详细程度取值：int，默认=0	- 0（默认）：无输出； - 1：输出进度条； - 2及以上：输出更详细的每棵树的训练信息（适合调试）。

7. 类别权重与剪枝（应对特殊场景）

参数名	翻译与取值	核心作用解析
`class_weight`	翻译：类别权重（用于解决样本不平衡问题）取值：{“balanced”, “balanced_subsample”}、dict或dict列表，默认=None 说明： ① None：所有类别权重为1； ② “balanced”：按输入数据的类别频率自动调整权重，公式为 `n_samples / (n_classes * np.bincount(y))`（样本数/（类别数×类别样本数））； ③ “balanced_subsample”：与"balanced"逻辑一致，但权重基于每棵树的bootstrap样本计算； ④ dict：手动指定权重（如 `{0:1, 1:10}`，表示类别1权重是类别0的10倍）； ⑤ 多输出任务：需按y的列顺序提供dict列表	- 解决“样本不平衡”（如类别0占90%，类别1占10%）： - 不设权重：模型倾向于预测多数类（类别0），少数类（类别1）预测准确率极低； - 设为"balanced"：少数类权重增大，模型更关注少数类，提升其召回率。
`ccp_alpha`	翻译：用于“最小成本复杂度剪枝”的复杂度参数取值：非负float，默认=0.0 说明：选择“成本复杂度最大且小于ccp_alpha”的子树，默认不剪枝详见：`minimal_cost_complexity_pruning`（官方剪枝文档）版本新增：0.22版本	- 剪枝：移除“对模型性能贡献小”的分支，降低树的复杂度； - 增大值（如0.01）：剪枝更彻底，树更简单（欠拟合风险↑）； - 默认0.0：不剪枝，树保持完整结构。

8. 单调性约束（1.4版本新增）

参数名	翻译与取值	核心作用解析
`monotonic_cst`	翻译：对每个特征强制执行的单调性约束取值：形状为(n_features,)的array-like（元素为int），默认=None 规则： ① 1：特征与预测结果呈“单调递增”（特征值↑→正类概率↑）； ② 0：无约束； ③ -1：特征与预测结果呈“单调递减” 限制：不支持多类分类（n_classes>2）、多输出分类、含缺失值的数据；约束针对“正类概率”生效详见：<monotonic_cst_gbdt>（官方约束文档）版本新增：1.4版本	- 业务场景必需：如“收入↑→违约概率↓”（单调递减），强制模型符合业务直觉，避免出现“收入高但违约概率高”的不合理预测； - 默认None：无约束，模型自由学习特征与结果的关系。

三、属性（Attributes）翻译与说明

（模型训练后可调用的属性，用于查看训练结果、分析模型）

属性名	翻译与说明	核心用途
`estimator_`	翻译：用于创建“拟合后子估计器集合”的子估计器模板说明：1.2版本中，`base_estimator_` 重命名为 `estimator_`	查看构建单棵树的基础配置（如决策树的 `criterion`、`max_depth`）。
`estimators_`	翻译：拟合后的子估计器集合（列表，元素为DecisionTreeClassifier）	查看森林中每棵树的具体信息（如某棵树的特征重要性、结构）。
`classes_`	翻译：类别标签（单输出任务为ndarray，多输出任务为ndarray列表）	确认模型训练时的类别顺序（如二分类中，`classes_ = [0,1]` 表示0为负类，1为正类）。
`n_classes_`	翻译：类别数量（单输出任务为int，多输出任务为列表）	快速了解任务的类别规模（如多类分类中 `n_classes_=3` 表示3个类别）。
`n_features_in_`	翻译：训练时看到的特征数量版本新增：0.24版本	验证输入特征数量是否与训练时一致（避免预测时特征数不匹配）。
`feature_names_in_`	翻译：训练时看到的特征名称（仅当X的特征名均为字符串时定义）版本新增：1.0版本	查看特征名称，便于分析“哪个特征重要”（结合 `feature_importances_`）。
`n_outputs_`	翻译：训练时的输出数量	确认任务类型（单输出/多输出，如多标签分类中 `n_outputs_=2` 表示两个标签）。
`feature_importances_`	翻译：基于不纯度的特征重要性（ndarray，形状为(n_features,)）说明：值越高，特征越重要；计算方式为“该特征带来的准则总减少量（归一化后）”，又称“Gini重要性” 警告：对高基数特征（如ID、时间戳，唯一值多）的重要性评估可能有误导性，建议用 `sklearn.inspection.permutation_importance`（排列重要性）替代	特征选择：筛选重要特征，剔除冗余特征（如重要性<0.01的特征可删除）。
`oob_score_`	翻译：用袋外样本估计的

采样的方法

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix,
balanced_accuracy_score, f1_score, roc_auc_score)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.metrics import classification_report_imbalanced
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

def create_price_labels(df):
“”“创建价格标签”“”
price_change_rate = (df[‘close’] - df[‘open’]) / df[‘open’] * 1000

conditions = [price_change_rate >= 1,    # 上涨千分之1或以上，标签为1price_change_rate <= -1,   # 下跌千分之1或以上，标签为2
]choices = [1, 2]
df['label'] = np.select(conditions, choices, default=0)
return df

def analyze_class_distribution(y, title=“类别分布”):
“”“分析类别分布”“”
print(f"\n{title}:“)
print(”-" * 40)

counts = Counter(y)
total = len(y)for label in sorted(counts.keys()):count = counts[label]percentage = count / total * 100print(f"标签 {label}: {count} 样本 ({percentage:.2f}%)")# 计算不平衡比率
max_count = max(counts.values())
min_count = min(counts.values())
imbalance_ratio = max_count / min_count
print(f"不平衡比率: {imbalance_ratio:.2f}:1")
print()return counts

def plot_class_distribution(y_original, y_resampled=None, title=“类别分布对比”):
“”“可视化类别分布”“”
if y_resampled is not None:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

    # 原始分布pd.Series(y_original).value_counts().sort_index().plot(kind='bar', ax=ax1, color='skyblue')ax1.set_title('原始数据分布')ax1.set_xlabel('类别')ax1.set_ylabel('样本数')# 重采样后分布pd.Series(y_resampled).value_counts().sort_index().plot(kind='bar', ax=ax2, color='lightcoral')ax2.set_title('重采样后分布')ax2.set_xlabel('类别')ax2.set_ylabel('样本数')plt.tight_layout()
else:plt.figure(figsize=(8, 5))pd.Series(y_original).value_counts().sort_index().plot(kind='bar', color='skyblue')plt.title(title)plt.xlabel('类别')plt.ylabel('样本数')plt.xticks(rotation=0)plt.show()

def evaluate_resampling_methods(X_train_enc, y_train, X_test_enc, y_test):
“”“评估不同的重采样方法”“”

# 定义重采样方法
resampling_methods = {'Original': None,'Random Over-sampling': RandomOverSampler(random_state=42),'SMOTE': SMOTE(random_state=42, k_neighbors=3),  # 减少k值以适应小样本'ADASYN': ADASYN(random_state=42, n_neighbors=3),'Random Under-sampling': RandomUnderSampler(random_state=42),'SMOTEENN': SMOTEENN(random_state=42),'SMOTETomek': SMOTETomek(random_state=42)
}results = []print("评估不同重采样方法:")
print("=" * 60)for method_name, sampler in resampling_methods.items():print(f"\n{method_name}:")print("-" * 40)try:# 应用重采样if sampler is None:X_resampled, y_resampled = X_train_enc, y_trainelse:X_resampled, y_resampled = sampler.fit_resample(X_train_enc, y_train)# 分析重采样后的分布analyze_class_distribution(y_resampled, f"{method_name} 重采样后")# 训练模型clf = RandomForestClassifier(n_estimators=100, random_state=42,class_weight=None if sampler is not None else 'balanced'  # 如果没有重采样，使用类权重平衡)clf.fit(X_resampled, y_resampled)# 预测y_pred = clf.predict(X_test_enc)y_pred_proba = clf.predict_proba(X_test_enc)# 计算评估指标accuracy = accuracy_score(y_test, y_pred)balanced_acc = balanced_accuracy_score(y_test, y_pred)f1_macro = f1_score(y_test, y_pred, average='macro')f1_weighted = f1_score(y_test, y_pred, average='weighted')# 计算AUC (多分类)try:auc_score = roc_auc_score(y_test, y_pred_proba, multi_class='ovr', average='macro')except:auc_score = np.nan# 存储结果result = {'Method': method_name,'Accuracy': accuracy,'Balanced_Accuracy': balanced_acc,'F1_Macro': f1_macro,'F1_Weighted': f1_weighted,'AUC_Macro': auc_score,'Train_Samples': len(y_resampled),'y_pred': y_pred,'model': clf}results.append(result)# 打印主要指标print(f"准确率: {accuracy:.4f}")print(f"平衡准确率: {balanced_acc:.4f}")print(f"F1分数(宏平均): {f1_macro:.4f}")print(f"F1分数(加权平均): {f1_weighted:.4f}")if not np.isnan(auc_score):print(f"AUC分数(宏平均): {auc_score:.4f}")print(f"训练样本数: {len(y_resampled)}")except Exception as e:print(f"方法 {method_name} 失败: {str(e)}")continuereturn results

def plot_results_comparison(results):
“”“可视化不同方法的结果比较”“”
if not results:
print(“没有可用的结果进行比较”)
return

# 创建结果DataFrame
df_results = pd.DataFrame(results)# 设置图形
fig, axes = plt.subplots(2, 2, figsize=(15, 10))# 准确率比较
df_results.plot(x='Method', y='Accuracy', kind='bar', ax=axes[0,0], color='skyblue')
axes[0,0].set_title('准确率比较')
axes[0,0].set_ylabel('准确率')
axes[0,0].tick_params(axis='x', rotation=45)# 平衡准确率比较
df_results.plot(x='Method', y='Balanced_Accuracy', kind='bar', ax=axes[0,1], color='lightcoral')
axes[0,1].set_title('平衡准确率比较')
axes[0,1].set_ylabel('平衡准确率')
axes[0,1].tick_params(axis='x', rotation=45)# F1分数比较
df_results.plot(x='Method', y='F1_Macro', kind='bar', ax=axes[1,0], color='lightgreen')
axes[1,0].set_title('F1分数(宏平均)比较')
axes[1,0].set_ylabel('F1分数')
axes[1,0].tick_params(axis='x', rotation=45)# AUC分数比较
df_valid_auc = df_results.dropna(subset=['AUC_Macro'])
if len(df_valid_auc) > 0:df_valid_auc.plot(x='Method', y='AUC_Macro', kind='bar', ax=axes[1,1], color='orange')axes[1,1].set_title('AUC分数(宏平均)比较')axes[1,1].set_ylabel('AUC分数')axes[1,1].tick_params(axis='x', rotation=45)plt.tight_layout()
plt.show()# 打印排名表
print("\n方法性能排名:")
print("=" * 80)
ranking_df = df_results[['Method', 'Accuracy', 'Balanced_Accuracy', 'F1_Macro', 'F1_Weighted']].round(4)
ranking_df = ranking_df.sort_values('Balanced_Accuracy', ascending=False)
print(ranking_df.to_string(index=False))

def plot_confusion_matrices(results, y_test):
“”“绘制最佳几个方法的混淆矩阵”“”
# 选择平衡准确率最高的前3个方法
sorted_results = sorted(results, key=lambda x: x[‘Balanced_Accuracy’], reverse=True)[:3]

fig, axes = plt.subplots(1, len(sorted_results), figsize=(5*len(sorted_results), 4))
if len(sorted_results) == 1:axes = [axes]for idx, result in enumerate(sorted_results):cm = confusion_matrix(y_test, result['y_pred'])sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx])axes[idx].set_title(f"{result['Method']}\n平衡准确率: {result['Balanced_Accuracy']:.4f}")axes[idx].set_xlabel('预测标签')axes[idx].set_ylabel('真实标签')plt.tight_layout()
plt.show()

主函数

def main():
“”“主函数执行完整的分析流程”“”

print("加载数据...")
# 读取数据
df = pd.read_csv('df7_train.csv')
df_test = pd.read_csv('df7_test_7.csv')# 创建标签
df = create_price_labels(df)
df_test = create_price_labels(df_test)# 标签向前移动（预测下一期）
df['label'] = df['label'].shift(-1)
df = df.dropna().reset_index(drop=True)df_test['label'] = df_test['label'].shift(-1)
df_test = df_test.dropna().reset_index(drop=True)# 定义特征列
feature_cols = ["vr_state", "delta_pattern", "site", "states_code", "AD_state_smooth"]# 分离特征和标签
X_train, X_test = df[feature_cols], df_test[feature_cols]
y_train, y_test = df["label"], df_test["label"]print("原始数据分析:")
print(f"训练集大小: {len(X_train)}")
print(f"测试集大小: {len(X_test)}")# 分析训练集标签分布
train_counts = analyze_class_distribution(y_train, "训练集标签分布")
test_counts = analyze_class_distribution(y_test, "测试集标签分布")# 可视化标签分布
plot_class_distribution(y_train, title="训练集标签分布")# 数据预处理
print("数据预处理...")
cat_cols = feature_cols# 对于离散特征，考虑使用LabelEncoder而不是OneHotEncoder
# 因为树模型可以直接处理数值型的离散特征
print("选择编码方式:")
print("1. OneHotEncoder - 适合无序分类特征")
print("2. LabelEncoder - 适合有序分类特征或树模型")# 这里我们提供两种选择
use_onehot = True  # 您可以根据特征性质调整if use_onehot:print("使用OneHotEncoder...")preprocessor = ColumnTransformer(transformers=[("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)],remainder="drop")X_train_enc = preprocessor.fit_transform(X_train)X_test_enc = preprocessor.transform(X_test)
else:print("使用LabelEncoder...")X_train_enc = X_train.copy()X_test_enc = X_test.copy()label_encoders = {}for col in cat_cols:le = LabelEncoder()X_train_enc[col] = le.fit_transform(X_train[col].astype(str))X_test_enc[col] = le.transform(X_test[col].astype(str))label_encoders[col] = leX_train_enc = X_train_enc.valuesX_test_enc = X_test_enc.valuesprint(f"编码后训练集形状: {X_train_enc.shape}")
print(f"编码后测试集形状: {X_test_enc.shape}")# 评估不同重采样方法
print("\n开始评估不同重采样方法...")
results = evaluate_resampling_methods(X_train_enc, y_train, X_test_enc, y_test)# 可视化结果比较
plot_results_comparison(results)# 绘制最佳方法的混淆矩阵
plot_confusion_matrices(results, y_test)# 推荐最佳方法
if results:best_result = max(results, key=lambda x: x['Balanced_Accuracy'])print(f"\n推荐方法: {best_result['Method']}")print(f"平衡准确率: {best_result['Balanced_Accuracy']:.4f}")# 详细分类报告print(f"\n{best_result['Method']} 详细分类报告:")print(classification_report(y_test, best_result['y_pred']))

if name == “main”:
main()

额外建议和最佳实践

print(“”"
💡 处理不平衡数据的建议:

评估指标选择:
- 不要只看准确率，重点关注平衡准确率、F1分数
- 对于不平衡数据，平衡准确率比普通准确率更有意义
重采样方法选择:
- SMOTE: 适合大多数情况，但要注意过拟合
- ADASYN: 适合噪声较多的数据
- 组合方法(SMOTEENN, SMOTETomek): 同时处理过采样和欠采样
模型参数调整:
- 使用class_weight=‘balanced’
- 调整RandomForest的max_depth避免过拟合
- 考虑使用XGBoost等对不平衡数据更友好的模型
交叉验证:
- 使用StratifiedKFold保证每折的类别分布一致
- 在交叉验证中应用重采样
特征工程:
- 对于离散特征，考虑特征组合
- 可以尝试特征选择减少噪声
  “”")

重采样后对比

要分析不同样本采样方法的结果，需结合数据不平衡改善效果、模型核心评估指标（准确率、平衡准确率、F1分数、AUC） 及采样对数据分布的影响展开。以下是详细对比与解读（注：SMOTE的评估指标未完全给出，将基于已知信息结合方法特性补充分析）：

一、基础信息与采样效果对比

首先明确原始数据的不平衡特征：类别0为多数类（59.02%），类别1和2为少数类（各约20%），不平衡比率2.90:1（多数类样本数是少数类的2.9倍）。两种采样方法的核心目标是缩小类别分布差异，具体效果如下：

对比维度	原始数据（Original）	随机过采样（Random Over-sampling）	SMOTE（合成少数类过采样）
类别0样本数/占比	285994（59.02%）	285994（33.33%）	285994（33.33%）
类别1样本数/占比	100095（20.66%）	285994（33.33%）（过采样18.5万+）	285994（33.33%）（合成18.5万+）
类别2样本数/占比	98497（20.33%）	285994（33.33%）（过采样18.7万+）	285994（33.33%）（合成18.7万+）
不平衡比率	2.90:1（多数类:少数类）	1.00:1（完全平衡）	1.00:1（完全平衡）
训练样本总数	484586	857982（样本量增加77%）	857982（样本量增加77%）
采样逻辑	无采样（原始分布）	对少数类随机重复采样（复制已有样本）	对少数类合成新样本（基于近邻生成，非复制）

二、模型评估指标对比与核心结论

1. 已明确指标对比（原始 vs 随机过采样）

评估指标	原始数据（Original）	随机过采样（Random Over-sampling）	差异分析
准确率（Accuracy）	0.5170	0.5086	采样后略有下降（-0.84%），原因是：原始模型倾向于预测多数类（类别0），采样后需兼顾少数类，多数类预测正确率降低，拉低整体准确率。
平衡准确率（Balanced Accuracy）	0.3893	0.3869	基本持平（仅-0.24%），说明：随机过采样对“各类别平均准确率”的提升无帮助，因重复采样未增加新信息，模型仍未学会区分少数类。
F1分数（宏平均）	0.3855	0.3821	略有下降（-0.34%），宏平均F1关注“所有类别的平等表现”，采样后少数类F1未提升，多数类F1下降，导致整体略降。
F1分数（加权平均）	0.5307	0.5248	略有下降（-0.59%），加权平均F1受多数类权重影响大，多数类预测表现下降是主因。
AUC分数（宏平均）	0.5712	0.5703	几乎无差异（-0.09%），说明随机过采样未提升模型对“类别区分能力”，仅改变了样本分布，未增加有效信息。

2. SMOTE的指标推断与方法特性对比

SMOTE的评估指标未完全给出，但基于其“合成新样本”的特性，可结合随机过采样的缺陷推断其可能表现：

优势 vs 随机过采样：
随机过采样通过“复制少数类样本”增加数量，会导致数据冗余（模型可能记住重复样本，增加过拟合风险），且未提供新的特征组合信息；
SMOTE通过“K近邻合成新样本”（如对少数类样本A，找其5个近邻，在A与近邻的特征空间中生成新样本），能补充少数类的特征多样性，理论上更易让模型学习到少数类的真实分布，从而提升少数类的召回率（减少漏判）和F1分数。
可能的指标表现：
- 平衡准确率/宏平均F1：大概率高于随机过采样（甚至高于原始数据），因合成样本提供了新信息，模型对少数类的识别能力提升；
- 准确率：可能仍低于原始数据（因需兼顾少数类），但下降幅度可能小于随机过采样；
- AUC分数：若合成样本质量高（未引入噪声），宏平均AUC可能略高于原始数据和随机过采样，因模型对类别边界的区分更准确。
SMOTE的潜在风险：
若少数类样本中存在噪声（如异常值），SMOTE会基于噪声样本合成“错误新样本”，导致模型学习到错误模式；此外，对高维数据（如编码后30维特征），近邻判断的准确性下降，合成样本的质量可能降低。

三、关键问题与优化方向

1. 随机过采样的无效性原因

随机过采样在本案例中未提升模型性能，核心问题是：

无信息增益：仅复制少数类已有样本，未增加新的特征-标签关联信息，模型仍无法区分少数类与多数类的差异；
过拟合风险：重复样本可能让模型对少数类的“特定样本”过度拟合，在测试集上对新的少数类样本仍无法识别。

2. 后续优化建议

基于当前结果，若目标是“提升少数类预测性能”（如类别1、2是关键业务指标，漏判代价高），可优先尝试以下方向：

优先验证SMOTE的实际指标：若SMOTE的平衡准确率、宏平均F1高于原始数据，说明合成样本有效，可继续使用；若仍无效，需检查少数类样本是否存在“特征区分度低”（如少数类与多数类的特征分布高度重叠），需先优化特征工程（如增加业务相关特征）。
尝试改进型过采样方法：如SMOTE的变体（SMOTEENN：SMOTE+编辑近邻法，移除合成的噪声样本；SMOTETomek：SMOTE+Tomek链接，移除类别边界的模糊样本），减少噪声对模型的干扰。
结合类别权重调整：在采样基础上，给模型设置 class_weight="balanced"（如随机森林的 class_weight 参数），进一步提升少数类的权重，强制模型关注少数类。
考虑欠采样（针对多数类）：若样本量过大（85万样本训练耗时久），可尝试对多数类进行欠采样（如随机删除部分多数类样本），但需注意避免删除关键信息（建议用“聚类欠采样”，保留多数类的核心聚类）。

四、总结

原始数据的核心问题：模型倾向于“偏向多数类”，对少数类的识别能力弱（平衡准确率仅0.3893，远低于理想值），但样本分布是根本原因，而非模型本身。
随机过采样的结论：仅解决了“样本数量不平衡”，未解决“信息不平衡”，对模型性能无正向提升，甚至略有下降，不建议单独使用。
SMOTE的预期价值：理论上优于随机过采样，需实际验证指标；若SMOTE仍无效，需优先优化特征工程（提升少数类与多数类的特征区分度），再考虑调整采样或模型参数。

最终选择哪种采样方法，需结合业务目标（如是否优先保证少数类召回率）和模型指标（如平衡准确率、宏平均F1）综合判断，而非单纯追求“样本分布平衡”。

哪些指标有inf

发现计算rsi指标有一些错误，排除这些错误；
还可以增加一些其他指标；

重采样的方法有5种

# 定义重采样方法
resampling_methods = {'Original': None,'Random Over-sampling': RandomOverSampler(random_state=42),'SMOTE': SMOTE(random_state=42, k_neighbors=3),  # 减少k值以适应小样本'ADASYN': ADASYN(random_state=42, n_neighbors=3),'Random Under-sampling': RandomUnderSampler(random_state=42),'SMOTEENN': SMOTEENN(random_state=42),'SMOTETomek': SMOTETomek(random_state=42)
}

针对重采样效果评估

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix, balanced_accuracy_score, f1_score, roc_auc_score)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.metrics import classification_report_imbalanced
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counterdef create_price_labels(df):"""创建价格标签"""price_change_rate = (df['close'] - df['open']) / df['open'] * 1000conditions = [price_change_rate >= 1,    # 上涨千分之1或以上，标签为1price_change_rate <= -1,   # 下跌千分之1或以上，标签为2]choices = [1, 2]df['label'] = np.select(conditions, choices, default=0)return dfdef analyze_class_distribution(y, title="类别分布"):"""分析类别分布"""print(f"\n{title}:")print("-" * 40)counts = Counter(y)total = len(y)for label in sorted(counts.keys()):count = counts[label]percentage = count / total * 100print(f"标签 {label}: {count} 样本 ({percentage:.2f}%)")# 计算不平衡比率max_count = max(counts.values())min_count = min(counts.values())imbalance_ratio = max_count / min_countprint(f"不平衡比率: {imbalance_ratio:.2f}:1")print()return countsdef plot_class_distribution(y_original, y_resampled=None, title="类别分布对比"):"""可视化类别分布"""if y_resampled is not None:fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))# 原始分布pd.Series(y_original).value_counts().sort_index().plot(kind='bar', ax=ax1, color='skyblue')ax1.set_title('原始数据分布')ax1.set_xlabel('类别')ax1.set_ylabel('样本数')# 重采样后分布pd.Series(y_resampled).value_counts().sort_index().plot(kind='bar', ax=ax2, color='lightcoral')ax2.set_title('重采样后分布')ax2.set_xlabel('类别')ax2.set_ylabel('样本数')plt.tight_layout()else:plt.figure(figsize=(8, 5))pd.Series(y_original).value_counts().sort_index().plot(kind='bar', color='skyblue')plt.title(title)plt.xlabel('类别')plt.ylabel('样本数')plt.xticks(rotation=0)plt.show()def evaluate_resampling_methods(X_train_enc, y_train, X_test_enc, y_test):"""评估不同的重采样方法"""# 定义重采样方法resampling_methods = {'Original': None,'Random Over-sampling': RandomOverSampler(random_state=42),'SMOTE': SMOTE(random_state=42, k_neighbors=3),  # 减少k值以适应小样本'ADASYN': ADASYN(random_state=42, n_neighbors=3),'Random Under-sampling': RandomUnderSampler(random_state=42),'SMOTEENN': SMOTEENN(random_state=42),'SMOTETomek': SMOTETomek(random_state=42)}results = []print("评估不同重采样方法:")print("=" * 60)for method_name, sampler in resampling_methods.items():print(f"\n{method_name}:")print("-" * 40)try:# 应用重采样if sampler is None:X_resampled, y_resampled = X_train_enc, y_trainelse:X_resampled, y_resampled = sampler.fit_resample(X_train_enc, y_train)# 分析重采样后的分布analyze_class_distribution(y_resampled, f"{method_name} 重采样后")# 训练模型clf = RandomForestClassifier(n_estimators=100, random_state=42,class_weight=None if sampler is not None else 'balanced'  # 如果没有重采样，使用类权重平衡)clf.fit(X_resampled, y_resampled)# 预测y_pred = clf.predict(X_test_enc)y_pred_proba = clf.predict_proba(X_test_enc)# 计算评估指标accuracy = accuracy_score(y_test, y_pred)balanced_acc = balanced_accuracy_score(y_test, y_pred)f1_macro = f1_score(y_test, y_pred, average='macro')f1_weighted = f1_score(y_test, y_pred, average='weighted')# 计算AUC (多分类)try:auc_score = roc_auc_score(y_test, y_pred_proba, multi_class='ovr', average='macro')except:auc_score = np.nan# 存储结果result = {'Method': method_name,'Accuracy': accuracy,'Balanced_Accuracy': balanced_acc,'F1_Macro': f1_macro,'F1_Weighted': f1_weighted,'AUC_Macro': auc_score,'Train_Samples': len(y_resampled),'y_pred': y_pred,'model': clf}results.append(result)# 打印主要指标print(f"准确率: {accuracy:.4f}")print(f"平衡准确率: {balanced_acc:.4f}")print(f"F1分数(宏平均): {f1_macro:.4f}")print(f"F1分数(加权平均): {f1_weighted:.4f}")if not np.isnan(auc_score):print(f"AUC分数(宏平均): {auc_score:.4f}")print(f"训练样本数: {len(y_resampled)}")except Exception as e:print(f"方法 {method_name} 失败: {str(e)}")continuereturn resultsdef plot_results_comparison(results):"""可视化不同方法的结果比较"""if not results:print("没有可用的结果进行比较")return# 创建结果DataFramedf_results = pd.DataFrame(results)# 设置图形fig, axes = plt.subplots(2, 2, figsize=(15, 10))# 准确率比较df_results.plot(x='Method', y='Accuracy', kind='bar', ax=axes[0,0], color='skyblue')axes[0,0].set_title('准确率比较')axes[0,0].set_ylabel('准确率')axes[0,0].tick_params(axis='x', rotation=45)# 平衡准确率比较df_results.plot(x='Method', y='Balanced_Accuracy', kind='bar', ax=axes[0,1], color='lightcoral')axes[0,1].set_title('平衡准确率比较')axes[0,1].set_ylabel('平衡准确率')axes[0,1].tick_params(axis='x', rotation=45)# F1分数比较df_results.plot(x='Method', y='F1_Macro', kind='bar', ax=axes[1,0], color='lightgreen')axes[1,0].set_title('F1分数(宏平均)比较')axes[1,0].set_ylabel('F1分数')axes[1,0].tick_params(axis='x', rotation=45)# AUC分数比较df_valid_auc = df_results.dropna(subset=['AUC_Macro'])if len(df_valid_auc) > 0:df_valid_auc.plot(x='Method', y='AUC_Macro', kind='bar', ax=axes[1,1], color='orange')axes[1,1].set_title('AUC分数(宏平均)比较')axes[1,1].set_ylabel('AUC分数')axes[1,1].tick_params(axis='x', rotation=45)plt.tight_layout()plt.show()# 打印排名表print("\n方法性能排名:")print("=" * 80)ranking_df = df_results[['Method', 'Accuracy', 'Balanced_Accuracy', 'F1_Macro', 'F1_Weighted']].round(4)ranking_df = ranking_df.sort_values('Balanced_Accuracy', ascending=False)print(ranking_df.to_string(index=False))def plot_confusion_matrices(results, y_test):"""绘制最佳几个方法的混淆矩阵"""# 选择平衡准确率最高的前3个方法sorted_results = sorted(results, key=lambda x: x['Balanced_Accuracy'], reverse=True)[:3]fig, axes = plt.subplots(1, len(sorted_results), figsize=(5*len(sorted_results), 4))if len(sorted_results) == 1:axes = [axes]for idx, result in enumerate(sorted_results):cm = confusion_matrix(y_test, result['y_pred'])sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx])axes[idx].set_title(f"{result['Method']}\n平衡准确率: {result['Balanced_Accuracy']:.4f}")axes[idx].set_xlabel('预测标签')axes[idx].set_ylabel('真实标签')plt.tight_layout()plt.show()# 主函数
def main():"""主函数执行完整的分析流程"""print("加载数据...")# 读取数据df = pd.read_csv('df7_train.csv')df_test = pd.read_csv('df7_test_7.csv')# 创建标签df = create_price_labels(df)df_test = create_price_labels(df_test)# 标签向前移动（预测下一期）df['label'] = df['label'].shift(-1)df = df.dropna().reset_index(drop=True)df_test['label'] = df_test['label'].shift(-1)df_test = df_test.dropna().reset_index(drop=True)# 定义特征列feature_cols = ["vr_state", "delta_pattern", "site", "states_code", "AD_state_smooth"]# 分离特征和标签X_train, X_test = df[feature_cols], df_test[feature_cols]y_train, y_test = df["label"], df_test["label"]print("原始数据分析:")print(f"训练集大小: {len(X_train)}")print(f"测试集大小: {len(X_test)}")# 分析训练集标签分布train_counts = analyze_class_distribution(y_train, "训练集标签分布")test_counts = analyze_class_distribution(y_test, "测试集标签分布")# 可视化标签分布plot_class_distribution(y_train, title="训练集标签分布")# 数据预处理print("数据预处理...")cat_cols = feature_cols# 对于离散特征，考虑使用LabelEncoder而不是OneHotEncoder# 因为树模型可以直接处理数值型的离散特征print("选择编码方式:")print("1. OneHotEncoder - 适合无序分类特征")print("2. LabelEncoder - 适合有序分类特征或树模型")# 这里我们提供两种选择use_onehot = True  # 您可以根据特征性质调整if use_onehot:print("使用OneHotEncoder...")preprocessor = ColumnTransformer(transformers=[("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)],remainder="drop")X_train_enc = preprocessor.fit_transform(X_train)X_test_enc = preprocessor.transform(X_test)else:print("使用LabelEncoder...")X_train_enc = X_train.copy()X_test_enc = X_test.copy()label_encoders = {}for col in cat_cols:le = LabelEncoder()X_train_enc[col] = le.fit_transform(X_train[col].astype(str))X_test_enc[col] = le.transform(X_test[col].astype(str))label_encoders[col] = leX_train_enc = X_train_enc.valuesX_test_enc = X_test_enc.valuesprint(f"编码后训练集形状: {X_train_enc.shape}")print(f"编码后测试集形状: {X_test_enc.shape}")# 评估不同重采样方法print("\n开始评估不同重采样方法...")results = evaluate_resampling_methods(X_train_enc, y_train, X_test_enc, y_test)# 可视化结果比较plot_results_comparison(results)# 绘制最佳方法的混淆矩阵plot_confusion_matrices(results, y_test)# 推荐最佳方法if results:best_result = max(results, key=lambda x: x['Balanced_Accuracy'])print(f"\n推荐方法: {best_result['Method']}")print(f"平衡准确率: {best_result['Balanced_Accuracy']:.4f}")# 详细分类报告print(f"\n{best_result['Method']} 详细分类报告:")print(classification_report(y_test, best_result['y_pred']))if __name__ == "__main__":main()# 额外建议和最佳实践
print("""
💡 处理不平衡数据的建议:1. 评估指标选择:- 不要只看准确率，重点关注平衡准确率、F1分数- 对于不平衡数据，平衡准确率比普通准确率更有意义2. 重采样方法选择:- SMOTE: 适合大多数情况，但要注意过拟合- ADASYN: 适合噪声较多的数据- 组合方法(SMOTEENN, SMOTETomek): 同时处理过采样和欠采样3. 模型参数调整:- 使用class_weight='balanced'- 调整RandomForest的max_depth避免过拟合- 考虑使用XGBoost等对不平衡数据更友好的模型4. 交叉验证:- 使用StratifiedKFold保证每折的类别分布一致- 在交叉验证中应用重采样5. 特征工程:- 对于离散特征，考虑特征组合- 可以尝试特征选择减少噪声
""")

针对catboost训练配置

    def get_catboost_configs(self):"""获取不同的CatBoost配置针对不平衡数据的不同处理策略"""base_params = {'iterations': 1000,'learning_rate': 0.1,'depth': 6,'random_seed': self.random_state,'verbose': False,'early_stopping_rounds': 100,'eval_metric': 'MultiClass','objective': 'MultiClass'}configs = {# 1. 基础配置'Basic': {**base_params,'name': 'Basic CatBoost'},# 2. 自动类别权重平衡'Auto Balanced': {**base_params,'auto_class_weights': 'Balanced','name': 'Auto Balanced Weights'},# 3. 手动类别权重（SqrtBalanced）'Sqrt Balanced': {**base_params,'auto_class_weights': 'SqrtBalanced','name': 'Sqrt Balanced Weights'},# 4. Bootstrap with Bayesian'Bootstrap Bayesian': {**base_params,'bootstrap_type': 'Bayesian','bagging_temperature': 1,'auto_class_weights': 'Balanced','name': 'Bootstrap Bayesian + Balanced'},# 5. 调整深度和学习率'Deep + Slow': {**base_params,'depth': 8,'learning_rate': 0.05,'iterations': 1500,'auto_class_weights': 'Balanced','name': 'Deep + Slow Learning'},# 6. L2正则化'L2 Regularized': {**base_params,'l2_leaf_reg': 10,'auto_class_weights': 'Balanced','name': 'L2 Regularized'}}return configs

对数据建120个模型

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')# ==================== 特征工程 ==================== #
def feature_engineering(df):"""与原始代码保持一致的特征工程"""df['exp_856P868P855P289'] = np.exp(df['X446'] + df['X76'] + df['X37'] + df['X287'])df['exp_860P868P855P289'] = np.exp(df['X3'] + df['X76'] + df['X37'] + df['X287'])df['exp_598P868P855P289'] = np.exp(df['X66'] + df['X76'] + df['X37'] + df['X287'])df['bid_ask_interaction'] = df['ask_qty'] * df['ask_qty']df['bid_buy_interaction'] = df['ask_qty'] * df['buy_qty']df['volume_weighted_sell'] = df['sell_qty'] * df['volume']df['buy_sell_ratio'] = df['buy_qty'] / (df['sell_qty'] + 1e-10)df['log_volume'] = np.log1p(df['volume'])df['net_order_flow'] = df['buy_qty'] - df['sell_qty']df['normalized_net_flow'] = df['net_order_flow'] / (df['volume'] + 1e-10)df['buying_pressure'] = df['buy_qty'] / (df['volume'] + 1e-10)df['total_depth'] = df['ask_qty'] + df['ask_qty']df['depth_imbalance'] = (df['ask_qty'] - df['ask_qty']) / (df['total_depth'] + 1e-10)df['log_depth'] = np.log1p(df['total_depth'])df['order_flow_imbalance'] = (df['buy_qty'] - df['sell_qty']) / (df['buy_qty'] + df['sell_qty'] + 1e-10)df['kyle_lambda'] = np.abs(df['net_order_flow']) / (df['volume'] + 1e-10)df['flow_toxicity'] = np.abs(df['order_flow_imbalance']) * df['volume']df['sqrt_volume'] = np.sqrt(df['volume'])df['volume_squared'] = df['volume'] ** 2df['imbalance_squared'] = df['order_flow_imbalance'] ** 2df = df.replace([np.inf, -np.inf], np.nan)df = df.fillna(0)return df# ==================== 配置 ==================== #
class Config:TRAIN_PATH = "/kaggle/input/drw-crypto-market-prediction/train.parquet"TEST_PATH = "/kaggle/input/drw-crypto-market-prediction/test.parquet"SUBMISSION_PATH = "/kaggle/input/drw-crypto-market-prediction/sample_submission.csv"SELECTED_FEATURES = ["X287", "X446", "X66", "X123", "X385", "X25", "X3", "X231","X37", "X174", "X298", "X168", "X1", "X76","buy_qty", "sell_qty", "volume", "ask_qty","bid_ask_interaction", "bid_buy_interaction", "volume_weighted_sell","buy_sell_ratio", "log_volume", "net_order_flow", "normalized_net_flow","buying_pressure", "total_depth", "depth_imbalance", "log_depth","order_flow_imbalance", "kyle_lambda", "flow_toxicity","sqrt_volume", "volume_squared", "imbalance_squared"]LABEL_COLUMN = "X683"# 交叉验证参数N_SPLIT_SCHEMES = 10   # 10种不同的拆分方案N_FOLDS = 10           # 每种方案10折交叉验证TEST_SIZE = 0.15       # 留出15%作为最终测试集（可选）USE_HOLDOUT_TEST = True  # 是否使用留出测试集RANDOM_STATE = 42# ==================== 随机森林参数配置 ==================== #
def get_rf_params(scheme_idx, fold_idx):"""为每个模型生成不同的随机森林参数"""np.random.seed(Config.RANDOM_STATE + scheme_idx * 100 + fold_idx)params = {'n_estimators': np.random.choice([50, 100, 150, 200, 250]),'max_depth': np.random.choice([10, 15, 20, 25, 30, None]),'min_samples_split': np.random.choice([2, 5, 10, 15]),'min_samples_leaf': np.random.choice([1, 2, 4, 8]),'max_features': np.random.choice(['sqrt', 'log2', 0.3, 0.5, 0.7]),'bootstrap': True,'random_state': Config.RANDOM_STATE + scheme_idx * 100 + fold_idx,'n_jobs': -1,'verbose': 0}return params# ==================== 标准交叉验证流程 ==================== #
def train_with_cv(train_df, test_df, use_holdout=True):"""标准交叉验证流程:如果use_holdout=True:1. 先划分出holdout测试集（15%）并锁定2. 对剩余85%进行10种方案×10折CV3. 每个模型在holdout集上评估如果use_holdout=False:1. 直接对全部训练数据进行10种方案×10折CV2. 使用OOF预测评估模型性能"""X_full = train_df[Config.SELECTED_FEATURES]y_full = train_df[Config.LABEL_COLUMN]X_test = test_df[Config.SELECTED_FEATURES]# ========== 第一步：是否留出测试集 ========== #if use_holdout:print(f"📊 使用留出法：保留{Config.TEST_SIZE*100:.0f}%数据作为Holdout测试集")X_trainval, X_holdout, y_trainval, y_holdout = train_test_split(X_full, y_full,test_size=Config.TEST_SIZE,random_state=Config.RANDOM_STATE,shuffle=True)print(f"   训练+验证集: {len(X_trainval):,} | Holdout测试集: {len(X_holdout):,}")else:print(f"📊 不使用留出法：使用全部训练数据进行交叉验证")X_trainval, y_trainval = X_full, y_fullX_holdout, y_holdout = None, Noneprint(f"   训练+验证集: {len(X_trainval):,}")# ========== 第二步：存储结果 ========== #all_models = []all_scores = {'cv_scores': [], 'holdout_scores': [] if use_holdout else None}# OOF预测（Out-of-Fold）oof_predictions_all_schemes = {}  # 每个方案的OOF预测# 测试集预测test_predictions = []print("\n" + "="*100)print(f"开始训练 {Config.N_SPLIT_SCHEMES} 种拆分方案 × {Config.N_FOLDS} 折 = {Config.N_SPLIT_SCHEMES * Config.N_FOLDS} 个模型")print("="*100)model_counter = 0# ========== 第三步：外层循环 - 10种拆分方案 ========== #for scheme_idx in range(Config.N_SPLIT_SCHEMES):print(f"\n{'='*100}")print(f"【拆分方案 {scheme_idx + 1}/{Config.N_SPLIT_SCHEMES}】")print(f"{'='*100}")# 为当前方案创建OOF数组oof_predictions = np.zeros(len(X_trainval))# 创建KFold对象（每个方案用不同的random_state）scheme_random_state = Config.RANDOM_STATE + scheme_idx * 1000kf = KFold(n_splits=Config.N_FOLDS,shuffle=True,random_state=scheme_random_state)scheme_models = []# ========== 第四步：内层循环 - 10折交叉验证 ========== #for fold_idx, (train_idx, valid_idx) in enumerate(kf.split(X_trainval), start=1):model_counter += 1# 获取当前fold的数据if isinstance(X_trainval, pd.DataFrame):X_train = X_trainval.iloc[train_idx]X_valid = X_trainval.iloc[valid_idx]y_train = y_trainval.iloc[train_idx]y_valid = y_trainval.iloc[valid_idx]else:X_train = X_trainval[train_idx]X_valid = X_trainval[valid_idx]y_train = y_trainval[train_idx]y_valid = y_trainval[valid_idx]# 获取模型参数params = get_rf_params(scheme_idx, fold_idx)# 训练模型model = RandomForestRegressor(**params)model.fit(X_train, y_train)# ========== 第五步：多种评估 ========== ## 1. 在验证集上评估（CV评估）valid_pred = model.predict(X_valid)cv_score = pearsonr(y_valid, valid_pred)[0]all_scores['cv_scores'].append(cv_score)# 2. 保存OOF预测oof_predictions[valid_idx] = valid_pred# 3. 在holdout测试集上评估（如果有）holdout_score = Noneif use_holdout:holdout_pred = model.predict(X_holdout)holdout_score = pearsonr(y_holdout, holdout_pred)[0]all_scores['holdout_scores'].append(holdout_score)# 4. 对最终测试集预测test_pred = model.predict(X_test)test_predictions.append(test_pred)# 保存模型信息model_info = {'model': model,'scheme_idx': scheme_idx + 1,'fold_idx': fold_idx,'params': params,'cv_score': cv_score,'holdout_score': holdout_score,'train_size': len(train_idx),'valid_size': len(valid_idx)}all_models.append(model_info)scheme_models.append(model_info)# 打印进度if use_holdout:print(f"  Fold {fold_idx:2d}/10 | 模型#{model_counter:3d} | "f"CV={cv_score:+.4f} | Holdout={holdout_score:+.4f} | "f"n_est={params['n_estimators']}, max_depth={params['max_depth']}")else:print(f"  Fold {fold_idx:2d}/10 | 模型#{model_counter:3d} | "f"CV={cv_score:+.4f} | "f"n_est={params['n_estimators']}, max_depth={params['max_depth']}")# ========== 第六步：计算当前方案的OOF分数 ========== #if isinstance(y_trainval, pd.Series):oof_score = pearsonr(y_trainval.values, oof_predictions)[0]else:oof_score = pearsonr(y_trainval, oof_predictions)[0]oof_predictions_all_schemes[f'scheme_{scheme_idx+1}'] = oof_predictions# 方案小结scheme_cv_scores = [m['cv_score'] for m in scheme_models]if use_holdout:scheme_holdout_scores = [m['holdout_score'] for m in scheme_models]print(f"\n  📊 方案{scheme_idx+1}总结:")print(f"     CV平均: {np.mean(scheme_cv_scores):+.4f} (±{np.std(scheme_cv_scores):.4f})")print(f"     Holdout平均: {np.mean(scheme_holdout_scores):+.4f} (±{np.std(scheme_holdout_scores):.4f})")print(f"     OOF分数: {oof_score:+.4f}")else:print(f"\n  📊 方案{scheme_idx+1}总结:")print(f"     CV平均: {np.mean(scheme_cv_scores):+.4f} (±{np.std(scheme_cv_scores):.4f})")print(f"     OOF分数: {oof_score:+.4f}")print(f"\n{'='*100}")print(f"✅ 训练完成！共训练 {len(all_models)} 个模型")print(f"{'='*100}")return all_models, all_scores, test_predictions, oof_predictions_all_schemes, (X_holdout, y_holdout)# ==================== 模型性能分析 ==================== #
def analyze_models(all_models, all_scores, oof_predictions_all_schemes, y_trainval, holdout_data=None):"""详细分析模型性能"""print("\n" + "="*100)print("【模型性能分析】")print("="*100)cv_scores = np.array(all_scores['cv_scores'])# 1. CV分数统计print(f"\n📊 交叉验证(CV)分数统计:")print(f"   平均分: {cv_scores.mean():+.4f}")print(f"   标准差: {cv_scores.std():.4f}")print(f"   最高分: {cv_scores.max():+.4f}")print(f"   最低分: {cv_scores.min():+.4f}")print(f"   中位数: {np.median(cv_scores):+.4f}")# 2. Holdout分数统计（如果有）if all_scores['holdout_scores'] is not None:holdout_scores = np.array(all_scores['holdout_scores'])print(f"\n📊 Holdout测试集分数统计:")print(f"   平均分: {holdout_scores.mean():+.4f}")print(f"   标准差: {holdout_scores.std():.4f}")print(f"   最高分: {holdout_scores.max():+.4f}")print(f"   最低分: {holdout_scores.min():+.4f}")print(f"   中位数: {np.median(holdout_scores):+.4f}")# 3. OOF分数统计print(f"\n📊 OOF (Out-of-Fold) 分数统计:")for scheme_name, oof_pred in oof_predictions_all_schemes.items():if isinstance(y_trainval, pd.Series):oof_score = pearsonr(y_trainval.values, oof_pred)[0]else:oof_score = pearsonr(y_trainval, oof_pred)[0]print(f"   {scheme_name}: {oof_score:+.4f}")# 4. 各方案表现对比print(f"\n📈 各拆分方案的平均表现:")for scheme_idx in range(Config.N_SPLIT_SCHEMES):scheme_models = [m for m in all_models if m['scheme_idx'] == scheme_idx + 1]scheme_cv_scores = [m['cv_score'] for m in scheme_models]if all_scores['holdout_scores'] is not None:scheme_holdout_scores = [m['holdout_score'] for m in scheme_models]print(f"   方案{scheme_idx+1}: CV={np.mean(scheme_cv_scores):+.4f} (±{np.std(scheme_cv_scores):.4f}) | "f"Holdout={np.mean(scheme_holdout_scores):+.4f} (±{np.std(scheme_holdout_scores):.4f})")else:print(f"   方案{scheme_idx+1}: CV={np.mean(scheme_cv_scores):+.4f} (±{np.std(scheme_cv_scores):.4f})")