当前位置：首页 > news >正文

如何在机器学习中使用特征提取对表格数据进行处理

news 2025/11/2 7:08:39

机器学习预测建模的性能取决于你的数据，而你的数据又取决于你为建模准备数据的方式。

数据准备的最常见方法是研究数据集并审查机器学习算法的期望，然后仔细选择最合适的数据准备技术，以将原始数据转换为最符合算法期望的形式。这种方法速度慢、成本高，并且需要大量的专业知识。

数据准备的另一种方法是并行应用一组常用且常用的数据准备技术对原始数据进行处理，并将所有转换的结果组合成一个大型数据集，从中可以拟合和评估模型。

这是一种用于数据准备的替代性哲学，将数据转换视为从原始数据中提取显著特征的方法，以向学习算法暴露问题的结构。它需要可扩展的权重输入特征的学习算法，并使用与预测目标最相关的输入特征。

这种方法需要较少的专家知识，与对数据准备方法进行完整网格搜索相比在计算上更有效，并且可以帮助发现非直观的数据准备解决方案，这些解决方案可以为给定的预测建模问题实现良好的或最佳的性能。

在这个教程中，您将了解如何使用特征提取来处理表格数据。

完成本教程后，您将了解：

特征提取为表格数据的数据准备提供了一种替代方法，其中所有数据转换都并行应用于原始输入数据，并组合在一起以创建一个大型数据集。
如何使用特征提取方法进行数据准备，以提高标准分类数据集的模型性能。
如何将特征选择添加到特征提取建模管道中，以在标准数据集上进一步提升建模性能。

教程概述

本教程分为三部分；它们是：

数据准备的特征提取技术
数据集和性能基准
1. 葡萄酒分类数据集
2. baseline模型性能
数据准备的特征提取方法

数据准备的特征提取技术

数据准备可能具有挑战性。

最常被推荐和遵循的方法是分析数据集，审查算法的要求，并将原始数据转换为最符合算法期望的形式。

这可能有效，但也缓慢，并且需要对数据分析和机器学习算法有深入的了解。

一种替代方法是将输入变量的准备视为建模管道的超参数，并将其与算法选择和算法配置一起进行调整。

这也可以是一种有效的方法，暴露非直观的解决方案，并且只需要很少的专家知识，尽管这可能在计算上很昂贵。

在数据准备的这两个方法之间寻求中间立场的一种方法是将输入数据的转换视为特征工程或特征提取过程。这涉及对原始数据应用一系列常见的或常用的数据准备技术，然后将所有特征聚合在一起，创建一个大型数据集，然后在此数据上拟合和评估模型。

该方法的哲学将每种数据准备技术视为一种从原始数据中提取显著特征并呈现给学习算法的转换。理想情况下，这些转换可以解开复杂的关系和复合输入变量，从而允许使用更简单的建模算法，例如线性机器学习技术。

由于缺乏更好的名称，我们将称之为“特征工程方法”或“特征提取方法”，用于为预测建模项目配置数据准备。

它允许数据分析师和算法专家在数据准备方法的选择中发挥作用，并且能够以较低的计算成本找到不直观的解决方案。

通过使用特征选择技术，也可以明确地处理输入特征的数量问题，这些技术试图对提取的大量特征的重要性或价值进行排序，并仅选择最相关的特征子集用于预测目标变量。

我们可以用一个实际例子来探讨这种方法的数据准备。

在我们深入一个实际例子之前，首先选择一个标准数据集并开发一个基准性能。

数据集和性能基准

在本节中，我们将首先选择一个标准的机器学习数据集，并在此数据集上建立一个性能基准。这将为在下一节中探索数据准备的特征提取方法提供背景。

葡萄酒分类数据集

我们将使用葡萄酒分类数据集。

这个数据集有13个输入变量，描述了葡萄酒样品的化学成分，并要求将葡萄酒分类为三种类型之一。

您可以在此处了解更多关于数据集的信息：

葡萄酒数据集（wine.csv）
葡萄酒数据集描述（wine.names）

无需下载数据集，因为我们将在我们的示例中自动下载。

打开数据集并查看原始数据。以下是数据的前几行。

我们可以看到这是一个多分类分类预测建模问题，具有数值输入变量，每个变量具有不同的尺度。

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1

13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1

13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1

14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1

13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1

…

该示例加载数据集并将其拆分为输入和输出列，然后总结数据数组。

# example of loading and summarizing the wine dataset
from pandas import read_csv
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
# load the dataset as a data frame
df = read_csv(url, header=None)
# retrieve the numpy array
data = df.values
# split the columns into input and output variables
X, y = data[:, :-1], data[:, -1]
# summarize the shape of the loaded data
print(X.shape, y.shape)

运行示例，我们可以看到数据集已正确加载，并且有179行数据，13个输入变量和一个目标变量。

(178, 13) (178,)

接下来，让我们在这个数据集上评估一个模型，并建立一个性能基准。

基线模型性能

我们可以通过在原始输入数据上评估一个模型来为葡萄酒分类任务建立一个基准性能。

在这种情况下，我们将评估一个逻辑回归模型。

首先，我们可以进行最小的数据准备，确保输入变量是数值型的，并且目标变量是标签编码的，以符合scikit-learn库的预期。

...
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))

接下来，我们可以定义我们的预测模型。

...
# define the model
model = LogisticRegression(solver='liblinear')

我们将使用重复分层k折交叉验证作为金标准来评估模型，分为10折，重复3次。

模型性能将通过分类准确率进行评估。

...
model = LogisticRegression(solver='liblinear')
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

在运行结束时，我们将报告在所有重复和评估折叠中收集到的准确率分数的平均值和标准差。

...
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

将这些内容综合起来，对原始葡萄酒分类数据集评估逻辑回归模型的完整示例如下。

# baseline model performance on the wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
df = read_csv(url, header=None)
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the model
model = LogisticRegression(solver='liblinear')
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例评估模型性能并报告平均和标准偏差分类准确率。

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到，逻辑回归模型在原始输入数据上的拟合平均分类准确率约为95.3%，提供了性能基准。

Accuracy: 0.953 (0.048)

接下来，让我们探讨一下是否可以使用基于特征提取的方法来改进数据准备的性能。

数据准备的特征提取方法

在本节中，我们可以探讨是否可以使用特征提取方法来改进数据准备的性能。

第一步是选择一组常用且常用的数据准备技术。

在这种情况下，由于输入变量是数值型的，我们将使用一系列转换来改变输入变量的尺度，例如MinMaxScaler、StandardScaler和RobustScaler，以及用于链式转换输入变量分布的转换，例如QuantileTransformer和KBinsDiscretizer。最后，我们还将使用一些转换来消除输入变量之间的线性依赖，例如PCA和TruncatedSVD。

FeatureUnion 类可以用来定义一个执行的转换列表，这些转换的结果将被聚合在一起，即合并。这将创建一个新的数据集，具有大量的列。

估计列的数量是13个输入变量乘以五次转换，或者65加上PCA和SVD降维方法输出的14列，总共大约79个特征。

...
# transforms for the feature union
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))
# create the feature union
fu = FeatureUnion(transforms)

然后我们可以创建一个建模管道，其中FeatureUnion作为第一步，逻辑回归模型作为最后一步。

...
# define the model
model = LogisticRegression(solver='liblinear')
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)

然后可以像之前一样使用重复分层k折交叉验证来评估该管道。

将这些内容综合起来，完整的示例如下所示。

# data preparation as feature engineering for wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
df = read_csv(url, header=None)
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# transforms for the feature union
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))
# create the feature union
fu = FeatureUnion(transforms)
# define the model
model = LogisticRegression(solver='liblinear')
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例评估模型性能并报告平均和标准偏差分类准确率。

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

在这种情况下，我们可以看到性能有所提升，与前一节中的95.3%相比，达到了大约96.8%的平均分类准确率。

Accuracy: 0.968 (0.037)

尝试在FeatureUnion中添加更多的数据准备方法，看看是否可以提高性能。

你能得到更好的结果吗？
请在下面的评论中告诉我你发现了什么。

我们还可以使用特征选择将提取的约80个特征减少到与模型最相关的那一部分特征。除了减少模型的复杂性，它还可以通过去除无关和冗余的输入特征来提高性能。

在这种情况下，我们将使用递归特征消除（RFE）技术进行特征选择，并将其配置为选择15个最相关的特征。

...
# define the feature selection
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)

然后，我们可以在FeatureUnion之后LogisticRegression算法之前将RFE特征选择添加到建模管道中。

...
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('rfe', rfe))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)

将这些内容综合起来，完整的具有特征选择的数据准备方法示例如下所示。

# data preparation as feature engineering with feature selection for wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
df = read_csv(url, header=None)
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# transforms for the feature union
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))
# create the feature union
fu = FeatureUnion(transforms)
# define the feature selection
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)
# define the model
model = LogisticRegression(solver='liblinear')
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('rfe', rfe))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例评估模型性能并报告平均和标准偏差分类准确率。

注意：由于算法或评估过程的随机性或数值精度的差异，您的结果可能会有所不同。考虑多次运行示例并比较平均结果。

再次，我们可以看到，从96.8%的所有提取特征到使用特征选择后的建模，性能进一步提升至约98.9%。