StandardScaler,MinMaxScaler 学习
当时不求甚解欠的债全是要还的!
- 归一化
- StandardScaler 和 MinMaxScaler
- StandardScaler
- MinMaxScaler
- 理解源码
- 基本数学原理
- fit方法
- transform方法是最重要的
- 验证代码
- 结果
- MinMaxScaler
- 验证代码
- Examples
尊敬的组织,事情的经过是这样的:。。。。。。
自己写了一下归一化函数,跑一个线性神经网络爬出来一坨,想来想去肯定是归一化函数的问题。
还是调用方便。
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
归一化
我们知道归一化,就像是跑图像的时候,先除以255,再给个方差,平均数矩阵,再甩给函数就成了。
简单的理解就是不能让e^6 和 e^-1 两个数量级的东西作为不同的特征一起去计算。
[1000,0.1,2] 这三个特征显然是有注意力差异的。这点可以通过注意力机制去理解,注意力机制就是特征*一个注意力矩阵,再去计算下一层嘛。
粗浅的估计就是一个x_scaler = (x -u)/sita .减去平均数,除以标准差,如果没有记错就是标准正态了吧。
StandardScaler 和 MinMaxScaler
参考https://baijiahao.baidu.com/s?id=1825808807439588177这一篇写得很好。
就实操而言,StandardScaler 明显更好。用MinMaxScaler有时候会出问题。
StandardScaler
StandardScaler 就像一位精准的裁判,把数据的均值调整为 0,标准差调整为 1,适合数据服从正态分布的情况。想象参加跑步比赛,每个选手起点都设置在相同的水平线上,赛道长度也完全一致,这样大家都能公平竞争。StandardScaler 就是给数据设定统一标准,让每个数据点站在同一条起跑线上。
MinMaxScaler
MinMaxScaler 则像一位热衷于量化的教练,把所有数据拉伸或压缩到 [0,1] 或 [-1,1] 范围内,适合没有明显分布特征的数据。可以把它想象成一位严格的裁判,强迫每个选手的成绩都必须处于某个范围内,无论你跑得快或慢,成绩都必须在规定范围内。这种方式使得不同选手成绩易于比较,但它并不关注选手的实际表现,只关注排名情况。
理解源码
遇事不决研究源码。想不懂就拿代码来说话吧。
class StandardScaler(_OneToOneFeatureMixin, TransformerMixin, BaseEstimator):"""Standardize features by removing the mean and scaling to unit variance.The standard score of a sample `x` is calculated as:z = (x - u) / swhere `u` is the mean of the training samples or zero if `with_mean=False`,and `s` is the standard deviation of the training samples or one if`with_std=False`.Centering and scaling happen independently on each feature by computingthe relevant statistics on the samples in the training set. Mean andstandard deviation are then stored to be used on later data using:meth:`transform`.Standardization of a dataset is a common requirement for manymachine learning estimators: they might behave badly if theindividual features do not more or less look like standard normallydistributed data (e.g. Gaussian with 0 mean and unit variance).For instance many elements used in the objective function ofa learning algorithm (such as the RBF kernel of Support VectorMachines or the L1 and L2 regularizers of linear models) assume thatall features are centered around 0 and have variance in the sameorder. If a feature has a variance that is orders of magnitude largerthat others, it might dominate the objective function and make theestimator unable to learn from other features correctly as expected.This scaler can also be applied to sparse CSR or CSC matrices by passing`with_mean=False` to avoid breaking the sparsity structure of the data.Read more in the :ref:`User Guide <preprocessing_scaler>`.Parameters----------copy : bool, default=TrueIf False, try to avoid a copy and do inplace scaling instead.This is not guaranteed to always work inplace; e.g. if the data isnot a NumPy array or scipy.sparse CSR matrix, a copy may still bereturned.with_mean : bool, default=TrueIf True, center the data before scaling.This does not work (and will raise an exception) when attempted onsparse matrices, because centering them entails building a densematrix which in common use cases is likely to be too large to fit inmemory.with_std : bool, default=TrueIf True, scale the data to unit variance (or equivalently,unit standard deviation).Attributes----------scale_ : ndarray of shape (n_features,) or NonePer feature relative scaling of the data to achieve zero mean and unitvariance. Generally this is calculated using `np.sqrt(var_)`. If avariance is zero, we can't achieve unit variance, and the data is leftas-is, giving a scaling factor of 1. `scale_` is equal to `None`when `with_std=False`... versionadded:: 0.17*scale_*mean_ : ndarray of shape (n_features,) or NoneThe mean value for each feature in the training set.Equal to ``None`` when ``with_mean=False``.var_ : ndarray of shape (n_features,) or NoneThe variance for each feature in the training set. Used to compute`scale_`. Equal to ``None`` when ``with_std=False``.n_features_in_ : intNumber of features seen during :term:`fit`... versionadded:: 0.24feature_names_in_ : ndarray of shape (`n_features_in_`,)Names of features seen during :term:`fit`. Defined only when `X`has feature names that are all strings... versionadded:: 1.0n_samples_seen_ : int or ndarray of shape (n_features,)The number of samples processed by the estimator for each feature.If there are no missing samples, the ``n_samples_seen`` will be aninteger, otherwise it will be an array of dtype int. If`sample_weights` are used it will be a float (if no missing data)or an array of dtype float that sums the weights seen so far.Will be reset on new calls to fit, but increments across``partial_fit`` calls.See Also--------scale : Equivalent function without the estimator API.:class:`~sklearn.decomposition.PCA` : Further removes the linearcorrelation across features with 'whiten=True'.Notes-----NaNs are treated as missing values: disregarded in fit, and maintained intransform.We use a biased estimator for the standard deviation, equivalent to`numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely toaffect model performance.For a comparison of the different scalers, transformers, and normalizers,see :ref:`examples/preprocessing/plot_all_scaling.py<sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.Examples-------->>> from sklearn.preprocessing import StandardScaler>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]>>> scaler = StandardScaler()>>> print(scaler.fit(data))StandardScaler()>>> print(scaler.mean_)[0.5 0.5]>>> print(scaler.transform(data))[[-1. -1.][-1. -1.][ 1. 1.][ 1. 1.]]>>> print(scaler.transform([[2, 2]]))[[3. 3.]]"""def __init__(self, *, copy=True, with_mean=True, with_std=True):self.with_mean = with_meanself.with_std = with_stdself.copy = copydef _reset(self):"""Reset internal data-dependent state of the scaler, if necessary.__init__ parameters are not touched."""# Checking one attribute is enough, because they are all set together# in partial_fitif hasattr(self, "scale_"):del self.scale_del self.n_samples_seen_del self.mean_del self.var_def fit(self, X, y=None, sample_weight=None):"""Compute the mean and std to be used for later scaling.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features)The data used to compute the mean and standard deviationused for later scaling along the features axis.y : NoneIgnored.sample_weight : array-like of shape (n_samples,), default=NoneIndividual weights for each sample... versionadded:: 0.24parameter *sample_weight* support to StandardScaler.Returns-------self : objectFitted scaler."""# Reset internal state before fittingself._reset()return self.partial_fit(X, y, sample_weight)
基本数学原理
The standard score of a sample
x
is calculated as:z = (x - u) / s
可见,StandardScaler的基本原理是正态化,这里当然应该是每一个特征的u和s。每个特征分别作这个操作。每一个列做标准化。
Examples
--------
>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.][-1. -1.][ 1. 1.][ 1. 1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]
fit方法
def fit(self, X, y=None, sample_weight=None):
# Compute the mean and std to be used for later scaling.
fit方法就是求出当前输入数据的平均数,方差的函数。跑完这个方法后,归一化的参数就订好了。
scaler = StandardScaler()
scaler.fit(x_train) # 在训练数据上拟合缩放器
x_train_scaled = scaler.transform(x_train)
像是很多人根本就没看到有这么一个方法,以为直接调用scaler.transform(x)就行了。
所以 print(scaler.transform([[2, 2]]))用到的是 data = [[0, 0], [0, 0], [1, 1], [1, 1]]的平均数,方差。
transform方法是最重要的
inverse_transform方法是它的反函数。Scale back the data to the original representation.把数据返回到原本形式。
为了防止用户乱输入东西,别的反正写了一大堆判定。核心的东西就是-mean再/scale的操作。别的部分明明没写什么复杂的东西,但是我总感觉看不大懂,真的奇了怪了。
def transform(self, X, copy=None):"""Perform standardization by centering and scaling.Parameters----------X : {array-like, sparse matrix of shape (n_samples, n_features)The data used to scale along the features axis.copy : bool, default=NoneCopy the input X or not.Returns-------X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features)Transformed array."""check_is_fitted(self)copy = copy if copy is not None else self.copyX = self._validate_data(X,reset=False,accept_sparse="csr",copy=copy,estimator=self,dtype=FLOAT_DTYPES,force_all_finite="allow-nan",)if sparse.issparse(X):if self.with_mean:raise ValueError("Cannot center sparse matrices: pass `with_mean=False` ""instead. See docstring for motivation and alternatives.")if self.scale_ is not None:inplace_column_scale(X, 1 / self.scale_)else:if self.with_mean:X -= self.mean_if self.with_std:X /= self.scale_return X
验证代码
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import numpy as npdef min_max_scaler(x):x = np.array(x)min = x.min(axis=0)max = x.max(axis=0)x_sc = (x - min) / (max - min)return x_scdef stand_scaler(x):# 将列表转换为NumPy数组my_array = np.array(x)# 计算平均数mean_value = np.mean(my_array,axis=0)# 计算方差variance_value = np.std(my_array,axis=0)return (x-mean_value)/variance_valueif __name__ == '__main__':# print("11" * 50)# scaler = MinMaxScaler()# data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]# scaler.fit(data)# # print(scaler.transform([[2, 2]]))# print("[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]minmax标准化:", scaler.transform(data))# print(f"自写的min_max标准化函数:{min_max_scaler(data)}")print("11" * 50)scaler = StandardScaler()data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]scaler.fit(data)# print(scaler.transform([[2, 2]]))print("[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]stand标准化:\n", scaler.transform(data))print(f"自写的stand标准化函数:\n{stand_scaler(data)}")
结果
[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]stand标准化:[[-1.18321596 -1.18321596][-0.50709255 -0.50709255][ 0.16903085 0.16903085][ 1.52127766 1.52127766]]
自写的stand标准化函数:
[[-1.18321596 -1.18321596][-0.50709255 -0.50709255][ 0.16903085 0.16903085][ 1.52127766 1.52127766]]
MinMaxScaler
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))X_scaled = X_std * (max - min) + min
很好理解,核心原理就是 -min , /(max-min) 两个部分。至于那个+min,我也不明白是什么意思。
验证代码
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import numpy as npdef min_max_scaler(x):min = x.min(axis=0)max = x.max(axis=0)x_sc = (x - min) / (max - min)return x_scif __name__ == '__main__':print("11" * 50)scaler = MinMaxScaler()data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]data = np.array(data)scaler.fit(data)# print(scaler.transform([[2, 2]]))print("[[-1, 2], [-0.5, 6], [0, 10], [1, 18]]minmax标准化:", scaler.transform(data))print(f"自写的min_max标准化函数:{min_max_scaler(data)}")
Examples
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
MinMaxScaler()
print(scaler.data_max_)
[ 1. 18.]
print(scaler.transform(data))
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]
print(scaler.transform([[2, 2]]))
[[1.5 0. ]]
这个好像是不分特征的呀。按照整体来做。
max=[1,18],min=[-1,2]
得到的结果就是
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]