Numpy应用之均值标准化

2018-11-20 本文已影响0人 IntoTheVoid

在机器学习中，我们会使用大量数据训练我们的模型。某些机器学习算法可能需要标准化数据才能正常工作。标准化是指特征缩放，旨在确保所有数据都采用相似的刻度，即所有数据采用相似范围的值。例如，数据集的值范围在 0 到 5,000 之间。通过标准化数据，可以使值范围在 0 到 1 之间。

为了实现标准化的过程, 首先，导入 NumPy 并创建一个秩为 2 的 ndarray，其中包含 0 到 5,000（含）之间的随机整数，共有 1000 行和 20 列。此数组将模拟一个值范围很广的数据集.

# import NumPy into Python
import numpy as np

# Create a 1000 x 20 ndarray with random integers in the half-open interval [0, 5001).
X = np.random.randint(0,5001,size=(1000,20))

# print the shape of X
X.shape

(1000, 20)

创建好数组后，我们将标准化数据。我们将使用以下方程进行均值标准化：

image.png

其中 $Col_i$ 是 $X$ 的第 $i$ 列， $μ_i$ 是 $X$ 的第 $i$ 列的平均值， $σ_i$ 是 $X$ 的第 $i$ 列的标准差。换句话说，均值标准化的计算方法是将值减去 $X$ 的每列的平均值，然后除以值的标准差。首先需要计算 $X$ 的每列的平均值和标准差。

# Average of the values in each column of X
ave_cols = X.mean(axis=0)

# Standard Deviation of the values in each column of X
std_cols = X.std(axis=0)

如果正确地完成了上述计算过程，则 ave_cols 和 std_cols 向量的形状都应该为 (20,)，因为 $X$ 有 20 列。可以通过填充以下代码验证这一点：

# Print the shape of ave_cols
ave_cols.shape
# Print the shape of std_cols
std_cols.shape

(20,)

现在，可以利用广播计算 $X$ 的均值标准化版本，借助上述方程，用一行代码就能搞定。请填充以下代码

# Mean normalize X
X_norm = (X-ave_cols)/std_cols

如果正确地完成了均值标准化过程，那么 $X_{norm}$ 中的所有元素的平均值应该接近 0。通过填充以下代码验证这一点：

# Print the average of all the values of X_norm
print(X_norm.mean())
# Print the minimum value of each column of X_norm
print(X_norm.min(axis=0))
# Print the maximum value of each column of X_norm
print(X_norm.max(axis=0))

-2.48689957516e-17

[-1.67304308 -1.67221292 -1.75934167 -1.73833551 -1.71948809 -1.73249692
-1.75209287 -1.68476739 -1.7765245 -1.73833478 -1.65349246 -1.77839033
-1.79404877 -1.7932638 -1.75375541 -1.79912139 -1.75387997 -1.7214838
-1.77489744 -1.75469174]

[ 1.73959707 1.73242353 1.74403255 1.68515046 1.72215205 1.72411917
1.73467583 1.76609836 1.76611175 1.72002128 1.75578587 1.72210378
1.74709618 1.70113121 1.73983003 1.69194414 1.7162684 1.74064629
1.73171655 1.7051294 ]

请注意，因为 $X$ 是使用随机整数创建的，因此上述值将有所变化。

Numpy应用之均值标准化

猜你喜欢

热点阅读