Numpy应用之数据集拆分

2018-11-20 本文已影响0人 IntoTheVoid

数据集拆分

数据均值标准化后，通常在机器学习中，我们会将数据集拆分为三个集合：

训练集
交叉验证集
测试集

划分方式通常为，训练集包含 60% 的数据，交叉验证集包含 20% 的数据，测试集包含 20% 的数据。

在此部分，将 X_norm 分离成训练集、交叉验证集和测试集。每个数据集将包含随机选择的 X_norm 行，确保不能重复选择相同的行。这样可以保证所有的 X_norm 行都能被选中，并且在三个新的数据集中随机分布。

首先需要创建一个秩为 1 的 ndarray，其中包含随机排列的 X_norm 行索引。为此，可以使用 np.random.permutation() 函数。np.random.permutation(N) 函数会创建一个从 0 到 N - 1的随机排列的整数集。我们来看一个示例：

# We create a random permutation of integers 0 to 4
np.random.permutation(5)

array([1, 2, 0, 3, 4])

创建一个秩为 1 的 ndarray，其中包含随机排列的 X_norm 行索引。用一行代码就能搞定：使用 shape 属性提取 X_norm 的行数，然后将其传递给 np.random.permutation() 函数。注意，shape 属性返回一个包含两个数字的元组，格式为 (rows,columns)。

# Create a rank 1 ndarray that contains a random permutation of the row indices of `X_norm`
row_indices = np.random.permutation(X_norm.shape[0])

现在，可以使用 row_indices ndarray 创建三个数据集，并选择进入每个数据集的行。\

# Make any necessary calculations.
# You can save your calculations into variables to use later.
int((row_indices.shape[0])*0.6)

# Create a Training Set
X_train = X_norm[row_indices[0:600],:]

# Create a Cross Validation Set
X_crossVal = X_norm[row_indices[600:800],:]

# Create a Test Set
X_test = X_norm[row_indices[800:1000],:]

如果正确地完成了上述计算步骤，那么 X_tain 应该有 600 行和 20 列，X_crossVal 应该有 200 行和 20 列，X_test 应该有 200 行和 20 列。可以通过填充以下代码验证这一点：

# Print the shape of X_train
print(X_train.shape)
# Print the shape of X_crossVal
print(X_crossVal.shape)
# Print the shape of X_test
print(X_test.shape)

(600, 20)
(200, 20)
(200, 20)

Numpy应用之数据集拆分

数据集拆分

猜你喜欢

热点阅读