分割数据集的方法一

2019-05-22  本文已影响0人  微斯人_吾谁与归

手撕数据集

1.随机数

2.哈希表

使用工具

1.sklearn.model_selection

Signature: train_test_split(*arrays, **options)

Docstring:

Split arrays or matrices into random train and test subsets
将数组或矩阵分割为随机训练和测试子集

Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.

Read more in the :ref:User Guide <cross_validation>.

Parameters

arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
具有相同长度/形状的可索引项序列允许的输入是列表、numpy数组、scipy-sparse矩阵或dataframes


test_size :float, int or None, optional (default=0.25)
如果是float表示比例,如果是int表示test样本的个数,如果是None将所有数据设为训练集。默认是等于0.25


train_size : float, int, or None, (default=None)
类上


random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.


shuffle : boolean, optional (default=True)
在分割前是否重新洗牌,如果False, 那么 stratify 必须是None


stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as
the class labels.


Returns

splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
versionadded:: 0.16
If the input is sparse, the output will be a
scipy.sparse.csr_matrix. Else, output type is the same as the
input type.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

File: d:\anaconda3\lib\site-packages\sklearn\model_selection_split.py
Type: function


2.from sklearn.model_selection import StratifiedShuffleSplit

Init signature:

StratifiedShuffleSplit(n_splits=10, test_size='default', train_size=None, random_state=None)

Docstring:

Stratified ShuffleSplit cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a merge of StratifiedKFold and
ShuffleSplit, which returns stratified randomized folds. The folds
are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits
do not guarantee that all folds will be different, although this is
still very likely for sizeable datasets.

Read more in the :ref:User Guide <cross_validation>.

Parameters

n_splits : int, default 10
Number of re-shuffling & splitting iterations.

test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. By default, the value is set to 0.1.
The default will change in version 0.21. It will remain 0.1 only
if train_size is unspecified, otherwise it will complement
the specified train_size.

train_size : float, int, or None, default is None
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.

Examples

>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5
>>> print(sss)       # doctest: +ELLIPSIS
StratifiedShuffleSplit(n_splits=5, random_state=0, ...)
>>> for train_index, test_index in sss.split(X, y):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]
File:           d:\anaconda3\lib\site-packages\sklearn\model_selection\_split.py
Type:           ABCMeta
上一篇 下一篇

猜你喜欢

热点阅读