Man page - mlpack_preprocess_split(1)

Packages contains this manual

Manual

mlpack_preprocess_split

NAME
SYNOPSIS
DESCRIPTION
REQUIRED INPUT OPTIONS
OPTIONAL INPUT OPTIONS
OPTIONAL OUTPUT OPTIONS
ADDITIONAL INFORMATION

NAME

mlpack_preprocess_split - split data

SYNOPSIS

mlpack_preprocess_split -i unknown [ -I unknown ] [ -S bool ] [ -s int ] [ -z bool ] [ -r double ] [ -V bool ] [ -T unknown ] [ -L unknown ] [ -t unknown ] [ -l unknown ] [ -h -v ]

DESCRIPTION

This utility takes a dataset and optionally labels and splits them into a training set and a test set. Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be used as the test set can be specified with the ’ --test_ratio ( -r )’ parameter; the default is 0.2 (20%).

The output training and test matrices may be saved with the ’ --training_file ( -t )’ and ’ --test_file ( -T )’ output parameters.

Optionally, labels can also be split along with the data by specifying the ’ --input_labels_file ( -I )’ parameter. Splitting labels works the same way as splitting the data. The output training and test labels may be saved with the ’ --training_labels_file ( -l )’ and ’ --test_labels_file ( -L )’ output parameters, respectively.

So, a simple example where we want to split the dataset ’X.csv’ into ’X_train.csv’ and ’X_test.csv’ with 60% of the data in the training set and 40% of the dataset in the test set, we could run

$ mlpack_preprocess_split --input_file X.csv --training_file X_train.csv --test_file X_test.csv --test_ratio 0.4

Also by default the dataset is shuffled and split; you can provide the ’ --no_shuffle ( -S )’ option to avoid shuffling the data; an example to avoid shuffling of data is:

$ mlpack_preprocess_split --input_file X.csv --training_file X_train.csv --test_file X_test.csv --test_ratio 0.4 --no_shuffle

If we had a dataset ’X.csv’ and associated labels ’y.csv’, and we wanted to split these into ’X_train.csv’, ’y_train.csv’, ’X_test.csv’, and ’y_test.csv’, with 30% of the data in the test set, we could run

$ mlpack_preprocess_split --input_file X.csv --input_labels_file y.csv --test_ratio 0.3 --training_file X_train.csv --training_labels_file y_train.csv --test_file X_test.csv --test_labels_file y_test.csv

To maintain the ratio of each class in the train and test sets, the’ --stratify_data ( -z )’ option can be used.

$ mlpack_preprocess_split --input_file X.csv --training_file X_train.csv --test_file X_test.csv --test_ratio 0.4 --stratify_data

REQUIRED INPUT OPTIONS

--input_file (-i) [ unknown ]

Matrix containing data.

OPTIONAL INPUT OPTIONS

--help (-h) [ bool ]

Default help info.

--info [string]

Print help on a specific option. Default value ’’.

--input_labels_file (-I) [ unknown ]

Matrix containing labels.

--no_shuffle (-S) [ bool ]

Avoid shuffling the data before splitting.

--seed (-s) [ int ]

Random seed (0 for std::time (NULL)). Default value 0.

--stratify_data (-z) [ bool ]

Stratify the data according to labels

--test_ratio (-r) [ double ]

Ratio of test set; if not set,the ratio defaults to 0.2 Default value 0.2.

--verbose (-v) [ bool ]

Display informational messages and the full list of parameters and timers at the end of execution.

--version (-V) [ bool ]

Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

--test_file (-T) [ unknown ]

Matrix to save test data to.

--test_labels_file (-L) [ unknown ]

Matrix to save test labels to.

--training_file (-t) [ unknown ]

Matrix to save training data to.

--training_labels_file (-l) [ unknown ]

Matrix to save train labels to.

ADDITIONAL INFORMATION

For further information, including relevant papers, citations, and theory, consult the documentation found at http://www.mlpack.org or included with your distribution of mlpack.