Man page - mlpack_kmeans(1)
Packages contains this manual
- mlpack_fastmks(1)
- mlpack_mean_shift(1)
- mlpack_hmm_generate(1)
- mlpack_local_coordinate_coding(1)
- mlpack_sparse_coding(1)
- mlpack_preprocess_scale(1)
- mlpack_kmeans(1)
- mlpack_linear_svm(1)
- mlpack_preprocess_split(1)
- mlpack_softmax_regression(1)
- mlpack_hmm_train(1)
- mlpack_nca(1)
- mlpack_range_search(1)
- mlpack_radical(1)
- mlpack_gmm_generate(1)
- mlpack_cf(1)
- mlpack_random_forest(1)
- mlpack_lmnn(1)
- mlpack_gmm_probability(1)
- mlpack_emst(1)
- mlpack_dbscan(1)
- mlpack_nbc(1)
- mlpack_preprocess_one_hot_encoding(1)
- mlpack_lsh(1)
- mlpack_knn(1)
- mlpack_kde(1)
- mlpack_hoeffding_tree(1)
- mlpack_adaboost(1)
- mlpack_hmm_loglik(1)
- mlpack_nmf(1)
- mlpack_pca(1)
- mlpack_bayesian_linear_regression(1)
- mlpack_hmm_viterbi(1)
- mlpack_preprocess_describe(1)
- mlpack_decision_tree(1)
- mlpack_krann(1)
- mlpack_det(1)
- mlpack_lars(1)
- mlpack_preprocess_binarize(1)
- mlpack_logistic_regression(1)
- mlpack_gmm_train(1)
- mlpack_perceptron(1)
- mlpack_preprocess_imputer(1)
- mlpack_kernel_pca(1)
- mlpack_kfn(1)
- mlpack_linear_regression(1)
- mlpack_approx_kfn(1)
apt-get install mlpack-bin
Manual
mlpack_kmeans
NAMESYNOPSIS
DESCRIPTION
REQUIRED INPUT OPTIONS
OPTIONAL INPUT OPTIONS
OPTIONAL OUTPUT OPTIONS
ADDITIONAL INFORMATION
NAME
mlpack_kmeans - k-means clustering
SYNOPSIS
mlpack_kmeans -c int -i unknown [ -a string ] [ -e bool ] [ -P bool ] [ -I unknown ] [ -E bool ] [ -K bool ] [ -l bool ] [ -m int ] [ -p double ] [ -r bool ] [ -S int ] [ -s int ] [ -V bool ] [ -C unknown ] [ -o unknown ] [ -h -v ]
DESCRIPTION
This program performs K-Means clustering on the given dataset. It can return the learned cluster assignments, and the centroids of the clusters. Empty clusters are not allowed by default; when a cluster becomes empty, the point furthest from the centroid of the cluster with maximum variance is taken to fill that cluster.
Optionally, the strategy to choose initial centroids can be specified. The k-means++ algorithm can be used to choose initial centroids with the ā --kmeans_plus_plus ( -K )ā parameter. The Bradley and Fayyad approach ("Refining initial points for k-means clustering", 1998) can be used to select initial points by specifying the ā --refined_start ( -r )ā parameter. This approach works by taking random samplings of the dataset; to specify the number of samplings, the ā --samplings ( -S )ā parameter is used, and to specify the percentage of the dataset to be used in each sample, the ā --percentage ( -p )ā parameter is used (it should be a value between 0.0 and 1.0).
There are several options available for the algorithm used for each Lloyd iteration, specified with the ā --algorithm ( -a )ā option. The standard O (kN) approach can be used (ānaiveā). Other options include the Pelleg-Moore tree-based algorithm (āpelleg-mooreā), Elkanās triangle-inequality based algorithm (āelkanā), Hamerlyās modification to Elkanās algorithm (āhamerlyā), the dual-tree k-means algorithm (ādualtreeā), and the dual-tree k-means algorithm using the cover tree (ādualtree-covertreeā).
The behavior for when an empty cluster is encountered can be modified with the ā --allow_empty_clusters ( -e )ā option. When this option is specified and there is a cluster owning no points at the end of an iteration, that clusterās centroid will simply remain in its position from the previous iteration. If the ā --kill_empty_clusters ( -E )ā option is specified, then when a cluster owns no points at the end of an iteration, the cluster centroid is simply filled with DBL_MAX, killing it and effectively reducing k for the rest of the computation. Note that the default option when neither empty cluster option is specified can be time-consuming to calculate; therefore, specifying either of these parameters will often accelerate runtime.
Initial clustering assignments may be specified using the ā --initial_centroids_file ( -I )ā parameter, and the maximum number of iterations may be specified with the ā --max_iterations ( -m )ā parameter.
As an example, to use Hamerlyās algorithm to perform k-means clustering with k=10 on the dataset ādata.csvā, saving the centroids to ācentroids.csvā and the assignments for each point to āassignments.csvā, the following command could be used:
$ mlpack_kmeans --input_file data.csv --clusters 10 --output_file assignments.csv --centroid_file centroids.csv
To run k-means on that same dataset with initial centroids specified in āinitial.csvā with a maximum of 500 iterations, storing the output centroids in āfinal.csvā the following command may be used:
$ mlpack_kmeans --input_file data.csv --initial_centroids_file initial.csv --clusters 10 --max_iterations 500 --centroid_file final.csv
REQUIRED INPUT OPTIONS
--clusters (-c) [ int ]
Number of clusters to find (0 autodetects from initial centroids).
--input_file (-i) [ unknown ]
Input dataset to perform clustering on.
OPTIONAL INPUT OPTIONS
--algorithm (-a) [ string ]
Algorithm to use for the Lloyd iteration (ānaiveā, āpelleg-mooreā, āelkanā, āhamerlyā, ādualtreeā, or ādualtree-covertreeā). Default value ānaiveā.
--allow_empty_clusters (-e) [ bool ]
Allow empty clusters to be persist.
--help (-h) [ bool ]
Default help info.
--in_place (-P) [ bool ]
If specified, a column containing the learned cluster assignments will be added to the input dataset file. In this case, --output_file is overridden. (Do not use in Python.)
--info [ string ]
Print help on a specific option. Default value āā.
--initial_centroids_file (-I) [ unknown ]
Start with the specified initial centroids.
--kill_empty_clusters (-E) [ bool ]
Remove empty clusters when they occur.
--kmeans_plus_plus (-K) [ bool ]
Use the k-means++ initialization strategy to choose initial points.
--labels_only (-l) [ bool ]
Only output labels into output file.
--max_iterations (-m) [ int ]
Maximum number of iterations before k-means terminates. Default value 1000.
--percentage (-p) [ double ]
Percentage of dataset to use for each refined start sampling (use when --refined_start is specified). Default value 0.02.
--refined_start (-r) [ bool ]
Use the refined initial point strategy by Bradley and Fayyad to choose initial points.
--samplings (-S) [ int ]
Number of samplings to perform for refined start
(use when --refined_start is specified).
Default value 100.
--seed (-s) [ int ]
Random seed. If 0, āstd::time(NULL)ā is used. Default value 0.
--verbose (-v) [ bool ]
Display informational messages and the full list of parameters and timers at the end of execution.
--version (-V) [ bool ]
Display the version of mlpack.
OPTIONAL OUTPUT OPTIONS
--centroid_file (-C) [ unknown ]
If specified, the centroids of each cluster will be written to the given file. --output_file ( -o ) [ unknown ] Matrix to store output labels or labeled data to.
ADDITIONAL INFORMATION
For further information, including relevant papers, citations, and theory, consult the documentation found at http://www.mlpack.org or included with your distribution of mlpack.