spsklearn.cluster ================= .. py:module:: spsklearn.cluster .. autoapi-nested-parse:: The :mod:`spsklearn.cluster` module implements clustering algorithms. Classes ------- .. autoapisummary:: spsklearn.cluster.SphericalKMeans Functions --------- .. autoapisummary:: spsklearn.cluster.spherical_k_means_plusplus spsklearn.cluster.spherical_k_means Package Contents ---------------- .. py:function:: spherical_k_means_plusplus(X, n_clusters, *, sample_weight=None, random_state=None, n_local_trials=None) Init n_clusters seeds according to spherical k-means++. :param X: The data to pick seeds from. :type X: {array-like, sparse matrix} of shape (n_samples, n_features) :param n_clusters: The number of centroids to initialize. :type n_clusters: int :param sample_weight: The weights for each observation in `X`. If `None`, all observations are assigned equal weight. `sample_weight` is ignored if `init` is a callable or a user provided array. :type sample_weight: array-like of shape (n_samples,), default=None :param random_state: Determines random number generation for centroid initialization. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `. :type random_state: int or RandomState instance, default=None :param n_local_trials: The number of seeding trials for each center (except the first), of which the one reducing inertia the most is greedily chosen. Set to None to make the number of trials depend logarithmically on the number of seeds (2+log(k)) which is the recommended setting. Setting to 1 disables the greedy cluster selection and recovers the vanilla k-means++ algorithm which was empirically shown to work less well than its greedy variant. :type n_local_trials: int, default=None :returns: * **centers** (*ndarray of shape (n_clusters, n_features)*) -- The initial centers for k-means. * **indices** (*ndarray of shape (n_clusters,)*) -- The index location of the chosen centers in the data array X. For a given index and center, X[index] = center. .. rubric:: Notes Selects initial cluster centers for spherical k-mean clustering in a smart way to speed up convergence. see: Arthur, D. and Vassilvitskii, S. "k-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms. 2007 .. py:function:: spherical_k_means(X, n_clusters, *, sample_weight=None, init='spherical-k-means++', n_init='auto', max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, algorithm='lloyd', return_n_iter=False) Perform K-means clustering algorithm. Read more in the :ref:`User Guide `. :param X: The observations to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. :type X: {array-like, sparse matrix} of shape (n_samples, n_features) :param n_clusters: The number of clusters to form as well as the number of centroids to generate. :type n_clusters: int :param sample_weight: The weights for each observation in `X`. If `None`, all observations are assigned equal weight. `sample_weight` is not used during initialization if `init` is a callable or a user provided array. :type sample_weight: array-like of shape (n_samples,), default=None :param init: Method for initialization: - `'spherical-k-means++'` : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. - `'random'`: choose `n_clusters` observations (rows) at random from data for the initial centroids. - If an array is passed, it should be of shape `(n_clusters, n_features)` and gives the initial centers. - If a callable is passed, it should take arguments `X`, `n_clusters` and a random state and return an initialization. :type init: {'spherical-k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='spherical-k-means++' :param n_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='spherical-k-means++'` or `init` is an array-like. .. versionadded:: 1.2 Added 'auto' option for `n_init`. .. versionchanged:: 1.4 Default value for `n_init` changed to `'auto'`. :type n_init: 'auto' or int, default="auto" :param max_iter: Maximum number of iterations of the k-means algorithm to run. :type max_iter: int, default=300 :param verbose: Verbosity mode. :type verbose: bool, default=False :param tol: Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. :type tol: float, default=1e-4 :param random_state: Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary `. :type random_state: int, RandomState instance or None, default=None :param copy_x: When pre-computing distances it is more numerically accurate to center the data first. If `copy_x` is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if `copy_x` is False. If the original data is sparse, but not in CSR format, a copy will be made even if `copy_x` is False. :type copy_x: bool, default=True :param algorithm: K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. :type algorithm: {"lloyd"}, default="lloyd" :param return_n_iter: Whether or not to return the number of iterations. :type return_n_iter: bool, default=False :returns: * **centroid** (*ndarray of shape (n_clusters, n_features)*) -- Centroids found at the last iteration of k-means. * **label** (*ndarray of shape (n_samples,)*) -- The `label[i]` is the code or index of the centroid the i'th observation is closest to. * **inertia** (*float*) -- The final value of the inertia criterion (sum of squared distances to the closest centroid for all observations in the training set). * **best_n_iter** (*int*) -- Number of iterations corresponding to the best results. Returned only if `return_n_iter` is set to True. .. rubric:: Examples >>> import numpy as np >>> from sklearn.cluster import k_means >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> centroid, label, inertia = k_means( ... X, n_clusters=2, n_init="auto", random_state=0 ... ) >>> centroid array([[10., 2.], [ 1., 2.]]) >>> label array([1, 1, 1, 0, 0, 0], dtype=int32) >>> inertia 16.0 .. py:class:: SphericalKMeans(n_clusters=8, *, init='spherical-k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd') Bases: :py:obj:`sklearn.cluster._kmeans._BaseKMeans` .. autoapi-inheritance-diagram:: spsklearn.cluster.SphericalKMeans :parts: 1 Spherical K-Means clustering. :param n_clusters: The number of clusters to form as well as the number of centroids to generate. :type n_clusters: int, default=8 :param init: Method for initialization: * 'spherical-k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is "greedy spherical-k-means++". It differs from the vanilla spherical-k-means++ by making several trials at each sampling step and choosing the best centroid among them. * 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. * If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. * If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. For an example of how to use the different `init` strategy, see the example entitled :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`. :type init: {'spherical-k-means++', 'random'}, callable or array-like of shape (n_clusters, n_features), default='spherical-k-means++' :param n_init: Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of `n_init` consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see :ref:`kmeans_sparse_high_dim`). When `n_init='auto'`, the number of runs depends on the value of init: 10 if using `init='random'` or `init` is a callable; 1 if using `init='spherical-k-means++'` or `init` is an array-like. :type n_init: 'auto' or int, default='auto' :param max_iter: Maximum number of iterations of the k-means algorithm for a single run. :type max_iter: int, default=300 :param tol: Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. :type tol: float, default=1e-4 :param verbose: Verbosity mode. :type verbose: int, default=0 :param random_state: Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary `. :type random_state: int, RandomState instance or None, default=None :param copy_x: When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False. :type copy_x: bool, default=True :param algorithm: spherical K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`. The `"elkan"` variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape `(n_samples, n_clusters)`. :type algorithm: {"lloyd"}, default="lloyd" .. attribute:: cluster_centers_ Coordinates of cluster centers. If the algorithm stops before fully converging (see ``tol`` and ``max_iter``), these will not be consistent with ``labels_``. :type: ndarray of shape (n_clusters, n_features) .. attribute:: labels_ Labels of each point :type: ndarray of shape (n_samples,) .. attribute:: inertia_ Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided. :type: float .. attribute:: n_iter_ Number of iterations run. :type: int .. attribute:: n_features_in_ Number of features seen during :term:`fit`. :type: int .. attribute:: feature_names_in_ Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. :type: ndarray of shape (`n_features_in_`,) .. rubric:: Notes The spherical k-means problem is solved using either Lloyd's or Elkan's algorithm. The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration. The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. Refer to :doi:`"How slow is the k-means method?" D. Arthur and S. Vassilvitskii - SoCG2006.<10.1145/1137856.1137880>` for more details. In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That's why it can be useful to restart it several times. If the algorithm stops before fully converging (because of ``tol`` or ``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent, i.e. the ``cluster_centers_`` will not be the means of the points in each cluster. Also, the estimator will reassign ``labels_`` after the last iteration to make ``labels_`` consistent with ``predict`` on the training set. .. py:attribute:: copy_x :value: True .. py:attribute:: algorithm :value: 'lloyd' .. py:method:: fit(X, y=None, sample_weight=None) Compute spherical k-means clustering. :param X: Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format. :type X: {array-like, sparse matrix} of shape (n_samples, n_features) :param y: Not used, present here for API consistency by convention. :type y: Ignored :param sample_weight: The weights for each observation in X. If None, all observations are assigned equal weight. `sample_weight` is not used during initialization if `init` is a callable or a user provided array. :type sample_weight: array-like of shape (n_samples,), default=None :returns: **self** -- Fitted estimator. :rtype: object