spsklearn.cluster
=================

.. py:module:: spsklearn.cluster

.. autoapi-nested-parse::

   The :mod:`spsklearn.cluster` module implements clustering algorithms.


Classes
-------

.. autoapisummary::

   spsklearn.cluster.SphericalKMeans


Functions
---------

.. autoapisummary::

   spsklearn.cluster.spherical_k_means_plusplus
   spsklearn.cluster.spherical_k_means


Package Contents
----------------

.. py:function:: spherical_k_means_plusplus(X, n_clusters, *, sample_weight=None, random_state=None, n_local_trials=None)

   Init n_clusters seeds according to spherical k-means++.

   :param X: The data to pick seeds from.
   :type X: {array-like, sparse matrix} of shape (n_samples, n_features)
   :param n_clusters: The number of centroids to initialize.
   :type n_clusters: int
   :param sample_weight: The weights for each observation in `X`. If `None`, all observations
                         are assigned equal weight. `sample_weight` is ignored if `init`
                         is a callable or a user provided array.
   :type sample_weight: array-like of shape (n_samples,), default=None
   :param random_state: Determines random number generation for centroid initialization. Pass
                        an int for reproducible output across multiple function calls.
                        See :term:`Glossary <random_state>`.
   :type random_state: int or RandomState instance, default=None
   :param n_local_trials: The number of seeding trials for each center (except the first),
                          of which the one reducing inertia the most is greedily chosen.
                          Set to None to make the number of trials depend logarithmically
                          on the number of seeds (2+log(k)) which is the recommended setting.
                          Setting to 1 disables the greedy cluster selection and recovers the
                          vanilla k-means++ algorithm which was empirically shown to work less
                          well than its greedy variant.
   :type n_local_trials: int, default=None

   :returns: * **centers** (*ndarray of shape (n_clusters, n_features)*) -- The initial centers for k-means.
             * **indices** (*ndarray of shape (n_clusters,)*) -- The index location of the chosen centers in the data array X. For a
               given index and center, X[index] = center.

   .. rubric:: Notes

   Selects initial cluster centers for spherical k-mean clustering in a smart way
   to speed up convergence. see: Arthur, D. and Vassilvitskii, S.
   "k-means++: the advantages of careful seeding". ACM-SIAM symposium
   on Discrete algorithms. 2007


.. py:function:: spherical_k_means(X, n_clusters, *, sample_weight=None, init='spherical-k-means++', n_init='auto', max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, algorithm='lloyd', return_n_iter=False)

   Perform K-means clustering algorithm.

   Read more in the :ref:`User Guide <k_means>`.

   :param X: The observations to cluster. It must be noted that the data
             will be converted to C ordering, which will cause a memory copy
             if the given data is not C-contiguous.
   :type X: {array-like, sparse matrix} of shape (n_samples, n_features)
   :param n_clusters: The number of clusters to form as well as the number of
                      centroids to generate.
   :type n_clusters: int
   :param sample_weight: The weights for each observation in `X`. If `None`, all observations
                         are assigned equal weight. `sample_weight` is not used during
                         initialization if `init` is a callable or a user provided array.
   :type sample_weight: array-like of shape (n_samples,), default=None
   :param init: Method for initialization:

                - `'spherical-k-means++'` : selects initial cluster centers for k-mean
                  clustering in a smart way to speed up convergence. See section
                  Notes in k_init for more details.
                - `'random'`: choose `n_clusters` observations (rows) at random from data
                  for the initial centroids.
                - If an array is passed, it should be of shape `(n_clusters, n_features)`
                  and gives the initial centers.
                - If a callable is passed, it should take arguments `X`, `n_clusters` and a
                  random state and return an initialization.
   :type init: {'spherical-k-means++', 'random'}, callable or array-like of shape             (n_clusters, n_features), default='spherical-k-means++'
   :param n_init: Number of time the k-means algorithm will be run with different
                  centroid seeds. The final results will be the best output of
                  n_init consecutive runs in terms of inertia.

                  When `n_init='auto'`, the number of runs depends on the value of init:
                  10 if using `init='random'` or `init` is a callable;
                  1 if using `init='spherical-k-means++'` or `init` is an array-like.

                  .. versionadded:: 1.2
                     Added 'auto' option for `n_init`.

                  .. versionchanged:: 1.4
                     Default value for `n_init` changed to `'auto'`.
   :type n_init: 'auto' or int, default="auto"
   :param max_iter: Maximum number of iterations of the k-means algorithm to run.
   :type max_iter: int, default=300
   :param verbose: Verbosity mode.
   :type verbose: bool, default=False
   :param tol: Relative tolerance with regards to Frobenius norm of the difference
               in the cluster centers of two consecutive iterations to declare
               convergence.
   :type tol: float, default=1e-4
   :param random_state: Determines random number generation for centroid initialization. Use
                        an int to make the randomness deterministic.
                        See :term:`Glossary <random_state>`.
   :type random_state: int, RandomState instance or None, default=None
   :param copy_x: When pre-computing distances it is more numerically accurate to center
                  the data first. If `copy_x` is True (default), then the original data is
                  not modified. If False, the original data is modified, and put back
                  before the function returns, but small numerical differences may be
                  introduced by subtracting and then adding the data mean. Note that if
                  the original data is not C-contiguous, a copy will be made even if
                  `copy_x` is False. If the original data is sparse, but not in CSR format,
                  a copy will be made even if `copy_x` is False.
   :type copy_x: bool, default=True
   :param algorithm: K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`.
                     The `"elkan"` variation can be more efficient on some datasets with
                     well-defined clusters, by using the triangle inequality. However it's
                     more memory intensive due to the allocation of an extra array of shape
                     `(n_samples, n_clusters)`.
   :type algorithm: {"lloyd"}, default="lloyd"
   :param return_n_iter: Whether or not to return the number of iterations.
   :type return_n_iter: bool, default=False

   :returns: * **centroid** (*ndarray of shape (n_clusters, n_features)*) -- Centroids found at the last iteration of k-means.
             * **label** (*ndarray of shape (n_samples,)*) -- The `label[i]` is the code or index of the centroid the
               i'th observation is closest to.
             * **inertia** (*float*) -- The final value of the inertia criterion (sum of squared distances to
               the closest centroid for all observations in the training set).
             * **best_n_iter** (*int*) -- Number of iterations corresponding to the best results.
               Returned only if `return_n_iter` is set to True.

   .. rubric:: Examples

   >>> import numpy as np
   >>> from sklearn.cluster import k_means
   >>> X = np.array([[1, 2], [1, 4], [1, 0],
   ...               [10, 2], [10, 4], [10, 0]])
   >>> centroid, label, inertia = k_means(
   ...     X, n_clusters=2, n_init="auto", random_state=0
   ... )
   >>> centroid
   array([[10.,  2.],
          [ 1.,  2.]])
   >>> label
   array([1, 1, 1, 0, 0, 0], dtype=int32)
   >>> inertia
   16.0


.. py:class:: SphericalKMeans(n_clusters=8, *, init='spherical-k-means++', n_init='auto', max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')

   Bases: :py:obj:`sklearn.cluster._kmeans._BaseKMeans`

   .. autoapi-inheritance-diagram:: spsklearn.cluster.SphericalKMeans
      :parts: 1


   Spherical K-Means clustering.

   :param n_clusters: The number of clusters to form as well as the number of
                      centroids to generate.
   :type n_clusters: int, default=8
   :param init: Method for initialization:

                * 'spherical-k-means++' : selects initial cluster centroids using sampling             based on an empirical probability distribution of the points'             contribution to the overall inertia. This technique speeds up             convergence. The algorithm implemented is "greedy spherical-k-means++". It             differs from the vanilla spherical-k-means++ by making several trials at             each sampling step and choosing the best centroid among them.

                * 'random': choose `n_clusters` observations (rows) at random from         data for the initial centroids.

                * If an array is passed, it should be of shape (n_clusters, n_features)        and gives the initial centers.

                * If a callable is passed, it should take arguments X, n_clusters and a        random state and return an initialization.

                For an example of how to use the different `init` strategy, see the example
                entitled :ref:`sphx_glr_auto_examples_cluster_plot_kmeans_digits.py`.
   :type init: {'spherical-k-means++', 'random'}, callable or array-like of shape             (n_clusters, n_features), default='spherical-k-means++'
   :param n_init: Number of times the k-means algorithm is run with different centroid
                  seeds. The final results is the best output of `n_init` consecutive runs
                  in terms of inertia. Several runs are recommended for sparse
                  high-dimensional problems (see :ref:`kmeans_sparse_high_dim`).

                  When `n_init='auto'`, the number of runs depends on the value of init:
                  10 if using `init='random'` or `init` is a callable;
                  1 if using `init='spherical-k-means++'` or `init` is an array-like.
   :type n_init: 'auto' or int, default='auto'
   :param max_iter: Maximum number of iterations of the k-means algorithm for a
                    single run.
   :type max_iter: int, default=300
   :param tol: Relative tolerance with regards to Frobenius norm of the difference
               in the cluster centers of two consecutive iterations to declare
               convergence.
   :type tol: float, default=1e-4
   :param verbose: Verbosity mode.
   :type verbose: int, default=0
   :param random_state: Determines random number generation for centroid initialization. Use
                        an int to make the randomness deterministic.
                        See :term:`Glossary <random_state>`.
   :type random_state: int, RandomState instance or None, default=None
   :param copy_x: When pre-computing distances it is more numerically accurate to center
                  the data first. If copy_x is True (default), then the original data is
                  not modified. If False, the original data is modified, and put back
                  before the function returns, but small numerical differences may be
                  introduced by subtracting and then adding the data mean. Note that if
                  the original data is not C-contiguous, a copy will be made even if
                  copy_x is False. If the original data is sparse, but not in CSR format,
                  a copy will be made even if copy_x is False.
   :type copy_x: bool, default=True
   :param algorithm: spherical K-means algorithm to use. The classical EM-style algorithm is `"lloyd"`.
                     The `"elkan"` variation can be more efficient on some datasets with
                     well-defined clusters, by using the triangle inequality. However it's
                     more memory intensive due to the allocation of an extra array of shape
                     `(n_samples, n_clusters)`.
   :type algorithm: {"lloyd"}, default="lloyd"

   .. attribute:: cluster_centers_

      Coordinates of cluster centers. If the algorithm stops before fully
      converging (see ``tol`` and ``max_iter``), these will not be
      consistent with ``labels_``.

      :type: ndarray of shape (n_clusters, n_features)

   .. attribute:: labels_

      Labels of each point

      :type: ndarray of shape (n_samples,)

   .. attribute:: inertia_

      Sum of squared distances of samples to their closest cluster center,
      weighted by the sample weights if provided.

      :type: float

   .. attribute:: n_iter_

      Number of iterations run.

      :type: int

   .. attribute:: n_features_in_

      Number of features seen during :term:`fit`.

      :type: int

   .. attribute:: feature_names_in_

      Names of features seen during :term:`fit`. Defined only when `X`
      has feature names that are all strings.

      :type: ndarray of shape (`n_features_in_`,)

   .. rubric:: Notes

   The spherical k-means problem is solved using either Lloyd's or Elkan's algorithm.

   The average complexity is given by O(k n T), where n is the number of
   samples and T is the number of iteration.

   The worst case complexity is given by O(n^(k+2/p)) with
   n = n_samples, p = n_features.
   Refer to :doi:`"How slow is the k-means method?" D. Arthur and S. Vassilvitskii -
   SoCG2006.<10.1145/1137856.1137880>` for more details.

   In practice, the k-means algorithm is very fast (one of the fastest
   clustering algorithms available), but it falls in local minima. That's why
   it can be useful to restart it several times.

   If the algorithm stops before fully converging (because of ``tol`` or
   ``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent,
   i.e. the ``cluster_centers_`` will not be the means of the points in each
   cluster. Also, the estimator will reassign ``labels_`` after the last
   iteration to make ``labels_`` consistent with ``predict`` on the training
   set.


   .. py:attribute:: copy_x
      :value: True


   .. py:attribute:: algorithm
      :value: 'lloyd'


   .. py:method:: fit(X, y=None, sample_weight=None)

      Compute spherical k-means clustering.

      :param X: Training instances to cluster. It must be noted that the data
                will be converted to C ordering, which will cause a memory
                copy if the given data is not C-contiguous.
                If a sparse matrix is passed, a copy will be made if it's not in
                CSR format.
      :type X: {array-like, sparse matrix} of shape (n_samples, n_features)
      :param y: Not used, present here for API consistency by convention.
      :type y: Ignored
      :param sample_weight: The weights for each observation in X. If None, all observations
                            are assigned equal weight. `sample_weight` is not used during
                            initialization if `init` is a callable or a user provided array.
      :type sample_weight: array-like of shape (n_samples,), default=None

      :returns: **self** -- Fitted estimator.
      :rtype: object