1 Introduction

Mixture models, particularly the Gaussian Mixture Model (GMM), are widely used for clustering and density estimation in Euclidean space. GMMs assume that data points are generated from a mixture of Gaussian distributions, each parameterized by a mean and a covariance matrix. This works well when the data lies in $\mathbb{R}^d$ and clusters are roughly ellipsoidal in shape.

However, many modern applications involve directional data — that is, data points that lie on the surface of a $d$-dimensional unit hypersphere $\mathbb{S}^{d-1}$. Examples include:

Document embeddings that are normalized to unit norm before computing cosine similarity,
Visual features extracted from deep neural networks and L2-normalized,
Geospatial or orientation data, where only the direction (not magnitude) matters.

In such cases, applying GMMs directly can be suboptimal or even misleading. The reason is simple: GMMs do not respect the geometry of the hypersphere — they operate in Euclidean space, not on hyperspheres.

To model such directional data properly, we need to adopt a distribution that is defined on the hypersphere, i.e., the von Mises-Fisher (vMF) distribution which is often considered the spherical analogue of the multivariate Gaussian. Just as Gaussian distributions are used for Euclidean space, vMF distributions are a natural choice for modeling points on the unit hypersphere.

And just as GMMs model data generation in $\mathbb{R}^d$, we can construct a von Mises-Fisher Mixture Model (vMFMM) to capture multiple clusters on the hypersphere $\mathbb{S}^{d-1}$. In this post, we will:

Review the von Mises-Fisher distribution and its properties;
Introduce the vMF mixture model and its likelihood formulation;
Explain how to estimate parameters using the Expectation-Maximization (EM) algorithm;
Provide an implementation and discuss practical considerations.

2 von Mises-Fisher (vMF) distribution

2.1 From Gaussian to von Mises-Fisher

Recall the Gaussian distribution in $\mathbb{R}^d$:

\[g(\mathbf{x}| \boldsymbol{\mu}, \mathbb{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\{ -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^{\top} \mathbb{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\}.\]

where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean, and $\mathbb{\Sigma} \in \mathbb{R}^{d \times d}$ is the covariance matrix.

Let’s consider the analogue case on the hypersphere $\mathbb{S}^{d-1}$.

Firstly, we begin from an isotropic Gaussian with $\mathbb{\Sigma} = \sigma^2 \mathbf{I}$

\[g(\mathbf{x}| \boldsymbol{\mu}, \sigma^2 \mathbf{I}) = \frac{1}{(2\pi \sigma^2)^{d/2} } \exp\{ -\frac{1}{2\sigma^2}\|\mathbf{x} - \boldsymbol{\mu}\|^2\}.\]

By constraining $|\mathbf{x}|=1$ and $\boldsymbol{\mu}=1$, we have

\[\begin{aligned} g(\mathbf{x}| \boldsymbol{\mu}, \sigma^2 \mathbf{I}) &= \frac{1}{(2\pi \sigma^2)^{d/2} } \exp\{ -\frac{1}{2\sigma^2} (\|\mathbf{x}\|^2 + \|\mathbf{\mu}\|^2 - 2\boldsymbol{\mu}^{\top} \mathbf{x}\}\\ &= \frac{1}{(2\pi \sigma^2)^{d/2} } \exp\{ -\frac{1}{\sigma^2} (1- \boldsymbol{\mu}^{\top} \mathbf{x})\}\\ &\propto \exp\{ \frac{1}{\sigma^2} \boldsymbol{\mu}^{\top} \mathbf{x}\}. \end{aligned}\]

Actually, that is what the von Mises-Fisher distribution looks like. Specifically, we define

\[V_{d}(\mathbf{x}; \boldsymbol{\mu}, \kappa) = C_{d}(\kappa) \cdot e^{\kappa\cdot \mu^\top \mathbf{x}}, \quad\text{with}\quad \|\mathbf{x}\|=1, \|\boldsymbol{\mu}\|=1\]

where

$\boldsymbol{\mu} \in \mathbb{S}^{d-1}$ with $|\boldsymbol{\mu}|=1$ is the mean direction. It is a radial direction on the hypersphere $\mathbb{S}^{d-1}$ and represents the “center” of a VMF distribution. On the directions “near” to this center, the probability density is higher. (analogue to the mean of Gaussian).
$\kappa$ with $\kappa>=0$ is the concentration parameter. It shows how the probability density concentrates around the mean direction $\boldsymbol{\mu}$. The larger $\kappa$ is, the distribution is more concentrated around $\boldsymbol{\mu}$. (analogue to the behaviour of $\sigma$ in Gaussian since we have $\kappa \rightarrow \frac{1}{\sigma^2}$ in the above derivation.)
- For $\kappa=0$, we can consider it as a uniform distribution on a hypersphere;
- For $\kappa \rightarrow +\infty$, VMF degrades to a Dirac delta distribution, or in other words, a single point at $\boldsymbol{\mu}$.
$C_{d}(\kappa) = \frac{\kappa^{\frac{d}{2}-1}}{(2\pi)^{d} I_{\frac{d}{2}-1}(\kappa)}$ is the normalizing factor, and $I_{n}(\cdot)$ the modified Bessel functions of the first kind with dimension $n$.

Expectation and Variance

Expectation of $V(\mathbf{x}; \boldsymbol{\mu}, \kappa)$ is

\[\mathbb{E}[\mathbf{x}] = A_{d}(\kappa)\cdot \mathbf{\mu}\]

where $A_{d}(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}$

Variance of $V(\mathbf{x}; \boldsymbol{\mu}, \kappa)$ is

\[\mathbb{V}[\mathbf{x}] = 1 - A_{d}(\kappa) ^ 2\]

when $\kappa=0$, $A_d(0)=0$
when $\kappa \rightarrow +\infty$, $A_{d}(\kappa) \rightarrow 1$.

TODO: Illustration

2.2 Maximum Likelihood Estimation on vMF

Assume we have a set of observations $\mathcal{X} = {\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_N}$ from $V(\mathbf{x}; \boldsymbol{\mu}, \kappa)$, we can estimate $\boldsymbol{\mu}, \kappa$ by maximum likelihood estimation (MLE). We begin with calculating the log-likelihood $\mathcal{L}(\boldsymbol{\mu}, \kappa)$

\[\begin{aligned} \log\mathcal{L}(\boldsymbol{\mu}, \kappa) &= \log \Pi_{i=1}^{N} V(\mathbf{x}_i; \boldsymbol{\mu}, \kappa)\\ &= \sum_{i=1}^{N} \log V(\mathbf{x}_i; \boldsymbol{\mu}, \kappa) \\ &= \kappa\cdot \boldsymbol{\mu}^{\top} (\sum_{i=1}^{N} \mathbf{x}_i) + N\cdot \log C_d(\kappa)\\ &= \kappa\cdot \boldsymbol{\mu}^{\top} \mathbf{r} + N\cdot \log C_d(\kappa)\\ \end{aligned}\]

where we define

\[\begin{aligned} \mathbf{r} &= \sum_{i=1}^{N} \mathbf{x}_i \\ \bar{\mathbf{r}} &= \frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i \\ \bar{r} &= \|\bar{\mathbf{r}} \| = \|\frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i\| \end{aligned}\]

MLE is to solve the following problem:

\[\begin{aligned} \max_{\boldsymbol{\mu}, \kappa} \log &\mathcal{L}(\boldsymbol{\mu}, \kappa)\\ &\text{s.t.} \|\boldsymbol{\mu}\|=1 \end{aligned}\]

By introducing the Lagrange multiplier, we define

\[\begin{aligned} \ell(\boldsymbol{\mu}, \kappa) &\triangleq \log\mathcal{L}(\boldsymbol{\mu}, \kappa) - \lambda (\boldsymbol{\mu}^{\top}\boldsymbol{\mu}-1)\\ &= \kappa\cdot \boldsymbol{\mu}^{\top} \mathbf{r} + N\cdot \log C_d(\kappa) - \lambda (\boldsymbol{\mu}^{\top}\boldsymbol{\mu}-1) \end{aligned}\]

Solving Maximum Likelihood Estimation

We need to take the partial derivative w.r.t. $\boldsymbol{\mu}$ and $\kappa$.

Estimate $\boldsymbol{\mu}$

We first deal with $\boldsymbol{\mu}$

\[\begin{aligned} \frac{\partial \ell(\boldsymbol{\mu}, \kappa)}{\partial \boldsymbol{\mu}} = \kappa \mathbf{r} - 2\lambda \boldsymbol{\mu} \end{aligned}\]

Letting $\frac{\partial \ell(\boldsymbol{\mu}, \kappa)}{\partial \boldsymbol{\mu}}=0$, we have

\[\boldsymbol{\mu} = \frac{\kappa}{2\lambda} \mathbf{r}\]

Remember that $| \boldsymbol{\mu} | = \frac{\kappa}{2\lambda} | \mathbf{r} |=1$, we get

\[\lambda = \frac{\kappa}{2}\|\mathbf{r}\|\]

By combining the above two equations, we have the estimate of $\boldsymbol{\mu}$:

\[\hat{\boldsymbol{\mu}} = \frac{\mathbf{r}}{\|\mathbf{r}\|} = \frac{\bar{\mathbf{r}}}{\|\bar{\mathbf{r}}\|} = \frac{\sum_{i=1}^{N} \mathbf{x}_i}{\|\sum_{i=1}^{N} \mathbf{x}_i\|}\]

Estimate $\kappa$

Recall that

\[\ell(\boldsymbol{\mu}, \kappa) = \kappa\cdot \boldsymbol{\mu}^{\top} \mathbf{r} + N\cdot \log C_d(\kappa) - \lambda (\boldsymbol{\mu}^{\top}\boldsymbol{\mu}-1)\]

With the definition of $C_{d}(\kappa)$, we have $\log C_{d}(\kappa) = (\frac{d}{2}-1)\cdot \log \kappa - \log I_{\frac{d}{2}-1}(\kappa) - \text{const}$

Also, based on the fact (see [1])

\[\frac{\mathrm{d} \log I_{n}(\kappa)}{\mathrm{d} \kappa} = \frac{I^{\prime}_{n}(\kappa)}{I_{n}(\kappa)} = \frac{I_{n+1}(\kappa) + \frac{n}{\kappa} I_{n}(\kappa)}{I_{n}(\kappa)} = \frac{I_{n+1}(\kappa)}{I_{n}(\kappa)} + \frac{n}{\kappa}\]

we have

\[\frac{\mathrm{d} \log C_{d}(\kappa)}{\mathrm{d} \kappa} = \frac{d/2-1}{\kappa} - \frac{\mathrm{d} \log I_{\frac{d}{2}-1}(\kappa)}{\mathrm{d} \kappa} =- \frac{I_{\frac{d}{2}}(\kappa)}{I_{\frac{d}{2}-1}(\kappa)}\]

By taking the partial derivative of $\ell(\boldsymbol{\mu}, \kappa)$ w.r.t. $\kappa$ and combining $\hat{\boldsymbol{\mu}} = \frac{\mathbf{r}}{|\mathbf{r}|} = \frac{\bar{\mathbf{r}}}{|\bar{\mathbf{r}}|}$, we have

\[\begin{aligned} \frac{\partial \ell(\boldsymbol{\mu}, \kappa)}{\partial \kappa} &= \boldsymbol{\mu}^{\top} \mathbf{r} + N \cdot \frac{\mathrm{d} \log C_{d}(\kappa)}{\mathrm{d} \kappa}\\ &= N\cdot \bar{r} - N\cdot \frac{I_{\frac{d}{2}}(\kappa)}{I_{\frac{d}{2}-1}(\kappa)} \end{aligned}\]

Thus, $\kappa$ can be estimated by solving

\[\begin{aligned} &\frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)} = \bar{r} \end{aligned}\]

Conclusion of MLE

\[\begin{aligned} &\hat{\boldsymbol{\mu}} = \frac{\mathbf{r}}{\|\mathbf{r}\|} = \frac{\bar{\mathbf{r}}}{\|\bar{\mathbf{r}}\|}\\ &\frac{I_{d/2}(\hat{\kappa})}{I_{d/2-1}(\hat{\kappa})} = \bar{r} \end{aligned}\]

where we define

Please note here the estimation equation for $\kappa$ only gives an implicit expression and there is no explicit analytic expression for solving $\kappa$.

Literature [2] gives a numerical approximation to compute $\kappa$:

\[\hat{\kappa} = \frac{\bar{r}d - \bar{r}^3}{1 - \bar{r}^2}\]

3 von Mises-Fisher Mixture Models (VMFMM) and Parameter Estimation with Expectation-Maximization

3.1 von Mises-Fisher Mixture Models (VMFMM)

If we need to model the data generation on a hypersphere, we can introduce the von Mises-Fisher Mixture Models (vMFMM).

\[p(\mathbf{x}; \boldsymbol{\alpha}, \boldsymbol{\theta})=\sum_{i=1}^K \alpha_{i} V(\mathbf{x}; \boldsymbol{\mu}_{i}, \kappa_{i})= \sum_{i=1}^K \alpha_{i} \cdot C_d(\kappa_{i}) e^{\kappa_{i} \cdot \boldsymbol{\mu}_{i}^\top \mathbf{x}}\]

where

Mixing weights: $\boldsymbol{\alpha} = {\alpha_{1},\cdots, \alpha_{K}}$ such that $\sum_{i=1}^{K} \alpha_{i}=1$
Component parameters: $\boldsymbol{\theta} = {\boldsymbol{\mu}_1, \kappa_1,\cdots, \boldsymbol{\mu}_K, \kappa_K }$

3.2 Parameter Estimation for VMFMM

Assume we have a set of observations $\mathcal{X} = {\mathbb{x}_1, \mathbb{x}_2, \cdots, \mathbb{x}_N}$ generated from such a von Mises-Fisher mixture model, how to estimate the parameters ${\boldsymbol{\alpha}, \boldsymbol{\theta}}$ ?

TODO

4 Implementation of EM algorithm on VMFMM