von Mise-Fisher Mixture Models and Parameter Estimation with Expectation-Maximization
parameter estimation for von Mise-Fisher Mixture Models (vMFMM)
1 Introduction
Mixture models, particularly the Gaussian Mixture Model (GMM), are widely used for clustering and density estimation in Euclidean space. GMMs assume that data points are generated from a mixture of Gaussian distributions, each parameterized by a mean and a covariance matrix. This works well when the data lies in $\mathbb{R}^d$ and clusters are roughly ellipsoidal in shape.
However, many modern applications involve directional data — that is, data points that lie on the surface of a $d$-dimensional unit hypersphere $\mathbb{S}^{d-1}$. Examples include:
- Document embeddings that are normalized to unit norm before computing cosine similarity,
- Visual features extracted from deep neural networks and L2-normalized,
- Geospatial or orientation data, where only the direction (not magnitude) matters.
In such cases, applying GMMs directly can be suboptimal or even misleading. The reason is simple: GMMs do not respect the geometry of the hypersphere — they operate in Euclidean space, not on hyperspheres.
To model such directional data properly, we need to adopt a distribution that is defined on the hypersphere, i.e., the von Mises-Fisher (vMF) distribution which is often considered the spherical analogue of the multivariate Gaussian. Just as Gaussian distributions are used for Euclidean space, vMF distributions are a natural choice for modeling points on the unit hypersphere.
And just as GMMs model data generation in $\mathbb{R}^d$, we can construct a von Mises-Fisher Mixture Model (vMFMM) to capture multiple clusters on the hypersphere $\mathbb{S}^{d-1}$. In this post, we will:
- Review the von Mises-Fisher distribution and its properties;
- Introduce the vMF mixture model and its likelihood formulation;
- Explain how to estimate parameters using the Expectation-Maximization (EM) algorithm;
- Provide an implementation and discuss practical considerations.
2 von Mises-Fisher (vMF) distribution
2.1 From Gaussian to von Mises-Fisher
Recall the Gaussian distribution in $\mathbb{R}^d$:
\[g(\mathbf{x}| \boldsymbol{\mu}, \mathbb{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\{ -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^{\top} \mathbb{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\}.\]where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean, and $\mathbb{\Sigma} \in \mathbb{R}^{d \times d}$ is the covariance matrix.
Let’s consider the analogue case on the hypersphere $\mathbb{S}^{d-1}$.
Firstly, we begin from an isotropic Gaussian with $\mathbb{\Sigma} = \sigma^2 \mathbf{I}$
\[g(\mathbf{x}| \boldsymbol{\mu}, \sigma^2 \mathbf{I}) = \frac{1}{(2\pi \sigma^2)^{d/2} } \exp\{ -\frac{1}{2\sigma^2}\|\mathbf{x} - \boldsymbol{\mu}\|^2\}.\]By constraining $|\mathbf{x}|=1$ and $\boldsymbol{\mu}=1$, we have
\[\begin{aligned} g(\mathbf{x}| \boldsymbol{\mu}, \sigma^2 \mathbf{I}) &= \frac{1}{(2\pi \sigma^2)^{d/2} } \exp\{ -\frac{1}{2\sigma^2} (\|\mathbf{x}\|^2 + \|\mathbf{\mu}\|^2 - 2\boldsymbol{\mu}^{\top} \mathbf{x}\}\\ &= \frac{1}{(2\pi \sigma^2)^{d/2} } \exp\{ -\frac{1}{\sigma^2} (1- \boldsymbol{\mu}^{\top} \mathbf{x})\}\\ &\propto \exp\{ \frac{1}{\sigma^2} \boldsymbol{\mu}^{\top} \mathbf{x}\}. \end{aligned}\]Actually, that is what the von Mises-Fisher distribution looks like. Specifically, we define
\[V_{d}(\mathbf{x}; \boldsymbol{\mu}, \kappa) = C_{d}(\kappa) \cdot e^{\kappa\cdot \mu^\top \mathbf{x}}, \quad\text{with}\quad \|\mathbf{x}\|=1, \|\boldsymbol{\mu}\|=1\]where
- $\boldsymbol{\mu} \in \mathbb{S}^{d-1}$ with $|\boldsymbol{\mu}|=1$ is the mean direction. It is a radial direction on the hypersphere $\mathbb{S}^{d-1}$ and represents the “center” of a VMF distribution. On the directions “near” to this center, the probability density is higher. (analogue to the mean of Gaussian).
- $\kappa$ with $\kappa>=0$ is the concentration parameter. It shows how the probability density concentrates around the mean direction $\boldsymbol{\mu}$. The larger $\kappa$ is, the distribution is more concentrated around $\boldsymbol{\mu}$. (analogue to the behaviour of $\sigma$ in Gaussian since we have $\kappa \rightarrow \frac{1}{\sigma^2}$ in the above derivation.)
- For $\kappa=0$, we can consider it as a uniform distribution on a hypersphere;
- For $\kappa \rightarrow +\infty$, VMF degrades to a Dirac delta distribution, or in other words, a single point at $\boldsymbol{\mu}$.
- $C_{d}(\kappa) = \frac{\kappa^{\frac{d}{2}-1}}{(2\pi)^{d} I_{\frac{d}{2}-1}(\kappa)}$ is the normalizing factor, and $I_{n}(\cdot)$ the modified Bessel functions of the first kind with dimension $n$.
Expectation and Variance
- Expectation of $V(\mathbf{x}; \boldsymbol{\mu}, \kappa)$ is
where $A_{d}(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}$
- Variance of $V(\mathbf{x}; \boldsymbol{\mu}, \kappa)$ is
- when $\kappa=0$, $A_d(0)=0$
- when $\kappa \rightarrow +\infty$, $A_{d}(\kappa) \rightarrow 1$.
2.2 Maximum Likelihood Estimation on vMF
Assume we have a set of observations $\mathcal{X} = {\mathbb{x}_1, \mathbb{x}_2, \cdots, \mathbb{x}_N}$ from $V(\mathbf{x}; \boldsymbol{\mu}, \kappa)$, we can estimate $\boldsymbol{\mu}, \kappa$ by maximum likelihood estimation (MLE). We begin with calculating the log-likelihood $\mathcal{L}(\boldsymbol{\mu}, \kappa)$
\[\begin{aligned} \log\mathcal{L}(\boldsymbol{\mu}, \kappa) &= \log \Pi_{i=1}^{N} V(\mathbf{x}_i; \boldsymbol{\mu}, \kappa)\\ &= \sum_{i=1}^{N} \log V(\mathbf{x}_i; \boldsymbol{\mu}, \kappa) \\ &= \kappa\cdot \boldsymbol{\mu}^{\top} (\sum_{i=1}^{N} \mathbf{x}_i) + N\cdot \log C_d(\kappa)\\ &= \kappa\cdot \boldsymbol{\mu}^{\top} \mathbf{r} + N\cdot \log C_d(\kappa)\\ \end{aligned}\]where we define
\[\begin{aligned} \mathbf{r} &= \sum_{i=1}^{N} \mathbf{x}_i \\ \bar{\mathbf{r}} &= \frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i \\ \bar{r} &= \|\bar{\mathbf{r}} \| = \|\frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i\| \end{aligned}\]MLE is to solve the following problem:
\[\begin{aligned} \max_{\boldsymbol{\mu}, \kappa} \log &\mathcal{L}(\boldsymbol{\mu}, \kappa)\\ &\text{s.t.} \|\boldsymbol{\mu}\|=1 \end{aligned}\]By introducing the Lagrange multiplier, we define
\[\begin{aligned} \ell(\boldsymbol{\mu}, \kappa) &\triangleq \log\mathcal{L}(\boldsymbol{\mu}, \kappa) - \lambda (\boldsymbol{\mu}^{\top}\boldsymbol{\mu}-1)\\ &= \kappa\cdot \boldsymbol{\mu}^{\top} \mathbf{r} + N\cdot \log C_d(\kappa) - \lambda (\boldsymbol{\mu}^{\top}\boldsymbol{\mu}-1) \end{aligned}\]Solving Maximum Likelihood Estimation
We need to take the partial derivative w.r.t. $\boldsymbol{\mu}$ and $\kappa$.
Estimate $\boldsymbol{\mu}$
We first deal with $\boldsymbol{\mu}$
\[\begin{aligned} \frac{\partial \ell(\boldsymbol{\mu}, \kappa)}{\partial \boldsymbol{\mu}} = \kappa \mathbf{r} - 2\lambda \boldsymbol{\mu} \end{aligned}\]Letting $\frac{\partial \ell(\boldsymbol{\mu}, \kappa)}{\partial \boldsymbol{\mu}}=0$, we have
\[\boldsymbol{\mu} = \frac{\kappa}{2\lambda} \mathbf{r}\]Remember that $|\boldsymbol{\mu}| = \frac{\kappa}{2\lambda} |\mathbf{r}|=1$, we get
\[\lambda = \frac{\kappa}{2}\|\mathbf{r}\|\]By combining the above two equations, we have the estimate of $\boldsymbol{\mu}$:
\[\hat{\boldsymbol{\mu}} = \frac{\mathbf{r}}{\|\mathbf{r}\|} = \frac{\bar{\mathbf{r}}}{\|\bar{\mathbf{r}}\|} = \frac{\sum_{i=1}^{N} \mathbf{x}_i}{\|\sum_{i=1}^{N} \mathbf{x}_i\|}\]Estimate $\kappa$
Recall that
\[\ell(\boldsymbol{\mu}, \kappa) = \kappa\cdot \boldsymbol{\mu}^{\top} \mathbf{r} + N\cdot \log C_d(\kappa) - \lambda (\boldsymbol{\mu}^{\top}\boldsymbol{\mu}-1)\]With the definition of $C_{d}(\kappa)$, we have \(\log C_{d}(\kappa) = (\frac{d}{2}-1)\cdot \log \kappa - \log I_{\frac{d}{2}-1}(\kappa) - \text{const}\)
Also, based on the fact (see [1])
\[\frac{\mathrm{d} \log I_{n}(\kappa)}{\mathrm{d} \kappa} = \frac{I^{\prime}_{n}(\kappa)}{I_{n}(\kappa)} = \frac{I_{n+1}(\kappa) + \frac{n}{\kappa} I_{n}(\kappa)}{I_{n}(\kappa)} = \frac{I_{n+1}(\kappa)}{I_{n}(\kappa)} + \frac{n}{\kappa}\]we have
\[\frac{\mathrm{d} \log C_{d}(\kappa)}{\mathrm{d} \kappa} = \frac{d/2-1}{\kappa} - \frac{\mathrm{d} \log I_{\frac{d}{2}-1}(\kappa)}{\mathrm{d} \kappa} =- \frac{I_{\frac{d}{2}}(\kappa)}{I_{\frac{d}{2}-1}(\kappa)}\]By taking the partial derivative of $\ell(\boldsymbol{\mu}, \kappa)$ w.r.t. $\kappa$ and combining $\hat{\boldsymbol{\mu}} = \frac{\mathbf{r}}{|\mathbf{r}|} = \frac{\bar{\mathbf{r}}}{|\bar{\mathbf{r}}|}$, we have
\[\begin{aligned} \frac{\partial \ell(\boldsymbol{\mu}, \kappa)}{\partial \kappa} &= \boldsymbol{\mu}^{\top} \mathbf{r} + N \cdot \frac{\mathrm{d} \log C_{d}(\kappa)}{\mathrm{d} \kappa}\\ &= N\cdot \bar{r} - N\cdot \frac{I_{\frac{d}{2}}(\kappa)}{I_{\frac{d}{2}-1}(\kappa)} \end{aligned}\]Thus, $\kappa$ can be estimated by solving
\[\begin{aligned} &\frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)} = \bar{r} \end{aligned}\]Conclusion of MLE
\[\begin{aligned} &\hat{\boldsymbol{\mu}} = \frac{\mathbf{r}}{\|\mathbf{r}\|} = \frac{\bar{\mathbf{r}}}{\|\bar{\mathbf{r}}\|}\\ &\frac{I_{d/2}(\hat{\kappa})}{I_{d/2-1}(\hat{\kappa})} = \bar{r} \end{aligned}\]where we define
\[\begin{aligned} \mathbf{r} &= \sum_{i=1}^{N} \mathbf{x}_i \\ \bar{\mathbf{r}} &= \frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i \\ \bar{r} &= \|\bar{\mathbf{r}} \| = \|\frac{1}{N}\sum_{i=1}^{N} \mathbf{x}_i\| \end{aligned}\]Please note here the estimation equation for $\kappa$ only gives an implicit expression and there is no explicit analytic expression for solving $\kappa$.
Literature [2] gives a numerical approximation to compute $\kappa$:
\[\hat{\kappa} = \frac{\bar{r}d - \bar{r}^3}{1 - \bar{r}^2}\]3 von Mises-Fisher Mixture Models (VMFMM) and Parameter Estimation with Expectation-Maximization
3.1 von Mises-Fisher Mixture Models (VMFMM)
If we need to model the data generation on a hypersphere, we can introduce the von Mises-Fisher Mixture Models (vMFMM).
\[p(\mathbf{x}; \boldsymbol{\alpha}, \boldsymbol{\theta})=\sum_{i=1}^K \alpha_{i} V(\mathbf{x}; \boldsymbol{\mu}_{i}, \kappa_{i})= \sum_{i=1}^K \alpha_{i} \cdot C_d(\kappa_{i}) e^{\kappa_{i} \cdot \boldsymbol{\mu}_{i}^\top \mathbf{x}}\]where
- $\boldsymbol{\alpha} = {\alpha_{1},\cdots, \alpha_{K}}$ such that $\sum_{i=1}^{K} \alpha_{i}=1$ are the mixing weights.
- $\boldsymbol{\theta} = { \boldsymbol{\mu}{1}, \kappa{1}, \cdots, \boldsymbol{\mu}{K}, \kappa{K} }$ are the parameters of each component
3.2 Parameter Estimation for VMFMM
Assume we have a set of observations $\mathcal{X} = {\mathbb{x}_1, \mathbb{x}_2, \cdots, \mathbb{x}_N}$ generated from such a von Mises-Fisher mixture model, how to estimate the parameters ${\boldsymbol{\alpha}, \boldsymbol{\theta}}$ ?
4 Implementation of EM algorithm on VMFMM
References
[1] Derivative of Modified Bessel Function of the First Kind.
[2] Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. JMLR 2015.
Enjoy Reading This Article?
Here are some more articles you might like to read next: