In the article we will look at the basics of Mean Shift Algorithm.

Kernel density Estimation

let us first consider a univariate gaussian PDF and sampled data from the PDF.
The kernel density estimation uses fact that the density of samples about a given point is proportional to its probability.

It approximates the probability density by estimating the local density of points
as seen in figure fig:image1 is resonable.

Large density of points are observed near the maximum of PDF.

Density estimationfig:image3
The KDE estimate the PDF by superposition of kernel at each data point,which is equivalent
to convolving the data points with a gaussian kernel.

Mean Shift

Mean shift is a tool for finding the modes in a set of data samples
which is sampled from underlying PDF.The aim of the mean
shift algorithm is to find the densest region in given set of data samples.

Data points density implies PDF value .

Let us consider a 2D region.The points in the 2D region are sampled
from a underlying unknown PDF.

Let $X=[x,y] $be random variables associated with a multi-variate
PDF $P(X)$.

Thus sampling a point $P(X)$ will give us a vector $X'=[x',y']$

For example let us consider a multi-variate gaussian distribution
where the random variables x and y take values in the range
-3 to 3.

Modes of Smooth function

Let us say we want to find the modes of PDF.The PDF is approximated using kernel density
estimation.Modes are the points at which PDF exhibits local maximum .

Dense regions in PDF corresponds to modes or local maxima.

Since the kernel is smooth,its differentiable.It gives to a smooth PDF.The gradient of density estimate is given by
\begin{eqnarray*}
\hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) \quad = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)\\
\nabla \hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n \nabla K\Big(\frac{x-x_i}{h}\Big) \\
for gaussian kernel \\
\nabla \hat{f}_h(x) = \frac{C}{nh}\sum_{i=1}^n \nabla exp\Big(\frac{-(x-x_i)^2}{2*h^2}\Big) \\
\nabla \hat{f}_h(x) = \frac{C}{nh*h^2}\sum_{i=1}^n exp\Big(\frac{-(x-x_i)^2}{2*h^2}\Big)*(-(x-xi)) \\
\nabla \hat{f}_h(x) = \frac{1}{nh*h^2}\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)*(-(x-xi)) \\
equating the gradient to 0 \\
\sum_{i=1}^n K\Big(\frac{x'-x_i}{h}\Big)*(x-xi)=0 \\
\sum_{i=1}^n K\Big(\frac{x'-x_i}{h}\Big)*x = \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)x_i\\
x'=\frac{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)x_i}{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)}
\end{eqnarray*}
The estimate is $x'=m(x)$ is called the sample mean at x with kernel K.

This will always be biased towards region with high density.

Thus if we were to move along the vector $m(x)-x$,we would reach the region with higher
density.The density at $m(x)$ will be greater than density at $x$.
This forms the basis of mean shift algorithm.

The vector $m(x)-x$ is called the mean shift vector which always points in the direction of
local maximum or mode.
\begin{eqnarray*}
m(x)-x = \frac{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)(x_i-x)}{\sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big)} \\
m(x)-x = \frac{-h^2\nabla f_h(x)}{f_h(x)}
\end{eqnarray*}
This is a estimate of normalize gradient of $f_h(x)$

Given any point x,we can take a small step in the direction of vector
$m(x)-x$ to reach the local maximum.

Let us consider that the present estimate of the mode is $x$,
then we compute $m(x)$ at this point.

For examples let initial estimate of the location of mode be $(-0.96,-2.01)$
The density at this point can be approximated by interpolation
or computed again using non parametric density estimation

The plot for this is show in fig:image5.The estimate clearly does
not lie at maximum.

To find the direction of the mean shift vector we find the gradient
of the normalize density estimate and take a smalll step in that
direction.This is perform till gradient magnitude is very small

Mean Shiftfig:image5
A video for mean shift algorithm using KDE is shown in

In this case we scale the gradient by the estimated PDF values to obtain normalize
gradient values.
\begin{eqnarray*}
m(x)-x = \frac{-h^2\nabla f_h(x)}{f_h(x)}
\end{eqnarray*}
This enabled us to adaptively change the step size based on estimated
PDF value.The step size magnitude is iversly proportional to estimated PDF values.
if the estimated PDF values is small,we are far away from the maximum and the
step size will be large.

If the estimate PDf value is large,we are close to maximum and the step
size will be small.

Using the Gradient

to find the modes of the PDF,we do not actually required to estimate
the PDF,we require just the gradient of the PDF to move in the direction
of the mean shift vector.

The gradient of superposition of kernels centered at each data point
is equivalent to convolving the data points with gradient of the kernel.

Instead of convolving with gaussian kernel,we can convolve with
gradient of gaussian kernel.

\begin{eqnarray*}
k(X) = C exp\Big(-\frac{x^2+y^2}{2*h}\Big) \\
\nabla k(X) = \Big[-\frac{x}{h}*k(x) ; -\frac{y}{h}*k(x) \Big]
\end{eqnarray*}
Thus given a intial point ${X}$ ,we estimate the value at ${X}$ using the kernels
$-\frac{x}{h}*k(x)$ and $-\frac{x}{h}*k(x)$ which gives us the direction of gradient at the point
$X$

Since we do not actually estimate the PDF at a point,but estimate the gradient of PDF
each time during the mean shift iteration we need to take a step in direction
of mean shift vector,in the earlier case ,we used the scale the gradient
by the estimated PDF values to obtained a normalized measure.

However in the present case we do not adaptively change the step size
but take a step of fixed size in direction of the gradient.

This still incorporates some adaptive behavior,since mean shift vector magnitude
depends on the gradient magnitude.

If gradient magnitude is large,step size take will be large else
step take will be small and refined ,near the maximum.

video of mean shift algorithm using gradient estimates is shown in

This iterative algorithm is a standard gradient descent algorithm
and the convergence is guranteed for infinately small step size.

Since the algorithm depends on kernel density estimate,
the bandwith of kernel will play a important role in mean shift
algorithm as well.

Local Maxima

If we reach a region,where local density is flat or we have
reached a local maximum.The algorithm will terminate.

this is a problem in case of all algorithms trying to reach a global
maximum.The animation for the same is shown in

The Code is written in matlab and
available in repository https://github.com/pi19404/m19404/tree/master/meanshift
the file mean_shift.m is the main file.The file kgde2 implements kernel density estimator
using bivariate gaussian windows for 2D distributions.The file kgde2x implements
estimation of gradient on KDE .The dim parameter decides the computation of gradient
along x and y directions.