Kernel density estimation

English: A 1D Kernel density estimation
A 1D Kernel density estimation

I will make a summary of ideas about nonparametric estimation, including some basics results to develop more advanced theory later.

In the first post  we talk something about the density estimation and the nonparametric regression. Later, in posts about histogram (I,II,III,IV) , we saw how the histogram is a nonparametric estimator and we studied its properties.

Now, we are ready to go one step further and study the kernel density estimators in general.

1. Kernel density estimation

Let $latex {X_{1},\ldots,X_{n}}&fg=000000$ i.i.d random variables with common probability $latex {p}&fg=000000$ in $latex {{\mathbb R}}&fg=000000$. The distribution function of $latex {p}&fg=000000$ is $latex {F(x)=\int_{-\infty}^{x}p(t)dt}&fg=000000$. Consider the empirical distribution function

$latex \displaystyle F_{n}(x)=\frac{1}{n}\sum_{i=1}^{n}I(X_{i}\leq x). &fg=000000$

By the strong law of large numbers we have $latex {F_{n}(x)\rightarrow F(x)}&fg=000000$ for all $latex {x}&fg=000000$ in $latex {{\mathbb R}}&fg=000000$ almost surely as $latex {n\rightarrow\infty}&fg=000000$. Therefore, $latex {F_{n}(x)}&fg=000000$ is a consistent estimator for all $latex {x}&fg=000000$ in $latex {{\mathbb R}}&fg=000000$. The natural question is: How can we estimate $latex {p?}&fg=000000$

The following argument was one of the first solutions. For $latex {h>0}&fg=000000$ small we have

$latex \displaystyle p(x)\approx\frac{F(x+h)-F(x-h)}{2h}. &fg=000000$

Replacing $latex {F}&fg=000000$ by its estimator $latex {F_{n}}&fg=000000$ , define

$latex \displaystyle \hat{p}_{n}^{R}(x)=\frac{F_{n}(x+h)-F_{n}(x-h)}{2h}, &fg=000000$

where $latex {\hat{p}_{n}^{R}(x)}&fg=000000$ is the Rosenblatt estimator. We can rewrite it in the form

$latex \displaystyle \hat{p}_{n}^{R}(x)=\frac{1}{2nh}\sum_{i=1}^{n}I(x-h<X_{i}\leq x+h)=\frac{1}{nh}\sum_{i=1}^{n}K_{0}\left(\frac{X_{i}-x}{h}\right) &fg=000000$

where $latex {K_{0}(u)=\frac{1}{2}I(-1<u\leq1)}&fg=000000$, which is in fact the histogram estimator.

In general,

$latex \displaystyle \hat{p}_{n}(x)=\frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{X_{i}-x}{h}\right) $

where $latex {K:{\mathbb R}\rightarrow{\mathbb R}}&fg=000000$, $latex {\int K(u)du=1}&fg=000000$ and $latex {K(u)\geq0}&fg=000000$. The function $latex {K}&fg=000000$ is the kernel and $latex {h}&fg=000000$ is the bandwidth which depends on $latex {n}&fg=000000$. The function $latex {x\rightarrow\hat{p}(x)}&fg=000000$ is the kernel (ParzenRonsenblatt) estimator.

Some classical examples of kernels are:

  • Rectangular: $latex {K(u)=\frac{1}{2}I( |u|\leq1)}&fg=000000$ .
  • Triangular: $latex {K(u)=(1-| u|)I(|u|\leq1)}&fg=000000$ .
  • Epanechnikov: $latex {K(u)=\frac{3}{4}(1-u^{2})I(|u|\leq1)}&fg=000000$ .
  • Gaussian: $latex {K(u)=\frac{1}{\sqrt{2\pi}}\exp(-u^{2}/2)}&fg=000000$.

2. Mean Squared Error (or Risk) of Kernel Estimators

At an arbitrary point $latex {x_{0}\in{\mathbb R}}&fg=000000$ we have

$latex \displaystyle \mathrm{MSE}=\mathrm{MSE}(x_{0})\triangleq\mathop{\mathbb E}_p{\left(\hat{p}_{n}(x_{0})-p(x_{0})\right)^{2}} \\ =\int\left(\hat{p}_{n}(x_{0},x_{1},\ldots,x_{n})-p(x_{0})\right)^{2}\prod_{i=1} ^{n}\left[p(x_{i})dx_{i}\right]. &fg=000000$

A simple calculation gives us

$latex \displaystyle \mathrm{MSE}(x_{0})=b^{2}(x_{0})+\sigma^{2}(x_{0}) &fg=000000$

where $latex {b(x_{0})}&fg=000000$ is the bias and $latex {\sigma^{2}(x_{0})}&fg=000000$ is the variance of the estimator $latex {\hat{p}_{n}(x_{0})}&fg=000000$ and we define as

$latex \displaystyle \begin{array}{rcl} b(x_{0}) & = & \mathop{\mathbb E}_p{\hat{p}_{n} (x_{0})}-p(x_{0})\\ \sigma^{2}(x_{0}) & = & \mathop{\mathbb E}_p{\left(\hat{p}_{n} (x_{0})-\mathop{\mathbb E}_p{\hat{p}_{n} (x_{0})}\right)}. \end{array} &fg=000000$

We shall analyze both terms separately.

2.1. Variance of the estimator $latex {\hat{p}_{n} }&fg=000000$

Proposition 1 Suppose that the density $latex {p}&fg=000000$ satisfies $latex {p(x)\leq p_{\max}<\infty}&fg=000000$ for all $latex {x\in{\mathbb R}}&fg=000000$. Let $latex {K:{\mathbb R}\rightarrow{\mathbb R}}&fg=000000$ be a function such that

$latex \displaystyle \int K^{2}(u)du<\infty. &fg=000000$

Then for any $latex {x_{0}\in{\mathbb R}}&fg=000000$, $latex {h>0}&fg=000000$, and $latex {n\geq1}&fg=000000$ we have

$latex \displaystyle \sigma^{2}(x_{0})\leq\frac{C_{1}}{nh} &fg=000000$

where $latex {C_{1}=p_{\max}\int K^{2}(u)du.}&fg=000000$

Proof: We have

$latex \displaystyle \begin{array}{rl} \mathop{\mathrm{Var}}(\hat{p}_{n} (x_{0}))= & \displaystyle\frac{1}{nh^{2}}\mathop{\mathrm{Var}}\left(K\left(\frac{X_{1}-x_{0}}{h}\right)\right)\\ \leq & \displaystyle\frac{1}{nh^{2}}\mathop{\mathbb E}_p{K^{2}\left(\frac{X_{1}-x_{0}}{h}\right)}\\ = & \displaystyle\frac{1}{nh}\int K^{2}(u)p(uh+x_{0})du\\ \leq & \displaystyle\frac{1}{nh}p_{\max}\int K^{2}(u)du. \end{array} &fg=000000$

$latex \Box&fg=000000$

Finally if $latex {nh\rightarrow\infty}&fg=000000$ as $latex {n\rightarrow\infty}&fg=000000$ then the variance $latex {\sigma^{2}(x_{0})}&fg=000000$ goes to 0 as $latex {n\rightarrow\infty}&fg=000000$.

2.2. Bias of the estimator $latex {\hat{p}_{n} }&fg=000000$

The bias has the form

$latex \displaystyle b(x_{0})=\mathop{\mathbb E}_p{\hat{p}_{n} (x_{0})}-p(x_{0})=\frac{1}{h}\int K\left(\frac{z-x_{0}}{h}\right)p(z)dz-p(x_{0}). &fg=000000$

To analyze the behavior of $latex {b(x_{0})}&fg=000000$ as function of $latex {h}&fg=000000$, we need some regularity conditions on the density $latex {p}&fg=000000$ and the kernel $latex {K}&fg=000000$. Denote as $latex {\lfloor\beta\rfloor}&fg=000000$ the integer part of the real number $latex {\beta}&fg=000000$.

Definition 2 Let $latex {T}&fg=000000$ be an interval in $latex {{\mathbb R}}&fg=000000$ and let $latex {\beta}&fg=000000$ and $latex {L}&fg=000000$ be two positives numbers. The Holder class $latex {\Sigma(\beta,L)}&fg=000000$ on $latex {T}&fg=000000$ is the set of $latex {l=\lfloor\beta\rfloor}&fg=000000$ times differential functions $latex {f:T\rightarrow{\mathbb R}}&fg=000000$ whose derivative $latex {f^{(l)}}&fg=000000$ satisfies

$latex \displaystyle |f^{(l)}(x)-f^{(l)}(x^{\prime})|\leq L|x-x^{\prime}|^{\beta-l},\quad\forall\ x,x^{\prime}\in T &fg=000000$

Definition 3 Let $latex {l\geq1}&fg=000000$ be an integer. We say that $latex {K:{\mathbb R}\rightarrow{\mathbb R}}&fg=000000$ is a kernel of order $latex {l}&fg=000000$ if the functions $latex {u\mapsto u^{j}K(u)}&fg=000000$, $latex {j=1,\ldots l}&fg=000000$, are integrable and satisfy

$latex \displaystyle \int K(u)du=1,\qquad\int u^{j}K(u)du=0,\quad j=1,\ldots,l. &fg=000000$

Often in the literature exists an alternative definition of an order for a kernel. A kernel is of order $latex {l+1}&fg=000000$ if Definition 2 holds and $latex {\int u^{l+1}K(u)du\neq0}&fg=000000$ which is more restrictive. Define the class of densities $latex {\mathop{\mathbb P}=\mathop{\mathbb P}(\beta,L)}&fg=000000$ as follows:

$latex \displaystyle \mathop{\mathbb P}(\beta,L)=\left\{ p\left|p\geq0,\int p(x)dx=1,\text{ and }p\in\Sigma(\beta,L)\text{ on }{\mathbb R}\right.\right\} . &fg=000000$

Proposition 2 Assume that $latex {p\in\mathop{\mathbb P}(\beta,L)}&fg=000000$ and let $latex {K}&fg=000000$ be a kernel of order $latex {l=\lfloor\beta\rfloor}&fg=000000$ satisfying

$latex \displaystyle \int| u|^{\beta}|K(u)|du<\infty. &fg=000000$

Then for all $latex {x_{0}\in{\mathbb R}}&fg=000000$, $latex {h>0}&fg=000000$ and $latex {n\geq1}&fg=000000$ we have

$latex \displaystyle |b(x_{0})|\leq C_{2}h^{\beta} &fg=000000$

where

$latex \displaystyle C_{2}=\frac{L}{l!}\int| u|^{\beta}|K(u)|du. &fg=000000$

Proof: We have

$latex \displaystyle \begin{array}{rl} b(x_{0}) & =\displaystyle\frac{1}{h}\int K\left(\frac{z-x_{0}}{h}\right)p(z)dz-p(x_{0})\\ & =\displaystyle\int K(u)\left[p(x_{0}+uh)-p(x_{0})\right]du. \end{array} &fg=000000$

Next, since $latex {K}&fg=000000$ is a kernel of order $latex {l=\lfloor\beta\rfloor}&fg=000000$ and making a Taylor’s development of $latex {p(x_{0}+uh)}&fg=000000$, for $latex {0\leq\tau\leq1}&fg=000000$, we obtain,

$latex \displaystyle \begin{array}{rl} b(x_{0}) & =\displaystyle\int K(u)\left[p(x_{0}+uh)-p(x_{0})\right]du\\ & = \displaystyle\int K(u)\left[p(x_{0})+p^{\prime}(x_{0})uh+\cdots+\frac{\left(uh\right)^{l}}{l!}p^{ (l)}(x_{0}+\tau uh)-p(x_{0})\right]du\\ & =\displaystyle\int K(u)\frac{\left(uh\right)^{l}}{l!}\left[p^{(l)}(x_{0}+\tau uh)-p^{(l)}(x_{0})\right]du \end{array} &fg=000000$

and

$latex \displaystyle \begin{array}{rl} |{b(x_{0})}| & \displaystyle\leq\int|K(u)|\frac{|uh|^{l}}{l!}|p^{(l)}(x_{0}+\tau uh)-p^{(l)}(x_{0})|du\\ & \displaystyle\leq L\int|{K(u)}|\frac{|{uh}|^{l}}{l!}|{\tau uh}|^{\beta-l}du\leq C_{2}h^{\beta}. \end{array} &fg=000000$

$latex \Box&fg=000000$

2.3. Upper bound on the mean squared risk

We see from before that the bias and variance behave in opposite direction. If we chose a small $latex {h}&fg=000000$ we get a large variance and for consequence an undersmoothed estimator. By the other side, if $latex {h}&fg=000000$ is large the bias cannot be controlled which lead to an oversmoothed estimator.

If the assumptions of Propositions 2 and 2 hold, we get

$latex \displaystyle \mathrm{MSE}\leq\frac{C_{1}}{nh}+C_{2}^{2}h^{2\beta}. &fg=000000$

The minimum with respect to $latex {h}&fg=000000$ is attained at

$latex \displaystyle h_{n}^{*}=\left(\frac{C_{1}}{2\beta C_{2}^{2}}\right)^{\frac{1}{2\beta+1}}n^{-\frac{1}{2\beta+1}} &fg=000000$

and for $latex {h=h_{n}^{*}}&fg=000000$

$latex \displaystyle \mathrm{MSE}(x_{0})=O\left(n^{-\frac{2\beta}{2\beta+1}}\right),\quad n\rightarrow\infty, &fg=000000$

uniformly in $latex {x_{0}}&fg=000000$. For $latex {\alpha>0}&fg=000000$ and $latex {h=\alpha n^{-\frac{1}{2\beta+1}}}&fg=000000$, those observations lead us to get

$latex \displaystyle \sup_{x_{0}\in{\mathbb R}}\sup_{p\in\mathop{\mathbb P}(\beta,L)}\mathop{\mathbb E}_p{\left(\hat{p}_{n} (x_{0})-(x_{0})\right)^{2}} p\leq Cn^{-\frac{2\beta}{2\beta+1}}, &fg=000000$

where $latex {C>0}&fg=000000$ is a constant depending only on $latex {\beta}&fg=000000$, $latex {L}&fg=000000$, $latex {\alpha}&fg=000000$ and the kernel $latex {K}&fg=000000$.

The next time, we will formalize the MISE (Mean Integrated Squared Error) and study its properties.

If you have any comments/suggestions/improvements please let me know below.


Source: Tsybakov, A. (2009). Introduction to nonparametric estimation. Springer.

 

Comments

  1. Pingback: A global measure of risk for kernel estimators | Blog about Statistics.

  2. Pingback: Introduction to Minimax Lower Bounds | Blog about Statistics

  3. Pingback: Example with two hypothesis: Regression case « Maximum Entropy

Leave a Reply