Optimizing the binwidth for the histogram using cross validation

Public Domain

Introduction

The cross validation is a common technique to calibrate the binwidth histogram. They are the simplest and, sometimes, the most effective tool to describe the density of some dataset.

As usual, suppose you have a independent and identically distributed random sample $X_1, X_2, \ldots X_n$ from some unknown continuous distribution called $f$. Recall that in a past post, we explained the construction of the histogram and we found the following histogram estimator for $f(x)$,

\begin{equation*}
\displaystyle
\hat{f}_h(x)=\frac{1}{nh}\sum_{i=1}^n \sum_j \I(X_i\in B_j) \I(x\in B_j),
\end{equation*}

where the $B_j$ are a uniform partition of size $h$ of the real line.

Then, continuing with the histogram series (II and III), we explored its asymptotic properties. Finally, we illustrated the theory with some simulated data.

Therefore, we adjusted the binwidth minimizing the mean integrated squared error (MISE) and getting

\begin{equation*}
\displaystyle
h_{opt}= \left(\frac{6}{n\|f^\prime \|_2^2}\right)\approx n^{-1/3}
\end{equation*}

Notice the bindwidth depends of the unknown quantity $\|f^\prime \|_2$, thus the main problem remains unsolved and we cannot use it in practice. Yeah I know!, I cheated before minimizing the MISE and forgetting the influence of the constant. In fact, all worked well because my example was a normal distributed with a relative large sample. Of course, those conditions are rare in statistics and we need improve the choice of $h$.

Consequently, we will find a fully data-driven estimator for $h$ using a technique called cross-validation.

Derivation of the cross validation formula

Define the integrated squared error as follows,

\begin{align*}
ISE(h) & = \displaystyle \int \left(\hat{f}_h(x) – f(x)\right)^2\, dx \\
& = \displaystyle \int \hat{f}_h^2(x)\, dx – 2 \int \hat{f}_h(x)
f(x)\, dx + \int f(x)^2\, dx
\end{align*}

Notice that to minimize the integrated squared error is equivalent to minimize the expression
\begin{equation*}
\displaystyle
J(h) = \int \hat{f}_h^2(x)\, dx – 2 \int \hat{f}_h(x) f(x)\, dx
\end{equation*}

We aim to find an estimator $\hat{J}(h)$ of $J(h)$ such as the $\E[\hat{J}(h)] = J(h)$.

Remark that we can estimate the first term using only available sample. However, the second one depends on the unknown function $f$. Thus, the first thought that comes in mind to approximate $\int \hat{f}_h(x) f(x)\, dx$ is to use the empirical estimator

\begin{equation*}
\displaystyle
\frac{1}{n} \sum_{i=1}^n \hat{f}_h(X_i).
\end{equation*}

The problem here is that we are using twice the data, once to estimate $\hat{f}_h$ and once again to estimate the empirical sum. To remedy this situation define,

\begin{equation*}
\displaystyle
\hat{f}_{h, -i}(x) = \frac{1}{(n-1)h} \sum_k \sum_{\substack{j=1 \\j \neq
i}}^n \I(X_j\in B_k) \I(x\in B_k).
\end{equation*}

Here, $\hat{f}_{h, -i}(x)$ is the leave-one-out estimator. You guessed right! We have removed the $i^{\text{th}}$ sample in each evaluation to ensure the independence between $\hat{f}_{h, -i}(\cdot)$ and the $X_i$’s. In fact, we can prove that $\E[\hat{J}(h)] = J(h)$ (e.g.,[1]).

The general criterion to find $h$ by cross-validation is,

\begin{equation}
\label{eq:hat_J_h}
\displaystyle
\hat{J}(h) = \int \hat{f}_h^2(x)\, dx
– \frac{2}{n} \sum_{i=1}^n \hat{f}_{h, -i}(X_i).
\end{equation}

Particular case: The histogram

The last equation looks ugly and trying to minimize it in this state, seems futile. Given that we are working—for now—with the histogram case, we can simplify it even further and find something easier to estimate.

First, denote by $N_k$ the number of observations belonging to the $k^{\text{th}}$ interval $B_k$. The random variable $N_k$ has the form
\begin{equation*}
N_k = \sum_{i=1}^n \I(X_i\in B_k).
\end{equation*}

Let us start with the first term of equation \eqref{eq:hat_J_h}.
\begin{align*}
\int \hat{f}_h^2(x)\, dx & = \displaystyle \frac{1}{n^2h^2} \int
\left(\sum_k \sum_{i=1}^n \I(x \in B_k) \I(X_i \in B_k)\right)^2\,
dx \\
& = \displaystyle \frac{1}{n^2h^2} \int \left(\sum_k \I(x \in B_k)
N_k \right)^2\, dx \\
& = \displaystyle \frac{1}{n^2h^2} \int \sum_k I(x \in B_k) N_k^2 +
\sum_{k\lt l} I(x \in B_k) I(x \in B_l) N_k N_l \ dx.
\end{align*}

The second sum inside the integral is zero (Why?).

Given that each $B_k$ has size $h$ we get,

\begin{align}
\int \hat{f}_h^2(x)\, dx
& = \frac{1}{n^2h^2} \sum_k N_k^2 \int \I(x \in B_k) \, dx \nonumber\\
& = \frac{1}{n^2h} \sum_k N_k^2. \label{eq:first_term}
\end{align}

On the other side, we have to handle the second term on equation \eqref{eq:hat_J_h}. So we can write the full expression of this term and rearrange the summand operators,
\begin{equation*}
\sum_{i=1}^n \hat{f}_{h, -i}(X_i) = \frac{1}{(n-1)h} \sum_k \sum_{i=1}^n
\I(X_i\in B_k) \sum_{\substack{j=1 \\j \neq i}}^n
\I(X_j\in B_k).
\end{equation*}

Remark that the expression $\sum_{j=1, \, j \neq i}^n \I(X_j\in B_k)$ is equal to $N_k – \I(X_i\in B_k)$. We can simplify the second term into,
\begin{align}
\sum_{i=1}^n \hat{f}_{h, -i}(X_i) & = \displaystyle \frac{1}{(n-1)h}
\sum_k \sum_{i=1}^n \I(X_i\in B_k) \left(N_k – \I(X_i\in B_k)
\right) \nonumber \\
& = \displaystyle \frac{1}{(n-1)h} \sum_k N_k^2 –
N_k. \label{eq:second_term}
\end{align}

Gathering the expressions \eqref{eq:first_term} and \eqref{eq:second_term} we obtain,
\begin{equation*}
\hat{J}(h) = \frac{2}{(n-1)h} – \frac{n+1}{n^2(n-1)h} \sum_k N_k^2.
\end{equation*}

We have achieved our main goal on finding a statistic (a formula which depends only in the data) that we can minimize it numerically to find the optimal bindwidth $h$.

References
  1. A. and R. , "Introduction to nonparametric estimation", "Springer series in statistics", "Springer", 0000. "214".

Leave a Reply