Density Estimation by Histograms (Part II)

Squared bias (thin solid line), variance (dashed line) and  (thick line) for the histogram
Squared bias (thin solid line), variance (dashed line) and (thick line) for the histogram

We continue our presentation about the estimation of histograms and its statistical properties. Today we will start the theory for reducing the mean squared error.

In order to study the statistical properties of $latex {\hat{f}_{h}(x)}&fg=000000$We will start introducing the concept of mean squared error (MSE) or quadratic risk. We define

$latex \displaystyle \mathrm{MSE}(\hat{f}_{h}(x))=\mathop{\mathbb E}(\hat{f}_{h}(x)-f(x))^{2}. &fg=000000$

We can rewrite the MSE as:

$latex \displaystyle \mathrm{MSE}(\hat{f}_{h}(x))=\mathrm{Var}(\hat{f}_{h}(x))+\mathrm{Bias}(\hat{f}_{h}(x)), \ \ \ \ \ (1)&fg=000000$


where $latex {\mathrm{Var}(\hat{f}_{h}(x))=\mathop{\mathbb E}(\hat{f}_{h}(x)-\mathop{\mathbb E}\hat{f}_{h}(x))^{2}}&fg=000000$ and $latex {\mathrm{Bias}(\hat{f}_{h}(x))=\mathop{\mathbb E}(\hat{f}_{h}(x))-f(x).}&fg=000000$

Both, the variance and the bias have inverse rate of convergence. That means, while the bias tends to grow as $latex {h}&fg=000000$ become small, the variance becomes small and vice versa.

The main goal will be find a right tradeoff between the bias and the variance. The natural question is: How much has to be $latex {h}&fg=000000$ in order to have a right balance? We will give a partial answer in this post.


Let us $latex {X_{i}\ (i=1,\ldots,n)}&fg=000000$ random variables i.i.d. For $latex {x\in B_{j}}&fg=000000$ fix we have the following,

$latex \displaystyle \begin{array}{rcl} \mathop{\mathbb E}(\hat{f}_{h}(x)) & = & {\displaystyle \frac{1}{nh}\sum_{i=1}^{n}\mathop{\mathbb E}(I(X_{i}\in B_{j}))}\\ & = & {\displaystyle \frac{1}{h}\mathop{\mathbb E}(I(X_{1}\in B_{j}))}. \end{array} &fg=000000$

Note that $latex {I(X_{1}\in B_{j}))}&fg=000000$ is a random variable Bernoulli with parameter $latex {p=\int_{(j-1)h}^{jh}f(u)du}&fg=000000$ given by

$latex \displaystyle I(X_{1}\in B_{j}))=\begin{cases} 1 & \text{with probability }p,\\ 0 & \text{with probability }1-p. \end{cases} &fg=000000$

Therefore it is Bernoulli distributed and $latex {\mathop{\mathbb E}(I(X_{1}\in B_{j}))=\int_{(j-1)h}^{jh}f(u)du}&fg=000000$. Then we get that $latex {\hat{f}_{h}(x)}&fg=000000$ is not an unbiased estimator and its bias is

$latex \displaystyle \begin{array}{rcl} \mathrm{Bias} & = & \frac{1}{h}\int_{(j-1)h}^{jh}f(u)du-f(x)\\ & = & \frac{1}{h}\int_{(j-1)h}^{jh}f(u)-f(x)du. \end{array} &fg=000000$

Now, we are going to make a Taylor development of $latex {f(u)-f(x)}&fg=000000$ around the point $latex {x.}&fg=000000$ That yields

$latex \displaystyle \mathrm{Bias}=f^{\prime}(x)\left(h\left(j-\frac{1}{2}\right)-x\right)+o(h). \ \ \ \ \ (2)&fg=000000$


Note that the bias increases as the $latex {h}&fg=000000$ grows.


Let us calculate the variance of histogram:

$latex \displaystyle \mathrm{Var}(\hat{f}_{h}(x))=\mathrm{Var}\left({\displaystyle \frac{1}{nh}\sum_{i=1}^{n}I(X_{i}\in B_{j})}\right). &fg=000000$

Since the $latex {X_{i}}&fg=000000$ are i.i.d we can write

$latex \displaystyle \begin{array}{rcl} \mathrm{Var}(\hat{f}_{h}(x)) & = & {\displaystyle \frac{1}{n^{2}h^{2}}{\displaystyle \sum_{i=1}^{n}\mathrm{Var}(I(X_{i}\in B_{j}))}}\\ & = & {\displaystyle \frac{1}{nh^{2}}{\displaystyle \mathrm{Var}(I(X_{1}\in B_{j})).}} \end{array} &fg=000000$

As we see before, $latex {I(X_{1}\in B_{j})}&fg=000000$ is a Bernoulli random variable with parameter $latex {p=\int_{(j-1)h}^{jh}f(u)du}&fg=000000$. Then we get that

$latex \displaystyle \mathrm{Var}(\hat{f}_{h}(x))=\frac{1}{nh^{2}}\int_{(j-1)h}^{jh}f(u)du\left(1-\int_{(j-1)h}^{jh}f(u)du\right). &fg=000000$

It is easy to check that

$latex \displaystyle \mathrm{Var}(\hat{f}_{h}(x)=\frac{1}{nh}f(x)+o\left(\frac{1}{nh}\right). \ \ \ \ \ (3)&fg=000000$


In this case, the variance tends to decrease as $latex {h}&fg=000000$ grows.

Since bias and variance vary in opposite directions, we have to find a value of $latex {h}&fg=000000$ that yields the optimal compromise for the MSE reduction.

Mean Squared Error

Putting (2) and (3) together we get,

$latex \displaystyle \mathrm{MSE}(\hat{f}_{h}(x))=f^{\prime}(x)\left(h\left(j-\frac{1}{2}\right)-x\right)+\frac{1}{nh}f(x)+o(h)+o\left(\frac{1}{nh}\right), \ \ \ \ \ (4)&fg=000000$


where $latex {o(h)}&fg=000000$ and $latex {o\left(\frac{1}{nh}\right)}&fg=000000$ are terms of order lower than $latex {h}&fg=000000$ and $latex {\frac{1}{nh}}&fg=000000$ respectively.

From (4), if we let $latex {h\rightarrow0}&fg=000000$ and $latex {nh\rightarrow0}&fg=000000$ we can conclude that the MSE converge. For example the figure of this post represents the typical behavior of the bias-variance tradeoff.

Question for you: Is it $latex {\hat{f}_{h}(x)}&fg=000000$ a consistent estimator?

In the next post we will give and explicit $latex {h}&fg=000000$ which minimize the MSE.



  1. Pingback: Density Estimation by Histograms (Part III) | Blog about Statistics.

  2. Pingback: Density Estimation by Histograms (Part IV) | Blog about Statistics.

  3. Pingback: Kernel density estimation | Blog about Statistics.

  4. Pingback: Introduction to Minimax Lower Bounds | Blog about Statistics

  5. Pingback: Optimizing the binwidth for the histogram using cross-validation | Maximum Entropy

  6. Pingback: Optimizing the binwidth for the histogram using cross validation - Maikol Solís

Leave a Reply