Density Estimation by Histograms (Part I)

histogram of travel time (US Census 2000 data)...
Histogram of travel time (US Census 2000 data)

We are going to introduce the histogram as a simple nonparametric density estimator.  I will divide this presentation in several posts for simplicity reasons.

Let us $latex {X_1,\ldots,X_n}&fg=000000$ with pdf $latex {f}&fg=000000$. The histogram is the simplest nonparametric estimator of $latex {f}&fg=000000$.

  1. Select and origin $latex {x_0}&fg=000000$ and divide the real line into bins of binwidth

    $latex \displaystyle B_j = \left[x_0 – (j-1)h, x_0 + (j-1)h\right) \quad j\in {\mathbb Z}. &fg=000000$

  2. Let $latex {n_j}&fg=000000$ be how many observations falls into $latex {B_j}&fg=000000$.
  3. Let $latex {\hat{f}_j=\frac{n_j}{n}}&fg=000000$ and let $latex {f_j=\int_{B_j} f(u)du}&fg=000000$.
  4. Finally plot the histogram erecting a bar over each bin with height $latex {f_j}&fg=000000$ and width $latex {h}&fg=000000$

More formally, the histogram is given by

$latex \displaystyle \hat{f}_h(x)=\frac{1}{nh}\sum_{i=1}^n \sum_j I(X_i\in B_j) I(x\in B_j) &fg=000000$

where $latex {I}&fg=000000$ is the indicator function.

If $latex {m_j}&fg=000000$ is the center of $latex {B_j}&fg=000000$, it is clear that the histogram assigns the same estimate $latex {\hat{f}_n(m_j)}&fg=000000$ for each $latex {x\in \left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right) }&fg=000000$. This is rather restrictive but later we will see better alternatives.

Derivation

The probability of that an observation of $latex {X}&fg=000000$ will fall into the bin $latex {x\in \left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right) }&fg=000000$ is given by

$latex \displaystyle \mathop{\mathbb P}\left( X\in \left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right) \right) = \int_{m_j-\frac{h}{2}}^{m_j+\frac{h}{2}} f(u)du, &fg=000000$

which is just the area under $latex {f}&fg=000000$ in the interval $latex {\left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right)}&fg=000000$. We can approximate this area by the area of a rectangle with height $latex {f(m_j)}&fg=000000$ and width $latex {h}&fg=000000$,

$latex \displaystyle \int_{m_j-\frac{h}{2}}^{m_j+\frac{h}{2}} f(u)du \approx f(m_j)h \ \ \ \ \ (1)&fg=000000$

A natural estimate for this probability is the relative frequency of observations in this interval. That means:

$latex \displaystyle \mathop{\mathbb P}\left( X\in x\in \left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right) \right) \approx \frac{1}{n} \sharp \left\lbrace X_i\in \left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right) \right\rbrace. \ \ \ \ \ (2)&fg=000000$

Where $latex {\sharp}&fg=000000$ denotes the cardinality. Combining 1 and 2 we get

$latex \displaystyle \hat{f}_h (m_j) = \frac{1}{n} \sharp \left\lbrace X_i\in \left[ m_j -\frac{h}{2},m_j +\frac{h}{2} \right) \right\rbrace. &fg=000000$

In the next post we will study the statistical properties of this estimator and we will make some practical examples.

Comments

  1. Pingback: Optimizing the binwidth for the histogram using cross validation - Maikol Solís

Leave a Reply