Paper’s review: Zhu & Fang, 1996. Asymptotics for kernel estimate of sliced inverse regression.

IMG_20130313_165726
My guinea pig Pascal it’s very interested in the article.

It is already known, that for $latex { Y\in {\mathbb R} }&fg=000000$ and $latex { X \in {\mathbb R}^{p} }&fg=000000$, the regression problem

$latex \displaystyle Y = f(\mathbf{X}) + \varepsilon, &fg=000000$

when $latex { p }&fg=000000$ is larger than the data available, it is well-known that the curse of dimensionality problem arises. Richard E. Bellman (see [1]) used this terminology when he was considering problems in dynamic optimization. He found that as the $latex { \mathbf{X} }&fg=000000$’s dimension increases, so the data sparsity as well.

The Sliced Inverse Regression (SIR) method, which we have discussed before, is a technique to reduce the dimensionality for this kind of problems. This method focuses in estimate the eigenvector associated to largest eigenvalues of

$latex \displaystyle \Lambda = \mathop{\mathrm{Cov}}(\mathbb E[\mathbf{X}\vert Y]). &fg=000000$

This post pretends to be a little review of the method presented by [6].

In this paper, they use a kernel method to estimate each matrix’s coefficient. Firstly, we use an Nadaraya-Watson kernel estimator to compute $latex { \mathbb E[X\vert Y] }&fg=000000$ for $latex { Y }&fg=000000$ fixed. Then with this quantity, we estimate empirically $latex { {\mathbb C}(\mathbb E [X\vert Y]) }&fg=000000$ (I’ll put some details below).

Theorems about the asymptotic normality for this estimator are shown. Also, using perturbation theory for linear operators (see [2]), they show that the eigenvalues and eigenvector also converge to a normal distribution.

To explain briefly the method, I shall introduce some notation. If someone gets lost in the middle, please let me know it in the comments.

Write $latex {(\mathbf{X},Y)}&fg=000000$ and its independent copies $latex {(\mathbf{X}_{i},Y_i)}&fg=000000$ as

$latex \displaystyle \begin{array}{rl} (\mathbf{X},Y) & =(X_{1},\ldots,X_{p},Y)^{\top}, \\ (\mathbf{X}_{j},Y_j) & =(X_{1j},\ldots,X_{pj},Y_j)^{\top},\quad j=1,\ldots,n. \end{array} &fg=000000$

Notice that if $latex { \mathbb E[\mathbf{X}] =\mathbf{0} }&fg=000000$, then

$latex \displaystyle \Lambda = \mathop{\mathrm{Cov}}(\mathbf{X}\vert Y) = \mathbb E[\mathbb E[\mathbf{X}\vert Y ] \mathbb E[\mathbf{X}\vert Y ]^\top ] &fg=000000$

We denote

$latex \displaystyle \mathbb E[\mathbf{X}\vert Y ] = \mathbf{R}(Y) = (R_1(Y), \ldots,R_p(Y)) &fg=000000$

Now, the trick is to use the Nadaraya-Watson estimator into every element of $latex { \mathbf{R}(Y) }&fg=000000$. Let us,

$latex \displaystyle \hat{R}_i(Y)=\frac{\hat{g}_i(Y)}{\hat{f}(Y)} &fg=000000$

where

$latex \displaystyle \begin{array}{rl} \hat{g}_{i}(Y) & = \displaystyle \frac{1}{nh}\sum_{k=1}^{n}X_{jk}K_{h} \left(\frac{Y-Y_{k}}{h}\right),\nonumber \\ \hat{f}(Y) & = \displaystyle \frac{1}{nh}\sum_{k=1}^{n}K_{h}\left(\frac{Y-Y_{k}}{h}\right), \end{array} &fg=000000$

and the function $latex {K(u)}&fg=000000$ satisfies the classical assumption for a kernel (See [4]).

Defining

$latex \displaystyle \begin{array}{rl} \mathbf{\hat{R}}(Y) & =(\hat{R}_{1}(Y),\ldots,\hat{R}_{p}(Y))^{\top} \end{array} &fg=000000$

it is possible to estimate $latex { \Lambda }&fg=000000$ by

$latex \displaystyle \hat{\Lambda}_n =\frac{1}{n} \sum_{j=1}^{n} \left( \mathbf{\hat{R}}(Y_j) \right) \left( \mathbf{\hat{R}}(Y_j) \right)^\top &fg=000000$

Remark 

Actually, the estimator $latex { \hat{\Lambda}_n }&fg=000000$ takes account that $latex { f(Y) }&fg=000000$ and $latex { \hat{f}(Y) }&fg=000000$ do not get small values. We do not present all the details to keep the ideas simple.

The principal theorem is that $latex { \hat{\Lambda}_n }&fg=000000$ is asymptotically normal. In other words, they show that

$latex \displaystyle \sqrt{n} (\hat{\Lambda}_n – \Lambda) \Rightarrow \mathbf{H} &fg=000000$

where

$latex \displaystyle \lambda^\top \mathop{\mathrm{vech}}(\mathbf{H}) \sim N(0,\sigma_\lambda^2 )\quad \forall \lambda\neq 0&fg=000000$

and

$latex \displaystyle \begin{array}{rl} V(X,Y) &= \displaystyle\frac{1}{2}\left(X_i R_l(Y) + X_l R_i(Y) \right) \\ \sigma_\lambda^2 & = \lambda^\top \mathop{\mathrm{Cov}} (\mathop{\mathrm{vech}}(V(X,Y))) \lambda,\quad \lambda \in {\mathbb R} ^ {d(d+1)/2} \end{array} &fg=000000$

I am not going to present the results about the asymptotic normality for the eigenvalues and eigenvectors. I encourage to the interested reader to take a look to the paper.

The proof for the $latex { \hat{\Lambda}_n }&fg=000000$’s asymptotic convergence basically consist in decompose it into several addends and show that most of them are significantly small with respect to the leading term $latex { V(X,Y) }&fg=000000$. To control those terms, the paper displays two interesting secondary results,

$latex \displaystyle \begin{array}{rl} \displaystyle \sup_y \left\vert \hat{f}(y) – f(y) \right\vert & = \displaystyle O\left(h^4 +\frac{\log n}{n^{1/2} h}\right),\\ \displaystyle \sup_y \left\vert \hat{g}_i(y) – g_i(y) \right\vert & = \displaystyle O_p\left(h^4 +\frac{\log n}{n^{1/2} h}\right). \end{array} &fg=000000$

These results allow us to control $latex { \hat{f} }&fg=000000$ and $latex { \hat{g}_i }&fg=000000$ convergence at rate $latex { h^4 +n^{-1/2} h^{-1}\log n }&fg=000000$, the former punctually and the latter in probability.

You can find all the proofs and the complete discussion in [5] and [3].

References

  1. Bellman, R. (1957). Dynamic programming. Princeton University Press, Princeton, N. J.
  2. Kato, T. (1995). Perturbation theory for linear operators, volume 132. Springer Verlag.
  3. Prakasa Rao, B. (1983). Nonparametric functional estimation. Probability and mathematical statistics. Academic Press, New York.
  4. Tsybakov, A. B. (2008). Introduction to nonparametric estimation. Springer series in statistics. Springer.
  5. Zhu, L.-X. (1993). Convergence rates of empirical processes indexed by classes of functions and their applications. J. Syst. Sci. Math. Sci, 13:33*41.
  6. Zhu, L.-X. and Fang, K.-T. (1996). Asymptotics for kernel estimate of sliced inverse regression. The Annals of Statistics, 24(3):1053*1068.

Leave a Reply