
It is already known, that for $latex { Y\in {\mathbb R} }&fg=000000$ and $latex { X \in {\mathbb R}^{p} }&fg=000000$, the regression problem
$latex \displaystyle Y = f(\mathbf{X}) + \varepsilon, &fg=000000$
when $latex { p }&fg=000000$ is larger than the data available, it is well-known that the curse of dimensionality problem arises. Richard E. Bellman (see [1]) used this terminology when he was considering problems in dynamic optimization. He found that as the $latex { \mathbf{X} }&fg=000000$’s dimension increases, so the data sparsity as well.
The Sliced Inverse Regression (SIR) method, which we have discussed before, is a technique to reduce the dimensionality for this kind of problems. This method focuses in estimate the eigenvector associated to largest eigenvalues of
$latex \displaystyle \Lambda = \mathop{\mathrm{Cov}}(\mathbb E[\mathbf{X}\vert Y]). &fg=000000$
This post pretends to be a little review of the method presented by [6].
In this paper, they use a kernel method to estimate each matrix’s coefficient. Firstly, we use an Nadaraya-Watson kernel estimator to compute $latex { \mathbb E[X\vert Y] }&fg=000000$ for $latex { Y }&fg=000000$ fixed. Then with this quantity, we estimate empirically $latex { {\mathbb C}(\mathbb E [X\vert Y]) }&fg=000000$ (I’ll put some details below).
Theorems about the asymptotic normality for this estimator are shown. Also, using perturbation theory for linear operators (see [2]), they show that the eigenvalues and eigenvector also converge to a normal distribution.
To explain briefly the method, I shall introduce some notation. If someone gets lost in the middle, please let me know it in the comments.
Write $latex {(\mathbf{X},Y)}&fg=000000$ and its independent copies $latex {(\mathbf{X}_{i},Y_i)}&fg=000000$ as
$latex \displaystyle \begin{array}{rl} (\mathbf{X},Y) & =(X_{1},\ldots,X_{p},Y)^{\top}, \\ (\mathbf{X}_{j},Y_j) & =(X_{1j},\ldots,X_{pj},Y_j)^{\top},\quad j=1,\ldots,n. \end{array} &fg=000000$
Notice that if $latex { \mathbb E[\mathbf{X}] =\mathbf{0} }&fg=000000$, then
$latex \displaystyle \Lambda = \mathop{\mathrm{Cov}}(\mathbf{X}\vert Y) = \mathbb E[\mathbb E[\mathbf{X}\vert Y ] \mathbb E[\mathbf{X}\vert Y ]^\top ] &fg=000000$
We denote
$latex \displaystyle \mathbb E[\mathbf{X}\vert Y ] = \mathbf{R}(Y) = (R_1(Y), \ldots,R_p(Y)) &fg=000000$
Now, the trick is to use the Nadaraya-Watson estimator into every element of $latex { \mathbf{R}(Y) }&fg=000000$. Let us,
$latex \displaystyle \hat{R}_i(Y)=\frac{\hat{g}_i(Y)}{\hat{f}(Y)} &fg=000000$
where
$latex \displaystyle \begin{array}{rl} \hat{g}_{i}(Y) & = \displaystyle \frac{1}{nh}\sum_{k=1}^{n}X_{jk}K_{h} \left(\frac{Y-Y_{k}}{h}\right),\nonumber \\ \hat{f}(Y) & = \displaystyle \frac{1}{nh}\sum_{k=1}^{n}K_{h}\left(\frac{Y-Y_{k}}{h}\right), \end{array} &fg=000000$
and the function $latex {K(u)}&fg=000000$ satisfies the classical assumption for a kernel (See [4]).
Defining
$latex \displaystyle \begin{array}{rl} \mathbf{\hat{R}}(Y) & =(\hat{R}_{1}(Y),\ldots,\hat{R}_{p}(Y))^{\top} \end{array} &fg=000000$
it is possible to estimate $latex { \Lambda }&fg=000000$ by
$latex \displaystyle \hat{\Lambda}_n =\frac{1}{n} \sum_{j=1}^{n} \left( \mathbf{\hat{R}}(Y_j) \right) \left( \mathbf{\hat{R}}(Y_j) \right)^\top &fg=000000$
Remark
Actually, the estimator $latex { \hat{\Lambda}_n }&fg=000000$ takes account that $latex { f(Y) }&fg=000000$ and $latex { \hat{f}(Y) }&fg=000000$ do not get small values. We do not present all the details to keep the ideas simple.
The principal theorem is that $latex { \hat{\Lambda}_n }&fg=000000$ is asymptotically normal. In other words, they show that
$latex \displaystyle \sqrt{n} (\hat{\Lambda}_n – \Lambda) \Rightarrow \mathbf{H} &fg=000000$
where
$latex \displaystyle \lambda^\top \mathop{\mathrm{vech}}(\mathbf{H}) \sim N(0,\sigma_\lambda^2 )\quad \forall \lambda\neq 0&fg=000000$
and
$latex \displaystyle \begin{array}{rl} V(X,Y) &= \displaystyle\frac{1}{2}\left(X_i R_l(Y) + X_l R_i(Y) \right) \\ \sigma_\lambda^2 & = \lambda^\top \mathop{\mathrm{Cov}} (\mathop{\mathrm{vech}}(V(X,Y))) \lambda,\quad \lambda \in {\mathbb R} ^ {d(d+1)/2} \end{array} &fg=000000$
I am not going to present the results about the asymptotic normality for the eigenvalues and eigenvectors. I encourage to the interested reader to take a look to the paper.
The proof for the $latex { \hat{\Lambda}_n }&fg=000000$’s asymptotic convergence basically consist in decompose it into several addends and show that most of them are significantly small with respect to the leading term $latex { V(X,Y) }&fg=000000$. To control those terms, the paper displays two interesting secondary results,
$latex \displaystyle \begin{array}{rl} \displaystyle \sup_y \left\vert \hat{f}(y) – f(y) \right\vert & = \displaystyle O\left(h^4 +\frac{\log n}{n^{1/2} h}\right),\\ \displaystyle \sup_y \left\vert \hat{g}_i(y) – g_i(y) \right\vert & = \displaystyle O_p\left(h^4 +\frac{\log n}{n^{1/2} h}\right). \end{array} &fg=000000$
These results allow us to control $latex { \hat{f} }&fg=000000$ and $latex { \hat{g}_i }&fg=000000$ convergence at rate $latex { h^4 +n^{-1/2} h^{-1}\log n }&fg=000000$, the former punctually and the latter in probability.
You can find all the proofs and the complete discussion in [5] and [3].
References
- Bellman, R. (1957). Dynamic programming. Princeton University Press, Princeton, N. J.
- Kato, T. (1995). Perturbation theory for linear operators, volume 132. Springer Verlag.
- Prakasa Rao, B. (1983). Nonparametric functional estimation. Probability and mathematical statistics. Academic Press, New York.
- Tsybakov, A. B. (2008). Introduction to nonparametric estimation. Springer series in statistics. Springer.
- Zhu, L.-X. (1993). Convergence rates of empirical processes indexed by classes of functions and their applications. J. Syst. Sci. Math. Sci, 13:33*41.
- Zhu, L.-X. and Fang, K.-T. (1996). Asymptotics for kernel estimate of sliced inverse regression. The Annals of Statistics, 24(3):1053*1068.
Related articles
- Paper’s review: Zhu & Fang, 1996. Asymptotics for kernel estimate of sliced inverse regression. (maikolsolis.wordpress.com)