July 15, 2013

Covariance estimation

Estimating the covariance matrix of a $d$-dimensional Gaussian random variable $X$ from $n$ observations is easy right? We all know we get an unbiased, efficient estimate of the covariance between $X_i$ and $X_j$ by computing
$$ \cov(X_i, X_j) \approx \frac 1 {n-1} \sum_{k=1}^n (x_{i,k} - \bar x_i) (x_{j,k} - \bar x_j) $$
so we just do this for each pair $(i, j)$ and we have our covariance matrix! Actually, it's not always that simple...

Theoretical covariance (and correlation) matrices are positive semi-definite, and so estimated covariance matrices should be viewed as elements of the manifold embedded in $\reals^{n\times n}$ consisting of PSD matrices. Intepreted in this light, the pairwise correlation estimator is biased and inefficient. Indeed, for small $n/d$ it's not unlikely that pairwise correlation will yield a matrix which isn't PSD, leading to problems further down the line when, for example, performing an eigenvalue decomposition.

One solution to this problem is shrinkage, an implementation of which is available in the scikit-learn Python library.

The idea behind shrinkage is that of substituting the traditional estimator with a convex combination $(1-\lambda) S + \lambda F$ of the sample covariance matrix $S$ and a shrinkage target $F$, a well-conditioned matrix which makes milder claims about the theoretical distribution.

So, to shrink a sample covariance matrix you need to decide on a shrinkage target and the convex weight $\lambda$. $F$ is typically taken to be the identity matrix multiplied by the average sample variance
$$
F = \frac{\tr S}{d}I
$$
so the tricky part is $\lambda$. Choosing this parameter is tricky because the optimality conditions one would intuitively choose depend on knowing the theoretical, unobservable covariance matrix $\Sigma$. However, it turns out, that these optimal parameters $\lambda^*$ can be estimated from the sample, ultimately providing a practical procedure to shrink the sample covariance matrix. Such optimal parameters are derived by Ledoit and Wolf and subsequently improved by Chen, Wiesel, Eldar and Hero.

Ledoit and Wolf derive the LW shrinkage intensity
$$
\hat\lambda_{LW}^* = \frac{b^2}{d^2}
$$
where $d^2 = \| S-F \|^2/$, $b^2 = \min(d^2, n^{-2}\sum_{k=1}^n \| x^{(n)}_k (x^{(n)}_k)^\T - S\|^2)$ and $\|M\|^2=\tr(MM^\T)$ is the Frobenius norm. Asymptotically, the shrunk estimator is shown to have, amongst all convex combinations of $S$ and $F$, minimum quadratic risk with respect to $\Sigma$. This asymptotic result does not, however, help in situations in which $n/d$ is low, for example when $n = d = 2$ we have $b = 0$ and hence no shrinkage occurs, which is the opposite of what one would want in such a situation.

The CWEH estimator performs significantly better in low $n/d$ situations. The shrinkage intensity is given by
$$
\hat\lambda_{OAS}^* = \min\paren{ \frac{(1-2/d)\tr(S^2)+\tr^2(S)}{(n+1-2/d)\bracket{ \tr(S^2) - \tr^2(S)/d } }, 1 }.
$$

No comments:

Post a Comment