July 15, 2013

Covariance estimation

Estimating the covariance matrix of a $d$-dimensional Gaussian random variable $X$ from $n$ observations is easy right? We all know we get an unbiased, efficient estimate of the covariance between $X_i$ and $X_j$ by computing
$$ \cov(X_i, X_j) \approx \frac 1 {n-1} \sum_{k=1}^n (x_{i,k} - \bar x_i) (x_{j,k} - \bar x_j) $$
so we just do this for each pair $(i, j)$ and we have our covariance matrix! Actually, it's not always that simple...

Theoretical covariance (and correlation) matrices are positive semi-definite, and so estimated covariance matrices should be viewed as elements of the manifold embedded in $\reals^{n\times n}$ consisting of PSD matrices. Intepreted in this light, the pairwise correlation estimator is biased and inefficient. Indeed, for small $n/d$ it's not unlikely that pairwise correlation will yield a matrix which isn't PSD, leading to problems further down the line when, for example, performing an eigenvalue decomposition.

One solution to this problem is shrinkage, an implementation of which is available in the scikit-learn Python library.

The idea behind shrinkage is that of substituting the traditional estimator with a convex combination $(1-\lambda) S + \lambda F$ of the sample covariance matrix $S$ and a shrinkage target $F$, a well-conditioned matrix which makes milder claims about the theoretical distribution.

So, to shrink a sample covariance matrix you need to decide on a shrinkage target and the convex weight $\lambda$. $F$ is typically taken to be the identity matrix multiplied by the average sample variance
$$
F = \frac{\tr S}{d}I
$$
so the tricky part is $\lambda$. Choosing this parameter is tricky because the optimality conditions one would intuitively choose depend on knowing the theoretical, unobservable covariance matrix $\Sigma$. However, it turns out, that these optimal parameters $\lambda^*$ can be estimated from the sample, ultimately providing a practical procedure to shrink the sample covariance matrix. Such optimal parameters are derived by Ledoit and Wolf and subsequently improved by Chen, Wiesel, Eldar and Hero.

Ledoit and Wolf derive the LW shrinkage intensity
$$
\hat\lambda_{LW}^* = \frac{b^2}{d^2}
$$
where $d^2 = \| S-F \|^2/$, $b^2 = \min(d^2, n^{-2}\sum_{k=1}^n \| x^{(n)}_k (x^{(n)}_k)^\T - S\|^2)$ and $\|M\|^2=\tr(MM^\T)$ is the Frobenius norm. Asymptotically, the shrunk estimator is shown to have, amongst all convex combinations of $S$ and $F$, minimum quadratic risk with respect to $\Sigma$. This asymptotic result does not, however, help in situations in which $n/d$ is low, for example when $n = d = 2$ we have $b = 0$ and hence no shrinkage occurs, which is the opposite of what one would want in such a situation.

The CWEH estimator performs significantly better in low $n/d$ situations. The shrinkage intensity is given by
$$
\hat\lambda_{OAS}^* = \min\paren{ \frac{(1-2/d)\tr(S^2)+\tr^2(S)}{(n+1-2/d)\bracket{ \tr(S^2) - \tr^2(S)/d } }, 1 }.
$$

July 3, 2013

References in JSON

Modern object-oriented programming languages store data as objects which reference each other, in other words as a directed graph where each node is an object and each arc is a reference. JSON, on the other hand, can natively only represent trees, which are a strict subset of all possible directed graphs, specifically
$$ \begin{align*} \text{(trees)} &= \left\{ g \in \text{(directed graphs)} : \text{ one node in }g\text{ has zero predecessors and the others have exactly one} \right\} \\ &\subset \text{(directed graphs)} \end{align*} $$
In order to represent generic directed graphs in JSON, it's necessary to introduce reference semantics, in other words some mechanism to allow objects to be referenced by multiple predecessors, including objects they in turn reference (cycles).

Let's look at a few approaches that have been tried elsewhere

ID-based references

One approach is to recognise JSON dictionaries of the form
{ "$ref": <identifier> }
as a reference to the unique dictionary containing the key
{ "$id": <identifier>, ... }
This approach has the disadvantage that only JSON dictionary can be referenced, as there is no mechanism for assigning an id to an array

Path-based references

An alternative method is to reference other objects by describing where they are in relation to the referencing object. One popular proposal for doing this is JSONPath.

A simple object names-based approach

Getting back to our original problem, of how to represent an entire directed graph of objects, clearly it's necessary to have some way to retrieve at least a subset of those objects for manipulation, hence a JSON representation of the graph should have as the root object a dictionary associating names to objects.

When this is the case, it's very straightforward to reference objects via their associated names. In order to  represent any directed graph of objects, all that is needed is to assign arbitrary, unique names to multiply-referenced objects. Singly-referenced objects do not need to have names as they can be represented directly inline within the JSON document.

References can then be represented using the {"$ref": <id>} notation or even more simply by deciding on a custom marker such as the '@' sign and specifying that any string starting with that symbol is a reference to a named object. Clearly strings which are really supposed to start with '@' must be escaped, e.g. with a double '@@' which must be processed during the linking/unlinking algorithms.

Note that in reality this approach is not too dissimilar to the ID-based approach, but it provides a different way of assigning IDs which can be applied to any type of object, not just dictionaries.

C# implementation

Here is a C# implementation of the object names-based approach. Note that here it has been assumed that the JSON structure is represented in-memory with IDictionary and Array objects, but this can easily be generalized to any JSON value system. The linking and unlinking procedures act in-place, i.e. they modify the instances passed to them rather than returning new, modified instances.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace JsonUtils
{
  public class JsonLinker
  {
    public static void Link(IDictionary<string, object> vars)
    {
      Link(vars, vars);
    }

    static object Link(object obj, IDictionary<string, object> variables)
    {
      if(obj == null)
        return null;

      Type t = obj.GetType();
      if(obj is IDictionary<string, object>)
      {
        var dict = (IDictionary<string, object>) obj;
        foreach(var key in dict.Keys.ToArray())
          dict[key] = Link(dict[key], variables);
      }
      else if(obj is Array)
      {
        var a = (Array) obj;
        for(int k = 0; k < a.Length; k++)
          a.SetValue(Link(a.GetValue(k), variables), k);
      }
      else if(obj is string)
      {
        var s = (string) obj;
        if(s.Length >  0 && s[0] == '@')
        {
          if(s.Length > 1 && s[1] == '@')
            return s.Substring(1);
          else
            return variables[s.Substring(1)];
        }
      }

      return obj;
    }

    public static void Unlink(IDictionary<string, object> vars)
    {
      var context = new UnlinkContext(vars);
    }

    class UnlinkContext
    {
      IDictionary<string, object> variables;
      IDictionary<object, string> objects = new Dictionary<object, string>();
      ISet<object> visited = new HashSet<object>();

      public UnlinkContext(IDictionary<string, object> variables)
      {
        this.variables = variables;
        foreach(var kv in variables)
        {
          if(kv.Value != null && (kv.Value is IDictionary<string, object> || kv.Value is Array))
            objects[kv.Value] = kv.Key;
        }

        foreach(var v in variables.Values.ToArray())
          Traverse(v);

        foreach(var k in variables.Keys.ToArray())
          variables[k] = Unlink(variables[k], k);
      }

      void Traverse(object obj)
      {
        if(obj != null && (obj is IDictionary<string, object> || obj is Array))
        {
          if(visited.Contains(obj))
          {
            if(!objects.ContainsKey(obj))
            {
              string name = "obj" + objects.Count.ToString();
              objects.Add(obj, name);
              variables.Add(name, obj);
            }
            return;
          }

          visited.Add(obj);

          if(obj is IDictionary<string, object>)
          {
            var dict = (IDictionary<string, object>) obj;
            foreach(var value in dict.Values)
              Traverse(value);
          }
          else if(obj is Array)
          {
            var a = (Array) obj;
            for(int k = 0; k < a.Length; k++)
              Traverse(a.GetValue(k));
          }
        }
      }

      object Unlink(object obj, string label = null)
      {
        if(obj != null && (obj is IDictionary<string, object> || obj is Array))
        {
          string name;
          if(objects.TryGetValue(obj, out name) && name != label)
            return "@" + name;

          if(obj is IDictionary<string, object>)
          {
            var dict = (IDictionary<string, object>) obj;
            foreach(var key in dict.Keys.ToArray())
              dict[key] = Unlink(dict[key]);
          }
          else if(obj is Array)
          {
            var a = (Array) obj;
            for(int k = 0; k < a.Length; k++)
              a.SetValue(Unlink(a.GetValue(k)), k);
          }
          else if(obj is string)
          {
            var s = (string) obj;
            if(s.Length > 0 && s[0] == '@')
              return "@" + s;
          }
        }

        return obj;
      }
    }
  }
}

July 2, 2013

Mapping the Pyson syntax to JSON

The problem with creating a new syntax like Pyson is that it's necessary to reimplement a parser in every programming you wish to use it in. So an alternative could be to map the Pyson syntax on to JSON and only keep the semantic concepts associated with it, i.e. the
$$\text{AST} \to \text{data}$$
step.

For example, the following could be a JSON representation of a Pyson document we have seen a few times already
{
  "DateTime": {
    "#type": " @Type",
    "#args": [
      "DateTime",
      ["year", "month", "day"]
    ],
    "defaults"=[0, 0, 0],
    "format"="{year:D4}/{month:D2}/{day:D2} {hour:D2}:{minute:D2}:{second:D2}"
  }
  "obj3": {
    "#type": "@DateTime",
    "#args": [2013, 6, 24]
  }
  "period": [
    "@obj3",
    {
      "#type": "@DateTime",
      "#args": [2014, 6, 1]
    }
  ],
  "simulation": {
    "start": "@obj3"
  }
}
Here we've used strings starting with '@' to denote references and dictionary keys starting with '#' for keys with special meaning. The JSON version is somewhat more verbose and less readable than the Pyson version, however it has the advantage of being immediately parsable in any language that already supports JSON. To render it usable in the same way as Pyson, a second pass is needed to link the references and unescape strings that really should start with '@', and viceversa.

Pyson get some more syntax

Keeping in theme with the 'Python subset' idea, I thought of allowing keyword arguments in constructors as follows
DateTime = Type(
    "DateTime",
    ["year", "month", "day", "hour", "minute", "second"],
    defaults=[0, 0, 0],
    format="{year:D4}/{month:D2}/{day:D2} {hour:D2}:{minute:D2}:{second:D2}"
    )