Shannon’s entropy $H$ is a fundamental statistic that measures the uncertainty of a (discrete) distribution. It is a building block for mutual information $I(X;Y) = H(X) - H(X|Y)$ which has numerous applications in statistics, communication, signal processing, machine learning and so on. In the context of neuroscience, entropy can measure the maximum capacity of a neuron, quantify the amount of noise, and also serve as a cost function for theoretical derivation of learning rules. Amount of information coded by neural spike trains about a stimulus can be measured by mutual information, and provides a fundamental limit for neural codes.

Unfortunately, estimating entropy or mutual information is notoriously difficult, especially when the number of observations is less than the number of possible symbols [1]. For the neural data, this is often the case, due to the combinatorial nature of the symbols under consideration. If we consider binning a 100 ms window of spike trains from 10 neurons with a resolution of 1 ms bin, the total number of possible symbols become $2^{10 \cdot 100}$. Just to observe that many symbols, one needs $10^{292}$ years. Therefore, we must be clever. The question is how to extrapolate when you may have a severely under-sampled distribution.

In the literature, there have been many entropy estimators, and mutual information estimators based on them. We extend one of the best known entropy estimators called the NSB estimator [2,3], which is a Bayesian estimator with an approximately non-informative prior on entropy. This is achieved by mixing Dirichlet distributions appropriately. We have extended the procedure to a situation where the number of symbols with non-zero probability is unknown or arbitrarily large by mixing Pitman-Yor process as priors. The limit of the NSB estimator for infinite bins can be captured by Dirichlet process mixture prior. Pitman-Yor process is an extension of Dirichlet process with an extra parameter. Advantages of using Pitman-Yor mixture is that it can fit heavy-tailed distributions, and neural data (as well as many other natural phenomena) has heavy-tailed distribution. Our estimator shows significantly smaller bias for power-law tailed generation process as well as spiking neural data.

If you’re at COSYNE 2012, details are presented as a poster titled “Bayesian entropy estimation for infinite neural alphabets” by Evan Archer, myself and Jonathan Pillow. Look for  III-31 (Feb 25th, Saturday)

Update: preprint of this work can be found on the arXiv: Evan Archer*, Il Memming Park*, Jonathan Pillow. Bayesian Entropy Estimation for Countable Discrete DistributionsarXiv:1302.0328 (2013) (* equal contribution)

1. Liam Paninski. Estimation of Entropy and Mutual Information. Neural Computation, Vol. 15, No. 6. (1 June 2003), pp. 1191-1253, doi:10.1162/089976603321780272
2. I Nemenman, F Shafee, and W Bialek. Entropy and inference, revisited. NIPS 2001
3. I Nemenman, W Bialek, and R de Ruyter van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69:056111, 2004.