On May 2nd 2015, I organized yet another BlackBoardDay, this time in New York City (on Columbia University campus, thanks Evan!).

I started the discussion by tracing the history of modern mathematics back to Gottlob Frege (Vika pointed out the axiomatic tradition goes back to Euclid (300 BC)).

• Gottlob Frege’s Begriffsschrift, the first symbolic logic system powerful enough for mathematics (1879)
• Giuseppe Peano’s axiomatization of arithmetic (1889)
• David Hilbert’s program to build a foundation of mathematics (1900-1920s)
• Bertrand Russell’s paradox in Frege’s system (1902)
• Kurt Gödel was born! (1906)
• Russell and Whitehead’s Principia Mathematica as foundation of mathematics (1910)
• Kurt Gödel’s completeness theorem (for first-order logic) (1930)
• Kurt Gödel’s incompleteness theorem (of Principia Mathematica and related systems) (1931)

Victoria (Vika) Gitman talked about non-standard models of Peano arithmetic. She listed the first-order form of Peano axioms which is supposed to describe addition, multiplication, and ordering of natural numbers $\mathbb{N} = \{0, 1, 2, \cdots \}$. However, it turns out there are other countable models that are not natural number and yet satisfy Peano axioms. She used the compactness theorem, a corollary of completeness theorem (Gödel 1930), that (loosely) states that for a consistent first-order system, if any finite subset of axioms has a model, then the system has a model. She showed that if we add a constant symbol ‘c’ (in addition to 0 and 1) to the language of arithmetic, and a set of infinite axioms which is consistent with the Peano axioms: {c > 0, c > 1, c > 2, … }, then using the compactness theorem, there exists a model. This model is somewhat like integers $\mathbb{Z}$ sprinkled on rational numbers $\mathbb{Q}$, in the sense that (…, c-2, c-1, c, c+1, c+2, …) are all larger than the regular $\mathbb{N}$, but then 2c is larger than all of that. Then there are also fractions of c such as c/2, and so on. This is still countable, since it is a countable collection of countably infinite sets, but this totally blew our minds. In this non-standard model of arithmetic, those ‘numbers’ outside $\mathbb{N}$ can be represented as a pair in $\mathbb{Q} \times \mathbb{Z}$, but actual computation with those numbers turn out to be non-trivial (and often non-computable).

Ashish Myles talked about the incompleteness theorem, and other disturbing ideas. Starting from the analogy of liar’s paradox, Ashish stated that arithmetic (with multiplication) can be used to encode logical statements into natural numbers, and also write a (recursive) function that encapsulates the notion of ‘provable from axioms’. The Gödel statement G roughly says that “the natural number that encodes G is not provable”. Such statement is true (in our meta language) since if it is false, there’s a contradiction. However, either adding G or not G as an axiom to the original system is consistent. Even after including G (or “not G”) as an axiom to Peano arithmetic, there’ll be statements that are true but not provable! Vika gave an example statement that is true for natural number but is not provable from Peano axioms: all Goodstein sequence terminates at 0.

At this point, we were all feeling very cold inside, and needed some warm sunshine. So, we continued our discussion outside:

Kyle Mandli talked about Axiom of Choice (AC), which is an axiom that is somewhat counter intuitive, and independent of the Zermelo-Fraenkel (ZF) set theory: Both ZF with AC and ZF with not AC are consistent (Gödel 1964). We discussed many counter intuitive “paradoxes” as well as usefulness of AC in mathematics.

Diana Hall talked about an counter intuitive bet: suppose we have a fair coin, and we are tossing to create a sequence. Would you bet on seeing HTH first or HHT first? At first one might think they are equally likely. However, since there’s a sequence effect that makes them non-equal!

Unfortunately, due to time constraints we couldn’t talk about Uygar planed: “approximate solutions to combinatorial optimization problems implies P=NP”, hopefully we’ll hear about it on BBD11!

NIPS is growing fast with 2400+ participants! I felt there were proportionally less “neuro” papers compared to last year, maybe because of a huge presence of deep network papers. My NIPS keywords of the year: Deep learning, Bethe approximation/partition function, variational inference, climate science, game theory, and Hamming ball. Here are my notes on the interesting papers/talks from my biased sampling by a neuroscientist as I did for the previous meetings. Other bloggers have written about the conference: Paul Mineiro, John Platt, Yisong Yue and Yun Hyokun (in Korean).

## The NIPS Experiment

The program chairs, Corinna Cortes and Neil Lawrence, ran an experiment on the reviewing process and estimated the inconsistency. 10% of the papers were chosen to be reviewed independently by two pools of reviewers and area chair, hence those authors got 6-8 reviews, and had to submit 2 author responses. The disagreement was around 25%, meaning around half of the accepted papers could have been rejected (the baseline assuming independent random acceptance was around 38%). This tells you that the variability in NIPS reviewing process is, so keep that in mind whether your papers got in or not! They accepted all papers that had disagreement between the two pools, so the overall acceptance rate was a bit higher this year. For details, see Eric Price’s blog post and Bert Huang’s post.

## Latent variable modeling of neural population activity

Extracting Latent Structure From Multiple Interacting Neural Populations
Joao Semedo, Amin Zandvakili, Adam Kohn, Christian K. Machens, Byron M. Yu

How can we quantify how two populations of neurons interact? A full interaction model would require O(N^2) which quickly makes the inference intractable. Therefore, low-dimensional interaction model could be useful, and this paper exactly does this by extending the ideas of canonical correlation analysis to vector autoregressive processes.

Clustered factor analysis of multineuronal spike data
Lars Buesing, Timothy A. Machado, John P. Cunningham, Liam Paninski

How can you put more structure to a PLDS (Poisson linear dynamical system) model? They assumed disjoint groups of neurons would have loadings from a restricted set of factors only. For application, they actually restricted the loading weights to be non-negative, in order to separate out the two underlying components of oscillation in spinal cord. They have a clever subspace clustering based initialization, and a variational inference procedure.

A Bayesian model for identifying hierarchically organised states in neural population activity
Patrick Putzky, Florian Franzen, Giacomo Bassetto, Jakob H. Macke

How do you capture discrete states in the brain, such as UP/DOWN states? They propose using a probabilistic hierarchical hidden Markov model for population of spiking neurons. The hierarchical structure reduces the effective number of parameters of the state transition matrix. The full model captures the population variability better than coupled GLMs, though the number of states and their structure is not learned. Estimation is via variational inference.

On the relations of LFPs & Neural Spike Trains
David E. Carlson, Jana Schaich Borg, Kafui Dzirasa, Lawrence Carin.

Analysis of Brain States from Multi-Region LFP Time-Series
Kyle R. Ulrich, David E. Carlson, Wenzhao Lian, Jana S. Borg, Kafui Dzirasa, Lawrence Carin

## Bayesian brain, optimal brain

Fast Sampling-Based Inference in Balanced Neuronal Networks
Guillaume Hennequin, Laurence Aitchison, Mate Lengyel

Sensory Integration and Density Estimation
Joseph G. Makin, Philip N. Sabes

Optimal Neural Codes for Control and Estimation
Alex K. Susemihl, Ron Meir, Manfred Opper

Spatio-temporal Representations of Uncertainty in Spiking Neural Networks
Cristina Savin, Sophie Denève

Optimal prior-dependent neural population codes under shared input noise
Agnieszka Grabska-Barwinska, Jonathan W. Pillow

Neurons as Monte Carlo Samplers: Bayesian ￼Inference and Learning in Spiking Networks
Yanping Huang, Rajesh P. Rao

## Other Computational and/or Theoretical Neuroscience

Using the Emergent Dynamics of Attractor Networks for Computation (Posner lecture)
J. J. Hopfield

He introduced bump attractor networks via analogy of magnetic bubble (shift register) memory. He suggested that cadence and duration variations in voice can be naturally integrated with state-dependent synaptic input. Hopfield previously suggested using relative spike timings to solve a similar problem in olfaction. Note that this continuous attractor theory predicts low-dimensional neural representation. His paper is available as a preprint.

Deterministic Symmetric Positive Semidefinite Matrix Completion
William E. Bishop and Byron M. Yu

See workshop posting where Will gave a talk on this topic.

## General Machine Learning

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio

From results in statistical physics, they hypothesize that there are more saddles in high-dimension which are the main cause of slow convergence of stochastic gradient descent. In addition, exact Newton method converges to saddles, (stochastic) gradient descent is slow to get out of saddles, causing lengthy platou in training neural networks. They provide a theoretical justification for a known heuristic optimization method which is to take the absolute value of eigenvalues of the Hessian when taking the Newton step. This avoids saddles, and dramatically improves convergence speed.

A* Sampling
Chris J. Maddison, Daniel Tarlow, Tom Minka

Extends the Gumbel-Max Trick to an exact sampling algorithm for general (low-dimensional) continuous distributions with intractable normalizers. The trick involves perturbing a discrete-domain function by adding an independent samples from Gumbel distribution.They construct Gumbel process which gives bounds on the intractable log partition function, and use it to sample.

Divide-and-Conquer Learning by Anchoring a Conical Hull
Tianyi Zhou, Jeff A. Bilmes, Carlos Guestrin

Spectral Learning of Mixture of Hidden Markov Models
Cem Subakan, Johannes Traa, Paris Smaragdis

Clamping Variables and Approximate Inference
His slides are available online.

Information-based learning by agents in unbounded state spaces
Shariq A. Mobin, James A. Arnemann, Fritz Sommer

Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights
Daniel Soudry, Itay Hubara, Ron Meir

Self-Paced Learning with Diversity
Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, Alexander Hauptmann

Last week, I co-organized the NIPS workshop titled: Large scale optical physiology: From data-acquisition to models of neural coding with Ferran Diego Andilla, Jeremy Freeman, Eftychios Pnevmatikakis and Jakob Macke. Optical neurophysiology promises larger population recordings, but we are also facing with technical challenges in hardware, software, signal processing, and statistical tools to analyze high-dimensional data. Here are highlights of some of the non-optical physiology talks:

Surya Ganguli presented exciting new results improving from his last NIPS workshop and last COSYNE workshop talks. Our experimental limitations put us to analyze severely subsampled data, and we often find correlations and low-dimensional dynamics. Surya asks “How would dynamical portraits change if we record from more neurons?” This time he had detailed results for single-trial experiments. Using matrix perturbation, random matrix, and non-commutative probability theory, they show a sharp phase transition in recoverability of the manifold. Their model was linear Gaussian, namely $R = U X + Z$, where X is a low-rank neural trajectories over time, U is a sparse subsampling matrix, and Z is additive Gaussian noise. The bound for recovery had a form of $\mathrm{SNR} \sqrt{MP} \geq K$, where K is the dimension of the latent dynamics, P is the temporal duration (samples), M is the number of subsampled neurons, and SNR denotes the signal-to-noise ratio of a single neuron.

Vladimir Itskov gave a talk about inferring structural properties of the network from the estimated covariance matrix (We originally invited his collaborator Eva Pastalkova, but she couldn’t make it due to a job interview). An undirected graph which has weights that corresponds to an embedding in an Euclidean space shows a characteristic Betti curve: curve of Betti numbers as a function of threshold for the graph’s weights which is varied to construct the topological objects. For certain random graphs, the characteristics are very different, hence they used it to quantify how ‘random’ or ‘low-dimensional’ the covariances they observed were. Unfortunately, these curves are very computationally expensive so only up to 3rd Betti number can be estimated, and the Betti curves are too noisy to be used for estimating dimensionality directly. But, they found that hippocampal data were far from ‘random’. A similar talk was given at CNS 2013.

William Bishop, a 5th year graduate student working with Byron Yu and Rob Kass, talked about stitching partially overlapping covariance matrices, a problem first discussed in NIPS 2013 by Srini Turaga and coworkers: Can we estimate the full noise correlation matrix of a large population given smaller overlapping observations? He provided sufficient conditions for stitching, the most important of which is to make the covariance matrix of the overlap at least the rank of the entire covariance matrix. Furthermore, he analyzed theoretical bounds on perturbations which can be used for designing strategies for choosing the overlaps carefully. For details see the corresponding main conference paper, Deterministic Symmetric Positive Semidefinite Matrix Completion.

Unfortunately, due to weather conditions Rob Kass couldn’t make it to the workshop.

Every last Sunday of April, I have been organizing a small workshop called BBD. We discuss logic, math, and science on a blackboard (this year, it was actually on a blackboard unlike the past 3 years!)

The main theme was paradox. A paradox is a puzzling contradiction; using some sort of reasoning one derives two contradicting conclusions. Consistency is an essential quality of a reasoning system, that is, it should not be able to produce contradictions by itself. Therefore, true paradoxes are hazardous to the fundamentals of being rational, but fortunately, most paradoxes are only apparent and can be resolved. Today (April 27th, 2014), we had several paradoxes presented:

Memming: I presented the Pinocchio paradox, which is a fun variant of the Liar paradox. Pinocchio’s nose grows if and only if Pinocchio tells a false statement. What happens when Pinocchio says “My nose grows now”/”My nose will grow now”? It either grows or not grows. If it grows, he is telling the truth, so it should not grow. If it is false, then it should grow, but then it is true again. Our natural language allows self-referencing, but is it really logically possible? (In the incompleteness theorem, Gödel numbering allows self-referencing using arithmetic.) There are several possible resolutions, such as, Pinocchio cannot say that statement, Pinocchio’s world is inconsistent (and hence cannot have physical reality attached to it), Pinocchio cannot know the truth value, and so on. In any case, a good logical system shouldn’t be able to produce such paradoxes.

Jonathan Pillow, continuing on the fairy tale theme, presented the sleeping beauty paradox. Toss a coin, sleeping beauty will be awakened once if it is head, twice if it is tail. Every time she is awakened, she is asked “What is your belief that the coin was heads?”, and given a drug that erases the memory of this awakening, and goes back to sleep. One argument (“halfer” position) says since a priori belief was 1/2, and each awakening does not provide more evidence, her belief does not change and would answer 1/2. The argument (“thirder” position) says that you are twice more likely to be awakened for the tail toss, hence the probability should be 1/3. If a certain reward was assigned to making a correct guess, the thirder position seems to be correct probability to use as the guess, but do we necessarily have matching belief? This paradox is still under debate, have not had a full resolution yet.

Kyle Mandi presented the classical Zeno’s paradox where your intuition on infinite sum of finite things being infinite is challenged. He also showed Gabriel’s horn where a simple (infinite) object with finite volume, but infinite surface area is given. Hence, if you were to pour in paint in this horn, you would need finite paint, but would never be able to paint the entire surface. (Hence its nickname: painter’s paradox)

Karin Knudson introduced the Banach-Tarski paradox where one solid unit sphere in 3D can be decomposed into 5 pieces, and only by translation and rotation, they are reconstructed into two solid unit spheres. In general, if two uncountable sets A, B are bounded with non-empty interior in $R^n$ with $n \geq 3$, then one can find a finite decomposition such that each piece in A is congruent to the corresponding piece in B. It requires some group theory, non-measurable sets, and the axiom of choice (fortunately).

Harold Gutch told us about the Borel-Kolmogorov paradox. What is the conditional distribution on a great circle when points are uniformly distributed on the surface of a sphere? One argument says it should be uniform by symmetry. But, a simple sampling scheme in polar coordinate shows that it should be proportional to cosine of the angle. Basically, the lesson is, never take conditional probabilities on sets of measure zero (not to be confused with conditional densities). Furthermore, he told us about a formula to produce infinitely many paradoxes from Jaynes’ book (ch 15) based on ill-defined series convergences.

Andrew Tan presented Rosser’s extension of Gödel’s first incompleteness theorem with the statement $R$ that colloquially says “For every proof of me, there’s a shorter disproof.” For any consistent system T that contains PA (Peano axioms), there exists an $R_T$, which is neither provable nor disprovable within T. Also, by the second incompleteness theorem, the consistency of PA (“con(PA)”) implies $G_{PA}$, which together with Gödel’s first incompleteness theorem that $G_{PA}$ is neither provable nor disprovable within PA, implies that PA augmented with “con(PA)” or “not con(PA)” are both consistent. However, the latter is paradoxical, since it appears that a consistent system declares its own inconsistency, and the natural number system that we are familiar with is not a model for the system. But, it could be resolved by creating a non-standard model of arithmetic. References: [V. Gitman’s blog post and talk slides][S. Reid, arXiv 2013]

I had a wonderful time, and I really appreciate my friends for joining me in this event!

Recently, there was a press release and a youtube video from University of Florida about one of my recent papers on neural code in the lobster olfactory system, and also by others [e.g. 1, 2, 3, 4]. I decided to write a bit about it in my own perspective. In general, I am interested in understanding how neurons process and represent information in their output through which they communicate with other neurons and collectively compute. In this paper, we show how a subset of olfactory neurons can be used like a stop watch to measure temporal patterns of smell.

Unlike vision and audition, the olfactory world is perceived through a filament of odor plume riding on top of complex and chaotic turbulence. Therefore, you are not going to be in constant contact with the odor (say the scent of freshly baked chocolate chip cookies) while you search for the source (cookies!). You might not even smell it at all for a long periods of time, even if the target is nearby depending on the air flow. Dogs are well known to be good at this task, and so are many animals. We study lobsters. Lobsters heavily rely on olfaction to track, avoid, and detect odor sources such as other lobsters, predators, and food, therefore, it is important for them to constantly analyze olfactory sensory information to put together an olfactory scene. In auditory system, the miniscule temporal differences in sound arriving to each of your ears is a critical cue for estimating the direction of the sound source. Similarly, one critical component for olfactory scene analysis is the temporal structure of the odor pattern. Therefore, we wanted to find out how neurons encode and process this information.

The neurons we study are of a subtype of olfactory sensory neurons. Sensory neurons detect signals, encode them into a temporal pattern of activity, so that it can be processed by downstream neurons. Thus, it was very surprising when we (Dr. Yuriy Bobkov) found that those neurons were spontaneously generating signals–in the form of regular bursts of action potentials–even in the absence of odor stimuli [Bobkov & Ache 2007]. We were wondering why a sensory system would generate its own signal, because the downstream neurons would not know if the signal sent by these neurons are caused by external odor stimuli (smell), or are spontaneously generated. However, we realized that they can work like little clocks. When external odor molecules stimulate the neuron, it sends a signal in a time dependent manner. Each neuron is too noisy to be a precise clock, but there is a whole population of these neurons, such that together they can measure the temporal aspects critical for the olfactory scene analysis. The temporal aspects, combined with other cues such as local flow information and navigation history, in turn can be used to track targets and estimate distances to sources. Furthermore, this temporal memory was previously believed to be formed in the brain, but our results suggest a simple yet effective mechanism in the very front end, the sensors themselves.

Applications: Currently electronic nose technology is mostly focused on discriminating ‘what’ the odor is. We bring to the table how animals might use the ‘when’ information to reconstruct the ‘where’ information, putting together an olfactory scene. Perhaps it could inspire novel search strategies for odor tracking robots. Another possibility is to build neuromorphic chips that emulate artificial neurons using the same principle to encode temporal patterns into instantaneously accessible information. This could be a part of low-power sensory processing unit in a robot. The principle we found are likely not limited to lobsters and could be shared by other animals and sensory modality.

References:

• Bobkov, Y. V. and Ache, B. W. (2007). Intrinsically bursting olfactory receptor neurons. J Neurophysiol, 97(2):1052-1057.
• Park, I. M., Bobkov, Y. V., Ache, B. W., and Príncipe, J. C. (2014). Intermittency coding in the primary olfactory system: A neural substrate for olfactory scene analysis. The Journal of Neuroscience, 34(3):941-952. [pdf]

Evan and I wrote a summary of the COSYNE 2014 workshop we organized!

Originally posted on Scalable models for high-dimensional neural data:

[ This blog post is collaboratively written by Evan and Memming ]
The Scalable Models workshop was a remarkable success! It attracted a huge crowd from the wee morning hours till the 7:30 pm close of the day. We attracted so much attention that we had to relocate from our original (tiny) allotted room (Superior A) to a (huge) lobby area (Golden Cliff). The talks offered both philosophical perspectives and methodological aspects, reflecting diverse viewpoints and approaches to high-dimensional neural data. Many of the discussions continued the next day in our sister workshop. Here we summarize each talk:

### Konrad Körding – Big datasets of spike data: why it is coming and why it is useful

Konrad started off the workshop by posting some philosophical questions about how big data might change the way we do science. He argued that neuroscience is rife with theories (for instance, how uncertainty is…

View original 1,677 more words

tags:

Shannon’s entropy is the fundamental building block of information theory – a theory of communication, compression, and randomness. Entropy has a very simple definition, $H = - \sum_i p_i \log_2(p_i)$, where $p_i$ is the probability of i-th symbol. However, estimating entropy from observations is surprisingly difficult, and is still an active area of research. Typically, one does not have enough samples compared to the number of possible symbols (so called “undersampled regime”), there’s no unbiased estimator [Paninski 2003], and the convergence rate of a consistent estimator could be arbitrarily slow [Antos and Kontoyiannis, 2001]. There are many estimators that aim to overcome these difficulties to some degree. Deciding which estimator to use can be overwhelming, so here’s my recommendation in the form of a flow chart:

(click to enlarge)

Let me explain one by one. First of all, if you have continuous (analogue) observation, read the title of this post. CDM, PYM, DPM, NSB are Bayesian estimators, meaning that they have explicit probabilistic assumptions. Those estimators provide posterior distributions or credible intervals as well as point estimates of entropy. Note that the assumptions made by these estimators do not have to be valid to make them good entropy estimators. In fact, even if they are in the wrong class, these estimators are consistent, and often give reasonable answers even in the undersampled regime.

Nemenman-Shafee-Bialek (NSB) uses a mixture of Dirichlet prior to have an approximately uninformative implied prior on entropy. This reduces the bias of estimator significantly for the undersampled regime, because a priori, it could have any entropy.

Centered Dirichlet mixture (CDM) is a Bayesian estimator with a special prior designed for binary observations. It comes in two flavors depending if your observation is close to independent (DBer) or the total number of 1’s is a good summary statistic (DSyn).

Pitman-Yor mixture (PYM) and Dirichlet process mixture (DPM) are for infinite or unknown number of symbols. In many cases, natural data have a vast number of possible symbols, as in the case of species samples or language, and have power-law (or scale-free) distributions. Power-law tails can hide a lot of entropy in their tails, in which case PYM is recommended. If you expect an exponentially decaying tail probabilities when sorted, then DPM is appropriate.  See my previous post for more.

Non-Bayesian estimators come in many different flavors:

Best upper bound (BUB) estimator is a bias correction method which bounds the maximum error in entropy estimation.

Coverage-adjusted estimator (CAE) uses the Good-Turing estimator for the “coverage” (1 – unobserved probability mass), and uses a Horvitz-Thompson estimator for entropy in combination.

James-Stein (JS) estimator regularizes entropy by estimating a mixture of uniform distribution and the empirical histogram with the James-Stein shrinkage. The main advantage of JS is that it also produces an estimate of the distribution.

Unseen estimator uses a Poissonization of fingerprint and linear programming to find the likely underlying fingerprint, and use the entropy as an estimate.

Other notable estimators include (1) a bias correction method by Panzeri & Travis (1995) which has been popular for a long time, (2) Grassberger estimator, and (3) asymptotic expansion of NSB that only works in extremely undersampled regime and is inconsistent [Nemenman 2011]. These methods are faster than the others, if you need speed.

There are many software packages available out there. Our estimators CDMentropy and PYMentropy are implemented for MATLAB with BSD license (by now you surely noticed that this is a shameless self-promotion!). For R, some of these estimators are implemented in a package called entropy (in CRAN; written by the authors of JS estimator). There’s also a python package called pyentropy. Targeting a more neuroscience specific audience, Spike Train Analysis Toolkit contains a few of estimators implemented in MATLAB/C.

References

• A. Antos and I. Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms, 19(3-4):163–193, 2001.
• E. Archer*, I. M. Park*, and J. Pillow. Bayesian estimation of discrete entropy with mixtures of stick-breaking priors. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2024–2032. MIT Press, Cambridge, MA, 2012. [PYMentropy]
• E. Archer*, I. M. Park*, J. Pillow. Bayesian Entropy Estimation for Countable Discrete Distributions. arXiv:1302.0328, 2013. [PYMentropy]
• E. Archer, I. M. Park, and J. Pillow. Bayesian entropy estimation for binary spike train data using parametric prior knowledge. In C.J.C. Burges and L. Bottou and M. Welling and Z. Ghahramani and K.Q. Weinberger}, editors, Advances in Neural Information Processing Systems 26, 2013. [CDMentropy]
• A. Chao and T. Shen. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environmental and Ecological Statistics, 10(4):429–443, 2003. [CAE]
• P. Grassberger. Estimating the information content of symbol sequences and efficient codes. Information Theory, IEEE Transactions on, 35(3):669–675, 1989.
• J. Hausser and K. Strimmer. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. The Journal of Machine Learning Research, 10:1469–1484, 2009. [JS]
• I. Nemenman. Coincidences and estimation of entropies of random variables with large cardinalities. Entropy, 13(12):2013–2023, 2011. [Asymptotic NSB]
• I. Nemenman, F. Shafee, and W. Bialek. Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14, pages 471–478. MIT Press, Cambridge, MA, 2002. [NSB]
• I. Nemenman, W. Bialek, and R. Van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111, 2004. [NSB]
• L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191–1253, 2003. [BUB]
• S. Panzeri and A. Treves. Analytical estimates of limited sampling biases in different information measures. Network: Computation in Neural Systems, 7:87–107, 1996.
• P. Valiant and G. Valiant. Estimating the Unseen: Improved Estimators for Entropy and other Properties. In Advances in Neural Information Processing Systems 26, pp. 2157-2165, 2013. [UNSEEN]
• V. Q. Vu, B. Yu, and R. E. Kass. Coverage-adjusted entropy estimation. Statistics in medicine, 26 (21):4039–4060, 2007. [CAE]