NIPS 2012 (proceedings) was held in Lake Tahoe, right next to the state line between California and Nevada. Despite the casino all around the area, it was a great conference: a lot of things to learn, and a lot of people to meet. My keywords for NIPS 2012 are deep learning, spectral learning, nonparanormal distribution, nonparametric Bayesian, negative binomial, graphical models, rank, and MDP/POMDP. Below are my notes on the topics that interested me. Also check out these great blog posts about the event by Dirk Van den Poel (@dirkvandenpoel), Yisong Yue (@yisongyue) John Moeller, Evan Archer, Hal Daume III.

### Monday

Optimal kernel choice for large-scale two-sample tests
A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu

This is an improvement over the maximum mean discrepancy (MMD), a divergence statistic for hypothesis testing using reproducing kernel Hilbert spaces. The statistical power of the test depends on the choice of kernel, and previously, it was shown that taking the max value over multiple kernels still results in a divergence. Here they linearly combine kernels to maximize the statistical power in linear time, using normal approximation of the test statistic. The disadvantage is that it requires more data for cross-validation.

Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
Xue-Xin Wei, Alan Stocker

Several biases observed in psychophysics shows repulsion from the mode of prior which seem counter intuitive if we assume brain is performing Bayesian inferences. They show that this could be due to asymmetric likelihood functions that originate from the efficient coding principle. The tuning curves, and hence the likelihood functions, under the efficient coding hypothesis are constrained by the prior, reducing the degree of freedom for the Bayesian interpretation of perception. They show asymmetric likelihood could happen under a wide range of circumstances, and claim that repulsive bias should be observed. Also they predict additive noise in the stimulus should decrease this effect.

Spiking and saturating dendrites differentially expand single neuron computation capacity
Romain Cazé, M. Humphries, B. Gutkin

Romain showed that boolean functions can be implemented by active dendrites. Neurons that generate dendritic spikes can be considered as a collection of AND gates, hence disjunctive normal form (DNF) can be directly implemented using the threshold in soma as the final stage. Similarly, saturating dendrites (inhibitory neurons) can be treated as OR gates, thus CNF can be implemented.

Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
Shinsuke Koyama

Hypothesis testing of whether the rate is constant or not for a renewal neuron can be done by decoding the rate from spike trains using empirical Bayes (EB). If the hyperparameter for the roughness is inferred to be zero by EB, the null hypothesis is accepted. Shinsuke derived a theoretical condition for the rejection based on the KL-divergence.

The coloured noise expansion and parameter estimation of diffusion processes
Simon Lyons, Amos Storkey, Simo Sarkka

For a continuous analogue of a nonlinear ARMA model, estimating parameters for stochastic differential equations is difficult. They approach it by using a truncated smooth basis expansion of the white noise process. The resulting colored noise is used for an MCMC sampling scheme.

Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
Evan Archer*, Il Memming Park*, Jonathan W. Pillow (*equally contributed, equally presented)

### Tuesday

Diffusion decision making for adaptive k-Nearest Neighbor Classification
Yung-Kyun Noh, F. C. Park, Daniel D. Lee

An interesting connection between sequential probability ratio test (Wald test) for homogeneous Poisson process with two different rates and k-nearest neighbor (k-NN) classification is established by the authors. The main assumption is that each class density is smooth, thus in the limit of large samples, distribution of NN follows a (spatial) Poisson process. Using this connection, several adaptive k-NN strategies are proposed motivated from Wald test.

TCA: High dimensional principal component analysis for non-gaussian data
F. Han, H. Liu

Using an elliptical copula model (extending the nonparanormal), the eigenvectors of the covariance of the copula variables can be estimated from Kendall’s tau statistic which is invariant to the nonlinearity of the elliptical distribution and the transformation of the marginals. This estimator achieves close to the parametric convergence rate while being a semi-parametric model.

Classification with Deep Invariant Scattering Networks (invited)
Stephane Mallat

How can we obtain stable informative invariant representation? To obtain an invariant representation with respect to a group (such as translation, rotation, scaling, and deformation), one can directly apply a group-convolution to each sample. He proposed an interpretation of deep convolutional network as learning the invariant representation, and a more direct approach when the invariance of interest is known, which is to use group invariant scattering (hierarchical wavelet decomposition). Scattering is contractive, preserves norm, and stable under deformation, hence generates a good representation for the final discriminative layer. He hypothesized that the stable parts (which lacks theoretical invariance) can be learned in deep convolutional network through sparsity.

Spectral learning of linear dynamics from generalised-linear observations with application to neural population data
L. Buesing, J. Macke, M. Sahani

Ho-Kalman algorithm is applied to Poisson observation with canonical link function, then the parameters are estimated through moment matching. This is a simple and great initializer for EM which tends to be slow and prone to local optima.

Spectral learning of general weighted automata via constrained matrix completion
B. Balle, M. Mohri

A parameteric function from strings to reals known as rational power series, or equivalently weighted finite automata, is estimated with a spectral method. Since the Hankel matrix for prefix-suffix values has a structure, a constrained optimization is applied for its completion from data. How to choose rows and columns of Hankel matrix remains a difficult problem.

Discriminative learning of Sum-Product Networks
R. Gens, P. Domingos

Sum-product network (SPN) is a nice abstraction of a hierarchical mixture model, and it provides simple and tractable inference rules. In SPM, all marginals are computable in linear time. In this case, discriminative learning algorithms for SPM inferences are given. The hard inference variant takes the most probable state, and can overcome gradient dilution.

Perfect dimensionality recovery by Variational Bayesian PCA
S. Nakajima, R. Tomioka, M. Sugiyama, S. Babacan

Previous Bayesian PCA algorithm utilizes the empirical Bayes procedure for sparsification, however, this may not be an exact inference for recovering the dimensionality. They provide a condition for which the recovered dimension is exact for a variational Bayesian inference using random matrix theory.

Fully bayesian inference for neural models with negative-binomial spiking
J. Pillow, J. Scott

Pillow & Scott’s negative binomial spiking poster presented by Memming (not an author), opposite side (left) was Mijung Park presenting her poster.

### Wednesday

Graphical models via generalized linear models
Eunho Yang, Genevera I. Allen, Pradeep Ravikumar, Zhandong Liu

Eunho introduced a family of graphical models with GLM marginals and Ising model style pairwise interaction. He said the Poisson-Markov-Random-Fields version must have negative coupling, otherwise the log partition function blows up. He showed conditions for which the graph structure can be recovered with high probability in this family.

No voodoo here! learning discrete graphical models via inverse covariance estimation
Po-Ling Loh, Martin Wainwright

I think Po-Ling did the best oral presentation. For any graph with no loop, zeros in the inverse covariance matrix corresponds to non-conditional dependence. In general, theoretically by triangulating the graph, conditional dependencies could be recovered, but the practical cost is high. In practice, graphical lasso is a pretty good way of recovering the graph structure, especially for certain discrete distributions (e.g. Ising model).

Augment-and-Conquer Negative Binomial Processes
M. Zhou, L. Carin

Poisson process over gamma process measure is related to Dirichlet process (DP) and Chinese restaurant process (CRP). Negative binomial (NB) distribution has an alternative (i.e., not gamma-Poisson) augmented representation as Poisson number of logarithmic random variables, which can be used to constructing Gamma-NB process. I do not fully understand the math, but it seems like this paper contains gems.

Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum Lp Loss
Zhuo Wang, Alan A. Stocker, Daniel D. Lee

Assuming different loss functions in the Lp family, optimal tuning curves of a rate limited Poisson neuron changes. Zhuo showed that as p goes to zero, the optimal tuning curve converges to that of the maximum information. The derivations assume no input noise, and a single neuron. [edit: we did a lab meeting about this paper]

Bayesian nonparametric models for ranked data
F. Caron, Y. Teh

Assuming observed partially ranked objects (e.g., top 10 books) have positive real-valued hidden strength, and assuming a size-biased ranking, they derive a simple inference scheme by introducing an auxiliary exponential variable.

### Thursday

Efficient and direct estimation of a neural subunit model for sensory coding
Brett Vintch, Andrew D. Zaharia, J. Anthony Movshon, Eero P. Simoncelli

We already discussed this nice paper in our journal club. They fit a special LNLN model that assumes a single (per channel) convolutional kernel shifted (and weighted) in space. Brett said the convolutional STC initialization described in the paper works well even when the STC itself looks like noise.

Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
Sander M. Bohte

A multiplicative spike response model is proposed and fit with a fixed post-spike filter shape, LNP based receptive filed, and grid search over the parameter space (3D?). This model reproduces the experimentally observed adaptation due to amplitude modulation and the variance modulation. The multiplicative dynamics must have a power-law decay that is close to 1/t, and it somehow restricts the firing rate of the neuron (Fig 2b).

Dropout: A simple and effective way to improve neural networks (invited, replacement)
Geoffrey Hinton, George Dahl

Dropout is a technique to randomly omit units in an artificial neural network to reduce overfitting. Hinton says dropout method is an efficient way of model averaging exponentially many models. It reduces overfitting because hidden units can’t depend on each other reliably. Related paper is on the arXiv.

Compressive neural representation of sparse, high-dimensional probabilities
Xaq Pitkow

Naively representing probability distributions are inefficient since it takes exponentially growing resource. Using ideas from compressed sensing, Xaq shows that random perceptron units can be used to represent a sparse high dimensional probability distribution efficiently. The question is what kind of operations on this representation biologically plausible and useful.

The topographic unsupervised learning of natural sounds in the auditory cortex

Visual cortex is much more retinatopic than auditory cortex is tonotopic. Unlike natural images, nautral auditory stimuli has harmonics that gives rise to correlations in the frequency domain. Could both primary sensory cortices have same principle for topographic learning rules but form different patterns because of differences in the input statistics? The authors’ model is consistent with the hypothesis, and moreover captures the nonlinear response to pitch perception problem.

This concludes my 3rd NIPS (NIPS 2011, NIPS 2010)!

Suppose you mix two Gaussian random variables $\mathcal{N}(-1, 1)$ and $\mathcal{N}(-1, 1)$ equally, that is, if one samples from the mixture, with probability 1/2, it comes from the first Gaussian and vice versa. It is evident that the mixture of Gaussians is not a Gaussian. (Do not confuse with adding two Gaussian random variables which produces another Gaussian random variable.)

Similarly, mixture of inhomogeneous Poisson processes results in a non-Poisson point process. The figure below illustrates the difference between a mixture of two Poisson processes (B) and a Poisson process with the same marginal intensity (rate) function (A). The colored bars indicates the rate over the real line (e.g. time); in this case they are constant rate over a fixed interval. The 4 realizations from each process A and B are represented by rows of vertical ticks.

Several special cases of mixed Poisson processes are studied [1], however, they are mostly limited to modeling over-dispersed homogeneous processes. In theoretical neuroscience, it is necessary to mix arbitrary (inhomogeneous) point processes. For example, to maximize the mutual information between the input spike trains and the output spike train of a neuron model, the entropy of a mixture of point processes is needed.

In general, a regular point process on the real line can be completely described by the conditional intensity function $\lambda(t|\mathcal{H}_t)$ where $\mathcal{H}_t$ is the full spiking history up to time $t$ [2]. Let us take the discrete limit to form regular point processes. Let $\rho_k$ to be the probability of a spike (an event) at the $k$-th bin of size $\Delta$, that is,

$\rho_k \simeq \lambda(k \Delta|y_{1:k-1}) \Delta,$

where $y_{1:k-1}$ are the 0-1 responses in all the previous bins. The likelihood of observing $y_k = 0$ or $y_k = 1$, given the history is simply,

$P(y_k|y_{1:k-1}, \lambda) = {\rho_k}^{y_k} \left(1 - \rho_k\right)^{1 - y_k}.$

In the limit of small $\Delta$, this approximation converges to a regular point process. A fun fact is that a mixture of Bernoulli random variables is Bernoulli again, since it’s the only distribution for 0-1-valued random variables. Specifically, for a family of Bernoulli random variables with probability of 1 being $\rho_z$ indexed by $z$, and a mixing distribution $P(z)$, the probability of observing one symbol $y=0$ or $y=1$ is

$P(y) = \int P(y|z)P(z) \mathrm{d}z = \int {\rho_z}^{y} \left(1 - \rho_z\right)^{1 - y} P(z) \mathrm{d}z = {\bar\rho}^{y} \left(1 - \bar\rho\right)^{1 - y}$

where $\bar\rho = \int \rho_k P(z) \mathrm{d}z$ is the average probability.

Suppose we mix $\lambda(t|\mathcal{H}_t, z)$ with $P(z)$. Then, similarly, for binned point process representation, above implies that,

$P(y_k|y_{1:k-1},\lambda) = \int P(y_k|y_{1:k-1},\lambda) P(z) \mathrm{d}z = {\bar\rho}_k^{y_k} \left(1 - \bar\rho_k \right)^{1 - y_k}$

where $\bar\rho_k = \int \rho_k P(z) \mathrm{d}z$ is the marginal rate. Moreover, due to causal dependence between $y_k$‘s, we can chain the expansion and get the marginal probability of observing $y_{1:k}$,

$P(y_{1:k}) = P(y_k|y_{1:k-1}) P(y_{1:k-1}) = P(y_k|y_{1:k-1}) P(y_{k-1}|P_{1:k-2}) \cdots P(y_1)$

$= \prod_{i=1}^k {\bar\rho}_i^{y_i} \left(1-\bar\rho_i\right)^{1-y_i}.$

Therefore, in the limit the mixture point process is represented by the conditional intensity function,

$\lambda(t|\mathcal{H}_t) = \int \lambda(t|\mathcal{H}_t, z) P(z) \mathrm{d}z$.

Conclusion: The conditional intensity function of a mixture of point processes is given by the expected conditional intensity function over the mixing distribution.

References

1. Grandell. Mixed Poisson processes. Chapman & Hall / CRC Press 1997
2. Daley, Vere-Johns. An Introduction to the Theory of Point Processes. Springer.
3. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. PNAS, 2005. doi:10.1073/pnas.0500495102

This was my first time at CNS (computational neuroscience conference, not to be confused with the cognitive neuroscience one with the same acronym). I was invited to give a talk at the “Examining the dynamic nature of neural representations with the olfactory system workshop” organized by Chris Buckley, Thomas Nowotny, and Taro Toyoizumi. I presented my bursting olfactory receptor neurons can form instantaneous memory about the temporal structure of odor plume encounter story and a bit of related Calcium imaging study. Below is my summary of the workshop talks I went to (system identification workshop, information theory workshop on the first day, and olfactory workshop on the second day).

Garrett Stanley talked about system identification of the rat barrel cortex response from whisker deflection. He started by criticizing the white-noise Volterra series approach; it requires too much data. Instead, by designing a sequence of parametric stimuli that will directly show 2nd order and 3rd order interactions, he could fit a parametric form of firing rate response with good predictive powers [1]. As far as I can tell, it seemed like a rank-1 approximation of the 3rd order Volterra kernel. However, this model was lacking the fine-temporal latency, as well as stimulus intensity dependent bimodal responses, which was later fixed by a better model with feedback [2].

Vladimir Brezina talked about modeling of feedback from muscle contractions onto a rhythmic central pattern generator in the crab heart. He used LNL and LN models to fit the response of 9 neurons and muscles in the crab heart. For the LNL system, he used a bilinear optimization of the squared error. However, for the spiking response of the LN model, instead of using the Bernoulli or Poisson likelihood (the GLM model), he used least squares to fit the parameters.

Matthieu Louis gave a talk about optogenetically controlling drosophila larva’s olfactory sensory neurons. They built an impressive closed loop system that can control the larva’s behavior as if it were in an odor gradient. They modeled the system as a black box with odor input and behavior as output, skipping the model of the nervous system, and successfully predicted the behavior and control it [3].

Daniel Coca talked about how fly photoreceptors can act as a nonlinear temporal filter that is optimized for detecting edges. He fit a NARMAX (nonlinear ARMA-X) model and analyzed it in the frequency domain and found that the phase response is consistent with phase congruency detection model for edge detection. Also, he explained how the system “linearizes” when stimulated with white Gaussian noise, although I couldn’t follow the details due to my lack of knowledge in nonlinear frequency domain analysis.

Tatyana Sharpee talked about sphere packing in the context of receptive fields of retina, and conditional population firing rates of song birds. For the receptive fields, she showed that to maximize the mutual information per unit lattice between a point source of light and the (binary) neural response of ganglion cells, if the lattice is not-perfect, elliptical shapes of receptive fields can help. For the song bird case, she showed that the noise correlation can change with training to improve separation (classification performance) of the conditional distributions while the irrelevant stimuli became less separable.

Rava Azeredo da Silveira talked about how finely tuned correlation structure can immensely increase performance. Given two population of neurons, each tuned to a class weakly (slightly higher firing rate for the preferred class), if cross-population correlation is slightly higher than otherwise, the population response as a whole can be very certain about the class identity. He also talked about many other related things such as asymptotics on required population size vs noise.

Shy Shoham talked about Linear-Nonlinear-Poisson (LNP) and Linear-Nonlinear-Hawkes (LNH) models, and how to relate spike train (output) correlations to gaussian (input) correlation [4,5]. LNH has a similar form to GLM but the feedback is added outside the nonlinearity. He referred to the procedure of inferring the underlying latent AR process as correlation-distortion, and proposed to use it for studying neural point processes as AR models; hence apply Granger causality, and other signal processing tools. He also talked about semi-blind system identification where the goal is to infer the linear kernel of the model given the autocorrelation of the input and the autocorrelation of the population spike trains are given (the phase ambiguity of the filter is resolved by choosing the minimal phase filter.)

Maxim Bazhenov talked about modeling the transient synchronization in the locust olfactory system as a network phenomena (interaction between projection neurons (PNs) and local inter-neurons (LNs)). The pattern of synchronization of PNs over multiple LFP cycles is repeatable, and his model reproduces it. He showed an interesting illustration of the connectivity between LNs posed as the graph coloring problem [6]. Each cluster of LNs targets everybody outside their cluster, enabling synchrony within. The connectivity matrix is effectively a block diagonal of zeros, and the off-diagonals are ones, because they are inhibitory neurons.

Nitin Gupta gave a talk on lateral horn (LH) cells. The normative model has been that the inhibitory neurons in LH acts as feed-forward inhibition to limit the integration time within the Kenyon cells (KCs). He identified a heterogeneous population of neurons in LH (see [7] for beautifully filled neurons). Among the ones that project to mushroom body (where KCs are), he found no evidence of GABA co-location, suggesting that there is no feed-forward inhibition through LH. He proposed an alternative model for limiting integration time in KCs, namely the feedback inhibition through (non-spiking) GGNs.

Thomas Nowotny talked about how odor plume structure can help in separating mixture of different sources, based on the the results of [8]. He proposed a simple model of lateral inhibition circuit among the glomeruli. The model showed counter-intuitive results for temporal mixtures of odor when linear decoding is used.

Kevin C. Daly gave a data packed talk on Manduca sexta (moth) olfactory system [9]. The oscillation he observed had a frequency modulation; starts at a high frequency and quickly falls, and it is odor dependent. He criticized the use of continuous odor application which may result in pathological responses (my wording), and instead he showed response to odor-puffs. (Interestingly, the blank puffs decreased the response.) He also emphasized the importance of not cutting the head of the animal, which preserves a pair of histamine neurons.

Aurel A. Lazar talked about precise odor delivery system using laminar flows that can produce a diverse temporal pattern of odor concentration with around 1% of error. Using this system, they showed that the firing response of the first two stages of drosophila; receptor neurons and projection neurons are both temporally differentiating. This was not simultaneously recorded, but thanks to the repeatable stimuli and response, it is well supported.

References:

1. R. M. Webber and G. B. Stanley. Transient and steady-state dynamics of cortical adaptation, J. Neurophys., 95:2923-2932, 2006.
2. A. S. Boloori, R. A. Jenks, Gaelle Desbordes, and G. B. Stanley. Encoding and decoding cortical representations of tactile features in the vibrissa system, J. Neurosci., 30(30):9990-10005, 2010.
3. Gomez-Marin A, Stephens GJ, Louis M. Active sampling and decision making in Drosophila chemotaxis. Nature Communications 2:441. doi: 10.1038/ncomms1455 (2011).
4. Michael Krumin, Shy Shoham. Generation of Spike Trains with Controlled Auto- and Cross-Correlation Functions. Neural Computation. June 2009, Vol. 21, No. 6, Pages 1642-1664
5. Michael Krumin, Inna Reutsky,  Shy Shoham. Correlation-Based Analysis and Generation of Multiple Spike Trains Using Hawkes Models with an Exogenous Input.  Front Comput Neurosci. 2010; 4: 147
6. Assisi C, Stopfer M, Bazhenov M. Using the structure of inhibitory networks to unravel mechanisms of spatiotemporal patterning. Neuron. 2011 Jan 27;69(2):373-86.
7. Nitin Gupta, Mark Stopfer. Functional Analysis of a Higher Olfactory Center, the Lateral Horn. Journal of Neuroscience, 13 June 2012, 32(24): 8138-8148; doi: 10.1523/​JNEUROSCI.1066-12.2012
8. Paul Szyszka, Jacob S. Stierle, Stephanie Biergans, C. Giovanni Galizia. The Speed of Smell: Odor-Object Segregation within Milliseconds. PLoS ONE, Vol. 7, No. 4. (27 April 2012), e36096, doi:10.1371/journal.pone.0036096
9. Daly KC, Galán RF, Peters OJ and Staudacher EM (2011) Detailed characterization of local field potential oscillations and their relationship to spike timing in the antennal lobe of the moth Manduca sexta. Front. Neuroeng. 4:12. doi: 10.3389/fneng.2011.00012
Last Sunday (April 29th) was the Black board day (BBD), which is a small informal workshop I organize every year. It started 7 years ago on Kurt Gödel‘s 100th birthday. We discuss logic, computation, math, and beyond. This year happens to be Alan Turing‘s 100th birth year, so we had a theme that combines Turing machines and logic. It was a huge success thanks to special guest speakers.

### Il Memming Park: On halting problem route to incompleteness

I was trying to give an overview on how certain problems in mathematics that deals with natural numbers are very difficult, and why a mechanized theorem prover was a dream of Hilbert’s. Then I introduced the devilish diagonal argument of Cantor’s in the context of binary strings and languages. Basically, there are more languages (defined as a set of finite binary strings) than there are natural numbers. I introduced Turing machines and their 3 possible outcomes (accept, reject, and infinite loop) as well as the concept of universal turning machines. Then, I constructed the halting problem and showed that the diagonal argument prevents us from having a Turing machine that can tell if another Turing machine will stop or not in finite time. Unfortunately, I didn’t have enough time to elaborate how the halting problem has a similar structure to the proof of incompleteness theorem, and how they could be connected.

### Kenneth Latimer: On Roger Penrose’s Emperor’s new mind

The controversial book Emperor’s new mind (1989) is famous for extending Lucas’ idea that since Turing machines can’t know if the Gödel statement is true, while human does, the computability of human must be greater. He further linked that idea to physics and brain. During our discussion, we agreed that the Gödel statement is true, but it’s truth can only be judged outside of the system, and human certainly are not using the same set of axioms as the system that the Gödel statement is constructed on. And also, the fact that we do not understand certain physics doesn’t imply that the is not computable. It was interesting that two people (Memming and Jonathan) were initially drawn into neuroscience because of this book.

### Michael Buice: Algebra of Probable Inference

Michael started talking about  adding oracles to Turing machines, and the hierarchy of such oracle-equipped Turing machines, as well as Kleene hierarchy of logical statements, but quickly jumped into a new topic. Instead of only considering only True or False statements, if we allow things in between, with a reasonable assumptions we can derive axioms of probability theory. Heuristically speaking, Gödel’s incompleteness theorem would imply that there are statements that even with infinite observations, the posterior probability for the statement does not converge to 0 or 1 and always stay in between. The derivation is given in Richard Cox’s papers, and the theory was expanded by Jaynes.

### Ryan Usher: An Incomplete, Inconsistent, Undecidable and Unsatisfiable Look at the Colloquial Identity and Aesthetic Possibilities of Math or Logic

Ryan started by stating how he finds beauty in mathematical proofs, especially in Henkin’s completeness theorem. But then he was unsatisfied with the fact that how often beautiful results such as Gödel’s incompleteness theorem are abused in completely irrelevant contexts such as in economics and social sciences. He had numerous quotes and examples showing the current state of sad abuses. He claimed that this is partly because of the terms like “consistency”, “completeness” have very rigorous meanings in mathematical context but often people associate their meanings to the common sensical ones.

### Jonathan Pillow: Do we live inside a Turing machine?

Jonathan summarized the argument by Bostrom (2003) that it is very probable that we are living inside a simulation. Under the assumption that
1. Simulated human brain brings consciousness (“substance independence”)
2. Large scale simulation of human brain + physical world around human is possible
Then, assuming high probability of technological advancement for such simulation, and some grad student in the future wishing to run “ancestor simulation”, a simple counting argument of all humans in simulation and not shows that we are probably living in a simulation. (Below photo is Jonathan’s writing. It was a white board, but in the spirit of black board day, I inverted the photo.)

References:

Primary olfactory receptor neurons (ORN) bind to odor molecules in the medium and sends action potentials to the brain. This signaling is not simply ON and OFF, but each ORN has delicate sensitivity to various odors and shows diverse temporal activation patterns. Using both electrophysiology and Calcium-sensitive dye imaging, my collaborators Yuriy V. Bobkov and Kirill Y. Ukhanov studied the temporal aspect of Lobster ORNs. The heterogeneous response patterns are well presented in a recent paper published in PLoS One. I was particularly interested in a special type of ORN called bursting ORNs. Bursting ORNs are spontaneously oscillating, and the Calcium imaging data allows population analysis. I was involved in the analysis to see if there’s any sign of synchrony using resampling based burst-triggered averaging technique. It turns out that they rarely interact, if any. Moreover, they have a wide range of periods of oscillation. Since they are coupled through the environment (a filament of odor molecules in the medium), in natural environments or under controlled odor stimulation they sometimes synchronize which is a subject of another paper under review.

Note: the publication actually has my first name as Ill instead of Il which is silly and sick. I asked for a correction, but it seems PLoS One will only publish a note for the correction and not correct the actual article (because of the inconsistency it will cause for other indexing systems [1][2]). This could have been fixed in the proof, if PLoS did proofs before final publications, but they don’t (presumably to lower costs). In my opinion, this is a flaw of PLoS journals. EDIT: there’s a note saying that my name is misspelled now.

As I did for past COSYNE‘s (2009, 2010, 2011), this is a summary of my personal experience this year. I loved both the main meeting and workshops. It’s definitely one of the best conferences. I had to present my own posters, so I couldn’t see many others. Therefore, this selection is severely subsampled. Also, there might be details that I am not remembering correctly. If you spot any mistake, please let me know.

## Neural dynamics in neural coding

II-67. Jeffrey Seely, Matthew T. Kaufman, John Cunningham, Stephen Ryu, Krishna Shenoy, Mark Churchland. Dimensionality in motor cortex: differences between models and experiment

Is population activity in the motor cortex well explained by tuning curves for each neuron, or is it better explained by linear dynamics? To answer this question, they collapsed each experimental condition (motor output) to a temporally interpolated histogram of same length. A large 3-D matrix A(n,c,t) for n neurons, c conditions, t time bins is constructed, and sliced with 2 different possible low rank approximations (PCA): one is with conditions which implies tuning curve like characteristics, and the other is with time which represents the dynamic modes (basis solutions to a differential equation). They sequentially chose the component, either condition or dynamics, that explains the most variance, subtracted it, and repeated. They showed that real data is mostly dynamics while tuning based models generate data that are mostly condition based. This is a pretty convincing argument using only very basic tools.

Workshop talk: David Sussillo. Rethinking gating: selective integration of sensory signals through network dynamics

Frontal Eye Field (FEF) spiking responses to a colored random dots task where a contextual cue determines whether the monkey has to use the dots direction or majority of color to make the decision is analyzed in a dynamical system framework. (This talk is related to Valerio Mante’s poster II-58, but I missed it in the main meeting.) The question is how does the monkey switch context: is it some sort of gating mechanism that controls if the motion stimulus or color stimulus reaches FEF? Or is all information gets to FEF and decision is formed? Directions in the population firing rate (state) space is extracted by regression on the conditional firing rates: $r(t) = \beta_1(t) * choice + \beta_2(t) * color + \beta_3(t) * motion + \mbox{`interaction terms'}$ (the neurons were recorded one at a time; they assume independence). The trajectory in the neural state space (reconstructed from $\beta$‘s) shows integration-like behavior on the relevant-stimulus while encoding, but not integrating the irrelevant dimension. They further built a recurrent neural network model trained with [James Martens, Ilya Sutskever, Learning Recurrent Neural Networks with Hessian-Free Optimization, ICML 2011 pdf related post] and saw similar performance and dynamics by tuning only the stimulus noise level. He further showed a fixed point analysis of the trained network and a non-normal matrix decomposition to explain the integration on the line atractor. A similar talk given jointly by Mante and Sussillo at Santa Fe Institute can be found online. EDIT: they gave another related talk at redwood center with online video.

## Learning to fire a precise temporal pattern from spike train input

I was pleased to find 3 very cool posters related to learning synaptic weights of an integrate-and-fire neuron for spike train input. It is the come back of tempotron (Robert Gütig, Nature 2006)!

I-22. Robert Gütig. The multi-class tempotron: a neuron model for processing of sensory streams

Tempotron is a classifier that either emits a spike or not to indicate the class given an input spike pattern from many neurons. Robert extended the tempotron to allow not just one or zero spike, but to learn to fire a prescribed number of spikes. This is done by considering the rank-ordered membrane voltage peaks simultaneously, where the original tempotron only deals with the maximum peak. He showed that this could be used to detect an event or feature in time. The training is done by providing just the number of occurrences (does not require tagging time series with precise timings).

II-23. Raoul-Martin Memmesheimer, Ran Rubin, Haim Sompolinsky. Learning precisely timed spiking responses

One of the main advantages of tempotron is that it can use time as an extra degree of freedom, allowing a higher capacity (# of patterns / synpase) compared to traditional perceptron. However, this is also a disadvantage because the precise timing of the spikes are not controlled. This poster describes a couple of simple iterative procedures for updating the weights and threshold of an IF to produce a desired spiking pattern. The tricky part of such task is the complication of reset after erroneous spikes. Two algorithms, (1) first error learning, and (2) high threshold learning are proposed to overcome this difficulty. They showed that the algorithm converges to the solution in a similar fashion to perceptron, if there is a solution.

II-39. Ran Rubin, Raoul-Martin Memmesheimer, Haim Sompolinskyo. Support Vector Machines in Spiking Neurons with Non-Linear Dendrites

This is a companion poster to II-23. They extend the method to find a robust solution by maximizing the margin. Using an auxiliary voltage trace assuming the resets happened in a small time before the desired spike occurred (as in the high threshold learning algorithm), they formulated the problem as an SVM-like optimization problem with constraints. They also proposed active dendrites as nonlinear positive semi-definite kernels (point nonlinearity on the original inner product).

## Probabilistic modeling based neural/stimulus distances

Measuring similarity given a generative system for the data can be done with divergences. Given a probabilistic spiking neuron population model, one can measure the similarity between the stimuli or between the population responses; there were two posters for each idea using the Ising model.

I-7. Elad Ganmor, Ronen Segev, Elad Schneidman. Semantic organization of a neural population codebook and accurate decoding using a neural thesaurus

Trial to trial variability of the population response $P(r|s)$ was captured by an Ising model. Using the Bayes rule, they measured the Jensen-Shannon divergence:  $d(r_1, r_2) = D_{JS}(P(s|r_1), P(s|r_2))$ (not a metric unless sqrt is taken). They only consider instantaneous response (20 neurons, 10 ms bin, binarized), and no temporal structure. Using hierarchical clustering (forming the codebook) on the test response patterns, they showed that such method captures most of the mutual information with just a few clusters.

I-35. Gasper Tkacik, Einat Granot-Atedgi, Ronen Segev, Elad Schneidman. Retinal metric: a stimulus distance measure derived from population neural responses

They used symmetric Kullback-Leibler divergence between the stimulus conditioned response distances as a similarity measure between stimuli: $d(s_1, s_2) = D_{KL}^{sym}(P(r|s_1);P(r|s_2))$ for similarity in the stimulus space. Conditional distribution was modeled with a stimulus driven maximum entropy ising model where the higher order interaction terms do not depend on the stimulus: $P(r|s) = \frac{1}{Z} \exp\left(h(s)r + \sum_{i \neq j} J_{ij} r_i r_j\right)$. They did not use JS divergence because it is difficult to compute it from the Ising model. This similarity reveals which features of the stimulus the population really cares about.

## Extending Spike Triggered Covariance

There were 3 very related talks about spike triggered covariance (STC) analysis in the Characterizing Neural Responses to Structured and Naturalistic Stimuli workshop organized by Kanaka Rajan and William Bialek.

Jonathan Pillow

His talk was focused on Empirical Bayes (EB) methods for the inference of hierarchical models. The first part was about Mijung Park’s work on spatio-temporal and frequency localized prior design for receptive fields. The second part, which was brief due to time constraints, was about a Bayesian extension of STC where the number of receptive fields is inferred by EB.

William Bialek

He talked about the full history from reverse correlation (Boer 1968), to STC, to maximizing mutual information. He introduced Kanaka’s work (arXiv:1201.0321v1 [q-bio.NC], also poster III-34, but I missed it) on maximizing mutual information between a quadratic projections of the stimulus to the response. This is an interesting extension of MID (see Sharpee’s talk below). MID tends to degrade as the number of dimensions to extract increases, but their method seems to work better.

Tatyana Sharpee

Maximally informative dimension (MID) aims at finding receptive fields of a linear-nonlinear cascade regardless of the nonlinearity by maximizing mutual information. This is an ideal goal, however, estimation and maximization of mutual information is very difficult in practice (as in the case of information bottleneck), and implementation suffers from local minima and (histogram) parameterization. She presented an approach from the opposite direction to minimize mutual information (or equivalently, maximize conditional entropy of response given stimulus). Using a maximum entropy model with first two moments constrained, she derived that a quadratic form of logistic regression as a model to fit for binary spiking response: $P(spike|s) = \frac{1}{1+exp(a+h\cdot s+s^\top \cdot J \cdot s)}$. This is closely related to our BSTC work which has similar quadratic form of Poisson regression model as a special case. (I missed the related poster by Ryan Rowekamp et al. II-35) Ref: J.D. Fitzgerald, R. J. Rowekamp, L.C. Sincich and T.O. Sharpee, (2011) “Second order dimensionality reduction using minimum and maximum mutual information models”, PLoS Computational Biology, 7(10): e1002249 doi:10.1371/journal.pcbi.1002249

III-36. Brett Vintch, Andrew D Zaharia, Tony Movshon, Eero P Simoncelli. Fitting receptive fields in V1 and V2 as linear combinationsof nonlinear subunits

I missed this one, but this one is also highly related. They have a low complexity model that generates a set of filters that are generally obtained from STC on V1 complex cells.

Shannon’s entropy $H$ is a fundamental statistic that measures the uncertainty of a (discrete) distribution. It is a building block for mutual information $I(X;Y) = H(X) - H(X|Y)$ which has numerous applications in statistics, communication, signal processing, machine learning and so on. In the context of neuroscience, entropy can measure the maximum capacity of a neuron, quantify the amount of noise, and also serve as a cost function for theoretical derivation of learning rules. Amount of information coded by neural spike trains about a stimulus can be measured by mutual information, and provides a fundamental limit for neural codes.

Unfortunately, estimating entropy or mutual information is notoriously difficult, especially when the number of observations is less than the number of possible symbols [1]. For the neural data, this is often the case, due to the combinatorial nature of the symbols under consideration. If we consider binning a 100 ms window of spike trains from 10 neurons with a resolution of 1 ms bin, the total number of possible symbols become $2^{10 \cdot 100}$. Just to observe that many symbols, one needs $10^{292}$ years. Therefore, we must be clever. The question is how to extrapolate when you may have a severely under-sampled distribution.

In the literature, there have been many entropy estimators, and mutual information estimators based on them. We extend one of the best known entropy estimators called the NSB estimator [2,3], which is a Bayesian estimator with an approximately non-informative prior on entropy. This is achieved by mixing Dirichlet distributions appropriately. We have extended the procedure to a situation where the number of symbols with non-zero probability is unknown or arbitrarily large by mixing Pitman-Yor process as priors. The limit of the NSB estimator for infinite bins can be captured by Dirichlet process mixture prior. Pitman-Yor process is an extension of Dirichlet process with an extra parameter. Advantages of using Pitman-Yor mixture is that it can fit heavy-tailed distributions, and neural data (as well as many other natural phenomena) has heavy-tailed distribution. Our estimator shows significantly smaller bias for power-law tailed generation process as well as spiking neural data.

If you’re at COSYNE 2012, details are presented as a poster titled “Bayesian entropy estimation for infinite neural alphabets” by Evan Archer, myself and Jonathan Pillow. Look for  III-31 (Feb 25th, Saturday)

Update: preprint of this work can be found on the arXiv: Evan Archer*, Il Memming Park*, Jonathan Pillow. Bayesian Entropy Estimation for Countable Discrete DistributionsarXiv:1302.0328 (2013) (* equal contribution)

1. Liam Paninski. Estimation of Entropy and Mutual Information. Neural Computation, Vol. 15, No. 6. (1 June 2003), pp. 1191-1253, doi:10.1162/089976603321780272
2. I Nemenman, F Shafee, and W Bialek. Entropy and inference, revisited. NIPS 2001
3. I Nemenman, W Bialek, and R de Ruyter van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69:056111, 2004.