Skip to content

CNS 2013


CNS 2013 badge

Computational NeuroScience (CNS) conference is held annually alternating in America and Europe. This year it was held in Paris, next year is Québec City, Canada. There are more theoretical and simulation based studies, compared to experimental studies. Among the experimental studies, there were a lot of oscillation and synchrony related subjects.

Disclaimer: I was occupied with several things, so I was not 100% attending the conference, so my selection is heavily biased. These notes are primarily for my future reference.

Simon Laughlin. The influence of metabolic energy on neural computation (keynote)

There are three main categories of energy cost in the brain: (1) maintenance, (2) spike generation, and (3) synapse. Assuming a finite energy budget for the brain, the optimal efficient coding strategy can vary from small number of neurons with high rate to large population with sparse coding [see Fig 3, Laughlin 2001]. Variation of cost ratios across animals may be associated with different coding strategies to optimize for energy/bits. He illustrated the balance through various laws of diminishing return plots. He emphasized reverse engineering the brain, and concluded with the 10 principles of neural design (transcribed from the slides thanks to the photo by @neuroflips):
(1) save on wire, (2) make components irreducibly small, (3) send only what is needed, (4) send at the lowest rate, (5) sparsify, (6) compute directly with analogue primitives, (7) mix analogue and digital, (8) adapt, match and learn, (9) complexify (elaborate to specialize), (10) compute with chemistry??????. (question marks are from the original slide)

Sophie Denev. Rescuing the spike (keynote)

She proposed that the observation of high trial-to-trial variability in spike trains from single neurons is due to degeneracy in the population encoding. There are many ways the presynaptic population can evoke similar membrane potential fluctuations of a linear readout neuron, and hence she claims that through precisely controlled lateral inhibition, the neural code is precise in the population level, but seems variable if we only observe a single neuron. She briefly mentioned how a linear dynamical system might be implemented in such a coding system, but it seemed limited as to what kind of computations can be achieved.

There were several noise correlation (joint variability in the population activity) related talks:

Joel Zylberberg et al. Consistency requirements determine optimal noise correlations in neural populations

The “sign rule” says that if the signal correlation is opposite of the noise correlation, linear Fisher information (and OLE performance) is improved (see Fig 1, Averbeck, Latham, Pouget 2006). They showed a theorem confirming the sign rule in general setup, and furthermore showed the optimal noise correlation does NOT necessarily obey the sign rule (see Hu, Zylberberg, Shea-Brown 2013). Experiments from the retina does not obey the sign rule; noise correlation is positive even for cells tuned to the same direction, however, it is still near optimal according to their theory.

Federico Carnevale et al. The role of neural correlations in a decision-making task

During a vibration detection task, cross-correlations among neurons in the premotor cortex (in a 250 ms window) were shown to be dependent on behavior (see Carnevale et al. 2012). Federico told me that there were no sharp peaks in the cross-correlation. He further extrapolated the choice probability to the network level based on multivariate Gaussian approximation, and a simplification to categorize neurons into two classes (transient or sustained response).

Alex Pouget and Peter Latham each gave talks in the Functional role of correlations workshop.

Both were on Fisher information and effect of noise correlations. Pouget’s talk was focused on “differential correlation” which is the noise in the direction of the manifold that tuning curves encode information (noise that looks like signal). Peter talked about why there are so many neurons in the brain with linear Fisher information and additive noise (but I forgot the details!)

On the first day of the workshop, I participated in the New approaches to spike train analysis and neuronal coding workshop organized by Conor Houghton and Thomas Kreuz.

Florian Mormann. Measuring spike-field coherence and spike train synchrony

He emphasized on using nonparametric statistics for testing circular variable of interest: the phase of LFP oscillation conditioned on spike timings. In the second part, he talked about spike-distance (see Kreuz 2012) which is a smooth, time scale invariant measure of instantaneous synchrony among spike trains.

Rodrigo Quian Quiroga. Extracting information in time patterns and correlations with wavelets

Using Haar wavelet time bins as the feature space, he proposed scale free linear analysis of spike trains. In addition, he proposed discovering relevant temporal structure through a feature selection using mutual information. The method doesn’t seem to be able to find higher order interactions between time bins.

Ralph Andrzejak. Detecting directional couplings between spiking signals and time-continuous signals

Using distance based directional coupling analysis (see Chicharro, Andrzejak 2009; Andrzejak, Kreuz 2011), he showed that it is possible to find unidirectional coupling between continuous signals and spike trains via spike train distances. He mentioned the possibility of using spectral Granger causality for a similar purpose.

Adrià Tauste Campo. Estimation of directed information between simultaneous spike trains in decision making

Bayesian conditional information estimation through the use of context-tree weighting was used to infer directional information (analogous to Granger causality, but with mutual information). A compact Markovian structure is learned for binary time series.

I presented a poster on Bayesian entropy estimation in the main meeting, and gave a talk about nonparametric (kernel) methods for spike trains in the workshop.

8th Black Board Day (BBD8)

Birthday cake for both Gödel and Shannon!

Birthday cake for both Gödel and Shannon!

Last Sunday (April 28th, 2013) was the 8th Black board day (BBD), which is a small informal workshop I organize every year. It started 8 years ago on my hero Kurt Gödel‘s 100th birthday. This year, I found out that April 30th (1916) is Claud Shannon‘s birthday so I decided the theme would be his information theory.

I started by introducing probabilistic reasoning as an extension of logic in this uncertain world (as Michael Buice told us in BBD7). I quickly introduced two key concepts, Shannon’s entropy H(X) = -\sum_i p_i \log_2 p_i which additively quantifies the uncertainty of a sequence of independent random quantity in bits, and mutual information I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) which quantifies the how much uncertainty is reduced in X by the knowledge of Y (and vice versa, it’s symmetric). I showed a simple example of the source coding theorem which states that a symbol sequence can be maximally compressed to the length of it’s entropy (information content), and stated the noisy channel coding theorem, which provides an achievable limit of information rate that can be passed through a channel (the channel capacity). Legend says that von Neumann told Shannon to use the word “entropy” due to its similarity to the concept in physics, so I gave a quick microcanonical picture that connects the Boltzmann entropy to Shannon’s entropy.

Andrew Tan: Holographic entanglement entropy

on the white board

Andrew on the white board

Andrew wanted to connect how space-time structure can be derived from holographic entanglement entropy, and furthermore to link it to graphical models such as the restricted Boltzmann machine. He gave overviews of quantum mechanics (deterministic linear dynamics of the quantum states), density matrix, von Neumann entropy, and entanglement entropy (entropy of a reduced density matrix, where we assume partial observation and marginalization over the rest). Then, he talked about the asymptotic behaviors of entropy for the ground state and critical regime, and introduced a parameterized form of Hamiltonian that gives rise to a specific dependence structure in space-time, and sketched what the dimension of boundary and area of the dependence structure are. Unfortunately, we did not have enough time to finish what he wanted to tell us (see Swingle 2012 for details).

Jonathan Pillow: Information Schminformation

Information theory is widely applied to neuroscience and sometimes to machine learning. Jonathan sympathized with Shannon’s note (1956) called “the bandwagon”, criticized the possible abuse/overselling of information theory. First, Jonathan focused on the derivation of a “universal” rate-distortion theory based on the “information bottleneck principle”. Then, he continued with his recent ideas in optimal neural codes under different Bayesian distortion functions. He showed a multiple-choice exam example where maximizing mutual information can be worse, and a linear neural coding example for different cost functions.

Free discussion time!

Free discussion time!


Typical coffee break

Typical coffee break at COSYNE main meeting

Feb 28–Mar 5 was the 10th COSYNE meeting, and my 6th participation. Thanks to my wonderful collaborators, I had a total of 4 posters in the main meeting (Jonathan Pillow had 7 which was a tie for the most number of abstracts with Larry Abbott). Hence, I didn’t have a chance to sample enough posters for the first two nights (I also noticed a few presentations that overlapped NIPS 2012). I tried to be a bit more social this year; I organized a small (unofficial) Korean social (with the help of Kijung Yoon), a tweet-up, and enjoyed many social drinking nights. Following are my notes on what I found interesting.

EDIT: here are some other blog posts about the meeting: [Jonathan Pillow] [Anne Churchland]

Main meeting—Day 1

William Bialek. Are we asking the right questions?
Not all sensory information are equally important. Rather, Bialek claims that the information that can predict the future are the important bits. Since neurons only have access to the presynaptic neurons’ spiking pattern, this should be achieved by neural computation that predicts its own future patterns (presumably under some constraints to prevent trivial solutions). When such information is measured over time, at least in some neurons in the fly visual system, its decay is very slow: “Even a fly is not Markovian”. This indicates that the neuronal population state may be critical. (see Bialek, Nemenman, Tishby 2001)

Evan Archer, Il Memming Park, Jonathan W Pillow. Semi-parametric Bayesian entropy estimation for binary spike trains [see Evan’s blog]

Jacob Yates, Il Memming Park, Lawrence Cormack, Jonathan W Pillow, Alexander Huk. Precise characterization of multiple LIP neurons in relation to stimulus and behavior

Jonathan W Pillow, Il Memming Park. Beyond Barlow: a Bayesian theory of efficient neural coding

Main meeting—Day 2

Eve Marder. The impact of degeneracy on system robustness
She stressed about how there could be multiple implementations of the same functionality, a property she refers to as degeneracy. Her story was centered around modeling the Lobster STG oscillation (side note: connectome is not enough to predict behavior). Since there are rapid decay and rebuilding of receptors and channels, there must be homeostatic mechanisms that constantly tune parameters for the vital oscillatory bursting in STG. There are multiple stable fixed points in the parameter space and single cell RNA quantification supports it.

Mark H Histed, John Maunsell. The cortical network can sum inputs linearly to guide behavioral decisions
Using optogenetics in a behaving mice, they tried to resolve the synchrony vs rate code debate. He showed that behaviorally, the population showed almost perfect integration to weak input, and not sensitive to synchrony. Hence, he claims that the brain may just well operate on linear population codes.

Arnulf Graf, Richard Andersen. Learning to infer eye movement plans from populations of intraparietal neurons
Spike trains from monkey area LIP were used for an “eye-movement intention” based brain—machine interface. During the brain–control period, LIP neurons changed their tuning. Decoding was done with a MAP decoder which was updated online through the trials. To encourage(?) the monkey, the brain–control period had different target distribution, and the decoder took this “behavioral history” or “prior” into account. Neurons with the lowest performance enhanced the most, demonstrating the ability of LIP neurons to swiftly change their firing pattern.

Il Memming Park, Evan Archer, Nicholas Priebe, Jonathan W Pillow. Got a moment or two? Neural models and linear dimensionality reduction

Evan's presenting it

Evan’s presenting the GQM poster

David Pfau Eftychios A. Pnevmatikakis Liam Paninski. Robust learning of low dimensional dynamics from large neural ensembles
Estimation of latent dynamics with arbitrary noise process is recovered from high dimensional spike train observation using low-rank optimization techniques (convex relaxation). Even spike history filter can be included by assuming low-rank matrix corrupted by sparse noise. Nice method that I look forward for its application to real data.

Main meeting—Day 3

Unofficial Korean social

Unofficial Korean social

Carlos Brody. Neural substrates of decision-making in the rat
Using the rats trained in a psychophysics factory on the Poisson click task, he showed that rats are noiseless integrators by fitting a detailed drift diffusion model with 8 (or 9?) parameters. From the model, he extracted detailed expected decision variable statistics related to activity in PPC and FOF (analogue of LIP and FEF in monkeys), which showed FOF is more threshold like, and PPC is integrator like in their firing rate representation. However, upon pharmacologically disabling either area, the rat psychophysics was not harmed, which indicates that the accumulation of sensory evidence is somewhere earlier in the information processing. (Jeffrey Erlich said it might be auditory cortex during the workshop.) [EDIT: Brody’s science paper is out.]

N. Parga, F. Carnevale, V. de Lafuente, R. Romo. On the role of neural correlations in decision-making tasks
I had hard time understanding the speaker, but it was interesting to see how spike count correlation and Gaussian assumption for decision making could accurately predict the choice probability.

Jonathan Aljadeff, Ronen Segev, Michael J. Berry II, Tatyana O. Sharpee. Singular dimensions in spike triggered ensembles of correlated stimuli
Due to large concentration in the eigenvalues in stimulus covariance natural scenes, they show that spike triggered covariance analysis (using the difference between the raw STC and stimulus covariance) result contains a spurious component that corresponds to the largest eigenvalue. They claim this using random matrix theory, and proposed a correction by projecting out the spurious dimension before STC analysis, and surprisingly, they recover more dimensions with larger than surrogate eigenvalue. I wonder if a model based approach like GQM (or BSTC) would do a better job for those ill-conditioned stimulus distributions.

Gergo Orban, Pierre-Olivier Polack, Peyman Golshani, Mate Lengyel. Stimulus-dependence of membrane potential and spike count variability in V1 of behaving mice
It is well known that the Fano factor of spike trains is reduced when stimulus is given (e.g. Churchland et al. 2010). Gergo measured contrast dependent trial-to-trial variability of V1 membrane potentials in awake mice. By computing the statistics from 5 out of 6 cycles of repeated stimulus, he found that the variability is reduced as the contrast gets stronger. The spikes were clipped from the membrane potential for this analysis.

Jakob H Macke, Iain Murray, Peter Latham. How biased are maximum entropy models of neural population activity?
This was based on their NIPS 2011 paper with the same title. If you use maximum entropy model as an entropy estimator, like all entropy estimators, your estimate of entropy will be biased. They have an exact form of the bias which is inversely proportional to the number of samples, if the model class is right.

Ryan P Adams, Geoffrey Hinton, Richard Zemel. Unsupervised learning of latent spiking representations
By taking a limit of small bin size of an RBM, they built a point process model with continuous coupling with a hidden point process. The work seems to be still preliminary. They used Gaussian process to constrain the coupling to be smooth.

Main meeting—Day 4

D. Acuna, M. Berniker, H. Fernandes, K. Kording. An investigation of how prior beliefs influence decision–making under uncertainty in a 2AFC task.
Subjects performing optimal Bayesian inference could be using several different strategies to generate behavior from the posterior; sampling from the posterior vs MAP inference are compared. Different strategies predict the just-noticeable-difference (JND) as a function of prior uncertainty. However, they find that human subjects were consistent with MAP inference and not sampling.

Phillip N. Sabes. On the duality of motor cortex: movement representation and dynamical machine
He poses the question of whether activities in the motor cortex is a representation (tuning curve) of motor related variables or the motor cortex is just generating dynamics for motor output. Also, he says jPCA applied to feed-forward non-normal dynamics shows similar results to Churchland et al. 2012: not necessarily oscillating. He suggested that dynamics is the way to interpret them, but the neurons were after all also tuned in the end.

Workshops—Day 5

Randy Bruno. The neocortical circuit is two circuits. (Why so many layers and cell types? workshop)
By studying the thalamic input to the cortex in a sedated animal, he discovered that they synapse 80% to layer 4 and 20% to layer 5/6 (more 5 than 6; see Oberlaender et al. 2012). He blocked the L4 and above activity which did not change the L5/6 membrane potential response to whisker deflection. He suggested that L1/2/3/4 and L5/6 are two different circuits that can function independently.

Alex Huk. Temporal dynamics of sensorimotor integration in the primate dorsal stream (Neural mechanism for orienting decisions across the animal kingdom workshop)

Workshops—Day 6

Matteo Carandini. Adaptation to stimulus statistics in visual cortex (Priors in perception, decision-making and physiology workshop)
Matteo showed adaptations in LGN and V1 due to changes in the input statistics. For LGN, the position of stimulus was used which in turn shifted V1 receptive fields (V1 didn’t adapt, it just didn’t know about the adaptation in LGN). For V1, random full field orientation was used (as in Beniucci et al. 2009) but with a sudden change in distribution over orientations. The effect on V1 tuning could be explained by changes in gain of each neuron, and each stimulus orientation. This equalized the population (firing rate) response. [EDIT: this is published in Nat Neurosci 2013]

Eero Simoncelli. Implicit embedding of prior probabilities in optimally efficient neural populations (Priors in perception, decision-making and physiology workshop)
Eero presented an elegant theory (work with Deep Ganguli presented in NIPS 2010; Evan’s review) of optimal tuning curves given the prior distribution. He showed that 4 visual and 2 auditory neurophysiology and psychophysics can be explained well with it.

Albert Lee. Cellular mechanisms underlying spatially-tuned firing in the hippocampus (Dendritic computation in neural circuits workshop)
Among the place cells there are also silent neurons in CA1. Using impressive whole cell patch on CA1 cells in awake freely moving mice, he showed that not only they don’t spike, the silent cells do not have tuned membrane fluctuation. However, by injecting current into the cell so that it would have a higher membrane potential (closer to the threshold), they successfully activated the silent cell and and made them place cells (Lee, Lin and Lee 2012).

Marina Garrett. Functional and structural mapping of mouse visual cortical areas (A new chapter in the study of functional maps in visual cortex workshop)
She used intrinsic imaging to find continuous retinotopical maps. Using the gradient of the retinotopy, in combination with the eccentricity map, she defined boarders of visual areas. She defined 9(or 10?) areas surrounding V1 (Marshel et al. 2011). Several areas had temporal selectivity, while others had temporal selectivity, which are the hallmark of parietal and temporal pathways (dorsal and ventral in primates). She also found connectivity patterns which showed increasingly multi-modal for higher areas.

Books that greatly influenced me

tags: ,

I highly recommend these beautifully written texts.

1. The Selfish Gene (1976) by Richard Dawkins [worldcat][amazon]

This was my introduction to the meme of Darwinism, and to memes themselves. I read it when I was 13 or 14 years old. My world view was fundamentally changed ever since.

2. The Emperor’s New Mind (1989) by Roger Penrose [worldcat][amazon]

I read this book when I was 18 years old. I liked the book so much that I bought many copies of this book and gave it to my friends as a present. It motivated me to study logic and computation theory as means to understand the mind. Although the core idea is flawed, the book overall brought me great joy of thinking about what human minds can do, and how they can do it.

3. The Myth of Sisyphus (1942) by Albert Camus [worldcat][amazon]

In times of despair, when I though I couldn’t understand this seemingly illogical world and frustrated by its complexity, this book talked to be dearly. I was 19 or 20 years old.

4. I am a Strange Loop (2007) by Douglas Hofstadter [worldcat][amazon]

Before this book, I was a pure reductionist (since I was little; my father is a physicist), trying to understand the world by going into the smaller scale of things. Now, I also think about what abstraction can bring to the table—understanding in a different, more humane level. I was in graduate school when it came out.

Spawning a realistic model of the brain?


Originally posted on Pillow Lab Blog:

I (Memming) presented Eliasmith et al. “A Large-Scale Model of the Functioning Brain” Science 2012 for our computational neuroscience journal club. The authors combined their past efforts for building various modules for solving cognitive tasks to build a large-scale spiking neuron model called SPAUN.

They built a downloadable network of 2.5 million spiking neurons (leaky-integrate-and-fire (LIF) units) that has a visual system for static images, working memory for sequence of symbols (mostly numbers), motor system for drawing numbers, and perform 8 different tasks without modification. I was impressed by the tasks it performed (video). But I must say I was disappointed after I found out that it was “designed” to solve each problem by the authors, and combined with a central control unit (basal ganglia) which uses its “subroutines” to solve. Except for the small set of weights specific for the reward task, the network has…

View original 255 more words

NIPS 2012


NIPS 2012 (proceedings) was held in Lake Tahoe, right next to the state line between California and Nevada. Despite the casino all around the area, it was a great conference: a lot of things to learn, and a lot of people to meet. My keywords for NIPS 2012 are deep learning, spectral learning, nonparanormal distribution, nonparametric Bayesian, negative binomial, graphical models, rank, and MDP/POMDP. Below are my notes on the topics that interested me. Also check out these great blog posts about the event by Dirk Van den Poel (@dirkvandenpoel), Yisong Yue (@yisongyue) John Moeller, Evan Archer, Hal Daume III.


Optimal kernel choice for large-scale two-sample tests
A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu

This is an improvement over the maximum mean discrepancy (MMD), a divergence statistic for hypothesis testing using reproducing kernel Hilbert spaces. The statistical power of the test depends on the choice of kernel, and previously, it was shown that taking the max value over multiple kernels still results in a divergence. Here they linearly combine kernels to maximize the statistical power in linear time, using normal approximation of the test statistic. The disadvantage is that it requires more data for cross-validation.

Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
Xue-Xin Wei, Alan Stocker

Several biases observed in psychophysics shows repulsion from the mode of prior which seem counter intuitive if we assume brain is performing Bayesian inferences. They show that this could be due to asymmetric likelihood functions that originate from the efficient coding principle. The tuning curves, and hence the likelihood functions, under the efficient coding hypothesis are constrained by the prior, reducing the degree of freedom for the Bayesian interpretation of perception. They show asymmetric likelihood could happen under a wide range of circumstances, and claim that repulsive bias should be observed. Also they predict additive noise in the stimulus should decrease this effect.

Spiking and saturating dendrites differentially expand single neuron computation capacity
Romain Cazé, M. Humphries, B. Gutkin

Romain showed that boolean functions can be implemented by active dendrites. Neurons that generate dendritic spikes can be considered as a collection of AND gates, hence disjunctive normal form (DNF) can be directly implemented using the threshold in soma as the final stage. Similarly, saturating dendrites (inhibitory neurons) can be treated as OR gates, thus CNF can be implemented.

Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
Shinsuke Koyama

Hypothesis testing of whether the rate is constant or not for a renewal neuron can be done by decoding the rate from spike trains using empirical Bayes (EB). If the hyperparameter for the roughness is inferred to be zero by EB, the null hypothesis is accepted. Shinsuke derived a theoretical condition for the rejection based on the KL-divergence.

The coloured noise expansion and parameter estimation of diffusion processes
Simon Lyons, Amos Storkey, Simo Sarkka

For a continuous analogue of a nonlinear ARMA model, estimating parameters for stochastic differential equations is difficult. They approach it by using a truncated smooth basis expansion of the white noise process. The resulting colored noise is used for an MCMC sampling scheme.

Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
Evan Archer*, Il Memming Park*, Jonathan W. Pillow (*equally contributed, equally presented)
PYM Estimator NIPS 2012


Diffusion decision making for adaptive k-Nearest Neighbor Classification
Yung-Kyun Noh, F. C. Park, Daniel D. Lee

An interesting connection between sequential probability ratio test (Wald test) for homogeneous Poisson process with two different rates and k-nearest neighbor (k-NN) classification is established by the authors. The main assumption is that each class density is smooth, thus in the limit of large samples, distribution of NN follows a (spatial) Poisson process. Using this connection, several adaptive k-NN strategies are proposed motivated from Wald test.

TCA: High dimensional principal component analysis for non-gaussian data
F. Han, H. Liu

Using an elliptical copula model (extending the nonparanormal), the eigenvectors of the covariance of the copula variables can be estimated from Kendall’s tau statistic which is invariant to the nonlinearity of the elliptical distribution and the transformation of the marginals. This estimator achieves close to the parametric convergence rate while being a semi-parametric model.

Classification with Deep Invariant Scattering Networks (invited)
Stephane Mallat

How can we obtain stable informative invariant representation? To obtain an invariant representation with respect to a group (such as translation, rotation, scaling, and deformation), one can directly apply a group-convolution to each sample. He proposed an interpretation of deep convolutional network as learning the invariant representation, and a more direct approach when the invariance of interest is known, which is to use group invariant scattering (hierarchical wavelet decomposition). Scattering is contractive, preserves norm, and stable under deformation, hence generates a good representation for the final discriminative layer. He hypothesized that the stable parts (which lacks theoretical invariance) can be learned in deep convolutional network through sparsity.

Spectral learning of linear dynamics from generalised-linear observations with application to neural population data
L. Buesing, J. Macke, M. Sahani

Ho-Kalman algorithm is applied to Poisson observation with canonical link function, then the parameters are estimated through moment matching. This is a simple and great initializer for EM which tends to be slow and prone to local optima.

Spectral learning of general weighted automata via constrained matrix completion
B. Balle, M. Mohri

A parameteric function from strings to reals known as rational power series, or equivalently weighted finite automata, is estimated with a spectral method. Since the Hankel matrix for prefix-suffix values has a structure, a constrained optimization is applied for its completion from data. How to choose rows and columns of Hankel matrix remains a difficult problem.

Discriminative learning of Sum-Product Networks
R. Gens, P. Domingos

Sum-product network (SPN) is a nice abstraction of a hierarchical mixture model, and it provides simple and tractable inference rules. In SPM, all marginals are computable in linear time. In this case, discriminative learning algorithms for SPM inferences are given. The hard inference variant takes the most probable state, and can overcome gradient dilution.

Perfect dimensionality recovery by Variational Bayesian PCA
S. Nakajima, R. Tomioka, M. Sugiyama, S. Babacan

Previous Bayesian PCA algorithm utilizes the empirical Bayes procedure for sparsification, however, this may not be an exact inference for recovering the dimensionality. They provide a condition for which the recovered dimension is exact for a variational Bayesian inference using random matrix theory.

Fully bayesian inference for neural models with negative-binomial spiking
J. Pillow, J. Scott

at NIPS 2012, presented by Memming (not an author)

Pillow & Scott’s negative binomial spiking poster presented by Memming (not an author), opposite side (left) was Mijung Park presenting her poster.


Graphical models via generalized linear models
Eunho Yang, Genevera I. Allen, Pradeep Ravikumar, Zhandong Liu

Eunho introduced a family of graphical models with GLM marginals and Ising model style pairwise interaction. He said the Poisson-Markov-Random-Fields version must have negative coupling, otherwise the log partition function blows up. He showed conditions for which the graph structure can be recovered with high probability in this family.

No voodoo here! learning discrete graphical models via inverse covariance estimation
Po-Ling Loh, Martin Wainwright

I think Po-Ling did the best oral presentation. For any graph with no loop, zeros in the inverse covariance matrix corresponds to non-conditional dependence. In general, theoretically by triangulating the graph, conditional dependencies could be recovered, but the practical cost is high. In practice, graphical lasso is a pretty good way of recovering the graph structure, especially for certain discrete distributions (e.g. Ising model).

Augment-and-Conquer Negative Binomial Processes
M. Zhou, L. Carin

Poisson process over gamma process measure is related to Dirichlet process (DP) and Chinese restaurant process (CRP). Negative binomial (NB) distribution has an alternative (i.e., not gamma-Poisson) augmented representation as Poisson number of logarithmic random variables, which can be used to constructing Gamma-NB process. I do not fully understand the math, but it seems like this paper contains gems.

Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum Lp Loss
Zhuo Wang, Alan A. Stocker, Daniel D. Lee

Assuming different loss functions in the Lp family, optimal tuning curves of a rate limited Poisson neuron changes. Zhuo showed that as p goes to zero, the optimal tuning curve converges to that of the maximum information. The derivations assume no input noise, and a single neuron. [edit: we did a lab meeting about this paper]

Bayesian nonparametric models for ranked data
F. Caron, Y. Teh

Assuming observed partially ranked objects (e.g., top 10 books) have positive real-valued hidden strength, and assuming a size-biased ranking, they derive a simple inference scheme by introducing an auxiliary exponential variable.


Efficient and direct estimation of a neural subunit model for sensory coding
Brett Vintch, Andrew D. Zaharia, J. Anthony Movshon, Eero P. Simoncelli

We already discussed this nice paper in our journal club. They fit a special LNLN model that assumes a single (per channel) convolutional kernel shifted (and weighted) in space. Brett said the convolutional STC initialization described in the paper works well even when the STC itself looks like noise.

Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
Sander M. Bohte

A multiplicative spike response model is proposed and fit with a fixed post-spike filter shape, LNP based receptive filed, and grid search over the parameter space (3D?). This model reproduces the experimentally observed adaptation due to amplitude modulation and the variance modulation. The multiplicative dynamics must have a power-law decay that is close to 1/t, and it somehow restricts the firing rate of the neuron (Fig 2b).

Dropout: A simple and effective way to improve neural networks (invited, replacement)
Geoffrey Hinton, George Dahl

Dropout is a technique to randomly omit units in an artificial neural network to reduce overfitting. Hinton says dropout method is an efficient way of model averaging exponentially many models. It reduces overfitting because hidden units can’t depend on each other reliably. Related paper is on the arXiv.

Compressive neural representation of sparse, high-dimensional probabilities
Xaq Pitkow

Naively representing probability distributions are inefficient since it takes exponentially growing resource. Using ideas from compressed sensing, Xaq shows that random perceptron units can be used to represent a sparse high dimensional probability distribution efficiently. The question is what kind of operations on this representation biologically plausible and useful.

The topographic unsupervised learning of natural sounds in the auditory cortex
Hiroki Terashima, Masato Okada

Visual cortex is much more retinatopic than auditory cortex is tonotopic. Unlike natural images, nautral auditory stimuli has harmonics that gives rise to correlations in the frequency domain. Could both primary sensory cortices have same principle for topographic learning rules but form different patterns because of differences in the input statistics? The authors’ model is consistent with the hypothesis, and moreover captures the nonlinear response to pitch perception problem.

This concludes my 3rd NIPS (NIPS 2011, NIPS 2010)!

Mixture of point processes


Suppose you mix two Gaussian random variables \mathcal{N}(-1, 1) and \mathcal{N}(-1, 1) equally, that is, if one samples from the mixture, with probability 1/2, it comes from the first Gaussian and vice versa. It is evident that the mixture of Gaussians is not a Gaussian. (Do not confuse with adding two Gaussian random variables which produces another Gaussian random variable.)

Similarly, mixture of inhomogeneous Poisson processes results in a non-Poisson point process. The figure below illustrates the difference between a mixture of two Poisson processes (B) and a Poisson process with the same marginal intensity (rate) function (A). The colored bars indicates the rate over the real line (e.g. time); in this case they are constant rate over a fixed interval. The 4 realizations from each process A and B are represented by rows of vertical ticks.

Several special cases of mixed Poisson processes are studied [1], however, they are mostly limited to modeling over-dispersed homogeneous processes. In theoretical neuroscience, it is necessary to mix arbitrary (inhomogeneous) point processes. For example, to maximize the mutual information between the input spike trains and the output spike train of a neuron model, the entropy of a mixture of point processes is needed.

In general, a regular point process on the real line can be completely described by the conditional intensity function \lambda(t|\mathcal{H}_t) where \mathcal{H}_t is the full spiking history up to time t [2]. Let us take the discrete limit to form regular point processes. Let \rho_k to be the probability of a spike (an event) at the k-th bin of size \Delta, that is,

\rho_k \simeq \lambda(k \Delta|y_{1:k-1}) \Delta,

where y_{1:k-1} are the 0-1 responses in all the previous bins. The likelihood of observing y_k = 0 or y_k = 1, given the history is simply,

P(y_k|y_{1:k-1}, \lambda) = {\rho_k}^{y_k} \left(1 - \rho_k\right)^{1 - y_k}.

In the limit of small \Delta, this approximation converges to a regular point process. A fun fact is that a mixture of Bernoulli random variables is Bernoulli again, since it’s the only distribution for 0-1-valued random variables. Specifically, for a family of Bernoulli random variables with probability of 1 being \rho_z indexed by z, and a mixing distribution P(z), the probability of observing one symbol y=0 or y=1 is

P(y) = \int P(y|z)P(z) \mathrm{d}z = \int {\rho_z}^{y} \left(1 - \rho_z\right)^{1 - y} P(z) \mathrm{d}z = {\bar\rho}^{y} \left(1 - \bar\rho\right)^{1 - y}

where \bar\rho = \int \rho_k P(z) \mathrm{d}z is the average probability.

Suppose we mix \lambda(t|\mathcal{H}_t, z) with P(z). Then, similarly, for binned point process representation, above implies that,

P(y_k|y_{1:k-1},\lambda) = \int P(y_k|y_{1:k-1},\lambda) P(z) \mathrm{d}z = {\bar\rho}_k^{y_k} \left(1 - \bar\rho_k \right)^{1 - y_k}

where \bar\rho_k = \int \rho_k P(z) \mathrm{d}z is the marginal rate. Moreover, due to causal dependence between y_k‘s, we can chain the expansion and get the marginal probability of observing y_{1:k},

P(y_{1:k}) = P(y_k|y_{1:k-1}) P(y_{1:k-1}) = P(y_k|y_{1:k-1}) P(y_{k-1}|P_{1:k-2}) \cdots P(y_1)

= \prod_{i=1}^k {\bar\rho}_i^{y_i} \left(1-\bar\rho_i\right)^{1-y_i}.

Therefore, in the limit the mixture point process is represented by the conditional intensity function,

\lambda(t|\mathcal{H}_t) = \int \lambda(t|\mathcal{H}_t, z) P(z) \mathrm{d}z.

Conclusion: The conditional intensity function of a mixture of point processes is given by the expected conditional intensity function over the mixing distribution.


  1. Grandell. Mixed Poisson processes. Chapman & Hall / CRC Press 1997
  2. Daley, Vere-Johns. An Introduction to the Theory of Point Processes. Springer.
  3. Taro Toyoizumi, Jean-Pascal Pfister, Kazuyuki Aihara, Wulfram Gerstner. Generalized Bienenstock–Cooper–Munro rule for spiking neurons that maximizes information transmission. PNAS, 2005. doi:10.1073/pnas.0500495102

Get every new post delivered to your Inbox.