Skip to content

NIPS 2011


This was my second NIPS (see last year’s NIPS summary). It had a lower acceptance rate of 22% (I served as a reviewer last year and this year). I felt like there were more computational neuroscience related posters than last year (perhaps due to the location in Europe). Non-parametric Bayes, reinforcement learning (MDP), and sparse learning were still big while kernel related posters were less. This post is a summary of my experience, and any error is due to myself (please let me know if you find any).


Dynamical segmentation of single trials from population neural data
Biljana Petreska, M. Sahani, B. Yu, J. Cunningham, S. Ryu, K. Shenoy, Gopal Santhanam

A randomly switching piecewise-linear dynamical system model is constructed via discrete latent states. Given a state, the dynamics of spiking neurons are assumed to be linear. This model is fit to 105 simultaneously recorded neurons (Utah array) during a motor task. Number of states were chosen heuristically. This is an unsupervised method that automatically captures the structure of the dynamics. The results suggest that neurons tend to be in a linear dynamical state both when waiting for the go-cue, and during early movement, and goes through nonlinear dynamical transitions in between.

Inferring spike-timing-dependent plasticity from spike train data
Ian H. Stevenson, Konrad P. Kording

Different synapses have different form of STDP, and while spike train data are abundant, in vivo whole cell recordings are very difficult. Hence, learning the synaptic plasticity rule from just spike train observation is of great importance. This is one of my long-term goals as well. They fit a unidirectionally coupled GLM model with a binned weight modulation function as a function of timing to previous presynaptic spike. The results are promising for simulated models. I’d love to see it applied to a well controlled real data.

Active dendrites: adaptation to spike-based communication
Balázs B Ujfalussy, Máté Lengyel

In the presence of correlated presynaptic population activity, to compute a function of presynaptic voltage online from spikes, the neuron has to be nonlinear. In particular, this paper links it to the nonlinear summation property of the dendrite. In previous work by Pfister, J., Dayan, P., Lengyel, M. (2010), they explained the role of short-term plasticity (dynamical synapse model) as optimal predictor for presynaptic membrane potential for a single neuron. This work expands it to the population case.

From stochastic nonlinear integrate-and-fire to generalized linear models
Skander Mensi, Richard Naud, Wulfram Gersnter

This poster shows that given a stochastic (adaptive-exponential) leaky-integrate-and-fire-neuron model, it is possible to construct a nearly equivalent GLM model (as a form of spike response model (SRM) with escape noise). Sub-threshold dynamics is linearized to provide the linear filter (corresponding to impulse response) and the reset/refractoriness part of the history filter, while the spike-adaptation is captured as a slower time scale component of the history filter. Then the link function can be estimated through empirical observation that \log(-\log(p(V))) is close to being linear. (I was totally thrown off by the notation p(V) which was the probability of spiking given a membrane potential, not the marginal distribution of voltage distribution of the model.)

Gaussian process modulated renewal processes
Vinayak Rao, Yee Whye Teh

This is an extension of R. P. Adams, I. Murray and D. J.C. MacKay’s work which was on Poisson intensity estimation to hazard rate modulated renewal process. Basic ideas are similar; use a sigmoidal link function, and use point process thinning like procedure to exactly sample.


Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint
Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet

Kernel embedding of probability distribution and induced divergence is an emerging direction of kernel methods. The divergence is related to Bayes risk of Parzen window classifier in particular, and this paper extends the results to Banach spaces. For a Banach space with a norm that is uniformly Fretchet differentiable, and uniformly convex, there is a semi-inner product inducing an reproducing kernel Banach space (RKBS) which has analogous properties to RKHS. They showed that kernel embedding is injective when the kernel is a Fourier transform of a signed measure (c.f. Bochner’s theorem requires a positive measure for positive definiteness). The resulting divergence is not computable, unless the semi-metric is of special form, and the convergence rate turns out to be at best same as the RKHS case.

Modelling genetic variations with fragmentation-coagulation processes
Yee Whye Teh, Charles Blundell, Lloyd T. Elliott

Similar to Chinese restaurant process (CRP) for clustering, a temporal evolution of clusters by fragmentation (breaking a table into two tables) and coagulation (merging two tables) can be described as a Fragmentation-Coagulation Process (FCP). They show that FCP is exchangeable, reversible, and has asymptotic distribution of CRP.

Priors over recurrent continuous time processes [code]
Ardavan Saeedi, Alexandre Bouchard-Cŏté

This paper received the best student paper award this year, and Ardavan is only a masters student! The problem he is interested in is a discrete latent state dependent continuous time series with partial observation process. For example, a recurrent disease with coarsely quantified states. He introduces the Gamma-exponential process, where an infinite Markovian transition rate matrix prior is given, extends to hierarchical case, and shows how to do inference.

Kernel Beta process
Lu Ren, Yingjian Wang, David Dunson, Lawrence Carin (none of the authors made it to the conference)

Beta process is a distribution over discrete random measures where each “stick” is in [0, 1], but does not sum to 1 as Dirichlet process (DP) does. In this paper, they smooth the sticks in relation to covariates through a kernel, such that their heights are correlated. Kernel here does not have to be positive definte, but only bounded positive functions (like pdf’s). I’m curious if a similar approach can be taken for DP. This was originally done in similar fashion for DP by Dunson and Park (2008) (‘kernel stick breaking process’).

Sparse estimation with structured dictionaries
David Wipf

Given an ill-posed problem Y = \Phi X + \epsilon, where the dictionary \Phi, and observation Y is known, under sparsity assumption this can be solved with l_1 regularization, when \Phi is incoherent (roughly independent columns). However, when the dictionary is more structured, it can cause problems. This paper alleviates this problem by transforming the sparse variables which effectively re-normalizes them. It turns out the solution is similar to iteratively reweighted l_1 with a different penality. [Workshop version recording]

Sequence learning with hidden units in spiking neural networks
Johanni Brea, Walter Sen, Jean-Pascal Pfister

Given a point process, the problem is to train a spiking neural network composed of GLM units (including hidden units) that would generate the training patterns. Minimization of KL-divergence between the given point process, and the one parameterized by GLM is done by online gradient descent. The gradient requires marginalization over the spikes of the hidden units: \frac{-\partial D_{KL}(p^\ast || p_\theta)}{\partial \theta} = E\left[ \frac{\log p(v,h;\theta)}{\partial \theta} \right], so they developed an importance sampling scheme where the samples from hidden units are obtained given the training spikes. The resulting training rule is Hebbian, and analogous to STDP. The results are shown when the given distribution is a delta, that is, when the network has to produce exactly one pattern, and that pattern only.


I presented Bayesian Spike Triggered Covariance analysis as a poster:


Empirical models of spiking in neural populations
Jakob H. Macke, Lars Büsing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani

A comparison study between coupled GLM model and latent variable model (Poisson linear dynamical system) to fit the motor cortex observations (preparation phase only). While GLM explicitly allows only coupled input between the output of the population spiking history, the latent variable model allowed a low dimensional hidden common input source with linear dynamics. They show that the latent variable model fits better and could reconstruct the cross-correlations while the GLM couldn’t. There were quite a bit of discussions on the floor after the oral presentation. The difference in performance was probably due to (1) relatively large bin size (10 ms), (2) neurons were recorded by Utah array which means low probability of direct connectivity. The coupled GLM was successfully applied to retina where the coupling is local, and the sampling of the neurons were very high with 0.1 ms bin size. It would be interesting to see further developments of latent variable models and GLMs in modeling such motor system data.


Hierarchical algorithms for χ-armed bandits
Rémi Munos

This was a non-Bayesian invited talk for the Bayesian optimization, experimental design and bandits workshop. He talked about his paper for the main conference “Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness”. In this case, smoothness assumption comes from f(x^*) - f(x) \leq l(x^*, x), where l is a semi-metric, and hence the function is bounded from below around the global maximum by the semi-metric. Then using a hierarchical partitioning of the input space that respects the semi-metric, one can get a bound of the function. (This assumption is  not weaker nor stronger than Lipschitz continuity, since the absolute value is missing and it is only from the maximum.) When the knowledge of semi-metric is perfect, the convergence rate of the simple regret (best function value) can be exponential (depending on the semi-metric; multiple semi-metrics can give the bound but the convergence rate can differ). When the semi-metric is unknown, and one overestimates the exponent, for example, global convergence is not guaranteed.

Dynamic Batch Bayesian Optimization
Javad Azimi, Ali Jalali, Xiaoli Fern

When parallel experiments are possible, experimental design with batch sampling can improve the efficiency, but sequential design often performs better than batch design. Under the assumption that the maximum of the function has a known bound, and using the GP predictive covariance, they choose a set of points that are loosely independent, and could improve the criterion.

Future information minimization as PAC Bayes regularization in Reinforcement Learning
Naftali Tishby

This was the last invited talk for the New frontiers in model order selection workshop. Tishby talked about reinforcement learning in a POMDP setup, but I couldn’t fully follow (in fact it went over my head mostly). In a perception-action cycle, the Bellman equation describes the world evolution and associated reward, and he describes a counter part for the agent (mental state?) using an associated Bellman equation with information-to-go (mutual information with respect to a goal).  Then he describes reinforcement learning as a coding problem (relating to Kraft’s inequality, which says subtree of an optimal coding tree is an optimal coding tree). At some point, he reaches PAC-Bayesian bound, and claims that reinforcement learning self-regularizes.

Between the philosophy of science and machine learning
David Corfield [U of Kent]

This was the first invited talk for the Philosophy and machine learning workshop. He talked about a broad range of philosophers (of science) and a couple of examples of interaction between ML. The first example was Karl Popper‘s idea of complexity of theory in terms of falsifiable dimensions and its similarity to VC dimension (see their paper in 2009 for details). The second example was Judea Pearl’s use of counterfactual (by David Lewis), and its impact on philosophy of science. He talked about what kinds of sciences can be benefited from ML, certainly the ones with lots of data. He also went through many philosopher’s ideas including: Popper, Carnap, Kuhn and Lakatos, Feyerabend. It is certainly a very fascinating area, but my impression was that we don’t have much to talk about yet.

[Google is hosting some workshop videos freely online] [Hal Daumé III’s blogpost on NIPS 2011]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: