Skip to content




This year, we had the first tutorial session for COSYNE. From the beginning of COSYNE, there was a demand & plan for a tutorial session (according to Alex Pouget). There were 293 registered for the tutorial, and it was their first COSYNE for [155, 208] (95% CI) participants. Everybody who gave me feedback was very happy with Jonathan Pillow‘s 3.5 hour lecture (slides & code) on statistical modeling techniques for neural signals, so we are planning to run another tutorial next year at #COSYNE19 Lisbon, Portugal.

Main meeting

Basic stats: 857 registrations, 709 abstracts submitted, 396 accepted (55.8%) which was increased from 330 at Salt lake city thanks to a bigger venue at Denver (curiously, it used to be where NIPS was up till 2000).

My biased keywords of #COSYNE18: neural manifold, state space, ring attractor, recurrent neural network. Those dynamical system views are going strong.

Below are just some notes for my future self.

Tiago Branco, Computation of instinctive escape decisions (invited talk)Screen Shot 2018-03-12 at 9.14.29 AM
Beautiful dissecting of escape decision and response circuit in a looming dark circle task. He showed that mSC (superior colliculus) is causally involved in threat evidence representation, and the immediate downstream dPAG (periaqueductal gray) causally represents escape decision  Interestingly, the synaptic connection from mSC to dPAG is weak and unreliable (fig from bioRxiv), possibly contributing to the threshold computation for escape decision.

Iain D. Couzin, Collective sensing and decision-making in animal groups: From fish schools to primate societies (Gatsby lecture)Screen Shot 2018-03-12 at 9.54.52 AM
Ian showed how he went from theoretical models of swarm behavior to virtual reality for studying interacting fish (I thought this is how you spend money in science!), how the group can solve spatial optimization problems without any individual having access to the gradient (PNAS 2009), how collective consensus can go from following strongly biased minority to a “democratized” decision making as the swarm size increased (Science 2017), and also how the collective transitions from averaging to winner take all (fig from Trends CogSci 2009).

I-80. Michael Okun, Kenneth Harris. Frequency domain structure of intrinsic infraslow dynamics of cortical microcircuits
In the time scale of tens of seconds (infraslow), they showed that inter-spike intervals alone does not, but with matching power spectral density does explain much of the slow variations. (Spikes with matching spectra and ISI were generated using a variation of amplitude adjusted Fourier transform).

I-15. Rudina Morina, Benjamin Cowley, Akash Umakantha, Adam Snyder, Matthew Smith, Byron Yu. The relationship between pairwise correlations and dimensionality reduction
From paired recordings, the spike count correlation distribution is often reported as evidence of low-dimensional activity. From population recordings, factor analysis is often used as measures of neural dimensionality. How do these two relate? Both quantities only depend on the covariance matrix, hence they investigate how the mean and standard deviation of spike count correlation (rSC) relate to dimensionality as a function of shared variance using the generative model of factor analysis. They found an interesting trend: low-dim can be either large mean rSC with small std rSC or small mean and large std (which can be shown by rotating the loadings matrix).

I-13. Lee Susman, Naama Brenner, Omri Barak. Stable memory with unstable synapses
Using anti-Hebbian learning rule latex-image-1, they stored memories in limit cycles instead of stable fixed points as in traditional Hopfield networks. Also, they added a Chaotic non-stationary dynamics in the symmetric portion of the network which made the network continuously fluctuate and escape limit cycles (without losing the memory).

T-14. Emily Mackevicius, Andrew Bahle, Alex Williams, Shijie Gu, Natalia Denissenko, Mark Goldman, Michale Fee. Unsupervised discovery of neural sequences in large-scale recordingsScreen Shot 2018-03-12 at 10.38.25 AM
Conventional low-rank decomposition of temporal data matrix cannot find sequential structure. They have extended the convolutional non-negative matrix factorization (Non-negative Matrix Factor Deconvolution; Smaragdis 2004, 2007) for neural data, and called it SeqNMF. (MATLAB code on github; thanks for the quick bug fix @ItsNeuronal; bioRxiv 273128). It’s performance on syllable extraction and spike sequence detection was very nice.

Timothy Behrens. Building models of the world for behavioural control (invited talk)Screen Shot 2018-03-12 at 1.06.50 PM
I usually ignore fMRI talks, but the 6-fold symmetry of conceptual space was very cool. In the “canonical” bird neck & leg length space, OFC and other areas showed grid-network like signal modulations (Science 2016).

CATNIP lab had several posters on the 2nd day:

Marlene Cohen. Understanding the relationship between neural variability and behavior (invited)Screen Shot 2018-03-12 at 1.33.48 PM
If correlated variability is important, it should (1) be related to performance, (2) related to individual perceptual decisions, (3) be selectively communicated between brain areas. She showed that the first principal component of noise correlation, but not the signal encoding direction, is most correlated with the choice. This was just recently published in Science 2018.

T-24. Caroline Haimerl, Eero Simoncelli. Shared stochastic modulation can facilitate biologically plausible decoding
Noise correlation tends to be in the direction of strongest decoding signal for unknown reason (Lin et al. 2015; Rabinowitz et al. 2016). They used the neural response weighted by the shared gain modulation w_n_=_int_r_n(t) for decoding, which was near optimal.

Byron Yu. Brain-computer interfaces for basic science (invited)Screen Shot 2018-03-13 at 11.00.50 AM
They used BCI to study how the monkey can change its output in a short time scale within the “neural manifold”. Surprisingly, the neural repertoire (distribution of possible population firing patterns) does not shift nor change in shape, but mostly reassigns meaning! (Nature Neuroscience 2018) The animal can learn out of manifold perturbation as well, but that takes days (as detailed in Emily Oby‘s talk (T-25) followed right after).

T-26. Evan Remington, Devika Narain, Eghbal Hosseini, Mehrdad Jazayeri. Control of sensorimotor dynamics through adjustment of inputs and initial conditionScreen Shot 2018-03-13 at 11.14.52 AM
In a ready-set-go time interval production task with variable gain (animal has to reproduce 1.5 times the duration sometimes), the mean neural activity of the population forms a #neuralManifold. On the interval subspace, the two temporal gains produced identical mean trajectories, while on the gain subspace, they were separated. (bioRxiv 261214)

Máté Lengyel. Sampling: coding, dynamics, and computation in the cortex (invited)
If the population neural activity represents samples from the posterior, what neural dynamics would produce them (Rubin et al. 2015)? He showed that a stochastic stabilized supralinear network (SSN) with a ring architecture (not ring attractor) can sample and also reproduce neurophysiological temporal dynamics such as on/off-set response and quenching of variability (bioRxiv 2016). Also he trained an RNN to amortize inference of a simple Gaussian scale mixture model of vision, and the solution found by the RNN turns out to be non-detailed balance solution to sampling (as demonstrated by the anti-symmetric part of cross-correlation over time).

T-37. Lucas Pinto, David Tank, Carlos Brody, Stephan Thiberge. Widespread cortical involvement in evidence-based navigation
Using a transparent skull animal on Poisson towers VR decision-making task (Front. Behav. Neurosci. 2018), they found that many areas in the dorsal cortex (V1, SS, M1, RSC, mM2, aM2, etc) were correlated with accumulation of evidence. Further optogenetic inactivation of each area disrupted animal’s performance.

Vivek Jayaraman. Navigational attractor dynamics in the Drosophila brain: Going from models to mechanism (invited)
Beautiful work on the ellipsoid body–protocerebral bridge circuit and their computation involving bump attractor dynamics and path integration.

Joni Wallis. Dynamics of prefrontal computations during decision-making (invited)
Theta-oscillation phase of OFC locks to the trials when the reward criteria were linearly changing. A closed-loop microstimulation of OFC at the peak of theta disrupts learning, possibly due to disrupted theta-locked communication with hippocampus.

III-108. Rainer Engelken, Fred Wolf. A spatiotemporally-resolved view of cellular contributions to network chaos
They implemented a event-based recurrent spiking neural network that is so efficient that they can simulate a very large number (15 million) of neurons and study their dynamics. They quantified Lyapunov exponents efficiently and computed cross-correlation against the participation index.

III-75. KiJung Yoon, Xaq Pitkow. Learning nonlinearities for identifying regular structure in the brain’s inference algorithm
Loopy belief propagation often produces poor inference on non-tree graphical models. Can we do better by training a recurrent neural network to do amortized inference on graphs? The answer is yes, and it can generalize to larger networks and unseen graph structures.

III-121. Dongsung Huh, Terrence Sejnowski. Gradient descent for spiking neural networks
They derived a differentiable synapse which gradually responses to membrane voltage near threshold. The presynaptic neuron still spikes, but the differentiable synapse allows gradient descent training of the recurrent spiking neural network. During training they can slowly make the synapse tighter and tighter to finally reach a non-differentiable synapse (arXiv 2017).


NIPS 2015 workshops


My experience during the 2 days of NIPS workshops after the main meeting (part 1,2,3).

Statistical Methods for Understanding Neural Systems workshop

2015-12-11 17.51.37-1Organized by: Allie Fletcher, Jakob Macke, Ryan P. Adams, Jascha Sohl-Dickstein

Towards a theory of high dimensional, single trial neural data analysis: On the role of random projections and phase transitions
Surya Ganguli

Surya talked about conditions for recovering the embedding dimension of discrete neural responses from noisy single trial observations (very similar to his talk at NIPS 2014 workshop organized by me). He models neural response as R = S U X + Z where S is M \times N sparse sampling matrix, U is a N \times K random orthogonal embedding matrix, X is the K \times P latent manifold driven by P stimulus conditions. Assuming Gaussian noise, and using free probability theory [Nica & Speicher], he shows the recovery condition \frac{\mathrm{var}(X)}{\mathrm{var}(Z)} \sqrt{M \cdot P} \geq K.

Translating between human and animal studies via Bayesian multi-task learning
Katherine Heller

Katherine talked about using a hierarchical Bayesian model and variational inference algorithms to infer linear latent dynamics. She talked about several ideas including (1) structural prior for connectivity, (2) using cross-spectral mixture kernel for LFP [Wilson & Adams ICML 2013Ulrich et al. NISP 2015], (3) combining fMRI and LFP through shared dynamics.

Similarity matching: A new theory of neural computation
Dmitri (Mitya) Chklovskii

Principled derivation of local learning rules for PCA [Pehlevan & Chklovskii NIPS 2015] and NMF [Pehlevan & Chklovskii 2015].

Small Steps Towards Biologically Plausible Deep Learning
Yoshua Bengio

What should hidden layers do in a deep neur(on)al network? He talked about some happy coincidences: What is the objective function for STDP in this setting [Bengio et al. 2015]? Deep autoencoders and symmetric weight learning [Arora et al. 2015]. Energy based models approximates back-propagation [Bengio 2015].

The Human Visual Hierarchy is Isomorphic to the Hierarchy learned by a Deep Convolutional Neural Network Trained for Object Recognition
Pulkit Agrawal

Which layers of various CNN trained on image discrimination task explain the fMRI voxels the best? [Agrawal et al. 2014] shows hierarchy of CNN matches the visual hierarchy and it’s not because of the receptive field sizes.

Unsupervised learning with deterministic reconstruction using What-where-convnet
Yann LeCun

CNN often loses the ‘where’ information in the pooling process. What-where-convnet keeps the ‘where’ information in the pooling stage and use it to reconstruct the image [Zhao et al. 2015].

Mechanistic fallacy and modelling how we think
Neil Lawrence

He came out as a cognitive scientist. He talked about System 1 (fast subconscious data-driven inference which handles uncertainty well) and System 2 (slow conscious symbolic inference that thinks it is driving the body), and how they could talk to one another. Interesting solution to the variations of the trolly problem and how System 1 kicks in and gives the ‘irrational’ answer.

Approximation methods for inferring time-varying interactions of a large neural population (poster)
Christian Donner and Hideaki Shimazaki

Inference on an Ising model with latent diffusion dynamics on the parameters (both first and second order). Due to large number of parameters, it needs multiple trials with identical latent process to make good inference.

Panel discussion: Neil Lawrence, Yann LeCun, Yoshua Bengio, Konrad Kording, Surya Ganguli, Matthias Bethge

Discussion on the interface between neuroscience and machine learning. Are we only focusing on ‘vision’ problems too much? What problems should neuroscience focus on to help advance machine learning? How can datasets and problems change machine learning? Should we train DNN to perform more diverse tasks?

Correlations and signatures of criticality in neural population models (ML and stat physics workshop)
Jakob Macke

Jakob talked about how subsampling neural data to infer different sizes of neural dynamics could lead to misleading conclusions (esp. criticality).

Black Box Learning and Inference workshop

  • Dustin Tran, Rajesh Ranganath, David M. Blei. Variational Gaussian Process. [arXiv 2015]
  • Yuri Burda, Roger Grosse, Ruslan Salakhutdinov. Importance Weighted Autoencoders. [arXiv 2015]
    A tighter lower-bound to the marginal likelihood! Better generative model estimation!
  • Alp Kucukelbir. Automatic Differentiation Variational Inference in Stan. [NIPS 2015][software]
  • Jan-Willem van de Meent, David Tolpin, Brooks Paige, Frank Wood. Black-Box Policy Search with Probabilistic Programs. [arXiv 2015]

NIPS 2015 Part 3


Day 3 and 4 of NIPS main meeting (part 1, part 2). More amazing deep learning results.

Embed to control: a locally linear latent dynamics model for control from raw images
Manuel Watter, Jost Springenberg, Joschka Boedecker, Martin Riedmiller

E2CTo implement optimal control in the latent state space, they used iterative Linear-Quadratic-Gaussian control applied directly to video. A gaussian latent state space was decoded from images through a deep variational latent variable model. One step prediction of latent dynamics was modeled to be locally linear where the dynamics matrices were parameterized by a neural network that depends on the current state. A variant of a variational cost that minimizes instantaneous reconstruction, and also KL divergence between predicted latent and the reconstructed latent. Deconvolution network was used, and as can be seen in the [video online], the generated images are a bit blurry, but iLQG control works well.

Deep visual analogy-making
Scott E. Reed, Yi Zhang, Yuting Zhang, Honglak Lee

Screen Shot 2015-12-15 at 9.48.11 AMSimple vector space embedding of natural words in [Mikolov et al. NIPS 2013] showed “Madrid” – “Spain” + “France” is closest to “Paris”. Authors show that making such analogy in computer generated images is possible through a deep architecture (top figure on the right). To make an aScreen Shot 2015-12-15 at 9.47.56 AMnalogy of the form a : b = c : ?, first three images are encoded via f, then f(b) – f(a) + f(c), or more generally T((f(b)-f(a)), f(c)), is decoded via g to generate the output image. They trained convolutional neural network f such that T(f(b)-f(a)) is close to f(d)-f(c). Decoder with same architecture but with up-sampling instead of pooling is used for g. The performance on simple object transformation and video game character animation are quite impressive! [recorded talk]

Deep convolutional inverse graphics network
Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, Josh Tenenbaum

Screen Shot 2015-12-15 at 10.24.09 AMAuthors propose a CNN autoencoder and training method that aims to infer ‘graphics parameters’ such as lighting and viewing angle from images. Usually the deep latent variables are hard to interpret, but here they force interpretability by training subsets of latent variables only (holding others constant) and using input with the corresponding invariance. The resulting ‘disentangled’ representation learns a meaningful approximation of a 3D graphics engine. Trained via SGVB [Kingma & Welling ICLR 2014].

Action-conditioned video prediction using deep networks in Atari games
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, Satinder Singh

In model based reinforcement learning, predicting the next state given the current state and action accurately is a key operation. Authors show very impressive video prediction given a couple of previous frames of Atari games and a chosen action. Hidden state is estimated using CNN, temporal correlation is learned using LSTM, and action interacts multiplicatively with the state. They used curriculum learning to make increasingly long prediction sequences with SGD + BPTT. They replaced the model-free DQN [Minh et al. NIPS 2013 workshop] with their model. See impressive results at [online videos and supplement] for yourself!

Empirical localization of homogeneous divergences on discrete sample spaces
Takashi Takenouchi, Takafumi Kanamori

Screen Shot 2015-12-15 at 11.07.11 AMIn many discrete probability models the (computationally intractable) normalizer for the distribution often hinders efficient estimation for high dimensional data (e.g., Ising model). Instead of using KL-divergence (equivalent to MLE) between the model and empirical distribution, if a homogeneous divergence which invariant under scaling of the underlying measure, we might be able to circumvent the difficulty. Authors use the pseudo-spherical (PS) divergence [Good, I. J. (1971)] and a trick to weight the model by the empirical distribution to make a convex optimization procedure for obtaining near MLE solution.

Equilibrated adaptive learning rates for non-convex optimizationScreen Shot 2015-12-15 at 11.28.36 AM
Yann Dauphin, Harm de Vries, Yoshua Bengio

Improving condition number of Hessian is important for SGD convergence speed. Authors revived equilibrated preconditioner as an adaptive learning rate for SGD.

Computational principles for deep neuronal architectures (invited talk)
Haim Sompolinsky

(1) In many early sensory systems, there’s an expansion of representation to a larger number of downstream neurons with sparser representation. This expansion ratio is around 10-100 times, and sparseness of 0.1-0.01 (fraction of neurons active). In [Babadi & Sompolinsky Neuron 2014], they derived how random connection is worse than hebbian learning for a certain scaling and sparseness constraints for representing cluster identities in the input space. (2) How about stacking such layers? Hebbian synaptic learning squashes noise as the network gets deeper. (3) Learning context-dependent influence as mixed (distributed) representation. [Mante et al. Nature 2013] is not biologically feasibly learned. Interleave sensory and context signal into deep structure with hebbian learning to solve it. (4) Extend perceptron theory for learning point clouds to manifold clouds (i.e., line segments, and L-p balls).

Efficient exact gradient update for training deep networks with very large sparse targets
Pascal Vincent, Alexandre de Brébisson, Xavier Bouthillier

If output is very high-dimensional, but sparse, as in the classification with large number of categories, the gradient computation bottleneck is the last layer. Authors propose a clever computational trick to compute gradient efficiently.

Attractor network dynamics enable preplay and rapid path planning in maze-like environments
Dane S. Corneil, Wulfram Gerstner

Hippocampal network can produce a sequence of activation (at rest) that represents goal-directed future plans. By taking the eigendecomposition of the Markov transition matrix of the maze, they obtain the ‘successor representation’ [Dayan NECO 1993] and implement it with a biologically plausible neural network.

A tractable approximation to optimal point process filtering: application to neural encoding
Yuval Harel, Ron Meir, Manfred Opper

By taking the limit of large number neurons with tuning curve centers drawn from a Gaussian, they derive a near optimal point process decoding framework. By optimizing on a grid, they derive the theoretically optimal Gaussian that minimizes MSE.

Bounding errors of expectation propagation
Guillaume P. Dehaene, Simon Barthelmé

New theory showing that EP converges faster than Laplace approximation.

Nonparametric von Mises estimators for entropies, divergences and mutual information
Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, James M. Robins

Use plug-in estimator for divergences using kernel density estimator and correct for bias using von Mises expansion. It works well in low dimensions (up to 6?). [matlab code on github]


NIPS 2015 Part 2


Continuing from Part 1 (and Part 3) from Neural Information Processing Systems conference 2015 at Montreal.

Linear Response Methods for Accurate Covariance Estimates from Mean Field Variational Bayes
Ryan J. Giordano, Tamara Broderick, Michael I. Jordan

Variational inference often results in factorized forms of approximate posterior that are tighter than the exact Bayesian posterior. Authors derive a method to recover the lost covariance among parameters by perturbing the posterior. For exponential family variational distribution, a simple closed form transformation involving the Hessian of the expected log posterior. [julia code on github]

Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems
Yuxin Chen, Emmanuel Candes

RScreen Shot 2015-12-09 at 6.24.42 AMecovering x given noisy squared measurements, e.g., y_i \sim \mathrm{Poisson}(|\mathbf{a_i}^\top\mathbf{x}|^2) is a nonlinear optimization problem. Authors show that using (1) a truncated spectral initialization, and (2) truncated gradient descent, exact recovery can be achieved with optimal sample O(n) and time complexity O(nm) for iid gaussian \mathbf{a_i}. Those truncations essentially remove responses and corresponding gradients that are larger than typical values. This stabilizes the optimization. [matlab code online] [recorded talk]

Closed-form Estimators for High-dimensional Generalized Linear Models
Eunho Yang, Aurelie C. Lozano, Pradeep K. Ravikumar

Authors derive closed-form estimators with theoretical guarantee for GLM with sparse parameter. In the high-dim regime of n \gg p, sample covariance matrix is rank-deficient, but by thresholding and making it sparse, it becomes full rank (original sample cov should be well approx by this sparse cov). They invert the inverse link function by remapping the observations by small amount: e.g., 0 is mapped to \epsilon > 0 for Poisson so that logarithm doesn’t blow up.Screen Shot 2015-12-09 at 9.02.02 AM.png


Newton-Stein Method: A Second Order Method for GLMs via Stein’s Lemma
Murat A. Erdogdu

By replacing the sum over the samples in the Hessian for GLM regression with expectation, and applying Stein’s lemma assuming Gaussian stimuli, he derived a computationally cheap 2nd order method (O(np) per iteration). This trick relies on large enough sample size n \gg p, and the Gaussian stimuli distribution assumption can be relaxed in practice if p \gg 1 by central limit theorem. Unfortunately, the condition for the theory doesn’t hold for Poisson-GLM!Screen Shot 2015-12-09 at 10.58.43 AM

Stochastic Expectation Propagation
Yingzhen Li, José Miguel Hernández-Lobato, Richard E. Turner

In EP, each likelihood contribution to the posterior is stored as an independent factor which is not scalable for large datasets. Authors propose to further approximate by using n copies of the same factor thus making the memory requirement of EP independent of n. This is similar to assumed density filtering (ADF) but with much better performance close to EP.SEP


Competitive Distribution Estimation: Why is Good-Turing Good
Alon Orlitsky, Ananda Theertha Suresh

Best paper award (1 of 2). Shows theoretical near optimality of Good-Turing estimator for discrete distributions.

Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, Yuval Tassa

Using the reparameterization trick for continuous state space, continuous action reinforcement learning problem. [youtube video]

High-dimensional neural spike train analysis with generalized count linear dynamical systems
Yuanjun Gao, Lars Büsing, Krishna V. Shenoy, John P. Cunningham

A super flexible count distribution with neural application.

Unlocking neural population non-stationarities using hierarchical dynamics models
Mijung Park, Gergo Bohner, Jakob H. Macke

Latent processes with two time scales: one fast linear dynamics, and one slow Gaussian process.

NIPS 2015 Part 1


NIPS2015_participants.pngNIPS has grown to 3755 participants this year with 21.9% acceptance rate, 11% deep learning, 42 sponsors, 101 area chairs, 1524 reviewers. Here are some posters from the first night that I thought were excellent (Part 2, Part 3, workshops). They arranged the posters using KPCA axis that approximately aligned with deep-to-non-deep, so the first 40 posters or so were heavy on deep-learning. [meta blog post on other blogger’s NIPS summaries]

Deep temporal sigmoid belief networks for sequence modeling
Zhe Gan, Chunyuan Li, Ricardo Henao, David E. Carlson, Lawrence Carin

Screen Shot 2015-12-07 at 10.31.32 PMThey propose a pair of probabilistic time series models for variational inference (one generative and one recognition model) and use variance controlled log-derivative trick to do stochastic optimization. Using a binary vector, they can model an exponentially large state space, and further introduce hierarchy (deep structure) that can produce longer time scale nonlinear dependences. Each node is extremely simple: linear-logistic-Bernoulli. Zhe told he that they applied to 3-bouncing-balls video dataset represented as 900 dimensional vector, but the generated samples were not perfect and balls would often get stuck. [code on github]

Variational dropout and the local reparameterization trick [arXiv]
Diederik P. Kingma, Tim Salimans, Max Welling

Screen Shot 2015-12-07 at 11.02.16 PM

Screen Shot 2015-12-07 at 11.02.07 PM

They apply the SGVB reparameterization trick to parameters instead of latent variables. Most importantly, they chose a reparametrization such that the noise is independent for each observation. Upper figure (from illustrates the parameterization with slow learning due to the noise being correlated with all samples in the mini-batch, while the lower figure shows the independent form. This relates to variational interpretation of dropout, but now the dropout rate can be learned in a more principled manner.


Local expectation gradients for black box variational inference
Michalis Titsias, Miguel Lázaro-Gredilla

The two mains tricks for estimating stochastic gradient for E_{q(x)}[f(x)] are the log-derivative trick, and the reparameterization trick (used by the first two papers I introduced above). Reparameterization has much smaller variance, hence leads to faster convergence, however, it can only be used for continuous x and differentiable f. Here authors propose an extension of the log-derivative trick with small variance by (numerically) integrating over 1d over the latent variable that is directly controlled by the corresponding parameter, while holding the Markov blanket constant.

A recurrent latent variable model for sequential data
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth goel, Aaron C. Courville, Yoshua Bengio

Screen Shot 2015-12-07 at 11.37.29 PMIn conventional recurrent neural network (RNN), noise is limited to the input/output space, and the internal states are deterministic. Authors add a stochastic latent variable node to an RNN, and incorporate variational autoencoder (VAE) concepts. Latents are only time dependent through the deterministic recurrent states (with hidden LSTM units), and had a much lower dimension. They train on raw waveform of speech, and were able to generate mumbling sound that resembles the speech (I sampled their cool audio), and similarly for 2D handwriting. Their model seemed to work equally well with different complexity of observation models, unlike plain RNNs which require complex observation models to generate reasonable output. [code on github]

Texture synthesis using convolutional neural networks
Leon Gatys, Alexander S. Ecker, Matthias Bethge

Screen Shot 2015-12-07 at 11.49.58 PM.pngTo generate texture images, they started with a deep convolutional neural network, and trained another network’s input with fixed weights until the covariance in certain layers matched. If they started with a white noise image, they could sample textures (via gradient descent optimization). [code on github]

10th Black Board Day (BBD10)


On May 2nd 2015, I organized yet another BlackBoardDay, this time in New York City (on Columbia University campus, thanks Evan!).

I started the discussion by tracing the history of modern mathematics back to Gottlob Frege (Vika pointed out the axiomatic tradition goes back to Euclid (300 BC)).

  • Gottlob Frege’s Begriffsschrift, the first symbolic logic system powerful enough for mathematics (1879)
  • Giuseppe Peano’s axiomatization of arithmetic (1889)
  • David Hilbert’s program to build a foundation of mathematics (1900-1920s)
  • Bertrand Russell’s paradox in Frege’s system (1902)
  • Kurt Gödel was born! (1906)
  • Russell and Whitehead’s Principia Mathematica as foundation of mathematics (1910)
  • Kurt Gödel’s completeness theorem (for first-order logic) (1930)
  • Kurt Gödel’s incompleteness theorem (of Principia Mathematica and related systems) (1931)

Victoria (Vika) Gitman talked about non-standard models of Peano arithmetic. She listed the first-order form of Peano axioms which is supposed to describe addition, multiplication, and ordering of natural numbers \mathbb{N} = \{0, 1, 2, \cdots \}. However, it turns out there are other countable models that are not natural number and yet satisfy Peano axioms. She used the compactness theorem, a corollary of completeness theorem (Gödel 1930), that (loosely) states that for a consistent first-order system, if any finite subset of axioms has a model, then the system has a model. She showed that if we add a constant symbol ‘c’ (in addition to 0 and 1) to the language of arithmetic, and a set of infinite axioms which is consistent with the Peano axioms: {c > 0, c > 1, c > 2, … }, then using the compactness theorem, there exists a model. This model is somewhat like integers \mathbb{Z} sprinkled on rational numbers \mathbb{Q}, in the sense that (…, c-2, c-1, c, c+1, c+2, …) are all larger than the regular \mathbb{N}, but then 2c is larger than all of that. Then there are also fractions of c such as c/2, and so on. This is still countable, since it is a countable collection of countably infinite sets, but this totally blew our minds. In this non-standard model of arithmetic, those ‘numbers’ outside \mathbb{N} can be represented as a pair in \mathbb{Q} \times \mathbb{Z}, but actual computation with those numbers turn out to be non-trivial (and often non-computable).

Vika writing on the black board

Ashish Myles talked about the incompleteness theorem, and other disturbing ideas. Starting from the analogy of liar’s paradox, Ashish stated that arithmetic (with multiplication) can be used to encode logical statements into natural numbers, and also write a (recursive) function that encapsulates the notion of ‘provable from axioms’. The Gödel statement G roughly says that “the natural number that encodes G is not provable”. Such statement is true (in our meta language) since if it is false, there’s a contradiction. However, either adding G or not G as an axiom to the original system is consistent. Even after including G (or “not G”) as an axiom to Peano arithmetic, there’ll be statements that are true but not provable! Vika gave an example statement that is true for natural number but is not provable from Peano axioms: all Goodstein sequence terminates at 0.2015-05-02 14.58.41_Ashish

At this point, we were all feeling very cold inside, and needed some warm sunshine. So, we continued our discussion outside:

Kyle Mandli talked about Axiom of Choice (AC), which is an axiom that is somewhat counter intuitive, and independent of the Zermelo-Fraenkel (ZF) set theory: Both ZF with AC and ZF with not AC are consistent (Gödel 1964). We discussed many counter intuitive “paradoxes” as well as usefulness of AC in mathematics.

Diana Hall talked about an counter intuitive bet: suppose we have a fair coin, and we are tossing to create a sequence. Would you bet on seeing HTH first or HHT first? At first one might think they are equally likely. However, since there’s a sequence effect that makes them non-equal!

Outside discussion. It was too cold inside!

Unfortunately, due to time constraints we couldn’t talk about Uygar planed: “approximate solutions to combinatorial optimization problems implies P=NP”, hopefully we’ll hear about it on BBD11!

NIPS 2014 main meeting


NIPS is growing fast with 2400+ participants! I felt there were proportionally less “neuro” papers compared to last year, maybe because of a huge presence of deep network papers. My NIPS keywords of the year: Deep learning, Bethe approximation/partition function, variational inference, climate science, game theory, and Hamming ball. Here are my notes on the interesting papers/talks from my biased sampling by a neuroscientist as I did for the previous meetings. Other bloggers have written about the conference: Paul Mineiro, John Platt, Yisong Yue and Yun Hyokun (in Korean).

The NIPS Experiment

The program chairs, Corinna Cortes and Neil Lawrence, ran an experiment on the reviewing process and estimated the inconsistency. 10% of the papers were chosen to be reviewed independently by two pools of reviewers and area chair, hence those authors got 6-8 reviews, and had to submit 2 author responses. The disagreement was around 25%, meaning around half of the accepted papers could have been rejected (the baseline assuming independent random acceptance was around 38%). This tells you that the variability in NIPS reviewing process is, so keep that in mind whether your papers got in or not! They accepted all papers that had disagreement between the two pools, so the overall acceptance rate was a bit higher this year. For details, see Eric Price’s blog post and Bert Huang’s post.

Latent variable modeling of neural population activity

Extracting Latent Structure From Multiple Interacting Neural Populations
Joao Semedo, Amin Zandvakili, Adam Kohn, Christian K. Machens, Byron M. YuSemedo_AR_pCCA

How can we quantify how two populations of neurons interact? A full interaction model would require O(N^2) which quickly makes the inference intractable. Therefore, low-dimensional interaction model could be useful, and this paper exactly does this by extending the ideas of canonical correlation analysis to vector autoregressive processes.

Clustered factor analysis of multineuronal spike data
Lars Buesing, Timothy A. Machado, John P. Cunningham, Liam Paninski

How can you put more structure to a PLDS (Poisson linear dynamical system) model? They assumed disjoint groups of neurons would have loadings from a restricted set of factors only. For application, they actually restricted the loading weights to be non-negative, in order to separate out the two underlying components of oscillation in spinal cord. They have a clever subspace clustering based initialization, and a variational inference procedure.

A Bayesian model for identifying hierarchically organised states in neural population activity
Patrick Putzky, Florian Franzen, Giacomo Bassetto, Jakob H. MackeHDMT population spiking model

How do you capture discrete states in the brain, such as UP/DOWN states? They propose using a probabilistic hierarchical hidden Markov model for population of spiking neurons. The hierarchical structure reduces the effective number of parameters of the state transition matrix. The full model captures the population variability better than coupled GLMs, though the number of states and their structure is not learned. Estimation is via variational inference.

On the relations of LFPs & Neural Spike Trains
David E. Carlson, Jana Schaich Borg, Kafui Dzirasa, Lawrence Carin.

Analysis of Brain States from Multi-Region LFP Time-Series
Kyle R. Ulrich, David E. Carlson, Wenzhao Lian, Jana S. Borg, Kafui Dzirasa, Lawrence Carin

Bayesian brain, optimal brain

Fast Sampling-Based Inference in Balanced Neuronal Networks
Guillaume Hennequin, Laurence Aitchison, Mate Lengyel

Sensory Integration and Density Estimation
Joseph G. Makin, Philip N. Sabes

Optimal Neural Codes for Control and Estimation
Alex K. Susemihl, Ron Meir, Manfred Opper

Spatio-temporal Representations of Uncertainty in Spiking Neural Networks
Cristina Savin, Sophie Denève

Optimal prior-dependent neural population codes under shared input noise
Agnieszka Grabska-Barwinska, Jonathan W. Pillow

Neurons as Monte Carlo Samplers: Bayesian Inference and Learning in Spiking Networks
Yanping Huang, Rajesh P. Rao

Other Computational and/or Theoretical Neuroscience

Using the Emergent Dynamics of Attractor Networks for Computation (Posner lecture)
J. J. Hopfield

He introduced bump attractor networks via analogy of magnetic bubble (shift register) memory. He suggested that cadence and duration variations in voice can be naturally integrated with state-dependent synaptic input. Hopfield previously suggested using relative spike timings to solve a similar problem in olfaction. Note that this continuous attractor theory predicts low-dimensional neural representation. His paper is available as a preprint.

Deterministic Symmetric Positive Semidefinite Matrix Completion
William E. Bishop and Byron M. Yu

See workshop posting where Will gave a talk on this topic.

General Machine Learning

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio

From results in statistical physics, they hypothesize that there are more saddles in high-dimension which are the main cause of slow convergence of stochastic gradient descent. In addition, exact Newton method converges to saddles, (stochastic) gradient descent is slow to get out of saddles, causing lengthy platou in training neural networks. They provide a theoretical justification for a known heuristic optimization method which is to take the absolute value of eigenvalues of the Hessian when taking the Newton step. This avoids saddles, and dramatically improves convergence speed.

A* Sampling
Chris J. Maddison, Daniel Tarlow, Tom Minka

Extends the Gumbel-Max Trick to an exact sampling algorithm for general (low-dimensional) continuous distributions with intractable normalizers. The trick involves perturbing a discrete-domain function by adding an independent samples from Gumbel distribution.They construct Gumbel process which gives bounds on the intractable log partition function, and use it to sample.

Divide-and-Conquer Learning by Anchoring a Conical Hull
Tianyi Zhou, Jeff A. Bilmes, Carlos Guestrin

Spectral Learning of Mixture of Hidden Markov Models
Cem Subakan, Johannes Traa, Paris Smaragdis

Clamping Variables and Approximate Inference
Adrian Weller, Tony Jebara
His slides are available online.

Information-based learning by agents in unbounded state spaces
Shariq A. Mobin, James A. Arnemann, Fritz Sommer

Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights
Daniel Soudry, Itay Hubara, Ron Meir

Self-Paced Learning with Diversity
Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, Alexander Hauptmann