My experience during the 2 days of NIPS workshops after the main meeting (part 1,2,3).

## Statistical Methods for Understanding Neural Systems workshop

Organized by: Allie Fletcher, Jakob Macke, Ryan P. Adams, Jascha Sohl-Dickstein

Towards a theory of high dimensional, single trial neural data analysis: On the role of random projections and phase transitions
Surya Ganguli

Surya talked about conditions for recovering the embedding dimension of discrete neural responses from noisy single trial observations (very similar to his talk at NIPS 2014 workshop organized by me). He models neural response as $R = S U X + Z$ where S is $M \times N$ sparse sampling matrix, U is a $N \times K$ random orthogonal embedding matrix, X is the $K \times P$ latent manifold driven by P stimulus conditions. Assuming Gaussian noise, and using free probability theory [Nica & Speicher], he shows the recovery condition $\frac{\mathrm{var}(X)}{\mathrm{var}(Z)} \sqrt{M \cdot P} \geq K$.

Translating between human and animal studies via Bayesian multi-task learning
Katherine Heller

Katherine talked about using a hierarchical Bayesian model and variational inference algorithms to infer linear latent dynamics. She talked about several ideas including (1) structural prior for connectivity, (2) using cross-spectral mixture kernel for LFP [Wilson & Adams ICML 2013Ulrich et al. NISP 2015], (3) combining fMRI and LFP through shared dynamics.

Similarity matching: A new theory of neural computation
Dmitri (Mitya) Chklovskii

Principled derivation of local learning rules for PCA [Pehlevan & Chklovskii NIPS 2015] and NMF [Pehlevan & Chklovskii 2015].

Small Steps Towards Biologically Plausible Deep Learning
Yoshua Bengio

What should hidden layers do in a deep neur(on)al network? He talked about some happy coincidences: What is the objective function for STDP in this setting [Bengio et al. 2015]? Deep autoencoders and symmetric weight learning [Arora et al. 2015]. Energy based models approximates back-propagation [Bengio 2015].

The Human Visual Hierarchy is Isomorphic to the Hierarchy learned by a Deep Convolutional Neural Network Trained for Object Recognition
Pulkit Agrawal

Which layers of various CNN trained on image discrimination task explain the fMRI voxels the best? [Agrawal et al. 2014] shows hierarchy of CNN matches the visual hierarchy and it’s not because of the receptive field sizes.

Unsupervised learning with deterministic reconstruction using What-where-convnet
Yann LeCun

CNN often loses the ‘where’ information in the pooling process. What-where-convnet keeps the ‘where’ information in the pooling stage and use it to reconstruct the image [Zhao et al. 2015].

Mechanistic fallacy and modelling how we think
Neil Lawrence

He came out as a cognitive scientist. He talked about System 1 (fast subconscious data-driven inference which handles uncertainty well) and System 2 (slow conscious symbolic inference that thinks it is driving the body), and how they could talk to one another. Interesting solution to the variations of the trolly problem and how System 1 kicks in and gives the ‘irrational’ answer.

Approximation methods for inferring time-varying interactions of a large neural population (poster)
Christian Donner and Hideaki Shimazaki

Inference on an Ising model with latent diffusion dynamics on the parameters (both first and second order). Due to large number of parameters, it needs multiple trials with identical latent process to make good inference.

Panel discussion: Neil Lawrence, Yann LeCun, Yoshua Bengio, Konrad Kording, Surya Ganguli, Matthias Bethge

Discussion on the interface between neuroscience and machine learning. Are we only focusing on ‘vision’ problems too much? What problems should neuroscience focus on to help advance machine learning? How can datasets and problems change machine learning? Should we train DNN to perform more diverse tasks?

Correlations and signatures of criticality in neural population models (ML and stat physics workshop)
Jakob Macke

Jakob talked about how subsampling neural data to infer different sizes of neural dynamics could lead to misleading conclusions (esp. criticality).

## Black Box Learning and Inference workshop

• Dustin Tran, Rajesh Ranganath, David M. Blei. Variational Gaussian Process. [arXiv 2015]
• Yuri Burda, Roger Grosse, Ruslan Salakhutdinov. Importance Weighted Autoencoders. [arXiv 2015]
A tighter lower-bound to the marginal likelihood! Better generative model estimation!
• Alp Kucukelbir. Automatic Differentiation Variational Inference in Stan. [NIPS 2015][software]
• Jan-Willem van de Meent, David Tolpin, Brooks Paige, Frank Wood. Black-Box Policy Search with Probabilistic Programs. [arXiv 2015]

Day 3 and 4 of NIPS main meeting (part 1, part 2). More amazing deep learning results.

Embed to control: a locally linear latent dynamics model for control from raw images
Manuel Watter, Jost Springenberg, Joschka Boedecker, Martin Riedmiller

To implement optimal control in the latent state space, they used iterative Linear-Quadratic-Gaussian control applied directly to video. A gaussian latent state space was decoded from images through a deep variational latent variable model. One step prediction of latent dynamics was modeled to be locally linear where the dynamics matrices were parameterized by a neural network that depends on the current state. A variant of a variational cost that minimizes instantaneous reconstruction, and also KL divergence between predicted latent and the reconstructed latent. Deconvolution network was used, and as can be seen in the [video online], the generated images are a bit blurry, but iLQG control works well.

Deep visual analogy-making
Scott E. Reed, Yi Zhang, Yuting Zhang, Honglak Lee

Simple vector space embedding of natural words in [Mikolov et al. NIPS 2013] showed “Madrid” – “Spain” + “France” is closest to “Paris”. Authors show that making such analogy in computer generated images is possible through a deep architecture (top figure on the right). To make an analogy of the form a : b = c : ?, first three images are encoded via f, then f(b) – f(a) + f(c), or more generally T((f(b)-f(a)), f(c)), is decoded via g to generate the output image. They trained convolutional neural network f such that T(f(b)-f(a)) is close to f(d)-f(c). Decoder with same architecture but with up-sampling instead of pooling is used for g. The performance on simple object transformation and video game character animation are quite impressive! [recorded talk]

Deep convolutional inverse graphics network
Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, Josh Tenenbaum

Authors propose a CNN autoencoder and training method that aims to infer ‘graphics parameters’ such as lighting and viewing angle from images. Usually the deep latent variables are hard to interpret, but here they force interpretability by training subsets of latent variables only (holding others constant) and using input with the corresponding invariance. The resulting ‘disentangled’ representation learns a meaningful approximation of a 3D graphics engine. Trained via SGVB [Kingma & Welling ICLR 2014].

Action-conditioned video prediction using deep networks in Atari games
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, Satinder Singh

In model based reinforcement learning, predicting the next state given the current state and action accurately is a key operation. Authors show very impressive video prediction given a couple of previous frames of Atari games and a chosen action. Hidden state is estimated using CNN, temporal correlation is learned using LSTM, and action interacts multiplicatively with the state. They used curriculum learning to make increasingly long prediction sequences with SGD + BPTT. They replaced the model-free DQN [Minh et al. NIPS 2013 workshop] with their model. See impressive results at [online videos and supplement] for yourself!

Empirical localization of homogeneous divergences on discrete sample spaces
Takashi Takenouchi, Takafumi Kanamori

In many discrete probability models the (computationally intractable) normalizer for the distribution often hinders efficient estimation for high dimensional data (e.g., Ising model). Instead of using KL-divergence (equivalent to MLE) between the model and empirical distribution, if a homogeneous divergence which invariant under scaling of the underlying measure, we might be able to circumvent the difficulty. Authors use the pseudo-spherical (PS) divergence [Good, I. J. (1971)] and a trick to weight the model by the empirical distribution to make a convex optimization procedure for obtaining near MLE solution.

Equilibrated adaptive learning rates for non-convex optimization
Yann Dauphin, Harm de Vries, Yoshua Bengio

Improving condition number of Hessian is important for SGD convergence speed. Authors revived equilibrated preconditioner as an adaptive learning rate for SGD.

Computational principles for deep neuronal architectures (invited talk)
Haim Sompolinsky

(1) In many early sensory systems, there’s an expansion of representation to a larger number of downstream neurons with sparser representation. This expansion ratio is around 10-100 times, and sparseness of 0.1-0.01 (fraction of neurons active). In [Babadi & Sompolinsky Neuron 2014], they derived how random connection is worse than hebbian learning for a certain scaling and sparseness constraints for representing cluster identities in the input space. (2) How about stacking such layers? Hebbian synaptic learning squashes noise as the network gets deeper. (3) Learning context-dependent influence as mixed (distributed) representation. [Mante et al. Nature 2013] is not biologically feasibly learned. Interleave sensory and context signal into deep structure with hebbian learning to solve it. (4) Extend perceptron theory for learning point clouds to manifold clouds (i.e., line segments, and L-p balls).

Efficient exact gradient update for training deep networks with very large sparse targets
Pascal Vincent, Alexandre de Brébisson, Xavier Bouthillier

If output is very high-dimensional, but sparse, as in the classification with large number of categories, the gradient computation bottleneck is the last layer. Authors propose a clever computational trick to compute gradient efficiently.

Attractor network dynamics enable preplay and rapid path planning in maze-like environments
Dane S. Corneil, Wulfram Gerstner

Hippocampal network can produce a sequence of activation (at rest) that represents goal-directed future plans. By taking the eigendecomposition of the Markov transition matrix of the maze, they obtain the ‘successor representation’ [Dayan NECO 1993] and implement it with a biologically plausible neural network.

A tractable approximation to optimal point process filtering: application to neural encoding
Yuval Harel, Ron Meir, Manfred Opper

By taking the limit of large number neurons with tuning curve centers drawn from a Gaussian, they derive a near optimal point process decoding framework. By optimizing on a grid, they derive the theoretically optimal Gaussian that minimizes MSE.

Bounding errors of expectation propagation
Guillaume P. Dehaene, Simon Barthelmé

New theory showing that EP converges faster than Laplace approximation.

Nonparametric von Mises estimators for entropies, divergences and mutual information
Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, James M. Robins

Use plug-in estimator for divergences using kernel density estimator and correct for bias using von Mises expansion. It works well in low dimensions (up to 6?). [matlab code on github]

Continuing from Part 1 (and Part 3) from Neural Information Processing Systems conference 2015 at Montreal.

Linear Response Methods for Accurate Covariance Estimates from Mean Field Variational Bayes
Ryan J. Giordano, Tamara Broderick, Michael I. Jordan

Variational inference often results in factorized forms of approximate posterior that are tighter than the exact Bayesian posterior. Authors derive a method to recover the lost covariance among parameters by perturbing the posterior. For exponential family variational distribution, a simple closed form transformation involving the Hessian of the expected log posterior. [julia code on github]

Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems
Yuxin Chen, Emmanuel Candes

Recovering x given noisy squared measurements, e.g., $y_i \sim \mathrm{Poisson}(|\mathbf{a_i}^\top\mathbf{x}|^2)$ is a nonlinear optimization problem. Authors show that using (1) a truncated spectral initialization, and (2) truncated gradient descent, exact recovery can be achieved with optimal sample O(n) and time complexity O(nm) for iid gaussian $\mathbf{a_i}$. Those truncations essentially remove responses and corresponding gradients that are larger than typical values. This stabilizes the optimization. [matlab code online] [recorded talk]

Closed-form Estimators for High-dimensional Generalized Linear Models
Eunho Yang, Aurelie C. Lozano, Pradeep K. Ravikumar

Authors derive closed-form estimators with theoretical guarantee for GLM with sparse parameter. In the high-dim regime of $n \gg p$, sample covariance matrix is rank-deficient, but by thresholding and making it sparse, it becomes full rank (original sample cov should be well approx by this sparse cov). They invert the inverse link function by remapping the observations by small amount: e.g., 0 is mapped to $\epsilon > 0$ for Poisson so that logarithm doesn’t blow up.

By replacing the sum over the samples in the Hessian for GLM regression with expectation, and applying Stein’s lemma assuming Gaussian stimuli, he derived a computationally cheap 2nd order method (O(np) per iteration). This trick relies on large enough sample size $n \gg p$, and the Gaussian stimuli distribution assumption can be relaxed in practice if $p \gg 1$ by central limit theorem. Unfortunately, the condition for the theory doesn’t hold for Poisson-GLM!

Stochastic Expectation Propagation
Yingzhen Li, José Miguel Hernández-Lobato, Richard E. Turner

In EP, each likelihood contribution to the posterior is stored as an independent factor which is not scalable for large datasets. Authors propose to further approximate by using n copies of the same factor thus making the memory requirement of EP independent of n. This is similar to assumed density filtering (ADF) but with much better performance close to EP.

Competitive Distribution Estimation: Why is Good-Turing Good
Alon Orlitsky, Ananda Theertha Suresh

Best paper award (1 of 2). Shows theoretical near optimality of Good-Turing estimator for discrete distributions.

Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, Yuval Tassa

Using the reparameterization trick for continuous state space, continuous action reinforcement learning problem. [youtube video]

High-dimensional neural spike train analysis with generalized count linear dynamical systems
Yuanjun Gao, Lars Büsing, Krishna V. Shenoy, John P. Cunningham

A super flexible count distribution with neural application.

Unlocking neural population non-stationarities using hierarchical dynamics models
Mijung Park, Gergo Bohner, Jakob H. Macke

Latent processes with two time scales: one fast linear dynamics, and one slow Gaussian process.

NIPS has grown to 3755 participants this year with 21.9% acceptance rate, 11% deep learning, 42 sponsors, 101 area chairs, 1524 reviewers. Here are some posters from the first night that I thought were excellent (Part 2, Part 3, workshops). They arranged the posters using KPCA axis that approximately aligned with deep-to-non-deep, so the first 40 posters or so were heavy on deep-learning. [meta blog post on other blogger’s NIPS summaries]

Deep temporal sigmoid belief networks for sequence modeling
Zhe Gan, Chunyuan Li, Ricardo Henao, David E. Carlson, Lawrence Carin

They propose a pair of probabilistic time series models for variational inference (one generative and one recognition model) and use variance controlled log-derivative trick to do stochastic optimization. Using a binary vector, they can model an exponentially large state space, and further introduce hierarchy (deep structure) that can produce longer time scale nonlinear dependences. Each node is extremely simple: linear-logistic-Bernoulli. Zhe told he that they applied to 3-bouncing-balls video dataset represented as 900 dimensional vector, but the generated samples were not perfect and balls would often get stuck. [code on github]

Variational dropout and the local reparameterization trick [arXiv]
Diederik P. Kingma, Tim Salimans, Max Welling

They apply the SGVB reparameterization trick to parameters instead of latent variables. Most importantly, they chose a reparametrization such that the noise is independent for each observation. Upper figure (from dpkingma.com) illustrates the parameterization with slow learning due to the noise being correlated with all samples in the mini-batch, while the lower figure shows the independent form. This relates to variational interpretation of dropout, but now the dropout rate can be learned in a more principled manner.

Local expectation gradients for black box variational inference
Michalis Titsias, Miguel Lázaro-Gredilla

The two mains tricks for estimating stochastic gradient for $E_{q(x)}[f(x)]$ are the log-derivative trick, and the reparameterization trick (used by the first two papers I introduced above). Reparameterization has much smaller variance, hence leads to faster convergence, however, it can only be used for continuous x and differentiable f. Here authors propose an extension of the log-derivative trick with small variance by (numerically) integrating over 1d over the latent variable that is directly controlled by the corresponding parameter, while holding the Markov blanket constant.

A recurrent latent variable model for sequential data
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth goel, Aaron C. Courville, Yoshua Bengio

In conventional recurrent neural network (RNN), noise is limited to the input/output space, and the internal states are deterministic. Authors add a stochastic latent variable node to an RNN, and incorporate variational autoencoder (VAE) concepts. Latents are only time dependent through the deterministic recurrent states (with hidden LSTM units), and had a much lower dimension. They train on raw waveform of speech, and were able to generate mumbling sound that resembles the speech (I sampled their cool audio), and similarly for 2D handwriting. Their model seemed to work equally well with different complexity of observation models, unlike plain RNNs which require complex observation models to generate reasonable output. [code on github]

Texture synthesis using convolutional neural networks
Leon Gatys, Alexander S. Ecker, Matthias Bethge

To generate texture images, they started with a deep convolutional neural network, and trained another network’s input with fixed weights until the covariance in certain layers matched. If they started with a white noise image, they could sample textures (via gradient descent optimization). [code on github]

On May 2nd 2015, I organized yet another BlackBoardDay, this time in New York City (on Columbia University campus, thanks Evan!).

I started the discussion by tracing the history of modern mathematics back to Gottlob Frege (Vika pointed out the axiomatic tradition goes back to Euclid (300 BC)).

• Gottlob Frege’s Begriffsschrift, the first symbolic logic system powerful enough for mathematics (1879)
• Giuseppe Peano’s axiomatization of arithmetic (1889)
• David Hilbert’s program to build a foundation of mathematics (1900-1920s)
• Bertrand Russell’s paradox in Frege’s system (1902)
• Kurt Gödel was born! (1906)
• Russell and Whitehead’s Principia Mathematica as foundation of mathematics (1910)
• Kurt Gödel’s completeness theorem (for first-order logic) (1930)
• Kurt Gödel’s incompleteness theorem (of Principia Mathematica and related systems) (1931)

Victoria (Vika) Gitman talked about non-standard models of Peano arithmetic. She listed the first-order form of Peano axioms which is supposed to describe addition, multiplication, and ordering of natural numbers $\mathbb{N} = \{0, 1, 2, \cdots \}$. However, it turns out there are other countable models that are not natural number and yet satisfy Peano axioms. She used the compactness theorem, a corollary of completeness theorem (Gödel 1930), that (loosely) states that for a consistent first-order system, if any finite subset of axioms has a model, then the system has a model. She showed that if we add a constant symbol ‘c’ (in addition to 0 and 1) to the language of arithmetic, and a set of infinite axioms which is consistent with the Peano axioms: {c > 0, c > 1, c > 2, … }, then using the compactness theorem, there exists a model. This model is somewhat like integers $\mathbb{Z}$ sprinkled on rational numbers $\mathbb{Q}$, in the sense that (…, c-2, c-1, c, c+1, c+2, …) are all larger than the regular $\mathbb{N}$, but then 2c is larger than all of that. Then there are also fractions of c such as c/2, and so on. This is still countable, since it is a countable collection of countably infinite sets, but this totally blew our minds. In this non-standard model of arithmetic, those ‘numbers’ outside $\mathbb{N}$ can be represented as a pair in $\mathbb{Q} \times \mathbb{Z}$, but actual computation with those numbers turn out to be non-trivial (and often non-computable).

Ashish Myles talked about the incompleteness theorem, and other disturbing ideas. Starting from the analogy of liar’s paradox, Ashish stated that arithmetic (with multiplication) can be used to encode logical statements into natural numbers, and also write a (recursive) function that encapsulates the notion of ‘provable from axioms’. The Gödel statement G roughly says that “the natural number that encodes G is not provable”. Such statement is true (in our meta language) since if it is false, there’s a contradiction. However, either adding G or not G as an axiom to the original system is consistent. Even after including G (or “not G”) as an axiom to Peano arithmetic, there’ll be statements that are true but not provable! Vika gave an example statement that is true for natural number but is not provable from Peano axioms: all Goodstein sequence terminates at 0.

At this point, we were all feeling very cold inside, and needed some warm sunshine. So, we continued our discussion outside:

Kyle Mandli talked about Axiom of Choice (AC), which is an axiom that is somewhat counter intuitive, and independent of the Zermelo-Fraenkel (ZF) set theory: Both ZF with AC and ZF with not AC are consistent (Gödel 1964). We discussed many counter intuitive “paradoxes” as well as usefulness of AC in mathematics.

Diana Hall talked about an counter intuitive bet: suppose we have a fair coin, and we are tossing to create a sequence. Would you bet on seeing HTH first or HHT first? At first one might think they are equally likely. However, since there’s a sequence effect that makes them non-equal!

Unfortunately, due to time constraints we couldn’t talk about Uygar planed: “approximate solutions to combinatorial optimization problems implies P=NP”, hopefully we’ll hear about it on BBD11!

NIPS is growing fast with 2400+ participants! I felt there were proportionally less “neuro” papers compared to last year, maybe because of a huge presence of deep network papers. My NIPS keywords of the year: Deep learning, Bethe approximation/partition function, variational inference, climate science, game theory, and Hamming ball. Here are my notes on the interesting papers/talks from my biased sampling by a neuroscientist as I did for the previous meetings. Other bloggers have written about the conference: Paul Mineiro, John Platt, Yisong Yue and Yun Hyokun (in Korean).

## The NIPS Experiment

The program chairs, Corinna Cortes and Neil Lawrence, ran an experiment on the reviewing process and estimated the inconsistency. 10% of the papers were chosen to be reviewed independently by two pools of reviewers and area chair, hence those authors got 6-8 reviews, and had to submit 2 author responses. The disagreement was around 25%, meaning around half of the accepted papers could have been rejected (the baseline assuming independent random acceptance was around 38%). This tells you that the variability in NIPS reviewing process is, so keep that in mind whether your papers got in or not! They accepted all papers that had disagreement between the two pools, so the overall acceptance rate was a bit higher this year. For details, see Eric Price’s blog post and Bert Huang’s post.

## Latent variable modeling of neural population activity

Extracting Latent Structure From Multiple Interacting Neural Populations
Joao Semedo, Amin Zandvakili, Adam Kohn, Christian K. Machens, Byron M. Yu

How can we quantify how two populations of neurons interact? A full interaction model would require O(N^2) which quickly makes the inference intractable. Therefore, low-dimensional interaction model could be useful, and this paper exactly does this by extending the ideas of canonical correlation analysis to vector autoregressive processes.

Clustered factor analysis of multineuronal spike data
Lars Buesing, Timothy A. Machado, John P. Cunningham, Liam Paninski

How can you put more structure to a PLDS (Poisson linear dynamical system) model? They assumed disjoint groups of neurons would have loadings from a restricted set of factors only. For application, they actually restricted the loading weights to be non-negative, in order to separate out the two underlying components of oscillation in spinal cord. They have a clever subspace clustering based initialization, and a variational inference procedure.

A Bayesian model for identifying hierarchically organised states in neural population activity
Patrick Putzky, Florian Franzen, Giacomo Bassetto, Jakob H. Macke

How do you capture discrete states in the brain, such as UP/DOWN states? They propose using a probabilistic hierarchical hidden Markov model for population of spiking neurons. The hierarchical structure reduces the effective number of parameters of the state transition matrix. The full model captures the population variability better than coupled GLMs, though the number of states and their structure is not learned. Estimation is via variational inference.

On the relations of LFPs & Neural Spike Trains
David E. Carlson, Jana Schaich Borg, Kafui Dzirasa, Lawrence Carin.

Analysis of Brain States from Multi-Region LFP Time-Series
Kyle R. Ulrich, David E. Carlson, Wenzhao Lian, Jana S. Borg, Kafui Dzirasa, Lawrence Carin

## Bayesian brain, optimal brain

Fast Sampling-Based Inference in Balanced Neuronal Networks
Guillaume Hennequin, Laurence Aitchison, Mate Lengyel

Sensory Integration and Density Estimation
Joseph G. Makin, Philip N. Sabes

Optimal Neural Codes for Control and Estimation
Alex K. Susemihl, Ron Meir, Manfred Opper

Spatio-temporal Representations of Uncertainty in Spiking Neural Networks
Cristina Savin, Sophie Denève

Optimal prior-dependent neural population codes under shared input noise
Agnieszka Grabska-Barwinska, Jonathan W. Pillow

Neurons as Monte Carlo Samplers: Bayesian ￼Inference and Learning in Spiking Networks
Yanping Huang, Rajesh P. Rao

## Other Computational and/or Theoretical Neuroscience

Using the Emergent Dynamics of Attractor Networks for Computation (Posner lecture)
J. J. Hopfield

He introduced bump attractor networks via analogy of magnetic bubble (shift register) memory. He suggested that cadence and duration variations in voice can be naturally integrated with state-dependent synaptic input. Hopfield previously suggested using relative spike timings to solve a similar problem in olfaction. Note that this continuous attractor theory predicts low-dimensional neural representation. His paper is available as a preprint.

Deterministic Symmetric Positive Semidefinite Matrix Completion
William E. Bishop and Byron M. Yu

See workshop posting where Will gave a talk on this topic.

## General Machine Learning

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio

From results in statistical physics, they hypothesize that there are more saddles in high-dimension which are the main cause of slow convergence of stochastic gradient descent. In addition, exact Newton method converges to saddles, (stochastic) gradient descent is slow to get out of saddles, causing lengthy platou in training neural networks. They provide a theoretical justification for a known heuristic optimization method which is to take the absolute value of eigenvalues of the Hessian when taking the Newton step. This avoids saddles, and dramatically improves convergence speed.

A* Sampling
Chris J. Maddison, Daniel Tarlow, Tom Minka

Extends the Gumbel-Max Trick to an exact sampling algorithm for general (low-dimensional) continuous distributions with intractable normalizers. The trick involves perturbing a discrete-domain function by adding an independent samples from Gumbel distribution.They construct Gumbel process which gives bounds on the intractable log partition function, and use it to sample.

Divide-and-Conquer Learning by Anchoring a Conical Hull
Tianyi Zhou, Jeff A. Bilmes, Carlos Guestrin

Spectral Learning of Mixture of Hidden Markov Models
Cem Subakan, Johannes Traa, Paris Smaragdis

Clamping Variables and Approximate Inference
His slides are available online.

Information-based learning by agents in unbounded state spaces
Shariq A. Mobin, James A. Arnemann, Fritz Sommer

Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights
Daniel Soudry, Itay Hubara, Ron Meir

Self-Paced Learning with Diversity
Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, Alexander Hauptmann

Last week, I co-organized the NIPS workshop titled: Large scale optical physiology: From data-acquisition to models of neural coding with Ferran Diego Andilla, Jeremy Freeman, Eftychios Pnevmatikakis and Jakob Macke. Optical neurophysiology promises larger population recordings, but we are also facing with technical challenges in hardware, software, signal processing, and statistical tools to analyze high-dimensional data. Here are highlights of some of the non-optical physiology talks:

Surya Ganguli presented exciting new results improving from his last NIPS workshop and last COSYNE workshop talks. Our experimental limitations put us to analyze severely subsampled data, and we often find correlations and low-dimensional dynamics. Surya asks “How would dynamical portraits change if we record from more neurons?” This time he had detailed results for single-trial experiments. Using matrix perturbation, random matrix, and non-commutative probability theory, they show a sharp phase transition in recoverability of the manifold. Their model was linear Gaussian, namely $R = U X + Z$, where X is a low-rank neural trajectories over time, U is a sparse subsampling matrix, and Z is additive Gaussian noise. The bound for recovery had a form of $\mathrm{SNR} \sqrt{MP} \geq K$, where K is the dimension of the latent dynamics, P is the temporal duration (samples), M is the number of subsampled neurons, and SNR denotes the signal-to-noise ratio of a single neuron.

Vladimir Itskov gave a talk about inferring structural properties of the network from the estimated covariance matrix (We originally invited his collaborator Eva Pastalkova, but she couldn’t make it due to a job interview). An undirected graph which has weights that corresponds to an embedding in an Euclidean space shows a characteristic Betti curve: curve of Betti numbers as a function of threshold for the graph’s weights which is varied to construct the topological objects. For certain random graphs, the characteristics are very different, hence they used it to quantify how ‘random’ or ‘low-dimensional’ the covariances they observed were. Unfortunately, these curves are very computationally expensive so only up to 3rd Betti number can be estimated, and the Betti curves are too noisy to be used for estimating dimensionality directly. But, they found that hippocampal data were far from ‘random’. A similar talk was given at CNS 2013.

William Bishop, a 5th year graduate student working with Byron Yu and Rob Kass, talked about stitching partially overlapping covariance matrices, a problem first discussed in NIPS 2013 by Srini Turaga and coworkers: Can we estimate the full noise correlation matrix of a large population given smaller overlapping observations? He provided sufficient conditions for stitching, the most important of which is to make the covariance matrix of the overlap at least the rank of the entire covariance matrix. Furthermore, he analyzed theoretical bounds on perturbations which can be used for designing strategies for choosing the overlaps carefully. For details see the corresponding main conference paper, Deterministic Symmetric Positive Semidefinite Matrix Completion.

Unfortunately, due to weather conditions Rob Kass couldn’t make it to the workshop.