# NIPS 2015 Part 3

Day 3 and 4 of NIPS main meeting (part 1, part 2). More amazing deep learning results.

**Embed to control: a locally linear latent dynamics model for control from raw images**

Manuel Watter, Jost Springenberg, Joschka Boedecker, Martin Riedmiller

To implement optimal control in the latent state space, they used iterative Linear-Quadratic-Gaussian control applied directly to video. A gaussian latent state space was decoded from images through a deep variational latent variable model. One step prediction of latent dynamics was modeled to be locally linear where the dynamics matrices were parameterized by a neural network that depends on the current state. A variant of a variational cost that minimizes instantaneous reconstruction, and also KL divergence between predicted latent and the reconstructed latent. Deconvolution network was used, and as can be seen in the [video online], the generated images are a bit blurry, but iLQG control works well.

**Deep visual analogy-making
**Scott E. Reed, Yi Zhang, Yuting Zhang, Honglak Lee

Simple vector space embedding of natural words in [Mikolov et al. NIPS 2013] showed “Madrid” – “Spain” + “France” is closest to “Paris”. Authors show that making such analogy in computer generated images is possible through a deep architecture (top figure on the right). To make an analogy of the form a : b = c : ?, first three images are encoded via f, then f(b) – f(a) + f(c), or more generally T((f(b)-f(a)), f(c)), is decoded via g to generate the output image. They trained convolutional neural network f such that T(f(b)-f(a)) is close to f(d)-f(c). Decoder with same architecture but with up-sampling instead of pooling is used for g. The performance on simple object transformation and video game character animation are quite impressive! [recorded talk]

**Deep convolutional inverse graphics network**

Tejas D. Kulkarni, William F. Whitney, Pushmeet Kohli, Josh Tenenbaum

Authors propose a CNN autoencoder and training method that aims to infer ‘graphics parameters’ such as lighting and viewing angle from images. Usually the deep latent variables are hard to interpret, but here they force interpretability by training subsets of latent variables only (holding others constant) and using input with the corresponding invariance. The resulting ‘disentangled’ representation learns a meaningful approximation of a 3D graphics engine. Trained via SGVB [Kingma & Welling ICLR 2014].

**Action-conditioned video prediction using deep networks in Atari games
**Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, Satinder Singh

In model based reinforcement learning, predicting the next state given the current state and action accurately is a key operation. Authors show very impressive video prediction given a couple of previous frames of Atari games and a chosen action. Hidden state is estimated using CNN, temporal correlation is learned using LSTM, and action interacts multiplicatively with the state. They used curriculum learning to make increasingly long prediction sequences with SGD + BPTT. They replaced the model-free DQN [Minh et al. NIPS 2013 workshop] with their model. See impressive results at [online videos and supplement] for yourself!

**Empirical localization of homogeneous divergences on discrete sample spaces**

Takashi Takenouchi, Takafumi Kanamori

In many discrete probability models the (computationally intractable) normalizer for the distribution often hinders efficient estimation for high dimensional data (e.g., Ising model). Instead of using KL-divergence (equivalent to MLE) between the model and empirical distribution, if a *homogeneous divergence* which invariant under scaling of the underlying measure, we might be able to circumvent the difficulty. Authors use the pseudo-spherical (PS) divergence [Good, I. J. (1971)] and a trick to weight the model by the empirical distribution to make a *convex optimization procedure for obtaining near MLE solution*.

**Equilibrated adaptive learning rates for non-convex optimization**

Yann Dauphin, Harm de Vries, Yoshua Bengio

Improving condition number of Hessian is important for SGD convergence speed. Authors revived equilibrated preconditioner as an adaptive learning rate for SGD.

**Computational principles for deep neuronal architectures (invited talk)
**Haim Sompolinsky

(1) In many early sensory systems, there’s an expansion of representation to a larger number of downstream neurons with sparser representation. This expansion ratio is around 10-100 times, and sparseness of 0.1-0.01 (fraction of neurons active). In [Babadi & Sompolinsky Neuron 2014], they derived how random connection is worse than hebbian learning for a certain scaling and sparseness constraints for representing cluster identities in the input space. (2) How about stacking such layers? Hebbian synaptic learning squashes noise as the network gets deeper. (3) Learning context-dependent influence as mixed (distributed) representation. [Mante et al. Nature 2013] is not biologically feasibly learned. Interleave sensory and context signal into deep structure with hebbian learning to solve it. (4) Extend perceptron theory for learning point clouds to manifold clouds (i.e., line segments, and L-p balls).

**Efficient exact gradient update for training deep networks with very large sparse targets
**Pascal Vincent, Alexandre de Brébisson, Xavier Bouthillier

If output is very high-dimensional, but sparse, as in the classification with large number of categories, the gradient computation bottleneck is the last layer. Authors propose a clever computational trick to compute gradient efficiently.

**Attractor network dynamics enable preplay and rapid path planning in maze-like environments**

Dane S. Corneil, Wulfram Gerstner

Hippocampal network can produce a sequence of activation (at rest) that represents goal-directed future plans. By taking the eigendecomposition of the Markov transition matrix of the maze, they obtain the ‘successor representation’ [Dayan NECO 1993] and implement it with a biologically plausible neural network.

**A tractable approximation to optimal point process filtering: application to neural encoding
**Yuval Harel, Ron Meir, Manfred Opper

By taking the limit of large number neurons with tuning curve centers drawn from a Gaussian, they derive a near optimal point process decoding framework. By optimizing on a grid, they derive the theoretically optimal Gaussian that minimizes MSE.

**Bounding errors of expectation propagation**

Guillaume P. Dehaene, Simon Barthelmé

New theory showing that EP converges faster than Laplace approximation.

**Nonparametric von Mises estimators for entropies, divergences and mutual information**

Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, James M. Robins

Use plug-in estimator for divergences using kernel density estimator and correct for bias using von Mises expansion. It works well in low dimensions (up to 6?). [matlab code on github]

Reblogged this on Do not stop thinking.