NIPS 2012 (proceedings) was held in Lake Tahoe, right next to the state line between California and Nevada. Despite the casino all around the area, it was a great conference: a lot of things to learn, and a lot of people to meet. My keywords for NIPS 2012 are deep learning, spectral learning, nonparanormal distribution, nonparametric Bayesian, negative binomial, graphical models, rank, and MDP/POMDP. Below are my notes on the topics that interested me. Also check out these great blog posts about the event by Dirk Van den Poel (
@dirkvandenpoel), Yisong Yue (@yisongyue), John Moeller, Evan Archer, Hal Daume III.
Optimal kernel choice for large-scale two-sample tests
A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu
This is an improvement over the maximum mean discrepancy (MMD), a divergence statistic for hypothesis testing using reproducing kernel Hilbert spaces. The statistical power of the test depends on the choice of kernel, and previously, it was shown that taking the max value over multiple kernels still results in a divergence. Here they linearly combine kernels to maximize the statistical power in linear time, using normal approximation of the test statistic. The disadvantage is that it requires more data for cross-validation.
Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference
Xue-Xin Wei, Alan Stocker
Several biases observed in psychophysics shows repulsion from the mode of prior which seem counter intuitive if we assume brain is performing Bayesian inferences. They show that this could be due to asymmetric likelihood functions that originate from the efficient coding principle. The tuning curves, and hence the likelihood functions, under the efficient coding hypothesis are constrained by the prior, reducing the degree of freedom for the Bayesian interpretation of perception. They show asymmetric likelihood could happen under a wide range of circumstances, and claim that repulsive bias should be observed. Also they predict additive noise in the stimulus should decrease this effect.
Spiking and saturating dendrites differentially expand single neuron computation capacity
Romain Cazé, M. Humphries, B. Gutkin
Romain showed that boolean functions can be implemented by active dendrites. Neurons that generate dendritic spikes can be considered as a collection of AND gates, hence disjunctive normal form (DNF) can be directly implemented using the threshold in soma as the final stage. Similarly, saturating dendrites (inhibitory neurons) can be treated as OR gates, thus CNF can be implemented.
Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
Hypothesis testing of whether the rate is constant or not for a renewal neuron can be done by decoding the rate from spike trains using empirical Bayes (EB). If the hyperparameter for the roughness is inferred to be zero by EB, the null hypothesis is accepted. Shinsuke derived a theoretical condition for the rejection based on the KL-divergence.
The coloured noise expansion and parameter estimation of diffusion processes
Simon Lyons, Amos Storkey, Simo Sarkka
For a continuous analogue of a nonlinear ARMA model, estimating parameters for stochastic differential equations is difficult. They approach it by using a truncated smooth basis expansion of the white noise process. The resulting colored noise is used for an MCMC sampling scheme.
Bayesian estimation of discrete entropy with mixtures of stick-breaking priors
Evan Archer*, Il Memming Park*, Jonathan W. Pillow (*equally contributed, equally presented)
Diffusion decision making for adaptive k-Nearest Neighbor Classification
Yung-Kyun Noh, F. C. Park, Daniel D. Lee
An interesting connection between sequential probability ratio test (Wald test) for homogeneous Poisson process with two different rates and k-nearest neighbor (k-NN) classification is established by the authors. The main assumption is that each class density is smooth, thus in the limit of large samples, distribution of NN follows a (spatial) Poisson process. Using this connection, several adaptive k-NN strategies are proposed motivated from Wald test.
TCA: High dimensional principal component analysis for non-gaussian data
F. Han, H. Liu
Using an elliptical copula model (extending the nonparanormal), the eigenvectors of the covariance of the copula variables can be estimated from Kendall’s tau statistic which is invariant to the nonlinearity of the elliptical distribution and the transformation of the marginals. This estimator achieves close to the parametric convergence rate while being a semi-parametric model.
Classification with Deep Invariant Scattering Networks (invited)
How can we obtain stable informative invariant representation? To obtain an invariant representation with respect to a group (such as translation, rotation, scaling, and deformation), one can directly apply a group-convolution to each sample. He proposed an interpretation of deep convolutional network as learning the invariant representation, and a more direct approach when the invariance of interest is known, which is to use group invariant scattering (hierarchical wavelet decomposition). Scattering is contractive, preserves norm, and stable under deformation, hence generates a good representation for the final discriminative layer. He hypothesized that the stable parts (which lacks theoretical invariance) can be learned in deep convolutional network through sparsity.
Spectral learning of linear dynamics from generalised-linear observations with application to neural population data
L. Buesing, J. Macke, M. Sahani
Ho-Kalman algorithm is applied to Poisson observation with canonical link function, then the parameters are estimated through moment matching. This is a simple and great initializer for EM which tends to be slow and prone to local optima.
Spectral learning of general weighted automata via constrained matrix completion
B. Balle, M. Mohri
A parameteric function from strings to reals known as rational power series, or equivalently weighted finite automata, is estimated with a spectral method. Since the Hankel matrix for prefix-suffix values has a structure, a constrained optimization is applied for its completion from data. How to choose rows and columns of Hankel matrix remains a difficult problem.
Discriminative learning of Sum-Product Networks
R. Gens, P. Domingos
Sum-product network (SPN) is a nice abstraction of a hierarchical mixture model, and it provides simple and tractable inference rules. In SPM, all marginals are computable in linear time. In this case, discriminative learning algorithms for SPM inferences are given. The hard inference variant takes the most probable state, and can overcome gradient dilution.
Perfect dimensionality recovery by Variational Bayesian PCA
S. Nakajima, R. Tomioka, M. Sugiyama, S. Babacan
Previous Bayesian PCA algorithm utilizes the empirical Bayes procedure for sparsification, however, this may not be an exact inference for recovering the dimensionality. They provide a condition for which the recovered dimension is exact for a variational Bayesian inference using random matrix theory.
Fully bayesian inference for neural models with negative-binomial spiking
J. Pillow, J. Scott
Graphical models via generalized linear models
Eunho Yang, Genevera I. Allen, Pradeep Ravikumar, Zhandong Liu
Eunho introduced a family of graphical models with GLM marginals and Ising model style pairwise interaction. He said the Poisson-Markov-Random-Fields version must have negative coupling, otherwise the log partition function blows up. He showed conditions for which the graph structure can be recovered with high probability in this family.
No voodoo here! learning discrete graphical models via inverse covariance estimation
Po-Ling Loh, Martin Wainwright
I think Po-Ling did the best oral presentation. For any graph with no loop, zeros in the inverse covariance matrix corresponds to non-conditional dependence. In general, theoretically by triangulating the graph, conditional dependencies could be recovered, but the practical cost is high. In practice, graphical lasso is a pretty good way of recovering the graph structure, especially for certain discrete distributions (e.g. Ising model).
Augment-and-Conquer Negative Binomial Processes
M. Zhou, L. Carin
Poisson process over gamma process measure is related to Dirichlet process (DP) and Chinese restaurant process (CRP). Negative binomial (NB) distribution has an alternative (i.e., not gamma-Poisson) augmented representation as Poisson number of logarithmic random variables, which can be used to constructing Gamma-NB process. I do not fully understand the math, but it seems like this paper contains gems.
Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum Lp Loss
Zhuo Wang, Alan A. Stocker, Daniel D. Lee
Assuming different loss functions in the Lp family, optimal tuning curves of a rate limited Poisson neuron changes. Zhuo showed that as p goes to zero, the optimal tuning curve converges to that of the maximum information. The derivations assume no input noise, and a single neuron. [edit: we did a lab meeting about this paper]
Bayesian nonparametric models for ranked data
F. Caron, Y. Teh
Assuming observed partially ranked objects (e.g., top 10 books) have positive real-valued hidden strength, and assuming a size-biased ranking, they derive a simple inference scheme by introducing an auxiliary exponential variable.
Efficient and direct estimation of a neural subunit model for sensory coding
Brett Vintch, Andrew D. Zaharia, J. Anthony Movshon, Eero P. Simoncelli
We already discussed this nice paper in our journal club. They fit a special LNLN model that assumes a single (per channel) convolutional kernel shifted (and weighted) in space. Brett said the convolutional STC initialization described in the paper works well even when the STC itself looks like noise.
Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
Sander M. Bohte
A multiplicative spike response model is proposed and fit with a fixed post-spike filter shape, LNP based receptive filed, and grid search over the parameter space (3D?). This model reproduces the experimentally observed adaptation due to amplitude modulation and the variance modulation. The multiplicative dynamics must have a power-law decay that is close to 1/t, and it somehow restricts the firing rate of the neuron (Fig 2b).
Dropout: A simple and effective way to improve neural networks (invited, replacement)
Geoffrey Hinton, George Dahl
Dropout is a technique to randomly omit units in an artificial neural network to reduce overfitting. Hinton says dropout method is an efficient way of model averaging exponentially many models. It reduces overfitting because hidden units can’t depend on each other reliably. Related paper is on the arXiv.
Compressive neural representation of sparse, high-dimensional probabilities
Naively representing probability distributions are inefficient since it takes exponentially growing resource. Using ideas from compressed sensing, Xaq shows that random perceptron units can be used to represent a sparse high dimensional probability distribution efficiently. The question is what kind of operations on this representation biologically plausible and useful.
The topographic unsupervised learning of natural sounds in the auditory cortex
Hiroki Terashima, Masato Okada
Visual cortex is much more retinatopic than auditory cortex is tonotopic. Unlike natural images, nautral auditory stimuli has harmonics that gives rise to correlations in the frequency domain. Could both primary sensory cortices have same principle for topographic learning rules but form different patterns because of differences in the input statistics? The authors’ model is consistent with the hypothesis, and moreover captures the nonlinear response to pitch perception problem.