This was my first NIPS (Neural Information Processing Systems). It is primarily a machine learning conference with some neural inspirations. According to the analysis done by the organizers, there were more than 1350 registered participants, 1256 submissions with 24.1% acceptance rate. Some of the keywords that have high mutual information with the acceptance are ‘tree-width, subgradient, concave’. From my perspective, I found non-parameteric Bayesian inference, reinforcement learning, semi-definite programming, and sparse/low-rank learning very popular along with general learning theory.

I summarized some of the posters that were interesting to me. If there is anything wrong, it’s my fault, and please let me know. You can find all the papers from the online proceedings.

Update: video recordings of the talks [invited talks] [oral sessions].

### Day 1

Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal

Aiming to learn a positive semi-definite kernel matrix from a given set of similarity measures, and a classification problem, this paper uses subgradient with mirror descent optimization. They combine the SVM cost with loss function on the similarity matrices. For the mirror descent, they chose negative of matrix entropy (unnormalized entropy of eigenvalue distribution) that guarantees  the convergence.

Generative Local Metric Learning for Nearest Neighbor Classification
Yung-Kyun Noh, Byoung-Tak Zhang, Daniel Lee

Using the asymptotic NN-classifier error bound $\int \frac{p(x)q(x)}{p(x)+q(x)}\mathrm{d}x$ which is measure-independent, the goal is to learn a global Euclidean metric scaling for $\mathbb{R}^D$. The critical problem is that the theoretically measure-independent quantity becomes measure-dependent when estimated from finite samples. By first order expansion of the probability estimated by the nearest neighbor estimator, the authors explicitly derive the bias, and tries to learn the metric while reducing the bias. It is posed as a semi-definite programming (SDP).

Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
David Pal, Barnabas Poczos, Csaba Szepesvari

There have been works on Rényi’s entropy estimation from nearest-neighbor, but this one proves strong convergence results. In addition, they show that given a consistent estimator for Rényi’s entropy, one can use empirical copula to obtain a consistent estimator for Renyi’s mutual information. The fact that the marginals are uniform make it very simple.

Random Conic Pursuit for Semidefinite Programming
Ariel Kleiner, Ali Rahimi, Michael Jordan

According to the authors, interior point method for SDP is slow and tends to jump to the solution instead of slowly converging. As an alternative, they propose a stochastic gradient descent, where for every iteration a random rank 1 positive definite matrix is chosen, and the optimization is solved for the 2-dimensional solution plane instead of the entire cone of semi-definite matrices.

Short-term memory in neuronal networks through dynamical compressed sensing
Surya Ganguli, Haim Sompolinsky

When sparse signal is provided as a single common input to a network with linear recurrent dynamics, the problem of recovering past input signals from current activation pattern can be formulated as an compressed sensing problem. They built a theory of memory depth under this setting where the recurrent dynamics matrix is orthogonal. The assumptions are not very realistic, however, viewing the problem from a compressive sensing point of view connects to the FWIRE project.

### Day 2

Machine Learning with HumanIntelligence: Principled Corner Cutting (PC2) – invited talk
Xiao Li Meng

This was an inspiring invited talk with a practical philosophy. Self-consistency principle as a general method to deal with missing data was fresh to me. Compromising statistical principles for more efficient methods that resembles machine learning principles were discussed with personal experiences.

Identifying Dendritic Processing
Aurel A. Lazar, Yevgeniy Slutskiy

Extending the time encoding machine and perfect reconstruction techniques, the authors propose estimating a linear filter followed by a non-noisy integrate-and-fire neuron. Interestingly, due to causality the dendritic process (modeled as a linear filter) has infinite spectrum which conflicts with the (RKHS) theory of band limited functions required for the time encoding machine theory. As a result, they can only recover the filter projected in the band limited space. Moreover, using a clever trick of switching the input and the filter, multiple presentations of different input can be combined to estimate the filter even under the usual Nyquist spiking rate limit.

Near-Optimal Bayesian Active Learning with Noisy Observations
Daniel Golovin, Andreas Krause, Debajyoti Ray

Determining the optimal hypothesis (within a finite set) with a few observation is a problem of active learning. Counter intuitively, the choice that provides maximal information gain do NOT perform well, and in pathological problems the posterior does not eliminate choices. Instead they show the choice that reduces the posterior probability mass the most is adaptive submodular, and works under noisy observations. The main contributions are efficient algorithms that implement such policy.

Robust PCA via Outlier Pursuit
Huan Xu, Constantine Caramanis, Sujay Sanghavi

When PCA is formulated as SVD on the data matrix, outliers are defined as corrupted columns that do not lie in the low dimensional space. Using a combination of nuclear norm (sum of singular values) and L1 norm of the column-wise L2 norm, they propose a convex optimization formulation. The proposed method has deterministic error bound related to the incoherence parameter.

Global Analytic Solution for Variational Bayesian Matrix Factorization
Shinichi Nakajima, Masashi Sugiyama, Ryota Tomioka

The problem of SVD is posed as a generative model. Gaussian model for a low rank matrix is assumed, and solved with analytical solutions using variational Bayes approach of factorizing the posterior of columns of the low rank matrix. Without assistance of symbolic mathematical tools such as mathematica, the authors arrived at analytic solutions with flat prior as well as ARD prior. (The full equations for the solutions were too lengthy to be included in the poster.)

Improvements to the Sequence Memoizer
Jan Gasthaus and Yee Whye Teh

Using hierarchical Pitman-Yor process (a “heavy-tail” version of Dirchlet process with extra parameters) as the prior, the aim is to learn a distribution of a symbol conditioned on all previous symbols. An application is compression where the inference process is replicated in the decoding end with same random seed. I’m new to non-parameteric Bayesian inference, so the details slipped through me, but I would definitely revisit this paper.

Global seismic monitoring as probabilistic inference
Nimar S. Arora, Stuart Russell, Paul Kidwell, Erik Sudderth

From a collection of seismic sensors, they aim to find a probable cause (artificial nuclear activity). Using prior knowledge combined with a heuristic Bayesian inference algorithm, they infer a sequence of events (with position and amplitude) as causes. The inference is done through operations on a Poisson process initialized sequence as a hypothesis and improve it via adding, aligning, reassigning, and pruning; simiar to dependent DP process by Dahua Lin (below). I think a similar process can be used to infer hidden causes of spike train observations in a generative point process modeling.

### Day 3

This paper got the best student paper award. Dirichlet process is related to a compound Poisson process known as Gamma process; Gamma process takes positive values independently distributed as Gamma distribution at each event generated by a Poisson process. Dirichlet process is simply a normalized form of Gamma process, such that the “amplitude” of Gamma process at each point represents the probability of being at that position. It is well known that Poisson process is the only completely random process, hence one can perform operations such as thinning, superposition, and random translation to create a similar Poisson process. The authors create dependent Dirichlet process through these operations and apply it to non-stationary clustering problem by developing Gibbs sampling for it.

A Novel Kernel for Learning a Neuron Model from Spike Train Data
Nicholas K. Fisher, Arunava Banerjee

Neuron modeling is posed as a classification problem for input spike trains that do not elicit an action potential vs. that do. SVM is used with conjunction to a novel spike train kernel of the form $\sum_i \sum_j \frac{t^1_i t^2_j}{t^1_i + t^2_j}$. This was derived from a combination of basis functions $f_t = \frac{1}{\tau} \exp(-\beta / t) \exp(-t/\tau)$ for each pair of spike timings they call REEF (reciprocal exponential – exponential function). The kernel is the uniform integration of outer product of REEF. This is a novel binless spike train positive definite kernel that weights the spikes in time non-uniformly (like spikernel). Here they applied to learning SRM0 (simplest spike response model), but I think this has potential to be applied to arbitrary neuron models.

Universal Kernels on Non-Standard Input Spaces
Andreas Christmann, Ingo Steinwart

Universal kernels have nice convergence properties and learning guarantees (on a compact set); Guassian kernel and Laplacian kernels are widely used because of this property. The authors show that one can construct universal kernels from a simple form $\sum_{n=0}^\infty a_n t^n$. When $a_n > 0$, then $k(x,y) = \sum_{n=0}^\infty ^n_{l_2}$ is a universal kernel in a closed ball. Extending this result, if one can find a continuous injective mapping to a separable Hilbert space, by Mercer theorem and the above result, one can build a universal kernel from any compact space. Two interesting examples of kernels on probability space are given; one for characteristic kernel and embedded probability, and the other is using (generalized) characteristic functions of probabilities.

Of course my poster was on the third day as well, so I had little time for other’s posters. (Update: check out http://nipsposterface.com/)