Last Sunday (April 29th) was the Black board day (BBD), which is a small informal workshop I organize every year. It started 7 years ago on Kurt Gödel‘s 100th birthday. We discuss logic, computation, math, and beyond. This year happens to be Alan Turing‘s 100th birth year, so we had a theme that combines Turing machines and logic. It was a huge success thanks to special guest speakers.

### Il Memming Park: On halting problem route to incompleteness

I was trying to give an overview on how certain problems in mathematics that deals with natural numbers are very difficult, and why a mechanized theorem prover was a dream of Hilbert’s. Then I introduced the devilish diagonal argument of Cantor’s in the context of binary strings and languages. Basically, there are more languages (defined as a set of finite binary strings) than there are natural numbers. I introduced Turing machines and their 3 possible outcomes (accept, reject, and infinite loop) as well as the concept of universal turning machines. Then, I constructed the halting problem and showed that the diagonal argument prevents us from having a Turing machine that can tell if another Turing machine will stop or not in finite time. Unfortunately, I didn’t have enough time to elaborate how the halting problem has a similar structure to the proof of incompleteness theorem, and how they could be connected.

### Kenneth Latimer: On Roger Penrose’s Emperor’s new mind

The controversial book Emperor’s new mind (1989) is famous for extending Lucas’ idea that since Turing machines can’t know if the Gödel statement is true, while human does, the computability of human must be greater. He further linked that idea to physics and brain. During our discussion, we agreed that the Gödel statement is true, but it’s truth can only be judged outside of the system, and human certainly are not using the same set of axioms as the system that the Gödel statement is constructed on. And also, the fact that we do not understand certain physics doesn’t imply that the is not computable. It was interesting that two people (Memming and Jonathan) were initially drawn into neuroscience because of this book.

### Michael Buice: Algebra of Probable Inference

Michael started talking about  adding oracles to Turing machines, and the hierarchy of such oracle-equipped Turing machines, as well as Kleene hierarchy of logical statements, but quickly jumped into a new topic. Instead of only considering only True or False statements, if we allow things in between, with a reasonable assumptions we can derive axioms of probability theory. Heuristically speaking, Gödel’s incompleteness theorem would imply that there are statements that even with infinite observations, the posterior probability for the statement does not converge to 0 or 1 and always stay in between. The derivation is given in Richard Cox’s papers, and the theory was expanded by Jaynes.

### Ryan Usher: An Incomplete, Inconsistent, Undecidable and Unsatisfiable Look at the Colloquial Identity and Aesthetic Possibilities of Math or Logic

Ryan started by stating how he finds beauty in mathematical proofs, especially in Henkin’s completeness theorem. But then he was unsatisfied with the fact that how often beautiful results such as Gödel’s incompleteness theorem are abused in completely irrelevant contexts such as in economics and social sciences. He had numerous quotes and examples showing the current state of sad abuses. He claimed that this is partly because of the terms like “consistency”, “completeness” have very rigorous meanings in mathematical context but often people associate their meanings to the common sensical ones.

### Jonathan Pillow: Do we live inside a Turing machine?

Jonathan summarized the argument by Bostrom (2003) that it is very probable that we are living inside a simulation. Under the assumption that
1. Simulated human brain brings consciousness (“substance independence”)
2. Large scale simulation of human brain + physical world around human is possible
Then, assuming high probability of technological advancement for such simulation, and some grad student in the future wishing to run “ancestor simulation”, a simple counting argument of all humans in simulation and not shows that we are probably living in a simulation. (Below photo is Jonathan’s writing. It was a white board, but in the spirit of black board day, I inverted the photo.)

References:

Primary olfactory receptor neurons (ORN) bind to odor molecules in the medium and sends action potentials to the brain. This signaling is not simply ON and OFF, but each ORN has delicate sensitivity to various odors and shows diverse temporal activation patterns. Using both electrophysiology and Calcium-sensitive dye imaging, my collaborators Yuriy V. Bobkov and Kirill Y. Ukhanov studied the temporal aspect of Lobster ORNs. The heterogeneous response patterns are well presented in a recent paper published in PLoS One. I was particularly interested in a special type of ORN called bursting ORNs. Bursting ORNs are spontaneously oscillating, and the Calcium imaging data allows population analysis. I was involved in the analysis to see if there’s any sign of synchrony using resampling based burst-triggered averaging technique. It turns out that they rarely interact, if any. Moreover, they have a wide range of periods of oscillation. Since they are coupled through the environment (a filament of odor molecules in the medium), in natural environments or under controlled odor stimulation they sometimes synchronize which is a subject of another paper under review.

Note: the publication actually has my first name as Ill instead of Il which is silly and sick. I asked for a correction, but it seems PLoS One will only publish a note for the correction and not correct the actual article (because of the inconsistency it will cause for other indexing systems [1][2]). This could have been fixed in the proof, if PLoS did proofs before final publications, but they don’t (presumably to lower costs). In my opinion, this is a flaw of PLoS journals. EDIT: there’s a note saying that my name is misspelled now.

As I did for past COSYNE‘s (2009, 2010, 2011), this is a summary of my personal experience this year. I loved both the main meeting and workshops. It’s definitely one of the best conferences. I had to present my own posters, so I couldn’t see many others. Therefore, this selection is severely subsampled. Also, there might be details that I am not remembering correctly. If you spot any mistake, please let me know.

## Neural dynamics in neural coding

II-67. Jeffrey Seely, Matthew T. Kaufman, John Cunningham, Stephen Ryu, Krishna Shenoy, Mark Churchland. Dimensionality in motor cortex: differences between models and experiment

Is population activity in the motor cortex well explained by tuning curves for each neuron, or is it better explained by linear dynamics? To answer this question, they collapsed each experimental condition (motor output) to a temporally interpolated histogram of same length. A large 3-D matrix A(n,c,t) for n neurons, c conditions, t time bins is constructed, and sliced with 2 different possible low rank approximations (PCA): one is with conditions which implies tuning curve like characteristics, and the other is with time which represents the dynamic modes (basis solutions to a differential equation). They sequentially chose the component, either condition or dynamics, that explains the most variance, subtracted it, and repeated. They showed that real data is mostly dynamics while tuning based models generate data that are mostly condition based. This is a pretty convincing argument using only very basic tools.

Workshop talk: David Sussillo. Rethinking gating: selective integration of sensory signals through network dynamics

Frontal Eye Field (FEF) spiking responses to a colored random dots task where a contextual cue determines whether the monkey has to use the dots direction or majority of color to make the decision is analyzed in a dynamical system framework. (This talk is related to Valerio Mante’s poster II-58, but I missed it in the main meeting.) The question is how does the monkey switch context: is it some sort of gating mechanism that controls if the motion stimulus or color stimulus reaches FEF? Or is all information gets to FEF and decision is formed? Directions in the population firing rate (state) space is extracted by regression on the conditional firing rates: $r(t) = \beta_1(t) * choice + \beta_2(t) * color + \beta_3(t) * motion + \mbox{`interaction terms'}$ (the neurons were recorded one at a time; they assume independence). The trajectory in the neural state space (reconstructed from $\beta$‘s) shows integration-like behavior on the relevant-stimulus while encoding, but not integrating the irrelevant dimension. They further built a recurrent neural network model trained with [James Martens, Ilya Sutskever, Learning Recurrent Neural Networks with Hessian-Free Optimization, ICML 2011 pdf] and saw similar performance and dynamics by tuning only the stimulus noise level. He further showed a fixed point analysis of the trained network and a non-normal matrix decomposition to explain the integration on the line atractor. A similar talk given jointly by Mante and Sussillo at Santa Fe Institute can be found online. EDIT: they gave another related talk at redwood center with online video.

## Learning to fire a precise temporal pattern from spike train input

I was pleased to find 3 very cool posters related to learning synaptic weights of an integrate-and-fire neuron for spike train input. It is the come back of tempotron (Robert Gütig, Nature 2006)!

I-22. Robert Gütig. The multi-class tempotron: a neuron model for processing of sensory streams

Tempotron is a classifier that either emits a spike or not to indicate the class given an input spike pattern from many neurons. Robert extended the tempotron to allow not just one or zero spike, but to learn to fire a prescribed number of spikes. This is done by considering the rank-ordered membrane voltage peaks simultaneously, where the original tempotron only deals with the maximum peak. He showed that this could be used to detect an event or feature in time. The training is done by providing just the number of occurrences (does not require tagging time series with precise timings).

II-23. Raoul-Martin Memmesheimer, Ran Rubin, Haim Sompolinsky. Learning precisely timed spiking responses

One of the main advantages of tempotron is that it can use time as an extra degree of freedom, allowing a higher capacity (# of patterns / synpase) compared to traditional perceptron. However, this is also a disadvantage because the precise timing of the spikes are not controlled. This poster describes a couple of simple iterative procedures for updating the weights and threshold of an IF to produce a desired spiking pattern. The tricky part of such task is the complication of reset after erroneous spikes. Two algorithms, (1) first error learning, and (2) high threshold learning are proposed to overcome this difficulty. They showed that the algorithm converges to the solution in a similar fashion to perceptron, if there is a solution.

II-39. Ran Rubin, Raoul-Martin Memmesheimer, Haim Sompolinskyo. Support Vector Machines in Spiking Neurons with Non-Linear Dendrites

This is a companion poster to II-23. They extend the method to find a robust solution by maximizing the margin. Using an auxiliary voltage trace assuming the resets happened in a small time before the desired spike occurred (as in the high threshold learning algorithm), they formulated the problem as an SVM-like optimization problem with constraints. They also proposed active dendrites as nonlinear positive semi-definite kernels (point nonlinearity on the original inner product).

## Probabilistic modeling based neural/stimulus distances

Measuring similarity given a generative system for the data can be done with divergences. Given a probabilistic spiking neuron population model, one can measure the similarity between the stimuli or between the population responses; there were two posters for each idea using the Ising model.

I-7. Elad Ganmor, Ronen Segev, Elad Schneidman. Semantic organization of a neural population codebook and accurate decoding using a neural thesaurus

Trial to trial variability of the population response $P(r|s)$ was captured by an Ising model. Using the Bayes rule, they measured the Jensen-Shannon divergence:  $d(r_1, r_2) = D_{JS}(P(s|r_1), P(s|r_2))$ (not a metric unless sqrt is taken). They only consider instantaneous response (20 neurons, 10 ms bin, binarized), and no temporal structure. Using hierarchical clustering (forming the codebook) on the test response patterns, they showed that such method captures most of the mutual information with just a few clusters.

I-35. Gasper Tkacik, Einat Granot-Atedgi, Ronen Segev, Elad Schneidman. Retinal metric: a stimulus distance measure derived from population neural responses

They used symmetric Kullback-Leibler divergence between the stimulus conditioned response distances as a similarity measure between stimuli: $d(s_1, s_2) = D_{KL}^{sym}(P(r|s_1);P(r|s_2))$ for similarity in the stimulus space. Conditional distribution was modeled with a stimulus driven maximum entropy ising model where the higher order interaction terms do not depend on the stimulus: $P(r|s) = \frac{1}{Z} \exp\left(h(s)r + \sum_{i \neq j} J_{ij} r_i r_j\right)$. They did not use JS divergence because it is difficult to compute it from the Ising model. This similarity reveals which features of the stimulus the population really cares about.

## Extending Spike Triggered Covariance

There were 3 very related talks about spike triggered covariance (STC) analysis in the Characterizing Neural Responses to Structured and Naturalistic Stimuli workshop organized by Kanaka Rajan and William Bialek.

Jonathan Pillow

His talk was focused on Empirical Bayes (EB) methods for the inference of hierarchical models. The first part was about Mijung Park’s work on spatio-temporal and frequency localized prior design for receptive fields. The second part, which was brief due to time constraints, was about a Bayesian extension of STC where the number of receptive fields is inferred by EB.

William Bialek

He talked about the full history from reverse correlation (Boer 1968), to STC, to maximizing mutual information. He introduced Kanaka’s work (arXiv:1201.0321v1 [q-bio.NC], also poster III-34, but I missed it) on maximizing mutual information between a quadratic projections of the stimulus to the response. This is an interesting extension of MID (see Sharpee’s talk below). MID tends to degrade as the number of dimensions to extract increases, but their method seems to work better.

Tatyana Sharpee

Maximally informative dimension (MID) aims at finding receptive fields of a linear-nonlinear cascade regardless of the nonlinearity by maximizing mutual information. This is an ideal goal, however, estimation and maximization of mutual information is very difficult in practice (as in the case of information bottleneck), and implementation suffers from local minima and (histogram) parameterization. She presented an approach from the opposite direction to minimize mutual information (or equivalently, maximize conditional entropy of response given stimulus). Using a maximum entropy model with first two moments constrained, she derived that a quadratic form of logistic regression as a model to fit for binary spiking response: $P(spike|s) = \frac{1}{1+exp(a+h\cdot s+s^\top \cdot J \cdot s)}$. This is closely related to our BSTC work which has similar quadratic form of Poisson regression model as a special case. (I missed the related poster by Ryan Rowekamp et al. II-35) Ref: J.D. Fitzgerald, R. J. Rowekamp, L.C. Sincich and T.O. Sharpee, (2011) “Second order dimensionality reduction using minimum and maximum mutual information models”, PLoS Computational Biology, 7(10): e1002249 doi:10.1371/journal.pcbi.1002249

III-36. Brett Vintch, Andrew D Zaharia, Tony Movshon, Eero P Simoncelli. Fitting receptive fields in V1 and V2 as linear combinationsof nonlinear subunits

I missed this one, but this one is also highly related. They have a low complexity model that generates a set of filters that are generally obtained from STC on V1 complex cells.

Shannon’s entropy $H$ is a fundamental statistic that measures the uncertainty of a (discrete) distribution. It is a building block for mutual information $I(X;Y) = H(X) - H(X|Y)$ which has numerous applications in statistics, communication, signal processing, machine learning and so on. In the context of neuroscience, entropy can measure the maximum capacity of a neuron, quantify the amount of noise, and also serve as a cost function for theoretical derivation of learning rules. Amount of information coded by neural spike trains about a stimulus can be measured by mutual information, and provides a fundamental limit for neural codes.

Unfortunately, estimating entropy or mutual information is notoriously difficult, especially when the number of observations is less than the number of possible symbols [1]. For the neural data, this is often the case, due to the combinatorial nature of the symbols under consideration. If we consider binning a 100 ms window of spike trains from 10 neurons with a resolution of 1 ms bin, the total number of possible symbols become $2^{10 \cdot 100}$. Just to observe that many symbols, one needs $10^{292}$ years. Therefore, we must be clever. The question is how to extrapolate when you may have a severely under-sampled distribution.

In the literature, there have been many entropy estimators, and mutual information estimators based on them. We extend one of the best known entropy estimators called the NSB estimator [2,3], which is a Bayesian estimator with an approximately non-informative prior on entropy. This is achieved by mixing Dirichlet distributions appropriately. We have extended the procedure to a situation where the number of symbols with non-zero probability is unknown or arbitrarily large by mixing Pitman-Yor process as priors. The limit of the NSB estimator for infinite bins can be captured by Dirichlet process mixture prior. Pitman-Yor process is an extension of Dirichlet process with an extra parameter. Advantages of using Pitman-Yor mixture is that it can fit heavy-tailed distributions, and neural data (as well as many other natural phenomena) has heavy-tailed distribution. Our estimator shows significantly smaller bias for power-law tailed generation process as well as spiking neural data.

If you’re at COSYNE 2012, details are presented as a poster titled “Bayesian entropy estimation for infinite neural alphabets” by Evan Archer, myself and Jonathan Pillow. Look for  III-31 (Feb 25th, Saturday)

1. Liam Paninski. Estimation of Entropy and Mutual Information. Neural Computation, Vol. 15, No. 6. (1 June 2003), pp. 1191-1253, doi:10.1162/089976603321780272
2. I Nemenman, F Shafee, and W Bialek. Entropy and inference, revisited. NIPS 2001
3. I Nemenman, W Bialek, and R de Ruyter van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69:056111, 2004.

In optimal experiment design (or active learning) one seeks an online strategy for function approximation (or system identification). It is particularly useful in situations where it is costly to obtain each sample. But, what if the goal is to optimize a certain target instead of learning the entire function? For problems where parameter adjustment for maximum efficiency is required, for example, drug combination, neural micro-stimulation parameters or aircraft design, one is often not interested in recovering the full system response, but only the optimal set of parameters. Therefore it makes sense to do active learning about the locations of optimal set of parameters, but not on learning the full function.

So we decided to work on the problem under a Bayesian inference framework, and named the problem Active Bayesian Optimization (ABO). The main issue is the complexity of the posterior on the minimizer that we want to learn. Our effort based on approximation is briefly presented in this arXiv paper [1]. However, unfortunately, we were not the first to think the ABO problem. Villemonteix and colleagues [2] have presented the problem in a similar setup using sampling techniques instead of approximation. We got to know this from the NIPS Bayesian optimization workshop (2011) where the referees told us about previous works. At the workshop, we also found another recent solution to ABO problem by Henning and Schuler [3]. They used a clever approximation to the multi-modal posterior of the minimizer with EP (expectation propagation). Approximate Bayesian inference techniques or clever prior design are definitely needed for ABO, and the initial solutions in [1-3] are somewhat slow and can be computationally intractable. This is an exciting area that has a great potential to grow.

1. Il Memming Park, Marcel Nassar, Mijung Park. Active Bayesian Optimization: Minimizing Minimizer Entropy.  arXiv:1202.2143v1 [stat.ME]
2. Julien Villemonteix, Emmanuel Vazquez, Eric Walter. An informational approach to the global optimization of expensive-to-evaluate functions. arXiv:cs/0611143v2 [cs.NA] (published in Journal of Global Optimization 2008)
3. Philipp Hennig and Christian J. Schuler. Entropy search for Information-Efficient global optimization. December 2011, arXiv:1112.1217

This was my second NIPS (see last year’s NIPS summary). It had a lower acceptance rate of 22% (I served as a reviewer last year and this year). I felt like there were more computational neuroscience related posters than last year (perhaps due to the location in Europe). Non-parametric Bayes, reinforcement learning (MDP), and sparse learning were still big while kernel related posters were less. This post is a summary of my experience, and any error is due to myself (please let me know if you find any).

### Monday

Dynamical segmentation of single trials from population neural data
Biljana Petreska, M. Sahani, B. Yu, J. Cunningham, S. Ryu, K. Shenoy, Gopal Santhanam

A randomly switching piecewise-linear dynamical system model is constructed via discrete latent states. Given a state, the dynamics of spiking neurons are assumed to be linear. This model is fit to 105 simultaneously recorded neurons (Utah array) during a motor task. Number of states were chosen heuristically. This is an unsupervised method that automatically captures the structure of the dynamics. The results suggest that neurons tend to be in a linear dynamical state both when waiting for the go-cue, and during early movement, and goes through nonlinear dynamical transitions in between.

Inferring spike-timing-dependent plasticity from spike train data
Ian H. Stevenson, Konrad P. Kording

Different synapses have different form of STDP, and while spike train data are abundant, in vivo whole cell recordings are very difficult. Hence, learning the synaptic plasticity rule from just spike train observation is of great importance. This is one of my long-term goals as well. They fit a unidirectionally coupled GLM model with a binned weight modulation function as a function of timing to previous presynaptic spike. The results are promising for simulated models. I’d love to see it applied to a well controlled real data.

Active dendrites: adaptation to spike-based communication
Balázs B Ujfalussy, Máté Lengyel

In the presence of correlated presynaptic population activity, to compute a function of presynaptic voltage online from spikes, the neuron has to be nonlinear. In particular, this paper links it to the nonlinear summation property of the dendrite. In previous work by Pfister, J., Dayan, P., Lengyel, M. (2010), they explained the role of short-term plasticity (dynamical synapse model) as optimal predictor for presynaptic membrane potential for a single neuron. This work expands it to the population case.

From stochastic nonlinear integrate-and-fire to generalized linear models
Skander Mensi, Richard Naud, Wulfram Gersnter

This poster shows that given a stochastic (adaptive-exponential) leaky-integrate-and-fire-neuron model, it is possible to construct a nearly equivalent GLM model (as a form of spike response model (SRM) with escape noise). Sub-threshold dynamics is linearized to provide the linear filter (corresponding to impulse response) and the reset/refractoriness part of the history filter, while the spike-adaptation is captured as a slower time scale component of the history filter. Then the link function can be estimated through empirical observation that $\log(-\log(p(V)))$ is close to being linear. (I was totally thrown off by the notation $p(V)$ which was the probability of spiking given a membrane potential, not the marginal distribution of voltage distribution of the model.)

Gaussian process modulated renewal processes
Vinayak Rao, Yee Whye Teh

This is an extension of R. P. Adams, I. Murray and D. J.C. MacKay’s work which was on Poisson intensity estimation to hazard rate modulated renewal process. Basic ideas are similar; use a sigmoidal link function, and use point process thinning like procedure to exactly sample.

### Tuesday

Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint
Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet

Kernel embedding of probability distribution and induced divergence is an emerging direction of kernel methods. The divergence is related to Bayes risk of Parzen window classifier in particular, and this paper extends the results to Banach spaces. For a Banach space with a norm that is uniformly Fretchet differentiable, and uniformly convex, there is a semi-inner product inducing an reproducing kernel Banach space (RKBS) which has analogous properties to RKHS. They showed that kernel embedding is injective when the kernel is a Fourier transform of a signed measure (c.f. Bochner’s theorem requires a positive measure for positive definiteness). The resulting divergence is not computable, unless the semi-metric is of special form, and the convergence rate turns out to be at best same as the RKHS case.

Modelling genetic variations with fragmentation-coagulation processes
Yee Whye Teh, Charles Blundell, Lloyd T. Elliott

Similar to Chinese restaurant process (CRP) for clustering, a temporal evolution of clusters by fragmentation (breaking a table into two tables) and coagulation (merging two tables) can be described as a Fragmentation-Coagulation Process (FCP). They show that FCP is exchangeable, reversible, and has asymptotic distribution of CRP.

Priors over recurrent continuous time processes [code]
Ardavan Saeedi, Alexandre Bouchard-Cŏté

This paper received the best student paper award this year, and Ardavan is only a masters student! The problem he is interested in is a discrete latent state dependent continuous time series with partial observation process. For example, a recurrent disease with coarsely quantified states. He introduces the Gamma-exponential process, where an infinite Markovian transition rate matrix prior is given, extends to hierarchical case, and shows how to do inference.

Kernel Beta process
Lu Ren, Yingjian Wang, David Dunson, Lawrence Carin (none of the authors made it to the conference)

Beta process is a distribution over discrete random measures where each “stick” is in $[0, 1]$, but does not sum to 1 as Dirichlet process (DP) does. In this paper, they smooth the sticks in relation to covariates through a kernel, such that their heights are correlated. Kernel here does not have to be positive definte, but only bounded positive functions (like pdf’s). I’m curious if a similar approach can be taken for DP. This was originally done in similar fashion for DP by Dunson and Park (2008) (‘kernel stick breaking process’).

Sparse estimation with structured dictionaries
David Wipf

Given an ill-posed problem $Y = \Phi X + \epsilon$, where the dictionary $\Phi$, and observation $Y$ is known, under sparsity assumption this can be solved with $l_1$ regularization, when $\Phi$ is incoherent (roughly independent columns). However, when the dictionary is more structured, it can cause problems. This paper alleviates this problem by transforming the sparse variables which effectively re-normalizes them. It turns out the solution is similar to iteratively reweighted $l_1$ with a different penality. [Workshop version recording]

Sequence learning with hidden units in spiking neural networks
Johanni Brea, Walter Sen, Jean-Pascal Pfister

Given a point process, the problem is to train a spiking neural network composed of GLM units (including hidden units) that would generate the training patterns. Minimization of KL-divergence between the given point process, and the one parameterized by GLM is done by online gradient descent. The gradient requires marginalization over the spikes of the hidden units: $\frac{-\partial D_{KL}(p^\ast || p_\theta)}{\partial \theta} = E\left[ \frac{\log p(v,h;\theta)}{\partial \theta} \right]$, so they developed an importance sampling scheme where the samples from hidden units are obtained given the training spikes. The resulting training rule is Hebbian, and analogous to STDP. The results are shown when the given distribution is a delta, that is, when the network has to produce exactly one pattern, and that pattern only.

### Wednesday

I presented Bayesian Spike Triggered Covariance analysis as a poster:

### Thursday

Empirical models of spiking in neural populations
Jakob H. Macke, Lars Büsing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani

A comparison study between coupled GLM model and latent variable model (Poisson linear dynamical system) to fit the motor cortex observations (preparation phase only). While GLM explicitly allows only coupled input between the output of the population spiking history, the latent variable model allowed a low dimensional hidden common input source with linear dynamics. They show that the latent variable model fits better and could reconstruct the cross-correlations while the GLM couldn’t. There were quite a bit of discussions on the floor after the oral presentation. The difference in performance was probably due to (1) relatively large bin size (10 ms), (2) neurons were recorded by Utah array which means low probability of direct connectivity. The coupled GLM was successfully applied to retina where the coupling is local, and the sampling of the neurons were very high with 0.1 ms bin size. It would be interesting to see further developments of latent variable models and GLMs in modeling such motor system data.

### Workshops

Hierarchical algorithms for χ-armed bandits
Rémi Munos

This was a non-Bayesian invited talk for the Bayesian optimization, experimental design and bandits workshop. He talked about his paper for the main conference “Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness”. In this case, smoothness assumption comes from $f(x^*) - f(x) \leq l(x^*, x)$, where $l$ is a semi-metric, and hence the function is bounded from below around the global maximum by the semi-metric. Then using a hierarchical partitioning of the input space that respects the semi-metric, one can get a bound of the function. (This assumption is  not weaker nor stronger than Lipschitz continuity, since the absolute value is missing and it is only from the maximum.) When the knowledge of semi-metric is perfect, the convergence rate of the simple regret (best function value) can be exponential (depending on the semi-metric; multiple semi-metrics can give the bound but the convergence rate can differ). When the semi-metric is unknown, and one overestimates the exponent, for example, global convergence is not guaranteed.

Dynamic Batch Bayesian Optimization
Javad Azimi, Ali Jalali, Xiaoli Fern

When parallel experiments are possible, experimental design with batch sampling can improve the efficiency, but sequential design often performs better than batch design. Under the assumption that the maximum of the function has a known bound, and using the GP predictive covariance, they choose a set of points that are loosely independent, and could improve the criterion.

Future information minimization as PAC Bayes regularization in Reinforcement Learning
Naftali Tishby

This was the last invited talk for the New frontiers in model order selection workshop. Tishby talked about reinforcement learning in a POMDP setup, but I couldn’t fully follow (in fact it went over my head mostly). In a perception-action cycle, the Bellman equation describes the world evolution and associated reward, and he describes a counter part for the agent (mental state?) using an associated Bellman equation with information-to-go (mutual information with respect to a goal).  Then he describes reinforcement learning as a coding problem (relating to Kraft’s inequality, which says subtree of an optimal coding tree is an optimal coding tree). At some point, he reaches PAC-Bayesian bound, and claims that reinforcement learning self-regularizes.

Between the philosophy of science and machine learning
David Corfield [U of Kent]

This was the first invited talk for the Philosophy and machine learning workshop. He talked about a broad range of philosophers (of science) and a couple of examples of interaction between ML. The first example was Karl Popper‘s idea of complexity of theory in terms of falsifiable dimensions and its similarity to VC dimension (see their paper in 2009 for details). The second example was Judea Pearl’s use of counterfactual (by David Lewis), and its impact on philosophy of science. He talked about what kinds of sciences can be benefited from ML, certainly the ones with lots of data. He also went through many philosopher’s ideas including: Popper, Carnap, Kuhn and Lakatos, Feyerabend. It is certainly a very fascinating area, but my impression was that we don’t have much to talk about yet.

tags: , , , ,

A widely used tool in neural characterization, where one is interested in the stimulus (or behavior) features that a neuron is sensitive to, is spike triggered averaging  (STA) or otherwise known as reverse correlation analysis [Dayan & Abbott]. At the occurrence of each spike, one averages the stimulus in a window time locked relative to the spike timing, that potentially causes the spike (or behavior that is caused by the spike) to obtain STA.

It essentially estimates the first order Volterra expansion of the neural response function, that is, approximating a neuron as a linear system. Although neuron is not really a linear system, STA works well in practice. Moreover, it is a consistent estimator for a linear-nonlinear Poisson (LNP) model if the stimulus is white Gaussian noise [Bussgang 1952 in Dayan & Abbott]. In [Paninski 2003] this condition is extended to an arbitrary radially symmetric stimulus that induces non-zero mean response.

When the neuron’s features space is in low-dimension, but not 1-dimension, then STA is not sufficient, since it recovers only a 1-dimensional subspace. Spike triggered covariance (STC) is an extension of STA that can consistently estimate filters of a multi-dimensional LNP model [Paninski 2003]. Let us denote the zero-mean stimulus distribution as $p(x)$, and the spike triggered distribution as $q(x)$. Then, STA is the mean of $\hat{q}(x)$ (empirical estimate of $q(x)$), and STC is the eigen-vectors of the covariance matrix of $\hat{q}(x)$. STC is only a consistent estimator when the stimulus distribution is Gaussian [for details, see Paninski 2003].

STA/STC are moment based estimators, and does not have a probabilistic model. Analogous to PPCA (probabilistic principal component analysis) provided a generative model for PCA, allowing Bayesian extensions of PCA, we formulate the STA/STC problem as a maximum likelihood estimate of a generative model. Inspired by iSTAC [Pillow & Simoncelli 2006], we extend the LNP model (figure) with exponentiated quadratic nonlinearity. This allows us to put priors on the features, and develop Bayesian estimators. We further extend it to a general family of models, that allows consistent estimation using arbitrary stimulus distribution and flexible class of nonlinearities. This result will be presented at Neural Information Processing Systems (NIPS) 2011. If you are coming to NIPS, it’s poster W88!