7th Black Board Day (BBD7)
Il Memming Park: On halting problem route to incompleteness
Kenneth Latimer: On Roger Penrose’s Emperor’s new mind
Michael Buice: Algebra of Probable Inference
Ryan Usher: An Incomplete, Inconsistent, Undecidable and Unsatisfiable Look at the Colloquial Identity and Aesthetic Possibilities of Math or Logic
Jonathan Pillow: Do we live inside a Turing machine?
- Simulated human brain brings consciousness (“substance independence”)
- Large scale simulation of human brain + physical world around human is possible
- Alan Turing. (1936) On computable numbers with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society. 2 42: 230
- Michael Sipser. Introduction to Computation (Memming’s halting problem proof followed this one)
- Roger Penrose. Emperor’s new mind
- Torkel Franzén. Godel’s Theorem: An Incomplete Guide to Its Use and Abuse (recommended by Ryan)
- Richard T. Cox. Algebra of Probable Inference
- Cox, R. (1946). Probability, frequency and reasonable expectation. American Journal of Physics, 14(1), 1–13.
- E.T. Jaynes. Probability Theory: The Logic of Science
- Martin Davis. The Undecidable (Collection of papers) The Undecidable: Basic Papers on Undecidable Propositions, Unsolvable Problems and Computable Functions (Dover Books on Mathematics)
- Martin Davis, Computability and Unsolvability (Michael Buice: One of the most beautiful books written by humankind; introduction to recursive function theory and computability, turing machines. One of the few books which does so in a complete and rigorous manner, also covers Logic and Gödel’s theorem.)
- Bostrom, N. , 2003, Are You Living in a Computer Simulation?, Philosophical Quarterly (2003), Vol. 53, No. 211, pp. 243-255.
Primary olfactory receptor neurons (ORN) bind to odor molecules in the medium and sends action potentials to the brain. This signaling is not simply ON and OFF, but each ORN has delicate sensitivity to various odors and shows diverse temporal activation patterns. Using both electrophysiology and Calcium-sensitive dye imaging, my collaborators Yuriy V. Bobkov and Kirill Y. Ukhanov studied the temporal aspect of Lobster ORNs. The heterogeneous response patterns are well presented in a recent paper published in PLoS One. I was particularly interested in a special type of ORN called bursting ORNs. Bursting ORNs are spontaneously oscillating, and the Calcium imaging data allows population analysis. I was involved in the analysis to see if there’s any sign of synchrony using resampling based burst-triggered averaging technique. It turns out that they rarely interact, if any. Moreover, they have a wide range of periods of oscillation. Since they are coupled through the environment (a filament of odor molecules in the medium), in natural environments or under controlled odor stimulation they sometimes synchronize which is a subject of another paper under review.
Note: the publication actually has my first name as Ill instead of Il which is silly and sick. I asked for a correction, but it seems PLoS One will only publish a note for the correction and not correct the actual article (because of the inconsistency it will cause for other indexing systems [1][2]). This could have been fixed in the proof, if PLoS did proofs before final publications, but they don’t (presumably to lower costs). In my opinion, this is a flaw of PLoS journals. EDIT: there’s a note saying that my name is misspelled now.
Interesting talks/posters from COSYNE 2012
As I did for past COSYNE‘s (2009, 2010, 2011), this is a summary of my personal experience this year. I loved both the main meeting and workshops. It’s definitely one of the best conferences. I had to present my own posters, so I couldn’t see many others. Therefore, this selection is severely subsampled. Also, there might be details that I am not remembering correctly. If you spot any mistake, please let me know.
Neural dynamics in neural coding
II-67. Jeffrey Seely, Matthew T. Kaufman, John Cunningham, Stephen Ryu, Krishna Shenoy, Mark Churchland. Dimensionality in motor cortex: differences between models and experiment
Is population activity in the motor cortex well explained by tuning curves for each neuron, or is it better explained by linear dynamics? To answer this question, they collapsed each experimental condition (motor output) to a temporally interpolated histogram of same length. A large 3-D matrix A(n,c,t) for n neurons, c conditions, t time bins is constructed, and sliced with 2 different possible low rank approximations (PCA): one is with conditions which implies tuning curve like characteristics, and the other is with time which represents the dynamic modes (basis solutions to a differential equation). They sequentially chose the component, either condition or dynamics, that explains the most variance, subtracted it, and repeated. They showed that real data is mostly dynamics while tuning based models generate data that are mostly condition based. This is a pretty convincing argument using only very basic tools.
Workshop talk: David Sussillo. Rethinking gating: selective integration of sensory signals through network dynamics
Frontal Eye Field (FEF) spiking responses to a colored random dots task where a contextual cue determines whether the monkey has to use the dots direction or majority of color to make the decision is analyzed in a dynamical system framework. (This talk is related to Valerio Mante’s poster II-58, but I missed it in the main meeting.) The question is how does the monkey switch context: is it some sort of gating mechanism that controls if the motion stimulus or color stimulus reaches FEF? Or is all information gets to FEF and decision is formed? Directions in the population firing rate (state) space is extracted by regression on the conditional firing rates: (the neurons were recorded one at a time; they assume independence). The trajectory in the neural state space (reconstructed from
‘s) shows integration-like behavior on the relevant-stimulus while encoding, but not integrating the irrelevant dimension. They further built a recurrent neural network model trained with [James Martens, Ilya Sutskever, Learning Recurrent Neural Networks with Hessian-Free Optimization, ICML 2011 pdf] and saw similar performance and dynamics by tuning only the stimulus noise level. He further showed a fixed point analysis of the trained network and a non-normal matrix decomposition to explain the integration on the line atractor. A similar talk given jointly by Mante and Sussillo at Santa Fe Institute can be found online. EDIT: they gave another related talk at redwood center with online video.
Learning to fire a precise temporal pattern from spike train input
I was pleased to find 3 very cool posters related to learning synaptic weights of an integrate-and-fire neuron for spike train input. It is the come back of tempotron (Robert Gütig, Nature 2006)!
I-22. Robert Gütig. The multi-class tempotron: a neuron model for processing of sensory streams
Tempotron is a classifier that either emits a spike or not to indicate the class given an input spike pattern from many neurons. Robert extended the tempotron to allow not just one or zero spike, but to learn to fire a prescribed number of spikes. This is done by considering the rank-ordered membrane voltage peaks simultaneously, where the original tempotron only deals with the maximum peak. He showed that this could be used to detect an event or feature in time. The training is done by providing just the number of occurrences (does not require tagging time series with precise timings).
II-23. Raoul-Martin Memmesheimer, Ran Rubin, Haim Sompolinsky. Learning precisely timed spiking responses
One of the main advantages of tempotron is that it can use time as an extra degree of freedom, allowing a higher capacity (# of patterns / synpase) compared to traditional perceptron. However, this is also a disadvantage because the precise timing of the spikes are not controlled. This poster describes a couple of simple iterative procedures for updating the weights and threshold of an IF to produce a desired spiking pattern. The tricky part of such task is the complication of reset after erroneous spikes. Two algorithms, (1) first error learning, and (2) high threshold learning are proposed to overcome this difficulty. They showed that the algorithm converges to the solution in a similar fashion to perceptron, if there is a solution.
II-39. Ran Rubin, Raoul-Martin Memmesheimer, Haim Sompolinskyo. Support Vector Machines in Spiking Neurons with Non-Linear Dendrites
This is a companion poster to II-23. They extend the method to find a robust solution by maximizing the margin. Using an auxiliary voltage trace assuming the resets happened in a small time before the desired spike occurred (as in the high threshold learning algorithm), they formulated the problem as an SVM-like optimization problem with constraints. They also proposed active dendrites as nonlinear positive semi-definite kernels (point nonlinearity on the original inner product).
Probabilistic modeling based neural/stimulus distances
Measuring similarity given a generative system for the data can be done with divergences. Given a probabilistic spiking neuron population model, one can measure the similarity between the stimuli or between the population responses; there were two posters for each idea using the Ising model.
I-7. Elad Ganmor, Ronen Segev, Elad Schneidman. Semantic organization of a neural population codebook and accurate decoding using a neural thesaurus
Trial to trial variability of the population response was captured by an Ising model. Using the Bayes rule, they measured the Jensen-Shannon divergence:
(not a metric unless sqrt is taken). They only consider instantaneous response (20 neurons, 10 ms bin, binarized), and no temporal structure. Using hierarchical clustering (forming the codebook) on the test response patterns, they showed that such method captures most of the mutual information with just a few clusters.
I-35. Gasper Tkacik, Einat Granot-Atedgi, Ronen Segev, Elad Schneidman. Retinal metric: a stimulus distance measure derived from population neural responses
They used symmetric Kullback-Leibler divergence between the stimulus conditioned response distances as a similarity measure between stimuli: for similarity in the stimulus space. Conditional distribution was modeled with a stimulus driven maximum entropy ising model where the higher order interaction terms do not depend on the stimulus:
. They did not use JS divergence because it is difficult to compute it from the Ising model. This similarity reveals which features of the stimulus the population really cares about.
Extending Spike Triggered Covariance
There were 3 very related talks about spike triggered covariance (STC) analysis in the Characterizing Neural Responses to Structured and Naturalistic Stimuli workshop organized by Kanaka Rajan and William Bialek.
Jonathan Pillow
His talk was focused on Empirical Bayes (EB) methods for the inference of hierarchical models. The first part was about Mijung Park’s work on spatio-temporal and frequency localized prior design for receptive fields. The second part, which was brief due to time constraints, was about a Bayesian extension of STC where the number of receptive fields is inferred by EB.
William Bialek
He talked about the full history from reverse correlation (Boer 1968), to STC, to maximizing mutual information. He introduced Kanaka’s work (arXiv:1201.0321v1 [q-bio.NC], also poster III-34, but I missed it) on maximizing mutual information between a quadratic projections of the stimulus to the response. This is an interesting extension of MID (see Sharpee’s talk below). MID tends to degrade as the number of dimensions to extract increases, but their method seems to work better.
Tatyana Sharpee
Maximally informative dimension (MID) aims at finding receptive fields of a linear-nonlinear cascade regardless of the nonlinearity by maximizing mutual information. This is an ideal goal, however, estimation and maximization of mutual information is very difficult in practice (as in the case of information bottleneck), and implementation suffers from local minima and (histogram) parameterization. She presented an approach from the opposite direction to minimize mutual information (or equivalently, maximize conditional entropy of response given stimulus). Using a maximum entropy model with first two moments constrained, she derived that a quadratic form of logistic regression as a model to fit for binary spiking response: . This is closely related to our BSTC work which has similar quadratic form of Poisson regression model as a special case. (I missed the related poster by Ryan Rowekamp et al. II-35) Ref: J.D. Fitzgerald, R. J. Rowekamp, L.C. Sincich and T.O. Sharpee, (2011) “Second order dimensionality reduction using minimum and maximum mutual information models”, PLoS Computational Biology, 7(10): e1002249 doi:10.1371/journal.pcbi.1002249
III-36. Brett Vintch, Andrew D Zaharia, Tony Movshon, Eero P Simoncelli. Fitting receptive fields in V1 and V2 as linear combinationsof nonlinear subunits
I missed this one, but this one is also highly related. They have a low complexity model that generates a set of filters that are generally obtained from STC on V1 complex cells.
Shannon’s entropy is a fundamental statistic that measures the uncertainty of a (discrete) distribution. It is a building block for mutual information
which has numerous applications in statistics, communication, signal processing, machine learning and so on. In the context of neuroscience, entropy can measure the maximum capacity of a neuron, quantify the amount of noise, and also serve as a cost function for theoretical derivation of learning rules. Amount of information coded by neural spike trains about a stimulus can be measured by mutual information, and provides a fundamental limit for neural codes.
Unfortunately, estimating entropy or mutual information is notoriously difficult, especially when the number of observations is less than the number of possible symbols [1]. For the neural data, this is often the case, due to the combinatorial nature of the symbols under consideration. If we consider binning a 100 ms window of spike trains from 10 neurons with a resolution of 1 ms bin, the total number of possible symbols become . Just to observe that many symbols, one needs
years. Therefore, we must be clever. The question is how to extrapolate when you may have a severely under-sampled distribution.
In the literature, there have been many entropy estimators, and mutual information estimators based on them. We extend one of the best known entropy estimators called the NSB estimator [2,3], which is a Bayesian estimator with an approximately non-informative prior on entropy. This is achieved by mixing Dirichlet distributions appropriately. We have extended the procedure to a situation where the number of symbols with non-zero probability is unknown or arbitrarily large by mixing Pitman-Yor process as priors. The limit of the NSB estimator for infinite bins can be captured by Dirichlet process mixture prior. Pitman-Yor process is an extension of Dirichlet process with an extra parameter. Advantages of using Pitman-Yor mixture is that it can fit heavy-tailed distributions, and neural data (as well as many other natural phenomena) has heavy-tailed distribution. Our estimator shows significantly smaller bias for power-law tailed generation process as well as spiking neural data.
If you’re at COSYNE 2012, details are presented as a poster titled “Bayesian entropy estimation for infinite neural alphabets” by Evan Archer, myself and Jonathan Pillow. Look for III-31 (Feb 25th, Saturday)
- Liam Paninski. Estimation of Entropy and Mutual Information. Neural Computation, Vol. 15, No. 6. (1 June 2003), pp. 1191-1253, doi:10.1162/089976603321780272
- I Nemenman, F Shafee, and W Bialek. Entropy and inference, revisited. NIPS 2001
- I Nemenman, W Bialek, and R de Ruyter van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E, 69:056111, 2004.
Active Bayesian Optimization
In optimal experiment design (or active learning) one seeks an online strategy for function approximation (or system identification). It is particularly useful in situations where it is costly to obtain each sample. But, what if the goal is to optimize a certain target instead of learning the entire function? For problems where parameter adjustment for maximum efficiency is required, for example, drug combination, neural micro-stimulation parameters or aircraft design, one is often not interested in recovering the full system response, but only the optimal set of parameters. Therefore it makes sense to do active learning about the locations of optimal set of parameters, but not on learning the full function.
So we decided to work on the problem under a Bayesian inference framework, and named the problem Active Bayesian Optimization (ABO). The main issue is the complexity of the posterior on the minimizer that we want to learn. Our effort based on approximation is briefly presented in this arXiv paper [1]. However, unfortunately, we were not the first to think the ABO problem. Villemonteix and colleagues [2] have presented the problem in a similar setup using sampling techniques instead of approximation. We got to know this from the NIPS Bayesian optimization workshop (2011) where the referees told us about previous works. At the workshop, we also found another recent solution to ABO problem by Henning and Schuler [3]. They used a clever approximation to the multi-modal posterior of the minimizer with EP (expectation propagation). Approximate Bayesian inference techniques or clever prior design are definitely needed for ABO, and the initial solutions in [1-3] are somewhat slow and can be computationally intractable. This is an exciting area that has a great potential to grow.
- Il Memming Park, Marcel Nassar, Mijung Park. Active Bayesian Optimization: Minimizing Minimizer Entropy. arXiv:1202.2143v1 [stat.ME]
- Julien Villemonteix, Emmanuel Vazquez, Eric Walter. An informational approach to the global optimization of expensive-to-evaluate functions. arXiv:cs/0611143v2 [cs.NA] (published in Journal of Global Optimization 2008)
- Philipp Hennig and Christian J. Schuler. Entropy search for Information-Efficient global optimization. December 2011, arXiv:1112.1217
NIPS 2011
This was my second NIPS (see last year’s NIPS summary). It had a lower acceptance rate of 22% (I served as a reviewer last year and this year). I felt like there were more computational neuroscience related posters than last year (perhaps due to the location in Europe). Non-parametric Bayes, reinforcement learning (MDP), and sparse learning were still big while kernel related posters were less. This post is a summary of my experience, and any error is due to myself (please let me know if you find any).
Monday
Dynamical segmentation of single trials from population neural data
Biljana Petreska, M. Sahani, B. Yu, J. Cunningham, S. Ryu, K. Shenoy, Gopal Santhanam
A randomly switching piecewise-linear dynamical system model is constructed via discrete latent states. Given a state, the dynamics of spiking neurons are assumed to be linear. This model is fit to 105 simultaneously recorded neurons (Utah array) during a motor task. Number of states were chosen heuristically. This is an unsupervised method that automatically captures the structure of the dynamics. The results suggest that neurons tend to be in a linear dynamical state both when waiting for the go-cue, and during early movement, and goes through nonlinear dynamical transitions in between.
Inferring spike-timing-dependent plasticity from spike train data
Ian H. Stevenson, Konrad P. Kording
Different synapses have different form of STDP, and while spike train data are abundant, in vivo whole cell recordings are very difficult. Hence, learning the synaptic plasticity rule from just spike train observation is of great importance. This is one of my long-term goals as well. They fit a unidirectionally coupled GLM model with a binned weight modulation function as a function of timing to previous presynaptic spike. The results are promising for simulated models. I’d love to see it applied to a well controlled real data.
Active dendrites: adaptation to spike-based communication
Balázs B Ujfalussy, Máté Lengyel
In the presence of correlated presynaptic population activity, to compute a function of presynaptic voltage online from spikes, the neuron has to be nonlinear. In particular, this paper links it to the nonlinear summation property of the dendrite. In previous work by Pfister, J., Dayan, P., Lengyel, M. (2010), they explained the role of short-term plasticity (dynamical synapse model) as optimal predictor for presynaptic membrane potential for a single neuron. This work expands it to the population case.
From stochastic nonlinear integrate-and-fire to generalized linear models
Skander Mensi, Richard Naud, Wulfram Gersnter
This poster shows that given a stochastic (adaptive-exponential) leaky-integrate-and-fire-neuron model, it is possible to construct a nearly equivalent GLM model (as a form of spike response model (SRM) with escape noise). Sub-threshold dynamics is linearized to provide the linear filter (corresponding to impulse response) and the reset/refractoriness part of the history filter, while the spike-adaptation is captured as a slower time scale component of the history filter. Then the link function can be estimated through empirical observation that is close to being linear. (I was totally thrown off by the notation
which was the probability of spiking given a membrane potential, not the marginal distribution of voltage distribution of the model.)
Gaussian process modulated renewal processes
Vinayak Rao, Yee Whye Teh
This is an extension of R. P. Adams, I. Murray and D. J.C. MacKay’s work which was on Poisson intensity estimation to hazard rate modulated renewal process. Basic ideas are similar; use a sigmoidal link function, and use point process thinning like procedure to exactly sample.
Tuesday
Learning in Hilbert vs. Banach spaces: A measure embedding viewpoint
Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet
Kernel embedding of probability distribution and induced divergence is an emerging direction of kernel methods. The divergence is related to Bayes risk of Parzen window classifier in particular, and this paper extends the results to Banach spaces. For a Banach space with a norm that is uniformly Fretchet differentiable, and uniformly convex, there is a semi-inner product inducing an reproducing kernel Banach space (RKBS) which has analogous properties to RKHS. They showed that kernel embedding is injective when the kernel is a Fourier transform of a signed measure (c.f. Bochner’s theorem requires a positive measure for positive definiteness). The resulting divergence is not computable, unless the semi-metric is of special form, and the convergence rate turns out to be at best same as the RKHS case.
Modelling genetic variations with fragmentation-coagulation processes
Yee Whye Teh, Charles Blundell, Lloyd T. Elliott
Similar to Chinese restaurant process (CRP) for clustering, a temporal evolution of clusters by fragmentation (breaking a table into two tables) and coagulation (merging two tables) can be described as a Fragmentation-Coagulation Process (FCP). They show that FCP is exchangeable, reversible, and has asymptotic distribution of CRP.
Priors over recurrent continuous time processes [code]
Ardavan Saeedi, Alexandre Bouchard-Cŏté
This paper received the best student paper award this year, and Ardavan is only a masters student! The problem he is interested in is a discrete latent state dependent continuous time series with partial observation process. For example, a recurrent disease with coarsely quantified states. He introduces the Gamma-exponential process, where an infinite Markovian transition rate matrix prior is given, extends to hierarchical case, and shows how to do inference.
Kernel Beta process
Lu Ren, Yingjian Wang, David Dunson, Lawrence Carin (none of the authors made it to the conference)
Beta process is a distribution over discrete random measures where each “stick” is in , but does not sum to 1 as Dirichlet process (DP) does. In this paper, they smooth the sticks in relation to covariates through a kernel, such that their heights are correlated. Kernel here does not have to be positive definte, but only bounded positive functions (like pdf’s).
I’m curious if a similar approach can be taken for DP. This was originally done in similar fashion for DP by Dunson and Park (2008) (‘kernel stick breaking process’).
Sparse estimation with structured dictionaries
David Wipf
Given an ill-posed problem , where the dictionary
, and observation
is known, under sparsity assumption this can be solved with
regularization, when
is incoherent (roughly independent columns). However, when the dictionary is more structured, it can cause problems. This paper alleviates this problem by transforming the sparse variables which effectively re-normalizes them. It turns out the solution is similar to iteratively reweighted
with a different penality. [Workshop version recording]
Sequence learning with hidden units in spiking neural networks
Johanni Brea, Walter Sen, Jean-Pascal Pfister
Given a point process, the problem is to train a spiking neural network composed of GLM units (including hidden units) that would generate the training patterns. Minimization of KL-divergence between the given point process, and the one parameterized by GLM is done by online gradient descent. The gradient requires marginalization over the spikes of the hidden units: , so they developed an importance sampling scheme where the samples from hidden units are obtained given the training spikes. The resulting training rule is Hebbian, and analogous to STDP. The results are shown when the given distribution is a delta, that is, when the network has to produce exactly one pattern, and that pattern only.
Wednesday
I presented Bayesian Spike Triggered Covariance analysis as a poster:
Thursday
Empirical models of spiking in neural populations
Jakob H. Macke, Lars Büsing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani
A comparison study between coupled GLM model and latent variable model (Poisson linear dynamical system) to fit the motor cortex observations (preparation phase only). While GLM explicitly allows only coupled input between the output of the population spiking history, the latent variable model allowed a low dimensional hidden common input source with linear dynamics. They show that the latent variable model fits better and could reconstruct the cross-correlations while the GLM couldn’t. There were quite a bit of discussions on the floor after the oral presentation. The difference in performance was probably due to (1) relatively large bin size (10 ms), (2) neurons were recorded by Utah array which means low probability of direct connectivity. The coupled GLM was successfully applied to retina where the coupling is local, and the sampling of the neurons were very high with 0.1 ms bin size. It would be interesting to see further developments of latent variable models and GLMs in modeling such motor system data.
Workshops
Hierarchical algorithms for χ-armed bandits
Rémi Munos
This was a non-Bayesian invited talk for the Bayesian optimization, experimental design and bandits workshop. He talked about his paper for the main conference “Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness”. In this case, smoothness assumption comes from , where
is a semi-metric, and hence the function is bounded from below around the global maximum by the semi-metric. Then using a hierarchical partitioning of the input space that respects the semi-metric, one can get a bound of the function. (This assumption is not weaker nor stronger than Lipschitz continuity, since the absolute value is missing and it is only from the maximum.) When the knowledge of semi-metric is perfect, the convergence rate of the simple regret (best function value) can be exponential (depending on the semi-metric; multiple semi-metrics can give the bound but the convergence rate can differ). When the semi-metric is unknown, and one overestimates the exponent, for example, global convergence is not guaranteed.
Dynamic Batch Bayesian Optimization
Javad Azimi, Ali Jalali, Xiaoli Fern
When parallel experiments are possible, experimental design with batch sampling can improve the efficiency, but sequential design often performs better than batch design. Under the assumption that the maximum of the function has a known bound, and using the GP predictive covariance, they choose a set of points that are loosely independent, and could improve the criterion.
Future information minimization as PAC Bayes regularization in Reinforcement Learning
Naftali Tishby
This was the last invited talk for the New frontiers in model order selection workshop. Tishby talked about reinforcement learning in a POMDP setup, but I couldn’t fully follow (in fact it went over my head mostly). In a perception-action cycle, the Bellman equation describes the world evolution and associated reward, and he describes a counter part for the agent (mental state?) using an associated Bellman equation with information-to-go (mutual information with respect to a goal). Then he describes reinforcement learning as a coding problem (relating to Kraft’s inequality, which says subtree of an optimal coding tree is an optimal coding tree). At some point, he reaches PAC-Bayesian bound, and claims that reinforcement learning self-regularizes.
Between the philosophy of science and machine learning
David Corfield [U of Kent]
This was the first invited talk for the Philosophy and machine learning workshop. He talked about a broad range of philosophers (of science) and a couple of examples of interaction between ML. The first example was Karl Popper‘s idea of complexity of theory in terms of falsifiable dimensions and its similarity to VC dimension (see their paper in 2009 for details). The second example was Judea Pearl’s use of counterfactual (by David Lewis), and its impact on philosophy of science. He talked about what kinds of sciences can be benefited from ML, certainly the ones with lots of data. He also went through many philosopher’s ideas including: Popper, Carnap, Kuhn and Lakatos, Feyerabend. It is certainly a very fascinating area, but my impression was that we don’t have much to talk about yet.
[Google is hosting some workshop videos freely online] [Hal Daumé III's blogpost on NIPS 2011]
Bayesian Spike Triggered Covariance Analysis
A widely used tool in neural characterization, where one is interested in the stimulus (or behavior) features that a neuron is sensitive to, is spike triggered averaging (STA) or otherwise known as reverse correlation analysis [Dayan & Abbott]. At the occurrence of each spike, one averages the stimulus in a window time locked relative to the spike timing, that potentially causes the spike (or behavior that is caused by the spike) to obtain STA.
It essentially estimates the first order Volterra expansion of the neural response function, that is, approximating a neuron as a linear system. Although neuron is not really a linear system, STA works well in practice. Moreover, it is a consistent estimator for a linear-nonlinear Poisson (LNP) model if the stimulus is white Gaussian noise [Bussgang 1952 in Dayan & Abbott]. In [Paninski 2003] this condition is extended to an arbitrary radially symmetric stimulus that induces non-zero mean response.
When the neuron’s features space is in low-dimension, but not 1-dimension, then STA is not sufficient, since it recovers only a 1-dimensional subspace. Spike triggered covariance (STC) is an extension of STA that can consistently estimate filters of a multi-dimensional LNP model [Paninski 2003]. Let us denote the zero-mean stimulus distribution as , and the spike triggered distribution as
. Then, STA is the mean of
(empirical estimate of
), and STC is the eigen-vectors of the covariance matrix of
. STC is only a consistent estimator when the stimulus distribution is Gaussian [for details, see Paninski 2003].
STA/STC are moment based estimators, and does not have a probabilistic model. Analogous to PPCA (probabilistic principal component analysis) provided a generative model for PCA, allowing Bayesian extensions of PCA, we formulate the STA/STC problem as a maximum likelihood estimate of a generative model. Inspired by iSTAC [Pillow & Simoncelli 2006], we extend the LNP model (figure) with exponentiated quadratic nonlinearity. This allows us to put priors on the features, and develop Bayesian estimators. We further extend it to a general family of models, that allows consistent estimation using arbitrary stimulus distribution and flexible class of nonlinearities. This result will be presented at Neural Information Processing Systems (NIPS) 2011. If you are coming to NIPS, it’s poster W88!
- Dayan & Abbott. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press 2001
- Paninski. Convergence properties of three spike-triggered analysis techniques. Network: Computation in Neural Systems. v 14 p 437-464. 2003
- Pillow & Simoncelli. Dimensionality reduction in neural models: An information-theoretic generalization of spike-triggered average and covariance analysis. Journal of Vision. v 6 p 414-428. 2006






