How to win big by thinking straight about relatively trivial problems

How to win big by thinking straight about relatively trivial problems
Tony Bell University of California at Berkeley

Density Estimation Make the model like the reality
by minimising the Kullback-Leibler Divergence: by gradient descent in a parameter of the model : THIS RESULT IS COMPLETELY GENERAL.

The passive case ( = 0) For a general model distribution written in the ‘Energy-based’ form: energy partition function (or zeroth moment...) the gradient evaluates in the simple ‘Boltzmann-like’ form: learn on data while awake unlearn on data while asleep

The single-layer case Many problems solved by modeling in the
transformed space Linear Transform Shaping Density Learning Rule (Natural Gradient) for non-loopy hypergraph The Score Function is the important quantity

Conditional Density Modeling
To model use the rules: This little known fact has hardly ever been exploited. It can be used instead of regression everywhere.

ICA ISA IVA DCA Independent Components, Subspaces and Vectors
(ie: score function hard to get at due to Z)

IVA used for audio-separation in real room:

Score functions derived from sparse factorial
and radial densities:

Results on real-room source separation:

Why does IVA work on this problem?
Because the score function, and thus the learning, is only sensitive to the amplitude of the complex vectors, representing correlations of amplitudes of frequency components associated with a single speaker. Arbitrary dependencies can exist between the phases of this vector. Thus all phase (ie: higher-order statistical structure) is confined within the vector and removed between them. It’s a simple trick, just relaxing the independence assumptions in a way that fits speech. But we can do much more: build conditional models across frequency components make models for data that is even more structured: Video is [time x space x colour] Many experiments are [time x sensor x task-condition x trial]

The big picture. Behind this effort is an attempt to explore something called “The Levels Hypothesis”, which is the idea that in biology, in the brain, in nature, there is a kind of density estimation taking place across scales. To explore this idea, we have a twofold strategy: 1. EMPIRICAL/DATA ANALYSIS: Build algorithms that can probe the EEG across scales, ie: across frequencies 2. THEORETICAL: Formalise mathematically the learning process in such systems.

A Multi-Level View of Learning
UNIT DYNAMICS LEARNING society organism behaviour ecology predation, symbiosis natural selection sensory-motor learning cell spikes synaptic plasticity protein molecular forces gene expression, protein recycling direct, voltage, Ca, 2nd messenger molecular change amino acid ( = STDP) Increasing Timescale LEARNING at a LEVEL is CHANGE IN INTERACTIONS between its UNITS, implemented by INTERACTIONS at the LEVEL beneath, and by extension resulting in CHANGE IN LEARNING at the LEVEL above. Interactions=fast Learning=slow Separation of timescales allows INTERACTIONS at one LEVEL to be LEARNING at the LEVEL above.

Infomax between Layers. (eg: V1 density-estimates Retina)
Infomax between Levels. (eg: synapses density-estimate spikes) 1 2 t all neural spikes retina V1 synaptic weights x y synapses, dendites y all synaptic readout overcomplete includes all feedback information flows between levels arbitrary dependencies models input and intrinsic activity square (in ICA formalism) feedforward information flows within a level predicts independent activity only models outside input pdf of all spike times pdf of all synaptic ‘readouts’ This SHIFT in looking at the problem alters the question so that if it is answered, we have an unsupervised theory of ‘whole brain learning’. If we can make this pdf uniform then we have a model constructed from all synaptic and dendritic causality

Formalisation of the problem:
IF p is the ‘data’ distribution, q is the ‘model’ distribution w is a synaptic weight, and I(y,t) is the spike synapse mutual information THEN if we were doing classical Infomax, we would use the gradient: (1) BUT if one’s actions can change the data, THEN an extra term appears: (2) change the world to fit the model, as well as changing one’s model to fit the world It is easier to live in a world where one can therefore (2) must be easier than (1). This is what we are now researching.

How to win big by thinking straight about relatively trivial problems

Similar presentations

Presentation on theme: "How to win big by thinking straight about relatively trivial problems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to win big by thinking straight about relatively trivial problems

Similar presentations

Presentation on theme: "How to win big by thinking straight about relatively trivial problems"— Presentation transcript:

Similar presentations

About project

Feedback