Inductive Transfer, Machine Lifelong Learning, and AGI

Inductive Transfer, Machine Lifelong Learning, and AGI
Daniel L. Silver … with input from Ryan Poirier, Duane Currie, Liangliang Tu, and Ben Fowler Acadia University, Wolfville, NS, Canada

Position It is now appropriate to seriously consider the nature of systems that learn over a lifetime Motivation / Rationale: Body of related work on which to build Power and low cost of modern computers Challenges and benefits of research to the areas of AI, brain sciences, and human learning

Machine Lifelong Learning
Example: Learning to Learn how to transform images Requires methods of efficiently & effectively Retaining transform model knowledge Using this knowledge to learn new transforms

AAAI 2013 Spring Symposium Lifelong Machine Learning March 25–27, 2013
Stanford University, Stanford, CA

Outline Machine Learning and Inductive Transfer
Machine Lifelong Learning and Related Work Context sensitive Multiple Task Learning Challenges and Benefits for AGI

What is Learning? Inductive inference/modeling
Developing a general model/hypothesis from examples It’s like … Fitting a curve to data Also considered modeling the data Statistical modeling

What is Learning? = a heuristic beyond the data
Requires an inductive bias = a heuristic beyond the data Do you know any inductive biases? How do you choose which to use?

Inductive Bias ASH ST FIR ST SEC OND THI RD ELM ST PINE ST OAK ST
Human learners use Inductive Bias ASH ST FIR ST SEC OND THI RD ELM ST Inductive bias depends upon: Having prior knowledge Selection of most related knowledge PINE ST OAK ST

Inductive Biases Universal heuristics - Occam’s Razor
Knowledge of intended use – Medical diagnosis Knowledge of the source - Teacher Knowledge of the task domain Analogy with previously learned tasks

What is Machine Learning?
The study of how to build computer programs that: Improve with experience Generalize from examples Self-program, to some extent

History of Machine Learning
[Patterson, D., Artificial Neural Networks: Theory and Applications, 1996, Figure 1.10 p13]

Classes of ML Methods Supervised – Develops models that predict the value of one variable from one or more others: Artifical Neural Networks, Inductive Decision Trees, Genetic Algorithms, k-Nearest Neighbour, Bayesian Networks, Support Vectors Machines Unsupervised – Generates groups or clusters of data that share similar features K-Means, Self-organizing Feature Maps Reinforcement Learning – Develops models from the results of a final outcome; eg. win/loss of game TD-learning, Q-learning

Supervised Machine Learning Framework
Testing Examples Instance Space X (x, f(x)) Model of Classifier h Inductive Learning System Training Examples h(x) ~ f(x)

Supervised Machine Learning
Problem: We wish to learn to classifying two people (A and B) based on their keyboard typing. Approach: Acquire lots of typing examples from each person Extract relevant features - representation! Transform feature representation as needed Use an algorithm to fit a model to the data - search! Test the model on an independent set of examples of typing from each person

Classification 1 Logistic Regression Mistakes Typing Speed Y Y=f(M,T)
Logistic Regression Y=f(M,T) B B B B B B B B B B B A B A B B Mistakes B B B A B A B B A A B A A A B B B A B B A A A A A A B A A B A B A Typing Speed

Classification Artificial Neural Network Mistakes Typing Speed Y M T B

Classification Inductive Decision Tree Mistakes Typing Speed A B Root
Leaf B B B B B B B B B B B A B A B B Mistakes B B B A B A B B A A B A A A B B B A B B A A A A A A B A A A B A B A Typing Speed Blood Pressure Example

Supervised Machine Learning Framework
Testing Examples Instance Space X After model is developed and used it is thrown away. (x, f(x)) Model of Classifier h Inductive Learning System Training Examples h(x) ~ f(x)

Machine Lifelong Learning (ML3)
Considers methods of retaining and using learned knowledge to improve the effectiveness and efficiency of future learning We investigate systems that must learn: From impoverished training sets For diverse domains of related/unrelated tasks Where practice of the same task is possible Applications: Intelligent Agents, Robotics, User Modeling, DM

Machine Lifelong Learning Framework
Testing Examples Instance Space X Domain Knowledge long-term memory Retention & Consolidation Knowledge Transfer (x, f(x)) Inductive Bias Selection Model of Classifier h Inductive Learning System short-term memory Training Examples h(x) ~ f(x)

Related Work Michalski (1980s) Utgoff and Mitchell (1983)
Constructive inductive learning Principle: New knowledge is easier to induce if search is done using the correct representation Two interrelated searches during learning: Search for the best representational space for hypotheses Search for best hypothesis within the current representational space Utgoff and Mitchell (1983) Importance of inductive bias to learning - systems should be able to search for an appropriate inductive bias using prior knowledge Proposed a system should be able to shift its bias by adjusting the operations of the modeling language

Related Work Solomonoff (1989) Thrun and Mitchell (1990s)
Incremental learning System primed on a small, incomplete set of primitive concepts; first learns to express the solutions to a set of simple problems Then given more difficult problems and, if necessary, additional primitive concepts, etc Thrun and Mitchell (1990s) Explanation-based neural networks (EBNN) Transfers knowledge across multiple learning tasks Uses domain knowledge of previous learning tasks (back-prop. gradients) to guide the development of a new task

Related Work Ring (1997) Rivest and Schultz (late 1990s)
Continual learning - CHILD Builds more complicated hypotheses on top of those already developed both incrementally and hierarchically using reinforcement learning methods. Rivest and Schultz (late 1990s) Knowledge-based cascade-correlation neural networks Extends the original cascade-correlation approach Selects previously learned sub-networks as well as simple hidden units Uses past learning to bias new learning (transfer knowledge)

Related Work Hinton and Bengio (2007+) Carlson et al (2010)
Learning of deep architectures of neural networks Layered networks of unsupervised auto-encoders efficiently develop hierarchies of features that capture regularities in their respective inputs Used to develop models for families of tasks Carlson et al (2010) Never-ending language learner Each day: Extracts information from the web to populate a growing knowledge base Learns to perform this task better than on previous day Uses a MTL approach in which a large number different semantic functions are trained together

Deep Learning Architectures
Consider the problem of trying to classify these hand-written digits.

Deep Learning Architectures
2000 top-level artificial neurons 2 1 3 500 neurons (higher level features) 1 2 3 4 5 6 7 8 9 500 neurons (low level features) Neural Network: - Trained on 40,000 examples Learns: * labels / recognize images * generate images from labels Probabilistic in nature Demo Images of digits 0-9 (28 x 28 pixels)

Related Work Transfer Learning Workshops and Competitions

Power of Modern Computers
Moores Law Expected to accelerate as the power of computers move to a log scale with use of multiple processing cores ML3 systems that are focused on constrained domain of tasks (e.g. medical diagnosis) are computationally tractable, practical, now !

IBMs Watson – Jeopardy, Feb, 2011: Massively parallel data processing system capable of competing with humans in real-time question-answer problems 90 IBM Power-7 servers Each with four 8-core processors 15 TB (220M text pages) of RAM Tasks divided into thousands of stand-alone jobs distributed among 80 teraflops (1 trillion ops/sec) Uses a variety of AI approaches including machine learning methods

Andrew Ng’s work on Deep Learning Networks (ICML-2012) Problem: Learn to recognize human faces, cats, etc from unlabeled data Dataset of 10 million images; each image has 200x200 pixels 9-layered locally connected neural network (1B connections) Parallel algorithm; 1,000 machines (16,000 cores) for three days Building High-level Features Using Large Scale Unsupervised Learning Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng ICML 2012: 29th International Conference on Machine Learning, Edinburgh, Scotland, June, 2012.

Results: A face detector that is 81.7% accurate Robust to translation, scaling, and rotation Further results: 15.8% accuracy in recognizing 20,000 object categories from ImageNet 70% relative improvement over the previous state-of-the-art.

Machine Lifelong Learning Framework
Testing Examples Instance Space X Domain Knowledge long-term memory Retention & Consolidation Knowledge Transfer (x, f(x)) Inductive Bias Selection Model of Classifier h Inductive Learning System short-term memory Training Examples h(x) ~ f(x)

Machine Lifelong Learning One Implementation
Testing Examples Instance Space X f3(x) f2(x) … f9(x) fk(x) Domain Knowledge long-term memory Consolidated MTL Knowledge Transfer Retention & Consolidation (x, f(x)) Inductive Bias Selection Model of Classifier h f2(x) x1 xn f1(x) f5(x) Training Examples Multiple Task Learning (MTL) h(x) ~ f(x)

Multiple Task Learning (MTL)
Multiple hypotheses develop in parallel within one back-propagation network [Caruana, Baxter 93-95] An inductive bias occurs through shared use of common internal representation Knowledge or Inductive transfer to primary task f1 (x) depends on choice of secondary tasks f2(x) x1 xn f1(x) Task specific representation Common internal Representation [Caruana, Baxter] Common feature layer fk(x)

Machine Lifelong Learning via MTL & Task Rehearsal
Rehearsal of virtual examples for y2 –y6 ensures knowledge retention Virtual examples from related prior tasks for knowledge transfer f1(x) y2 y3 y4 y5 y6 Lots of internal representation Rich set of virtual training examples Small learning rate = slow learning Validation set to prevent growth of high magnitude weights [Poirier04] Long-term Consolidated Domain Knowledge x1 xn f1(x) y2 y3 y5 Mention RASL3 here, show the methods of KT again, name outputs T4-T8 Virtual Examples of f1(x) for Long-term Consolidation x1 xn Short Term Learning Network

Lifelong Learning with MTL
Band Domain Mean Percent Misclass. Logic Domain Coronary Artery Disease A B C D

An Environmental Example
x = weather data f(x) = flow rate Stream flow rate prediction [Lisa Gaudette, 2006]

Limitations of MTL for ML3
Problems with multiple outputs: Training examples must have matching target values Redundant representation Frustrates practice of a task Prevents a fluid development of domain knowledge No way to naturally associate examples with tasks Inductive transfer limited to sharing of hidden node weights Inductive transfer relies on selecting related secondary tasks f2(x) x1 xn f1(x) Task specific representation Common internal Representation [Caruana, Baxter] Common feature layer fk(x)

Context Sensitive MTL (csMTL)
We have developed an alternative approach that is meant to overcome these limitations: Uses a single output neural network structure Context inputs associate an example with a task All weights are shared - focus shifts from learning separate tasks to learning a domain of tasks Conjecture: No measure of task relatedness is required x1 xn c1 ck Primary Inputs x One output for all tasks y’=f’(c,x) Context Inputs c

Context Sensitive MTL (csMTL)
Overcomes limitations of standard MTL for long-term consolidation of tasks: Eliminates redundant outputs for the same task Facilitates accumulation of knowledge through practice Examples can be associated with tasks directly by the environment Develops a fluid domain of task knowledge index by the context inputs … AND … Acommodates tasks that have multiple outputs x1 xn c1 ck Context Inputs Primary Inputs One output for all tasks y’

csMTL Empirical Studies Results from Repeated Studies

Why is csMTL doing so well?
Constraint between context to hidden and hidden node bias weights (hidden node j, training example n, task z) Constraint between context to hidden and output node weights (hidden node j, output k, task z) x1 xn c1 ck y’ Reduces number of free parameters in csMTL

csMTL and Other ML Methods
Will the csMTL encoding work with other machine learning methods? IDT kNN SVM? Bayesian Nets? Deep Architecture Learning Networks ?

csMTL Using IDT (Logic Domain)

csMTL for WEKA 2009-10 modified version of WEKA MLP called MLP_CS
See modified version of WEKA MLP called MLP_CS Will accept the csMTL encoded examples with context inputs Can be used for transfer learning by researchers and practitioners

csMTL and Tasks with Multiple Outputs
Liangliang Tu (2010) Image Morphing: Inductive transfer between tasks that have multiple outputs Transforms 30x30 grey scale images using inductive transfer

Demo

Domain Knowledge Network
A ML3 based on csMTL Stability-Plasticity Problem Functional transfer virtual examples) for consolidation One output for all tasks f’(c,x) f1(c,x) Short-term Learning Network Representational transfer from CDK for rapid learning Long-term Consolidated Domain Knowledge Network c1 ck x1 xn Task Context Standard Inputs Work with Ben Fowler, 2010

Challenges and Benefits of ML3 to AGI
Method of knowledge retention ? A Weights (ANN) Distance metric (kNN) Branches (IDT) Choice of kernel (SVM) Functional A Examples (ANN) Hyper-priors (NB) Minimization guides (EBNN) Representational Method of knowledge transfer ? A B Functional Representational

Challenges and Benefits
Weighing the relevance and accuracy of prior versus new knowledge (training examples)? How do we select relevant prior knowledge? Role of unsupervised learning?

Stability-Plasticity problem - How do we integrate new knowledge in with old? No loss of new knowledge No loss or prior knowledge Efficient methods of storage and recall ML3 methods that can efficiently and effectively retain learned knowledge will suggest approaches to “common knowledge” representation – a “Big AI” problem

Practice makes perfect ! An ML3 system must be capable of learning from examples of tasks over a lifetime Practice should increase model accuracy and overall domain knowledge How can this be done? Research important to AI, Psych, and Education

Scalability Often a difficult but important challenge Must scale with increasing: Number of inputs and outputs Number of training examples Number of tasks Complexity of tasks, size of hypothesis representation Preferably, polynomial growth

The study of ML3 systems will be beneficial to the understanding of human and machine learning Insight into curriculum and training sequences Best practices for rapid, accurate learning Best practices for knowledge consolidation Of interest to AI and Education

Applications in software agents and robots Useful test platforms for empirical studies Examples encountered periodically, intermittently Practice is often necessary Consolidation of new knowledge with old are needed for continual learning Integration of supervised, unsupervised and reinforcement learning approaches

Conclusions Machine Lifelong Learning is a logical next step for machine learning: Transfer Learning of new tasks Consolidation of new task knowledge with prior ML3 is a natural research area for AGI Considers an agent that learns and practices tasks Takes a systems perspective on ML and AI Transfer takes advantage of prior learning Consolidation will inform approaches to common knowledge representation – key element for Big AI

Conclusions csMTL is a method of inductive transfer using multiple tasks: Single task output, additional context inputs Shifts focus to learning a continuous domain of tasks Eliminates redundant task representation Empirical studies: csMTL performs transfer at or above level of MTL Less dependent upon task relatedness Scales for tasks with multiple outputs csMTL encoding does not work for all ML methods

Future Work Conditions under which csMTL ANNs succeed / fail
General ML characteristics under which csMTL encoding will work Explore deep learning architectures Explore domains with real-valued context inputs grounded in their environment Investigate knowledge loss during consolidation Explore common knowledge retention/consolidation for activities other than learning

Thank You! danny.silver@acadiau.ca

Inductive Transfer, Machine Lifelong Learning, and AGI

Similar presentations

Presentation on theme: "Inductive Transfer, Machine Lifelong Learning, and AGI"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inductive Transfer, Machine Lifelong Learning, and AGI

Similar presentations

Presentation on theme: "Inductive Transfer, Machine Lifelong Learning, and AGI"— Presentation transcript:

Similar presentations

About project

Feedback