Thirty-Three Years of Knowledge-Based Machine Learning

Slides:



Advertisements
Similar presentations
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Advertisements

Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
EECS 349 Machine Learning Instructor: Doug Downey Note: slides adapted from Pedro Domingos, University of Washington, CSE
CSE 546 Data Mining Machine Learning Instructor: Pedro Domingos.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.
Machine Learning via Advice Taking Jude Shavlik. Thanks To... Rich Maclin Lisa Torrey Trevor Walker Prof. Olvi Mangasarian Glenn Fung Ted Wild DARPA.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
CpSc 810: Machine Learning Design a learning system.
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Friday, February 4, 2000 Lijun.
Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison,
Skill Acquisition via Transfer Learning and Advice Taking Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin.
Transfer Learning Via Advice Taking Jude Shavlik University of Wisconsin-Madison.
Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.
Machine Learning.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Data Mining and Decision Support
Thirty-Two Years of Knowledge-Based Machine Learning Jude Shavlik University of Wisconsin Not on cs540 final.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Brief Intro to Machine Learning CS539
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Who am I? Work in Probabilistic Machine Learning Like to teach 
CS Fall 2016 (Shavlik©), Lecture 5
Ensembles (Bagging, Boosting, and all that)
CS Fall 2015 (Shavlik©), Midterm Topics
School of Computer Science & Engineering
Trevor Walker, Gautam Kunapuli, Noah Larsen, David Page, Jude Shavlik
Mastering the game of Go with deep neural network and tree search
Conditional Random Fields for ASR
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Machine Learning for dotNET Developer Bahrudin Hrnjica, MVP
Basic machine learning background with Python scikit-learn
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
Print slides for students reference
Basic Intro Tutorial on Machine Learning and Data Mining
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
CS Fall 2016 (Shavlik©), Lecture 2
Overview of Machine Learning
Richard Maclin University of Minnesota - Duluth
Ensembles.
Artificial Intelligence Lecture No. 28
Why Machine Learning Flood of data
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Ensemble learning.
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Support Vector Machines and Kernels
Overview of deep learning
Refining Rules Incorporated into Knowledge-Based Support Vector Learners via Successive Linear Programming Richard Maclin University of Minnesota - Duluth.
Lecture 14 Learning Inductive inference
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Dan Roth Department of Computer Science
Modeling IDS using hybrid intelligent systems
Ensembles (Bagging, Boosting, and all that)
Presentation transcript:

Thirty-Three Years of Knowledge-Based Machine Learning Jude Shavlik University of Wisconsin USA

Key Question of AI: How to Get Knowledge into Computers? Hand coding Supervised ML How can we mix these two extremes? Small ML subcommunity has looked at ways to do so

The Standard Task for Supervised Machine Learning Given Training Examples Features Age Weight Height BP Sex . . . Outcome 38 142 64in 150/88 F 27 189 70in 125/72 M Examples Produce a ‘Model’ (that Classifies Future Examples)

Learning with Data in Multiple Tables (Relational ML) Previous Blood Tests Patients Previous Rx Previous Mammograms Key challenge different amount of data for each patient

How Does the ‘Domain Expert’ Impart His/Her Expertise? Choose Features Choose and Label Examples Can AI’ers Allow More? Knowledge-Based Artificial Neural Networks Knowledge-Based Support Vector Machines Markov Logic Networks (Domingos’ group)

Two Underexplored Questions in ML How can we go beyond teaching machines solely via I/O pairs? ‘advice giving’ How can we understand what an ML algorithm has discovered? ‘rule extraction’

Outline Explanation-Based Learning (1980’s) Knowledge-Based Neural Nets (1990’s) Knowledge-Based SVMs (2000’s) Markov Logic Networks (2010’s)

Explanation-Based Learning (EBL) – My PhD Years, 1983-1987 The EBL Hypothesis By understanding why an example is a member of a concept, can learn essential properties of the concept Trade-Off The need to collect many examples for The ability to ‘explain’ single examples (a ‘domain theory’) Ie, assume a smarter learner

Knowledge-Based Artificial Neural Networks, KBANN (1988-2001) Initial Symbolic Domain Theory Final Symbolic Domain Theory Mooney, Pazzani, Cohen, etc Extract Examples Insert Initial Neural Network Trained Neural network Refine

What Inspired KBANN? Geoff Hinton was an invited speaker at ICML-88 I recall him saying something like “one can backprop through any function” And I thought “what would it mean to backprop through a domain theory?”

Inserting Prior Knowledge into a Neural Network Domain Theory Final Conclusions Supporting Facts Intermediate Conclusions Neural Network Output Units Input Units Hidden Units

Jumping Ahead a Bit Notice that symbolic knowledge induces a graphical model, which is then numerically optimized Similar perspective later followed in Markov Logic Networks (MLNs) However in MLNs, symbolic knowledge expressed in first-order logic

Mapping Rules to Nets Z A B C D If A and B then Z If B and ¬C then Z Maps propositional rule sets to neural networks Bias Z A B C D 4 2 -4 6 Weight

Case Study: Learning to Recognize Genes (Towell, Shavlik & Noordewier, AAAI-90) promoter :- contact, conformation. contact :- minus_35, minus_10. <4 rules for conformation> <4 rules for minus_35> <4 rules for minus_10> (Halved error rate of standard BP) promoter contact conformation minus_35 minus_10 We compile rules to a more basic language, but here we compile for refinability DNA sequence

Learning Curves (similar results on many tasks and with other advice-taking algo’s) Fixed amount of data Testset Errors Amount of Training Data KBANN STD. ANN DOMAIN THEORY Given error-rate spec

From Prior Knowledge to Advice (Maclin PhD 1995) Originally ‘theory refinement’ community assumed domain knowledge was available before learning starts (prior knowledge) When applying KBANN to reinforcement learning, we began to realize you should be able to provide domain knowledge to a machine learner whenever you think of something to say Changing the metaphor: commanding vs. advising computers Continual (ie, lifelong) Human Teacher – Machine Learner Cooperation

What Would You Like to Say to This Penguin? IF a Bee is (Near and West) & an Ice is (Near and North) Then Begin Move East Move North END

Some Advice for Soccer-Playing Robots if distanceToGoal ≤ 10 and shotAngle  30 then prefer shoot over all other actions X

Some Sample Results With advice Without advice

Overcoming Bad Advice No Advice Bad Advice

Rule Extraction Initially Geoff Towell (PhD, 1991) viewed this as simplifying the trained neural network (M-of-N rules) Mark Craven (PhD, 1996) realized This is simply another learning task! Ie, learn what the neural network computes Collect I/O pairs from trained neural network Give them to decision-tree learner Applies to SVMs, decision forests, etc

Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging Rule Extraction (Craven & Shavlik, 1996) Could be an ENSEMBLE of models Roughly speaking, train ID3 to learn the I/O behavior of the neural network – note we can generate as many labeled training ex’s as desired by forward prop’ing through trained ANN!

KBANN Recap Use symbolic knowledge to make an initial guess at the concept description Standard neural-net approaches make a random guess Use training examples to refine the initial guess (‘early stopping’ reduces overfitting) Nicely maps to incremental (aka online) machine learning Valuable to show user the learned model expressed in symbols rather than numbers

Knowledge-Based Support Vector Machines (2001-2011) Question arose during 2001 PhD defense of Tina Eliassi-Rad How would you apply the KBANN idea using SVMs? Led to collaboration with Olvi Mangasarian (who has worked on SVMs for over 50 years!)

Generalizing the Idea of a Training Example for SVM’s Can extend SVM linear program to handle ‘regions as training examples’ In standard supervised learning, training data can be viewed as points in a “feature space.” In the work at Wisconsin, the idea of a “training example” has been extended to be REGIONS in feature space (with “points” being a limiting case). Notice that an IF-THEN rule defines a region in feature space (eg, If age is between 10 and 10 and height is between 50 and 60 inches, then category is POSITIVE”). This provides a mechanism for domain knowledge to be used in combination with (traditional) training examples.

Knowledge-Based Support Vector Regression Add soft constraints to linear program (so need only follow advice approximately) 4 Output Advice: In this region, y should exceed 4 Inputs Note that with advice get a different line that one would get by simply “curve fitting” the data points. (A simple line is used here, but f(x) could be a non-linear function as well.) The notation ||.||1 is the “one norm” and is simply the sum of the absolute values. The first term in the expression being minimized is the “model complexity” and the second is the “data misfit.” minimize ||w||1 + C ||s||1 + penalty for violating advice such that f(x) = y  s constraints that represent advice

Sample Advice-Taking Results if distanceToGoal ≤ 10 and shotAngle  30 then prefer shoot over all other actions advice Q(shoot) > Q(pass) Q(shoot) > Q(move) 2 vs 1 BreakAway, rewards +1, -1 std RL Maclin et al: AAAI ‘05, ‘06, ‘07

Automatically Creating Advice Interesting approach to transfer learning (Lisa Torrey, PhD 2009) Learn in task A Perform ‘rule extraction’ Give as advice for related task B Since advice not assumed 100% correct, differences between tasks A and B handled by training ex’s for task B So advice giving is done by MACHINE!

Close-Transfer Scenarios 2-on-1 BreakAway 4-on-3 BreakAway 3-on-2 BreakAway

Distant-Transfer Scenarios 3-on-2 KeepAway 3-on-2 BreakAway 3-on-2 MoveDownfield

Some Results: Transfer to 3-on-2 BreakAway Torrey, Shavlik, Walker & Maclin: ECML 2006, ICML Workshop 2006

KBSVM Recap Can view symbolic knowledge as a way to label regions of feature space (rather than solely labeling points) Maximize Model Simplicity + Fit to Advice + Fit to Training Examples Note: does not fit view of “guess initial model, then refine using training ex’s”

Markov Logic Networks, 2009+ (and statistical-relational learning in general) My current favorite for combining symbolic knowledge and numeric learning MLN = set of weighted FOPC sentences wgt=3.2 x,y,z parent(x, y)  parent(z, y) → married(x, z) Have worked on speeding up MLN inference (via RDBMS) plus learning MLN rule sets

Logic & Probability: Two Major Math Underpinnings of AI Add Probabilities Probabilistic Logic Learning Add Relations Also called as SRL Inference is in inner loop Probabilities MLNs a popular approach

Learning a Set of First-Order Regression Trees (each path to a leaf is an MLN rule) – ICDM ‘11 Data = Gradients Current Rules learn vs Predicted Probs iterate + + Final Ruleset = + …

Relational Probability Trees Blockeel & De Raedt , AIj 1998; Neville et al, KDD 2003 Computing prob(getFine(?X) | evidence) job(?Y, politician) 0.01 0.05 0.17 0.99 0.98 no yes knows(?X,?Y) job(?X, politician) speed(?X) > 75 mph

Basic Idea of Boosted Relational Dependency Networks Learn a relational regression tree 0  training example, collect error in its predicted probability Learn another first-order regression tree that fixes errors collected in Step 2 (eg, for neg ex 9, predict -0.13) Goto 2 Learns ‘structure’ and weights simultaneously … a(X,Y) d(Y) b(Y,Z) 0.13 0.14 0.74 0.35 True False e(Y,Z) c(X) -0.32 0.25 0.32 -0.27

Some Results advisedBy(x,y) AUC-PR Training Time MLN-BT 0.94 ± 0.06 18.4 sec MLN-BC 0.95 ± 0.05 33.3 sec Alch-D 0.31 ± 0.10 7.1 hrs Motif 0.43 ± 0.03 1.8 hrs LHL 0.42 ± 0.10 37.2 sec

Illustrative Task Given a Text Corpus of Sports Articles, Determine Who Won/Lost Each Week winner(TeamName, GameID) loser(TeamName, GameID) Evidence: we ran Stanford NLP Toolkit on the articles and produced lots of facts (nextWord, partOfSpeech, parseTree, etc) DARPA gave us a few hundred pos ex’s

Some Expert-Provided ‘Advice’ X ‘beat’ ⟶ winner(X, ArticleDate) X ‘was beaten’ ⟶ loser(X, ArticleDate) Home teams more likely to win. If you found a likely game winner, then if you see a different team mentioned nearby in the text, it is likely the game loser. If a team didn’t score any touchdowns, it probably lost. If ‘predict*’ in the sentence, ignore it.

One Great Rule (very high precision, but moderate recall) Look, in short phrases, for TeamName1 * Number1 * TeamName2 * Number2 where Number1 > Number2 Infer with high confidence winningTeam(TeamName1, ArticleDate) losingTeam(TeamName2 , ArticleDate) winnerScore(TeamName1, Number1 , ArticleDate) loserScore(TeamName2, Number2 , ArticleDate) Rule matches none of DARPA-provided training examples!

Differences from KBANN Rules involve logical variables During learning, we create new rules to correct errors in initial rules Recent followup: also refine initial rules (note that KBSVMs also do NOT refine rules, though we had one AAAI paper on that)

Wrapping Up Symbolic knowledge refined/extended by Neural networks Support-vector machines MLN rule and weight learning Variety of views taken Make initial guess at concept, then refine weights Use advice to label a region in feature space Make initial guess at concept, then add wgt’ed rules Seeing what was learned – rule extraction Applications in genetics, cancer, machine reading, robot learning, etc

Some Suggestions Allow humans to continually observe learning and provide symbolic knowledge at any time Never assume symbolic knowledge is 100% correct Allow user to see what was learned in a symbolic representation to facilitate additional advice Put a graphic on every slide 

Thanks for Your Time Questions? Papers, Powerpoint presentations, and some software available on line pages.cs.wisc.edu/~shavlik/mlrg/publications.html hazy.cs.wisc.edu/hazy/tuffy/ pages.cs.wisc.edu/~tushar/rdnboost/i