Thirty-Three Years of Knowledge-Based Machine Learning

Thirty-Three Years of Knowledge-Based Machine Learning
Jude Shavlik University of Wisconsin USA

Key Question of AI: How to Get Knowledge into Computers?
Hand coding Supervised ML How can we mix these two extremes? Small ML subcommunity has looked at ways to do so

The Standard Task for Supervised Machine Learning
Given Training Examples Features Age Weight Height BP Sex . . . Outcome 38 142 64in 150/88 F 27 189 70in 125/72 M Examples Produce a ‘Model’ (that Classifies Future Examples)

Learning with Data in Multiple Tables (Relational ML)
Previous Blood Tests Patients Previous Rx Previous Mammograms Key challenge different amount of data for each patient

How Does the ‘Domain Expert’ Impart His/Her Expertise?
Choose Features Choose and Label Examples Can AI’ers Allow More? Knowledge-Based Artificial Neural Networks Knowledge-Based Support Vector Machines Markov Logic Networks (Domingos’ group)

Two Underexplored Questions in ML
How can we go beyond teaching machines solely via I/O pairs? ‘advice giving’ How can we understand what an ML algorithm has discovered? ‘rule extraction’

Outline Explanation-Based Learning (1980’s)
Knowledge-Based Neural Nets (1990’s) Knowledge-Based SVMs (2000’s) Markov Logic Networks (2010’s)

Explanation-Based Learning (EBL) – My PhD Years, 1983-1987
The EBL Hypothesis By understanding why an example is a member of a concept, can learn essential properties of the concept Trade-Off The need to collect many examples for The ability to ‘explain’ single examples (a ‘domain theory’) Ie, assume a smarter learner

Knowledge-Based Artificial Neural Networks, KBANN (1988-2001)
Initial Symbolic Domain Theory Final Symbolic Domain Theory Mooney, Pazzani, Cohen, etc Extract Examples Insert Initial Neural Network Trained Neural network Refine

What Inspired KBANN? Geoff Hinton was an invited speaker at ICML-88
I recall him saying something like “one can backprop through any function” And I thought “what would it mean to backprop through a domain theory?”

Inserting Prior Knowledge into a Neural Network
Domain Theory Final Conclusions Supporting Facts Intermediate Conclusions Neural Network Output Units Input Units Hidden Units

Jumping Ahead a Bit Notice that symbolic knowledge induces a graphical model, which is then numerically optimized Similar perspective later followed in Markov Logic Networks (MLNs) However in MLNs, symbolic knowledge expressed in first-order logic

Mapping Rules to Nets Z A B C D If A and B then Z If B and ¬C then Z
Maps propositional rule sets to neural networks Bias Z A B C D 4 2 -4 6 Weight

Case Study: Learning to Recognize Genes (Towell, Shavlik & Noordewier, AAAI-90)
promoter :- contact, conformation. contact :- minus_35, minus_10. <4 rules for conformation> <4 rules for minus_35> <4 rules for minus_10> (Halved error rate of standard BP) promoter contact conformation minus_35 minus_10 We compile rules to a more basic language, but here we compile for refinability DNA sequence

Learning Curves (similar results on many tasks and with other advice-taking algo’s)
Fixed amount of data Testset Errors Amount of Training Data KBANN STD. ANN DOMAIN THEORY Given error-rate spec

From Prior Knowledge to Advice (Maclin PhD 1995)
Originally ‘theory refinement’ community assumed domain knowledge was available before learning starts (prior knowledge) When applying KBANN to reinforcement learning, we began to realize you should be able to provide domain knowledge to a machine learner whenever you think of something to say Changing the metaphor: commanding vs. advising computers Continual (ie, lifelong) Human Teacher – Machine Learner Cooperation

What Would You Like to Say to This Penguin?
IF a Bee is (Near and West) & an Ice is (Near and North) Then Begin Move East Move North END

Some Advice for Soccer-Playing Robots
if distanceToGoal ≤ 10 and shotAngle  30 then prefer shoot over all other actions X

Some Sample Results With advice Without advice

Overcoming Bad Advice No Advice Bad Advice

Rule Extraction Initially Geoff Towell (PhD, 1991) viewed this as simplifying the trained neural network (M-of-N rules) Mark Craven (PhD, 1996) realized This is simply another learning task! Ie, learn what the neural network computes Collect I/O pairs from trained neural network Give them to decision-tree learner Applies to SVMs, decision forests, etc

Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging
Rule Extraction (Craven & Shavlik, 1996) Could be an ENSEMBLE of models Roughly speaking, train ID3 to learn the I/O behavior of the neural network – note we can generate as many labeled training ex’s as desired by forward prop’ing through trained ANN!

KBANN Recap Use symbolic knowledge to make an initial guess at the concept description Standard neural-net approaches make a random guess Use training examples to refine the initial guess (‘early stopping’ reduces overfitting) Nicely maps to incremental (aka online) machine learning Valuable to show user the learned model expressed in symbols rather than numbers

Knowledge-Based Support Vector Machines (2001-2011)
Question arose during PhD defense of Tina Eliassi-Rad How would you apply the KBANN idea using SVMs? Led to collaboration with Olvi Mangasarian (who has worked on SVMs for over 50 years!)

Generalizing the Idea of a Training Example for SVM’s
Can extend SVM linear program to handle ‘regions as training examples’ In standard supervised learning, training data can be viewed as points in a “feature space.” In the work at Wisconsin, the idea of a “training example” has been extended to be REGIONS in feature space (with “points” being a limiting case). Notice that an IF-THEN rule defines a region in feature space (eg, If age is between 10 and 10 and height is between 50 and 60 inches, then category is POSITIVE”). This provides a mechanism for domain knowledge to be used in combination with (traditional) training examples.

Knowledge-Based Support Vector Regression
Add soft constraints to linear program (so need only follow advice approximately) 4 Output Advice: In this region, y should exceed 4 Inputs Note that with advice get a different line that one would get by simply “curve fitting” the data points. (A simple line is used here, but f(x) could be a non-linear function as well.) The notation ||.||1 is the “one norm” and is simply the sum of the absolute values. The first term in the expression being minimized is the “model complexity” and the second is the “data misfit.” minimize ||w||1 + C ||s||1 + penalty for violating advice such that f(x) = y  s constraints that represent advice

Sample Advice-Taking Results
if distanceToGoal ≤ 10 and shotAngle  30 then prefer shoot over all other actions advice Q(shoot) > Q(pass) Q(shoot) > Q(move) 2 vs 1 BreakAway, rewards +1, -1 std RL Maclin et al: AAAI ‘05, ‘06, ‘07

Automatically Creating Advice
Interesting approach to transfer learning (Lisa Torrey, PhD 2009) Learn in task A Perform ‘rule extraction’ Give as advice for related task B Since advice not assumed 100% correct, differences between tasks A and B handled by training ex’s for task B So advice giving is done by MACHINE!

Close-Transfer Scenarios
2-on-1 BreakAway 4-on-3 BreakAway 3-on-2 BreakAway

Distant-Transfer Scenarios
3-on-2 KeepAway 3-on-2 BreakAway 3-on-2 MoveDownfield

Some Results: Transfer to 3-on-2 BreakAway
Torrey, Shavlik, Walker & Maclin: ECML 2006, ICML Workshop 2006

KBSVM Recap Can view symbolic knowledge as a way to label regions of feature space (rather than solely labeling points) Maximize Model Simplicity + Fit to Advice + Fit to Training Examples Note: does not fit view of “guess initial model, then refine using training ex’s”

Markov Logic Networks, 2009+ (and statistical-relational learning in general)
My current favorite for combining symbolic knowledge and numeric learning MLN = set of weighted FOPC sentences wgt=3.2 x,y,z parent(x, y)  parent(z, y) → married(x, z) Have worked on speeding up MLN inference (via RDBMS) plus learning MLN rule sets

Logic & Probability: Two Major Math Underpinnings of AI
Add Probabilities Probabilistic Logic Learning Add Relations Also called as SRL Inference is in inner loop Probabilities MLNs a popular approach

Learning a Set of First-Order Regression Trees (each path to a leaf is an MLN rule) – ICDM ‘11
Data = Gradients Current Rules learn vs Predicted Probs iterate + + Final Ruleset = + …

Relational Probability Trees
Blockeel & De Raedt , AIj 1998; Neville et al, KDD 2003 Computing prob(getFine(?X) | evidence) job(?Y, politician) 0.01 0.05 0.17 0.99 0.98 no yes knows(?X,?Y) job(?X, politician) speed(?X) > 75 mph

Basic Idea of Boosted Relational Dependency Networks
Learn a relational regression tree 0  training example, collect error in its predicted probability Learn another first-order regression tree that fixes errors collected in Step 2 (eg, for neg ex 9, predict -0.13) Goto 2 Learns ‘structure’ and weights simultaneously … a(X,Y) d(Y) b(Y,Z) 0.13 0.14 0.74 0.35 True False e(Y,Z) c(X) -0.32 0.25 0.32 -0.27

Some Results advisedBy(x,y) AUC-PR Training Time MLN-BT 0.94 ± 0.06
18.4 sec MLN-BC 0.95 ± 0.05 33.3 sec Alch-D 0.31 ± 0.10 7.1 hrs Motif 0.43 ± 0.03 1.8 hrs LHL 0.42 ± 0.10 37.2 sec

Illustrative Task Given a Text Corpus of Sports Articles, Determine Who Won/Lost Each Week winner(TeamName, GameID) loser(TeamName, GameID) Evidence: we ran Stanford NLP Toolkit on the articles and produced lots of facts (nextWord, partOfSpeech, parseTree, etc) DARPA gave us a few hundred pos ex’s

Some Expert-Provided ‘Advice’
X ‘beat’ ⟶ winner(X, ArticleDate) X ‘was beaten’ ⟶ loser(X, ArticleDate) Home teams more likely to win. If you found a likely game winner, then if you see a different team mentioned nearby in the text, it is likely the game loser. If a team didn’t score any touchdowns, it probably lost. If ‘predict*’ in the sentence, ignore it.

One Great Rule (very high precision, but moderate recall)
Look, in short phrases, for TeamName1 * Number1 * TeamName2 * Number2 where Number1 > Number2 Infer with high confidence winningTeam(TeamName1, ArticleDate) losingTeam(TeamName2 , ArticleDate) winnerScore(TeamName1, Number1 , ArticleDate) loserScore(TeamName2, Number2 , ArticleDate) Rule matches none of DARPA-provided training examples!

Differences from KBANN
Rules involve logical variables During learning, we create new rules to correct errors in initial rules Recent followup: also refine initial rules (note that KBSVMs also do NOT refine rules, though we had one AAAI paper on that)

Wrapping Up Symbolic knowledge refined/extended by Neural networks
Support-vector machines MLN rule and weight learning Variety of views taken Make initial guess at concept, then refine weights Use advice to label a region in feature space Make initial guess at concept, then add wgt’ed rules Seeing what was learned – rule extraction Applications in genetics, cancer, machine reading, robot learning, etc

Some Suggestions Allow humans to continually observe learning and provide symbolic knowledge at any time Never assume symbolic knowledge is 100% correct Allow user to see what was learned in a symbolic representation to facilitate additional advice Put a graphic on every slide 

Thanks for Your Time Questions?
Papers, Powerpoint presentations, and some software available on line pages.cs.wisc.edu/~shavlik/mlrg/publications.html hazy.cs.wisc.edu/hazy/tuffy/ pages.cs.wisc.edu/~tushar/rdnboost/i

Thirty-Three Years of Knowledge-Based Machine Learning

Similar presentations

Presentation on theme: "Thirty-Three Years of Knowledge-Based Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thirty-Three Years of Knowledge-Based Machine Learning

Similar presentations

Presentation on theme: "Thirty-Three Years of Knowledge-Based Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback