Knowledge Representation 573 Topics Perception NLP Robotics Multi-agent Reinforcement Learning MDPs Supervised Learning Planning Uncertainty Search Knowledge Representation Expressiveness Tractability Problem Spaces Agency © Daniel S. Weld
} Weekly Course Outline Overview, agents, problem spaces Search Learning Planning (and CSPs) Uncertainty Planning under uncertainty Reinforcement learning Topics Project Presentations & Contest } Search Knowledge Representation © Daniel S. Weld
Search Problem spaces Blind Depth-first, breadth-first, iterative-deepening, iterative broadening Informed Best-first, Dijkstra's, A*, IDA*, SMA*, DFB&B, Beam, hill climb,limited discrep, AO*, LAO*, RTDP Local search Heuristics Evaluation, construction, learning Pattern databases Online methods Techniques from operations research Constraint satisfaction Adversary search © Daniel S. Weld
KR Propositional logic First-order logic Syntax (CNF, Horn clauses, …) Semantics (Truth Tables) Inference (FC, Resolution, DPLL, WalkSAT) First-order logic Syntax (quantifiers, skolem functions, … Semantics (Interpretations) Inference (FC, Resolution, Compilation) Frame systems Two theses… Representing events, action & change © Daniel S. Weld
Machine Learning Overview What is Inductive Learning? Understand a ML technique in depth Induction of Decision Trees Bagging Overfitting Applications © Daniel S. Weld
Inductive Learning Inductive learning or “Prediction”: Classification Given examples of a function (X, F(X)) Predict function F(X) for new examples X Classification F(X) = Discrete Regression F(X) = Continuous Probability estimation F(X) = Probability(X): © Daniel S. Weld
Widely-used Approaches Decision trees Rule induction Bayesian learning Neural networks Genetic algorithms Instance-based learning Support Vector Machines Etc. © Daniel S. Weld
Defining a Learning Problem A program is said to learn from experience E with respect to task T and performance measure P, if it’s performance at tasks in T, as measured by P, improves with experience E. Task T: Playing checkers Performance Measure P: Percent of games won against opponents Experience E: Playing practice games against itself © Daniel S. Weld
Issues What feedback (experience) is available? How should these features be represented? What kind of knowledge is being increased? How is that knowledge represented? What prior information is available? What is the right learning algorithm? How avoid overfitting? © Daniel S. Weld
Choosing the Training Experience Credit assignment problem: Direct training examples: E.g. individual checker boards + correct move for each Supervised learning Indirect training examples : E.g. complete sequence of moves and final result Reinforcement learning Which examples: Random, teacher chooses, learner chooses © Daniel S. Weld
Choosing the Target Function What type of knowledge will be learned? How will the knowledge be used by the performance program? E.g. checkers program Assume it knows legal moves Needs to choose best move So learn function: F: Boards -> Moves hard to learn Alternative: F: Boards -> R Note similarity to choice of problem space © Daniel S. Weld
The Ideal Evaluation Function V(b) = 100 if b is a final, won board V(b) = -100 if b is a final, lost board V(b) = 0 if b is a final, drawn board Otherwise, if b is not final V(b) = V(s) where s is best, reachable final board Nonoperational… Want operational approximation of V: V © Daniel S. Weld
How Represent Target Function x1 = number of black pieces on the board x2 = number of red pieces on the board x3 = number of black kings on the board x4 = number of red kings on the board x5 = num of black pieces threatened by red x6 = num of red pieces threatened by black V(b) = a + bx1 + cx2 + dx3 + ex4 + fx5 + gx6 Now just need to learn 7 numbers! © Daniel S. Weld
Example: Checkers Task T: Performance Measure P: Experience E: Playing checkers Performance Measure P: Percent of games won against opponents Experience E: Playing practice games against itself Target Function V: board -> R Representation of approx. of target function V(b) = a + bx1 + cx2 + dx3 + ex4 + fx5 + gx6 © Daniel S. Weld
Target Function Profound Formulation: Can express any type of inductive learning as approximating a function E.g., Checkers V: boards -> evaluation E.g., Handwriting recognition V: image -> word E.g., Mushrooms V: mushroom-attributes -> {E, P} © Daniel S. Weld
© Daniel S. Weld
More Examples Collaborative Filtering (Parts of) Robocode Controller © Daniel S. Weld
© Daniel S. Weld
Bias Introduced concept by name here Two types The problem of disjunction Two types Restriction Preference © Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
Decision Trees Convenient Representation Expressive Developed with learning in mind Deterministic Expressive Equivalent to propositional DNF Handles discrete and continuous parameters Simple learning algorithm Handles noise well Classify as follows Constructive (build DT by adding nodes) Eager Batch (but incremental versions exist) © Daniel S. Weld
Concept Learning E.g. Learn concept “Edible mushroom” Target Function has two values: T or F Represent concepts as decision trees Use hill climbing search Thru space of decision trees Start with simple concept Refine it into a complex concept as needed © Daniel S. Weld
Example: “Good day for tennis” Attributes of instances Wind Temperature Humidity Outlook Feature = attribute with one value E.g. outlook = sunny Sample instance wind=weak, temp=hot, humidity=high, outlook=sunny © Daniel S. Weld
Experience: “Good day for tennis” Day Outlook Temp Humid Wind PlayTennis? d1 s h h w n d2 s h h s n d3 o h h w y d4 r m h w y d5 r c n w y d6 r c n s y d7 o c n s y d8 s m h w n d9 s c n w y d10 r m n w y d11 s m n s y d12 o m h s y d13 o h n w y d14 r m h s n © Daniel S. Weld
Decision Tree Representation Good day for tennis? Leaves = classification Arcs = choice of value for parent attribute Outlook Sunny Overcast Rain Humidity Wind Yes Strong Normal High Weak Yes No No Yes Decision tree is equivalent to logic in disjunctive normal form G-Day (Sunny Normal) Overcast (Rain Weak) © Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
© Daniel S. Weld
DT Learning as Search Nodes Operators Initial node Heuristic? Goal? Decision Trees Tree Refinement: Sprouting the tree Smallest tree possible: a single leaf Information Gain Best tree possible (???) © Daniel S. Weld
What is the Simplest Tree? Day Outlook Temp Humid Wind Play? d1 s h h w n d2 s h h s n d3 o h h w y d4 r m h w y d5 r c n w y d6 r c n s y d7 o c n s y d8 s m h w n d9 s c n w y d10 r m n w y d11 s m n s y d12 o m h s y d13 o h n w y d14 r m h s n What is the Simplest Tree? How good? [10+, 4-] Means: correct on 10 examples incorrect on 4 examples © Daniel S. Weld
Which attribute should we use to split? Successors Yes Humid Wind Outlook Temp Which attribute should we use to split? © Daniel S. Weld
To be decided: How to choose best attribute? Information gain Entropy (disorder) When to stop growing tree? © Daniel S. Weld
Disorder is bad Homogeneity is good No Better Good © Daniel S. Weld
Entropy 1.0 0.5 P as % .00 .50 1.00 © Daniel S. Weld
Entropy (disorder) is bad Homogeneity is good Let S be a set of examples Entropy(S) = -P log2(P) - N log2(N) where P is proportion of pos example and N is proportion of neg examples and 0 log 0 == 0 Example: S has 10 pos and 4 neg Entropy([10+, 4-]) = -(10/14) log2(10/14) - (4/14)log2(4/14) = 0.863 © Daniel S. Weld
Information Gain Measure of expected reduction in entropy Resulting from splitting along an attribute v Values(A) Gain(S,A) = Entropy(S) - (|Sv| / |S|) Entropy(Sv) Where Entropy(S) = -P log2(P) - N log2(N) © Daniel S. Weld
Gain of Splitting on Wind Day Wind Tennis? d1 weak n d2 s n d3 weak yes d4 weak yes d5 weak yes d6 s yes d7 s yes d8 weak n d9 weak yes d10 weak yes d11 s yes d12 s yes d13 weak yes d14 s n Values(wind)=weak, strong S = [10+, 4-] Sweak = [6+, 2-] Ss = [3+, 3-] Gain(S, wind) = Entropy(S) - (|Sv| / |S|) Entropy(Sv) = Entropy(S) - 8/14 Entropy(Sweak) - 6/14 Entropy(Ss) = 0.863 - (8/14) 0.811 - (6/14) 1.00 = -0.029 v {weak, s} © Daniel S. Weld
Evaluating Attributes Yes Humid Wind Gain(S,Wind) =-0.029 Gain(S,Humid) =0.151 Outlook Temp Value is actually different than this, but ignore this detail Gain(S,Temp) =0.029 Gain(S,Outlook) =0.170 © Daniel S. Weld
Resulting Tree …. Good day for tennis? Outlook Sunny Overcast Rain No [2+, 3-] Yes [4+] No [2+, 3-] © Daniel S. Weld
Recurse! Outlook Sunny Day Temp Humid Wind Tennis? d1 h h weak n d2 h h s n d8 m h weak n d9 c n weak yes d11 m n s yes © Daniel S. Weld
One Step Later… Outlook Sunny Overcast Rain Humidity Yes [4+] No [2+, 3-] Normal High Yes [2+] No [3-] © Daniel S. Weld
Decision Tree Algorithm BuildTree(TraingData) Split(TrainingData) Split(D) If (all points in D are of the same class) Then Return For each attribute A Evaluate splits on attribute A Use best split to partition D into D1, D2 Split (D1) Split (D2) © Daniel S. Weld
Application to Robocode Assume Fixed Architecture If <in-danger> Then escape: If <ok-escape-dir> Then forward 45 Else … Else attack: If <shooting-best> Then if <gun-aimed> Then … Else ram: … From training set generated Learn these classifiers By trial battles © Daniel S. Weld
Movie Recommendation Features? Rambo Matrix Rambo 2 © Daniel S. Weld
Issues Content vs. Social Non-Boolean Attributes Missing Data Scaling up © Daniel S. Weld
Stopped here © Daniel S. Weld
Missing Data 1 Assign attribute most common value at node Day Temp Humid Wind Tennis? d1 h h weak n d2 h h s n d8 m h weak n d9 c weak yes d11 m n s yes Or most common value, … given classification © Daniel S. Weld
Fractional Values 75% h and 25% n Use in gain calculations [0.75+, 3-] Day Temp Humid Wind Tennis? d1 h h weak n d2 h h s n d8 m h weak n d9 c weak yes d11 m n s yes [1.25+, 0-] 75% h and 25% n Use in gain calculations Further subdivide if other missing attributes Same approach to classify test ex with missing attr Classification is most probable classification Summing over leaves where it got divided © Daniel S. Weld
Non-Boolean Features Features with multiple discrete values Construct a multi-way split Test for one value vs. all of the others? Group values into two disjoint subsets? Real-valued Features Discretize? Consider a threshold split using observed values? © Daniel S. Weld
© Daniel S. Weld
Skipping remaining slides © Daniel S. Weld
© Daniel S. Weld
What if data is too big for main memory? How evaluate split points? Discrete attributes Continuous attributes How to partition data once a split point is found? Post pruning © Daniel S. Weld
=> Attribute Lists Training data Attribute lists one per attribute written to disk If attribute is continuous, Sort on that attribute © Daniel S. Weld
Evaluating an Index Gini Index Easy to evaluate – only requires histogram Low value is good Sum over j atttributes gini(D) = 1 - pj2 ginisplit(D) = |D1|/|D| gini(D1) + |D2|/|D| gini(D2) © Daniel S. Weld
Histograms Associated with each node in (growing) tree Record distribution of class values Used to evaluate splits Continuous Attributes C_above C_below Discrete Attributes Count matrix © Daniel S. Weld
Evaluating continuous split points © Daniel S. Weld
Evaluating discrete split points © Daniel S. Weld
Performing the Split Must split every attribute list in two Easy for the ‘winning’ attribute Just scan through old list Add each entry to correct one of two new lists Remaining attributes are harder (No test to apply to those attributes to find correct part) Use RID Details coming up…. © Daniel S. Weld
Splitting a node’s attribute list © Daniel S. Weld
Splitting with Limited Memory Scan thru list for ‘winning’ attribute Just scan through old list Add each entry to correct one of two new lists Insert RID into hash table, noting destination child: L, R Later, scan each of the other attribute lists Probe hash table with RID to find destination Write current record to one of two child lists If hash table is too big for memory, do in stages Only partition splitting attribute part way E.g. until one has scanned as many records as would fit in hash table given memory constraints © Daniel S. Weld
Evaluation Machine had only 16MB RAM © Daniel S. Weld