CS Fall 2016 (© Jude Shavlik), Lecture 4

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 7 – Classification and Regression Trees
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Lecture 5 (Classification with Decision Trees)
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Learning Chapter 18 and Parts of Chapter 20
6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
Chapter 9 – Classification and Regression Trees
Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 1 5-Slide Example: Gene Chip Data.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS Fall 2016 (Shavlik©), Lecture 5
Semi-Supervised Clustering
Decision Trees an introduction.
Bagging and Random Forests
Ananya Das Christman CS311 Fall 2016
Ensembles (Bagging, Boosting, and all that)
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
cs540 - Fall 2015 (Shavlik©), Lecture 25, Week 14
Data Science Algorithms: The Basic Methods
Supervised Learning Seminar Social Media Mining University UC3M
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
Overview of Supervised Learning
CS 4/527: Artificial Intelligence
ECE 471/571 – Lecture 12 Decision Tree.
cs638/838 - Spring 2017 (Shavlik©), Week 7
CS Fall 2016 (© Jude Shavlik), Lecture 6, Week 4
Roberto Battiti, Mauro Brunato
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
CS Fall 2016 (© Jude Shavlik), Lecture 3
CS Fall 2016 (Shavlik©), Lecture 2
CS Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 17
Learning Chapter 18 and Parts of Chapter 20
Junheng, Shengming, Yunsheng 11/09/2018
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining CSCI 307, Spring 2019 Lecture 6
Ensembles (Bagging, Boosting, and all that)
Presentation transcript:

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4 CS 540 Fall 2015 (Shavlik) 7/24/2018 Today’s Topics Read Section 18.8.1 of textbook and Wikipedia article(s) linked to class home page Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5) of textbook Sign up for Piazza! Information Gain Derived (and Generalized to k Output Categories) Handling Numeric and Hierarchical Features Advanced Topic: Regression Trees The Trouble with Too Many Possible Values What if Measuring Features is Costly? Feature Values Missing? Summer Internships Panel tomorrow [last Fri], Rm 1240 CS, 3:30pm 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4 ID3 Info Gain Measure Justified (Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp 21-22) Definition of Information Info conveyed by message M depends on its probability, i.e., info(M)  -log2[Prob(M)] (due to Claude Shannon) Note: last lecture we used infoNeeded() as a more informative name for info() The Supervised Learning Task Select example from a set S and announce it belongs to class C The probability of this occurring is approx fC the fraction of C ’s in S Hence info in this announcement is, by definition, -log2(fC) 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

ID3 Info Gain Measure (cont.) Let there be K different classes in set S, namely C1, C2, …, CK What’s expected info from msg about class of an example in set S ? info(s) is the average number of bits of information (by looking at feature values) needed to classify member of set S 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

Handling Hierarchical Features in ID3 Define a new feature for each level in hierarchy, e.g., Let ID3 choose the appropriate level of abstraction! Shape Circular Polygonal Shape1 = { Circular, Polygonal } Shape2 = { } 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

Handling Numeric Features in ID3 On the fly create binary features and choose best Step 1: Plot current examples (green=pos, red=neg) Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg featurenew1  F<8 and featurenew2  F<10 Step 3: Choose split with best info gain (compete with all other features) 5 7 9 11 13 Value of Feature Note: “On the fly” means in each recursive call to ID3 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

Handling Numeric Features (cont.) Technical Note Cannot discard numeric feature after use in one portion of d-tree F<10 F< 5 + - T F 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4 Advanced Topic: Regression Trees (assume features are numerically valued) Age > 25 No Yes Gender Output = 4 f3 + 7 f5 – 2 f9 M F Output = 100 f4 – 2 f8 Output = 7 f6 - 2 f1 - 2 f8 + f7 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

Advanced Topic: Scoring “Splits” for Regression (Real-Valued) Problems We want to return real values at the leaves - For each feature, F, “split” as done in ID3 - Use residue remaining, say using Linear Least Squares (LLS), instead of info gain to score candidate splits Why not a weighted sum in total error? Commonly models at leaves are wgt’ed sums of features (y = mx + b) Some approaches just place constants at leaves Output LLS X 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

Unfortunate Characteristic Property of Using Info-Gain Measure FAVORS FEATURES WITH HIGH BRANCHING FACTORS (ie, many possible values) Extreme Case: At most one example per leaf and all Info(.,.) scores for leaves equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data) 1 + 0 - 0 + 1 - 1 99 999999 Student ID … … 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4 One Fix (used in HW0/HW1) Convert all features to binary eg, Color = { Red, Blue, Green } From one N-valued feature to N binary-valued features Color = Red? Color = Blue? Color = Green? Used in Neural Nets and SVMs D-tree readability probably less, but not necessarily 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

Considering the Cost of Measuring a Feature Want trees with high accuracy and whose tests are inexpensive to compute take temperature vs. do CAT scan Common Heuristic InformationGain(F)² / Cost(F) Used in medical domains as well as robot-sensing tasks 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

What about Missing Feature Values? Quinlan proposed and eval’ed some ideas Might be best to use a Bayes Net (later) to infer the most likely values for missing features given class and known feature values 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4

CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4 We’ll return to d-trees after a digression into train/tune/test sets k-nearest neighbors Still to cover on d-trees overfitting reduction ensembles (train a set of d-trees) 9/20/16 CS 540 - Fall 2016 (© Jude Shavlik), Lecture 4