Data Science Algorithms: The Basic Methods

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Data Mining Classification: Alternative Techniques
Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.
RIPPER Fast Effective Rule Induction
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Classification Algorithms – Continued. 2 Outline  Rules  Linear Models (Regression)  Instance-based (Nearest-neighbor)
Naïve Bayes: discussion
Primer Parcial -> Tema 3 Minería de Datos Universidad del Cauca.
Instance-based representation
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Chapter 6 Decision Trees
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Decision Trees.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.5: Mining Association Rules Rodney Nielsen.
CSCI 347, Data Mining Chapter 4 – Functions, Rules, Trees, and Instance Based Learning.
RULE-BASED CLASSIFIERS
Classification Algorithms Covering, Nearest-Neighbour.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Data Mining – Algorithms: K Means Clustering
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Credibility: Evaluating What’s Been Learned
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning in Practice Lecture 18
DECISION TREES An internal node represents a test on an attribute.
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Rule Induction for Classification Using
Data Science Algorithms: The Basic Methods
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Data Mining Classification: Alternative Techniques
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
ID3 Algorithm.
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Techniques for Data Mining
Machine Learning: Lecture 3
Classification Algorithms
Data Mining Rule Classifiers
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Data Mining Rule Classifiers
Self-Balancing Search Trees
Machine Learning in Practice Lecture 17
Chapter 7: Transformations
INTRODUCTION TO Machine Learning 2nd Edition
Decision Trees Jeff Storey.
Data Mining CSCI 307, Spring 2019 Lecture 21
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Data Science Algorithms: The Basic Methods Covering Algorithms WFH: Data Mining, Chapter 4.4 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Algorithms: The Basic Methods Inferring rudimentary rules Naïve Bayes, probabilistic model Constructing decision trees Constructing rules Association rule learning Linear models Instance-based learning Clustering

Covering Algorithms Convert decision tree into a rule set Straightforward, but rule set overly complex More effective conversions are not trivial Instead, can generate rule set directly For each class in turn find rule set that covers all instances in it (excluding instances not in the class) Called a covering approach: At each stage a rule is identified that “covers” some of the instances

Example: Generating a Rule Possible rule set for class “b”: Could add more rules, get “perfect” rule set If true then class = a If x > 1.2 and y > 2.6 then class = a If x > 1.2 then class = a Student Q: Name an example where you would want to generate "sensible" rules as opposed to perfect rules. If x  1.2 then class = b If x > 1.2 and y  2.6 then class = b

Rules vs. Trees Corresponding decision tree: (produces exactly the same predictions) But: rule sets can be more clear when decision trees suffer from replicated subtrees Note: In multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

Simple Covering Algorithm Generates a rule by adding tests that maximize rule’s accuracy Similar to situation in decision trees: problem of selecting an attribute to split on But: decision tree inducer maximizes overall purity Each new test reduces rule’s coverage:

Selecting a Test Goal: maximize accuracy t total number of instances covered by rule p positive examples of the class covered by rule (t – p number of errors made by rule) Select test that maximizes accuracy of rule: p/t We are finished when p/t = 1 or the set of instances can’t be split any further Student Q: It seems like separate-and-conquer strategy has its positive in its simplicity while the advantage of the divide-and-conquer approach is its accuracy. How does it make sense for one to consider the separate-and-conquer strategy at the expense of accuracy?

Example: Contact Lens Data Rule we seek: Possible tests/conditions: If ? then recommendation = hard Tear production rate = Normal 0/12 Tear production rate = Reduced Astigmatism = yes Astigmatism = no 1/12 Spectacle prescription = Hypermetrope 3/12 Spectacle prescription = Myope 1/8 Age = Presbyopic Age = Pre-presbyopic 2/8 Age = Young 4/12 4/12

Modified Rule and Resulting Data Rule with best test added: Instances covered by rule: If astigmatism = yes then recommendation = hard None Reduced Yes Hypermetrope Pre-presbyopic Normal Myope Presbyopic Hard Young Recommended lenses Tear production rate Astigmatism Spectacle prescription Age

Further Refinement Current state: Possible tests: If astigmatism = yes and ? then recommendation = hard Tear production rate = Normal 0/6 Tear production rate = Reduced 1/6 Spectacle prescription = Hypermetrope 3/6 Spectacle prescription = Myope 1/4 Age = Presbyopic Age = Pre-presbyopic 2/4 Age = Young 4/6

Modified Rule and Resulting Data Rule with best test added: Instances covered by modified rule: If astigmatism = yes and tear production rate = normal then recommendation = hard None Normal Yes Hypermetrope Pre-presbyopic Hard Myope Presbyopic Young Recommended lenses Tear production rate Astigmatism Spectacle prescription Age

Further Refinement Current state: Possible tests: Tie between the first and the fourth test We choose the one with greater coverage If astigmatism = yes and tear production rate = normal and ? then recommendation = hard 1/3 Spectacle prescription = Hypermetrope 3/3 Spectacle prescription = Myope 1/2 Age = Presbyopic Age = Pre-presbyopic 2/2 Age = Young

The Result Final rule: Second rule for recommending “hard lenses”: (built from instances not covered by first rule) These two rules cover all “hard lenses”: Process is repeated with other two classes If astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard

Pseudo-Code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until p/t = 1.0 OR no_more_attributes do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add condition (A = v) to R Remove the instances covered by R from E Student Q: What kind of rule can you add to a test in order to prevent it from covering negative data examples? Student Q: PRISM only constructs "perfect" rules by ignoring any rule with less than 100% accuracy. What kind of effect would this have on the running time, and why may it not be the appropriate method for certain situations?

Rules vs. Decision Lists PRISM with outer loop removed generates a decision list set for one class Subsequent rules are designed for examples that are not covered by previous rules But: order doesn’t matter because all rules predict the same class Outer loop considers all classes separately No order dependence implied Problems: Overlapping rules No guarantee all new exs. covered: Default rule required

Separate and Conquer Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms: First, identify a useful rule Then, separate out all the instances it covers Finally, “conquer” the remaining instances Difference with divide-and-conquer methods: Subset covered by rule doesn’t need to be explored any further

Student Questions Student Q: What is an alternative approach to divide-and-conquer classification methods? Student Q: Since rule generation is usually more understandable to humans trying to extrapolate from the data, is rule generation considered good at describing the patterns in the data in comparison to our previously talked about algorithms? Student Q: How do you evaluate the error rate of rule on a test validation dataset and decide if its good enough to keep it? Student Q: How do you evaluate the error rate of rule on test set and decide if its good enough to keep it? Student Q: When pruning a decision tree, does it make sense to prune based on the lowest p/t? Student Q: The rules derived from a decision tree may be much more numerous than necessary, may contain redundant terms and may not provide easily understandable classification rules if the tree is too large. Is there a way to deal with this large tree dilemma? Student Q: Is a rule-generating method a potentially better option than a decision tree when looking for a minority class? Student Q: Section 6.2 talks a lot about how post pruning is better than pre pruning for rule generalization but makes a point to bring forth the flaws in each method. It ends with only one method truly working but being overly complex. Couldn't a form of pre-pruning be used by using the information gain? Set a threshold for the lowest amount of information gain acceptable and then generate rules while leaving out rules under the information gain threshold?