Data Science Algorithms: The Basic Methods

Data Science Algorithms: The Basic Methods
Covering Algorithms WFH: Data Mining, Chapter 4.4 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Algorithms: The Basic Methods
Inferring rudimentary rules Naïve Bayes, probabilistic model Constructing decision trees Constructing rules Association rule learning Linear models Instance-based learning Clustering

Covering Algorithms Convert decision tree into a rule set
Straightforward, but rule set overly complex More effective conversions are not trivial Instead, can generate rule set directly For each class in turn find rule set that covers all instances in it (excluding instances not in the class) Called a covering approach: At each stage a rule is identified that “covers” some of the instances

Example: Generating a Rule
Possible rule set for class “b”: Could add more rules, get “perfect” rule set If true then class = a If x > 1.2 and y > 2.6 then class = a If x > 1.2 then class = a Student Q: Name an example where you would want to generate "sensible" rules as opposed to perfect rules. If x  1.2 then class = b If x > 1.2 and y  2.6 then class = b

Rules vs. Trees Corresponding decision tree:
(produces exactly the same predictions) But: rule sets can be more clear when decision trees suffer from replicated subtrees Note: In multiclass situations, covering algorithm concentrates on one class at a time whereas decision tree learner takes all classes into account

Simple Covering Algorithm
Generates a rule by adding tests that maximize rule’s accuracy Similar to situation in decision trees: problem of selecting an attribute to split on But: decision tree inducer maximizes overall purity Each new test reduces rule’s coverage:

Selecting a Test Goal: maximize accuracy
t total number of instances covered by rule p positive examples of the class covered by rule (t – p number of errors made by rule) Select test that maximizes accuracy of rule: p/t We are finished when p/t = 1 or the set of instances can’t be split any further Student Q: It seems like separate-and-conquer strategy has its positive in its simplicity while the advantage of the divide-and-conquer approach is its accuracy. How does it make sense for one to consider the separate-and-conquer strategy at the expense of accuracy?

Example: Contact Lens Data
Rule we seek: Possible tests/conditions: If ? then recommendation = hard Tear production rate = Normal 0/12 Tear production rate = Reduced Astigmatism = yes Astigmatism = no 1/12 Spectacle prescription = Hypermetrope 3/12 Spectacle prescription = Myope 1/8 Age = Presbyopic Age = Pre-presbyopic 2/8 Age = Young 4/12 4/12

Modified Rule and Resulting Data
Rule with best test added: Instances covered by rule: If astigmatism = yes then recommendation = hard None Reduced Yes Hypermetrope Pre-presbyopic Normal Myope Presbyopic Hard Young Recommended lenses Tear production rate Astigmatism Spectacle prescription Age

Further Refinement Current state: Possible tests:
If astigmatism = yes and ? then recommendation = hard Tear production rate = Normal 0/6 Tear production rate = Reduced 1/6 Spectacle prescription = Hypermetrope 3/6 Spectacle prescription = Myope 1/4 Age = Presbyopic Age = Pre-presbyopic 2/4 Age = Young 4/6

Modified Rule and Resulting Data
Rule with best test added: Instances covered by modified rule: If astigmatism = yes and tear production rate = normal then recommendation = hard None Normal Yes Hypermetrope Pre-presbyopic Hard Myope Presbyopic Young Recommended lenses Tear production rate Astigmatism Spectacle prescription Age

Further Refinement Current state: Possible tests:
Tie between the first and the fourth test We choose the one with greater coverage If astigmatism = yes and tear production rate = normal and ? then recommendation = hard 1/3 Spectacle prescription = Hypermetrope 3/3 Spectacle prescription = Myope 1/2 Age = Presbyopic Age = Pre-presbyopic 2/2 Age = Young

The Result Final rule: Second rule for recommending “hard lenses”: (built from instances not covered by first rule) These two rules cover all “hard lenses”: Process is repeated with other two classes If astigmatism = yes and tear production rate = normal and spectacle prescription = myope then recommendation = hard If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard

Pseudo-Code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until p/t = 1.0 OR no_more_attributes do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add condition (A = v) to R Remove the instances covered by R from E Student Q: What kind of rule can you add to a test in order to prevent it from covering negative data examples? Student Q: PRISM only constructs "perfect" rules by ignoring any rule with less than 100% accuracy. What kind of effect would this have on the running time, and why may it not be the appropriate method for certain situations?

Rules vs. Decision Lists
PRISM with outer loop removed generates a decision list set for one class Subsequent rules are designed for examples that are not covered by previous rules But: order doesn’t matter because all rules predict the same class Outer loop considers all classes separately No order dependence implied Problems: Overlapping rules No guarantee all new exs. covered: Default rule required

Separate and Conquer Methods like PRISM (for dealing with one class) are separate-and-conquer algorithms: First, identify a useful rule Then, separate out all the instances it covers Finally, “conquer” the remaining instances Difference with divide-and-conquer methods: Subset covered by rule doesn’t need to be explored any further

Student Questions Student Q: What is an alternative approach to divide-and-conquer classification methods? Student Q: Since rule generation is usually more understandable to humans trying to extrapolate from the data, is rule generation considered good at describing the patterns in the data in comparison to our previously talked about algorithms? Student Q: How do you evaluate the error rate of rule on a test validation dataset and decide if its good enough to keep it? Student Q: How do you evaluate the error rate of rule on test set and decide if its good enough to keep it? Student Q: When pruning a decision tree, does it make sense to prune based on the lowest p/t? Student Q: The rules derived from a decision tree may be much more numerous than necessary, may contain redundant terms and may not provide easily understandable classification rules if the tree is too large. Is there a way to deal with this large tree dilemma? Student Q: Is a rule-generating method a potentially better option than a decision tree when looking for a minority class? Student Q: Section 6.2 talks a lot about how post pruning is better than pre pruning for rule generalization but makes a point to bring forth the flaws in each method. It ends with only one method truly working but being overly complex. Couldn't a form of pre-pruning be used by using the information gain? Set a threshold for the lowest amount of information gain acceptable and then generate rules while leaving out rules under the information gain threshold?

Data Science Algorithms: The Basic Methods

Similar presentations

Presentation on theme: "Data Science Algorithms: The Basic Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Science Algorithms: The Basic Methods

Similar presentations

Presentation on theme: "Data Science Algorithms: The Basic Methods"— Presentation transcript:

Similar presentations

About project

Feedback