Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Quality Class 9. Rule Discovery Decision and Classification Trees Association Rules.

Similar presentations


Presentation on theme: "Data Quality Class 9. Rule Discovery Decision and Classification Trees Association Rules."— Presentation transcript:

1 Data Quality Class 9

2 Rule Discovery Decision and Classification Trees Association Rules

3 Decision and Classification Trees Each node in tree represents a question Decision as to which path to take from that node is dependent on the answer to the question At each step along the path from the root of the tree to the leaves, the set of records that conform to the answers along the way continues to grow smaller

4 Decision and Classification – 2 At each node in the tree, we have a representative set of records that conform to the answers to the questions along that path Each node in the tree represents a segmenting question, which subdivides the current representative set into two smaller segments Every path is unique Each node in the tree also represents the expression of a rule

5 Example

6 Decision and Classification – 3

7 CART Classification and Regression Tree Grab a training set Subselect some records that we know already share some attribute properties in common All other data attributes become independent variables The results of the decision tree process are to be applied to the entire data set at a later date

8 CART 2 Decide which of the independent variables is the best for splitting the records The choice for the next split is based on choosing the criteria that divide the records into sets where, in each set, a single characteristic predominates Evaluate the possible ways to split based on each independent variable, measuring how good that split will be

9 Selection Heuristics Gini: maximize the set differentiated by a split, with the goal of isolating records with that class from other records Twoing: tries to evenly distribute the records at each split opportunity There are other heuristics

10 CART 3 The complete tree is built by recursively splitting the data at each decision point in the tree At each step, if we find that for a certain attribute all values are the same, we eliminate that attribute from future consideration When we reach a point where no appropriate split can be found, we determine that node to be a leaf node When the tree is complete, the splitting properties at each internal node can be evaluated and assigned some meaning

11 Example 2

12 Rules If (monthly_bill > 100) AND (PayPerViews < 2) If (monthly_bill > 100) AND (PerPerViews > 2) AND (PayPerViews < 5) If (monthly_bill > 100) AND (PayPerViews >= 5)

13 Association Rules Rules of the form X  Y, where X is a set of (attribute, value) pairs and Y is a set of (attribute, value) pairs An example is “94% of the customers that purchase tortilla chips and cheese also purchase salsa” This can be used for many application domains, such as market basket analysis Can also be used to discover data quality rules

14 Association Rules 2 Formally: – Let D be a database of records – Each record R in D contains a set of (attribute, value) pairs (also called an item) – An itemset X is a subset of (attribute, value) pairs of a record R (i.e., X  R) – An association rule is an implication of the form X  Y, where X and Y are both itemsets, and share no attributes. – The rule holds with confidence c if c% of the records that contain X also contain Y – The rule has support s% if s% of the records in D contain X or Y

15 Association Rules 3 Confidence is the percentage of time that the rule holds when X is in the record Support is the percentage of time that the rule could hold Association rules describe a relation imposed on individual values that appear in the data Association rules with high confidence are likely to imply generalities about the data We can infer data quality rules from the discovery of association rules

16 Association Rules 4 Example: – (CustomerType==Business) AND (total > $1000)  (managerApproval == “required”) with confidence 85% and support 25% – This means that 25% of the time, the record had one of those attributes set with the indicated values – Of the records with (CustomerType==Business) AND (total > $1000), 85% of the time he attribute managerApproval had the value “required” – We might infer this as a more general rule, that business orders greater than $1000 require manager approval – This calls into question the 15% of the time it didn’t hold true – data quality problem, or is it not a general rule?

17 Association Rules 5 We can set some minimum support and minimum confidence levels Definitions: – L k is the set of large sets having k items – C k is the set of candidate sets having k items

18 Association Rules Algorithm L 1 = sets with 1 item for (k = 2; L k-1 not empty; k++) do – C k = generate_new_candidates(L k-1 ) – forall records R in D do C R = subset(C k, R) forall candidates c in C R do – c.count++; – end L k = { c in C R | c.count > minimum support }

19 Candidate Generation and Subset Takes the set of all large itemsets of size (k – 1) First, it joins L k-1 with L k-1, if they share (k – 2) items, to get a superset of the set of candidates The candidates are pruned if a subset of the items in each candidate does not have minimum support (i.e., the subset of size (k – 1) is not in L k-1 Subset operation takes a record, and finds all candidate rules of iteration k within that record

20 More on Association Rules We can adjust our goals for finding rules by quantizing the values in each attribute – In other words, we can assign values of attributes that belong to large ranges into quantized components, making the rule process less cumbersome We can also use clustering to enhance the association rule algorithm – If we don’t know how to quantize to begin with, use clustering for values Association rules can uncover interesting data quality and business rules


Download ppt "Data Quality Class 9. Rule Discovery Decision and Classification Trees Association Rules."

Similar presentations


Ads by Google