November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.

November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Constraint-based association mining Summary

November 3, 2015Data Mining: Concepts and Techniques2 Mining Various Kinds of Association Rules Mining multilevel association Miming multidimensional association Mining quantitative association Mining interesting correlation patterns

Mining multilevel association Association rules generated from mining data at multiple levels of abstraction are called multiple- level or multilevel association rules Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework November 3, 2015Data Mining: Concepts and Techniques3

Concept hierarchy November 3, 2015Data Mining: Concepts and Techniques4 A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts Data can be generalized by replacing low-level concepts within the data by their higher-level concepts, or ancestors, from a concept hierarchy

Approaches to mining multilevel association rules Using uniform minimum support for all levels (referred to as uniform support) The same minimum support threshold is used when mining at each level of abstraction November 3, 2015Data Mining: Concepts and Techniques5

… Using reduced minimum support at lower levels (referred to as reduced support) Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the corresponding threshold is November 3, 2015Data Mining: Concepts and Techniques6

… Using item or group-based minimum support (referred to as group-based support) Which groups are more important than others? It is sometimes more desirable to set up user- specific, item, or group based minimal support thresholds when mining multilevel rules November 3, 2015Data Mining: Concepts and Techniques7

November 3, 2015Data Mining: Concepts and Techniques8 Multi-level Association: Redundancy Filtering Concept hierarchies are useful in data mining since they permit the discovery of knowledge at different levels of abstraction, such as multilevel association rules Some rules may be redundant due to “ancestor” relationships between items. Example milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] The first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

Checking for redundant multilevel association rules November 3, 2015Data Mining: Concepts and Techniques9

November 3, 2015Data Mining: Concepts and Techniques10 Mining Multi-Dimensional Association rule Single-dimensional rules/ intra-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes: finite number of possible values, no ordering among values (e.g., occupation, brand, color) Quantitative Attributes: numeric, implicit ordering among values (e.g., age, income)

November 3, 2015Data Mining: Concepts and Techniques11 Mining Quantitative Associations Techniques can be categorized by how numerical or quantitative attributes, such as age or salary are treated Descretization occurs before mining 1.Static discretization based on predefined concept hierarchies (data cube methods) Quantitative attributes are discretized using predefined concept hierarchies 2.Dynamic discretization based on data distribution (quantitative rules) Quantitative attributes are discretized or clustered in to “bins” based on the distribution of the data

November 3, 2015Data Mining: Concepts and Techniques12 Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. A k-predicate set is a set containing k conjunctive predicates Example: 3-predicate set  {age, occupation, buys} Data cube is well suited for mining. The cells of an n-dimensional cuboid can be used to store support counts of the corresponding n-predicate sets. Mining from data cubes can be much faster. (incom e) (age) () (buys) (age, income)(age,buys)(income,buys) (age,income,buys)

November 3, 2015Data Mining: Concepts and Techniques13 Mining Quantitative Association Rules age(X,”34-35”)  income(X,” 30-50K ”)  buys(X,”high resolution TV”) Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized 2-D quantitative association rules: A quan1  A quan2  A cat How can we find such rules? Cluster adjacent association rules to form general rules using a 2-D grid Example

… Approach used in a system called ARCS (Association Rule Clustering System) Steps: Binning  partitioning process – intervals are considered “bins” Equal-width binning  interval size of each bin is the same // ARCS employs Equal-frequency binning  bin has approximately the same number of tuples associated to it Clustering-based binning  clustering is performed on the quantitative attributes to group neighboring points into the same bin November 3, 2015Data Mining: Concepts and Techniques14

… A 2-D array for each possible bin combination involving both quantitative attributes is created Finding frequent predicate sets 2-D array containing the count distribution It can be scanned to find the frequent predicate sets (those satisfy min_sup) Clustering the association rules The strong association rules obtained in the previous step are then mapped to a 2-D grid November 3, 2015Data Mining: Concepts and Techniques15

2-D Grid November 3, 2015Data Mining: Concepts and Techniques16

November 3, 2015Data Mining: Concepts and Techniques17 Mining Other Interesting Patterns Flexible support constraints Some items (e.g., diamond) may occur rarely but are valuable Customized sup min specification and application Top-K closed frequent patterns Hard to specify sup min, but top-k with length min is more desirable Dynamically raise sup min in FP-tree construction and mining, and select most promising path to mine

From association analysis to correlation analysis Support-confidence framework  association rules Correlation rule  is measured not only by its support and confidence but also by the correlation between itemsets A and B Lift is a simple correlation measure The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB)=P(A)P(B) Otherwise, itemsets A and B are dependent and correlated as events =1  A and B are independent <1  occurrence of A is negatively correlated with the occurrence of B >1  A and B are positively correlated November 3, 2015Data Mining: Concepts and Techniques19

November 3, 2015Data Mining: Concepts and Techniques20 Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift BasketballNot basketballSum (row) Cereal200017503750 Not cereal10002501250 Sum(col.)300020005000

Correlation analysis using  2 The observed value and expected value November 3, 2015Data Mining: Concepts and Techniques21

November 3, 2015Data Mining: Concepts and Techniques22 Are lift and  2 Good Measures of Correlation? “Buy walnuts  buy milk [1%, 80%]” is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations So many interestingness measures?

… November 3, 2015Data Mining: Concepts and Techniques23

November 3, 2015Data Mining: Concepts and Techniques24 Which Measures Should Be Used? lift and  2 are not good measures for correlations in large transactional DBs all-conf or coherence could be good measures Both all-conf and coherence have the downward closure property The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Efficient algorithms can be derived for mining

… November 3, 2015Data Mining: Concepts and Techniques25

November 3, 2015Data Mining: Concepts and Techniques27 Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused! Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface) Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: explores such constraints for efficient mining—constraint-based mining

November 3, 2015Data Mining: Concepts and Techniques28 Constraints in Data Mining Knowledge type constraint: classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in Chicago in Dec.’02 Dimension/level constraint  levels of concept hierarchy in relevance to region, price, brand, customer category Rule (or pattern) constraint Specify the form of rules to be mined Constraints are expressed as metarules (rule templates) small sales (price $200) Interestingness constraint strong rules: min_support  3%, min_confidence  60%

November 3, 2015Data Mining: Concepts and Techniques29 Constrained Mining Constrained mining Aimed at reducing search space Finding all patterns satisfying constraints Constraint-pushing Constrained mining vs. query processing in DBMS Database query processing requires to find all Constrained pattern mining shares a similar philosophy as pushing selections deeply in query processing

Rule constraints Rule constraints can be classified into the following five categories w.r.t frequent itemset mining: Anti-monotonic Monotonic Succinct Convertible In-convertible November 3, 2015Data Mining: Concepts and Techniques30

November 3, 2015Data Mining: Concepts and Techniques31 Anti-Monotonicity in Constraint Pushing Anti-monotonicity When an intemset S violates the constraint, so does any of its superset sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone Example. C: range(S.profit)  15 is anti- monotone Itemset ab violates C So does every superset of ab TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g TDB (min_sup=2) ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

November 3, 2015Data Mining: Concepts and Techniques32 Monotonicity for Constraint Pushing Monotonicity When an itemset S satisfies the constraint, so does any of its superset sum(S.Price)  v is monotone min(S.Price)  v is monotone Example. C: range(S.profit)  15 Itemset ab satisfies C So does every superset of ab TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g TDB (min_sup=2) ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

November 3, 2015Data Mining: Concepts and Techniques33 Succinctness Succinctness: Given A 1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A 1, i.e., S contains a subset belonging to A 1 Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price)  v is succinct sum(S.Price)  v is not succinct Optimization: If C is succinct, C is pre-counting pushable

November 3, 2015Data Mining: Concepts and Techniques34 The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

November 3, 2015Data Mining: Concepts and Techniques35 Naïve Algorithm: Apriori + Constraint Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Constraint: Sum{S.price} < 5

November 3, 2015Data Mining: Concepts and Techniques36 The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Constraint: Sum{S.price} < 5

November 3, 2015Data Mining: Concepts and Techniques37 The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 Constraint: min{S.price } <= 1 not immediately to be used

November 3, 2015Data Mining: Concepts and Techniques38 Converting “Tough” Constraints Convert tough constraints into anti- monotone or monotone by properly ordering items Examine C: avg(S.profit)  25 Order items in value-descending order If an itemset afb violates C So does afbh, afb* It becomes anti-monotone! TIDTransaction 10a, b, c, d, f 20b, c, d, f, g, h 30a, c, d, e, f 40c, e, f, g TDB (min_sup=2) ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

November 3, 2015Data Mining: Concepts and Techniques39 Strongly Convertible Constraints avg(X)  25 is convertible anti-monotone w.r.t. item value descending order R: If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X)  25 is convertible monotone w.r.t. item value ascending order R -1 : If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefix Thus, avg(X)  25 is strongly convertible ItemProfit a40 b0 c-20 d10 e-30 f30 g20 h-10

November 3, 2015Data Mining: Concepts and Techniques40 Can Apriori Handle Convertible Constraint? A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm Within the level wise framework, no direct pruning based on the constraint can be made Itemset df violates constraint C: avg(X)>=25 Since adf satisfies C, Apriori needs df to assemble adf, df cannot be pruned But it can be pushed into frequent-pattern growth framework! ItemValue a40 b0 c-20 d10 e-30 f30 g20 h-10

November 3, 2015Data Mining: Concepts and Techniques41 Mining With Convertible Constraints C: avg(X) >= 25, min_sup=2 List items in every transaction in value descending order R: C is convertible anti-monotone w.r.t. R Scan TDB once remove infrequent items Item h is dropped Itemsets a and f are good, … Projection-based mining Imposing an appropriate order on item projection Many tough constraints can be converted into (anti)-monotone TIDTransaction 10a, f, d, b, c 20f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e TDB (min_sup=2) ItemValue a40 f30 g20 d10 b0 h-10 c-20 e-30

November 3, 2015Data Mining: Concepts and Techniques42 Handling Multiple Constraints Different constraints may require different or even conflicting item-ordering If there exists an order R s.t. both C 1 and C 2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints If there exists conflict on order of items Try to satisfy one constraint first Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database

November 3, 2015Data Mining: Concepts and Techniques43 What Constraints Are Convertible? Constraint Convertible anti- monotone Convertible monotone Strongly convertible avg(S) ,  v Yes median(S) ,  v Yes sum(S)  v (items could be of any value, v  0) YesNo sum(S)  v (items could be of any value, v  0) NoYesNo sum(S)  v (items could be of any value, v  0) NoYesNo sum(S)  v (items could be of any value, v  0) YesNo ……

November 3, 2015Data Mining: Concepts and Techniques44 Constraint-Based Mining—A General Picture ConstraintAntimonotoneMonotoneSuccinct v  S noyes S  V noyes S  V yesnoyes min(S)  v noyes min(S)  v yesnoyes max(S)  v yesnoyes max(S)  v noyes count(S)  v yesnoweakly count(S)  v noyesweakly sum(S)  v ( a  S, a  0 ) yesno sum(S)  v ( a  S, a  0 ) noyesno range(S)  v yesno range(S)  v noyesno avg(S)  v,   { , ,  } convertible no support(S)   yesno support(S)   noyesno

November 3, 2015Data Mining: Concepts and Techniques45 A Classification of Constraints Convertible anti-monotone Convertible monotone Strongly convertible Inconvertible Succinct Antimonotone Monotone

November 3, 2015Data Mining: Concepts and Techniques47 Frequent-Pattern Mining: Summary Frequent pattern mining—an important task in data mining Scalable frequent pattern mining methods Apriori (Candidate generation & test) Projection-based (FPgrowth) Vertical format approach  Mining a variety of rules and interesting patterns  Constraint-based mining  Mining sequential and structured patterns  Extensions and applications

November 3, 2015Data Mining: Concepts and Techniques48 Frequent-Pattern Mining: Research Problems Mining fault-tolerant frequent, sequential and structured patterns Patterns allows limited faults (insertion, deletion, mutation) Mining truly interesting patterns Surprising, novel, concise, … Application exploration E.g., DNA sequence analysis and bio-pattern classification “Invisible” data mining

November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.

Similar presentations

Presentation on theme: "November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.

Similar presentations

Presentation on theme: "November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient."— Presentation transcript:

Similar presentations

About project

Feedback