Fast Algorithms for Mining Association Rules

Fast Algorithms for Mining Association Rules
Rakesh Agrawal Ramakrishnan Srikant Modified from slides by: Dan Li Presenter:Jimmy Jiang Discussion Lead: Leo Li

Origin of Problem Simple example: beer and diapers
Basket data: new possibility of customized data-driven strategies for retail companies Simple example: beer and diapers Famous case for market basket analysis If you buy diapers, you tend to buy beer Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data. Successful organizations view such databases as important pieces of the marketing infrastructure. They are interested in instituting information-driven marketing processes managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies The story is about Wal-Mart. Wal-Mart did an analysis of customers’ buying habits. Surprisingly they found a statistically significant correlation between purchases of beer and purchases of diapers. Probably mom called dad to buy diapers on way home, and he decided to buy a 6-pack as well (since he does not have time for pub) As a result, Wal-Mart put the diapers next to the beer, resulting in increased sales of beers especially from young fathers. A way of using association rules to promote high-profit items.

Usage of Data Mining (general)
Clustering, predictive modeling, dependency modeling, data summarization, change and deviation. Association rules in dependency modeling Applications: market basket analysis, direct/interactive marketing, fraud detection, science, sports stats , etc Data mining is digging through and analyzing a large amount of data and then extracting the meaning of the data. Narrow sense Market basket analysis - Understand what products or services are commonly purchased together; e.g., beer and diapers. Market segmentation - Identify the common characteristics of customers who buy the same products from your company. Customer churn - Predict which customers are likely to leave your company and go to a competitor. Fraud detection - Identify which transactions are most likely to be fraudulent. Direct marketing - Identify which prospects should be included in a mailing list to obtain the highest response rate. Interactive marketing - Predict what each individual accessing a Web site is most likely interested in seeing. Trend analysis - Reveal the difference between a typical customer this month and last. Some sources from:

Association Rule Example Formal Definition , where , and
Confidence c, c% of transactions that contain X also contain Y Support s, s% of all transactions contain Transaction: set of items, where I is the total set of items , where , and X: antecedent; Y consequent, subsets of I I: total set of items Transactions: sets of items in I While confidence is a measure of the rule’s strength, support corresponds to statistical significance. Example buys(x, “computer”) →buys(x, “financial management software”) [0.5%, 60%]

Mining Association Rules
The problem of finding association rules falls within the purview of database mining. Eg1: Find proportion (confidence) of transactions that purchase diapers also purchase beer. Eg2: Find proportion (support) of all transactions purchase diaper and/or beer (among all transactions). Finding association rules is valuable for Cross-marketing Catalog design Add-on sales Store layout and so on General: Finding frequent patterns, associations, correlations, or causalstructures among sets of items in transaction databases Given a set of transactions ‘D, the problem of mining association rules is to generate all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf) respectively. also called knowledge discovery in databases work includes: the induction of classification rules discovery of causal rules learning of logical definitions Fitting of functions to data clustering

Problem Decomposition
Find all sets of items (itemseis) that have transaction support above minimum support: Apriori and AprioriTid Use the large itemsets to generate the desired rules: not discussed Mining association rules can be decomposed into two sub-problems:

Discovering Large Itemsets
Intuition: any subset of a large itemset must be large. Algorithms for discovering large itemsets make multiple passes over the data. In the first pass, determine which individual item is large. In each sequent pass: Previous large itemsets (seed sets) are used to generate candidate itemsets. Count actual support for the candidate itemsets. Determine which are the real large itemsets, pass to next pass This process continues until no new large itemsets are found. Large: have minimum support Intuition of two algorithms is the same Each sequent pass can be divided into 3 parts.

Algorithm Apriori Intuition: every subset of a large itemset must be large. So combine almost-matching pairs of large (k-1)-itemsets, and prune out those with non-large (k-1)-subsets - Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemse Yellow: generating itemsets Ck is candidate itemset of size k, Lk is frequent itemset of size k Ck is generated by joining Lk-1 with itself for each transaction t in database, increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support

Itemset Generation The apriori-gen function has two steps Join Prune
Efficiently Implementing subset functions The hash tree In the join step, we join the previous large itemsets with itself In the prune step, we delete all itemsets if their k-1 subsets are not large.

Example for Apriori Items within an itemset are kept in lexicographic order. First pass, Generate large itemsets, each contains only one item, L1 Second pass, Generate candidates, C2 *In order to generate C2, we join L1 with L1 first *Prune the join results by deleting all itemsets if their k-1 subsets are not large.

Example for Apriori (cont’)
We scan each transaction in the database, Use itemsets in C2 to create Ct Eventually: # of itemsets supported by original transactions

Apriori TID Candidate generation is same as Apriori, but DB is used for counting support only in the first pass. More memory needed: storage set in memory containing frequent sets per transaction Idea: To reduce transaction database scans The interesting feature of this algorithm is that the database V is not used for counting support after the first pass.

Performance Average size of transactions 5~20;
Average size of maximal potentially large itemsets 2~6 Dataset sizes 2.4~8.4 MB (on an IBM RS/SOOO 530H workstation with a CPU clock rate of 33 MHz, 64 MB of main memory, and running AIX The data resided in the AIX file system and was stored on a 2GB SCSI 3.5” drive, with measured sequential throughput of about 2 MB/second.)

Results Apriori/AprioriTid outperforms AIS/SETM
avg # of transactions/maximal potentially large itemsets AIS: Candidate itemsets are generated and counted onthe-fly as the database is scanned SETM: However, to use the standard SQL join operation for candidate generation, SETM separates candidate generation from counting. It saves a copy of the candidate itemset together with the TID of the generating transaction in a sequential structure. At the end of the pass, the support count of candidate itemsets is determined by sorting and aggregating this sequential structure.

Smaller number of candidates
Reducing computational cost Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates

Discussion: In year 1994, the author mentioned the dataset they used in scale-up experiment are not even as large as 1 GB. Nowadays, this seems to be not sufficient as scale-up (as in previous discussion, Hive and Dremel could handle TB and even PB level data). Do you think Apriori algorithm is still suitable today? Are there any limitations or issues exposed due to the increased data size?

Improving Apriori in efficiency
AprioriHybrid: Use Apriori in initial passes; (Heuristic) Estimation on the size of C^k; Switch to AprioriTid when C^k is expected to fit in memory The switch takes time, but it is still better in most c

AprioriHybrid left: minimum support right: number of transactions

Later Work Parallel versions Quantitative association rules
E.g., "10% of married people between age 50 and 60 have at least 2 cars." Online association rules

Conclusion Two new algorithms: Apriori and AprioriTid are discussed.
These algorithms outperform AIS and SETM. Apriori and AprioriTid can be combined into AprioriHybird. AprioriHybrid matches Apriori and AprioriTid when either one wins.

Discussion This paper has spun off more similar algorithm in the database world than any other data mining algorithm. Why do you think this is the case? Is it the algorithm? The problem? The approach? Or something else.

Fast Algorithms for Mining Association Rules

Similar presentations

Presentation on theme: "Fast Algorithms for Mining Association Rules"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Algorithms for Mining Association Rules

Similar presentations

Presentation on theme: "Fast Algorithms for Mining Association Rules"— Presentation transcript:

Similar presentations

About project

Feedback