Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining, Frequent-Itemset Mining. Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot.

Similar presentations


Presentation on theme: "Data Mining, Frequent-Itemset Mining. Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot."— Presentation transcript:

1 Data Mining, Frequent-Itemset Mining

2 Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot dogs also buy mustard," Find "similar" items in a large collection. E.g.: – Find documents on the Web that share a significant amount of words – Find books that have been bought by many of the same Amazon customers. Find clusters of data. E.g. – Find clusters of Web pages by the words they use.

3 Frequent-Itemset Mining Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. Fundamental problem What sets of items are often bought together? Application If a large number of baskets contain both hot dogs and mustard, we can use this information in several ways. How?

4 Beer and Diapers What’s the explanation here?

5 On-Line Purchases Amazon.com offers several million different items for sale, and has several tens of millions of customers. Baskets = Customers, Items = Books, DVDs, etc. Motivation: Find out what items are bought together. Baskets = Books, DVDs, etc. Items = Customers Motivation: Find out similar customers.

6 Words and Documents Baskets = sentences; Items = words in those sentences. Motivation: Find words that appear together unusually frequently, i.e., linked concepts. Baskets = sentences, Items = documents containing those sentences. Motivation: Items that appear together too often could represent plagiarism.

7 Genes Baskets = people; Items = genes or blood-chemistry factors. Motivation: Detect combinations of genes that result in diabetes

8 Support Support for a set of items (itemset) I = the number of baskets containing all items in I. Given a support threshold s, itemsets that appear in > s baskets are called frequent itemsets.

9 Example: Frequent Itemsets Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} Frequent itemsets: {m}, {c}, {b}, {j},, {b,c}, {c,j}. {m,b}

10 Scale of Problem WalMart sells 100,000 items and can store billions of baskets. The Web has over 100,000,000 words and billions of pages.

11 Association Rules If-then rules about the contents of baskets. {i 1, i 2,…,i k } → j means: “if a basket contains all of i 1,…,i k then it is likely to contain j.” Confidence of this association rule is the probability of j given i 1,…,i k. Example B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} An association rule: {m, b} → c. – Confidence = 2/4 = 50%.

12 Interest The interest of an association rule X → Y is the absolute value of the amount by which the confidence differs from the probability of Y being in a given basket. Example B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} For association rule {m, b} → c, item c appears in 5/8 of the baskets. Interest = |2/4 - 5/8| = 1/8 --- not very interesting.

13 Finding Association Rules Typical question: – “find all association rules with support ≥ s and confidence ≥ c.” Note: “support” of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. – Checking the confidence of association rules involving those sets is relatively easy.


Download ppt "Data Mining, Frequent-Itemset Mining. Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot."

Similar presentations


Ads by Google