Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining, Frequent-Itemset Mining

Similar presentations


Presentation on theme: "Data Mining, Frequent-Itemset Mining"— Presentation transcript:

1 Data Mining, Frequent-Itemset Mining
COMP 451/651 Data Mining, Frequent-Itemset Mining Chapter 7 1

2 Data Mining Discovery of useful, possibly unexpected, patterns in data. Example patterns: "50% of the people who buy hot dogs also buy mustard,“ "these three individual's pattern of credit-card expenditures indicate that they are running an illigal activity." Some mining problems Find frequent itemsets in "market-basket" data Find "similar" items in a large collection. Example applications: Find documents on the Web that share a significant amount of common text or Find books that have been bought by many of the same Amazon customers. Find clusters of data Example application Find clusters of Web pages by the words they use.

3 Frequent-Itemset Mining
The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. Fundamental problem What sets of items are often bought together? Application If a large number of baskets contain both hot dogs and mustard, we can use this information in several ways. How?

4 Hot Dogs and Mustard Apparently, many people walk from where the hot dogs are to where the mustard is. We can put them close together, and put between them other foods that might also be bought with hot dogs and mustard, e.g., ketchup or potato chips. Doing so can generate additional "impulse" sales. The store can run a sale on hot dogs and at the same time raise the price of mustard. People will come to the store for the cheap hot dogs, and many will need mustard too. It is not worth the trouble to go to another store for cheaper mustard, so they buy that too. The store makes back on mustard what it loses on hot dogs, and also gets more customers into the store.

5 Beer and Diapers What’s the explanation here?

6 On-Line Purchases Amazon.com offers several million different items for sale, and has several tens of millions of customers. Basket = Customer, Item = Book, DVD, etc. Motivation: Find out what items are bought together. Basket = Book, DVD, etc. Item = Customer Motivation: Find out similar customers.

7 Words and Documents Baskets = sentences; items = words in those sentences. Lets us find words that appear together unusually frequently, i.e., linked concepts. Baskets = sentences, items = documents containing those sentences. Items that appear together too often could represent plagiarism.

8 Genes Baskets = people; items = genes or blood-chemistry factors.
Has been used to detect combinations of genes that result in diabetes

9 Support Question: find sets of items that appear “frequently” in the baskets. Support for itemset I = the number of baskets containing all items in I. Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets.

10 Example: Frequent Itemsets
Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c} , {c,j}.

11 Scale of Problem WalMart sells 100,000 items and can store billions of baskets. The Web has over 100,000,000 words and billions of pages.

12 Association Rules If-then rules about the contents of baskets.
{i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j.” Confidence of this association rule is the probability of j given i1,…,ik. Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} An association rule: {m, b} → c. Confidence = 2/4 = 50%.

13 Interest The interest of an association rule X → Y is the absolute value of the amount by which the confidence differs from the probability of Y being in a given basket. Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} For association rule {m, b} → c, item c appears in 5/8 of the baskets. Interest = |2/4 - 5/8| = 1/8 --- not very interesting.

14 Finding Association Rules
A typical question: “find all association rules with support ≥ s and confidence ≥ c.” Note: “support” of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. Checking the confidence of association rules involving those sets is relatively easy.


Download ppt "Data Mining, Frequent-Itemset Mining"

Similar presentations


Ads by Google