Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul.

Data Mining

Jim Which cow should I buy??

Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul Cows on sale AGE Milk Avg. (MA) Name35Phil 23Collins 59Larry 52Bird Which cow should I buy??

And suppose I know their: And suppose I know their: Behavior Behavior Preferred mating months Preferred mating months Milk production Milk production Nutritional habits Nutritional habits Immune system data Immune system data … Now suppose I have 10,000 cows … Now suppose I have 10,000 cows …

“ understanding ” data Looking for patterns is not new: Looking for patterns is not new: Hunters seek patterns in animal migration Hunters seek patterns in animal migration Politicians seek patterns in voting habits Politicians seek patterns in voting habits … Available data is increasing very fast (exponentially?) Available data is increasing very fast (exponentially?) Greater opportunities to extract valuable information Greater opportunities to extract valuable information But “ understanding ” the data becomes more difficult But “ understanding ” the data becomes more difficult

Data Mining Data Mining: The process of discovering patterns in data, usually stored in a Database. The patterns lead to advantages (economic or other). Data Mining: The process of discovering patterns in data, usually stored in a Database. The patterns lead to advantages (economic or other). Very fast growing area of research Very fast growing area of research Because: Because: Huge databases (Walmart-20 mil transactions/day) Huge databases (Walmart-20 mil transactions/day) Automatic data capture of transactions (Bar code, satellites, scanners, cameras, etc.) Automatic data capture of transactions (Bar code, satellites, scanners, cameras, etc.) Large financial advantage Large financial advantage Evolving analytical methods Evolving analytical methods

Data Mining techniques in some huji courses TechniqueCourse Decision Trees Artificial Intelligence EM, Perceptron, SVM, PCA … Intro. to Machine Learning Intro. to information processing and Learning Neural Networks Neural Networks 1, 2. K-Nearest Neighbor Computational Geometry

Data Mining Two extremes for the expression of the patterns: Two extremes for the expression of the patterns: 1. “ Black Box ” : “ Buy cow Zehava, Petra and Paulina ” 2. “ Transparent Box ” (Structural Patterns): “ Buy cows with age 300 or cows with calm behavior and >90 liters of milk production per month ”

The weather example PlayWindyHumidityTemp.Outlook NoFalseHighHotSunny NoTrueHighHotSunny YesFalseHighHotOvercast YesFalseHighMildRainy YesFalseNormalCoolRainy NoTrueNormalCoolRainy YesTrueNormalCoolOvercast NoFalseHighMildSunny YesFalseNormalCoolSunny Today is Overcast, mild temperature, high humidity, and windy. Will we play?

Questions one can ask A set of rules learned from this data could be presented in a Decision List: A set of rules learned from this data could be presented in a Decision List: If outlook=sunny and humidity=high then play=no If outlook=sunny and humidity=high then play=no ElseIf outlook=rainy and windy=true then play=no ElseIf outlook=rainy and windy=true then play=no ElseIf outlook=overcast then play=yes ElseIf outlook=overcast then play=yes ElseIf humidity=normal then play=yes ElseIf humidity=normal then play=yes Else play=yes Else play=yes This is an example of Classification Rules This is an example of Classification Rules We could also look for Association Rules: We could also look for Association Rules: If temperature=cool then humidity=normal If temperature=cool then humidity=normal If windy=false and play=no then outlook=sunny and If windy=false and play=no then outlook=sunny and humidity=high humidity=high

Example Cont. The previous example is very simplified. Real Databases will probably: The previous example is very simplified. Real Databases will probably: 1. Contain Numerical values as well. 2. Contain “ Noise ” and errors (stochastic). 3. Be a lot larger. And the analysis we are asked to perform might not be of Association Rules, but rather Decision Trees, Neural Networks, etc. And the analysis we are asked to perform might not be of Association Rules, but rather Decision Trees, Neural Networks, etc.

Caution David Rhine was a parapsychologist in the 1930-1950 ’ s David Rhine was a parapsychologist in the 1930-1950 ’ s He hypothesized that some people have Extra-Sensory Perception (ESP) He hypothesized that some people have Extra-Sensory Perception (ESP) He asked people to say if 10 hidden cards are red or blue. He asked people to say if 10 hidden cards are red or blue. He discovered that almost 1 in every 1000 people has ESP ! He discovered that almost 1 in every 1000 people has ESP ! He told these people that they have ESP and called them in for another test He told these people that they have ESP and called them in for another test He discovered almost all of them had lost their ESP ! He discovered almost all of them had lost their ESP ! He concluded that … He concluded that … You shouldn ’ t tell people they have ESP, it makes them loose it. You shouldn ’ t tell people they have ESP, it makes them loose it. [Source: J. Ullman]

Another Example Classic example: Database of purchases in a supermarket Classic example: Database of purchases in a supermarket Such huge DB ’ s which are saved for long periods of time are called Data Warehouses Such huge DB ’ s which are saved for long periods of time are called Data Warehouses It is extremely valuable for the manager of the store to extract Association Rules from the Data Warehouse It is extremely valuable for the manager of the store to extract Association Rules from the Data Warehouse It is even more valuable if this information can be associated with the person buying, hence the Club Memberships … It is even more valuable if this information can be associated with the person buying, hence the Club Memberships … Each Shopping Basket is a list of items that were bought in a single purchase by some customer Each Shopping Basket is a list of items that were bought in a single purchase by some customer

Supermarket Example Beer and Diapers were found to often be bought together by men, so they were placed in the same aisle Beer and Diapers were found to often be bought together by men, so they were placed in the same aisle [Datamining urban legend]

transiditem 111pen 111ink 111milk 111juice 112pen 112ink 112milk 113pen 113milk 114pen 114ink 114juice The Purchases Relation Itemset: A set of items Support of an itemset: the fraction of transactions that contain all items in the itemset. What is the Support of: 1.{pen}? 2.{pen, ink}? 3.{pen, juice}?

Frequent Itemsets We would like to find items that are purchased together at high frequency- Frequent Itemsets. We would like to find items that are purchased together at high frequency- Frequent Itemsets. We look for itemsets which have a We look for itemsets which have a support > minSupport. If minSupport is set to 0.7, then the frequent itemsets in our example would be: If minSupport is set to 0.7, then the frequent itemsets in our example would be: {pen}, {ink}, {milk}, {pen, ink}, {pen, milk} The A-Priori property of frequent itemsets: Every subset of a frequent itemset is also a frequent itemset. The A-Priori property of frequent itemsets: Every subset of a frequent itemset is also a frequent itemset.

Algorithm for finding Frequent itemsets Suppose we have n items. Suppose we have n items. The na ï ve approach: for every subset of items, check if it is frequent. The na ï ve approach: for every subset of items, check if it is frequent. Very expensive Very expensive Improvement (based on the A-priori property): first identify frequent itemsets of size 1, then try to expand them. Improvement (based on the A-priori property): first identify frequent itemsets of size 1, then try to expand them. Greatly reduces the number of candidate frequent itemsets. Greatly reduces the number of candidate frequent itemsets. A single scan of the table is enough to determine which candidate itemsets, are frequent. A single scan of the table is enough to determine which candidate itemsets, are frequent. The algorithm terminates when no new frequent itemsets are found in an iteration. The algorithm terminates when no new frequent itemsets are found in an iteration.

Algorithm for finding Frequent itemsets foreach item, check if it is a frequent itemset; (appears in >minSupport of the transactions) k=1;repeat foreach new frequent itemset I k with k items: Generate all itemsets I k+1 with k+1 items, such that I k is contained in I k+1. scan all transactions once and add itemsets that have support > minSupport. k++ k++ until no new frequent itemsets are found

transiditem 111pen 111ink 111milk 111juice 112pen 112ink 112milk 113pen 113milk 114pen 114ink 114juice

Finding Frequent itemsets, on table “ Purchases ”, with minSupport=0.7 In the first run, the following single itemsets are found to be frequent: {pen}, {ink}, {milk}. In the first run, the following single itemsets are found to be frequent: {pen}, {ink}, {milk}. Now we generate the candidates for k=2: {pen, ink}, {pen, milk}, {pen, juice}, {ink, milk}, {ink, juice} and {milk, juice}. Now we generate the candidates for k=2: {pen, ink}, {pen, milk}, {pen, juice}, {ink, milk}, {ink, juice} and {milk, juice}. By scanning the relation, we determine that the following are frequent: {pen, ink}, {pen, milk}. By scanning the relation, we determine that the following are frequent: {pen, ink}, {pen, milk}. Now we generate the candidates for k=3: {pen, ink, milk}, {pen, milk, juice}, {pen, ink, juice}. Now we generate the candidates for k=3: {pen, ink, milk}, {pen, milk, juice}, {pen, ink, juice}. By scanning the relation, we determine that none of these are frequent, and the algorithm ends with: { {pen}, {ink}, {milk}, {pen, ink}, {pen, milk} } By scanning the relation, we determine that none of these are frequent, and the algorithm ends with: { {pen}, {ink}, {milk}, {pen, ink}, {pen, milk} }

Algorithm refinement: One important refinement: after the candidate- generation phase, and before the scan of the relation (A- priori), eliminate candidate itemsets in which there is a subset which is not frequent. This is due to the A-Priori property. One important refinement: after the candidate- generation phase, and before the scan of the relation (A- priori), eliminate candidate itemsets in which there is a subset which is not frequent. This is due to the A-Priori property. In the second iteration, this means we would eliminate {pen, juice}, {ink, juice} and {milk, juice} as candidates since {juice} is not frequent. So we only check {pen, ink}, {pen, milk} and {ink, milk}. In the second iteration, this means we would eliminate {pen, juice}, {ink, juice} and {milk, juice} as candidates since {juice} is not frequent. So we only check {pen, ink}, {pen, milk} and {ink, milk}. So only {pen, ink, milk} is generated as a candidate, but it is eliminated before the scan because {ink, milk} is not frequent. So only {pen, ink, milk} is generated as a candidate, but it is eliminated before the scan because {ink, milk} is not frequent. So we don ’ t perform the 3 rd iteration of the relation. So we don ’ t perform the 3 rd iteration of the relation. More complex algorithms use the same tools: iterative generation and testing of candidate itemsets. More complex algorithms use the same tools: iterative generation and testing of candidate itemsets.

Association Rules Up until now we discussed identification of frequent item sets. We now wish to go one step further. Up until now we discussed identification of frequent item sets. We now wish to go one step further. An association rule is of the structure An association rule is of the structure {pen}=> {ink} Meaning: “ if a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction ”. Meaning: “ if a pen is purchased in a transaction, it is likely that ink will also be purchased in that transaction ”. It describes the data in the DB (past). Extrapolation to future transactions should be done with caution. It describes the data in the DB (past). Extrapolation to future transactions should be done with caution. More formally, an Association Rule is LHS=>RHS, where both LHS and RHS are sets of items, and implies that if every item in LHS was purchased in a transaction, it is likely that the items in RHS are purchased as well. More formally, an Association Rule is LHS=>RHS, where both LHS and RHS are sets of items, and implies that if every item in LHS was purchased in a transaction, it is likely that the items in RHS are purchased as well.

Measures for Association Rules 1. Support of “ LHS=>RHS ” is the support of the itemset (LHS U RHS). In other words: the LHS U RHS) 1. Support of “ LHS=>RHS ” is the support of the itemset (LHS U RHS). In other words: the fraction of transactions that contain all items in (LHS U RHS). 2. “ LHS=>RHS ” : Consider all transactions which contain all items in LHS. The fraction of these transactions that also contain all items in RHS, is the confidence of RHS. 2. Confidence of “ LHS=>RHS ” : Consider all transactions which contain all items in LHS. The fraction of these transactions that also contain all items in RHS, is the confidence of RHS. =S(LHS U RHS)/S(LHS) The confidence of a rule is an indication of the strength of the rule. The confidence of a rule is an indication of the strength of the rule.

transiditem 111pen 111ink 111milk 111juice 112pen 112ink 112milk 113pen 113milk 114pen 114ink 114juice What is the support of {pen}=>{ink}? And the confidence? What is the support of {ink}=>{pen}? And the confidence?

Finding Association rules A user can ask for rules with minimum support minSup and minimum confidence minConf. A user can ask for rules with minimum support minSup and minimum confidence minConf. Firstly, all frequent itemsets with support>minSup are computed with the previous Algorithm. Firstly, all frequent itemsets with support>minSup are computed with the previous Algorithm. Secondly, rules are generated using the frequent itemsets, and checked for minConf. Secondly, rules are generated using the frequent itemsets, and checked for minConf.

Finding Association rules Find all frequent itemsets using the previous alg. Find all frequent itemsets using the previous alg. For each frequent itemset X with support S(X): For each frequent itemset X with support S(X): For each division of X into 2 itemsets, LHS and RHS: Calculate the Confidence of LHS=>RHS : S(X)/S(LHS). We computed S(LHS) in the previous algorithm (because LHS is frequent since X is frequent). We computed S(LHS) in the previous algorithm (because LHS is frequent since X is frequent).

Generalized association rules transiddateitem 1111.5.99pen 1111.5.99ink 1111.5.99Milk 1111.5.99juice 11210.5.99pen 11210.5.99ink 11210.5.99milk 11315.5.99Pen 11315.5.99milk 1141.6.99Pen 1141.6.99Ink 1141.6.99juice We would like to know if the rule {pen}=>{juice} is different on the first day of the month compared to other days. How? What are its support and confidence generally? And on the first days of the month?

Generalized association rules By specifying different attributes to group by (date in the last example), we can come up with interesting rules which we would otherwise miss. By specifying different attributes to group by (date in the last example), we can come up with interesting rules which we would otherwise miss. Another example would be to group by location and check if the same rules apply for customers from Jerusalem compared to Tel Aviv. Another example would be to group by location and check if the same rules apply for customers from Jerusalem compared to Tel Aviv. By comparing the support and confidence of the rules we can observe differences in the data on different conditions. By comparing the support and confidence of the rules we can observe differences in the data on different conditions.

Caution in prediction When we find a pattern in the data, we wish to use it for prediction (that is in many case the point). When we find a pattern in the data, we wish to use it for prediction (that is in many case the point). However, we have to be cautious about this. However, we have to be cautious about this. For example: suppose {pen}=>{ink} has a high support and confidence. We might give a discount on pens in order to increase sales of pens and therefore also in sales of ink. For example: suppose {pen}=>{ink} has a high support and confidence. We might give a discount on pens in order to increase sales of pens and therefore also in sales of ink. However, this assumes a causal link between {pen} and {ink}. However, this assumes a causal link between {pen} and {ink}.

Caution in prediction Suppose pens and pencils are sold together a lot Suppose pens and pencils are sold together a lot We would then also get the rule {pencil}=>{ink} with high support and confidence We would then also get the rule {pencil}=>{ink} with high support and confidence However, it is clear there is no causal link between buying pencils and buying ink. However, it is clear there is no causal link between buying pencils and buying ink. If we promoted pencils it would not cause an increase in sales of ink, despite high support and confidence. If we promoted pencils it would not cause an increase in sales of ink, despite high support and confidence. The chance to infer “ wrong ” rules (rules which are not causal links) decreases as the DB size increases, but we should keep in mind that such rules do come up. The chance to infer “ wrong ” rules (rules which are not causal links) decreases as the DB size increases, but we should keep in mind that such rules do come up. Therefore, the generated rules are a only good starting point for identifying causal links. Therefore, the generated rules are a only good starting point for identifying causal links.

Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul.

Similar presentations

Presentation on theme: "Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul.

Similar presentations

Presentation on theme: "Data Mining. Jim Which cow should I buy?? Jim ’ s cows RatingAGE Milk Avg. (MA) Name Good56Mona Bad64Lisa Good38Mary Bad56Quirri Good62Paula Bad710Abdul."— Presentation transcript:

Similar presentations

About project

Feedback