1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 8 = Finish chapter 6 Agenda:

Slides:



Advertisements
Similar presentations
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Advertisements

1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:
Chapter 1 The Study of Body Function Image PowerPoint
STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
1 G601, IO I Eric Rasmusen, 31 August – Info This is for one 75 minute session on the chapter. The big idea is bayesian equilibrium.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Configuration management
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
Association Rule Mining
VOORBLAD.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
CMPT 275 Software Engineering
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
25 seconds left…...
Januar MDMDFSSMDMDFSSS
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Pertemuan XIV FUNGSI MAYOR Assosiation. What Is Association Mining? Association rule mining: –Finding frequent patterns, associations, correlations, or.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1) Go over HW #1 solutions (Due today)
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 9 = Review for midterm exam.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 7 = Finish chapter 3 and.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Elective-I Examination Scheme- In semester Assessment: 30 End semester Assessment :70 Text Books: Data Mining Concepts and Techniques- Micheline Kamber.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Statistics 202: Statistical Aspects of Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 8 = Finish chapter 6 Agenda: 1) Reminder about midterm exam ( July 26 ) 2) Reminder about homework (due 9AM Tues) 3) Lecture over rest of Chapter 6 (sections 6.1and 6.7) 4) A few sample midterm questions

2 Announcement – Midterm Exam: The midterm exam will be Thursday, July 26 The best thing will be to take it in the classroom (9:00-10:15 AM in Terman 156) For remote students who absolutely can not come to the classroom that day please me to confirm arrangements with SCPD You are allowed one 8.5 x 11 inch sheet (front and back) for notes No books or computers are allowed, but please bring a hand held calculator The exam will cover the material that we covered in class from Chapters 1,2,3 and 6

3 Announcement – Midterm Exam: For remote students who absolutely can not come to the classroom that day please me to confirm arrangements with SCPD (see ) I have heard from: Catrina Jack C Steven V Jeff N Trent P Duyen N Jason E If you are not one of these people, I will assume you will take the exam in the classroom unless you contact me and tell me otherwise

4 Homework Assignment: Chapter 3 Homework Part 2 and Chapter 6 Homework is due 9AM Tuesday 7/24 Either to me bring it to class, or put it under my office door. SCPD students may use or fax or mail. The assignment is posted at Important: If using , please submit only a single file (word or pdf) with your name and chapters in the file name. Also, include your name on the first page.

5 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 6: Association Analysis

6 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction l Examples: {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke} {Beer, Bread} {Milk} l Implication means co-occurrence, not causality!

7 Definitions: lItemset –A collection of one or more items –Example: {Milk, Bread, Diaper} –k-itemset = An itemset that contains k items lSupport count ( ) –Frequency of occurrence of an itemset –E.g. ({Milk, Bread,Diaper}) = 2 lSupport –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 lFrequent Itemset –An itemset whose support is greater than or equal to a minsup threshold

8 Another Definition: lAssociation Rule –An implication expression of the form X Y, where X and Y are itemsets –Example: {Milk, Diaper} {Beer}

9 Even More Definitions: lAssociation Rule Evaluation Metrics –Support (s) =Fraction of transactions that contain both X and Y –Confidence (c) =Measures how often items in Y appear in transactions that contain X lExample:

10 In class exercise #26: Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each transaction ID as a market basket.

11 In class exercise #27: Use the results in the previous problem to compute the confidence for the association rules {b, d} {a} and {a} {b, d}. State what these values mean in plain English.

12 In class exercise #28: Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each customer ID as a market basket.

13 In class exercise #29: Use the results in the previous problem to compute the confidence for the association rules {b, d} {a} and {a} {b, d}. State what these values mean in plain English.

14 In class exercise #30: The data contains access logs from May 7, 2007 to July 1, Treating each row as a "market basket" find the support and confidence for the rule Mozilla/5.0 (compatible; Yahoo! Slurp;

15 An Association Rule Mining Task: l Given a set of transactions T, find all rules having both - support minsup threshold - confidence minconf threshold l Brute-force approach: - List all possible association rules - Compute the support and confidence for each rule - Prune rules that fail the minsup and minconf thresholds - Problem: this is computationally prohibitive!

16 The Support and Confidence Requirements can be Decoupled l All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} l Rules originating from the same itemset have identical support but can have different confidence l Thus, we may decouple the support and confidence requirements {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

17 Two Step Approach: 1) Frequent Itemset Generation = Generate all itemsets whose support minsup 2) Rule Generation = Generate high confidence ( confidence minconf ) rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset l Note: Frequent itemset generation is still computationally expensive and your book discusses algorithms that can be used

18 In class exercise #31: Use the two step approach to generate all rules having support.4 and confidence.6 for the transactions below.

19 Drawback of Confidence Association Rule: Tea Coffee Confidence(Tea Coffee) = P(Coffee|Tea) = 0.75 Coffee Tea15520 Tea

20 Drawback of Confidence Association Rule: Tea Coffee Confidence(Tea Coffee) = P(Coffee|Tea) = 0.75 but support(Coffee) = P(Coffee) = 0.9 Although confidence is high, rule is misleading confidence(Tea Coffee) = P(Coffee|Tea) = Coffee Tea15520 Tea

21 Other Proposed Metrics:

22 Simpsons Paradox (page 384) l Occurs when a 3 rd (possibly hidden) variable causes the observed relationship between a pair of variables to disappear or reverse directions l Example: My friend and I play a basketball game and each shoot 20 shots. Who is the better shooter?

23 Simpsons Paradox (page 384) l Occurs when a 3 rd (possibly hidden) variable causes the observed relationship between a pair of variables to disappear or reverse directions l Example: My friend and I play a basketball game and each shoot 20 shots. Who is the better shooter? l But, who is the better shooter if you control for the distance of the shot? Who would you rather have on your team?

24 Another example of Simpsons Paradox l A search engine labels web pages as good and bad. A researcher is interested in studying the relationship between the duration of time a user spends on the web page (long/short) and the good/bad attribute.

25 Another example of Simpsons Paradox l A search engine labels web pages as good and bad. A researcher is interested in studying the relationship between the duration of time a user spends on the web page (long/short) and the good/bad attribute. l It is possible that this relationship reverses direction when you control for the type of query (adult/non-adult). Which relationship is more relevant?

26 Sample Midterm Question #1: What is the definition of data mining used in your textbook? A) the process of automatically discovering useful information in large data repositories B) the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data C) an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data

27 Sample Midterm Question #2: If height is measured as short, medium or tall then it is what kind of attribute? A) Nominal B) Ordinal C) Interval D) Ratio

28 Sample Midterm Question #3: If my data frame in R is called data, which of the following will give me the third column? A) data[2,] B) data[3,] C) data[,2] D) data[,3] E) data(2,) F) data(3,) G) data(,2) H) data(,3)

29 Sample Midterm Question #4: Compute the confidence for the association rule {b, d} {a} by treating each row as a market basket. Also, state what this value means in plain English.