MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 彭波 北京大学信息科学技术学院 7/14/2009.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
חוקי Association ד " ר אבי רוזנפלד. המוטיבציה מה הם הדברים שהולכים ביחד ? –איזה מוצרים בסופר שווה לשים ביחד –מערכות המלצה – Recommendation Systems שבוע.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED.
HADOOP ADMIN: Session -2
1 Data Mining, Database Tuning Tuesday, Feb. 27, 2007.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Intermediate 2 Software Development Process. Software You should already know that any computer system is made up of hardware and software. The term hardware.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Association Rule Mining
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Data Mining (and machine learning) The A Priori Algorithm.
Elsayed Hemayed Data Mining Course
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Lecture VIII: Software Architecture
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Pig : Building High-Level Dataflows over Map-Reduce
RDDs and Spark.
Pig Latin - A Not-So-Foreign Language for Data Processing
MapReduce Simplied Data Processing on Large Clusters
A Balanced Introduction to Computer Science David Reed, Creighton University ©2005 Pearson Prentice Hall ISBN X Chapter 13 (Reed) - Conditional.
Overview of big data tools
Pig : Building High-Level Dataflows over Map-Reduce
Algorithms and Problem Solving
Data Mining (and machine learning)
A Balanced Introduction to Computer Science David Reed, Creighton University ©2005 Pearson Prentice Hall ISBN X Chapter 13 (Reed) - Conditional.
Chapter 6 – Distributed Processing and File Systems
Pig and pig latin: An Introduction
Presentation transcript:

MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 彭波 北京大学信息科学技术学院 7/14/2009

大纲 作业回顾 MapReduce 高层应用 频繁集挖掘算法的 MapReduce 实现 课程安排

Review of Lecture 3 Process synchronization refers to the coordination of simultaneous threads or processes to complete a task in order to get correct runtime order and avoid unexpected race conditions.race conditions Process synchronization refers to the coordination of simultaneous threads or processes to complete a task in order to get correct runtime order and avoid unexpected race conditions.race conditions

What makes this work? Underneath the socket layer are several more protocols Most important are TCP and IP (which are used hand-in- hand so often, they ’ re often spoken of as one protocol: TCP/IP)

Why is This Necessary? Not actually tube-like “ underneath the hood ” Unlike phone system (circuit switched), the packet switched Internet uses many routes at once

“ A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable ” -- Leslie Lamport

Ken Arnold, CORBA designer: “ Failure is the defining difference between distributed and local programming ”

The Eight Design Fallacies The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. -- Peter Deutsch and James Gosling, Sun Microsystems

Random Walks Over the Web Model: User starts at a random Web page User randomly clicks on links, surfing from page to page What’s the amount of time that will be spent on any given page? This is PageRank

Given page x with in-bound links t 1 …t n, where C(t) is the out-degree of t  is probability of random jump N is the total number of nodes in the graph PageRank: Defined X X t1t1 t1t1 t2t2 t2t2 tntn tntn …

The Google File System Main Contribution We treat component failures as the norm rather than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system.

MapReduce: Simplied Data Processing on Large Clusters Main Contribution MapReduce is a programming model and an associated implementation for processing and generating large data sets. 1. the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. 2. a large variety of problems are easily expressible as MapReduce computations. 3. developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines.

MapReduce 高层应用

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research

Data Processing Renaissance  Internet companies swimming in data E.g. TBs/day at Yahoo!  Data analysis is “inner loop” of product innovation  Data analysts are skilled programmers

Data Warehousing … ? Scale Often not scalable enough $ $ Prohibitively expensive at web scale Up to $200K/TB SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs

New Systems For Data Analysis  Map-Reduce  Apache Hadoop  Dryad...

Map-Reduce Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce Just a group-by-aggregate?

The Map-Reduce Appeal Scale Scalable due to simpler design Only parallelizable operations No transactions $ $ Runs on cheap commodity hardware Procedural Control- a processing “pipe” SQL

Disadvantages 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Split M M R R M M M M R R M M Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Pros And Cons ScalableCheap Control over execution Inflexible Lots of hand coding Semantics hidden Need a high-level, general data flow language

Enter Pig Latin ScalableCheap Control over execution Pig Latin Need a high-level, general data flow language

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Example Data Analysis Task UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 Find the top 10 most visited pages in each category UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrl Info

Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

In Pig Latin visits = load ‘ /data/visits ’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘ /data/urlInfo ’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘ /data/topUrls ’ ;

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! Automatic query optimization is hard Pig Latin does not preclude optimization With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo!

visits = load ‘ /data/visits ’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘ /data/urlInfo ’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘ /data/topUrls ’ ; Quick Start and Interoperability Operates directly over files

visits = load ‘ /data/visits ’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘ /data/urlInfo ’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘ /data/topUrls ’ ; Quick Start and Interoperability Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically

visits = load ‘ /data/visits ’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘ /data/urlInfo ’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘ /data/topUrls ’ ; User-Code as a First-Class Citizen User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Pig Latin has a fully-nestable data model with: – Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins See paper Nested Data Model yahoo, finance news

Outline Map-Reduce and the need for Pig Latin Pig Latin example Novel features Implementation

cluster Hadoop Map-Reduce Pig SQL automatic rewrite + optimize or user Pig is open-source. Pig is open-source.

Compilation into Map-Reduce Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases

Usage First production release about a year ago 150+ early adopters within Yahoo! Over 25% of the Yahoo! map-reduce user base

Related Work Sawzall – Data processing language on top of map-reduce – Rigid structure of filtering followed by aggregation DryadLINQ – SQL-like language on top of Dryad Nested data models – Object-oriented databases

Future Work Optional “ safe ” query optimizer – Performs only high-confidence rewrites User interface – Boxes and arrows UI – Promote collaboration, sharing code fragments and UDFs Tight integration with a scripting language – Use loops, conditionals of host language

Arun Murthy Pi Song Santhosh Srinivasan Amir Youssefi Shubham Chopra Alan Gates Shravan Narayanamurthy Olga Natkovich Credits

频繁集挖掘算法的 MapReduce 实现

`Basket data’ A very common type of data; often also called transaction data. Next slide shows example transaction database, where each record represents a transaction between (usually) a customer and a shop. Each record in a supermarket’s transaction DB, for example, corresponds to a basket of specific items.

ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream

Discovering Rules A common and useful application of data mining A `rule’ is something like this: If a basket contains apples and cheese, then it also contains beer Any such rule has two associated measures: confidence – when the `if’ part is true, how often is the `then’ bit true? This is the same as accuracy. coverage or support – how much of the database contains the `if’ part?

ID apples, beer, cheese, dates, eggs, fish, glue, honey, ice-cream What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey What is the confidence and coverage of: If the basket contains beer and cheese, then it also contains honey 2/20 of the records contain both beer and cheese, so coverage is 10% Of these 2, 1 contains honey, so confidence is 50%

Interesting/Useful rules Statistically, anything that is interesting is something that happens significantly more than you would expect by chance. E.g. basic statistical analysis of basket data may show that 10% of baskets contain bread, and 4% of baskets contain washing-up powder. I.e: if you choose a basket at random: There is a probability 0.1 that it contains bread. There is a probability 0.04 that it contains washing-up powder.

Bread and washing up powder What is the probability of a basket containing both bread and washing-up powder? The laws of probability say: If these two things are independent, chance is 0.1 * 0.04 = That is, we would expect 0.4% of baskets to contain both bread and washing up powder

Interesting means surprising We therefore have a prior expectation that just 4 in 1,000 baskets should contain both bread and washing up powder. If we investigate, and discover that really it is 20 in 1,000 baskets, then we will be very surprised. It tells us that: Something is going on in shoppers’ minds: bread and washing-up powder are connected in some way. There may be ways to exploit this discovery … put the powder and bread at opposite ends of the supermarket?

Finding surprising rules Suppose we ask `what is the most surprising rule in this database? ‘ This would be, presumably, a rule whose accuracy is more different from its expected accuracy than any others. But it also has to have a suitable level of coverage, or else it may be just a statistical blip, and/or unexploitable.

Here are some interesting ones in our mini basket DB: If a basket contains glue, then it also contains either beer or eggs confidence: 100% ; coverage 25% If a basket contains apples and dates, then it also contains honey confidence 100% ; coverage 20%

Finding surprising rules Looking only at rules of the form: if basket contains X and Y, then it also contains Z … our realistic numbers tell us that there may be around 500,000,000 distinct possible rules. For each of these we need to work out its accuracy and coverage, by trawling through a database of around 20,000,000 basket records. … c operations … By searching through, somehow, 500,000,000 (or usually immensely more) rules to sniff out what may be the interesting ones. Does MapReduce work here?

Find rules in two stages Agarwal and colleagues divided the problem of finding good rules into two phases: 1. Find all itemsets with a specified minimal support (coverage). An itemset is just a specific set of items, e.g. {apples, cheese}. The Apriori algorithm can efficiently find all itemsets whose coverage is above a given minimum. 2. Use these itemsets to help generate interersting rules. Having done stage 1, we have considerably narrowed down the possibilities, and can do reasonably fast processing of the large itemsets to generate candidate rules.

Terminology k-itemset : a set of k items. E.g. {beer, cheese, eggs} is a 3-itemset {cheese} is a 1-itemset {honey, ice-cream} is a 2-itemset support: an itemset has support s% if s% of the records in the DB contain that itemset. minimum support: the Apriori algorithm starts with the specification of a minimum level of support, and will focus on itemsets with this level or above.

Terminology large itemset: doesn’t mean an itemset with many items. It means one whose support is at least minimum support. L k : the set of all large k-itemsets in the DB. C k : a set of candidate large k-itemsets. In the algorithm we will look at, it generates this set, which contains all the k-itemsets that might be large, and then eventually generates the set above.

Terminology sets: Let A be a set (A = {cat, dog}) and let B be a set (B = {dog, eel, rat}) and let C = {eel, rat} I use `A + B’ to mean A union B. So A + B = {cat, dog, eel. rat} When X is a subset of Y, I use Y – X to mean the set of things in Y which are not in X. E.g. B – C = {dog}

ID a, b, c, d, e, f, g, h, i E.g. 3-itemset {a,b,h} has support 15% 2-itemset {a, i} has support 0% 4-itemset {b, c, d, h} has support 5% If minimum support is 10%, then {b} is a large itemset, but {b, c, d, h} Is a small itemset!

Insight What’s the relationship between k-itemset and k+1-itemset? If k-itemset is not large itemset, the k+1 -itemset contains it must not be large itemset. If k-itemset is not large itemset, the k+1 -itemset contains it must not be large itemset.

The Apriori algorithm for finding large itemsets efficiently in big DBs 1: Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3{C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 { C r = subset (C k, r); 7 For each c in C r, c.count++ 8 } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets.

Explaining the Apriori Algorithm … 1: Find all large 1-itemsets To start off, we simply find all of the large 1- itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get L 1 = { {a}, {b}, {c}, {d}, {f}}

1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) We already have L 1. This next bit just means that the remainder of the algorithm generates L 2, L 3, and so on until we get to an L k that’s empty. How these are generated is like this: Explaining the Apriori Algorithm …

1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) Given the large k-1-itemsets, this step generates some candidate k-itemsets that might be large. Because of how apriori- gen works, the set C k is guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’.

Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero We are going to work out the support for each of the candidate k-itemsets in C k, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero.

Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } We now take each record r in the DB and do this: get all the candidate k-itemsets from C k that are contained in r. For each of these, update its count.

Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have L k, We now go back into the for loop of line 2 and start working towards finding L k+1

Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets. We finish at the point where we get an empty L k. The algorithm returns all of the (non-empty) L k sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.

apriori-gen : notes Suppose we have worked out that the large 2-itemsets are: L 2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}} apriori-gen now generates 3-itemsets that all may be large.

apriori-gen : the join step Keep an ordering of the items. a < b will mean that a comes before b in alphabetical order. Suppose we have L k and wish to generate C k+1 First we take every distinct pair of sets in L k {a 1, a 2, … a k } and {b 1, b 2, … b k }, and do this: in all cases where {a 1, a 2, … a k-1 } = {b 1, b 2, … b k-1 }, and a k < b k, {a 1, a 2, … a k, b k } is a candidate k+1- itemset.

An illustration of that Suppose the 2-itemsets are: L 2 = { {milk, noodles}, {milk, tights}, {noodles, quorn}, {noodles, peas}, {noodles, tights}} The pairs that satisfy this: {a 1, a 2, … a k-1 } = {b 1, b 2, … b k-1 }, and a k < b k, are: {milk, noodles}|{milk, tights} {noodles, peas}|{noodles, quorn} {noodles, peas}|{noodles, tights} {noodles, quorn}|{noodles, tights} So the candidate 3-itemsets are: {milk, noodles, tights}, {noodles, peas, quorn} {noodles, peas, tights}, {noodles, quorn, tights}

apriori-gen : the prune step In the prune step, we take the candidate k+1 itemsets we have, and remove any for which some 2-subset of it is not a large k-itemset. Such couldn’t possibly be a large k+1-itemset. E.g. in the current example, we have (n = noodles, etc): L 2 = { {milk, n}, {milk, tights}, {n, quorn}, {n, peas}, {n, tights}} And candidate k+1-itemsets so far: {m, n, t}, {n, p, q}, {n, p, t}, {n, q, t} Now, {p, q} is not a 2-itemset, so {n,p,q} is pruned. {p,t} is not a 2-itemset, so {n,p,t} is pruned {q,t} is not a 2-itemset, so {n,q,t} is pruned. After this we finally have C 3 = {{milk, noodles, tights}}

Understanding rules The Apriori algorithm finds interesting (i.e. frequent) itemsets. E.g. it may find that {apples, bananas, milk} has coverage 30% -- so 30% of transactions contain each of these three things. What can you say about the coverage of {apples, milk}? We can invent several potential rules, e.g.: IF basket contains apples and bananas, it also contains MILK. Suppose support of {a, b} is 40%; what is the confidence of this rule?

Understanding rules II Suppose itemset A = {beer, cheese, eggs} has 30% support in the DB {beer, cheese} has 40%, {beer, eggs} has 30%, {cheese, eggs} has 50%, and each of beer, cheese, and eggs alone has 50% support.. What is the confidence of: IF basket contains Beer and Cheese, THEN basket also contains Eggs ? The confidence of a rule if A then B is simply: support(A + B) / support(A). So it’s 30/40 = 0.75 ; this rule has 75% confidence What is the confidence of: IF basket contains Beer, THEN basket also contains Cheese and Eggs ? 30 / 50 = 0.6 so this rule has 60% confidence

Understanding rules III If the following rule has confidence c: If A then B and if support(A) = 2 * support(B), what can be said about the confidence of: If B then A confidence c is support(A + B) / support(A) = support(A + B) / 2 * support(B) Let d be the confidence of ``If B then A’’. d is support(A+B / support(B) -- Clearly, d = 2c E.g. A might be milk and B might be newspapers

Summary The Apriori algorithm for efficiently finding frequent large itemsets in large DBs Associated terminology Associated notes about rules, and working out the confidence of a rule based on the support of its component itemsets

A full run through of Apriori ID a, b, c, d, e, f, g We will assume this is our transaction database D and we will assume minsup is 4 (20%) This will not be run through in the lecture; it is here to help with revision

First we find all the large 1-itemsets. I.e., in this case, all the 1-itemsets that are contained by at least 4 records in the DB. In this example, that’s all of them. So, L 1 = {{a}, {b}, {c}, {d}, {e}, {f}, {g}} Now we set k = 2 and run apriori-gen to generate C 2 The join step when k=2 just gives us the set of all alphabetically ordered pairs from L 1, and we cannot prune any away, so we have C 2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}}

So we have C 2 = {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c}, {b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c, f}, {c, g}, {d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}} Line 4 of the Apriori algorithm now tells us set a counter for each of these to 0. Line 5 now prepares us to take each record in the DB in turn, and find which of those in C 2 are contained in it. The first record r1 is: {a, b, d, g}. Those of C 2 it contains are: {a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}. Hence C r1 = {{a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d}, {b, g}, {d, g}} and the rest of line 6 tells us to increment the counters of these itemsets. The second record r2 is:{c, d, e}; C r2 = {{c, d}, {c, e}, {d, e}}, and we increment the counters for these three itemsets. … After all 20 records, we look at the counters, and in this case we will find that the itemsets with >= minsup (4) counters are: {a, d}, {c, e}. So, L 2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}}

So we have L 2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c, f}} We now set k = 3 and run apriori-gen on L 2. The join step finds the following pairs that meet the required pattern: {a, c}:{a, d} {c, d}:{c, e} {c, d}:{c, f} {c, e}:{c, f} This leads to the candidates 3-itemsets: {a, c, d}, {c, d, e}, {c, d, f}, {c, e, f} We prune {c, d, e} since {d, e} is not in L 2 We prune {c, d, f} since {d, f} is not in L 2 We prune {c, e, f} since {e, f} is not in L 2 We are left with C 3 = {a, c, d} We now run lines 5—7, to count how many records contain {a, c, d}. The count is 4, so L 3 = {a, c, d}

So we have L 3 = {a, c, d} We now set k = 4, but when we run apriori-gen on L 3 we get the empty set, and hence eventually we find L 4 = {} This means we now finish, and return the set of all of the non- empty Ls – these are all of the large itemsets: Result = {{a}, {b}, {c}, {d}, {e}, {f}, {g}, {a, c}, {a, d}, {c, d}, {c, e}, {c, f}, {a, c, d}} Each large itemset is intrinsically interesting, and may be of business value. Simple rule-generation algorithms can now use the large itemsets as a starting point.

MapReduce Implementation C k = apriori-gen (L k-1 ) Join step Prune step C r = subset (C k, r); 2-itemset generation

通知

Special Talk Brewster Kahle His stated goal is "Universal Access to all Knowledge". Director of the Internet ArchiveInternet Archive A key supporter of the Open Content AllianceOpen Content Alliance In 2005, Kahle was elected a fellow of the American Academy of Arts and Sciences.American Academy of Arts and Sciences 地点:理科 1 号楼 1131 时间: 17 日上午 9 点 地点:理科 1 号楼 1131 时间: 17 日上午 9 点

课程安排 本周四的课换到下周二上 本周五听 Internet Archive 的创始人 Brewster Kahle 的报告 下周二,请各小组报告:课程项目 Proposal

Q&A