Jeffrey D. Ullman Stanford University.  Data mining = extraction of actionable information from (usually) very large datasets, is the subject of extreme.

Slides:



Advertisements
Similar presentations
Map Reduce Based on A. Rajaraman and J.D.Ullman. Mining Massive Data Sets. Cambridge U. Press, 2009; Ch. 2. Some figures “stolen” from their slides.
Advertisements

LECTURE 2 Map-Reduce for large scale similarity computation.
MapReduce.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Overview of MapReduce and Hadoop
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Map Reduce Architecture
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most of the slides are taken from: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman,
MapReduce and the New Software Stack CHAPTER 2 1.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Mining of Massive Datasets Edited based on Leskovec’s from
 Much of the course will be devoted to large scale computing for data mining  Challenges:  How to distribute computation?  Distributed/parallel.
BIG DATA/ Hadoop Interview Questions.
MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Big Data is a Big Deal!.
Large-scale file systems and Map-Reduce
Overview of the Course MapReduce Other Parallel-Computing Systems
CS246 Introduction Essence of the Course Administrivia MapReduce Other Parallel-Computing Systems Jeffrey D. Ullman Stanford University.
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Jeffrey D. Ullman Stanford University

 Data mining = extraction of actionable information from (usually) very large datasets, is the subject of extreme hype, fear, and interest.  It’s not all about machine learning.  But a lot of it is.  Emphasis in CS246 on algorithms that scale.  Parallelization often essential. 2

 Often, especially for ML-type algorithms, the result is a model = a simple representation of the data, typically used for prediction.  Example: PageRank is a number Google assigns to each Web page, representing the “importance” of the page.  Calculated from the link structure of the Web.  Summarizes in one number, all the links leading to one page.  Used to help decide which pages Google shows you. 3

 Problem: Most teams don’t play each other, and there are weaker and stronger leagues.  So which teams are the best? Who would win a game between teams that never played each other?  Model (old style): a list of the teams from best to worst.  Algorithm (old style): 1.Cost of a list = # of teams out of order. 2.Start with some list (e.g., order by % won). 3.Make incremental cost improvements until none possible (e.g., swap adjacent teams) 4

5 A D B E C A B C D E Cost = 4 A B C E D Cost = 3 A B E C D Cost = 2 Note: best ordering, ECABD, with a cost of 1, is unreachable using only adjacent swaps. Tail = winner; head = loser.

 Model has one variable for each of n teams (“power rating”).  Variable x for team X.  Cost function is an expression of variables.  Example: hinge loss function h({x,y}) =  0 if teams X and Y never played or X won and x>y or Y won and y>x.  y-x if X won and y>x, and x-y if Y won and x>y.  Cost function =  {x,y} h({x,y}) – terms to prevent all variables from being equal, e.g., c(  {x,y} |x-y|). 6

 Start with some solution, e.g., all variables = 1.  Find gradient of the cost for one or more variables (= derivative of cost function with respect to that variable).  Make changes to variables in the direction of the gradient that lower cost, and repeat until improvements are “small.” 7

 In many applications, all we want is an algorithm that will say “yes” or “no.”  Example: a model for spam based on weighted occurrences of words or phrases.  Would give high weight to words like “Viagra” or phrases like “Nigerian prince.”  Problem: when the weights are in favor of spam, there is no obvious reason why it is spam.  Sometimes, no one cares; other times understanding is vital. 8

 Rules like “Nigerian prince” -> spam are understandable and actionable.  But the downside is that every with that phrase will be considered spam.  Next lecture will talk about these association rules, and how they are used in managing (brick and mortar) stores, where understanding the meaning of a rule is essential. 9

10 Prerequisites Requirements Staff Resources Honor Code Lateness Policy

 Basic Algorithms.  CS161 is surely sufficient, but may be more than needed.  Probability, e.g., CS109 or Stat116.  There will be a review session and a review doc is linked from the class home page.  Linear algebra.  Another review doc + review session is available.  Programming (Java).  Database systems (SQL, relational algebra).  CS145 is sufficient by not necessary. 11

 Final Exam (40%). The exam will be held Weds., March 16, 8:30AM.  Place TBD.  Note: the on-line schedule is misaligned, and 8:30AM is my interpretation – more later if needed.  There is no alternative final, but if you truly have a conflict, we can arrange for you to take the exam immediately after the regular final. 12

 Gradiance on-line homework (20%).  This automated system lets you try questions as many times as you like, and the goal is that everyone will work until they get all problems right.  Sign up at and enter class 62B99A55www.gradiance.com/services  Note: your score is based on the most recent submission, not the max.  Note: After the due date, you can see the solutions to all problems by looking at one of your submissions, so you must try at least once. 13

 You should work each of the implied problems before answering the multiple-choice questions.  That way, if you have to repeat the work to get 100%, you will have available what you need for the questions you have solved correctly, and the process can go quickly.  Note: There is a 10-minute delay between submissions, to protect against people who randomly fire off guesses. 14

 Written Homework (40%).  Four major assignments, involving programming, proofs, algorithm development.  Preceded by a “warmup” assignment, called “tutorial,” to introduce everyone to Hadoop.  Submission of work will be via gradescope.com and you need to use class code 92B7E9 for CS

 There are 12 great TA’s this year! Tim Althoff (chief TA), Sameep Bagadia, Ivaylo Bahtchevanov, Duyun Chen, Nihit Desai, Shubham Gupta, Jeff Hwang, Himabindu Lakkaraju, Caroline Suen, Jacky Wang, Leon Yao, You Zhou.  See class page for schedule of office hours.  will contact TA’s + Ullman. 16

 Class textbook: Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and U.  Sold by Cambridge Univ. Press, but available for free download at  On-line review notes for Probability, Proofs, and Linear Algebra are available through the class home page.  Coursera MOOC at class.coursera.org/mmds-003  Piazza discussion group (please join) at piazza.com/stanford/winter2016/cs246 piazza.com/stanford/winter2016/cs246 17

 We’ll follow the standard CS Dept. approach: You can get help, but you MUST acknowledge the help on the work you hand in.  Failure to acknowledge your sources is a violation of the Honor Code.  We use MOSS to check the originality of your code. 18

 There is no possibility of submitting Gradiance homework late.  For the five written homeworks, you have two late periods.  If a homework is due on a Thursday (Tuesday), then the late period expires 11:59PM on the following Tuesday (Thursday).  You can only use one late period per homework. 19

Jeffrey D. Ullman Stanford University

21 Chunking Replication Distribution on Racks

 Standard architecture: 1.Cluster of commodity Linux nodes (compute nodes).  = processor + main memory + disk. 2.Gigabit Ethernet interconnect.

Mem Disk CPU Mem Disk CPU … Switch Each rack contains nodes Mem Disk CPU Mem Disk CPU … Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone among racks

 Problem: if compute nodes can fail, how can we store data persistently?  Answer: Distributed File System.  Provides global file namespace.  Examples: Google GFS, Colossus; Hadoop HDFS.

 Chunk Servers.  File is split into contiguous chunks, typically 64MB.  Each chunk replicated (usually 3x).  Try to keep replicas in different racks.  Master Node for a file.  Stores metadata, location of all chunks.  Possibly replicated.

26 Racks of Compute Nodes File Chunks

27 3-way replication of files, with copies on different racks.

 More recent approach to stable file storage.  Allows reconstruction of a lost chunk.  Advantage: less redundancy for a given probability of loss.  Disadvantage: no choice regarding where to obtain a given chunk. 28

29 A Quick Introduction Word-Count Example Fault-Tolerance

1. Easy parallel programming. 2. Invisible management of hardware and software failures. 3. Easy management of very-large-scale data. 30

 A MapReduce job starts with a collection of input elements of a single type.  Technically, all types are key-value pairs.  Apply a user-written Map function to each input element, in parallel.  Mapper applies the Map function to a single element.  Many mappers grouped in a Map task (the unit of parallelism).  The output of the Map function is a set of 0, 1, or more key-value pairs.  The system sorts all the key-value pairs by key, forming key-(list of values) pairs.

 Another user-written function, the Reduce function, is applied to each key-(list of values).  Application of the Reduce function to one key and its list of values is a reducer.  Often, many reducers are grouped into a Reduce task.  Each reducer produces some output, and the output of the entire job is the union of what is produced by each reducer. 32

33 MappersReducers InputOutput key-value pairs

 We have a large file of documents (the input elements).  Documents are sequences of words.  Count the number of times each distinct word appears in the file.

map(key, value): // key: document ID; value: text of document FOR (each word w IN value) emit(w, 1); reduce(key, value-list): // key: a word; value-list: a list of integers result = 0; FOR (each integer v on value-list) result += v; emit(key, result); Expect to be all 1’s, but “combiners” allow local summing of integers with the same key before passing to reducers.

36  MapReduce is designed to deal with compute nodes failing to execute a Map task or Reduce task.  Re-execute failed tasks, not whole jobs.  Key point: MapReduce tasks have the blocking property: no output is used until task is complete.  Thus, we can restart a Map task that failed without fear that a Reduce task has already used some output of the failed Map task.

37 Relational Join Matrix Multiplication in One or Two Rounds

 Stick the tuples of two relations together when they agree on common attributes (column names).  Example: R(A,B) JOIN S(B,C) = {abc | ab is in R and bc is in S}. JOIN = 38 AB ABC BC

 Each tuple (a,b) in R is mapped to key = b, value = (R,a).  Note: “R” in the value is just a bit that means “this value represents a tuple in R, not S.”  Each tuple (b,c) in S is mapped to key = b, value = (S,c).  After grouping by keys, each reducer gets a key- list that looks like (b, [(R,a 1 ), (R,a 2 ),…, (S,c 1 ), (S,c 2 ),…]). 39

 For each pair (R,a) and (S,c) on the list for key b, emit (a,b,c).  Note this process can produce a quadratic number of outputs as a function of the list length.  If you took CS245, you may recognize this algorithm as essentially a “parallel hash join.”  It’s a really efficient way to join relations, as long as you don’t have too many tuples with a common shared value. 40

 Multiply matrix M = [m ij ] by N = [n jk ].  Want: P = [p ik ], where p ik =  j m ij *n jk.  First pass is similar to relational join; second pass is a group+aggregate operation.  Typically, large relations are sparse (mostly 0’s), so we can assume the nonzero element m ij is really a tuple of a relation (i, j, m ij ).  Similarly for n jk.  0 elements are not represented at all. 41

 The Map function: m ij -> key = j, value = (M,i,m ij ); n jk -> key = j, value = (N,k,n jk ).  As for join, M and N here are bits indicating which relation the value comes from.  The Reduce function: for key j, pair each (M,i,m ij ) on its list with each (N,k,n jk ) and produce key = (i,k), value = m ij * n jk. 42

 The Map function: The identity function.  Result is that each key (i,k) is paired with the list of products m ij * n jk for all j.  The Reduce function: sum all the elements on the list, and produce key = (i,k), value = that sum.  I.e., each output element ((i,k),s) says that the element p ik of the product matrix P is s. 43

 We can use a single pass if: 1.Keys (reducers) correspond to output elements (i,k). 2.We send input elements to more than one reducer.  The Map function: m ij -> for all k: key = (i,k), value = (M,j,m ij ); n jk -> for all i: key = (i,k), value = (N,j,n jk ).  The Reduce function: for each (M,j,m ij ) on the list for key (i,k) find the (N,j,n jk ) with the same j. Multiply m ij by n jk and then sum the products. 44

45 Data-Flow Systems Bulk-Synchronous Systems Tyranny of Communication

 MapReduce uses two ranks of tasks: one for Map the second for Reduce.  Data flows from the first rank to the second.  Generalize in two ways: 1.Allow any number of ranks. 2.Allow functions other than Map and Reduce.  As long as data flow is in one direction only, we can have the blocking property and allow recovery of tasks rather than whole jobs. 46

 Two recent data-flow systems from Apache.  Incorporate group+aggregate (GA) as an explicit operator.  Allow any acyclic data flow involving these primitives.  Are tuned to operate in main memory.  Big performance improvement over Hadoop in many cases. 47

 The 2-pass MapReduce algorithm had a second Map function that didn’t really do anything.  We could think of it as a five-rank data-flow algorithm of the form Map-GA-Reduce-GA- Reduce, where the output forms are: 1.(j, (M,i,m ij )) and (j, (N,k,n jk )). 2.j with list of (M,i,m ij )’s and (N,k,n jk )’s. 3.((i,k), m ij * n jk ). 4.(i,k) with list of m ij * n jk ’s. 5.((i,k), p ik ). 48

49 Graph Model of Data Some Systems Using This Model

50  Views all computation as a recursion on some graph.  Graph nodes send messages to one another.  Messages bunched into supersteps, where each graph node processes all data received.  Sending individual messages would result in far too much overhead.  Checkpoint all compute nodes after some fixed number of supersteps.  On failure, rolls all tasks back to previous checkpoint.

51 Node N I found a path from node M to you of length L I found a path from node M to you of length L+3 I found a path from node M to you of length L+5 I found a path from node M to you of length L+6 Is this the shortest path from M I know about? If so … table of shortest paths to N

 Pregel: the original, from Google.  Giraph: open-source (Apache) Pregel.  Built on Hadoop.  GraphX: a similar front end for Spark.  GraphLab: similar system that deals more effectively with nodes of high degree.  Will split the work for such a graph node among several compute nodes. 52