Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.govramanathana@ornl.gov

2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Last class … Class Logistics Introduction to big data Types of data and compute systems Bonferoni Principle and “how-not-to-design-an- experiment” The Big Data Mining Process

3 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration This class… Need for Map Reduce Paradigm Map Reduce Decision making and Design of Map Reduce algorithms Example usage for easy statistics: –Word count –Co-occurrence counts

4 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration What are common to data mining/analytic algorithms? Acquire (data) Extract and Clean Aggregat e and Integrate Represent Analyze and Model Interpret Iterate over a large set of data Extract some quantities of interest from the data Shuffle and sort the data Aggregate intermediate results Make it look pretty!

5 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Traditional Architecture of Data Mining Classical Machine Learning/ Data Mining Data fetched from disk loaded onto main memory and processed in the CPUs CPU Memory Disk

6 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Compute Intensive vs. Data Intensive Computing Compute Intensive Traditionally designed for optimizing floating point operations (FLOPS) Key assumption: Working set data will fit main memory Memory bandwidth is usually high (and optimized) “Computationally Dense” – meaning all applications will have to rethink how to optimize use of compute resources Data Intensive Has to be optimized for data movement, storage, analysis Data ops not FLOPS are important Key assumption: Working data set will not fit (may not be even available on the same machine) Current architectures are optimized for either media or transactional use

7 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Compute Intensive vs. Data Intensive Computing (2) 2 BerkinO ̈zisikyilmaz,RamanathanNarayanan,JosephZambreno,GokhanMemik,andAlokN. Choudhary. An architectural characterization study of data mining and bioinformatics work- loads. In IISWC, pages 61–70, 2006 integerFloating point Key Take home message: Current compute architectures are not optimized for Data mining/analytic operations!

8 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Programmers shoulder responsibility in traditional HPC environments! P1P1 P2P2 P3P3 P4P4 P5P5 Memory P1P1 P2P2 P3P3 P4P4 P5P5 Message Passing Shared Memory Issues related to scheduling, data distribution, synchronization, inter-process communication, etc. Architectural considerations: SIMD/MIMD, Network topology, etc. OS issues: mutexes, deadlocks, etc.

9 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Scalable Algorithms for Data Mining Data sizes are vast (> 100 Terabytes) Even assuming nominal read speed of 35 MB/sec, it can take over a month to just access/read the data! How about answering more useful questions? –Number of categories –Types of datasets represented, etc. Takes even longer!!

10 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Challenges How to ease access of data? –(Reasonably) Fast and efficient –(Somewhat) fault tolerant access How to distribute computation? –Parallel Programming is hard! –Use commodity clusters for processing Hadoop Distributed File System (HDFS) Google File System Hadoop MapReduce / Google MapReduce MapReduce is an elegant paradigm of working with Big Data

11 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration What would you change in the underlying architectures? Hybrid Memory Cloud (HMC) Non-volatile Random Access Memory Global Address Space (GAS) Synergistic Challenges in Data Intensive and Exascale Computing (DOE ASCAC report 2013)

12 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Let’s talk about distributing computations CPU Memory Disk CPU Memory Disk CPU Memory Disk CPU Memory Disk … … Switch Commodity clusters What do we do when we have supercomputers?

13 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Programming Model for Data Mining Transferring data over network can take time Key Ideas: –Bring the computation close to the data –Replicate the data multiple times for reliability MapReduce: –Provides a storage infrastructure –@Google: GFS; @class: Hadoop-HDFS –Programming Model –Parallel paradigm, easier than conventional MPI

14 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration MapReduce Architecture

15 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration This is not the first lecture on MapReduce… Material is (in part) inspired by: –William Cohen’s lectures (10601 class in CMU) –Jure Leskovec (Stanford) –Aditya Prakash (Viriginia Tech) –Cloudera –Google –And many many others! Materials “redrawn”, “reorganized” and “reworked” to reflect how we use it

16 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration 1 st Key Idea MapReduce: Bring computations close to the data Programmers specify two functions: –Map(in_k, in_v)  list –Reduce( list)  list All values with the same key are reduced together Let the “runtime” handle everything else: –Scheduling, I/O, networking, Inter-process communication, etc.

17 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Visual interpretation of MapReduce k1k1 k1k1 v1v1 k2k2 k2k2 v2v2 k3k3 k3k3 v3v3 k4k4 k4k4 v4v4 k5k5 k5k5 v5v5 k6k6 k6k6 v6v6 map a a 1 b b 1 c c 4 b b 6 a a 5 d d 3 b b 5 c c 4 Shuffle & Sort: aggregate by key values a a 15 b b 165 c c 44 d d 3 reduce a a 6 b b 12 c c 8 d d 3

18 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Other things programmer also considers Partition(out_k, numberOfPartitions): –A simple hash e.g., hash(out_k) mod n –Divides the key space for parallel reduce operations Combine(out_k, inter_v)  list: –Mini reduce function that run in memory after the map phase –Optimize the network traffic

19 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Now, how does MapReduce look like? k1k1 k1k1 v1v1 k2k2 k2k2 v2v2 k3k3 k3k3 v3v3 k4k4 k4k4 v4v4 k5k5 k5k5 v5v5 k6k6 k6k6 v6v6 map a a 1 b b 1 c c 4 b b 6 d d 5 d d 3 b b 5 c c 4 Shuffle & Sort: aggregate by key values a a 15 b b 165 c c 44 d d 3 reduce a a 6 b b 12 c c 8 d d 3 combine d d 8 partition

20 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Let’s understand the MapReduce Runtime Scheduling: –workers are assigned to map and reduce tasks Data distribution: –move processes to the data Synchronization: –Gather, sort, and shuffle intermediate data Fault tolerance: –Detect worker failures and restarts Hadoop Distributed File System

21 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Now for an example: WordCount Let’s look at a corpus of documents How do we write the algorithm? Joe likes toastJane likes toast with jamJoe burnt toast

22 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Word Count (2) def map(String doc_id, String text): for each word w in text: emit(w, 1); def reduce(String term, Iterator values): int sum = 0; for each v in values: sum += v; Emit(term, sum);

24 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration WordCount (3): Slow Motion (SloMo) Map

26 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration WordCount (4): SloMo Shuffle & Sort Joe1 Likes 1 Toast1 Jane1 Likes 1 Toast1 With1 Jam1 Joe1 burnt1 Toast1 Joe1 Jane1 likes1 toast1 Toast1 with1 jam1 burnt1 the1 Input Output

28 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration WordCount (5): SloMo Reduce Joe1 Jane1 likes1 toast1 Toast1 with1 jam1 burnt1 the1 Input Joe2 Jane1 likes2 toast3 with1 jam1 burnt1 the1 Output

29 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration A look under the hood: what happens when you invoke WordCount? Split 1 Split 2 Split 3 Split 4 Split 0 Split into 64MB chunk per piece Multiple copies of the program across cluster User Program Master Worker fork assign map read Master task is special M map tasks and R reduce tasks Idle workers are picked to run Worker reads the split it is assigned to key value pairs are written to buffer local write Worker assign reduce remote read OutputFile0 OutputFile1 fork Reduce workers are notified by the master about locations of files Reduce workers sort and present results Final results are stored with the correct intermediate key

30 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration How do commodity clusters use this? Compute Nodes NAS SAN Main problem: how to handle the data store + compute?

31 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration 2 nd Key Idea MapReduce: Replicate the data multiple times for reliability Hadoop Distributed File System: –Store data (replicates) on the local disks –Start running jobs on nodes that have data Why? –Not enough RAM to hold the data on main memory –Disk access is slow, but throughput is usually higher

32 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration File storage design Files stored as chunks (e.g., 128 MB) Reliability through replication: –Each chunk replicated across 3+ chunkservers Single master to coordinate access + metadata: –Centralized management No data caching: –Little benefits for large data, streaming reads Simple API

33 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration How does HDFS work? NameNode stores cluster metadata Files and directories are represented by inodes Inodes store attributes like permissions, etc. Data is stored across datanodes Replicated effectively

34 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration WordCount Code: Main others: KeyValueInputFormat SequenceFileInputFormat

35 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration WordCount: Map Function

36 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration WordCount: Reduce Function

37 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration MapReduce Limits Moving data is very expensive: –writing and reading are both expensive No reduce jobs can start until: –All map jobs are done –Data in its partition is shuffled/sorted

38 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Limitations of MapReduce No control of the order in which reduce jobs are performed –Only ordering is that reduce jobs start after map jobs Assume that the map and reduce jobs will take place: –across different machines –across different memory spaces

39 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Programming Pitfalls Don’t make a static variable and assume that other processes read it –They can’t –It appears that they can when run locally, but they can’t Do not communicate between mappers or between reducers –overhead is high –you don’t know which mappers/reducers are actually running at any given point –there’s no easy way to find out what machine they’re running on because you shouldn’t be looking for them anyway Thanks to Shannon Quinn for his pointers!

40 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Designing MapReduce Algorithms

41 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration A slightly more complex example: Term Co- occurrence Given a large text collection compute a matrix of all words: –M = N x N matrix (N = vocabulary size) –M ij : number of times i and j co-occur in a sentence Why? –Distributional profiles are a way of measuring semantic distance –Semantic distance is important for NLP tasks

42 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Example of a large counting problem Term co-occurrence matrix computation: –A large event space (no. of terms) –A large number of observations (no. of documents) –Keep track of interesting statistics about events Approach: –Mappers generate partial counts –Reducers aggregate counts

43 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration First Approach: “Pairs” Let each mapper take a sentence: –Generate all co-occurring term pairs –For all pairs, emit(a, b)  count Reducers sum up counts associated with these pairs User combiners to aggregate results Advantages: Easy to implement, understand Disadvantages: Upper bound on pairs is unknown Advantages: Easy to implement, understand Disadvantages: Upper bound on pairs is unknown

44 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Second approach: “Stripes” Group together pairs into an associative array Each mapper takes a sentence: –Generate all co-occurring term pairs –For each term, emit a  {b: countb, c: countc, …} Reducers perform element-wise sum of associative arrays (a, b)  1 (a, c)  2 (a, d)  5a: {b: 1, c: 2, d: 5, e: 3, f: 2} (a, e)  3 (a, f)  2 a  {b: 1, d: 5, e:3} a  {b: 1, c: 2, d: 5, f: 2} ------------------------------------------------- a  {b:2, c:2, d: 10, e: 3, f: 2} Advantages: Far less sorting and shuffling of key value pairs Can make better use of combiners Disadvantages: More difficult to implement Underlying objects are “larger” than a typical intermediate results Advantages: Far less sorting and shuffling of key value pairs Can make better use of combiners Disadvantages: More difficult to implement Underlying objects are “larger” than a typical intermediate results

45 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration How do the runtimes compare

46 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Summary and To Dos

47 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Summary MapReduce is a big data programming paradigm –map jobs –reduce jobs Careful consideration of data movement is required

48 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration Notes and What to expect next? Please form project teams as soon as possible –2 is good; 3 is okay. –More team members  higher expectations! Assignment 1 is due today! Additional notes are put up on the website for Hadoop Next class: –Probability and Statistics Review Basics –Naïve Bayes and Logistic Regression on Hadoop

THANK YOU!!!

Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Similar presentations

Presentation on theme: "Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Similar presentations

Presentation on theme: "Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,"— Presentation transcript:

Similar presentations

About project

Feedback