Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases KSU CIS Department How-To William H. Hsu Laboratory for Knowledge Discovery in Databases ( Department of Computing and Information Sciences Kansas State University Slides for this tutorial: Getting Started with Google MapReduce in C++, Apache Hadoop, and R
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References What This How-To Is
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Lecture or Seminar on MapReduce Algorithm Functional Programming Foundations Analyzing Performance Applications Survey Tutorial on Platforms: C++, Hadoop, R Full Workshop Parallel Computing Distributed Computing What This How-To Is Not
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References What This How-To Is
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Simple Motivating Example [1]: Distributed Grep Very large text collection Split data grep matches cat All matches Adapted from slide © 2006 Hendra Setiawan, National University of Singapore
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Simple Motivating Example [2]: Distributed Word Count Very large text collection Adapted from slide © 2006 Hendra Setiawan, National University of Singapore Split data count sum total count
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases What Is MapReduce? Programming Model and Associated Implementation Characteristics and Purpose Processing large data sets Exploiting large sets of commodity computers Executing processes in distributed manner Offers high degree of transparency Other Goals: Simplicity, Generality, Scalability May Be Suitable for Your Task Adapted from slide © 2006 Hendra Setiawan, National University of Singapore
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Reduce Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map/Reduce [1] Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map/Reduce [2] Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Map Accepts input key/value pair Emits intermediate key/value pair Reduce Accepts intermediate key/value* pair Emits output key/value pair Result MAPMAP REDUCEREDUCE Partitioning Function MapReduce Architecture Adapted from slide © 2006 Hendra Setiawan, National University of Singapore
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Example Applications: Distributed Grep, WC Revisited Adapted from slide © 2006 Hendra Setiawan, National University of Singapore Distributed Grep Map if match(value, pattern) emit(value,1) Reduce emit(key, sum(value*)) Distributed Word Count Map for all w in value do emit(w,1) Reduce emit(key, sum(value*))
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Word Count Example Illustrated Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Distributed Sort [1]: Mapping To “Pre-Sorted” Buckets Adapted from slide © 2006 Hendra Setiawan, National University of Singapore See also: HP Labs technical note on TeraSort
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Distributed Sort [2]: Partition Function Adapted from slide © 2006 Hendra Setiawan, National University of Singapore See also: HP Labs technical note on TeraSort Default: hash(key) mod R Guarantee Relatively well-balanced partitions Ordering guarantee within partition Distributed Sort Map emit(key, value) Reduce (with R=1) emit(key, value)
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Rationale: The Need for MapReduce Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures)
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Functional Programming and Parallelism [1] reduce (aka foldr ) (reduce + (map square '(1 2 3)) (reduce + '(1 4 9)) 14 Pure functional programming: easily parallelizable Do you see how you could parallelize above evaluation? What if reduce function argument were associative? Would that help? Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures)
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Functional Programming and Parallelism [2] Imagine 10,000-machine cluster Ready to help you compute anything you could cast as MapReduce problem! Abstraction Google famous for developing this … but their Reduce not same as functional programming reduce Builds a reverse-lookup table Hides lots of difficulty of writing parallel code! System takes care of load balancing, dead machines, etc. Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures)
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases MapReduce Transparencies Adapted from slide © 2006 Hendra Setiawan, National University of Singapore Google Distributed File System Features Parallel I/O Fault-tolerance Locality optimization Load-balancing
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases When To Use MapReduce Available Compute Cluster Large Data Set Text corpora Web documents Raw numerical data (e.g., signals, sequences) Data (Assumed to Be) Independent Can Be Cast into map and reduce Adapted from slide © 2006 Hendra Setiawan, National University of Singapore
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Download using lynx bzcat mapreduce.tar.bz2 | tar -xf – Set up rsync Start inetd (or xinetd ) Fix Type Errors in MapReduceScheduler.c Compile using make Preliminaries Under Linux
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Complete Tutorial Download Unpack and Verify C++ Implementation [1] Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Setting up sched_args C++ Implementation [2]: (Function) Arguments to Scheduler Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases map Function Setup C++ Implementation [3]: Map Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases reduce Function Setup C++ Implementation [3]: Reduce Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Setting up intcmp C++ Implementation [4]: Key Comparison Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Output of make C++ Implementation [5]: Compilation Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Call to map_reduce_scheduler and Follow-Up Statements C++ Implementation [6]: Execution Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Download and Documentation: Tutorials Cloudera (Video): Apache (Written): Hadoop Implementation Cover slide from tutorial © 2009 Cloudera
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Downloads and Documentation Comprehensive R Archive Network (CRAN) package R interpreter: MapReduce in CRAN: Example from Open Data Group: R Implementation Adapted from tutorial © 2009 Cloudera
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Programming Resources and References Basic Tutorials Setiawan, National University of Singapore – Meinsel, Hasso-Plattner Institute – Beamer, Berkeley – Algorithm Design Google - Apache Implementations Gibson, C++ version for Linux & Solaris - Cutting, Hadoop version – Brown, R version (CRAN) – Other Tutorials Chris Olston, Yahoo Research – Google Code –
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Tutorial Material Hendra Setiawan – National University of Singapore Christoph Meinel – Hasso-Plattner Institute Scott Beamer – Berkeley Algorithm Design Google (Jeffrey Dean, Sanjay Ghemawat) – Original MapReduce Apache Software Foundation (Doug Cutting, now of Cloudera) – Hadoop Implementations Dan Gibson, University of Wisconsin-Madison – C++ version Doug Cutting, Cloudera – Hadoop version Chris Brown, Open Data Group – R version (CRAN) Thanks Also To Alley Stoughton, Kansas State University – K-State CIS How-To Series Chris Olston, Yahoo Research – talks on data parallelism, PIG (DSSI-2007) Acknowledgements