Download presentation
Presentation is loading. Please wait.
Published byAndra Ray Modified over 9 years ago
1
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases KSU CIS Department How-To William H. Hsu http://www.cis.ksu.edu/~bhsu Laboratory for Knowledge Discovery in Databases (www.kddresearch.org)www.kddresearch.org Department of Computing and Information Sciences Kansas State University Slides for this tutorial: Getting Started with Google MapReduce in C++, Apache Hadoop, and R
2
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References What This How-To Is
3
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Lecture or Seminar on MapReduce Algorithm Functional Programming Foundations Analyzing Performance Applications Survey Tutorial on Platforms: C++, Hadoop, R Full Workshop Parallel Computing Distributed Computing What This How-To Is Not
4
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References What This How-To Is
5
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Simple Motivating Example [1]: Distributed Grep Very large text collection Split data grep matches cat All matches Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b
6
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Simple Motivating Example [2]: Distributed Word Count Very large text collection Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b Split data count sum total count
7
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
8
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases What Is MapReduce? Programming Model and Associated Implementation Characteristics and Purpose Processing large data sets Exploiting large sets of commodity computers Executing processes in distributed manner Offers high degree of transparency Other Goals: Simplicity, Generality, Scalability May Be Suitable for Your Task Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b
9
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute http://bit.ly/bToUx2
10
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Reduce Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute http://bit.ly/bToUx2
11
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map/Reduce [1] Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute http://bit.ly/bToUx2
12
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map/Reduce [2] Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute http://bit.ly/bToUx2
13
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Map Accepts input key/value pair Emits intermediate key/value pair Reduce Accepts intermediate key/value* pair Emits output key/value pair Result MAPMAP REDUCEREDUCE Partitioning Function MapReduce Architecture Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b
14
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Example Applications: Distributed Grep, WC Revisited Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b Distributed Grep Map if match(value, pattern) emit(value,1) Reduce emit(key, sum(value*)) Distributed Word Count Map for all w in value do emit(w,1) Reduce emit(key, sum(value*))
15
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Word Count Example Illustrated Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute http://bit.ly/bToUx2
16
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Distributed Sort [1]: Mapping To “Pre-Sorted” Buckets Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b See also: HP Labs technical note on TeraSort http://bit.ly/biHbcAhttp://bit.ly/biHbcA
17
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Distributed Sort [2]: Partition Function Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b See also: HP Labs technical note on TeraSort http://bit.ly/biHbcAhttp://bit.ly/biHbcA Default: hash(key) mod R Guarantee Relatively well-balanced partitions Ordering guarantee within partition Distributed Sort Map emit(key, value) Reduce (with R=1) emit(key, value)
18
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
19
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Rationale: The Need for MapReduce Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures) http://bit.ly/bhGXiq
20
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Functional Programming and Parallelism [1] reduce (aka foldr ) (reduce + (map square '(1 2 3)) (reduce + '(1 4 9)) 14 Pure functional programming: easily parallelizable Do you see how you could parallelize above evaluation? What if reduce function argument were associative? Would that help? Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures) http://bit.ly/bhGXiq
21
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Functional Programming and Parallelism [2] Imagine 10,000-machine cluster Ready to help you compute anything you could cast as MapReduce problem! Abstraction Google famous for developing this … but their Reduce not same as functional programming reduce Builds a reverse-lookup table Hides lots of difficulty of writing parallel code! System takes care of load balancing, dead machines, etc. Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures) http://bit.ly/bhGXiq
22
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases MapReduce Transparencies Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b Google Distributed File System Features Parallel I/O Fault-tolerance Locality optimization Load-balancing
23
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases When To Use MapReduce Available Compute Cluster Large Data Set Text corpora Web documents Raw numerical data (e.g., signals, sequences) Data (Assumed to Be) Independent Can Be Cast into map and reduce Adapted from slide © 2006 Hendra Setiawan, National University of Singapore http://bit.ly/9KOR3b
24
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
25
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
26
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Download using lynx bzcat mapreduce.tar.bz2 | tar -xf – Set up rsync Start inetd (or xinetd ) Fix Type Errors in MapReduceScheduler.c Compile using make Preliminaries Under Linux
27
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Complete Tutorial http://pages.cs.wisc.edu/~gibson/mapReduceTutorial.html Download http://pages.cs.wisc.edu/~gibson/filelib/mapreduce.tar.bz2 Unpack and Verify C++ Implementation [1] Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
28
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases http://pages.cs.wisc.edu/~gibson/mapreduceexample/main.C.html Setting up sched_args C++ Implementation [2]: (Function) Arguments to Scheduler Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
29
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases http://bit.ly/98Hnfi map Function Setup C++ Implementation [3]: Map Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
30
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases http://bit.ly/9AhCIt reduce Function Setup C++ Implementation [3]: Reduce Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
31
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases http://bit.ly/aYvcVp Setting up intcmp C++ Implementation [4]: Key Comparison Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
32
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases http://pages.cs.wisc.edu/~gibson/mapReduceTutorial.html Output of make C++ Implementation [5]: Compilation Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
33
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases http://pages.cs.wisc.edu/~gibson/mapReduceTutorial.html Call to map_reduce_scheduler and Follow-Up Statements C++ Implementation [6]: Execution Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison http://bit.ly/dnKaZL
34
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
35
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Download and Documentation: http://hadoop.apache.org/mapreduce/http://hadoop.apache.org/mapreduce/ Tutorials Cloudera (Video): http://vimeo.com/cloudera/videos/http://vimeo.com/cloudera/videos/ Apache (Written): http://bit.ly/b0whwXhttp://bit.ly/b0whwX Hadoop Implementation Cover slide from tutorial © 2009 Cloudera http://vimeo.com/3584536
36
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
37
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Downloads and Documentation Comprehensive R Archive Network (CRAN) package R interpreter: http://cran.r-project.org/http://cran.r-project.org/ MapReduce in CRAN: http://bit.ly/9a0AqLhttp://bit.ly/9a0AqL Example from Open Data Group: http://bit.ly/9EKWxChttp://bit.ly/9EKWxC R Implementation Adapted from tutorial © 2009 Cloudera http://vimeo.com/3584536
38
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce Basic Definitions and Brief Synopsis Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce C++ Apache Hadoop R Programming Resources and References Outline
39
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Programming Resources and References Basic Tutorials Setiawan, National University of Singapore – http://bit.ly/9KOR3bhttp://bit.ly/9KOR3b Meinsel, Hasso-Plattner Institute – http://bit.ly/bToUx2http://bit.ly/bToUx2 Beamer, Berkeley – http://bit.ly/bhGXiqhttp://bit.ly/bhGXiq Algorithm Design Google - http://labs.google.com/papers/mapreduce.htmlhttp://labs.google.com/papers/mapreduce.html Apache - http://bit.ly/b0whwX, http://vimeo.com/3584536http://bit.ly/b0whwXhttp://vimeo.com/3584536 Implementations Gibson, C++ version for Linux & Solaris - http://bit.ly/dnKaZLhttp://bit.ly/dnKaZL Cutting, Hadoop version – http://vimeo.com/3584536http://vimeo.com/3584536 Brown, R version (CRAN) – http://bit.ly/9a0AqLhttp://bit.ly/9a0AqL Other Tutorials Chris Olston, Yahoo Research – http://bit.ly/a28mklhttp://bit.ly/a28mkl Google Code – http://bit.ly/9CeBSdhttp://bit.ly/9CeBSd
40
Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Tutorial Material Hendra Setiawan – National University of Singapore Christoph Meinel – Hasso-Plattner Institute Scott Beamer – Berkeley Algorithm Design Google (Jeffrey Dean, Sanjay Ghemawat) – Original MapReduce Apache Software Foundation (Doug Cutting, now of Cloudera) – Hadoop Implementations Dan Gibson, University of Wisconsin-Madison – C++ version Doug Cutting, Cloudera – Hadoop version Chris Brown, Open Data Group – R version (CRAN) Thanks Also To Alley Stoughton, Kansas State University – K-State CIS How-To Series Chris Olston, Yahoo Research – talks on data parallelism, PIG (DSSI-2007) Acknowledgements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.