Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases KSU CIS Department.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Overview of MapReduce and Hadoop
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
HAMS Technologies 1
MapReduce How to painlessly process terabytes of data.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Computing & Information Sciences Kansas State University Paper Review Guidelines KDD Lab Course Supplement William H. Hsu Kansas State University Department.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Computing & Information Sciences Kansas State University An Overview of Big Data Analytics: Challenges & Selected Applications Guest Seminar Drake University.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Introduction to Google MapReduce
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
15-826: Multimedia Databases and Data Mining
Lecture 3: Bringing it all together
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Overview of big data tools
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases KSU CIS Department How-To William H. Hsu Laboratory for Knowledge Discovery in Databases ( Department of Computing and Information Sciences Kansas State University Slides for this tutorial: Getting Started with Google MapReduce in C++, Apache Hadoop, and R

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References What This How-To Is

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Lecture or Seminar on MapReduce Algorithm  Functional Programming Foundations  Analyzing Performance  Applications Survey Tutorial on Platforms: C++, Hadoop, R Full Workshop  Parallel Computing  Distributed Computing What This How-To Is Not

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References What This How-To Is

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Simple Motivating Example [1]: Distributed Grep Very large text collection Split data grep matches cat All matches Adapted from slide © 2006 Hendra Setiawan, National University of Singapore

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Simple Motivating Example [2]: Distributed Word Count Very large text collection Adapted from slide © 2006 Hendra Setiawan, National University of Singapore Split data count sum total count

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases What Is MapReduce? Programming Model and Associated Implementation Characteristics and Purpose  Processing large data sets  Exploiting large sets of commodity computers  Executing processes in distributed manner  Offers high degree of transparency Other Goals: Simplicity, Generality, Scalability May Be Suitable for Your Task Adapted from slide © 2006 Hendra Setiawan, National University of Singapore

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Reduce Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map/Reduce [1] Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Building Blocks: Map/Reduce [2] Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Map  Accepts input key/value pair  Emits intermediate key/value pair Reduce  Accepts intermediate key/value* pair  Emits output key/value pair Result MAPMAP REDUCEREDUCE Partitioning Function MapReduce Architecture Adapted from slide © 2006 Hendra Setiawan, National University of Singapore

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Example Applications: Distributed Grep, WC Revisited Adapted from slide © 2006 Hendra Setiawan, National University of Singapore Distributed Grep  Map if match(value, pattern) emit(value,1)  Reduce emit(key, sum(value*)) Distributed Word Count  Map for all w in value do emit(w,1)  Reduce emit(key, sum(value*))

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Word Count Example Illustrated Adapted from slide © 2009 Christoph Meinel, Hasso-Plattner Institute

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Distributed Sort [1]: Mapping To “Pre-Sorted” Buckets Adapted from slide © 2006 Hendra Setiawan, National University of Singapore See also: HP Labs technical note on TeraSort

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Distributed Sort [2]: Partition Function Adapted from slide © 2006 Hendra Setiawan, National University of Singapore See also: HP Labs technical note on TeraSort Default: hash(key) mod R Guarantee  Relatively well-balanced partitions  Ordering guarantee within partition Distributed Sort  Map emit(key, value)  Reduce (with R=1) emit(key, value)

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Rationale: The Need for MapReduce Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures)

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Functional Programming and Parallelism [1] reduce (aka foldr ) (reduce + (map square '(1 2 3))  (reduce + '(1 4 9))  14 Pure functional programming: easily parallelizable  Do you see how you could parallelize above evaluation?  What if reduce function argument were associative?  Would that help? Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures)

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Functional Programming and Parallelism [2] Imagine 10,000-machine cluster Ready to help you compute anything you could cast as MapReduce problem! Abstraction  Google famous for developing this  … but their Reduce not same as functional programming reduce  Builds a reverse-lookup table  Hides lots of difficulty of writing parallel code!  System takes care of load balancing, dead machines, etc. Adapted from slide © 2007 Scott Beamer, University of California – Berkeley (CS61C Machine Structures)

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases MapReduce Transparencies Adapted from slide © 2006 Hendra Setiawan, National University of Singapore Google Distributed File System Features  Parallel I/O  Fault-tolerance  Locality optimization  Load-balancing

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases When To Use MapReduce Available Compute Cluster Large Data Set  Text corpora  Web documents  Raw numerical data (e.g., signals, sequences) Data (Assumed to Be) Independent Can Be Cast into map and reduce Adapted from slide © 2006 Hendra Setiawan, National University of Singapore

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Download using lynx bzcat mapreduce.tar.bz2 | tar -xf – Set up rsync Start inetd (or xinetd ) Fix Type Errors in MapReduceScheduler.c Compile using make Preliminaries Under Linux

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Complete Tutorial Download Unpack and Verify C++ Implementation [1] Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Setting up sched_args C++ Implementation [2]: (Function) Arguments to Scheduler Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases map Function Setup C++ Implementation [3]: Map Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases reduce Function Setup C++ Implementation [3]: Reduce Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Setting up intcmp C++ Implementation [4]: Key Comparison Function Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Output of make C++ Implementation [5]: Compilation Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Call to map_reduce_scheduler and Follow-Up Statements C++ Implementation [6]: Execution Adapted from tutorial © 2007 Dan Gibson, University of Wisconsin-Madison

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Download and Documentation: Tutorials  Cloudera (Video):  Apache (Written): Hadoop Implementation Cover slide from tutorial © 2009 Cloudera

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Downloads and Documentation  Comprehensive R Archive Network (CRAN) package  R interpreter:  MapReduce in CRAN: Example from Open Data Group: R Implementation Adapted from tutorial © 2009 Cloudera

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Overview of MapReduce  Basic Definitions and Brief Synopsis  Deciding When to Use: Pros and Cons Installation/Compilation Guide for MapReduce  C++  Apache Hadoop  R Programming Resources and References Outline

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Programming Resources and References Basic Tutorials  Setiawan, National University of Singapore –  Meinsel, Hasso-Plattner Institute –  Beamer, Berkeley – Algorithm Design  Google -  Apache Implementations  Gibson, C++ version for Linux & Solaris -  Cutting, Hadoop version –  Brown, R version (CRAN) – Other Tutorials  Chris Olston, Yahoo Research –  Google Code –

Computing & Information Sciences Kansas State University How-To Wednesday, 24 Mar 2010 Laboratory for Knowledge Discovery in Databases Tutorial Material  Hendra Setiawan – National University of Singapore  Christoph Meinel – Hasso-Plattner Institute  Scott Beamer – Berkeley Algorithm Design  Google (Jeffrey Dean, Sanjay Ghemawat) – Original MapReduce  Apache Software Foundation (Doug Cutting, now of Cloudera) – Hadoop Implementations  Dan Gibson, University of Wisconsin-Madison – C++ version  Doug Cutting, Cloudera – Hadoop version  Chris Brown, Open Data Group – R version (CRAN) Thanks Also To  Alley Stoughton, Kansas State University – K-State CIS How-To Series  Chris Olston, Yahoo Research – talks on data parallelism, PIG (DSSI-2007) Acknowledgements