MapReduce using Hadoop Jan Krüger … in 30 minutes...

Slides:



Advertisements
Similar presentations
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Image taken from: slideshare
Map reduce Cs 595 Lecture 11.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Hadoop Technopoints.
Introduction to Apache
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce using Hadoop Jan Krüger … in 30 minutes...

MapReduce using Hadoop ● MapReduce in theory... – Introduction – Implementation – Scalable distributed FS ● … and praxis ● Discussion

MapReduce Introduction ● Basis : functional blocks map and reduce ● differs from known funtional map / reduce functions used e.g. in haskell [2] ● Google Implementation: a Framework named 'MapReduce' (since 2003) ● But many other (also OS) implementation

MapReduce Framework Properties ● Automatic parallelization, distribution and scheduling of jobs ● Fault-tolerant ● Automatic burden-sharing ● Optimizing network and data transfer ● Monitoring

MapReduce Framework Applications ● Indexing of large data sets (e.g. for searching) ● Distributed Search of Pattern in large data sets ● Sorting large data sets ● Evaluation of log data (web) ● Grep data (of interest) from documents (e.g.web pages) ● Graphgeneration (user profiling, web page linking)

MapReduce Programming Model [a] Map :: (key1,a) → [(key2,b)] Reduce :: key2 → [b] → b [b] Combine :: [(key2,b)] → [(key2,[b])] MapReduce Library 3 ½

MapReduce Example WordCount map(String value): List intermediateResult; Foreach word w in value: intermediateResult.add(w,1); Return intermediateResult; reduce(String key, List value): result = 0; Foreach v in values: result += v; return result; Counting the number of occurrences of each word in a large collection of documents !

MapReduce Google's Implementation Execution Overview taken from [1] Prog ram Ma ster wor ker Output Map phase Reduce phase Intermediate files (on local disks) OutputInput (6)write (5)remote read (4)local write (3)read (2)assign map (2)assign reduce (1)fork

GFS Google's File System “Moving Computation is Cheaper than Moving Data !”

GFS Google's File System ● DFS - optimized for very large datasets ● Data is stored in chunks (typ. 64 Mb of size) ● Chunks are stored redundant (typ. 3 times) on so called chunkserver ● High data throughput vs. random access time ● Write Once, Read Many times data ● Streaming access data ● Fault tolerant

GFS Architecture Architecture taken from [3] Application GFS Client GFS master File namespace Chunk 2ef0 /foo/bar Chunkserv er FileSystem Chunkserv er FileSystem... File name Chunk handle Chunk location Chunk data Chunk handle Byte range state instructio n

… and praxis MapReduce using Hadoop ● Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant.Doug Cutting ● Hadoop is OpenSource, available at apache.org – MapReduce : MapReduce Framework – HDFS : Hadoop File System – Hbase : Distributed Database ● Implemented in Java using C++ to speed up some operations ! ● Currently supported by Yahoo, Amazon (A9), Google, IBM, … ● Requirements : Java 1.6 and ssh/sshd running ● Different run modes : single, pseudodistributed and distributed

… and praxis MapReduce using Hadoop Counting the number of occurrences of each word in a large collection of documents ! Usings Hadoop's Streaming Interface

… and praxis Python Mapper

… and praxis Python Reducer

Discussion (1) [5] Using MapReduce for bioinformatic applications ?

Discussion (2) using MapReduce - an expensive risk ? html

End Thanks for your attention!

References (1)J. Dean and S. Ghemawat – Google Inc. :: MapReduce : Simplified Data Processing on Large Cluster, OSDI 2004 (2)R. Lämmel – Microsoft :: MapReduce : Google's MapReduce Programming Model – Revistied, SCP 2007 (3)S. Ghemawat et al – Google Inc. :: The Google File System, SOSP 2003 (4)R.Grimm :: Das MapReduce-Framework :: (5)M.Schatz :: CloudBurst : highly sensitive read mapping with MapReduce (6)Apache Hadoop ::