Map reduce Cs 595 Lecture 11.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Google File System Robert Nishihara. What is GFS? Distributed filesystem for large-scale distributed applications.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Introduction to MapReduce and Hadoop
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Hadoop Aakash Kag What Why How 1.
Hadoop.
INTRODUCTION TO BIGDATA & HADOOP
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Large-scale file systems and Map-Reduce
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Brief Introduction to Hadoop
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Map reduce Cs 595 Lecture 11

Map Reduce & Hadoop MapReduce: Pioneered by Google A programming model for expressing distributed computations on massive amounts of data Execution framework for large scale data processing on clusters of commodity servers Pioneered by Google Processes 20PB of data per day Popularized by the open source Hadoop project Used by Yahoo!, Facebook, Amazon, … GFS: Google File System HDFS: Hadoop File System

What is Map Reduce? Simple data-parallel programming model designed for scalability and fault tolerance Composed of Mapping method(s) that performs filtering and sorting of inputs Ex: Word counting map functions break input strings into tokens and outputs a key/value (word/count) pair for each word. Reducing method(s) that are called for each key in the key/value pairs generated by the mapping functions. Ex: in the word counting example, reduce functions take key/value pairs from the mappers, then sums and generates a single output count for each word.

What is Map Reduce? Input Map/Combine Shuffle/Sort Reduce Output Map the quick brown fox the fox ate the mouse how now brown cow Map Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 the, 2

Where is Map Reduce used? In research: Astronomical image analysis (Washington) Ocean climate simulation (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) In Industry: Reporting/Machine learning (Facebook) Process tweets and log files (Twitter) User activity, server metrics, transaction logs (LinkedIn) Scaling tests (Yahoo!) Search optimization and research (Ebay)

Map Reduce design goals Scalability to accommodate BIG DATA 1000’s of machines with 10,000’s of disks *Ebay’s Hadoop Cluster

Map Reduce design goals Cost efficiency: Commodity machines Cheap, but unreliable Commodity network Gigabit Ethernet Automatic fault tolerance Fewer administrators needed Easy to use Fewer programmers needed

Map Reduce challenges Cheap nodes prone to failure Mean time between failures for one node: Mean time between failures for 1000 nodes: Solution: Build fault tolerance into MapReduce Commodity network = low bandwidth Solution: Push computation to the data. Hadoop framework built on top of HDFS DataNodes in HDFS also responsible for handling computational requests Requires precise mapping of application computations to specific DataNodes Programming distributed systems is difficult Solution: Data-parallel programming model Users write “map” and “reduce” functions Hadoop distributes work and handles faults 3 years 1 day

File Systems What is a file system? File systems determine how data is stored and retrieved Distributed file systems manage the storage across a network of servers Increased complexity due to networking GFS/HDFS are distributed file systems

GFS Assumptions *HDFS is heavily influenced by GFS Hardware failures are common Files are large (GB/TB) Millions (not billions) of files Two main types of file reads: Large streaming reads Small random reads Jobs include sequential writes that append data to files Once written, files are seldom modified, other than append Random file modification is possible, but not efficient in GFS High sustained bandwidth has priority over low latency *HDFS is heavily influenced by GFS

GFS/HDFS GFS/HDFS are not a good fit for: Low latency (ms) data access Many small files Constantly changing file structure/data *Not all details of GFS are public knowledge

Question How would you design a distributed file system? Application How to write data to the cluster? How to read data from the cluster?

GFS Architecture revisited

(chunk_handle, byte_range) Files on GFS A single file can contain main objects Web documents Logs Divided into fixed size chunks of 64MB with 64 bit identifiers 264 number of unique files 264MB (Exabyte scale) total filesystem space for data Uses Linux files Reading/writing of data specified by they tuple (chunk_handle, byte_range)

Chunks One chunk = 64MB or 128MB (can be changed) Stored as a plain Linux file on a chunk server Advantages of large, but not too large, chunk size: Reduced need for client/master interaction One request per chunk Clients can cache all chunk locations for large data sets Reduced size of metadata on the master Kept in RAM Disadvantage Chunkserver can become a hotspot for popular file(s) How could chunkserver Hotspots be resolved?

HDFS: Hadoop Distributed File System

GFS vs. HDFS GFS HDFS Master NameNode ChunkServer DataNode Operation Log Journal, Edit Log Chunk Block Random file writes possible Only append is possible Multiple writer/reader model Single writer/multiple reader model Default chunk size: 64MB Default block size: 128MB

Hadoop architecture NameNode Master of HDFS, directs DataNode services to perform low-level I/O tasks Keeps track of: File/block location Location of replicas Files sharing multiple blocks Secondary NameNode Replicates processes running in the primary NameNode for fault tolerance DataNode Services and stores files in HDFS

Hadoop name and data nodes

Hadoop cluster architecture

Fault tolerance in Hadoop If a task crashes: Retry task on another node OK for map tasks because it has no dependencies OK for reduce tasks because map outputs are on disk If the same task fails repeatedly Ignore that input block, if possible Fail the job and notify the user

Fault tolerance in Hadoop If a node crashes: Relaunch its current tasks on other nodes Rerun any map tasks that the failed node ran Necessary because the output files were lost when the node failed

Fault tolerance in Hadoop If a task is going slowly (Straggler): Launch a second copy of the task on another node Take the output of whichever copy finishes first, and terminate the other process. Very important in large Hadoop clusters Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc. Single stragglers may noticeably slow down a job.

Fault tolerance in Hadoop Advantages of MapReduce/Hadoop: Using the data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures and stragglers Users of the MapReduce/Hadoop system focus their efforts on the application instead of the complexities of distributed computing.

Hadoop in the wild: Yahoo! HDFS cluster with 4,500 nodes NameNodes Up to 64GB RAM each 40 DataNodes/rack, one switch per rack 16GB RAM each Gigabit Ethernet 70% of disk space allocated to HDFS Total storage: 9.8PB  3.2PB with 3x replication 60 million files, 63 million blocks 54k blocks per DataNode 1-2 nodes lost per day Time for cluster to re-replicate lost blocks: two minutes