Map reduce Cs 595 Lecture 11.

Map reduce Cs 595 Lecture 11

Map Reduce & Hadoop MapReduce: Pioneered by Google
A programming model for expressing distributed computations on massive amounts of data Execution framework for large scale data processing on clusters of commodity servers Pioneered by Google Processes 20PB of data per day Popularized by the open source Hadoop project Used by Yahoo!, Facebook, Amazon, … GFS: Google File System HDFS: Hadoop File System

What is Map Reduce? Simple data-parallel programming model designed for scalability and fault tolerance Composed of Mapping method(s) that performs filtering and sorting of inputs Ex: Word counting map functions break input strings into tokens and outputs a key/value (word/count) pair for each word. Reducing method(s) that are called for each key in the key/value pairs generated by the mapping functions. Ex: in the word counting example, reduce functions take key/value pairs from the mappers, then sums and generates a single output count for each word.

What is Map Reduce? Input Map/Combine Shuffle/Sort Reduce Output Map
the quick brown fox the fox ate the mouse how now brown cow Map Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 the, 2

Where is Map Reduce used?
In research: Astronomical image analysis (Washington) Ocean climate simulation (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) In Industry: Reporting/Machine learning (Facebook) Process tweets and log files (Twitter) User activity, server metrics, transaction logs (LinkedIn) Scaling tests (Yahoo!) Search optimization and research (Ebay)

Map Reduce design goals
Scalability to accommodate BIG DATA 1000’s of machines with 10,000’s of disks *Ebay’s Hadoop Cluster

Map Reduce design goals
Cost efficiency: Commodity machines Cheap, but unreliable Commodity network Gigabit Ethernet Automatic fault tolerance Fewer administrators needed Easy to use Fewer programmers needed

Map Reduce challenges Cheap nodes prone to failure
Mean time between failures for one node: Mean time between failures for 1000 nodes: Solution: Build fault tolerance into MapReduce Commodity network = low bandwidth Solution: Push computation to the data. Hadoop framework built on top of HDFS DataNodes in HDFS also responsible for handling computational requests Requires precise mapping of application computations to specific DataNodes Programming distributed systems is difficult Solution: Data-parallel programming model Users write “map” and “reduce” functions Hadoop distributes work and handles faults 3 years 1 day

File Systems What is a file system?
File systems determine how data is stored and retrieved Distributed file systems manage the storage across a network of servers Increased complexity due to networking GFS/HDFS are distributed file systems

GFS Assumptions *HDFS is heavily influenced by GFS
Hardware failures are common Files are large (GB/TB) Millions (not billions) of files Two main types of file reads: Large streaming reads Small random reads Jobs include sequential writes that append data to files Once written, files are seldom modified, other than append Random file modification is possible, but not efficient in GFS High sustained bandwidth has priority over low latency *HDFS is heavily influenced by GFS

GFS/HDFS GFS/HDFS are not a good fit for:
Low latency (ms) data access Many small files Constantly changing file structure/data *Not all details of GFS are public knowledge

Question How would you design a distributed file system? Application
How to write data to the cluster? How to read data from the cluster?

GFS Architecture revisited

(chunk_handle, byte_range)
Files on GFS A single file can contain main objects Web documents Logs Divided into fixed size chunks of 64MB with 64 bit identifiers 264 number of unique files 264MB (Exabyte scale) total filesystem space for data Uses Linux files Reading/writing of data specified by they tuple (chunk_handle, byte_range)

Chunks One chunk = 64MB or 128MB (can be changed)
Stored as a plain Linux file on a chunk server Advantages of large, but not too large, chunk size: Reduced need for client/master interaction One request per chunk Clients can cache all chunk locations for large data sets Reduced size of metadata on the master Kept in RAM Disadvantage Chunkserver can become a hotspot for popular file(s) How could chunkserver Hotspots be resolved?

HDFS: Hadoop Distributed File System

GFS vs. HDFS GFS HDFS Master NameNode ChunkServer DataNode
Operation Log Journal, Edit Log Chunk Block Random file writes possible Only append is possible Multiple writer/reader model Single writer/multiple reader model Default chunk size: 64MB Default block size: 128MB

Hadoop architecture NameNode
Master of HDFS, directs DataNode services to perform low-level I/O tasks Keeps track of: File/block location Location of replicas Files sharing multiple blocks Secondary NameNode Replicates processes running in the primary NameNode for fault tolerance DataNode Services and stores files in HDFS

Hadoop name and data nodes

Hadoop cluster architecture

Fault tolerance in Hadoop
If a task crashes: Retry task on another node OK for map tasks because it has no dependencies OK for reduce tasks because map outputs are on disk If the same task fails repeatedly Ignore that input block, if possible Fail the job and notify the user

If a node crashes: Relaunch its current tasks on other nodes Rerun any map tasks that the failed node ran Necessary because the output files were lost when the node failed

If a task is going slowly (Straggler): Launch a second copy of the task on another node Take the output of whichever copy finishes first, and terminate the other process. Very important in large Hadoop clusters Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc. Single stragglers may noticeably slow down a job.

Advantages of MapReduce/Hadoop: Using the data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures and stragglers Users of the MapReduce/Hadoop system focus their efforts on the application instead of the complexities of distributed computing.

Hadoop in the wild: Yahoo!
HDFS cluster with 4,500 nodes NameNodes Up to 64GB RAM each 40 DataNodes/rack, one switch per rack 16GB RAM each Gigabit Ethernet 70% of disk space allocated to HDFS Total storage: 9.8PB  3.2PB with 3x replication 60 million files, 63 million blocks 54k blocks per DataNode 1-2 nodes lost per day Time for cluster to re-replicate lost blocks: two minutes

Map reduce Cs 595 Lecture 11.

Similar presentations

Presentation on theme: "Map reduce Cs 595 Lecture 11."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Map reduce Cs 595 Lecture 11.

Similar presentations

Presentation on theme: "Map reduce Cs 595 Lecture 11."— Presentation transcript:

Similar presentations

About project

Feedback