Download presentation
Presentation is loading. Please wait.
Published byEvan Bruce Modified over 8 years ago
1
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am Yi Zhou Dec 2, 2015 Wednesday9:25-9:50am Moayad Almohaishi Dec 4, 2015 Friday9:00-9:25am Project Presentation Schedule
2
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce – Job Processing Dr. Xiao Qin Auburn University http://www.eng.auburn.edu/~xqin xqin@auburn.edu
3
Review: Map-Reduce Framework 3
4
Grep – Input consists of (url+offset, single line) – map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”) – reduce(key=line, values=uniq_counts): Don’t do anything; just emit line 4
5
MapReduce at Google A C++ library linked into user programs Status of Implementation (OSDI’ 04) – 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory – Limited bisection bandwidth – Storage is on local IDE disks – GFS: distributed file system manages data (SOSP'03) – Job scheduling system: jobs made up of tasks, Scheduler assigns tasks to machines 5
6
Execution Overview* How is this distributed? – Partition input key/value pairs into chunks, run map() tasks in parallel – After all map()s are complete, consolidate all emitted values for each unique emitted key – Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, re-execute! 6 * Adapted from Google slides
7
Job Processing JobTracker TaskTracker 0 TaskTracker 1TaskTracker 2 TaskTracker 3TaskTracker 4TaskTracker 5 1.Client submits “grep” job, indicating code and input files 2.JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. 3.After map(), tasktrackers exchange map-output to build reduce() keyspace 4.JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. 5.reduce() output may go to NDFS “grep” 7
8
Execution 8
9
Parallel Execution 9
10
Fine granularity tasks: map tasks >> machines – Minimizes time for fault recovery – Can pipeline shuffling with map execution – Better dynamic load balancing Often use 200,000 map & 5000 reduce tasks Running on 2000 machines Task Granularity and Pipelining 10 Why map 1 and 3 have different execution time?
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
Big Data: Solution “Googled” MapReduce! – Divide and Conquer. – Google File System (GFS) to store data. Apache – Framework for running applications on large clusters of commodity hardware. – Storage: HDFS. – Processing: MapReduce 22
23
Hadoop is – Economical – Easy to use – Portable – Reliable. Infrastructure needed, are in Data centers. Facebook’s Hadoop cluster has 30PB storage. Yahoo!, Amazon & Google all have Hadoop Data centers Hadoop in Data centers 23
24
Hadoop Architecture Distributed Storage (HDFS) Distributed Processing (Map Reduce) 24
25
Master-Slave-Client Architecture NameNode DataNode Meta-Data management Store Data Client File I/O operations manage JobTracker TaskTracker Task scheduling Execute Job Client Job submission assign 25
26
HDFS Data is organized into files and directories. Files are divided into uniform sized blocks and distributed across cluster nodes. Blocks are replicated to handle hardware failure Filesystem keeps checksums of data for corruption detection and recovery. HDFS exposes block placement so that computation can be migrated to data. 26
27
NameNode DataNode 1 DataNode 2 DataNode 3 Meta-Data management Store Data Client report Block B Block C Block A HDFS Write Write 27
28
HDFS Read NameNode DataNode 1 DataNode 2 DataNode 3 Meta-Data management Store Data Client report Read 28
29
29
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.