Presentation is loading. Please wait.

Presentation is loading. Please wait.

大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons.

Similar presentations


Presentation on theme: "大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons."— Presentation transcript:

1 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 http://net.pku.edu.cn/~course/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup

2 课程评分 Project 取消 4 次作业 –wordcount ( 不记分 ) –co-occurrence –index –pagerank 1 Week –grace time, one day –10% for each day delay(60% at most)

3 'wordcount' How does it work?

4 Hadoop Cluster datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node datanode daemon Linux file system … tasktracker slave node namenode namenode daemon job submission node jobtracker 4

5 job 提交过程

6 Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2). Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

7 Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3). Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

8 InputFormat Class Hierarchy

9 combine ba12c9ac52bc78 partition map k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 Shuffle and Sort: aggregate values by keys reduce a15b27c298 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 c2368 9

10 Serialization Serialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage. Deserialization is the reverse process of turning a byte stream back into a series of structured objects. In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

11 The Writable Interface public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; } public interface WritableComparable extends Writable, Comparable A Writable which is also Comparable. public int compareTo(WritableComparable w){}

12

13 Shuffle and Sort Mapper Reducer other mappers other reducers circular buffer (in memory) spills (on disk) merged spills (on disk) intermediate files (on disk) Combiner Combiner? 13

14 Partitioner public abstract class Partitioner { public int getPartition(KEY key, VALUE value, int numPartitions) }

15 Q&A


Download ppt "大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 This work is licensed under a Creative Commons."

Similar presentations


Ads by Google