Download presentation
Presentation is loading. Please wait.
Published byRudolph Russell Modified over 8 years ago
1
HADOOP Priyanshu Jha A.D.Dilip 6 th IT
2
Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers patented[1] software framework introduced by Google to support distributed computing on large data sets on clusters of computers patented[1]software frameworkGoogledistributed computingdata setsclusters patented[1]software frameworkGoogledistributed computingdata setsclusters MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within adatabase (structured). MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within adatabase (structured).filesystemdatabasefilesystemdatabase
3
“Map” step The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi- level tree structure. The worker node processes that smaller problem, and passes the answer back to its master nodetree
4
“Reduce” Step The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve. The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.
5
Map Reduce The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time
6
The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) -> list(k2,v2) Map(k1,v1) -> list(k2,v2) The map function is applied in parallel to every item in the input dataset. This produces a list of (k2,v2) pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) -> list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.
7
Example void map(String name, String document): void map(String name, String document): // name: document name // name: document name // document: document contents // document: document contents for each word w in document: for each word w in document: EmitIntermediate(w, "1"); EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): void reduce(String word, Iterator partialCounts): // word: a word // word: a word // partialCounts: a list of aggregated partial counts // partialCounts: a list of aggregated partial counts int result = 0; int result = 0; for each pc in partialCounts: for each pc in partialCounts: result += ParseInt(pc); result += ParseInt(pc); Emit(AsString(result)); Emit(AsString(result));
8
Example Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word. Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word.
9
HADOOP Apache Hadoop is a Java software framework that supports data- intensive distributed applications under a free license Apache Hadoop is a Java software framework that supports data- intensive distributed applications under a free licenseJavaframeworkdistributed applicationsfree licenseJavaframeworkdistributed applicationsfree license The Apache Hadoop project develops open- source software for reliable, scalable, distributed computing. The Apache Hadoop project develops open- source software for reliable, scalable, distributed computing.
10
Hadoop includes: Hadoop includes: Hadoop Common utilities Hadoop Common utilities Avro: A data serialization system with scripting languages. Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel computation. Pig: A high-level data-flow language for parallel computation. ZooKeeper: coordination service for distributed applications. ZooKeeper: coordination service for distributed applications.
11
HDFS File System The HDFS filesystem stores large files (an ideal file size is a multiple of 64 MB[10]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. The HDFS filesystem stores large files (an ideal file size is a multiple of 64 MB[10]), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.MB[10]RAIDMB[10]RAID The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a web browser or other client. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. They also serve the data over HTTP, allowing access to all content from a web browser or other client. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.
12
A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[11] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure. A filesystem requires one unique server, the name node. This is a single point of failure for an HDFS installation. If the name node goes down, the filesystem is offline. When it comes back up, the name node must replay all outstanding operations. This replay process can take over half an hour for a big cluster.[11] The filesystem includes what is called a Secondary Namenode, which misleads some people into thinking that when the primary Namenode goes offline, the Secondary Namenode takes over. In fact, the Secondary Namenode regularly connects with the namenode and downloads a snapshot of the primary Namenode's directory information, which is then saved to a directory. This Secondary Namenode is used together with the edit log of the Primary Namenode to create an up-to-date directory structure.single point of failure[11]single point of failure[11]
13
Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace has been developed to address this problem, at least for Linux and some other Unix systems. Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace has been developed to address this problem, at least for Linux and some other Unix systems.Filesystem in UserspaceFilesystem in Userspace
14
Replicating data three times is costly. To alleviate this cost, recent versions of HDFS have erasure coding support whereby multiple blocks of the same file are combined together to generate a parity block. HDFS creates parity blocks asynchronously and then decreases the replication factor of the file from 3 to 2. Studies have shown that this technique decreases the physical storage requirements from a factor of 3 to a factor of around 2.2 Replicating data three times is costly. To alleviate this cost, recent versions of HDFS have erasure coding support whereby multiple blocks of the same file are combined together to generate a parity block. HDFS creates parity blocks asynchronously and then decreases the replication factor of the file from 3 to 2. Studies have shown that this technique decreases the physical storage requirements from a factor of 3 to a factor of around 2.2
17
Word Count Example Read text files and count how often words occur. Read text files and count how often words occur. The input is text files The input is text files The output is a text file The output is a text file each line: word, tab, count each line: word, tab, count Map: Produce pairs of (word, count) Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts. Reduce: For each word, sum up the counts.
18
Map Class public static class Map extends MapReduceBase implements Mapper public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private Text word = new Text(); public void map( public void map( LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); word.set(tokenizer.nextToken()); output.collect(word, one); output.collect(word, one); } } }
19
public static class Reduce extends MapReduceBase implements Reducer public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; int sum = 0; while (values.hasNext()) { while (values.hasNext()) { sum += values.next().get(); sum += values.next().get(); } output.collect(key, new IntWritable(sum)); output.collect(key, new IntWritable(sum)); }}
20
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); JobClient.runJob(conf); }
21
HDFS Limitations “Almost” GFS (Google FS) “Almost” GFS (Google FS) No file update options (record append, etc); all files are write-once No file update options (record append, etc); all files are write-once Does not implement demand replication Does not implement demand replication Designed for streaming Designed for streaming Random seeks devastate performance Random seeks devastate performance
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.