Distributed Systems Lecture 3 Big Data and MapReduce 1.

Distributed Systems Lecture 3 Big Data and MapReduce 1

Previous lecture Overview of main cloud computing aspects – Definition – Models – Elasticity – Cloud stack – Virtualization AWS 2

Data intensive computing Clouds are designed for data intensive applications Approach: move application to data Computation-Intensive computing – Example areas: MPI-based, High-performance computing, Grids – Typically run on supercomputers (e.g., NCSA Blue Waters) – High CPU utilization Data-Intensive computing – Typically store data at datacenters – Use compute nodes nearby (same datacenter, rack; latency based) – Compute nodes run computation services – High I/O utilization In data-intensive computing, the focus shifts from computation to the data: CPU utilization no longer the most important resource metric 3

Big Data (1) Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance – Sload Digital Sky Survey (2000): 200 GB/night – Large Synoptic Survey Telescope (2016): 140 TB every five days – NASA Center for Climate Simulation: 32 PB of information – eBay: two warehouses of 7.5 respectively 40 PB – Amazon holds the world’s largest databases: 7.5 TB, 18.5 TB, and 24.7 TB – Facebook: 50 million photos Michael Dell (CEO): – “Top thinkers are no longer the people who can tell you what happened in the past, but those who can predict the future. Welcome to the Data Economy, which holds the promise for significantly advancing society and economic growth on a global scale.”

Big Data (2) Volume – Analyzing large volumes (TBs) of distributed data E.g. Derive insights into large historical data sets (e.g. Velocity – Fast processing of variable data sets (streaming data) E.g. Trend analysis, weather forecast Variety (Complexity) – Highly complex analysis at large scale E.g. audio/video analysis at web scale, speech to text etc. – Unstructured or structured data E.g. relational databases, graphs Dimensions

Big Data (3) Volume Velocity Variety MB GBTBPB Historic Batch Periodic O(days/hours) Realtime O(seconds) Relational data Audio Video Graphs Photos

Big Data (4) Data is an important asset to any organization Discovery of knowledge; Enabling discovery; annotation of data Complex computational models No single environment is good enough: need elastic, on- demand capacities → cloud computing New programming models for BigData on Cloud Supporting algorithms and data structures Data Economy

Big Data analytics (1) Cloud based Big Data programming – Large number of low-cost “commodity” hardware – Performance/$ – High failure rate increases with the number of resources – High Volume, distributed data (usually unstructured or tuple based) Embarrassingly parallel computations on big data Programming model: – Split data and process each chunk independently. Join the intermediate results and output the aggregated result – Same computation performed at different nodes on different pieces of the dataset MapReduce

HDFS MR Task Tracker M/R Data Node B B B HDFS MR Task Tracker M/R Data Node B B B HDFS MR Task Tracker M/R Data Node B B B Hadoop Slaves (workers) HDFS MR Job Tracker Name Node Hadoop Master Big Data analytics (2) Hadoop

Rack 3 Rack 2 Rack 1 Hadoop Distributed File System (HDFS) (3) Data Node x1 B B Data Node x2 B B Data Node B x1 B Data Node x1 x2 B Data Node B x2 B Data Node B B B HDFS Name Node Filename “x”, size: 1GB blockid “1”, datanodes d1, d3, d4 blockid “2”, datanodes d2, d4, d5 … Each block is replicated (3 times by default) One replica of each block on the same rack, rest spread across the cluster

Hadoop MR execution (4) 11 Terms are borrowed from Functional Languages (e.g., Lisp) Example: sum of squares (map square ‘(1 2 3 4)) –Output: (1 4 9 16) [processes each record sequentially and independently] (reduce + ‘(1 4 9 16)) –(+ 16 (+ 9 (+ 4 1) ) ) –Output: 30 [processes set of all records in groups]

Hadoop MR execution (5) b2 b3 b4 b5 b6 b7 b1 Mapper MM MR Job Tracker Execute “wc” on file “x” with 4 mappers 3 reducers K1,k2,k3 K4, k5 Reducer RR k1=>v1,v2… k2=>v1,v2 k3=>v1,v2 k4=>v1,v2 k5=>v1,v2 o1 o2 o3 Launch Mappers (Prefer Collocation with data) Launch Reducers Shuffle Stage

Hadoop MR programming model (6) Input Map Reduce Output Map(k, v) -> (k’,v’) Reduce(k’, v’[]) -> (k’’,v’’) User defined functions Shuffle – MR System void map(String key, String value) { //do work //emit key, value pairs to reducers } void reduce(String key, Iterator values) { //for each key, iterate through all values //aggregate results //emit final result }

Hadoop - example: word count (7) void map(String key, String value) { //Key: document name (ignored) //value: document contents for each word w in value Emit(w, 1) } void reduce(String key, Iterator values) { //key: word //values: list of counts int count = 0; foreach v in values count += v Emit(key, count) } Input: Large number of documents Output: Count of each word occurring across documents

Hadoop – example: word count (8) public class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } 15

Hadoop – example: word count (9) public class Reduce extends Reducer { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } Source: http://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce- job.html 16

Hadoop – example: word count (10) public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } 17

Hadoop - example: word count (11) Peter Piper picked a peck of pickled peppers M M M M M M A peck of pickled peppers Peter Piper picked R R R R Of, 1 Pickled, 1 Peppers 1 Peter, 1 Piper, 1 Picked, 1 A, 1 Peck, 1 A, 1 Peck, 1 Peter, 1 Piper, 1 Picked, 1 Of, 1 Peppers 1 Pickled, 1 Of, 1 Peppers 1 Peppers, 1 Pickled, 1 A, 1 Peck, 1 Peter, 1 Picked, 1 Piper, 1 Peter, 2 Piper, 2 Picked, 2 A, 2 Peck, 2 Of, 2 Pickled, 2 Peppers 2 Local Shuffle and Sort

Hadoop - more examples (12) Distributed search (grep) – Map: Emit line if it matches the patter – Reduce: Concatenate Analysis of large scale system logs More – Jerry Zhao, Jelena Pjesivac-Grbovic, MapReduce: The Programming Model and Practice. Sigmetrics tutorial, Sigmetrics 2009. research.google.com/pubs/archive/36249.pdf – http://www.dbms2.com/2008/08/26/known-applications-of- mapreduce/

Hadoop - other patterns and optimizations (13) Map-Reduce-Reduce… M M R R R R R R Iterative Map-Reduce M M R R Local Combiners (Mappers) M M R R M M R R C C C C C, 1 A, 2 B, 1 A, 1 B, 1 A, 1 C, 2 A, 2 C, 2 B, 1 Custom Data Partitionerse.g., hash(line) mod R Goal: Increase local work (Mapper+Combiners), reduce data transfer over network

Hadoop - other important aspects (14) Data Locality – Map Heavy Jobs – Execute Mappers on data partitions, move data to reducers – Reduce Heavy Jobs – Do not move data to reducers, instead move reducers to intermediate data Fault Tolerance – Worker Failure Re-execute on failure (Hadoop) Regenerate current state with minimum re-execution (e.g. Resilient Distributed Datasets, SPARK) – Master Failure Primary/Secondary Master. Regular backups to secondary. On failure of primary, secondary takes over

Hadoop in real life 22 Easy to write and run highly parallel programs in new cloud programming paradigms: Google: MapReduce and Pregel (and many others) Amazon: Elastic MapReduce service (pay-as-you-go) Google (MapReduce) – Indexing: a chain of 24 MapReduce jobs – ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) – WebMap: a chain of 100 MapReduce jobs – 280 TB of data, 2500 nodes, 73 hours Facebook (Hadoop + Hive) – ~300TB total, adding 2TB/day (in 2008) – 3K jobs processing 55TB/day Similar numbers from other companies, e.g., Yieldex, eharmony.com, etc. NoSQL: MySQL has been an industry standard for a while, but Cassandra is 2400 times faster!

Useful links Installing Hadoop on single node (Linux) WordCount example – https://hadoop.apache.org/docs/current/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html https://hadoop.apache.org/docs/current/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html Running WordCount on AWS Elastic MapReduce – http://muhammadkhojaye.blogspot.com/2012/04/how-to-run- amazon-elastic-mapreduce-job.html http://muhammadkhojaye.blogspot.com/2012/04/how-to-run- amazon-elastic-mapreduce-job.html 23

Next lecture Failure detection 24

Distributed Systems Lecture 3 Big Data and MapReduce 1.

Similar presentations

Presentation on theme: "Distributed Systems Lecture 3 Big Data and MapReduce 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems Lecture 3 Big Data and MapReduce 1.

Similar presentations

Presentation on theme: "Distributed Systems Lecture 3 Big Data and MapReduce 1."— Presentation transcript:

Similar presentations

About project

Feedback