Chapter X: Big Data.

Slides:



Advertisements
Similar presentations
Parallel and Distributed Computing: MapReduce Alona Fyshe.
Advertisements

Hadoop: The Definitive Guide Chap. 2 MapReduce
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Distributed Computations MapReduce
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Image taken from: slideshare
Introduction to Google MapReduce
Hadoop Aakash Kag What Why How 1.
Hadoop.
MapReduce Types, Formats and Features
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Airlinecount CSCE 587 Fall 2017.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
MapReduce Simplied Data Processing on Large Clusters
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
Overview of big data tools
Lecture 18 (Hadoop: Programming Examples)
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Chapter 10: Big Data.
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MIT 802 Introduction to Data Platforms and Sources Lecture 2
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Chapter X: Big Data

Chapter X: Big Data Map Reduce The Map reduce paradigm Distributed File Systems Hadoop Big Data Storage Systems

The MapReduce Paradigm Platform for reliable, scalable parallel computing Abstracts issues of distributed and parallel environment from programmer. Paradigm dates back many decades But very large scale implementations running on clusters with 10^3 to 10^4 machines are more recent Google Map Reduce, Hadoop, .. Data access done using distributed file systems 3 3

Distributed File Systems Highly scalable distributed file system for large data-intensive applications. E.g. 10K nodes, 100 million files, 10 PB Provides redundant storage of massive amounts of data on cheap and unreliable computers Files are replicated to handle hardware failure Detect failures and recovers from them Examples: Google File System (GFS) Hadoop File System (HDFS) 4 4

MapReduce: File Access Count Example Given log file in following format: ... 2013/02/21 10:31:22.00EST /slide-dir/11.ppt 2013/02/21 10:43:12.00EST /slide-dir/12.ppt 2013/02/22 18:26:45.00EST /slide-dir/13.ppt 2013/02/22 20:53:29.00EST /slide-dir/12.ppt ... Goal: find how many times each of the files in the slide-dir directory was accessed between 2013/01/01 and 2013/01/31. Options: Sequential program too slow on massive datasets Load into database expensive, direct operation on log files cheaper Custom built parallel program for this task possible, but very laborious Map-reduce paradigm 5 5

MapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming languages like Lisp. Input: a set of key/value pairs User supplies two functions: map(k,v)  list(k1,v1) reduce(k1, list(v1))  v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs For our example, assume that system breaks up files into lines, and calls map function with value of each line Key is the line number 6 6

MapReduce: File Access Count Example map(String key, String record) { String attribute[3]; …. break up record into tokens (based on space character), and store the tokens in array attributes String date = attribute[0]; String time = attribute[1]; String filename = attribute[2]; if (date between 2013/01/01 and 2013/01/31 and filename starts with "/slide-dir/") emit(filename, 1). } reduce(String key, List recordlist) { String filename = key; int count = 0; For each record in recordlist count = count + 1. output(filename, count)

Schematic Flow of Keys and Values mk1 mv1 mk2 mv2 mkn mvn rk1 rv1 rk7 rv2 rk3 rv3 rk2 rv8 rki rvn rk1 rv7 rk2 rvi rk1 rv1,rv7,... rk2 rv8,rvi,... rk3 rv3,... rk7 rv2,... rki ... rvn,... map inputs (key, value) map outputs reduce inputs (key, value) Flow of keys and values in a map reduce task

MapReduce: Word Count Example Consider the problem of counting the number of occurrences of each word in a large collection of documents How would you do it in parallel ? Solution: Divide documents among workers Each worker parses document to find all words, map function outputs (word, count) pairs Partition (word, count) pairs across workers based on word For each word at a worker, reduce function locally add up counts Given input: “One a penny, two a penny, hot cross buns.” Records output by the map() function would be (“One”, 1), (“a”, 1), (“penny”, 1),(“two”, 1), (“a”, 1), (“penny”, 1), (“hot”, 1), (“cross”, 1), (“buns”, 1). Records output by reduce function would be (“One”, 1), (“a”, 2), (“penny”, 2), (“two”, 1), (“hot”, 1), (“cross”, 1), (“buns”, 1) 9 9

Pseudo-code for each word w in input_value: Emit(w, "1"); map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: Emit(w, "1"); // Group by step done by system on key of intermediate Emit above, // and reduce called on list of values in each group. reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Output(result); 10 10

Parallel Processing of MapReduce Job User Program Reduce 1 Master Map 1 Map 2 Map n assign map reduce read local write Remote Read, Sort File 1 Part 1 Part 2 Input file partitions Output files Reduce m Part n File 2 File m Part 3 Part 4 Intermediate files copy

Hadoop Google pioneered map-reduce implementations that could run on thousands of machines (nodes), and transparently handle failures of machines Hadoop is a widely used open source implementation of Map Reduce written in Java Map and reduce functions can be written in several different languages, we use Java. Input and output to map reduce systems such as Hadoop must be done in parallel Google used GFS distributed file system Hadoop uses Hadoop File System (HDFS) File blocks partitioned across many machines Blocks are replicated so data is not lost/unavailable if a machine crashes Central “name node” provides metadata such as which blocks are contained in which files

Hadoop Types in Hadoop Generic Mapper and Reducer interfaces both take four type arguments, that specify the types of the input key, input value, output key and output value Map class in next slide implements the Mapper interface Map input key is of type LongWritable, i.e. a long integer Map input value which is (all or part of) a document, is of type Text. Map output key is of type Text, since the key is a word, Map output value is of type IntWritable, which is an integer value.

Hadoop Code in Java: Map Function public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }

Hadoop Code in Java: Reduce Function public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum));

Hadoop Job Parameters The classes that contain the map and reduce functions for the job set by methods setMapperClass() and setReducerClass() The types of the job’s output key and values set by methods setOutputKeyClass() and setOutputValueClass() The input format of the job set by method job.setInputFormatClass() Default input format in Hadoop is the TextInputFormat, map key whose value is a byte offset into the file, and map value is the contents of one line of the file The directories where the input files are stored, and where the output files must be created set by addInputPath() and addOutputPath() And many more parameters

Hadoop Code in Java: Overall Program public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Local Pre-Aggregation Combiners: perform partial aggregation to minimize network traffic E.g. within machine And/or at rack level In Hadoop, reduce function is used by default if combiners are enabled But alternative implementation of combiner can be specified if input and output types of reducers are different

Implementations Google Not available outside Google Hadoop An open-source implementation in Java Uses HDFS for stable storage Download: http://lucene.apache.org/hadoop/ Aster Data Cluster-optimized SQL Database that also implements MapReduce IITB alumnus among founders And several others, such as Cassandra at Facebook, etc.

Map Reduce vs. Parallel Databases Map Reduce widely used for parallel processing Google, Yahoo, and 100’s of other companies Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. Database people say: but parallel databases have been doing this for decades Map Reduce people say: we operate at scales of 1000’s of machines We handle failures seamlessly We allow procedural code in map and reduce and allow data of any type many real-world uses of MapReduce that cannot be expressed in SQL.

Map Reduce vs. Parallel Databases (Cont.) Map Reduce is cumbersome for writing simple queries Current approach: declarative querying, with execution on Map Reduce infrastructure Pig Latin – Declarative language supporting complex data using JSON; From Yahoo Programmer has to specify parser for each input source Hive – SQL based syntax; From Facebook Allows specification of schema for each input source Has become very popular SCOPE system from Microsoft Many proposed extensions of Map Reduce to allow joins, pipelining of data, etc.

Big Data Storage Systems Need to store massive amounts of data Scalability: ability to grow the system by adding more machines Scalability of storage volume, and access rate Distributed file systems good for scalable storage of unstructured data E.g. log files Massively parallel database ideal for storing records But hard to support all database features across thousands of machines Massively parallel key-value stores built to support scalable storage and access of records E.g. Big Table from Google, PNUTS/Sherpa from Yahoo, HBase from Apache Hadoop project Applications using key-value stores manage query processing on their own

Key Value Stores Key-value stores support put(key, value): used to store values with an associated key, get(key): which retrieves the stored value associated with the specified key. Some systems such as Bigtable additionally provide range queries on key values Multiple versions of data may be stored, by adding a timestamp to the key

PNUTS Parallel Storage System Architecture

Data Representation Records in many big data applications need to have a flexible schema Not all records have same structure Some attributes may have complex substructure XML and JSON data representation formats widely used An example of a JSON object is: { "ID": "22222", "name": { "firstname: "Albert", "lastname: "Einstein" }, "deptname": "Physics", "children": [ { "firstname": "Hans", "lastname": "Einstein" }, { "firstname": "Eduard", "lastname": "Einstein" } ] }