Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Computations have to be distributed !
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Large-scale file systems and Map-Reduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Charles Tappert Seidenberg School of CSIS, Pace University
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Map Reduce, Types, Formats and Features
Presentation transcript:

Big Data,Map-Reduce, Hadoop

Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge

What is Big Data? Large interconnected data. Typically implies fault tolerant and load balanced. Frequently open source. Many smaller computers. Clustered. NoSQL, non-transactional,non-relational. Read-Oriented. Hadoop, Solr are leading players.

What is map-reduce? Map-reduce is a programming model (and an associated implementation) for processing and generating large data sets. It consists of two steps: map and reduce. The “map” step takes a key/value pair and produces an intermediate key/value pair. The “reduce” step takes a key and a list of the key's values and outputs the final key/value pair.

Types map: (k 1, v 1 ) → list(k 2, v 2 )‏ reduce: (k 2, list(v 2 )) → list(v 2 )‏

Why is this useful? Map-reduce jobs are automatically parallelized. Partial failure of the processing cluster is expected and tolerable. Redundancy and fault-tolerance is built in, so the programmer doesn't have to worry. It scales very well. Many jobs are naturally expressible in the map/reduce paradigm.

What are some uses? Word count map:. reduce: Grep map:. reduce: identity Inverted index map:. reduce: Distributed sort (special case)‏ map:. reduce: identity Users: Google, Yahoo!, Amazon, Facebook, etc.

Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge

Execution overview: map The user begins a map-reduce job. One of the machines becomes the master. Partition the input into M splits (16-64 MB each) and distribute among the machines. A worker reads his split and begins work. Upon completion, the worker notifies the master. The master partitions the intermediate keyspace into R pieces with a partitioning function.

Execution overview: reduce When a reduce worker is notified about a job, it uses RPC to read the intermediate data from a mapper, then sorts it by key. The reducer processes its job, then writes its output to the final output file for its reduce partition. When all reducers are finished, the master wakes up the user program.

What are M and R? M is the number of map pieces. R is the number of reduce pieces. Ideally, M and R are much larger than the number of workers. This allows one machine to perform many different tasks, improving load balancing and speeds up recovery. The master makes O(M+R) scheduling decisions and keeps O(M*R) states in memory. At least R files end up being written.

Example: counting words We have UTD's fight song: C-O-M-E-T-S! Go! Green, Orange, White! Comets! Go! Strong of will, we fight for right! Let's all show our comet might! We want to count the number of occurrences of each word. The next slides show the map and reduce phases.

First stage: map Go through the input, and for each word return a tuple of (, 1). Output:...

Between map and reduce... Between the mapper and the reducer, some gears turn within Hadoop, and it groups identical keys and sorts by key before starting the reducer. Here's the output:...

Second stage: reducer The reducer receives the content, one key- valuelist pair at a time, and does its own processing. For wordcount, it sums the values in each list. Here's the output: … Then it writes these tuples to the final files in the HDFS.

How can we improve our wordcount? Also, any questions?

Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge

Fault tolerance Worker failure is expected. If a worker fails during a map phase, its workload is reassigned to another worker. If a mapper fails during a reduce phase, both phases are re-executed. Master failure is not expected, though checkpointing can be used for recovery. If a particular record causes the mapper or reducer to reliably crash, the map-reduce system can figure this out, skip the record, and proceed.

Ordering guarantee The implementation of map-reduce guarantees that within a given partition, the intermediate key/value pairs are processed in increasing key order. This means that each reduce partition ends up with an output file sorted by key.

Partitioning function By default, your reduce tasks will be distributed evenly by using a hash(intrmdt-key) mod N function. You can specify a custom partitioning function. Useful for locality reasons, such as if the key is a URL and you want all URLs belonging to a single host to be processed on a single machine.

Combiner function After a map phase, the mapper transmits over the network the entire intermediate data file to the reducer. Sometimes this file is highly compressible. The user can specify a combiner function. It's just like a reduce function, except it's run by the mapper before passing the job to the reducer.

Counters A counter can be associated with any action that a mapper or a reducer does. This is in addition to default counters such as the number of input and output key/value pairs processed. A user can watch the counters in real time to see the progress of a job. When the map/reduce job finishes, these counters are provided to the user program.

Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along Map-reduce-merge

What is ? Hadoop is the implementation of the map/reduce design that we will use. Hadoop is released under the Apache License 2.0, so it's open source. Hadoop uses the Hadoop Distributed File System, HDFS. (In contrast to what we've seen with Lucene.)‏ Get the release from:

Preparing Hadoop on your system Configure passwordless public-key SSH on localhost Configure Hadoop: look at the two configuration files at Format the HDFS: bin/hadoop namenode -format Start Hadoop: cd bin/start-all.sh (and wait ≈20 seconds)‏

Example: grep Standard Unix 'grep' behavior: run it on the command line with the search string as the first argument and the list of files or directories as the subsequent argument(s). $ grep HelloWorld file1.c file2.c file3.c file2.c:System.out.println(“I say HelloWorld!”); $

Preparing for 'grep' in Hadoop Hadoop's jobs always operate within the HDFS. Hadoop will read its input from HDFS, and will write its output to HDFS. Thus, to prepare: Download a free electronic book: Load the file into HDFS: bin/hadoop fs -copyFromLocal book.txt /book.txt

Using 'grep' within Hadoop bin/hadoop jar \ hadoop examples.jar \ grep /book.txt /grep-result \ “search string” bin/hadoop fs -ls /grep-result bin/hadoop fs -cat /grep-result/part A good string to try: “Horace de \S+” Between job runs: bin/hadoop fs -rmr /grep-result

How 'grep' in Hadoop works The program runs two map/reduce jobs in sequence. The first job counts how many times a matching string occurred and the second job sorts matching strings by their frequency and stores the output in a single output file. Each mapper of the first job takes a line as input and matches the user- provided regular expression against the line. It extracts all matching strings and emits (matching string, 1) pairs. Each reducer sums the frequencies of each matching string. The output is sequence files containing the matching string and count. The reduce phase is optimized by running a combiner that sums the frequency of strings from local map output. As a result it reduces the amount of data that needs to be shipped to a reduce task. The second job takes the output of the first job as input. The mapper is an inverse map, while the reducer is an identity reducer. The number of reducers is one, so the output is stored in one file, and it is sorted by the count in a descending order. The output file is text, each line of which contains count and a matching string.

Another example: word count bin/hadoop jar hadoop examples.jar \ wordcount /book.txt /wc-result bin/hadoop fs -cat /wc-result/part | \ sort -n -k 2 You can also try passing a “-r #” option to increase the number of parallel reducers. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum. As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.

Presentation Overview What is map-reduce? input/output data types why is it useful and where is it used? Execution overview Features fault tolerance ordering guarantee other perks and bonuses Hands-on demonstration and follow-along

Does map-reduce satisfy all needs? Map-reduce is great for homogeneous data, such as grepping a large collection of files or word- counting a huge document. Joining heterogeneous databases does not work well. As is, we'd need additional map-reduce steps, such as map-reducing one database and reading from the others on the fly. We want to support relational algebra.

Solution The solution to these problems is: map-reduce- merge. It is map-reduce with a new additional merging step. The merge phase makes it easier to process data relationships among heterogeneous data sets. Types: map: (k 1, v 1 ) α → [(k 2, v 2 )] α reduce: (k 2, [v 2 ]) α → (k 2, [v 3 ]) α (notice that the output [v] is a list)‏ merge: ((k 2, [v 3 ]) α, (k 3, [v 4 ]) β ) → (k 4, v 5 ) γ If α=β, then the merging step performs a self-merge (self-join in R.A.).

New terms Partition selector: determines which data partitions produced by reducers should be retrieved for merging. Processor: user-defined logic of processing data from an individual source. Merger: user-defined logic of processing data merged from two sources where data satisfies a merge condition. Configurable iterator: next slide.

Configurable iterators The map and reduce user-defined functions get one iterator for the values. The merge function gets two iterators, one for each data source. The iterators do not have to move forward – they can be instrumented to do whatever the user wants. Relational join algorithms have specific patterns for the merging step.