Hadoop: A Software Framework for Data Intensive Computing Applications

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Hadoop: A Software Framework for Data Intensive Computing Applications
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Team: 3 Md Liakat Ali Abdulaziz Altowayan Andreea Cotoranu Stephanie Haughton Gene Locklear Leslie Meadows.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Introduction to MapReduce and Hadoop
Hadoop Aakash Kag What Why How 1.
Large-scale file systems and Map-Reduce
Map Reduce.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS-4513 Distributed Computing Systems Hugh C. Lauer
Hadoop Technopoints.
Map-Reduce framework -By Jagadish Rouniyar.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Hadoop: A Software Framework for Data Intensive Computing Applications Ravi Mukkamala Department of Computer Science Old Dominion University Norfolk, VA, USA mukka@cs.odu.edu

Sathya Sai University, Prashanti Nilayam What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access Yahoo! is the biggest contributor Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam What does it do? Hadoop implements Google’s MapReduce, using HDFS MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions: Hardware will fail. Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency. Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Applications need a write-once-read-many access model. Moving Computation is Cheaper than Moving Data. Portability is important. Sathya Sai University, Prashanti Nilayam

Apache Hadoop Wins Terabyte Sort Benchmark (July 2008) One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory. The cluster had 910 nodes; 2 quad core Xeons @ 2.0ghz per node; 4 SATA disks per node; 8G RAM per a node; 1 gigabit ethernet on each node; 40 nodes per a rack; 8 gigabit ethernet uplinks from each rack to the core; Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18); Sun Java JDK 1.6.0_05-b13 Sathya Sai University, Prashanti Nilayam

Example Applications and Organizations using Hadoop A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data; Sathya Sai University, Prashanti Nilayam

More Hadoop Applications Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2. Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage. Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive) NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis The New York Times : Large scale image conversions ; EC2 to run hadoop on a large virtual cluster Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3 Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam MapReduce Paradigm Programming model developed at Google Sort/merge based distributed computing Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) Sathya Sai University, Prashanti Nilayam

How does MapReduce work? The run time partitions the input and provides it to different Map instances; Map (key, value)  (key’, value’) The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. Each Reduce produces a single (or zero) file output. Map and Reduce are user written functions Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam Example MapReduce: To count the occurrences of words in the given set of documents map(String key, String value): // key: document name; value: document contents; map (k1,v1)  list(k2,v2) for each word w in value: EmitIntermediate(w, "1"); (Example: If input string is (“Saibaba is God. I am I”), Map produces {<“Saibaba”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>} reduce(String key, Iterator values): // key: a word; values: a list of counts; reduce (k2,list(v2))  list(v2) int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); (Example: reduce(“I”, <1,1>)  2) Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam Example applications Distributed grep (as in Unix grep command) Count of URL Access Frequency ReverseWeb-Link Graph: list of all source URLs associated with a given target URL Inverted index: Produces <word, list(Document ID)> pairs Distributed sort Sathya Sai University, Prashanti Nilayam

MapReduce-Fault tolerance Worker failure: The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Master Failure: It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, in most cases, the user restarts the job. Sathya Sai University, Prashanti Nilayam

Mapping workers to Processors The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam Task Granularity The map phase has M pieces and the reduce phase has R pieces. M and R should be much larger than the number of worker machines. Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails. Larger the M and R, more the decisions the master must make R is often constrained by users because the output of each reduce task ends up in a separate output file. Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. Sathya Sai University, Prashanti Nilayam

Additional support functions Partitioning function: The users of MapReduce specify the number of reduce tasks/output files that they desire (R). Data gets partitioned across these tasks using a partitioning function on the intermediate key. A default partitioning function is provided that uses hashing (e.g. .hash(key) mod R.). In some cases, it may be useful to partition data by some other function of the key. The user of the MapReduce library can provide a special partitioning function. Combiner function: User can specify a Combiner function that does partial merging of the intermediate local disk data before it is sent over the network. The Combiner function is executed on each machine that performs a map task. Typically the same code is used to implement both the combiner and the reduce functions. Sathya Sai University, Prashanti Nilayam

Problem: seeks are expensive ● CPU & transfer speed, RAM & disk size – double every 18-24 months ● Seek time nearly constant (~5%/year) ● Time to read entire drive is growing – scalable computing must go at transfer rate Example: Updating a terabyte DB, given: 10MB/s transfer, 10ms/seek, 100B/entry (10Billion entries), 10kB/page (1Billion pages) ● Updating 1% of entries (100Million) takes: – 1000 days with random B-Tree updates – 100 days with batched B-Tree updates – 1 day with sort & merge To process 100TB datasets • on 1 node: – scanning @ 50MB/s = 23 days • on 1000 node cluster: – scanning @ 50MB/s = 33 min – MTBF = 1 day • Need framework for distribution – efficient, reliable, easy to use Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. highly fault-tolerant and is designed to be deployed on low-cost hardware. provides high throughput access to application data and is suitable for applications that have large data sets. relaxes a few POSIX requirements to enable streaming access to file system data. part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/. Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam HDFS Architecture Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam Example runs [1] Cluster configuration: ≈1800 machines; each with two 2GHz Intel Xeon processors with 4GB of memory (1-1.5 GB reserved for other tasks), two 160GB IDE disks, and a gigabit Ethernet link. All of the machines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond. Grep: Scans through 1010 100-byte records (distributed over 1000 input file by GFS), searching for a relatively rare three-character pattern (the pattern occurs in 92,337 records). The input is split into approximately 64MB pieces (M = 15000), and the entire output is placed in one file (R = 1). The entire computation took approximately 150 seconds from start to finish including 60 seconds to start the job. Sort: Sorts 10 10 100-byte records (approximately 1 terabyte of data). As before, the input data is split into 64MB pieces (M = 15000) and R = 4000. Including startup overhead, the entire computation took 891 seconds. Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam Execution overview 1. The MapReduce library in the user program first splits input files into M pieces of typically 16 MB to 64 MB/piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. 3. A worker who is assigned a map task reads the contents of the assigned input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 4. The locations of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers. Sathya Sai University, Prashanti Nilayam

Execution overview (cont.) 5. When a reduce worker is notified by the master about these locations, it uses RPC remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program---the MapReduce call in the user program returns back to the user code. The output of the mapreduce execution is available in the R output files (one per reduce task). Sathya Sai University, Prashanti Nilayam

Sathya Sai University, Prashanti Nilayam References [1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google) [2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!) [3] R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128, http://www.cs.cmu.edu/~bryant Sathya Sai University, Prashanti Nilayam