IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Clydesdale: Structured Data Processing on MapReduce Jackie.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Hadoop implementation of MapReduce computational model Ján Vaňo.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Image taken from: slideshare
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Hadoop.
Software Systems Development
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 10 Data Analytics for IoT
Hadoop MapReduce Framework
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Presentation transcript:

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE

IBM Research | India Research Lab What is Hadoop? An Open-Source Software, batch-offline oriented, data & I/O intensive general purpose framework for creating distributed applications that process huge amounts of data. HUGE - Few thousand machines - Peta-bytes of data - Processing thousands of job each week What is not Hadoop? - A Relational Database - An OLTP System - A Structured data-store of any kind

IBM Research | India Research Lab Hadoop vs Relational  General Purpose vs Relational Data  User Control vs System Defined  No Schema vs Schema  Key-Value Pairs vs Tables  Offline/batch vs Online/Real-time

IBM Research | India Research Lab Hadoop Eco-System  HDFS  Hadoop Distributed File System  Map-Reduce System  A distributed framework for executing work in parallel  Hive/Pig/Jaql  SQL like languages to manipulate relational data on HDFS  HBase  Column-Store on Hadoop  Misc  Avro, Ganglia, Sqoop, ZooKeeper, Mahout

IBM Research | India Research Lab HDFS  Hadoop Distributed File System  Stores files in blocks across many nodes in a cluster  Replicates the blocks across nodes for durability  Default – 64 MB  Master/Slave Architecture  HDFS Master  NameNode Runs on a single node as a master process Directs client access to files in HDFS  HDFS Slave  DataNode Runs on all nodes in the cluster Block creation/replication/deletion Takes orders from the namenode

IBM Research | India Research Lab HDFS NameNode Data Nodes

IBM Research | India Research Lab HDFS NameNode Data Nodes Put File , 4, 5 2, 5, 6 2, 3, 4 File1.txt

IBM Research | India Research Lab HDFS NameNode Data Nodes Read File , 4 2, 6 2, 3 Read-Time = Transfer-Rate x Number of Machines

IBM Research | India Research Lab HDFS  Fault-Tolerant  Handles Node Failures  Self-Healing  Rebalances files across cluster  Data from the remaining two nodes is automatically copied  Scalable  Just by adding new nodes

IBM Research | India Research Lab Map-Reduce  Logical Functions : Mappers and Reducers  Developers write map and reducer functions then submit a jar to the Hadoop Cluster  Hadoop handles distributing the Map and Reduce tasks across the cluster  Typically Batch-Oriented

IBM Research | India Research Lab Map-Reduce Job-Flow

IBM Research | India Research Lab Word-Count Hadoop Uses Map-Reduce There is a Map-Phase There is a Reduce phase (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce, 1) (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) Sort/Shuffle (a, [1,1]) (Hadoop, 1) (is, [1,1]) (map, [1,1]) (phase, [1,1]) (reduce, [1,1]) (there, [1,1]) (uses, 1) A-I J-Q R-Z (a, 2) (hadoop, 1) (is, 2) (map, 2) (phase, 2) (reduce, 2) (there, 2) (uses, 1)

IBM Research | India Research Lab Map-Reduce Daemons  Job-Tracker (Master)  Manages map-reduce jobs,  Partitions tasks across different nodes,  Manages task-failures, Restarts tasks on different nodes  Speculative Execution  Task-Tracker (Slave)  Creates individual map and reduce tasks  Reports task status to job-tracker

IBM Research | India Research Lab Word-Count Map  public class WordCountMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); } } } Type of Input Key Type of Input Value Type of Output KeyType of Output Value

IBM Research | India Research Lab Word Count Reduce public class DataReadReduce extends Reducer { public void reduce(Text key, Iterable values, Context context){ context.write(key, new IntWritable(count(values))); } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value

IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }

IBM Research | India Research Lab Running a Job ./bin/hadoop jar WC.jar WordCountRunner WC

IBM Research | India Research Lab Cluster View of a MR Job Flow NameNode JobTracker Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node JAR MR MMM M AP P HASE k,v S HUFFLE S ORT k,v RRR R EDUCE P HASE JOB FINISHED

IBM Research | India Research Lab Map-Reduce Example: Aggregation  Compute Avg of B for each distinct value of A ABC R R R R R R R MAP 1 MAP 2 (1, 10) (2, 20) (1, 10) (1, 30) (3, 40) (2, 10) (1, 20) (1, 17.5) (2, 15) (3, 40) (1, [10, 10, 30, 20]) (2, 10) (3, 40) Reducer 1 Reducer 2

IBM Research | India Research Lab Map-Reduce Example : Join  Select R.A, R.B, S.D where R.A==S.A ABC R R R R R ADE S S S S S MAP 1 MAP 2 (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (1, 10, 30) (1, 10, 20) (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] Reducer 1 Reducer 2

IBM Research | India Research Lab Map-Reduce Example : Inequality Join  Select R.A, R.B, S.D where R.A <= S.A  Consider 3-Node Cluster ABC R R R R R ADE S S S S S MAP 2 (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40) MAP 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Reducer 1 Reducer 3 …… Reducer 2

IBM Research | India Research Lab Designing a Map-Reduce Algorithm  Thinking in terms of Map and Reduce  What data should be the key?  What data should be the values?  Minimizing Cost  Reading Cost  Communication Cost  Processing Cost at Reducer  Load Balancing  All reducers should get similar volume of traffic  Should not happen that only few machines are busy while others are loaded

IBM Research | India Research Lab SQL-Like Languages For Map-Reduce  Hive, Pig, JAQL  A user need not write native Java Map-Reduce Code  SQL like statements can be written to process data on Hadoop  Allows users without a sound understanding of map-reduce to work on data stored on HDFS

IBM Research | India Research Lab JAQL  Simpler language for writing Map-Reduce jobs  Reduce the barrier to Hadoop use by eliminating the need to write Java programs for many users  Exploit massive parallelism using Hadoop  Provides a simple yet powerful language to manipulate semi-structured data  Uses JSON as data model  Most data has a natural JSON representation  Easily extended using Java, Python, JavaScript  Inspired from UNIX pipes  Other languages: Hive, Pig  Resources  

IBM Research | India Research Lab JavaScript Object Notation (JSON)  $emp = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, ]  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, dob: {day:1, month:1, year:1975}}, {name: "Vince Wayne", income: 32500, mgr: false, dob: {day:1, month:2, year:1978}}, ]  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [java, C++, Hadoop]}, {name: "Vince Wayne", income: 32500, mgr: false, skills: [java, DB2, SQL]}, ]  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, exp: [{org:`IBM’, from: 2000, to:2003},{org:`oracle’, from:2003, to:`2010’}] ] JSON has arrays, records, strings, numbers, boolean, and null [] == array, {} == record or object, x: == field name

IBM Research | India Research Lab Accessing Data  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:`IBM’, from: 2000, to:2003},{org:`oracle’, from:2003, to:`2010’}] } ]  $emp[0] = {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }  $emp[0].name = “Jon Doe”  $emp[0].exp[0] = {org:`IBM’, from: 2000, to:2005}  $emp[0].exp[0].org = ‘IBM’  $emp[0].skills[0] = ‘Java’  $emp[*].name = [‘Jon Doe’, ‘Vince Wayne’]  $emp[0].exp[*].org = [‘IBM’,’yahoo’]  $emp[*].exp[*].org = [[‘IBM’,’yahoo’],[‘IBM’,’oracle’]]

IBM Research | India Research Lab JAQL core functionalities  Filter  Transform  Group  Join  Sort  Expand

IBM Research | India Research Lab Filter  $input -> filter ;  In the variable $ is bound to each item of the input  The can be composed of the relations ==, !=, >, >=, <, <=  Complex expressions can be created with not, and, or which are evaluated in this order  If the evaluates to true, the item from the input is included in the output

IBM Research | India Research Lab Filter Example  $employees = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, {name: "Jane Dean", income: 72000, mgr: true}, {name: "Alex Smith", income: 25000, mgr: false} ];  $employees -> filter $.mgr or $.income > 30000;  [ { "income": 32500, "mgr": false, "name": "Vince Wayne" }, { "income": 72000, "mgr": true, "name": "Jane Dean" } ]

IBM Research | India Research Lab Group By  $input -> group by = into  Similar to SQL group-by  $ is bound to the grouped items  To get an array of all values for an item that are aggregated into one group, use $[*]

IBM Research | India Research Lab Group By Example  $employees = [ {id:1, dept: 1, band:7, income:12000}, {id:2, dept: 1, band:8, income:13000}, {id:3, dept: 2, band:7, income:15000}, {id:4, dept: 1, band:8, income:10000}, {id:5, dept: 3, band:7, income:8000}, {id:6, dept: 2, band:8, income:5000}, {id:7, dept: 1, band:7, income:24000} ]  $emplyees -> group by $.dept into {$dept, total: sum($[*].income)}; [ {dept: 1, total: 59000}, {dept:2, total:20000}, {dept:3, total:8000} ]  $emplyees -> group by $.dept_group = $dept into {$dept_group, total: sum($[*].income)};  $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group.*, total:sum($[*].income)}  $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group, total:sum($[*].income)}

IBM Research | India Research Lab Join  Join where into  contains two or more variables that should share at least one attribute  : only equality predicates are allowed  is applied to all items from the input that match the join condition. To copy all fields of an input, use $input.*  Add the keyword ‘preserve’ to make it full join

IBM Research | India Research Lab Join Example  $users = [ {name: "Jon Doe", password: "asdf1234", id: 1}, {name: "Jane Doe", password: "qwertyui", id: 2}, {name: "Max Mustermann", password: "q1w2e3r4", id: 3} ]; $pages = [ {userid: 1, url:"code.google.com/p/jaql/"}, {userid: 2, url:" {userid: 1, url:"java.sun.com/javase/6/docs/api/"} ]  Join $users, $pages where $users.id == $pages.userid into {$users.name, $pages.*}  [ { "name": "Jon Doe", "url": "code.google.com/p/jaql/", "userid": 1 }, { "name": "Jon Doe", "url": "java.sun.com/javase/6/", "userid": 1 }, { "name": "Jane Doe", "url": " "userid": 2 } ]

IBM Research | India Research Lab IBM InfoSphere BigInsights  IBM’s offering for managing Big-Data  Powered by Hadoop and other components  Provides a fully tested environments

IBM Research | India Research Lab Recap  Introduction to Apache Hadoop  HDFS and Map-Reduce Programming Framework  Name Node, Data Node  Job Tracker, Task Tracker  Map and Reduce Methods Signatures  Word-Count Example  Flow In Map-Reduce  Java Implementation  More Map-Reduce Examples  Aggregation, Equi-Join and Inequality Join  Introduction to JAQL and IBM BigInsights

IBM Research | India Research Lab Advanced Concepts In Hadoop  Map-Reduce Programming Framework  Combiner, Counter, Partitioner, Distributed-Cache  Hadoop I/O  Input-Formats and Output-Formats Input and Output-Formats provided by Hadoop Writing Custom Input and Output Formats Passing custom objects as key-values  Chaining Map-Reduce Jobs  Hadoop Tuning and Optimization  Configuration Parameters  Hadoop Eco-System  Hive/Pig/JAQL  HBase  Avro, ZooKeeper, Mahout, Sqoop, Ganglia etc.  An Overview of Hadoop Research  Join Processing : Multi-way equi and theta joins, set-similarity joins, k-NN joins, interval and spatial joins  Graph Processing, Text Processing etc  Systems : ReStore, PerfXPlain, Stubby, RAMP, HadoopDB etc.

IBM Research | India Research Lab References  Hadoop – The Definitive Guide. Oreilly Press  Pro-Hadoop : Build scalable, distributed applications in the Cloud.  Hadoop Tutorial : 