IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

IBM Research | India Research Lab What is Hadoop? An Open-Source Software, batch-offline oriented, data & I/O intensive general purpose framework for creating distributed applications that process huge amounts of data. HUGE - Few thousand machines - Peta-bytes of data - Processing thousands of job each week What is not Hadoop? - A Relational Database - An OLTP System - A Structured data-store of any kind

IBM Research | India Research Lab Hadoop vs Relational  General Purpose vs Relational Data  User Control vs System Defined  No Schema vs Schema  Key-Value Pairs vs Tables  Offline/batch vs Online/Real-time

IBM Research | India Research Lab Hadoop Eco-System  HDFS  Hadoop Distributed File System  Map-Reduce System  A distributed framework for executing work in parallel  Hive/Pig/Jaql  SQL like languages to manipulate relational data on HDFS  HBase  Column-Store on Hadoop  Misc  Avro, Ganglia, Sqoop, ZooKeeper, Mahout

IBM Research | India Research Lab HDFS  Hadoop Distributed File System  Stores files in blocks across many nodes in a cluster  Replicates the blocks across nodes for durability  Default – 64 MB  Master/Slave Architecture  HDFS Master  NameNode Runs on a single node as a master process Directs client access to files in HDFS  HDFS Slave  DataNode Runs on all nodes in the cluster Block creation/replication/deletion Takes orders from the namenode

IBM Research | India Research Lab HDFS NameNode Data Nodes 123 456

IBM Research | India Research Lab HDFS NameNode Data Nodes Put File 123 456 1, 4, 5 2, 5, 6 2, 3, 4 File1.txt

IBM Research | India Research Lab HDFS NameNode Data Nodes Read File 123 456 1, 4 2, 6 2, 3 Read-Time = Transfer-Rate x Number of Machines

IBM Research | India Research Lab HDFS  Fault-Tolerant  Handles Node Failures  Self-Healing  Rebalances files across cluster  Data from the remaining two nodes is automatically copied  Scalable  Just by adding new nodes

IBM Research | India Research Lab Map-Reduce  Logical Functions : Mappers and Reducers  Developers write map and reducer functions then submit a jar to the Hadoop Cluster  Hadoop handles distributing the Map and Reduce tasks across the cluster  Typically Batch-Oriented

IBM Research | India Research Lab Map-Reduce Job-Flow

IBM Research | India Research Lab Word-Count Hadoop Uses Map-Reduce There is a Map-Phase There is a Reduce phase (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce, 1) (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) Sort/Shuffle (a, [1,1]) (Hadoop, 1) (is, [1,1]) (map, [1,1]) (phase, [1,1]) (reduce, [1,1]) (there, [1,1]) (uses, 1) A-I J-Q R-Z (a, 2) (hadoop, 1) (is, 2) (map, 2) (phase, 2) (reduce, 2) (there, 2) (uses, 1)

IBM Research | India Research Lab Map-Reduce Daemons  Job-Tracker (Master)  Manages map-reduce jobs,  Partitions tasks across different nodes,  Manages task-failures, Restarts tasks on different nodes  Speculative Execution  Task-Tracker (Slave)  Creates individual map and reduce tasks  Reports task status to job-tracker

IBM Research | India Research Lab Word-Count Map  public class WordCountMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); } } } Type of Input Key Type of Input Value Type of Output KeyType of Output Value

IBM Research | India Research Lab Word Count Reduce public class DataReadReduce extends Reducer { public void reduce(Text key, Iterable values, Context context){ context.write(key, new IntWritable(count(values))); } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value

IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }

IBM Research | India Research Lab Running a Job ./bin/hadoop jar WC.jar WordCountRunner WC

IBM Research | India Research Lab Cluster View of a MR Job Flow NameNode JobTracker Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node JAR MR MMM M AP P HASE k,v S HUFFLE S ORT k,v RRR R EDUCE P HASE JOB FINISHED

IBM Research | India Research Lab Map-Reduce Example: Aggregation  Compute Avg of B for each distinct value of A ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 R6 21049 R7 12044 MAP 1 MAP 2 (1, 10) (2, 20) (1, 10) (1, 30) (3, 40) (2, 10) (1, 20) (1, 17.5) (2, 15) (3, 40) (1, [10, 10, 30, 20]) (2, 10) (3, 40) Reducer 1 Reducer 2

IBM Research | India Research Lab Map-Reduce Example : Join  Select R.A, R.B, S.D where R.A==S.A ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 ADE S1 12022 S2 23036 S3 21029 S4 35016 S5 34037 MAP 1 MAP 2 (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (1, 10, 30) (1, 10, 20) (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] Reducer 1 Reducer 2

IBM Research | India Research Lab Map-Reduce Example : Inequality Join  Select R.A, R.B, S.D where R.A <= S.A  Consider 3-Node Cluster ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 ADE S1 12022 S2 23036 S3 21029 S4 35016 S5 34037 MAP 2 (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40) MAP 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Reducer 1 Reducer 3 …… Reducer 2

IBM Research | India Research Lab Designing a Map-Reduce Algorithm  Thinking in terms of Map and Reduce  What data should be the key?  What data should be the values?  Minimizing Cost  Reading Cost  Communication Cost  Processing Cost at Reducer  Load Balancing  All reducers should get similar volume of traffic  Should not happen that only few machines are busy while others are loaded

IBM Research | India Research Lab SQL-Like Languages For Map-Reduce  Hive, Pig, JAQL  A user need not write native Java Map-Reduce Code  SQL like statements can be written to process data on Hadoop  Allows users without a sound understanding of map-reduce to work on data stored on HDFS

IBM Research | India Research Lab JAQL  Simpler language for writing Map-Reduce jobs  Reduce the barrier to Hadoop use by eliminating the need to write Java programs for many users  Exploit massive parallelism using Hadoop  Provides a simple yet powerful language to manipulate semi-structured data  Uses JSON as data model  Most data has a natural JSON representation  Easily extended using Java, Python, JavaScript  Inspired from UNIX pipes  Other languages: Hive, Pig  Resources  http://code.google.com/p/jaql  http://jaql.org

IBM Research | India Research Lab JavaScript Object Notation (JSON)  $emp = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, ]  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, dob: {day:1, month:1, year:1975}}, {name: "Vince Wayne", income: 32500, mgr: false, dob: {day:1, month:2, year:1978}}, ]  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [java, C++, Hadoop]}, {name: "Vince Wayne", income: 32500, mgr: false, skills: [java, DB2, SQL]}, ]  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:ÌBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, exp: [{org:ÌBM’, from: 2000, to:2003},{org:òracle’, from:2003, to:`2010’}] ] JSON has arrays, records, strings, numbers, boolean, and null [] == array, {} == record or object, x: == field name

IBM Research | India Research Lab Accessing Data  $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:ÌBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:ÌBM’, from: 2000, to:2003},{org:òracle’, from:2003, to:`2010’}] } ]  $emp[0] = {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:ÌBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }  $emp[0].name = “Jon Doe”  $emp[0].exp[0] = {org:ÌBM’, from: 2000, to:2005}  $emp[0].exp[0].org = ‘IBM’  $emp[0].skills[0] = ‘Java’  $emp[*].name = [‘Jon Doe’, ‘Vince Wayne’]  $emp[0].exp[*].org = [‘IBM’,’yahoo’]  $emp[*].exp[*].org = [[‘IBM’,’yahoo’],[‘IBM’,’oracle’]]

IBM Research | India Research Lab JAQL core functionalities  Filter  Transform  Group  Join  Sort  Expand

IBM Research | India Research Lab Filter  $input -> filter ;  In the variable $ is bound to each item of the input  The can be composed of the relations ==, !=, >, >=, <, <=  Complex expressions can be created with not, and, or which are evaluated in this order  If the evaluates to true, the item from the input is included in the output

IBM Research | India Research Lab Filter Example  $employees = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, {name: "Jane Dean", income: 72000, mgr: true}, {name: "Alex Smith", income: 25000, mgr: false} ];  $employees -> filter $.mgr or $.income > 30000;  [ { "income": 32500, "mgr": false, "name": "Vince Wayne" }, { "income": 72000, "mgr": true, "name": "Jane Dean" } ]

IBM Research | India Research Lab Group By  $input -> group by = into  Similar to SQL group-by  $ is bound to the grouped items  To get an array of all values for an item that are aggregated into one group, use $[*]

IBM Research | India Research Lab Group By Example  $employees = [ {id:1, dept: 1, band:7, income:12000}, {id:2, dept: 1, band:8, income:13000}, {id:3, dept: 2, band:7, income:15000}, {id:4, dept: 1, band:8, income:10000}, {id:5, dept: 3, band:7, income:8000}, {id:6, dept: 2, band:8, income:5000}, {id:7, dept: 1, band:7, income:24000} ]  $emplyees -> group by $.dept into {$dept, total: sum($[*].income)}; [ {dept: 1, total: 59000}, {dept:2, total:20000}, {dept:3, total:8000} ]  $emplyees -> group by $.dept_group = $dept into {$dept_group, total: sum($[*].income)};  $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group.*, total:sum($[*].income)}  $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group, total:sum($[*].income)}

IBM Research | India Research Lab Join  Join where into  contains two or more variables that should share at least one attribute  : only equality predicates are allowed  is applied to all items from the input that match the join condition. To copy all fields of an input, use $input.*  Add the keyword ‘preserve’ to make it full join

IBM Research | India Research Lab Join Example  $users = [ {name: "Jon Doe", password: "asdf1234", id: 1}, {name: "Jane Doe", password: "qwertyui", id: 2}, {name: "Max Mustermann", password: "q1w2e3r4", id: 3} ]; $pages = [ {userid: 1, url:"code.google.com/p/jaql/"}, {userid: 2, url:"www.cnn.com"}, {userid: 1, url:"java.sun.com/javase/6/docs/api/"} ]  Join $users, $pages where $users.id == $pages.userid into {$users.name, $pages.*}  [ { "name": "Jon Doe", "url": "code.google.com/p/jaql/", "userid": 1 }, { "name": "Jon Doe", "url": "java.sun.com/javase/6/", "userid": 1 }, { "name": "Jane Doe", "url": "www.cnn.com", "userid": 2 } ]

IBM Research | India Research Lab IBM InfoSphere BigInsights  IBM’s offering for managing Big-Data  Powered by Hadoop and other components  Provides a fully tested environments

IBM Research | India Research Lab Recap  Introduction to Apache Hadoop  HDFS and Map-Reduce Programming Framework  Name Node, Data Node  Job Tracker, Task Tracker  Map and Reduce Methods Signatures  Word-Count Example  Flow In Map-Reduce  Java Implementation  More Map-Reduce Examples  Aggregation, Equi-Join and Inequality Join  Introduction to JAQL and IBM BigInsights

IBM Research | India Research Lab Advanced Concepts In Hadoop  Map-Reduce Programming Framework  Combiner, Counter, Partitioner, Distributed-Cache  Hadoop I/O  Input-Formats and Output-Formats Input and Output-Formats provided by Hadoop Writing Custom Input and Output Formats Passing custom objects as key-values  Chaining Map-Reduce Jobs  Hadoop Tuning and Optimization  Configuration Parameters  Hadoop Eco-System  Hive/Pig/JAQL  HBase  Avro, ZooKeeper, Mahout, Sqoop, Ganglia etc.  An Overview of Hadoop Research  Join Processing : Multi-way equi and theta joins, set-similarity joins, k-NN joins, interval and spatial joins  Graph Processing, Text Processing etc  Systems : ReStore, PerfXPlain, Stubby, RAMP, HadoopDB etc.

IBM Research | India Research Lab References  Hadoop – The Definitive Guide. Oreilly Press  Pro-Hadoop : Build scalable, distributed applications in the Cloud.  Hadoop Tutorial : http://developer.yahoo.com/hadoop/tutorial/.http://developer.yahoo.com/hadoop/tutorial/  www.slideshare.net

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

Similar presentations

Presentation on theme: "IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.

Similar presentations

Presentation on theme: "IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE."— Presentation transcript:

Similar presentations

About project

Feedback