IBM Research | India Research Lab What is Hadoop? An Open-Source Software, batch-offline oriented, data & I/O intensive general purpose framework for creating distributed applications that process huge amounts of data. HUGE - Few thousand machines - Peta-bytes of data - Processing thousands of job each week What is not Hadoop? - A Relational Database - An OLTP System - A Structured data-store of any kind
IBM Research | India Research Lab Hadoop vs Relational General Purpose vs Relational Data User Control vs System Defined No Schema vs Schema Key-Value Pairs vs Tables Offline/batch vs Online/Real-time
IBM Research | India Research Lab Hadoop Eco-System HDFS Hadoop Distributed File System Map-Reduce System A distributed framework for executing work in parallel Hive/Pig/Jaql SQL like languages to manipulate relational data on HDFS HBase Column-Store on Hadoop Misc Avro, Ganglia, Sqoop, ZooKeeper, Mahout
IBM Research | India Research Lab HDFS Hadoop Distributed File System Stores files in blocks across many nodes in a cluster Replicates the blocks across nodes for durability Default – 64 MB Master/Slave Architecture HDFS Master NameNode Runs on a single node as a master process Directs client access to files in HDFS HDFS Slave DataNode Runs on all nodes in the cluster Block creation/replication/deletion Takes orders from the namenode
IBM Research | India Research Lab HDFS NameNode Data Nodes Put File , 4, 5 2, 5, 6 2, 3, 4 File1.txt
IBM Research | India Research Lab HDFS NameNode Data Nodes Read File , 4 2, 6 2, 3 Read-Time = Transfer-Rate x Number of Machines
IBM Research | India Research Lab HDFS Fault-Tolerant Handles Node Failures Self-Healing Rebalances files across cluster Data from the remaining two nodes is automatically copied Scalable Just by adding new nodes
IBM Research | India Research Lab Map-Reduce Logical Functions : Mappers and Reducers Developers write map and reducer functions then submit a jar to the Hadoop Cluster Hadoop handles distributing the Map and Reduce tasks across the cluster Typically Batch-Oriented
IBM Research | India Research Lab Map-Reduce Job-Flow
IBM Research | India Research Lab Word-Count Hadoop Uses Map-Reduce There is a Map-Phase There is a Reduce phase (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce, 1) (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) Sort/Shuffle (a, [1,1]) (Hadoop, 1) (is, [1,1]) (map, [1,1]) (phase, [1,1]) (reduce, [1,1]) (there, [1,1]) (uses, 1) A-I J-Q R-Z (a, 2) (hadoop, 1) (is, 2) (map, 2) (phase, 2) (reduce, 2) (there, 2) (uses, 1)
IBM Research | India Research Lab Map-Reduce Daemons Job-Tracker (Master) Manages map-reduce jobs, Partitions tasks across different nodes, Manages task-failures, Restarts tasks on different nodes Speculative Execution Task-Tracker (Slave) Creates individual map and reduce tasks Reports task status to job-tracker
IBM Research | India Research Lab Word-Count Map public class WordCountMap extends Mapper { public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); } } } Type of Input Key Type of Input Value Type of Output KeyType of Output Value
IBM Research | India Research Lab Word Count Reduce public class DataReadReduce extends Reducer { public void reduce(Text key, Iterable values, Context context){ context.write(key, new IntWritable(count(values))); } } Type of Output KeyType of Output ValueType of Input KeyType of Input Value
IBM Research | India Research Lab Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }
IBM Research | India Research Lab Running a Job ./bin/hadoop jar WC.jar WordCountRunner WC
IBM Research | India Research Lab Cluster View of a MR Job Flow NameNode JobTracker Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node JAR MR MMM M AP P HASE k,v S HUFFLE S ORT k,v RRR R EDUCE P HASE JOB FINISHED
IBM Research | India Research Lab Map-Reduce Example: Aggregation Compute Avg of B for each distinct value of A ABC R R R R R R R MAP 1 MAP 2 (1, 10) (2, 20) (1, 10) (1, 30) (3, 40) (2, 10) (1, 20) (1, 17.5) (2, 15) (3, 40) (1, [10, 10, 30, 20]) (2, 10) (3, 40) Reducer 1 Reducer 2
IBM Research | India Research Lab Map-Reduce Example : Join Select R.A, R.B, S.D where R.A==S.A ABC R R R R R ADE S S S S S MAP 1 MAP 2 (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (1, 10, 30) (1, 10, 20) (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] Reducer 1 Reducer 2
IBM Research | India Research Lab Map-Reduce Example : Inequality Join Select R.A, R.B, S.D where R.A <= S.A Consider 3-Node Cluster ABC R R R R R ADE S S S S S MAP 2 (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40) MAP 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Reducer 1 Reducer 3 …… Reducer 2
IBM Research | India Research Lab Designing a Map-Reduce Algorithm Thinking in terms of Map and Reduce What data should be the key? What data should be the values? Minimizing Cost Reading Cost Communication Cost Processing Cost at Reducer Load Balancing All reducers should get similar volume of traffic Should not happen that only few machines are busy while others are loaded
IBM Research | India Research Lab SQL-Like Languages For Map-Reduce Hive, Pig, JAQL A user need not write native Java Map-Reduce Code SQL like statements can be written to process data on Hadoop Allows users without a sound understanding of map-reduce to work on data stored on HDFS
IBM Research | India Research Lab JAQL Simpler language for writing Map-Reduce jobs Reduce the barrier to Hadoop use by eliminating the need to write Java programs for many users Exploit massive parallelism using Hadoop Provides a simple yet powerful language to manipulate semi-structured data Uses JSON as data model Most data has a natural JSON representation Easily extended using Java, Python, JavaScript Inspired from UNIX pipes Other languages: Hive, Pig Resources
IBM Research | India Research Lab JavaScript Object Notation (JSON) $emp = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, ] $emp = [ {name: "Jon Doe", income: 20000, mgr: false, dob: {day:1, month:1, year:1975}}, {name: "Vince Wayne", income: 32500, mgr: false, dob: {day:1, month:2, year:1978}}, ] $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [java, C++, Hadoop]}, {name: "Vince Wayne", income: 32500, mgr: false, skills: [java, DB2, SQL]}, ] $emp = [ {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, exp: [{org:`IBM’, from: 2000, to:2003},{org:`oracle’, from:2003, to:`2010’}] ] JSON has arrays, records, strings, numbers, boolean, and null [] == array, {} == record or object, x: == field name
IBM Research | India Research Lab Accessing Data $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:`IBM’, from: 2000, to:2003},{org:`oracle’, from:2003, to:`2010’}] } ] $emp[0] = {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] } $emp[0].name = “Jon Doe” $emp[0].exp[0] = {org:`IBM’, from: 2000, to:2005} $emp[0].exp[0].org = ‘IBM’ $emp[0].skills[0] = ‘Java’ $emp[*].name = [‘Jon Doe’, ‘Vince Wayne’] $emp[0].exp[*].org = [‘IBM’,’yahoo’] $emp[*].exp[*].org = [[‘IBM’,’yahoo’],[‘IBM’,’oracle’]]
IBM Research | India Research Lab JAQL core functionalities Filter Transform Group Join Sort Expand
IBM Research | India Research Lab Filter $input -> filter ; In the variable $ is bound to each item of the input The can be composed of the relations ==, !=, >, >=, <, <= Complex expressions can be created with not, and, or which are evaluated in this order If the evaluates to true, the item from the input is included in the output
IBM Research | India Research Lab Filter Example $employees = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, {name: "Jane Dean", income: 72000, mgr: true}, {name: "Alex Smith", income: 25000, mgr: false} ]; $employees -> filter $.mgr or $.income > 30000; [ { "income": 32500, "mgr": false, "name": "Vince Wayne" }, { "income": 72000, "mgr": true, "name": "Jane Dean" } ]
IBM Research | India Research Lab Group By $input -> group by = into Similar to SQL group-by $ is bound to the grouped items To get an array of all values for an item that are aggregated into one group, use $[*]
IBM Research | India Research Lab Group By Example $employees = [ {id:1, dept: 1, band:7, income:12000}, {id:2, dept: 1, band:8, income:13000}, {id:3, dept: 2, band:7, income:15000}, {id:4, dept: 1, band:8, income:10000}, {id:5, dept: 3, band:7, income:8000}, {id:6, dept: 2, band:8, income:5000}, {id:7, dept: 1, band:7, income:24000} ] $emplyees -> group by $.dept into {$dept, total: sum($[*].income)}; [ {dept: 1, total: 59000}, {dept:2, total:20000}, {dept:3, total:8000} ] $emplyees -> group by $.dept_group = $dept into {$dept_group, total: sum($[*].income)}; $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group.*, total:sum($[*].income)} $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group, total:sum($[*].income)}
IBM Research | India Research Lab Join Join where into contains two or more variables that should share at least one attribute : only equality predicates are allowed is applied to all items from the input that match the join condition. To copy all fields of an input, use $input.* Add the keyword ‘preserve’ to make it full join
IBM Research | India Research Lab Join Example $users = [ {name: "Jon Doe", password: "asdf1234", id: 1}, {name: "Jane Doe", password: "qwertyui", id: 2}, {name: "Max Mustermann", password: "q1w2e3r4", id: 3} ]; $pages = [ {userid: 1, url:""}, {userid: 2, url:" {userid: 1, url:""} ] Join $users, $pages where $ == $pages.userid into {$, $pages.*} [ { "name": "Jon Doe", "url": "", "userid": 1 }, { "name": "Jon Doe", "url": "", "userid": 1 }, { "name": "Jane Doe", "url": " "userid": 2 } ]
IBM Research | India Research Lab IBM InfoSphere BigInsights IBM’s offering for managing Big-Data Powered by Hadoop and other components Provides a fully tested environments
IBM Research | India Research Lab Recap Introduction to Apache Hadoop HDFS and Map-Reduce Programming Framework Name Node, Data Node Job Tracker, Task Tracker Map and Reduce Methods Signatures Word-Count Example Flow In Map-Reduce Java Implementation More Map-Reduce Examples Aggregation, Equi-Join and Inequality Join Introduction to JAQL and IBM BigInsights
IBM Research | India Research Lab Advanced Concepts In Hadoop Map-Reduce Programming Framework Combiner, Counter, Partitioner, Distributed-Cache Hadoop I/O Input-Formats and Output-Formats Input and Output-Formats provided by Hadoop Writing Custom Input and Output Formats Passing custom objects as key-values Chaining Map-Reduce Jobs Hadoop Tuning and Optimization Configuration Parameters Hadoop Eco-System Hive/Pig/JAQL HBase Avro, ZooKeeper, Mahout, Sqoop, Ganglia etc. An Overview of Hadoop Research Join Processing : Multi-way equi and theta joins, set-similarity joins, k-NN joins, interval and spatial joins Graph Processing, Text Processing etc Systems : ReStore, PerfXPlain, Stubby, RAMP, HadoopDB etc.
