Big Data Technology: Introduction to Hadoop Antonino Virgillito
Hadoop Open source platform for distributed processing of large data Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms What makes Hadoop unique is its simplified programming model which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.
Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve. Executing Hadoop on a limited amount of data on a small number of nodes may not demonstrate particularly stellar performance as the overhead involved in starting Hadoop programs is relatively high. Other parallel/distributed programming paradigms such as MPI (Message Passing Interface) may perform much better on two, four, or perhaps a dozen machines. Though the effort of coordinating work among a small number of machines may be better-performed by such systems, the price paid in performance and engineering effort (when adding more hardware as a result of increasing data volumes) increases non-linearly.
Hadoop Components HDFS: Hadoop Distributed File System MapReduce Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work
Hadoop Principle I’m one big data set Hadoop is basically a middleware platforms that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes
The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories An algorithm is applied to all the elements of the same category
MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS MapReduce HDFS Figure?
MapReduce and Hadoop Hadoop Output is written on HDFS MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is
Hadoop pros & cons Good for Not good for Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data
HDFS
HDFS Design Principles Targeted at storing large files Performance with “small” files can be poor due to the overhead of distribution Reliable and scalable Fast failover and extension of the cluster Reliable but NOT highly available (a SPOF is present) Optimized for long sequential reads rather than random read/write access Block-structured Files are split into blocks that are treated independently wrt distribution Size of blocks is configurable but is typically “big” (default 64Mb) to optimize handling of big files
HDFS Interface HDFS acts as a separate file system wrt the operating system A shell is available implementing common operating systems commands ls, cat, etc. Commands for moving files to/from the local file system are present A web interface allows to browse the file system and show the state of the cluster
HDFS Architecture Two kinds of nodes NameNode DataNode Maps blocks to DataNodes Maintains file system metadata (file names, permissions and block locations) Coordinates block creation, deletion and replication One node in the cluster (Single Point of Failure) Contacted by clients for triggering file operations Maintains state of DataNodes DataNode Stores blocks Each block is replicated in more DataNodes (replicas are specified on single file basis) All the nodes in the cluster (possibly except one) are DataNodes Contacted by clients for data transfer operations Sends heartbeats to the NameNode
HDFS Operations: read
HDFS Operations: Write
Block Replica Consistency Simple consistency model write once – read many Concurrent operations on metadata are serialized at the NameNode The NameNode records a transaction log that is used to reconstruct the state of the file system at startup (checkpointing)
MapReduce
MapReduce Programming model for parallel execution Implementations available in several programming languages/platforms Hadoop implementation: Java Clean abstraction for programmers
Programming Model A MapReduce program transforms an input list into an output list Processing is organized into two steps: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list
map Data source must be structured in records (lines out of files, rows of a database, etc) Each record has an associated key Records are fed into the map function as key*value pairs: e.g., (filename, line) map() produces one or more intermediate values along with an output key from the input In other words, map identifies input values with the same characteristics that are represented by the output key Not necessarily related to input key
reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() aggregates the intermediate values into one or more final values for that same intermediate key in practice, usually only one final value per key
Parallelism Different instances of map() function run in parallel, creating different intermediate values from different input data sets Elements of a list being computed by map cannot see the effects of the computations on other elements Data cannot be shared among map instances Since the order of application of map to input records is commutative, we can reorder or parallelize execution All values are processed independently Instances of reduce() functions also run in parallel, each working on a different output key Each instance of reduce processes all the intermediate records for a same intermediate key
MapReduce Applications Data aggregation Log analysis Statistics Machine learning …
MapReduce Applications Amazon: we build Amazon's product search indices Facebook: we use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning Journey Dynamics: Using Hadoop MapReduce to analyse billions of lines of GPS data to create TrafficSpeeds, our accurate traffic speed forecast product. LinkedIn: We use Hadoop for discovering People You May Know and other fun facts. The New York Times: Large scale image conversions … http://wiki.apache.org/hadoop/PoweredBy Clusters from 4 to 4500 nodes
Example: Count word occurrences map(String input_key, String input_value): // input_key: line number – not used // input_value: line content for each word in input_value: emit (word, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); emit(AsString(result));
Example: word count
Example: word count
WordCount in Java - 1 public static class MapClass extends MapReduceBase public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); ...
WordCount in Java - 2 ... public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); }
Hadoop Ecosystem The term “ecosystem” with regards to Hadoop might relate to Apache Projects Non-Apache projects Companies providing custom Hadoop distributions Companies providing user-friendly Hadoop interfaces Hadoop as a service We only consider Apache projects has listed in the Hadoop home page as of march 2013
Apache Projects Data storage (NoSQL DBs) Data analysis HBase Hive Cassandra Data analysis Pig Mahout Chuckwa Coordination and management Ambari Zookeeper Utility Flume Sqoop
Data storage Hbase Cassandra Hive Scalable, distributed database that supports structured data storage for large tables, based on the BigTable model Cassandra Scalable, fault-tolerant database with no single points of failure, based on the BigTable model Hive Data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems http://cloudstory.in/2012/04/introduction-to-big-data-hadoop-ecosystem-part-3/
Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS Hive is treated only in the appendix
Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations
Pig Example Real example of a Pig script used at Twitter The Java equivalent… http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
Pig Commands Loading datasets from HDFS users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray);
Pig Commands Filtering data users_1825 = filter users by age>=18 and age<=25;
Pig Commands Join datasets joined = join users_1825 by username, pages by username;
Pig Commands Group records grouped = group joined by url; Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})
Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;
Pig Commands Sort a dataset Filter first n rows sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;
Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';
Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';
Another Pig Example: Correlation What is the correlation between users that have phones and users that tweet?
Pig: Used Defined Functions There are times when Pig’s built in operators and functions will not suffice Pig provides ability to implement your own Filter Ex: res = FILTER bag BY udfFilter(post); Load Function Ex: res = load 'file.txt' using udfLoad(); Eval Ex: res = FOREACH bag GENERATE udfEval($1) Choice between several programming languages Java, Python, Javascript
Hive Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.
Using Hadoop from Statistical Software packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In-Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data
RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster
WordCount in R wordcount = wc.reduce = function( input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}
Case Study 1: Air Traffic data Input data set: Ticket Id | Booking No | Origin | Destination | Flight No. | Miles One record per O-D couple Compute the following dataset Origin | Final Destination | Number of Passengers Final Destination is obtained by chaining origins and destinations with the same booking number
Case Study 2: Maritime Traffic Data Input data sets: Ship ID | Longitude | Latitude | Timestamp One record per position tracking Ship ID | Origin | Destination | Number of passengers One record per ship Design processing architecture Compute the following dataset Ship ID | Period (night-day) | Total stop time