Big Data Technology: Introduction to Hadoop

Big Data Technology: Introduction to Hadoop
Antonino Virgillito

Hadoop Open source platform for distributed processing of large data
Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms What makes Hadoop unique is its simplified programming model which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve. Executing Hadoop on a limited amount of data on a small number of nodes may not demonstrate particularly stellar performance as the overhead involved in starting Hadoop programs is relatively high. Other parallel/distributed programming paradigms such as MPI (Message Passing Interface) may perform much better on two, four, or perhaps a dozen machines. Though the effort of coordinating work among a small number of machines may be better-performed by such systems, the price paid in performance and engineering effort (when adding more hardware as a result of increasing data volumes) increases non-linearly.

Hadoop Components HDFS: Hadoop Distributed File System MapReduce
Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Hadoop Principle I’m one big data set
Hadoop is basically a middleware platforms that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes

The MapReduce Paradigm
Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories An algorithm is applied to all the elements of the same category

MapReduce and Hadoop Hadoop MapReduce HDFS
MapReduce is logically placed on top of HDFS MapReduce HDFS Figure?

MapReduce and Hadoop Hadoop Output is written on HDFS
MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

Hadoop pros & cons Good for Not good for
Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data

HDFS Design Principles
Targeted at storing large files Performance with “small” files can be poor due to the overhead of distribution Reliable and scalable Fast failover and extension of the cluster Reliable but NOT highly available (a SPOF is present) Optimized for long sequential reads rather than random read/write access Block-structured Files are split into blocks that are treated independently wrt distribution Size of blocks is configurable but is typically “big” (default 64Mb) to optimize handling of big files

HDFS Interface HDFS acts as a separate file system wrt the operating system A shell is available implementing common operating systems commands ls, cat, etc. Commands for moving files to/from the local file system are present A web interface allows to browse the file system and show the state of the cluster

HDFS Architecture Two kinds of nodes NameNode DataNode
Maps blocks to DataNodes Maintains file system metadata (file names, permissions and block locations) Coordinates block creation, deletion and replication One node in the cluster (Single Point of Failure) Contacted by clients for triggering file operations Maintains state of DataNodes DataNode Stores blocks Each block is replicated in more DataNodes (replicas are specified on single file basis) All the nodes in the cluster (possibly except one) are DataNodes Contacted by clients for data transfer operations Sends heartbeats to the NameNode

HDFS Operations: read

HDFS Operations: Write

Block Replica Consistency
Simple consistency model write once – read many Concurrent operations on metadata are serialized at the NameNode The NameNode records a transaction log that is used to reconstruct the state of the file system at startup (checkpointing)

MapReduce

MapReduce Programming model for parallel execution
Implementations available in several programming languages/platforms Hadoop implementation: Java Clean abstraction for programmers

Programming Model A MapReduce program transforms an input list into an output list Processing is organized into two steps: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Data source must be structured in records (lines out of files, rows of a database, etc) Each record has an associated key Records are fed into the map function as key*value pairs: e.g., (filename, line) map() produces one or more intermediate values along with an output key from the input In other words, map identifies input values with the same characteristics that are represented by the output key Not necessarily related to input key

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() aggregates the intermediate values into one or more final values for that same intermediate key in practice, usually only one final value per key

Parallelism Different instances of map() function run in parallel, creating different intermediate values from different input data sets Elements of a list being computed by map cannot see the effects of the computations on other elements Data cannot be shared among map instances Since the order of application of map to input records is commutative, we can reorder or parallelize execution All values are processed independently Instances of reduce() functions also run in parallel, each working on a different output key Each instance of reduce processes all the intermediate records for a same intermediate key

MapReduce Applications
Data aggregation Log analysis Statistics Machine learning …

MapReduce Applications
Amazon: we build Amazon's product search indices Facebook: we use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning Journey Dynamics: Using Hadoop MapReduce to analyse billions of lines of GPS data to create TrafficSpeeds, our accurate traffic speed forecast product. LinkedIn: We use Hadoop for discovering People You May Know and other fun facts. The New York Times: Large scale image conversions … Clusters from 4 to 4500 nodes

Example: Count word occurrences
map(String input_key, String input_value): // input_key: line number – not used // input_value: line content for each word in input_value: emit (word, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); emit(AsString(result));

Example: word count

WordCount in Java - 1 public static class MapClass extends MapReduceBase public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); ...

WordCount in Java - 2 ... public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); }

Hadoop Ecosystem The term “ecosystem” with regards to Hadoop might relate to Apache Projects Non-Apache projects Companies providing custom Hadoop distributions Companies providing user-friendly Hadoop interfaces Hadoop as a service We only consider Apache projects has listed in the Hadoop home page as of march 2013

Apache Projects Data storage (NoSQL DBs) Data analysis
HBase Hive Cassandra Data analysis Pig Mahout Chuckwa Coordination and management Ambari Zookeeper Utility Flume Sqoop

Data storage Hbase Cassandra Hive
Scalable, distributed database that supports structured data storage for large tables, based on the BigTable model Cassandra Scalable, fault-tolerant database with no single points of failure, based on the BigTable model Hive Data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems

Tools for Data Analysis with Hadoop
Pig Hive Hadoop Statistical Software MapReduce HDFS Hive is treated only in the appendix

Apache Pig Tool for querying data on Hadoop clusters
Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations

Pig Example Real example of a Pig script used at Twitter
The Java equivalent…

Pig Commands Loading datasets from HDFS
users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray);

Pig Commands Filtering data
users_1825 = filter users by age>=18 and age<=25;

Pig Commands Join datasets
joined = join users_1825 by username, pages by username;

Pig Commands Group records grouped = group joined by url;
Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; ( {(alice, 15), (bob, 18)}) ( {(carol, 24), (alice, 14), (bob, 18)})

Pig Commands Apply function to records in a dataset
summed = foreach grouped generate group as url, COUNT(joined) AS views;

Pig Commands Sort a dataset Filter first n rows
sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;

Pig Commands Writes a dataset to HDFS
store top_5 into 'top5_sites.csv';

Word Count in Pig A = load '/tmp/bible+shakes.nopunc';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';

Another Pig Example: Correlation
What is the correlation between users that have phones and users that tweet?

Pig: Used Defined Functions
There are times when Pig’s built in operators and functions will not suffice Pig provides ability to implement your own Filter Ex: res = FILTER bag BY udfFilter(post); Load Function Ex: res = load 'file.txt' using udfLoad(); Eval Ex: res = FOREACH bag GENERATE udfEval($1) Choice between several programming languages Java, Python, Javascript

Hive Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Using Hadoop from Statistical Software
packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In-Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data

RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster

WordCount in R wordcount = wc.reduce = function(
input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}

Big Data Technology: Introduction to Hadoop

Similar presentations

Presentation on theme: "Big Data Technology: Introduction to Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Technology: Introduction to Hadoop

Similar presentations

Presentation on theme: "Big Data Technology: Introduction to Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback