Big Data Technology: Introduction to Hadoop

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Distributed Computations MapReduce
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
HADOOP ADMIN: Session -2
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Lecture 4: Mapreduce and Hadoop
Image taken from: slideshare
Introduction to Google MapReduce
MapReduce Compiler RHadoop
Hadoop.
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Introduction to MapReduce and Hadoop
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Introduction to Hadoop and Spark
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
CSE 491/891 Lecture 21 (Pig).
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter X: Big Data.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
CS639: Data Management for Data Science
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Big Data Technology: Introduction to Hadoop
Pig Hive HBase Zookeeper
Presentation transcript:

Big Data Technology: Introduction to Hadoop Antonino Virgillito

Hadoop Open source platform for distributed processing of large data Functions: Distribution of data and processing across machine Management of the cluster Simplified programming model Easy to write distributed algorithms What makes Hadoop unique is its simplified programming model which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve. Executing Hadoop on a limited amount of data on a small number of nodes may not demonstrate particularly stellar performance as the overhead involved in starting Hadoop programs is relatively high. Other parallel/distributed programming paradigms such as MPI (Message Passing Interface) may perform much better on two, four, or perhaps a dozen machines. Though the effort of coordinating work among a small number of machines may be better-performed by such systems, the price paid in performance and engineering effort (when adding more hardware as a result of increasing data volumes) increases non-linearly.

Hadoop Components HDFS: Hadoop Distributed File System MapReduce Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Hadoop Principle I’m one big data set Hadoop is basically a middleware platforms that manages a cluster of machines I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes

The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Reduce x 4 x 5 x 3 Data elements are classified into categories An algorithm is applied to all the elements of the same category

MapReduce and Hadoop Hadoop MapReduce HDFS MapReduce is logically placed on top of HDFS MapReduce HDFS Figure?

MapReduce and Hadoop Hadoop Output is written on HDFS MR works on (big) files loaded on HDFS Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

Hadoop pros & cons Good for Not good for Repetitive tasks on big size data Not good for Replacing a RDMBS Complex processing requiring various phases and/or iterations Processing small to medium size data

HDFS

HDFS Design Principles Targeted at storing large files Performance with “small” files can be poor due to the overhead of distribution Reliable and scalable Fast failover and extension of the cluster Reliable but NOT highly available (a SPOF is present) Optimized for long sequential reads rather than random read/write access Block-structured Files are split into blocks that are treated independently wrt distribution Size of blocks is configurable but is typically “big” (default 64Mb) to optimize handling of big files

HDFS Interface HDFS acts as a separate file system wrt the operating system A shell is available implementing common operating systems commands ls, cat, etc. Commands for moving files to/from the local file system are present A web interface allows to browse the file system and show the state of the cluster

HDFS Architecture Two kinds of nodes NameNode DataNode Maps blocks to DataNodes Maintains file system metadata (file names, permissions and block locations) Coordinates block creation, deletion and replication One node in the cluster (Single Point of Failure) Contacted by clients for triggering file operations Maintains state of DataNodes DataNode Stores blocks Each block is replicated in more DataNodes (replicas are specified on single file basis) All the nodes in the cluster (possibly except one) are DataNodes Contacted by clients for data transfer operations Sends heartbeats to the NameNode

HDFS Operations: read

HDFS Operations: Write

Block Replica Consistency Simple consistency model write once – read many Concurrent operations on metadata are serialized at the NameNode The NameNode records a transaction log that is used to reconstruct the state of the file system at startup (checkpointing)

MapReduce

MapReduce Programming model for parallel execution Implementations available in several programming languages/platforms Hadoop implementation: Java Clean abstraction for programmers

Programming Model A MapReduce program transforms an input list into an output list Processing is organized into two steps: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Data source must be structured in records (lines out of files, rows of a database, etc) Each record has an associated key Records are fed into the map function as key*value pairs: e.g., (filename, line) map() produces one or more intermediate values along with an output key from the input In other words, map identifies input values with the same characteristics that are represented by the output key Not necessarily related to input key

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() aggregates the intermediate values into one or more final values for that same intermediate key in practice, usually only one final value per key

Parallelism Different instances of map() function run in parallel, creating different intermediate values from different input data sets Elements of a list being computed by map cannot see the effects of the computations on other elements Data cannot be shared among map instances Since the order of application of map to input records is commutative, we can reorder or parallelize execution All values are processed independently Instances of reduce() functions also run in parallel, each working on a different output key Each instance of reduce processes all the intermediate records for a same intermediate key

MapReduce Applications Data aggregation Log analysis Statistics Machine learning …

MapReduce Applications Amazon: we build Amazon's product search indices Facebook: we use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning Journey Dynamics: Using Hadoop MapReduce to analyse billions of lines of GPS data to create TrafficSpeeds, our accurate traffic speed forecast product. LinkedIn: We use Hadoop for discovering People You May Know and other fun facts. The New York Times: Large scale image conversions … http://wiki.apache.org/hadoop/PoweredBy Clusters from 4 to 4500 nodes

Example: Count word occurrences map(String input_key, String input_value): // input_key: line number – not used // input_value: line content for each word in input_value: emit (word, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); emit(AsString(result));

Example: word count

Example: word count

WordCount in Java - 1 public static class MapClass extends MapReduceBase public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); ...

WordCount in Java - 2 ... public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); }

Hadoop Ecosystem The term “ecosystem” with regards to Hadoop might relate to Apache Projects Non-Apache projects Companies providing custom Hadoop distributions Companies providing user-friendly Hadoop interfaces Hadoop as a service We only consider Apache projects has listed in the Hadoop home page as of march 2013

Apache Projects Data storage (NoSQL DBs) Data analysis HBase Hive Cassandra Data analysis Pig Mahout Chuckwa Coordination and management Ambari Zookeeper Utility Flume Sqoop

Data storage Hbase Cassandra Hive Scalable, distributed database that supports structured data storage for large tables, based on the BigTable model Cassandra Scalable, fault-tolerant database with no single points of failure, based on the BigTable model Hive Data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems http://cloudstory.in/2012/04/introduction-to-big-data-hadoop-ecosystem-part-3/

Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS Hive is treated only in the appendix

Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations

Pig Example Real example of a Pig script used at Twitter The Java equivalent… http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

Pig Commands Loading datasets from HDFS users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray);

Pig Commands Filtering data users_1825 = filter users by age>=18 and age<=25;

Pig Commands Join datasets joined = join users_1825 by username, pages by username;

Pig Commands Group records grouped = group joined by url; Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})

Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;

Pig Commands Sort a dataset Filter first n rows sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;

Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';

Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';

Another Pig Example: Correlation What is the correlation between users that have phones and users that tweet?

Pig: Used Defined Functions There are times when Pig’s built in operators and functions will not suffice Pig provides ability to implement your own Filter Ex: res = FILTER bag BY udfFilter(post); Load Function Ex: res = load 'file.txt' using udfLoad(); Eval Ex: res = FOREACH bag GENERATE udfEval($1) Choice between several programming languages Java, Python, Javascript

Hive Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

Using Hadoop from Statistical Software packages rhdfs, rmr Issue HDFS commands and write MapReduce jobs SAS SAS In-Memory Statistics SAS/ACCESS Makes data stored in Hadoop appear as native SAS datasets Uses Hive interface SPSS Transparent integration with Hadoop data

RHadoop Set of packages that allows integration of R with HDFS and MapReduce Hadoop provides the storage while R brings the analysis Just a library Not a special run-time, Not a different language, Not a special purpose language Incrementally port your code and use all packages Requires R installed and configured on all nodes in the cluster

WordCount in R wordcount = wc.reduce = function( input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}

Case Study 1: Air Traffic data Input data set: Ticket Id | Booking No | Origin | Destination | Flight No. | Miles One record per O-D couple Compute the following dataset Origin | Final Destination | Number of Passengers Final Destination is obtained by chaining origins and destinations with the same booking number

Case Study 2: Maritime Traffic Data Input data sets: Ship ID | Longitude | Latitude | Timestamp One record per position tracking Ship ID | Origin | Destination | Number of passengers One record per ship Design processing architecture Compute the following dataset Ship ID | Period (night-day) | Total stop time