Central Florida Business Intelligence User Group

Central Florida Business Intelligence User Group
Hadoop Introduction Curtis Boyden Pentaho Corporation Central Florida Business Intelligence User Group May 19, 2011

What we are going to discuss
Hadoop HDFS Hadoop MapReduce (M/R) Hive

Hadoop What Hadoop is What Hadoop can do What Hadoop is not
How Hadoop works How Hadoop is used

What Hadoop is Distributed computing platform Multiple ASF projects
“an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware.” - Yahoo! Developer Network Distributed computing platform Multiple ASF projects HDFS (Filesystem) Hadoop M/R (Logic) Hive (SQL DB) More...

Hadoop: FOSS Inspired by Google's MapReduce framework
Originally developed by Apache Major contributors: Yahoo, Facebook, Cloudera, LinkedIn, and more

What Hadoop can do Store large files (HDFS)
Scale affordably (Utilize commodity hardware) Handle failover automatically Process large files (M/R & HDFS) efficiently

What Hadoop is not Storage for many small files
For fast access to data (Read latency) An RDBMS A framework for processing streaming data in real-time

How Hadoop works Master Node (NameNode & JobTracker services)
Source: Apache Hadoop Master Node (NameNode & JobTracker services) Worker Node (DataNode & TaskTracker services) Data is loaded into HDFS Client sends M/R to execute M/R is executed on Worker nodes with local data Results are stored to HDFS

HDFS What HDFS is What HDFS is not How HDFS works How HDFS is used

What HDFS is Distributed filesystem High throughput access to data
“[HDFS] is the primary storage system used by Hadoop applications” - Apache Hadoop Distributed filesystem High throughput access to data Scalable Data replication / location awareness

What HDFS is not Low latency data store An RDBMS
“A Posix filesystem” - Apache Hadoop “Substitute for a HA-SAN” - Apache Hadoop

How HDFS works Source: Apache Hadoop

How HDFS is used Main storage for data to be processed by MapReduce algorithms MapReduce data locality The filesystem for Hadoop Example usages: Large file store Hive table data

Hadoop M/R What M/R is What M/R is not What problems M/R solves
How M/R works How M/R is used

What MapReduce is Programming model Parallel Scalable
“A software framework for distributed processing of large data sets on compute cluster” - Apache Hadoop Programming model Parallel Scalable Automated failover

What M/R is not “MapReduce is not always the best algorithm”
To support parallelism: “each MR operation independent from all the others” “If you need to know everything that has gone before, you have a problem.” - Apache Hadoop

What problems M/R solves
Huge datasets – Distributed Storage Massively parallel processing – Distributed Computing Example use cases: Process weblogs Index the internet for searching Data analysis -

How M/R works Data locality: process the data you have local access
Source/Result data are Key/Value pairs Mapper process KVPs into new KVPs One KVP input per map iteration Any number of KVP can be generated in a map iteration Reducer processes a set of values for a given key Single key with list of values per reduce iteration Any number of KVP can be generated in a reduce iteration

Mapper while (wordList.hasMoreTokens()) {
public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { StringTokenizer wordList = new StringTokenizer(value.toString()); // For each word in the line while (wordList.hasMoreTokens()) { // Set the key's value to the word itself this.word.set(wordList.nextToken()); // Emit a key/value pair: WORD, 1 output.collect(this.word, 1); } The incoming key/value values are dictated by a configurable InputFormat - map(...) will be executed once per line in the input file - key: the line number - value: text of the given line The line's string is broken up into its individual words and each word is individually assigned a count of 1.

Mapper Input: “If you've never seen an elephant ski, you've never been on acid – Eddie Izzard” Output: If, 1 you've, 1 never, 1 seen, 1 an, 1 elephant, 1 ski,, 1 been, 1 on, 1 acid, 1 -, 1 Eddie, 1 Izzard, 1

Reducer public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int wordCount = 0; // Key is a single WORD and values is a list of values of the key's occurrences // For each value(count) of the given key(WORD) while (values.hasNext()) { // Increment the count of the key(WORD) wordCount += ((IntWritable) values.next()).get(); } // Set the total counted occurrences of the key(WORD) this.totalWordCount.set(wordCount); // Send the KVP result to the output collector output.collect(key, this.totalWordCount); The incoming values are aggregated from various Mapper sources with a common key. All values of a given key are processed by a single Reducer on a single machine. - reduce(...) will be executed once per key - key: the word - values: A list of “count” occurrences for the key (in our example, always 1) The key represents the word and is not involved in the processing, while the list of “word occurrences” is tallied up for a total number of word occurrences in the input file.

Reducer Input (From mapper): Output: Input (From mapper): Output:
Key: If Values: 1 Output: If, 1 Input (From mapper): Key: never Values: 1, 1 Output: never, 2 Inputs: you've: 1, 1 seen, 1 an, 1 elephant, 1 ski,, 1 been, 1 on, 1 acid, 1 -, 1 Eddie, 1 Izzard, 1 Outputs: you've, 2 seen, 1 an, 1 elephant, 1 ski,, 1 been, 1 on, 1 acid, 1 -, 1 Eddie, 1 Izzard, 1

How M/R is used Hadoop M/R reads/writes data from/to HDFS
Hive queries data with M/R Any application can execute M/R

Hive What Hive is How Hive works How Hive is used

Other Hadoop projects Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications More info:

Thank you

Central Florida Business Intelligence User Group

Similar presentations

Presentation on theme: "Central Florida Business Intelligence User Group"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Central Florida Business Intelligence User Group

Similar presentations

Presentation on theme: "Central Florida Business Intelligence User Group"— Presentation transcript:

Similar presentations

About project

Feedback