Central Florida Business Intelligence User Group

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop Ecosystem Overview
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
BIG DATA/ Hadoop Interview Questions.
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Image taken from: slideshare
Big Data is a Big Deal!.
Introduction to Google MapReduce
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
Hadoop.
Apache hadoop & Mapreduce
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Database Management Systems Unit – VI Introduction to Big Data, HADOOP: HDFS, MapReduce Prof. Deptii Chaudhari, Assistant Professor Department of.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Pig Hive HBase Zookeeper
Presentation transcript:

Central Florida Business Intelligence User Group Hadoop Introduction Curtis Boyden Pentaho Corporation Central Florida Business Intelligence User Group May 19, 2011

What we are going to discuss Hadoop HDFS Hadoop MapReduce (M/R) Hive

Hadoop What Hadoop is What Hadoop can do What Hadoop is not How Hadoop works How Hadoop is used

What Hadoop is Distributed computing platform Multiple ASF projects “an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware.” - Yahoo! Developer Network Distributed computing platform Multiple ASF projects HDFS (Filesystem) Hadoop M/R (Logic) Hive (SQL DB) More...

Hadoop: FOSS Inspired by Google's MapReduce framework Originally developed by Apache Major contributors: Yahoo, Facebook, Cloudera, LinkedIn, and more

What Hadoop can do Store large files (HDFS) Scale affordably (Utilize commodity hardware) Handle failover automatically Process large files (M/R & HDFS) efficiently

What Hadoop is not Storage for many small files For fast access to data (Read latency) An RDBMS A framework for processing streaming data in real-time

How Hadoop works Master Node (NameNode & JobTracker services) Source: Apache Hadoop Master Node (NameNode & JobTracker services) Worker Node (DataNode & TaskTracker services) Data is loaded into HDFS Client sends M/R to execute M/R is executed on Worker nodes with local data Results are stored to HDFS

HDFS What HDFS is What HDFS is not How HDFS works How HDFS is used

What HDFS is Distributed filesystem High throughput access to data “[HDFS] is the primary storage system used by Hadoop applications” - Apache Hadoop Distributed filesystem High throughput access to data Scalable Data replication / location awareness

What HDFS is not Low latency data store An RDBMS “A Posix filesystem” - Apache Hadoop “Substitute for a HA-SAN” - Apache Hadoop http://wiki.apache.org/hadoop/HadoopIsNot

How HDFS works Source: Apache Hadoop

How HDFS is used Main storage for data to be processed by MapReduce algorithms MapReduce data locality The filesystem for Hadoop Example usages: Large file store Hive table data

Hadoop M/R What M/R is What M/R is not What problems M/R solves How M/R works How M/R is used

What MapReduce is Programming model Parallel Scalable “A software framework for distributed processing of large data sets on compute cluster” - Apache Hadoop Programming model Parallel Scalable Automated failover

What M/R is not “MapReduce is not always the best algorithm” To support parallelism: “each MR operation independent from all the others” “If you need to know everything that has gone before, you have a problem.” - Apache Hadoop

What problems M/R solves Huge datasets – Distributed Storage Massively parallel processing – Distributed Computing Example use cases: Process weblogs Index the internet for searching Data analysis - http://www.cloudera.com/why-hadoop/

How M/R works Data locality: process the data you have local access Source/Result data are Key/Value pairs Mapper process KVPs into new KVPs One KVP input per map iteration Any number of KVP can be generated in a map iteration Reducer processes a set of values for a given key Single key with list of values per reduce iteration Any number of KVP can be generated in a reduce iteration

Mapper while (wordList.hasMoreTokens()) { public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { StringTokenizer wordList = new StringTokenizer(value.toString()); // For each word in the line while (wordList.hasMoreTokens()) { // Set the key's value to the word itself this.word.set(wordList.nextToken()); // Emit a key/value pair: WORD, 1 output.collect(this.word, 1); } The incoming key/value values are dictated by a configurable InputFormat - map(...) will be executed once per line in the input file - key: the line number - value: text of the given line The line's string is broken up into its individual words and each word is individually assigned a count of 1.

Mapper Input: “If you've never seen an elephant ski, you've never been on acid – Eddie Izzard” Output: If, 1 you've, 1 never, 1 seen, 1 an, 1 elephant, 1 ski,, 1 been, 1 on, 1 acid, 1 -, 1 Eddie, 1 Izzard, 1

Reducer public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int wordCount = 0; // Key is a single WORD and values is a list of values of the key's occurrences // For each value(count) of the given key(WORD) while (values.hasNext()) { // Increment the count of the key(WORD) wordCount += ((IntWritable) values.next()).get(); } // Set the total counted occurrences of the key(WORD) this.totalWordCount.set(wordCount); // Send the KVP result to the output collector output.collect(key, this.totalWordCount); The incoming values are aggregated from various Mapper sources with a common key. All values of a given key are processed by a single Reducer on a single machine. - reduce(...) will be executed once per key - key: the word - values: A list of “count” occurrences for the key (in our example, always 1) The key represents the word and is not involved in the processing, while the list of “word occurrences” is tallied up for a total number of word occurrences in the input file.

Reducer Input (From mapper): Output: Input (From mapper): Output: Key: If Values: 1 Output: If, 1 Input (From mapper): Key: never Values: 1, 1 Output: never, 2 Inputs: you've: 1, 1 seen, 1 an, 1 elephant, 1 ski,, 1 been, 1 on, 1 acid, 1 -, 1 Eddie, 1 Izzard, 1 Outputs: you've, 2 seen, 1 an, 1 elephant, 1 ski,, 1 been, 1 on, 1 acid, 1 -, 1 Eddie, 1 Izzard, 1

How M/R is used Hadoop M/R reads/writes data from/to HDFS Hive queries data with M/R Any application can execute M/R

Hive What Hive is How Hive works How Hive is used

Other Hadoop projects Avro: A data serialization system. Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications More info: http://hadoop.apache.org

Thank you