A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

Slides:



Advertisements
Similar presentations
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
HADOOP ADMIN: Session -2
Inter-process Communication in Hadoop
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Hive Facebook 2009.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
1 HBase Intro 王耀聰 陳威宇
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Lecture 5 Books: “Hadoop in Action” by Chuck Lam,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Bigtable A Distributed Storage System for Structured Data.
Introduction to Google MapReduce
Column-Based.
HBase Mohamed Eltabakh
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
MapReduce Types, Formats and Features
Gowtham Rajappan.
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Apache
Lecture 18 (Hadoop: Programming Examples)
Lecture 16 (Intro to MapReduce and Hadoop)
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
MAPREDUCE TYPES, FORMATS AND FEATURES
Pig Hive HBase Zookeeper
Presentation transcript:

A Hadoop Overview

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

Progress Hadoop buildup has been completed.  Version , running under Standalone mode. HBase buildup has been completed.  Version , with no assists of HDFS. Simple demonstration over MapReduce.  Simple word count program.

Testing Platform Fedora 10 JDK1.6.0_18 Hadoop Hbase One can connect to the machine using pietty or putty.  Host:  Account: labuser  Password: robot3233  Port: 3385 (using ssh connection)

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

MapReduce A computing framework including map phase, shuffling phase and reduce phase. Map function and Reduce function are provided by the user. Key-Value Pair(KVP)  map is initiated with each KVP ingested, and output any number of KVPs.  reduce is initiated with each key and its corresponding values, and output any number of KVPs.

MapReduce(cont.)

What user has to do? 1. Specify the input/output format 2. Specify the output key/value type 3. Specify the input/output location 4. Specify the mapper/reducer class 5. Specify the number of reduce tasks 6. Specify the partitioner class(dicussed later)

What user has to do?(cont.) Specify the input/output format  “Input/Output format” is class that translate raw data and KVPs.  Has to inherit class InputFormat / OutputFormat.  Input format is required.  The most common choice is KeyValueTextInputFormat class and SequenceFileInputFormat class.  Output format is selective, the default is TextOutputFormat class.

What user has to do?(cont.) Specify the output key/value type  The KVP type output by reducer.  The Key type has to implements WritableComparable interface.  The Value type has to implements Writable interface. Specify the input/output location  The directory or for input files/output files.  The input directory should exist and contain at least one file.  The output directory should not exist or be empty.

What user has to do?(cont.) Specify the mapper/reducer class  The two classes should extend MapReduceBase class.  The map/reduce class should implement Mapper /Reducer interface Specify the number of reduce tasks  Usually approximate the number of computing nodes.  1 if we want a single output file.  0 if we don’t need the reduce phase.  Note that we will not have our result sorted.  The reducer class is not required in this case.

Map Phase Configuration ElementRequired?Default Input path(s)Yes Class to convert the input path elements to KVPs Yes Map output key classNoJob output key class Map output value classNoJob output value class Class supplying the map functionYes Suggested minimum number of map tasksNoCluster default Number of threads to run each map taskNo1

Reduce Phase Configuration ElementRequired?Default Output pathYes Class to convert the KVPs to output filesNoTextOutputFormat Job input key classNoJob output key class Job input value classNoJob output value class Job output key classYes Job output value classYes Class supplying the reduce functionYes The number of reduce tasksNoCluster default

MapReduceIntro.java public class MapReduceIntro { protected static Logger logger = Logger.getLogger(MapReduceIntro.class); public static void main(final String[] args) { try { final JobConf conf = new JobConf(MapReduceIntro.class); conf.set("hadoop.tmp.dir","/tmp"); conf.setInputFormat(KeyValueTextInputFormat.class); FileInputFormat.setInputPaths(conf, MapReduceIntroConfig.getInputDirectory()); conf.setMapperClass(IdentityMapper.class); FileOutputFormat.setOutputPath(conf, MapReduceIntroConfig.getOutputDirectory()); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setNumReduceTasks(1); conf.setReducerClass(IdentityReducer.class); final RunningJob job = JobClient.runJob(conf); if (!job.isSuccessful()) { logger.error("The job failed."); System.exit(1); } System.exit(0); } public class MapReduceIntro { protected static Logger logger = Logger.getLogger(MapReduceIntro.class); public static void main(final String[] args) { try { final JobConf conf = new JobConf(MapReduceIntro.class); conf.set("hadoop.tmp.dir","/tmp"); conf.setInputFormat(KeyValueTextInputFormat.class); FileInputFormat.setInputPaths(conf, MapReduceIntroConfig.getInputDirectory()); conf.setMapperClass(IdentityMapper.class); FileOutputFormat.setOutputPath(conf, MapReduceIntroConfig.getOutputDirectory()); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setNumReduceTasks(1); conf.setReducerClass(IdentityReducer.class); final RunningJob job = JobClient.runJob(conf); if (!job.isSuccessful()) { logger.error("The job failed."); System.exit(1); } System.exit(0); } Initial Configuration Map Phase Configuration Reduce Phase Configuration Job Running

IdentityMapper.java public class IdentityMapper extends MapReduceBase implements Mapper { public void map(K key, V val, OutputCollector output, Reporter reporter) throws IOException { output.collect(key, val); } public class IdentityMapper extends MapReduceBase implements Mapper { public void map(K key, V val, OutputCollector output, Reporter reporter) throws IOException { output.collect(key, val); } Input type Output type Discussed later Collect output KVPs

IdentityReducer.java public class IdentityReducer extends MapReduceBase implements Reducer { public void reduce(K key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } public class IdentityReducer extends MapReduceBase implements Reducer { public void reduce(K key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } The input value is an Iterator !

Compiling Using default java compiler  Note that we have to supply – classpath parameter so that the compiler can find the hadoop core libraries and other classes needed.  $ javac –classpath $HADOOP_HOME/hadoop core.jar:. –d. Myclass.java The hadoop core libraries The location of other class files

Creating jar file To create an executable jar file: 1. Create a file “manifest.mf” 2. Type the command:  $ jar –cmf MyExample.jar manifest.mf  Wildcard character * is also accepted. Main-Class: myclass Class-Path: MyExample.jar A white space! A return carriage! White space separate list! The driver class

Run the jar file Using hadoop command.  $ hadoop jar MyExample.jar Remember that the output path should not exist.  If the path exist, use rm path –r command.

A simple demonstration A simple word count program.

Reporter

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

Hadoop Full name Apache Hadoop project.  Open source implementation for reliable, scalable distributed computing.  An aggregation of the following projects (and its core):  Avro  Chukwa  HBase  HDFS  Hive  MapReduce  Pig  ZooKeeper

Virtual Machine (VM) Virtualization  All services are delivered through VMs.  Allows for dynamically configuring and managing.  There can be multiple VMs running on a single commodity machine.  VMware

HDFS(Hadoop Distributed File System) The highly scalable distributed file system of Hadoop.  Resembles Google File System(GFS).  Provides reliability by replication. NameNode & DataNode  NameNode  Maintains file system metadata and namespace.  Provides management and control services.  Usually one instance.  DataNode  Provides data storage and retrieval services.  Usually several instances.

MapReduce The sophisticate distributed computing service of Hadoop.  A computation framework.  Usually resides on HDFS. JobTracker & TaskTracker  JobTracker  Manages the distribution of tasks to the TaskTrackers.  Provides job monitoring and control, and the submission of jobs.  TaskTracker  Manages single map or reduce tasks on a compute node.

Cluster Makeup A Hadoop cluster is usually make up by:  Real Machines.  Not required to be homogeneous.  Homogeneity will help maintainability.  Server Process.  Multiple process can be run on a single VM. Master & Slave  The node/machine running the JobTracker or NameNode will be Master node.  The ones running the TaskTracker or DataNode will be Slave node.

Cluster Makeup(cont.)

Administrator Scripts Administrator can use the following script files to start or stop server processes.  Can be located in $HADOOP_HOME/bin  start-all.sh/stop-all.sh  start-mapred.sh/stop-mapred.sh  start-dfs.sh/stop-dfs.sh  slaves.sh  hadoop

Configuration By default, each Hadoop Core server will load the configuration from several files.  These file will be located in $HADOOP_HOME/conf  Usually identical copies of those files are maintained in every machine in the cluster.

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

HBase The Hadoop scalable distributed database.  Resembles Google BigTable.  Not relational database.  Resides in HDFS. Master & RegionServer  Master  For bootstrapping and RegionServer recovering.  Assigning regions to RegionServers.  RegionServer  Hold 0 or more regions.  responsible for data transaction.

Hbase(cont.)

Row, Column, Timestamp The data cell is the intersection of an individual row key and a column.  Cells stores uninterrupted array of byte.  Cell data is versioned by timestamp.

Row Row (Key) is the primary key of database  Can be consisted by arbitrary byte array.  Strings, binary data.  Each row has to be distinguished.  The table is sorted by row key.  Any mutation action of a single row is atomic.

Column/Column Family Columns are grouped into families, with which shares a common prefix.  Ex: temperature:air and temperature:dew_point  The prefix has to be a printable string.  The column name can also be arbitrary byte array.  Column family member can be dynamically added or dropped.  Column families must be pre-specified as table schemas.  HBase is indeed a column-family-oriented storing.  The same column family will be stored together in any file system.

Region The table is automatically horizontally-partitioned into regions.  That is, a region is a subset of data rows.  Regions are stored in separated RegionServers.  A region is defined by its first row, last row, and a randomly generated identifier.  The partition will be completed by the master automatically.

Administrator Scripts Administrator can use the following script files to start or stop server processes.  Can be located in $HBASE_INSTALL/bin  start-hbase.sh / stop-hbase.sh  hbase hbase shell to initial a command line interface. hbase master / hbase regionserver

HBase shell command line Type command help to get information.  create ‘table’, ‘column family1’, ‘column family2’, …  put ‘table’, ‘row’, ‘column’, ‘value’  get ‘table’, ‘row’, {COLUMN=>…}  alter ‘table’, {NAME=>‘...’}  To modify a table schema, we have to disable it first!  scan ‘table’  disable ‘table’  drop ‘table’  To drop a table, we have to disable it first!!  list

A Simple Demonstration Command line operation

Operations Create table (and its schema)  Shell  create ‘table’, ‘cf1’, ‘cf2’,…  create ‘table’, {NAME=>‘cf1’}, {NAME=>‘cf2’},…  API HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); HTableDescriptor table = new HTableDescriptor(“table”); table.addFamily(new HColumnDescriptor(“cf1:”)); table.addFamily(new HColumnDescriptor(“cf2:”)); admin.createTable(table); HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); HTableDescriptor table = new HTableDescriptor(“table”); table.addFamily(new HColumnDescriptor(“cf1:”)); table.addFamily(new HColumnDescriptor(“cf2:”)); admin.createTable(table);

Operations(cont.) Modify table (and its schema)  Shell  alter ‘table’, {NAME=>’cf’, KEY=>’value’, …}  API  Note that there will be exceptions if the table is not disabled. HBaseAdmin admin = new HBaseAdmin(); Admin.modifyColumn(“table”,”cf”, new HColumnDescriptor(…)); Admin.modifyTable(new HTableDescriptor(…)); HBaseAdmin admin = new HBaseAdmin(); Admin.modifyColumn(“table”,”cf”, new HColumnDescriptor(…)); Admin.modifyTable(new HTableDescriptor(…));

Operations(cont.) Write data  Shell  put ‘table’, ‘row’, ‘cf:name’, ‘value’, ts  API HTable table = new Htable(“table”); BatchUpdate update = new BatchUpdate(“row”); update.put(“cf:name”,”value”); table.commit(update); HTable table = new Htable(“table”); BatchUpdate update = new BatchUpdate(“row”); update.put(“cf:name”,”value”); table.commit(update);

Operations(cont.) Retrieve data  Shell  get ‘table’, ‘row’, {COLUMN=>’cf:name’, …}  API  If we don’t know the row retrieved at prior, we can use Scanner object instead. Scanner scanner = table.getScanner(“cf:name”); HTable table = new HTable(“table”); RowResult row = table.getRow(“row”); Cell data = table.get(“row”,”cf:name”); HTable table = new HTable(“table”); RowResult row = table.getRow(“row”); Cell data = table.get(“row”,”cf:name”);

Operations Delete a cell  Shell  delete ‘table’, ‘row’, ‘cf:name’  API HTable table = new HTable(“table”); BatchUpdate update = new BatchUpdate(“row”); Udpate.delete(“cf:name”); table.commit(update); HTable table = new HTable(“table”); BatchUpdate update = new BatchUpdate(“row”); Udpate.delete(“cf:name”); table.commit(update);

Operations(cont.) Enable/Disable a table  Shell  enable/disable ‘table’  API HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); admin.disableTable(“table”); admin.enableTable(“table”); HBaseAdmin admin = new HBaseAdmin(new HBaseConfiguration()); admin.disableTable(“table”); admin.enableTable(“table”);

Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A

Hadoop API  tml HBase API  Any question?