HBase and Bigtable Storage

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Big Data Analytics Training
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hive Facebook 2009.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
An Introduction to HDInsight June 27 th,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
1 HBase Intro 王耀聰 陳威宇
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
HBase and Bigtable Storage Xiaoming Gao Judy Qiu Hui Li.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
MapReduce Compilers-Apache Pig
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Software Systems Development
Unit 5 Working with pig.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Central Florida Business Intelligence User Group
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Slides borrowed from Adam Shook
Introduction to Apache
Overview of big data tools
CSE 491/891 Lecture 21 (Pig).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

HBase and Bigtable Storage Xiaoming Gao Judy Qiu Hui Li

Outline HBase and Bigtable Storage HBase Use Cases Hands-on: Load CSV file to Hbase table with MapReduce Demo Search Engine System with MapReduce Technologies (Hadoop/HDFS/HBase/Pig)

HBase Introduction HBase is an open source, distributed, sorted map modeled after Google’s BigTable HBase is built on Hadoop: Fault tolerance Scalability Batch processing with MapReduce HBase uses HDFS for storage Data sets are going to petabytes

HBase Cluster Architecture Region a subset of a table’s rows, like a range parition; region server, serves data for reads and writes, master responsible for coordinating the slaves, assigns regions, detects failures of region Tables split into regions and served by region servers Regions vertically divided by column families into “stores” Stores saved as files on HDFS

Data Model: A Big Sorted Map Not a relational database, no sql, Tables consist of rows, each of which has a primary key (row key) Each row has any number of columns: sortedMap<rowKey, List(sortedMap(Column, List(Value,TimeStamp))))> Time stamp is a long value; data is all byte[] in hbase; usually need two tables to avoid data redundancy;

HBase VS. RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Indexes On row and columns On row Hardware requirement Large arrays of fast and expensive disks Designed for commodity hardware Max data size TBs ~1PB Read/write throughput 1000s queries/second Millions of queries/second Query language SQL (Join, Group) Get/Put/ Easy of use Relational data modeling, easy to learn A sorted Map, significant learning curve, communities and tools are increasing

When to Use HBase Dataset Scale Read/Write Scale Batch Analysis Indexing huge amount of web pages in internet or genome data Need data mining large social media data sets Read/Write Scale reads/writes are distributed as tables are distributed across nodes Writes are extremely fast and require no index updates Batch Analysis Massive and convoluted SQL queries can be executed in parallel via MapReduce jobs

Use Cases: Facebook Analytics Twitter Mozilla Real-time counters of URLs shared, preferred links Twitter 25 TB of message every month Mozilla Store crashes report, 2.5 million per day.

Programming with HBase HBase shell Scan, List, Create Native Java API Get(byte[] row, byte[] column, long ts, int version) Non-Java Clients Thrift server (Ruby, C++, PHP) REST server HBase MapReduce API hbase.mapreduce.TableMapper; hbase.mapreduce.TableReducer; High Level Interface Pig, Hive

Hands-on HBase MapReduce Programming HBase MapReduce API import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.util.Bytes;

Hands-on: load CSV file into HBase table with MapReduce CSV represent for comma separate values CSV file is common file in many scientific fields such as flow cytometry in bioinformatics

Hands-on: load CSV file into HBase table with MapReduce Main entry point of program public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if(otherArgs.length != 2) { System.err.println("Wrong number of arguments: " + otherArgs.length); System.err.println("Usage: <csv file> <hbase table name>"); System.exit(-1); }//end if Job job = configureJob(conf, otherArgs); System.exit(job.waitForCompletion(true) ? 0 : 1); }//main

Hands-on: load CSV file into HBase table with MapReduce Configure HBase MapReduce job public static Job configureJob(Configuration conf, String [] args) throws IOException { Path inputPath = new Path(args[0]); String tableName = args[1]; Job job = new Job(conf, tableName); job.setJarByClass(CSV2HBase.class); FileInputFormat.setInputPaths(job, inputPath); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(CSV2HBase.class); TableMapReduceUtil.initTableReducerJob(tableName, null, job); job.setNumReduceTasks(0); return job; }//public static Job configure The following is an example of using HBase as a MapReduce source in read-only manner.

Hands-on: load CSV file into HBase table with MapReduce The map function public void map(LongWritable key, Text line, Context context) throws IOException { // Input is a CSV file Each map() is a single line, where the key is the line number // Each line is comma-delimited; row,family,qualifier,value String [] values = line.toString().split(","); if(values.length != 4) { return; } byte [] row = Bytes.toBytes(values[0]); byte [] family = Bytes.toBytes(values[1]); byte [] qualifier = Bytes.toBytes(values[2]); byte [] value = Bytes.toBytes(values[3]); Put put = new Put(row); put.add(family, qualifier, value); try { context.write(new ImmutableBytesWritable(row), put); } catch (InterruptedException e) { e.printStackTrace(); } if(++count % checkpoint == 0) { context.setStatus("Emitting Put " + count); } } }

Hands-on: steps to load CSV file into HBase table with MapReduce Check Hbase installation in Ubuntu Sandbox http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_guide.html Echo $HBASE_HOME Start Hadoop and Hbase cluster Start-all.sh Start-hbase.sh Create hbase table with specified data schema Hbase shell Create “csv2hbase”,”f1” Compile the program with Ant cd “hbasetutorial” Ant Upload input.csv into HDFS Hadoop dfs –mkdir input Hadoop dfs –copyFromLocal input.csv input/input.csv Run the program: /bin/hadoop jar dist/lib/cglHBaseSummerSchool.jar iu.pti.hbaseapp.CSV2HBase input/input.csv “csv2hbase” Check inserted records in Hbase table Scan “csv2hbase”

Hands-on: load CSV file into HBase table with MapReduce

Extension: set HBase table as Input Using TableInputFormat and TableMapReduceUtil to use an HTable as input to a map/reduce job public static Job configureJob (Configuration conf, String [] args) throws IOException { conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(new Scan())); conf.set(TableInputFormat.INPUT_TABLE, tableName); conf.set("index.tablename", tableName); conf.set("index.familyname", columnFamily); String[] fields = new String[args.length - 2]; for(int i = 0; i < fields.length; i++) { fields[i] = args[i + 2]; } conf.setStrings("index.fields", fields); conf.set("index.familyname", "attributes"); Job job = new Job(conf, tableName); job.setJarByClass(IndexBuilder.class); job.setMapperClass(Map.class); job.setNumReduceTasks(0); job.setInputFormatClass(TableInputFormat.class); job.setOutputFormatClass(MultiTableOutputFormat.class); return job; } >Using TableInputFormat and TableMapReduceUtil to use an HTable as input * to a map/reduce job.

Extension: write output to HBase table public static class Map extends Mapper<ImmutableBytesWritable, Result, ImmutableBytesWritable, Writable> { private byte[] family; private HashMap<byte[], ImmutableBytesWritable> indexes; protected void map(ImmutableBytesWritable rowKey, Result result, Context context) throws IOException, InterruptedException { for(java.util.Map.Entry<byte[], ImmutableBytesWritable> index : indexes.entrySet()) { byte[] qualifier = index.getKey(); ImmutableBytesWritable tableName = index.getValue(); byte[] value = result.getValue(family, qualifier); if (value != null) { Put put = new Put(value); put.add(INDEX_COLUMN, INDEX_QUALIFIER, rowKey.get()); context.write(tableName, put); }//if }//for }//map Write out to multiple Hbase tables; index are table_names defined in other fields;

Big Data Challenge Peta 10^15 Tera 10^12 Giga 10^9 Mega 10^6 The primary function of data flow language and runtimes is the management and manipulation of data. The sample systems include the MapReduce architecture pioneered by Google and the open-source implementation called Hadoop. 30 peta new gene data per data. High through put gene sequence data .

Search Engine System with MapReduce Technologies Search Engine System for Summer School To give an example of how to use MapReduce technologies to solve big data challenge. Using Hadoop/HDFS/HBase/Pig Indexed 656K web pages (540MB in size) selected from Clueweb09 data set. Calculate ranking values for 2 million web sites.

Architecture for SESSS Apache Lucene Inverted Indexing System PHP script Web UI HBase Tables 1. inverted index table 2. page rank table Hive/Pig script HBase Apache Server on Salsa Portal Thrift client Thrift server Pig script Hadoop Cluster on FutureGrid Ranking System

Demo Search Engine System for Summer School build-index-demo.exe (build index with HBase) pagerank-demo.exe (compute page rank with Pig) http://salsahpc.indiana.edu/sesss/index.php

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012

What is Pig Framework for analyzing large un-structured and semi-structured data on top of Hadoop. Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop. Pig Latin is simple but powerful data flow language similar to scripting languages. Write pig latin job is as simple as writing sql queries, for complex cases, the developers can integrate user defined function into the pig statements.

Motivation of Using Pig Faster development Fewer lines of code (Writing map reduce like writing SQL queries) Re-use the code (Pig library, Piggy bank) One test: Find the top 5 words with most high frequency 10 lines of Pig Latin V.S 200 lines in Java 15 minutes in Pig Latin V.S 4 hours in Java Accelerate development process, many company such as Yahoo, Twitter, using Pig Latin to process large scale data.

Word Count using MapReduce

Pig performance VS MapReduce Pigmix : pig vs mapreduce Where does pig stand, compared to java MR in terms of performance ? We have what we call Pigmix, which is a set of queries used to test pig performance from release to release. It compares the performance gab between direct use of map-reduce and using pig. Performance has steadily improved across releases. And we have had 7 releases in around last two years, since it became part of apache. In the next version 0.8, which will be out in few days, the ratio is around 0.9 . The map-reduce queries in pigmix don’t have all the optimizations that are present in pig because implementing them involves a lot of effort. Not all pig optimizations are tested in pigmix. One example is skew-join in pig , it enables joining of tables where some there are large number of records for some values of the join key. The naïve implementation of join in map-reduce will run out of memory. So pigmix tells only part of the story. http://wiki.apache.org/pig/PigMix 27

Word Count using Pig Lines=LOAD ‘input/hadoop.log’ AS (line: chararray); Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; Groups = GROUP Words BY word; Counts = FOREACH Groups GENERATE group, COUNT(Words); Results = ORDER Words BY Counts DESC; Top5 = LIMIT Results 5; STORE Top5 INTO /output/top5words;

Who uses Pig for What 70% of production jobs at Yahoo (10ks per day) Twitter, LinkedIn, Ebay, AOL,… Used to Process web logs Build user behavior models Process images Build maps of the web Do research on raw data sets

Pig Tutorial Accessing Pig Basic Pig knowledge: (Word Count) Pig Data Types Pig Operations How to run Pig Scripts Advanced Pig features: (Kmeans Clustering) Embedding Pig within Python User Defined Function

Accessing Pig Accessing approaches: Execution mode: Batch mode: submit a script directly Interactive mode: Grunt, the pig shell PigServer Java class, a JDBC like interface Execution mode: Local mode: pig –x local Mapreduce mode: pig –x mapreduce

Pig Data Types Concepts: fields, tuples, bags, relations, Simple Types A Field is a piece of data A Tuple is an ordered set of fields A Bag is a collection of tuples A Relation is a bag Simple Types Int, long, float, double, boolean,nul, chararray, bytearry, Complex types Tuple  Row in Database ( 0002576169, Tome, 21, “Male”) Data Bag  Table or View in Database {(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. } Map a set of keyval pair

Pig Operations Loading data Projection Grouping Dump/Store Aggregation LOAD loads input data Lines=LOAD ‘input/access.log’ AS (line: chararray); Projection FOREACH … GENERTE … (similar to SELECT) takes a set of expressions and applies them to every record. Grouping GROUP collects together records with the same key Dump/Store Dump displays results to screen, Store save results to file system Aggregation AVG, COUNT, COUNT_STAR, MAX, MIN, SUM There are more than 20 Pig operations

How to run Pig Latin scripts Local mode Local host and local file system is used Neither Hadoop nor HDFS is required Useful for prototyping and debugging MapReduce mode Run on a Hadoop cluster and HDFS Batch mode - run a script directly Pig –x local my_pig_script.pig Pig –x mapreduce my_pig_script.pig Interactive mode use the Pig shell to run script Grunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray); Grunt> Unique = DISTINCT Lines; Grunt> DUMP Unique;

Hands-on: Word Count using Pig Latin cd pigtutorial/pig-hands-on/ tar –xf pig-wordcount.tar cd pig-wordcount pig –x local grunt> Lines=LOAD ‘input.txt’ AS (line: chararray); grunt>Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word; grunt>Groups = GROUP Words BY word; grunt>counts = FOREACH Groups GENERATE group, COUNT(Words); grunt>DUMP counts;

Sample: Kmeans using Pig Latin A method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Assignment step: Assign each observation to the cluster with the closest mean Update step: Calculate the new means to be the centroid of the observations in the cluster. Reference: http://en.wikipedia.org/wiki/K-means_clustering

Kmeans Using Pig Latin PC = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreach grouped Generate group, AVG(centroided.gpa); store result into 'output'; """)

Kmeans Using Pig Latin while iter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PCB.runSingle() iter = results.result("result").iterator() centroids = [None] * v distance_move = 0.0 # get new centroid of this iteration, calculate the moving distance with last iteration for i in range(v): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / v; if distance_move<tolerance: converged = True break ……

Embedding Python scripts with Pig Statements Pig does not support flow control statement: if/else, while loop, for loop, etc. Pig embedding API can leverage all language features provided by Python including control flow: Loop and exit criteria Similar to the database embedding API Easier parameter passing JavaScript is available as well The framework is extensible. Any JVM implementation of a language could be integrated

User Defined Function What is UDF Why use UDF Way to do an operation on a field or fields Called from within a pig script Currently all done in Java Why use UDF You need to do more than grouping or filtering Actually filtering is a UDF Maybe more comfortable in Java land than in SQL/Pig Latin P = Pig.compile("""register udf.jar DEFINE find_centroid FindCentroid('$centroids');

Hands-on Run Pig Latin Kmeans export PIG_CLASSPATH= /opt/pig/lib/jython-2.5.0.jar Hadoop dfs –copyFromLocal input.txt ./input.txt pig –x mapreduce kmeans.py pig—x local kmeans.py

Hands-on Run Pig Latin Kmeans 2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to run: register udf.jar DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output'; Input(s): Successfully read 10000 records (219190 bytes) from: "hdfs://iw-ubuntu/user/developer/student.txt" Output(s): Successfully stored 4 records (134 bytes) in: "hdfs://iw-ubuntu/user/developer/output“ last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]

References: Questions? http://pig.apache.org (Pig official site) http://en.wikipedia.org/wiki/K-means_clustering Docs http://pig.apache.org/docs/r0.9.0 Papers: http://wiki.apache.org/pig/PigTalksPapers http://en.wikipedia.org/wiki/Pig_Latin Slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012 Questions?

Acknowledgement

HBase Cluster Architecture Region a subset of a table’s rows, like a range parition; region server, serves data for reads and writes, master responsible for coordinating the slaves, assigns regions, detects failures of region Tables split into regions and served by region servers Regions vertically divided by column families into “stores” Stores saved as files on HDFS