HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
-A APACHE HADOOP PROJECT
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Distributed storage for structured data
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hive Facebook 2009.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 HBase Intro 王耀聰 陳威宇
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
BIG DATA/ Hadoop Interview Questions.
Bigtable A Distributed Storage System for Structured Data.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Image taken from: slideshare
and Big Data Storage Systems
Column-Based.
HBase Mohamed Eltabakh
Software Systems Development
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CLOUDERA TRAINING For Apache HBase
NOSQL.
Gowtham Rajappan.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Hbase – NoSQL Database Presented By: 13MCEC13.
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

HBase A column-centered database 1

Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce Goals ▫Scalability ▫Versions ▫Compression ▫In memory tables 2

Architectural issues Cluster of nodes is the general architecture Standalone mode for single machine There is a Java API accessed with JRuby There is a JRuby shell 3

Modeling constructs Table ▫Has a row key ▫A series of column families  Each has a column name and a value Operations ▫Create table ▫Insert a row with “Put” command  Only one column at a time ▫Query a table with a “Get” command  (uses a table name and a row key) 4

Filters Scan ▫can get a series of rows based on two key values ▫Can provide a filter for such things as column families, timestamps ▫Filters can be pushed to the server 5

Updating When a column value is written to the db, old values are kept and organized by timestamp ▫Each such value is a cell You can explicitly assign timestamps manually ▫Otherwise, current timestamp with insert ▫When getting, uses most recent version Operations that alter column family structures is expensive 6

Other characteristics Text compression Rows are stored in order by key value A region is some set of rows ▫Each is stored in a single region server ▫Regions can be automatically merged and split Uses write-ahead logging to prevent loss of data with node failures ▫This is called journaling in Unix file systems Supports a master/slave multi-cluster strategy 7

An HBase cluster taken from: 8

Terminology A region is a subset of the rows of a table ▫These are automatically sharded A master coordinates the slaves ▫Assigns regions ▫Detects region failures ▫Administrative functions A client reads and writes rows directly to the region servers A client finds region server addresses in zookeeper 9

Tasks of components Zookeeper cluster is a coordination service for the HBase cluster ▫Finds the correct server ▫Selects the master Master allocates regions & load balancing Region servers hold the regions Hadoop supports Map-Reduce 10

Features Consistency over available Efficient mapreduce Range partition queries, not based on hashing or other random access Automatically shards Very sparse column storage 11

Some key concepts De-normalization Fast random, key-row retrieval Use of a multi-component architecture to leverage existing software tools Controllable in-memory selection 12

Important HBase Properties Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re- distributed as your data grows. MapReduce: HBase supports massively parallelized processing via MapReduce 13

When to use or not use Hbase Java Client API: HBase supports an easy to use Java API for programmatic access. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. 14

More on using or not using HBase Make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) Make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. 15

The Hbase API Get a row Put a row, with a column/value pair Scan, with a key range and a filter mapreduce via Hive 16

High level map reduce diagram 17

A more detailed diagram 18

Map reduce example steps 1. The system takes input from a file system and splits it up across separate Map nodes 2. The Map function or code is run and generates an output for each Map node—in the word count function, every word is listed and grouped by word per node 3. This output represents a set of intermediate key-value pairs that are moved to Reduce nodes as input 4. The Reduce function or code is run and generates an output for each Reduce node—in the word count example, the reduce function sums the number of times a group of words or keys occurs 5. The system takes the outputs from each node to aggregate a final view 19

Diagram of map reduce 20