Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.

Slides:



Advertisements
Similar presentations
HBase and Hive at StumbleUpon
Advertisements

Phoenix We put the SQL back in NoSQL James Taylor Demos:
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Software and Services Group “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
-A APACHE HADOOP PROJECT
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Hadoop File Formats and Data Ingestion
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Hive Facebook 2009.
LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Module 11: Programming Across Multiple Servers. Overview Introducing Distributed Queries Setting Up a Linked Server Environment Working with Linked Servers.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 HBase Intro 王耀聰 陳威宇
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
COMP 321 Week 4. Overview Normalization Entity-Relationship Diagrams SQL JDBC/JDBC Drivers hsqldb Lab 4-1 Introduction.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Bigtable: A Distributed Storage System for Structured Data
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Model and Storage in NoSQL Systems (Bigtable, HBase) 1 Slides from Mohamed Eltabakh.
INCREMENTAL AGGREGATION After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
SQL Basics Review Reviewing what we’ve learned so far…….
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Fundamentals of DBMS Notes-1.
and Big Data Storage Systems
Column-Based.
HBase Mohamed Eltabakh
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CLOUDERA TRAINING For Apache HBase
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
Hive Mr. Sriram
NOSQL databases and Big Data Storage Systems
ISC440: Web Programming 2 Server-side Scripting PHP 3
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Hbase – NoSQL Database Presented By: 13MCEC13.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

Data storing and data access

Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary

Basic Java API for HBase import org.apache.hadoop.hbase.*

Data modification operations Storing the data – Put – it inserts or updates a new cell (s) – Append – it updates a cell – Bulk loading – loads the data directly to store files Reading – Get – single row lookup (one or many cells) – Scan – rows range scanning Deleting data – delete ( column families, rows, cells) – truncate…

Adding a row with Java API 1.Configuration creation Configuration config = HBaseConfiguration.create(); 2.Establishing connection Connection connection = ConnectionFactory.createConnection(config); 3.Table opening Table table = connection.getTable(TableName.valueOf(table_name)); 4.Create a put object Put p = new Put(key); 5.Set values for columns p.addColumn(family, col_name, value);…. 6.Push the data to the table table.put(p);

Getting data with Java API 1.Open a connection and instantiate a table object 2.Create a get object Get g = new Get(key); 3.(optional) specify certain family or column only g.addColumn(family_name, col_name); //or g.addFamily(family_name); 4.Get the data from the table Result result= table.get(g); 5.Get values from the result object byte [] value = result.getValue(family, col_name); //….

Scanning the data with Java API 1.Open a connection and instantiate a table object 2.Create a Scan object Scan s = new Scan(start_row,stop_row); 3.(optional) Set columns to be retrieved s.addColumn(family,col_name) 4.Get a result scanner object ResultScanner scanner = table.getScanner(s); 5.Iterate through results for (Result row : scanner) { // do something with the row }

Filtering scan results will not prevent from reading the data set -> will reduce the network utilization only! there are many filters available for column names, values etc. …and can be combained scan.setFilter(new ValueFilter(GREATER_OR_EQUAL,1500); scan.setFilter(new PageFilter(25));

Demo Lets store persons data in HBase Description of a person: – id – first_name – last_name – data of brith – profession – ….? Additional requirement – Fast records lookup by Last Name

Demo – source data in CSV ,Zbigniew,Baranowski,M, ,Poland,IT,CERN ,Kacper,Surdy,M, ,Poland,IT,CERN ,Michel,Jackson,M, ,USA,Music,None ,Barack,Obama,M, ,USA,President,USA ,Andrzej,Duda,M, ,Poland,President,Poland ,Ewa,Kopacz,F, ,Poland,Prime Minister,Poland ,Rolf,Heuer,M, ,Germany,DG,CERN ,Fabiola,Gianotti,F, ,Italy,Particle Physicist,CERN ,Lionel,Messi,M, ,Argentina,Football Player,FC Barcelona

Demo - designing Generate a new id when inserting a person – Has to be unique sequence of incremented numbers incrementing has to be an atomic operation – Recent value for id has to be stored (in a table) Row key = id ? – maybe row key = “last_name+id”? – Lets keep: row key = id Fast last_name lookups – Additional indexing table

Demo - Tables Users – with users data – row_key = userID Counters – for userID generation – row_key = main_table_name usersIndex – for indexing users table – row_key = last_name+userID ? – row_key = column_name+value+userID

Demo – Java classes UsersLoader – loading the data – generates userID – from “counters” table – loads the users data into “users” table – updates “usersIndex” table UsersScanner – performs range scans – scans the “usersIndex” table – ranges provided by a caller – gets the details of given records from the “users” table

Hands on Get the scripts wget cern.ch/zbaranow/hbase.zip unzip hbase.zip cd hbase/part1 Preview: UsersLoader.java, UsersScanner.java Create tables hbase shell -n tables.txt Compile and run javac –cp `hbase classpath` *.java java –cp `hbase classpath` UserLoader users.csv 2>/dev/null java –cp `hbase classpath` UsersScanner last_name Baranowski Baranowskj 2>/dev/null

Schema design consideration

Key values Is the most important aspect in designing – fast data reading vs fast data storing Fast data access (range scans) – keep in mind the right order of row key parts “username+timestamp” vs “timestamp+username” – for fast recent data retrievals it is better to insert new rows into the first regions of the table Example: key= timestamp Fast data storing – distribute rows across regions Salting Hashing

Tables Two options – Wide - large number of columns – Tall - large number of rows KeyF1:COL1F1:COL2F2:COL3F2:COL4 r1r1v1r1v2r1v3r1v4 r2r2v1r2v2r2v3r2v4 r3r3v1r3v2r3v3r3v4 KeyF1:V r1_col1r1v1 r1_col2r1v1 r1_col3r1v3 r1_col4r1v4 r2_col1r2v1 r2_col2r2v2 r2_col3r2v3 r2_col4r2v4 r3_col1r3v1 Region 1 Region 2 Region 3 Region 4

Bulk data loading

Bulk loading Why? – For loading big data sets already available on HDFS – Faster – direct data writing to HBase store files – No footprint on region servers How? 1.Load the data into HDFS 2.Generate a hfiles with the data using MapReduce write your own or use importtsv – has some limitations 3.Embed generated files into HBase

Bulk load – demo 1.Create a target table 2.Load the CSV file to HDFS 3.Run ImportTsv 4.Run LoadIncrementalHFiles All commands in: bulkLoading.txt

Part 2: Distributed storage (hands –on)

Hands on: distributed storage Let’s imagine we need to provide a backend storage system for a large scale application – e.g. for a mail service, for a cloud drive We want the storage to be – distributed – content addressed In the following hands on we’ll see how Hbase can do this

Distributed storage: insert client The application will be able to upload a file from a local file system and save a reference to it in ‘users’ table A file will be reference by its SHA-1 fingerprint General steps: – read a file and calculate a fingerprint – check for file existence – save in ‘files’ table if not exists – add a reference in ‘users’ table in ‘media’ column family

Distributed storage: download client The application will be able to download a file given an user ID and a file (media) name General steps: – retrieve a fingerprint from ‘users’ table – get the file data from ‘files’ table – save the data to a local file system

Distributed storage: exercise location Get to the source files cd../part2 Fill the TODOs – support with docs and previous examples Compile with javac -cp `hbase classpath` InsertFile.java javac -cp `hbase classpath` GetMedia.java

SQL on HBase

Running SQL on HBase From Hive or Impala HTable mapped to an external table Some DMLs are supported – insert (but not overwrite) – updates are available by duplicating a row with insert statement

Use cases for SQL on HBase Data warehouses – facts table : big data scanning -> impala + parquet – dimensional table: random lookups -> hbase Read – write storage – Metadata – counters

How to? Create an external table with hive – Provide column names and types (key column should be always a string) – STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' – WITH SERDEPROPERTIES "hbase.columns.mapping” = ":key,main:first_name,main:last_name….” – TBLPROPERTIES ("hbase.table.name" = "users”); Try it out ! hive –f./part2/SQLonHBase.txt

Summary

What was not covered Writing co-processors - stored procedures HBase table permissions Filtering of data scanner results Using map reduce for data storing and retrieving Bulk data loading with custom map reduce Using different APIs - Thrift

Summary Hbase is a key-value, wide-columnar store – Horizontal (regions) + Vertical (col. Families) partitioning – Row Key values are indexed within regions – Data typefree – data stored in bytes arrays – Tables are semi structured Fast random data access by key Not for massive parallel data processing! Stored data can be modified (updated, deleted)

Other similar NoSQL engines Apache Cassandra MongoDB Apache Accumulo (on Hadoop) Hypertable (on Hadoop) HyperDex BerkleyDB / Oracle NoSQL

Announcement: Hadoop users forum Why? – Exchange knowledge and experience about The technology itself Current successful projects on – Service requirements Who? – Everyone how is interested in Hadoop (and not only) How? – e-group: When? – Every 2-4 weeks – Starting from 7 th of October