LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Big Data I/Hadoop explained Presented to ITS at the UoA on December 6 th 2012.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
COLUMN-BASED DBS BigTable, HBase, SimpleDB, and Cassandra.
NoSQL Databases: MongoDB vs Cassandra
BY VAIBHAV NACHANKAR ARVIND DWARAKANATH Evaluation of Hbase Read/Write (A study of Hbase and it’s benchmarks)
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Hadoop Ecosystem Overview
Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
HADOOP ADMIN: Session -2
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Zhang Gang Big data High scalability One time write, multi times read …….(to be add )
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Goodbye rows and tables, hello documents and collections.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
ZhangGang Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named.
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
Introduction to Sqoop. Table of Contents Sqoop - Introduction Integration of RDBMS and Sqoop Sqoop use case Sample sqoop commands Key features of Sqoop.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Experience Cassandra Wenjing wu outline About Cassandra Data Model Deployment Client Programming An example: implementing a name space Stress.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Discussion MySQL&Cassandra ZhangGang 2012/11/22. Optimize MySQL.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MySQL to NoSQL Data Modeling Challenges in Supporting Scalability ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΜΑΤΙΚΗΣ ΠΜΣ "Πληροφορική και Τηλεματική“
1 HBase Intro 王耀聰 陳威宇
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Introduction to NoSQL Databases Chyngyz Omurov Osman Tursun Ceng,Middle East Technical University.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
and Big Data Storage Systems
HBase Mohamed Eltabakh
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A free and open-source distributed NoSQL database
CLOUDERA TRAINING For Apache HBase
NOSQL.
Gowtham Rajappan.
NOSQL databases and Big Data Storage Systems
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

LOGO Discussion Zhang Gang 2012/11/8

Discussion Progress on HBase 1 Cassandra or HBase 2

HBase Sechma Design  HBase reference guide  How to design a good HBase schema. – row key – column family

HBase Sechma Design  row key – monotonically increasing keys or timeseries keys may cause a pile-up on a single region. – randomize the input records to not be in sorted order can mitigate the situation. So its best to avoid using a timestamp or a sequence as the row key. – at present, I use the startTime(a timestamp) as the row key, in future I will explore if there has a better replacement.

HBase Sechma Design  column famliy: – I was wrong about the schema with two column families. – HBase currently does not do well with anything above two or three column families. – Try to make do with one column family if you can in your schemas. – If you have thousands or even millions column, you can consider have more than one column family. We only have 21 columns, so one is enough and the best choice.

HBase Sechma Design  Optimization(minimize row and column sizes) – in HBase, values are always as a cell value that accompanied by its row, column name, and timestamp. So if row and column name is long, it will waste a large size.(see behind) – column family: keep the name as short as possible. – row key length: keep them as short as is reasonable such that they can still be useful for required data access.

Sqoop  Have successfully configured the sqoop in my PC. On farm, have a Exception-- ”access denied for user ‘zhang’, but it seems successfully transfer the data.  Command : – Sqoop import –connect jdbc:

Sqoop  sqoop on my PC: – test: 81,280 records, s – test: 215,500 records, s – test: 1,539,763 records,310s – then:35,427,339 records, s/about 3.43h – the HBase table size: about 35G, compare mysql table(5G), the size is bigger. So design a good schema is very necessary.

Sqoop  sqoop on the farm: – two exceptions: – then found access denied – import: 35,427,339 records,5120s/about 1.39h – hbase-name:’hb_type_job’ – row-key: ’startTime’ – column-family: ’d’  s

Sqoop

Cassandra or HBase

 review our requirement: – big data: now 5G, increases 1.5 GB per year, not very big. – high scalability: we want the database we choice has a better scalability.(many candidates have the feature. – write/read: we read more than we write.(One of the reasons we choose HBase before)

Cassandra or HBase  Written in: Java  Main point: Best of BigTable and Dynamo  Tunable trade-offs for distribution and replication (N, R, W)  Querying by column, range of keys  BigTable-like features: columns, column families  Has secondary indices  Writes are much faster than reads (!)  Map/reduce possible with Apache Hadoop  All nodes are similar, as opposed to Hadoop/Hbase  Gossip protocol, multi data center, no single point of failure

Cassandra or HBase  C has only one type of nodes, all nodes are similar. H consists of several different types of nodes (Muster/RegionServer).  H must deployed over the HDFS, compare this C is much more simple  Data consistency of C is tunable(N,W,R).  H better support map/reduce  H provides the developer with row locking facilities whereas Cassandra can not. C just use timestamp.  C has better I/O performance and better scalability but not good at range scan.  CAP:C focus on AC and H focus on CP  H has an SQL compatibility interface(Hive),so H support SQL

Cassandra or HBase  The structure of C is simple,deploy and maintenance is simple, compare C(save money, save time),H is much more complex deploy or maintenance.  H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data.

Cassandra or HBase  How do I incorporate my logo to a slide that will apply to all the other slides? – bb  Aa – bb  Aa – On

Cassandra or HBase  the possibility we start to explore Cassandra – more simple than Hadoop HBase. – written by Java.(same as HBase) – pycassa:It is a python client library for Apache Cassandra.  problem: seem doesn’t have a ready- made tool for transfer the data from mysql to Cassandra.

LOGO Your Company Slogan