HBase and Hive at StumbleUpon

Slides:

Advertisements

Similar presentations

From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.

Advertisements

Phoenix We put the SQL back in NoSQL James Taylor Demos:

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Building Scalable Big Data Infrastructure Using Open Source Software Sam William

Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Gowtham Rajappan. HDFS – Hadoop Distributed File System modeled on Google GFS. Hadoop MapReduce – Similar to Google MapReduce Hbase – Similar to Google.

Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.

Hadoop File Formats and Data Ingestion

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.

Introduction to Hadoop and HDFS

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

Hive Facebook 2009.

Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.

Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.

An Introduction to HDInsight June 27 th,

Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.

A NoSQL Database - Hive Dania Abed Rabbou.

1 HBase Intro 王耀聰陳威宇

Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

Distributed Networks & Systems Lab Distributed Networks and Systems(DNS) Lab, Department of Electronics and Computer Engineering Chonnam National University.

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

HBase Elke A. Rundensteiner Fall 2013

Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Image taken from: slideshare

Hadoop and Analytics at CERN IT

Software Systems Development

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Spark Presentation.

CLOUDERA TRAINING For Apache HBase

Hive Mr. Sriram

Gowtham Rajappan.

Central Florida Business Intelligence User Group

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Introduction to Apache

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Pig Hive HBase Zookeeper

Presentation transcript:

HBase and Hive at StumbleUpon Jean-Daniel Cryans DB Engineer at StumbleUpon HBase Committer @jdcryans, jdcryans@apache.org

Highlights Why Hive and HBase? HBase refresher Hive refresher Integration Hive @ StumbleUpon Data flows Use cases

HBase Refresher Apache HBase in a few words: “HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable” Used for: Powering websites/products, such as StumbleUpon and Facebook’s Messages Storing data that’s used as a sink or a source to analytical jobs (usually MapReduce) Main features: Horizontal scalability Machine failure tolerance Row-level atomic operations including compare-and-swap ops like incrementing counters Augmented key-value schemas, the user can group columns into families which are configured independently Multiple clients like its native Java library, Thrift, and REST Family configurations such as replication scope, compression, caching priority

Hive Refresher Apache Hive in a few words: “A data warehouse infrastructure built on top of Apache Hadoop” Used for: Ad-hoc querying and analyzing large data sets without having to learn MapReduce Main features: SQL-like query language called QL Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools Plug-in capabilities for custom mappers, reducers, and UDFs Support for different storage types such as plain text, RCFiles, HBase, and others Multiple clients like a shell, JDBC, Thrift

Integration Reasons to use Hive on HBase: A lot of data sitting in HBase due to its usage in a real-time environment, but never used for analysis Give access to data in HBase usually only queried through MapReduce to people that don’t code (business analysts) When needing a more flexible storage solution, so that rows can be updated live by either a Hive job or an application and can be seen immediately to the other Reasons not to do it: Run SQL queries on HBase to answer live user requests (it’s still a MR job) Hoping to see interoperability with other SQL analytics systems

Hive table definitions HBase Integration How it works: Hive can use tables that already exist in HBase or manage its own ones, but they still all reside in the same HBase instance Hive table definitions HBase Points to an existing table Manages this table from Hive

Hive table definitions HBase Integration How it works: When using an already existing table, defined as EXTERNAL, you can create multiple Hive tables that point to it Hive table definitions HBase Points to some column Points to other columns, different names

Hive table definition HBase table Integration persons people How it works: Columns are mapped however you want, changing names and giving types Hive table definition HBase table a column name was changed, one isn’t used, and a map points to a family persons people name STRING d:fullname age INT d:age siblings MAP<string, string> d:address f:

Integration Drawbacks (that can be fixed with brain juice): Binary keys and values (like integers represented on 4 bytes) aren’t supported since Hive prefers string representations, HIVE-1634 Compound row keys aren’t supported, there’s no way of using multiple parts of a key as different “fields” This means that concatenated binary row keys are completely unusable, which is what people often use for HBase Filters are done at Hive level instead of being pushed to the region servers Partitions aren’t supported

@

Data Flows Data is being generated all over the place: Apache logs Application logs MySQL clusters HBase clusters We currently use all that data except for the Apache logs (in Hive)

Data Flows Transforms format HDFS Dumped into Read nightly Moving application log files Transforms format HDFS Dumped into Read nightly Wild log file Tail’ed continuously Inserted into Parses into HBase format HBase

Data Flows Dumped nightly with CSV import HDFS MySQL Moving MySQL data Dumped nightly with CSV import HDFS MySQL Tungsten replicator Inserted into Parses into HBase format HBase

Data Flows CopyTable MR job HBase MR HBase Prod Moving HBase data CopyTable MR job HBase MR HBase Prod Imported in parallel into Read in parallel * HBase replication currently only works for a single slave cluster, in our case HBase replicates to a backup cluster.

Use Cases Front-end engineers They need some statistics regarding their latest product Research engineers Ad-hoc queries on user data to validate some assumptions Generating statistics about recommendation quality Business analysts Statistics on growth and activity Effectiveness of advertiser campaigns Users’ behavior VS past activities to determine, for example, why certain groups react better to email communications Ad-hoc queries on stumbling behaviors of slices of the user base

Use Cases CREATE EXTERNAL TABLE blocked_users( userid INT, Using a simple table in HBase: CREATE EXTERNAL TABLE blocked_users( userid INT, blockee INT, blocker INT, created BIGINT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:blockee,f:blocker,f:created") TBLPROPERTIES("hbase.table.name" = "m2h_repl-userdb.stumble.blocked_users"); HBase is a special case here, it has a unique row key map with :key Not all the columns in the table need to be mapped

Use Cases CREATE EXTERNAL TABLE ratings_hbase( userid INT, Using a complicated table in HBase: CREATE EXTERNAL TABLE ratings_hbase( userid INT, created BIGINT, urlid INT, rating INT, topic INT, modified BIGINT) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b@0,:key#b@1,:key#b@2,default:rating#b,default:topic#b,default:modified#b") TBLPROPERTIES("hbase.table.name" = "ratings_by_userid"); #b means binary, @ means position in composite key (SU-specific hack)

Use Cases Some metrics: Doing a SELECT (*) on the stumbles table (currently 1.2TB after LZO compression) used to take over 2 hours with 20 machines, today it takes 12 minutes with 80 newer machines.

Wrapping up Hive is a good complement to HBase for ad-hoc querying capabilities without having to write a new MR job each time. (All you need to know is SQL) Even though it enables relational queries, it is not meant for live systems. (Not a MySQL replacement) The Hive/HBase integration is functional but still lacks some features to call it ready. (Unless you want to get your hands dirty)

In Conclusion… ?

In Conclusion… ? ? ?

Have a job yet? We’re hiring! Analytics Engineer Database Administrator Site Reliability Engineer Senior Software Engineer (and more) http://www.stumbleupon.com/jobs/