HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.

Slides:



Advertisements
Similar presentations
HBase and Hive at StumbleUpon
Advertisements

Introduction to cloud computing
Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
Hadoop/Hive General Introduction Open-Source Solution for Huge Data Sets Zheng Shao Hadoop Committer - Apache Software Foundation 11/23/2008.
1 Foundations of Software Design Lecture 27: Java Database Programming Marti Hearst Fall 2002.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Carey Probst Technical Director Technology Business Unit - OLAP Oracle Corporation.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
Hadoop/Hive/PIG Open-Source Solutions for Huge Data Sets.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Introduction to Apache
Pig - Hive - HBase - Zookeeper
Data Warehousing in the age of Big Data (1)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Building a Threat-Analytics Multi-Region Data Lake on AWS
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Why Another Data Warehousing System?  Problem: Data, data and more data –200GB per day in March 2008 back to 1TB compressed per day today  The Hadoop Experiment  Problem: Map/Reduce is great but every one is not a Map/Reduce expert –I know SQL and I am a python and php expert  So what do we do: HIVE

What is HIVE?  A system for querying and managing structured data built on top of Map/Reduce and Hadoop  We had: –Structured logs with rich data types (structs, lists and maps) –A user base wanting to access this data in the language of their choice –A lot of traditional SQL workloads on this data (filters, joins and aggregations) –Other non SQL workloads

Data Warehousing at Facebook Today Web ServersScribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL

HIVE: Components HDFS Hive CLI DDL QueriesBrowsing Map Reduce MetaStore Thrift API SerDe ThriftJuteJSON.. Execution Hive QL Parser Planner Mgmt. Web UI

Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFSMetaStore / hive/clicks /hive/clicks/ds= /hive/clicks/ds= /0 … Tables #Buckets=32 Bucketing Info Partitioning Cols

Dealing with Structured Data  Type system –Primitive types –Recursively build up using Composition/Maps/Lists  Generic (De)Serialization Interface (SerDe) –To recursively list schema –To recursively access fields within a row object  Serialization families implement interface –Thrift DDL based SerDe –Delimited text based SerDe –You can write your own SerDe  Schema Evolution

MetaStore  Stores Table/Partition properties: –Table schema and SerDe library –Table Location on HDFS –Logical Partitioning keys and types –Other information  Thrift API –Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests)  Metadata can be stored as text files or even in a SQL backend

Hive CLI  DDL: –create table/drop table/rename table –alter table add column  Browsing: –show tables –describe table –cat table  Loading Data  Queries

Hive Query Language  Philosophy –SQL like constructs + Hadoop Streaming  Query Operators in initial version –Projections –Equijoins and Cogroups –Group by –Sampling  Output of these operators can be: –passed to Streaming mappers/reducers –can be stored in another Hive Table –can be output to HDFS files –can be output to local files

Hive Query Language  Package these capabilities into a more formal SQL like query language in next version  Introduce other important constructs: –Ability to stream data thru custom mappers/reducers –Multi table inserts –Multiple group bys –SQL like column expressions and some XPath like expressions –Etc..

Joins FROM page_view pv JOIN user u ON (pv.userid = u.id) INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age WHERE pv.date = ; Outer Joins FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age WHERE pv.date = ;

Aggregations and Multi-Table Inserts FROM pv_users INSERT INTO TABLE pv_gender_uu SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO TABLE pv_ip_uu SELECT pv_users.ip, count(DISTINCT pv_users.id) GROUP BY(pv_users.ip);

Running Custom Map/Reduce Scripts FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);

Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Hadoop Facebook  Types of Applications: –Summarization  Eg: Daily/Weekly aggregations of impression/click counts –Ad hoc Analysis  Eg: how many group admins broken down by state/country –Data Mining (Assembling training data)  Eg: User Engagement as a function of user attributes

Hadoop Facebook  Usage statistics: –Total Users: ~140 (about 50% of engineering !) in the last 1 ½ months –Hive Data (compressed): 80 TB total, ~1TB incoming per day –Job statistics:  ~1000 jobs/day  ~100 loader jobs/day

Hadoop Facebook  Some problems: –No Fair Sharing: Big tasks can hog the cluster –No snapshots: What if a software bug corrupts the NameNode transaction log  Solutions: –Simple fair sharing (Matie Zaharia) –Investigating Snapshots (Dhrubha Bortharkur)

Conclusion  JIRA  Soon to be checked into hadoop trunk  Release available in hadoop version 0.19  People: –Suresh Anthony –Zheng Shao –Prasad Chakka –Pete Wyckoff –Namit Jain –Raghu Murthy –Joydeep Sen Sarma –Ashish Thusoo