HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Overview of MapReduce and Hadoop
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive - A Warehousing Solution Over a Map-Reduce Framework.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
Hadoop/Hive General Introduction Open-Source Solution for Huge Data Sets Zheng Shao Hadoop Committer - Apache Software Foundation 11/23/2008.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu
Hive – SQL on top of Hadoop
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Nov 2006 Google released the paper on BigTable.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
Hadoop/Hive/PIG Open-Source Solutions for Huge Data Sets.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Introduction to MapReduce and Hadoop
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Server & Tools Business
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team

Why Another Data Warehousing System?  Problem: Data, data and more data –200GB per day in March 2008 –2+TB(compressed) raw data per day today  The Hadoop Experiment –Much superior to availability and scalability of commercial DBs –Efficiency not that great, but throw more hardware –Partial Availability/resilience/scale more important than ACID  Problem: Programmability and Metadata –Map-reduce hard to program (users know sql/bash/python) –Need to publish data in well known schemas  Solution: HIVE

What is HIVE?  A system for querying and managing structured data built on top of Hadoop –Uses Map-Reduce for execution –HDFS for storage – but any system that implements Hadoop FS API  Key Building Principles: –Structured data with rich data types (structs, lists and maps) –Directly query data from different formats (text/binary) and file formats (Flat/Sequence) –SQL as a familiar programming tool and for standard analytics –Allow embedded scripts for extensibility and for non standard applications –Rich MetaData to allow data discovery and for optimization

Data Warehousing at Facebook Today Web ServersScribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL

Hive/Hadoop Facebook  Types of Applications: –Summarization  Eg: Daily/Weekly aggregations of impression/click counts  Complex measures of user engagement –Ad hoc Analysis  Eg: how many group admins broken down by state/country –Data Mining (Assembling training data)  Eg: User Engagement as a function of user attributes –Spam Detection  Anomalous patterns in UGC  Application api usage patterns –Ad Optimization –Too many to count..

Hadoop Facebook  Data statistics: –Total Data:180TB (mostly compressed) –Net Data added/day:2+TB (compressed)  6TB of uncompressed source logs  4TB of uncompressed dimension data reloaded daily  Usage statistics: –3200 jobs/day with 800K tasks(map-reduce tasks)/day –55TB of compressed data scanned daily –15TB of compressed output data written to hdfs –80 MM compute minutes/day

HIVE: Components HDFS Hive CLI DDL QueriesBrowsing Map Reduce MetaStore Thrift API SerDe ThriftJuteJSON.. Execution Hive QL Parser Planner Mgmt. Web UI

Data Model Logical Partitioning Hash Partitioning Schema Library clicks HDFSMetaStore / hive/clicks /hive/clicks/ds= /hive/clicks/ds= /0 … Tables #Buckets=32 Bucketing Info Partitioning Cols

Dealing with Structured Data  Type system –Primitive types –Recursively build up using Composition/Maps/Lists  ObjectInspector interface for user-defined types –To recursively list schema –To recursively access fields within a row object  Generic (De)Serialization Interface (SerDe)  Serialization families implement interface –Thrift DDL based SerDe –Delimited text based SerDe –You can write your own SerDe (XML, JSON …)

MetaStore  Stores Table/Partition properties: –Table schema and SerDe library –Table Location on HDFS –Logical Partitioning keys and types –Partition level metadata –Other information  Thrift API –Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI)  Metadata stored in any SQL backend  Future –Statistics –Schema Evolution

Hive Query Language  Basic SQL –From clause subquery –ANSI JOIN (equi-join only) –Multi-table Insert –Multi group-by –Sampling –Objects traversal  Extensibility –Pluggable Map-reduce scripts using TRANSFORM

Running Custom Map/Reduce Scripts FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count);

Machine 2 Machine 1 (Simplified) Map Reduce Review Local Map Global Shuffle Local Sort Local Reduce

Hive QL – Join SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid); pageiduseridtime 11119:08: :08: :08:14 useridagegender 11125female 22232male pageidage X = page_view user pv_users

Hive QL – Join in Map Reduce keyvalue pageiduseridtime 11119:08: :08: :08:14 useridagegender 11125female 22232male page_view user pv_users keyvalue Map keyvalue keyvalue Shuffle Sort pagei d age pageidage 132 Reduce

Joins Outer Joins INSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id) WHERE pv.date = ;

Join To Map Reduce  Only Equality Joins with conjunctions supported  Future –Pruning of values send from map to reduce on the basis of projections –Make Cartesian product more memory efficient –Map side joins  Hash Joins if one of the tables is very small  Exploit pre-sorted data by doing map-side merge join

Hive Optimizations – Merge Sequential Map Reduce Jobs  SQL: –FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … keyavbv keyav 1111 A Map Reduce keybv 1222 B keycv 1333 C AB Map Reduce keyavbvcv ABC

Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age; pageidage pv_users pageidagecount

Hive QL – Group By in Map Reduce pageidage pv_users pageidagecount pageidage Map keyvalue 1 1 keyvalue 1 1 keyvalue 1 1 keyvalue 1 1 Shuffle Sort pageidagecount 2252 Reduce

Hive QL – Group By with Distinct SELECT pageid, COUNT(DISTINCT userid) FROM page_view GROUP BY pageid pageiduseridtime 11119:08: :08: :08: :08:20 page_view pageidcount_distinct_userid 12 21

Hive QL – Group By with Distinct in Map Reduce page_view pageidcount Shuffle and Sort pageidcount 11 Reduce pageiduseridtime 11119:08: :08:13 pageiduseridtime 12229:08: :08:20 keyv keyv Map Reduce pageidcount 12 pageidcount 21

Group by Future optimizations  Map side partial aggregations –Hash Based aggregates –Exploit pre-sorted data for distinct counts  Partial aggregations in Combiner  Be smarter about how to avoid multiple stage  Exploit table/column statistics for deciding strategy

Inserts into Files, Tables and Local Files FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age) INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Future Work  Cost-based optimization  Multiple interfaces (JDBC…)  Performance Comparisons with similar work (PIG)  SQL Compliance (order by, nested queries…)  Integration with BI tools  Data Compression –Columnar storage schemes –Exploit lazy/functional Hive field retrieval interfaces  Better data locality –Co-locate hash partitions on same rack –Exploit high intra-rack bandwidth for merge joins

Hive Performance  full table aggregate (not grouped)  Input data size: 1,407,867,660 (32 files)  count in mapper and 2 map-reduce jobs for sum –time taken 30 seconds –Test cluster: 10 nodes from ( from test t select transform (t.userid) as (cnt) using myCount' ) mout select sum(mout.cnt);

Hadoop Facebook  QOS/Isolation: –Big jobs can hog the cluster –JobTracker memory as limited resource –Limit memory impact of runaway tasks  Fair Scheduler (Matei)  Protection –What if a software bug corrupts the NameNode transaction log/image?  HDFS SnapShots (Dhruba)  Data Archival –Not all data is hot and needs colocation with Compute  HDFS Symlinks (Dhruba) Data Archival  Performance –Really hard to understand what bottlenecks are

Conclusion  Available as a contrib project in hadoop - -Checkout src/contrib/hive from trunk (works against 0.19 onwards)  Latest distributions (including for hadoop-0.17) at:  People: –Suresh Anthony –Zheng Shao –Prasad Chakka –Pete Wyckoff –Namit Jain –Raghu Murthy –Joydeep Sen Sarma –Ashish Thusoo