Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1.

Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Outline Introduction Data Model Architecture HiveQL 2

What is Hive? A system for managing and querying structured data built on top of Hadoop –Map-Reduce for execution –HDFS for storage –Metadata on raw files Key Building Principles: –SQL as a familiar data warehousing tool –Extensibility – Types, Functions, Formats, Scripts –Scalability and Performance 3

Hive/Hadoop Usage @ Facebook Types of Applications: –Reporting Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement –Ad hoc Analysis Eg: how many group admins broken down by state/country –Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes –Spam Detection Anomalous patterns for Site Integrity Application API usage patterns –Ad Optimization –Too many to count.. 700 Terabytes data 5000queries/day More than 100 users 4

Data Warehousing at Facebook Today Web ServersScribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL 5

Data Model Hive 中数据组织形式 : –Tables: 概念上类似于 rdbms 中的 table ，在存储上对应于一个 HDFS 的目录。 –Partitions: 每个表有一个或多个分区，决定数据在子目录中分发。 –Buckets: 每个分区中数据基于对列的 hash 分配到每个 bucket ，每个 bucket 是一个文件。例如：指定数据按例 ds 划分 Create table sc （ sno int ） partitioned by （ ds string) 则数据中，若 ds=2009-12-08 ，存储中此分区子目录则为 /sc/ds=2009-12-08 7

Data Model Logical Partitioning Hash Partitioning sc HDFSMetaStore / hive/sc /hive/sc/ds=2009-12-08 /hive/sc/ds=2009-12-08/sc.txt … Tables Data Location Bucketing Info Partitioning Cols Metastore DB student course 8

Metastore 存储于本地或者传统的 Rdbms 中（非 Hdfs ）。 Database – 所有 table 的命名空间，默认为 “default” Table – 包括 Column 列表和其类型， storage 和序列反序列化信息。 –Storage 包括数据在底层位置，数据格式（类型）， buckets 信息。 Partition – 每个分区可以包含自己的列，序列反序列化信息，以及 storage 信息。 9

Architecture HDFS Hive CLI DDLQueriesBrowsing Map Reduce MetaStore Thrift API SerDe ThriftJuteJSON.. Execution Parser Planner DB Web UI Optimizer 10

HiveQL – Hive Query Language Support: –Select,project, aggregate,union all –Load data to table from local or hdfs directory –Equi-joins –Subqueries in from clause –Multi-table Insert –Multi-group-by 11

Example Student ( sno int,sname string,class int) Course (cno int,cname string); Sc (sno int, cno int,grade int) partitioned by (ds string); 12

传统的 Insert into table test （ 1 ， 1 ， 1 ）；不支持 14

HiveQL- Join SQL: INSERT OVERWRITE TABLE test SELECT t1.sname,t2.cno FROM student t1 JOIN sc t2 ON (t1.sno = t2.sno); Sno Snam e Class 1Wang1 2Zhang1 3Zhou2 4Chen2 SnoCnoGrade 1190 1280 2179 2280 snocno Wang1 2 Zhan g 1 2 X = student sctest 15

HiveQL- Join in Map Reduce keyvalue 1 2 3 4 student sc test keyvalue 1 1 2 2 Map keyvalue 1 1 1 keyvalue 2 2 2 Shuffle Sort Snam e Cno Wang1 2 Snam e Cno Zhang1 2 Reduce SnoSnameClass 1Wang1 2Zhang1 3Zhou2 4Chen2 SnoCnoGrade 1190 1280 2179 2280 3 4 16

Query plan TableScanOperator Table:student [sno int,sname string,class int] TableScanOperator Table:sc [sno int,cno int,grade int] ReduceSinkOperator Partition cols:col[0] [0 int,1 string,2 int] ReduceSinkOperator Partition cols:col[0] [0 int,1 int,2 int] JoinOperator Predicate : cols[0,0]=col[1,0] [0 int,1 string,2 int,3 int,4 int,5 int] SelectOperator Expressions:[col[1],col[4]] [0 string,1 int] FileOutputOperator Table:test [0 string,1 int] Map Reduce 17

Hive QL – Group By SELECT student.class, count(1) FROM student GROUP BY student.class; student Classcount 12 22 Sno Snam e Class 1Wang1 2Zhang1 3Zhou2 4Chen2 19

Hive QL – Group By in Map Reduce Sno Snam e Class 1Wang1 2Zhan g 1 pv_users classcount 12 Sno Snam e Class 3Zhou2 4Chen2 Map keyvalue 11 11 keyvalue 21 21 keyvalue 11 11 keyvalue 21 21 Shuffle Sort classcount 22 Reduce 20

Query plan TableScanOperator Table:student [sno int,sname string,class int] ReduceSinkOperator Partition cols:col[2] [0 int,1 string,2 int] GroupByOperator Aggregations:[count[2]] Keys:[col[2]] [0 int,1 bigint] FileOutputOperator Table:tmp1 [0 int, 1 bigint] TableScanOperator Table:tmp1 [0 int, 1 bigint] ReduceSinkOperator Partition cols:col[0] [0 int, 1 bigint] SelectOperator Expressions:[col[0],col[1]] [0 int, 1 bigint] 聚集的 key 如果 groupby sno ， class? 0 ? 21

Multi group by 23

Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1.

Similar presentations

Presentation on theme: "Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1.

Similar presentations

Presentation on theme: "Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1."— Presentation transcript:

Similar presentations

About project

Feedback