Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1.

Similar presentations


Presentation on theme: "Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1."— Presentation transcript:

1 Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

2 Outline Introduction Data Model Architecture HiveQL 2

3 What is Hive? A system for managing and querying structured data built on top of Hadoop –Map-Reduce for execution –HDFS for storage –Metadata on raw files Key Building Principles: –SQL as a familiar data warehousing tool –Extensibility – Types, Functions, Formats, Scripts –Scalability and Performance 3

4 Hive/Hadoop Usage @ Facebook Types of Applications: –Reporting Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement –Ad hoc Analysis Eg: how many group admins broken down by state/country –Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes –Spam Detection Anomalous patterns for Site Integrity Application API usage patterns –Ad Optimization –Too many to count.. 700 Terabytes data 5000queries/day More than 100 users 4

5 Data Warehousing at Facebook Today Web ServersScribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL 5

6 6

7 Data Model Hive 中数据组织形式 : –Tables: 概念上类似于 rdbms 中的 table ,在存 储上对应于一个 HDFS 的目录。 –Partitions: 每个表有一个或多个分区,决定数 据在子目录中分发。 –Buckets: 每个分区中数据基于对列的 hash 分 配到每个 bucket ,每个 bucket 是一个文件。 例如:指定数据按例 ds 划分 Create table sc ( sno int ) partitioned by ( ds string) 则数据中,若 ds=2009-12-08 ,存 储中此分区子目录则为 /sc/ds=2009-12-08 7

8 Data Model Logical Partitioning Hash Partitioning sc HDFSMetaStore / hive/sc /hive/sc/ds=2009-12-08 /hive/sc/ds=2009-12-08/sc.txt … Tables Data Location Bucketing Info Partitioning Cols Metastore DB student course 8

9 Metastore 存储于本地或者传统的 Rdbms 中(非 Hdfs )。 Database – 所有 table 的命名空间,默认为 “default” Table – 包括 Column 列表和其类型, storage 和序列反序列化信 息。 –Storage 包括数据在底层位置,数据格式(类型), buckets 信息。 Partition – 每个分区可以包含自己的列,序列反序列化信息,以 及 storage 信息。 9

10 Architecture HDFS Hive CLI DDLQueriesBrowsing Map Reduce MetaStore Thrift API SerDe ThriftJuteJSON.. Execution Parser Planner DB Web UI Optimizer 10

11 HiveQL – Hive Query Language Support: –Select,project, aggregate,union all –Load data to table from local or hdfs directory –Equi-joins –Subqueries in from clause –Multi-table Insert –Multi-group-by 11

12 Example Student ( sno int,sname string,class int) Course (cno int,cname string); Sc (sno int, cno int,grade int) partitioned by (ds string); 12

13 13

14 传统的 Insert into table test ( 1 , 1 , 1 );不支持 14

15 HiveQL- Join SQL: INSERT OVERWRITE TABLE test SELECT t1.sname,t2.cno FROM student t1 JOIN sc t2 ON (t1.sno = t2.sno); Sno Snam e Class 1Wang1 2Zhang1 3Zhou2 4Chen2 SnoCnoGrade 1190 1280 2179 2280 snocno Wang1 2 Zhan g 1 2 X = student sctest 15

16 HiveQL- Join in Map Reduce keyvalue 1 2 3 4 student sc test keyvalue 1 1 2 2 Map keyvalue 1 1 1 keyvalue 2 2 2 Shuffle Sort Snam e Cno Wang1 2 Snam e Cno Zhang1 2 Reduce SnoSnameClass 1Wang1 2Zhang1 3Zhou2 4Chen2 SnoCnoGrade 1190 1280 2179 2280 3 4 16

17 Query plan TableScanOperator Table:student [sno int,sname string,class int] TableScanOperator Table:sc [sno int,cno int,grade int] ReduceSinkOperator Partition cols:col[0] [0 int,1 string,2 int] ReduceSinkOperator Partition cols:col[0] [0 int,1 int,2 int] JoinOperator Predicate : cols[0,0]=col[1,0] [0 int,1 string,2 int,3 int,4 int,5 int] SelectOperator Expressions:[col[1],col[4]] [0 string,1 int] FileOutputOperator Table:test [0 string,1 int] Map Reduce 17

18 18

19 Hive QL – Group By SELECT student.class, count(1) FROM student GROUP BY student.class; student Classcount 12 22 Sno Snam e Class 1Wang1 2Zhang1 3Zhou2 4Chen2 19

20 Hive QL – Group By in Map Reduce Sno Snam e Class 1Wang1 2Zhan g 1 pv_users classcount 12 Sno Snam e Class 3Zhou2 4Chen2 Map keyvalue 11 11 keyvalue 21 21 keyvalue 11 11 keyvalue 21 21 Shuffle Sort classcount 22 Reduce 20

21 Query plan TableScanOperator Table:student [sno int,sname string,class int] ReduceSinkOperator Partition cols:col[2] [0 int,1 string,2 int] GroupByOperator Aggregations:[count[2]] Keys:[col[2]] [0 int,1 bigint] FileOutputOperator Table:tmp1 [0 int, 1 bigint] TableScanOperator Table:tmp1 [0 int, 1 bigint] ReduceSinkOperator Partition cols:col[0] [0 int, 1 bigint] SelectOperator Expressions:[col[0],col[1]] [0 int, 1 bigint] 聚集 的 key 如果 groupby sno , class? 0 ? 21

22 22

23 Multi group by 23

24 24


Download ppt "Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1."

Similar presentations


Ads by Google