Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hive 实战 数据平台及产品部 少杰. Agenda 简介 Hive QL Hive 扩展 SQL vs HQL.

Similar presentations


Presentation on theme: "Hive 实战 数据平台及产品部 少杰. Agenda 简介 Hive QL Hive 扩展 SQL vs HQL."— Presentation transcript:

1 Hive 实战 数据平台及产品部 少杰

2 Agenda 简介 Hive QL Hive 扩展 SQL vs HQL

3 简介 分布式计算 MapReduce 编程模型 Hadoop Hive

4 简介 Hive 系统结构

5 简介 数据流 (in taobao) – 数据源: weblog/db/… – 数据同步: jdbcdump – 报表计算 / 预处理 /ETL : Hive – 数据入库: dbloader

6 Hive QL 数据类型 – Primitive int / bigint / smallint / tinyint boolean double / float string – Array – Map – Struct – No precision / length config – No date / datetime type

7 Hive QL DDL – create table – CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type,...)] [PARTITIONED BY (col_name data_type,...)] [ [ROW FORMAT row_format] [STORED AS file_format] | [ WITH SERDEPROPERTIES (...) ] ] [LOCATION hdfs_path]

8 Hive QL DDL – create table example CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User‘ ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' STORED AS SEQUENCEFILE;

9 Hive QL DML – load data – LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)]

10 Hive QL DML – insert – INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)] select_statement1 FROM from_statement – FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION...] select_statement2]... – INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2]...) select_statement FROM from_statement – (HDFS) 不支持 UPDATE !

11 Hive QL DML/DDL – add partition – ALTER TABLE table_name ADD PARTITION (partcol1=val1, partcol2=val2...) [LOCATION 'filepath' ]

12 Hive QL Query - select – SELECT [ALL | DISTINCT] select_expr, select_expr,... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] – 不支持 exist in 子查询

13 Hive QL Query - join – join_table: table_reference JOIN table_factor [join_condition] | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition – table_reference: table_factor | join_table – table_factor: tbl_name [alias] | table_subquery alias | ( table_references ) – join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression – 等值 Join – 合并 Join 的原则 – NULL 值处理

14 Hive QL Query - subqueries – SELECT... FROM (subquery) name... – select_statement UNION ALL select_statement UNION ALL select_statement...

15 Hive 扩展 UDFs – 类别 UDF - 1:1 UDAF – N:1 (UDTF) – Implement UDF extends UDF / GenericUDF implement evaluate() function – Implement UDAF extends UDAF / GenericUDAF implement – iterate – merge – terminatePartial – terminate

16 Hive 扩展 Transform – FROM ( FROM src MAP expression (',' expression)* USING 'my_map_script' ( AS colName (',' colName)* )? ( clusterBy? | distributeBy? sortBy? ) src_alias ) REDUCE expression (',' expression)* USING 'my_reduce_script' ( AS colName (',' colName)* )?

17 Hive vs SQL 语义 – 无关系约束(第一范式?) – 不支持 exist in 子查询 – 只支持等值 Join 数据类型

18 Hive 优化器 Partition Pruning (ppr) 分区裁减 where(pt=‘’) Predicate Push down (ppd) Column Pruning (cp) Mapjoin transformer

19 Hive 优化 数据偏斜 – MapJoin 缺点: 1 内存 2 小表数 *MAP 数是否太大 – Group by (distinct) skew 内存优化 – 驱动表 : 优化内存,将大表作为驱动表即 a join b b 为驱动表 I/O 优化 – Map aggregation MR 任务合并 – multi-insert 节省两次 m/r 的扫描 – multi-groupby – multi-distinct


Download ppt "Hive 实战 数据平台及产品部 少杰. Agenda 简介 Hive QL Hive 扩展 SQL vs HQL."

Similar presentations


Ads by Google