05 | Processing Big Data with Hive Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft
Module Overview What is Hive? Creating Hive Tables Loading Data into Hive Tables Querying Hive Tables Using Hive with PowerShell
What is Hive? SELECT… A metadata service that projects tabular schemas over HDFS folders Enables the contents of folders to be queried as tables, using SQL-like query semantics Queries are translated into Map/Reduce jobs
Creating Hive Tables Use the CREATE TABLE HiveQL statement Defines schema metadata to be projected onto data in a folder when the table is queried (not when it is created) Specify file format and file location Defaults to sequencefile format in the /hive/warehouse/<table_name> folder Create internal or external tables Internal tables manage the lifetime of the underlying folders External tables are managed independently from folders
CREATE TABLE Internal table (folders deleted when table is dropped) CREATE TABLE table1 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; Default location (/hive/warehouse/table1) CREATE TABLE table2 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table2'; Stored in a custom location (but still internal, so the folder is deleted when table is dropped) CREATE EXTERNAL TABLE table3 (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/data/table3'; External table (folders and files are left intact in Azure Blob Store when the table is dropped)
Loading Data into Hive Tables Save data files in table folders Use the LOAD statement Moves or copies files to the appropriate folder Use the INSERT statement Inserts data from one table to another LOAD DATA LOCAL INPATH '/data/source' INTO TABLE MyTable; FROM StagingTable INSERT INTO TABLE MyTable SELECT Col1, Col2;
Querying Hive Tables with HiveQL Query data using the SELECT statement Hive translates the query into Map/Reduce jobs and applies the table schema to the underlying data files SELECT Col1, SUM(Col2) AS TotalCol2 FROM MyTable WHERE Col1 >= '2013-06-01' AND Col1 <= '2013-06-30' GROUP BY Col1 ORDER BY Col1;
Demo: Using Hive In this demonstration, you will see how to: Create Hive Tables Load Data into Hive Tables Query a Hive Table with HiveQL Drop a Hive Table
Using Hive in PowerShell The AzureHDInsightHiveJobDefinition cmdlet Create a job definition Use Query for explicit HiveQL statements, or File to reference a saved script Run the job with the Start-AzureHDInsightJob cmdlet The Invoke-Hive cmdlet Simpler syntax to run a HiveQL query
Demo: Using Hive in PowerShell In this demonstration, you will see how to: Use PowerShell to Run a HiveQL Command Use PowerShell to Query a Hive Table
Module Summary Hive enables Map/Reduce processing through SQL-like syntax Internal tables manage the lifetime of their data, External tables are metadata only Use HiveQL queries in PowerShell scripts to perform Hive operations and retrieve data