Download presentation
Presentation is loading. Please wait.
1
HIVE CSCE 587 Spring 2018
2
Step 1: Data Start by downloading the data
process-data-with-apache-hive/assets/driver_data.zip There are two files in this data set that we saw last week with PIG: drivers.csv timesheet.csv
3
Step 2: Load the files into HDFS
Unlike last week, we will use the GUI to load. Start by logging onto ambaria: vm-Hadoop-xx.cse.sc.edu:8080 use your maria_dev credentials
4
Step 2: Load the files into HDFS
Click on the icon that resembles 3x3 grid on the menu bar (it is at the top of the window on the far right side) Select “Files View”
5
Step 2: Load the files into HDFS
Navigate to /user/maria_dev 1. scroll to bottom of list to find user. 2. click on “user”. 3. scroll down list to find “maria_dev” 4. click on “maria_dev”
6
Navigating to /user/maria_dev
You should see something like this, although you should also see the other files that you created last week with Hadoop and PIG.
7
Click on the “Upload” button
Then select the “Browse” button and navigate to where you stored the files on the linux file system. Select drivers.csv to upload Do the same for timesheet.csv
8
Had you started with a tabula rasa, your directory would look like this:
9
3. HIVE View 2.0 Switch context from “Files” view to “Hive View 2.0:
10
Step 3.1 This brings up the query editor
11
Step 3.2: Create an empty table
Enter: create table temp_drivers (col_value STRING); Then click on “Execute”
12
Result
13
Step 3. 3 Enter: LOAD DATA INPATH '/user/maria_dev/drivers
Step 3.3 Enter: LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE temp_drivers; Then click on “Execute”
14
Go back to the “Files” view What has changed. The file drivers
Go back to the “Files” view What has changed? The file drivers.csv is no longer there Loading the file into Hive has “consumed” drivers.csv
15
Take a peek at the table “temp_drivers” Enter: select
Take a peek at the table “temp_drivers” Enter: select * from temp_drivers limit 10; then click on “Execute”
16
Step 3.4: extract the fields we want enter: CREATE TABLE drivers (driverId INT, name STRING, ssn BIGINT, location STRING, certified STRING, wageplan STRING); the click on “Execute”
17
Step 3.5: QUERY TO EXTRACT DATA FROM Temp_drivers insert overwrite table drivers SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) name, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ssn, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) location, regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) certified, regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) wageplan from temp_drivers;
18
Step 3.5: create a query to extract DATA FROM Temp_drivers insert overwrite table drivers SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) name, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ssn, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) location, regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) certified, regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) wageplan from temp_drivers;
19
Take a peek at the first 10 rows of the resulting table Enter: select
Take a peek at the first 10 rows of the resulting table Enter: select * from drivers limit 10; Then click on “Execute”
20
Processing timesheet.csv creating similar tables from timesheet.csv
CREATE TABLE temp_timesheet (col_value string); Processing timesheet.csv creating similar tables from timesheet.csv Start by creating temp_timesheet with the following command: CREATE TABLE temp_timesheet (col_value string); Then populate it with data from timesheet.csv LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE temp_timesheet; Finally, look at the first 10 lines of the table as a sanity check. Select * from temp_timesheet limit 10;
21
3.6 CREATE TABLE temp_timesheet (col_value string); LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE temp_timesheet;
22
Creating timesheet from temp_timesheet
Start by creating an empty table timesheet Enter: CREATE TABLE timesheet (driverId INT, week INT, hours_logged INT , miles_logged INT); Then click on “Execute” Next populate the table by extracting columns fromm temp_timesheet
23
Extracting columns from temp_timesheet
Enter: insert overwrite table timesheet SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) week, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) hours_logged, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) miles_logged from temp_timesheet; Then click on “Execute”
24
Take a peek at the first 10 rows enter: select
Take a peek at the first 10 rows enter: select * from timesheet limit 10; click on “Execute”
25
Now group timesheet data by driverID so that we can sum the hours logged and sum the miles logged Enter: SELECT driverId, sum(hours_logged), sum(miles_logged) FROM timesheet GROUP BY driverId; Then click on “Execute”
26
Results after grouping by driverID and summing logged hours and logged miles
27
Combine columns from drivers and timesheet tables
Columns from drivers table: driverId name Columns from timesheets table: total_hours total_miles Join column: driverId
28
Combine columns from drivers and timesheet tables
SELECT d.driverId, d.name, t.total_hours, t.total_miles from drivers d JOIN (SELECT driverId, sum(hours_logged)total_hours, sum(miles_logged)total_miles FROM timesheet GROUP BY driverId ) t ON (d.driverId = t.driverId);
29
Results after entering command and executing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.