Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225
NameNode Giant File Giant File HDFSClientHDFSClient NameNode returns locations of blocks of file DataNode DataNodes return blocks of the file
Output
public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } Source:
demo A Quick-and-Dirty Data Warehouse in Hadoop
OLTP DW ACID BASE SQL Server Hive HBase Cassandra SQL Server
Define schema with Hive DDL (state the structure, map to file) create external table CUSTOMER ( C_CUSTKEYint, C_MKTSEGMENTstring, C_NATIONKEYint, C_NAMEstring, C_ADDRESSstring, C_PHONEstring, C_ACCTBALfloat, C_COMMENTstring ) row format delimited fields terminated by '|' stored as textfile location 'asv://customer/';
orders = load '/wh/orders/orders.tbl' using PigStorage ('|') as ( ORDERDATE:chararray, ORDERKEY:long, CUSTKEY:int, ORDERSTATUS:chararray, TOTALPRICE:double, COMMENT:chararray ); custs = load '/wh/customer/customer.tbl' using PigStorage ('|') as ( CUSTKEY:int, MKTSEGMENT:chararray, NATIONKEY:int, NAME:chararray, ADDRESS:chararray, PHONE:chararray ); nations = load ‘/wh/nation/nation.tbl' using PigStorage ('|') as ( id:int, nation:chararray, region:int ); custnat = join custs by NATIONKEY, nations by id; ordernat = join custnat by CUSTKEY, orders by CUSTKEY; ordersbynat = group ordernat by NATIONKEY; sums = foreach ordersbynat generate group, COUNT(ordernat.TOTALPRICE), SUM(ordernat.TOTALPRICE); dump sums; Logic here – the rest is schema
hive> select devicemake, devicemodel, sum(querydwelltime) as a > from hivesampletable > group by devicemake, devicemodel > order by a; Total MapReduce jobs = 2 Launching Job 1 out of 2 Starting Job = job_ _0003, Tracking URL = Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker= :9010 -kill job_ _ :29:21,382 Stage-1 map = 0%, reduce = 0% :29:33,601 Stage-1 map = 50%, reduce = 0% :29:37,617 Stage-1 map = 100%, reduce = 0% :29:48,648 Stage-1 map = 100%, reduce = 33% :29:51,664 Stage-1 map = 100%, reduce = 100% Ended Job = job_ _0003 Launching Job 2 out of 2 Starting Job = job_ _0004, Tracking URL = Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker= :9010 -kill job_ _ :30:18,195 Stage-2 map = 0%, reduce = 0% :30:30,210 Stage-2 map = 100%, reduce = 0% :30:45,241 Stage-2 map = 100%, reduce = 33% :30:48,257 Stage-2 map = 100%, reduce = 100% Ended Job = job_ _0004 OK Samsung SGH-i LG LG-C HTC 7 Mozart SAMSUNG SGH-i917R HTC PD Apple iPhone
OLTP DB HR DB Data Warehouse DB Customer Mgmt. External sources Staging area Data mart OLAP cube Reports Interactive tools Dashboards ETL (Optional) ETL
Persistent storage in HDFS Interactive tools Sqoop data interchange with relational targets Flume for file acquisition OLTP in HBASE Hive presents data as tables Pig transforms data in HDFS Oozie manages workflows Sqoop data interchange with relational sources DW in Hive Presentation DB OLAP cube Reports Dashboards External sources OLTP in RDBMS
demo The Best of Both Worlds