Download presentation
Presentation is loading. Please wait.
Published byEmery Patrick Modified over 9 years ago
1
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225
4
NameNode Giant File 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 Giant File 110010101001 010100101010 011001010100 101010010101 001100101010 010101001010 100110010101 001010100101 HDFSClientHDFSClient NameNode returns locations of blocks of file DataNode DataNodes return blocks of the file
5
Output
6
public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } Source: http://wiki.apache.org/hadoop/WordCount
9
demo A Quick-and-Dirty Data Warehouse in Hadoop
10
OLTP DW ACID BASE SQL Server Hive HBase Cassandra SQL Server
11
Define schema with Hive DDL (state the structure, map to file) create external table CUSTOMER ( C_CUSTKEYint, C_MKTSEGMENTstring, C_NATIONKEYint, C_NAMEstring, C_ADDRESSstring, C_PHONEstring, C_ACCTBALfloat, C_COMMENTstring ) row format delimited fields terminated by '|' stored as textfile location 'asv://customer/';
12
orders = load '/wh/orders/orders.tbl' using PigStorage ('|') as ( ORDERDATE:chararray, ORDERKEY:long, CUSTKEY:int, ORDERSTATUS:chararray, TOTALPRICE:double, COMMENT:chararray ); custs = load '/wh/customer/customer.tbl' using PigStorage ('|') as ( CUSTKEY:int, MKTSEGMENT:chararray, NATIONKEY:int, NAME:chararray, ADDRESS:chararray, PHONE:chararray ); nations = load ‘/wh/nation/nation.tbl' using PigStorage ('|') as ( id:int, nation:chararray, region:int ); custnat = join custs by NATIONKEY, nations by id; ordernat = join custnat by CUSTKEY, orders by CUSTKEY; ordersbynat = group ordernat by NATIONKEY; sums = foreach ordersbynat generate group, COUNT(ordernat.TOTALPRICE), SUM(ordernat.TOTALPRICE); dump sums; Logic here – the rest is schema
13
hive> select devicemake, devicemodel, sum(querydwelltime) as a > from hivesampletable > group by devicemake, devicemodel > order by a; Total MapReduce jobs = 2 Launching Job 1 out of 2 Starting Job = job_201206011857_0003, Tracking URL = http://10.114.202.178:50030/jobdetails.jsp?jobid=job_201206011857_0003 Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.114.202.178:9010 -kill job_201206011857_0003 2012-06-02 22:29:21,382 Stage-1 map = 0%, reduce = 0% 2012-06-02 22:29:33,601 Stage-1 map = 50%, reduce = 0% 2012-06-02 22:29:37,617 Stage-1 map = 100%, reduce = 0% 2012-06-02 22:29:48,648 Stage-1 map = 100%, reduce = 33% 2012-06-02 22:29:51,664 Stage-1 map = 100%, reduce = 100% Ended Job = job_201206011857_0003 Launching Job 2 out of 2 Starting Job = job_201206011857_0004, Tracking URL = http://10.114.202.178:50030/jobdetails.jsp?jobid=job_201206011857_0004 Kill Command = c:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.114.202.178:9010 -kill job_201206011857_0004 2012-06-02 22:30:18,195 Stage-2 map = 0%, reduce = 0% 2012-06-02 22:30:30,210 Stage-2 map = 100%, reduce = 0% 2012-06-02 22:30:45,241 Stage-2 map = 100%, reduce = 33% 2012-06-02 22:30:48,257 Stage-2 map = 100%, reduce = 100% Ended Job = job_201206011857_0004 OK Samsung SGH-i987 0.4610394 LG LG-C900 6.315 HTC 7 Mozart 10.442 SAMSUNG SGH-i917R 15.5504033 HTC PD67100 15.590325499999999 Apple iPhone 3.1 18.7357592
14
OLTP DB HR DB Data Warehouse DB Customer Mgmt. External sources Staging area Data mart OLAP cube Reports Interactive tools Dashboards ETL (Optional) ETL
15
Persistent storage in HDFS Interactive tools Sqoop data interchange with relational targets Flume for file acquisition OLTP in HBASE Hive presents data as tables Pig transforms data in HDFS Oozie manages workflows Sqoop data interchange with relational sources DW in Hive Presentation DB OLAP cube Reports Dashboards External sources OLTP in RDBMS
17
demo The Best of Both Worlds
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.