Download presentation
Presentation is loading. Please wait.
Published byJessie Sutton Modified over 9 years ago
1
Cloud Computing Other High-level parallel processing languages Keke Chen
2
Outline sawzall Dryad and DraydLINQ (MS, abandoned) Hive
3
Sawzall Simplify mapreduce programming Filters + aggregator mapperreducer
4
Example mappers reducers Convert the input record to float
5
input Sawzall program works on a single record As a filter filtering through the data stream Input can be parsed to Values, e.g., float Data structure x: float = input; (variable : type = input)
6
aggregators definition table agg_name of data_type/variable Examples: c: table collection of string; S: table sample(100) of string; S: table sum of {count: int, revenue: float} More aggregators Maximum, quantile, top, unique
7
Indexed aggregators similar to “group by”, the index is group id Example t1: table sum[country: string] of int country: string = input; Emit t1[country] <- 1;
8
More example Proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; Log_record: queryLogProto = input; Loc: Location = locationinfo(log_record.ip); Emit queries_per_degree[int(loc.lat)][int(loc.l on)]<-1
9
Performance Single-CPU speed, Also 51 times slower than compiled C++
10
Performance
11
Dryad and DryadLINQ Dryad provides a low-level parallel data flow processing interface Acyclic data flow graphs Data communication methods include pipes, file-based, message, shared-memory DryadLINQ A high level language for app developers It hides the data flow details
12
Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs
13
Runtime Services Name server Daemon Job Manager Centralized coordinating process User application to construct graph Linked with Dryad libraries for scheduling vertices Vertex executable Dryad libraries to communicate with JM User application sees channels in/out Arbitrary application code, can use local FS
14
Graph operators
15
Hive Developed by facebook (open source) Mimic SQL language Built on hadoop/mapreduce
16
Hive data model: table etc. Table Similar to DB table stored in hadoop directories Builtin compression, serialization/deserialization Partitions Groups in the table Subdirectory in the table directory Buckets Files in the partition directory Key (column) based partition /table/partition/bucket1
17
Hive data model: Column type integers, floating point numbers, generic strings, dates and booleans nestable collection types: array and map.
19
Architecture Metastore stores the schema of databases. It uses non HDFS data store
20
Query processing Steps (similar to DBMS) Parse Semantic analyzer Logical plan generator (algebra tree) Optimizer Physical plan generator (to mapreduce jobs)
21
Operations: DDL and DML HiveQL: SQL like, with slightly different syntax User defined filtering and aggregation functions Java only Map/reduce plugin for streaming process Implemented with any language
22
Example Facebook status updates Table: status_updates(userid int, status string,ds string) profiles(userid int,school string,gender int) Operations Load data LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20') Count status updates by school and by gender
23
More query examples
24
Query examples
25
Query examples – using hadoopstreaming
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.