Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloud Computing Other High-level parallel processing languages Keke Chen.

Similar presentations


Presentation on theme: "Cloud Computing Other High-level parallel processing languages Keke Chen."— Presentation transcript:

1 Cloud Computing Other High-level parallel processing languages Keke Chen

2 Outline  sawzall  Dryad and DraydLINQ (MS, abandoned)  Hive

3 Sawzall  Simplify mapreduce programming  Filters + aggregator mapperreducer

4 Example mappers reducers Convert the input record to float

5 input  Sawzall program works on a single record As a filter filtering through the data stream  Input can be parsed to Values, e.g., float Data structure x: float = input; (variable : type = input)

6 aggregators  definition table agg_name of data_type/variable  Examples: c: table collection of string; S: table sample(100) of string; S: table sum of {count: int, revenue: float}  More aggregators Maximum, quantile, top, unique

7 Indexed aggregators  similar to “group by”, the index is group id  Example t1: table sum[country: string] of int country: string = input; Emit t1[country] <- 1;

8 More example Proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; Log_record: queryLogProto = input; Loc: Location = locationinfo(log_record.ip); Emit queries_per_degree[int(loc.lat)][int(loc.l on)]<-1

9 Performance Single-CPU speed, Also 51 times slower than compiled C++

10 Performance

11 Dryad and DryadLINQ  Dryad provides a low-level parallel data flow processing interface Acyclic data flow graphs Data communication methods include pipes, file-based, message, shared-memory  DryadLINQ A high level language for app developers It hides the data flow details

12 Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs

13 Runtime  Services Name server Daemon  Job Manager Centralized coordinating process User application to construct graph Linked with Dryad libraries for scheduling vertices  Vertex executable Dryad libraries to communicate with JM User application sees channels in/out Arbitrary application code, can use local FS

14 Graph operators

15 Hive  Developed by facebook (open source)  Mimic SQL language  Built on hadoop/mapreduce

16 Hive data model: table etc.  Table Similar to DB table stored in hadoop directories Builtin compression, serialization/deserialization  Partitions Groups in the table Subdirectory in the table directory  Buckets Files in the partition directory Key (column) based partition  /table/partition/bucket1

17 Hive data model: Column type  integers, floating point numbers, generic strings, dates and booleans  nestable collection types: array and map.

18

19 Architecture Metastore stores the schema of databases. It uses non HDFS data store

20 Query processing  Steps (similar to DBMS) Parse Semantic analyzer Logical plan generator (algebra tree) Optimizer Physical plan generator (to mapreduce jobs)

21 Operations: DDL and DML  HiveQL: SQL like, with slightly different syntax  User defined filtering and aggregation functions Java only  Map/reduce plugin for streaming process Implemented with any language

22 Example  Facebook status updates Table: status_updates(userid int, status string,ds string) profiles(userid int,school string,gender int)  Operations Load data LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20') Count status updates by school and by gender

23 More query examples

24 Query examples

25 Query examples – using hadoopstreaming


Download ppt "Cloud Computing Other High-level parallel processing languages Keke Chen."

Similar presentations


Ads by Google