Cloud Computing Other High-level parallel processing languages Keke Chen.

Cloud Computing Other High-level parallel processing languages Keke Chen

Outline  sawzall  Dryad and DraydLINQ (MS, abandoned)  Hive

Sawzall  Simplify mapreduce programming  Filters + aggregator mapperreducer

Example mappers reducers Convert the input record to float

input  Sawzall program works on a single record As a filter filtering through the data stream  Input can be parsed to Values, e.g., float Data structure x: float = input; (variable : type = input)

aggregators  definition table agg_name of data_type/variable  Examples: c: table collection of string; S: table sample(100) of string; S: table sum of {count: int, revenue: float}  More aggregators Maximum, quantile, top, unique

Indexed aggregators  similar to “group by”, the index is group id  Example t1: table sum[country: string] of int country: string = input; Emit t1[country] <- 1;

More example Proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; Log_record: queryLogProto = input; Loc: Location = locationinfo(log_record.ip); Emit queries_per_degree[int(loc.lat)][int(loc.l on)]<-1

Performance Single-CPU speed, Also 51 times slower than compiled C++

Performance

Dryad and DryadLINQ  Dryad provides a low-level parallel data flow processing interface Acyclic data flow graphs Data communication methods include pipes, file-based, message, shared-memory  DryadLINQ A high level language for app developers It hides the data flow details

Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs

Runtime  Services Name server Daemon  Job Manager Centralized coordinating process User application to construct graph Linked with Dryad libraries for scheduling vertices  Vertex executable Dryad libraries to communicate with JM User application sees channels in/out Arbitrary application code, can use local FS

Graph operators

Hive  Developed by facebook (open source)  Mimic SQL language  Built on hadoop/mapreduce

Hive data model: table etc.  Table Similar to DB table stored in hadoop directories Builtin compression, serialization/deserialization  Partitions Groups in the table Subdirectory in the table directory  Buckets Files in the partition directory Key (column) based partition  /table/partition/bucket1

Hive data model: Column type  integers, floating point numbers, generic strings, dates and booleans  nestable collection types: array and map.

Architecture Metastore stores the schema of databases. It uses non HDFS data store

Query processing  Steps (similar to DBMS) Parse Semantic analyzer Logical plan generator (algebra tree) Optimizer Physical plan generator (to mapreduce jobs)

Operations: DDL and DML  HiveQL: SQL like, with slightly different syntax  User defined filtering and aggregation functions Java only  Map/reduce plugin for streaming process Implemented with any language

Example  Facebook status updates Table: status_updates(userid int, status string,ds string) profiles(userid int,school string,gender int)  Operations Load data LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20') Count status updates by school and by gender

More query examples

Query examples

Query examples – using hadoopstreaming

Cloud Computing Other High-level parallel processing languages Keke Chen.

Similar presentations

Presentation on theme: "Cloud Computing Other High-level parallel processing languages Keke Chen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cloud Computing Other High-level parallel processing languages Keke Chen.

Similar presentations

Presentation on theme: "Cloud Computing Other High-level parallel processing languages Keke Chen."— Presentation transcript:

Similar presentations

About project

Feedback