Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011.

Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011

Outline Introduction Introduction Approaches Approaches Application framework level approach Application framework level approach Language level approach Language level approach Instruction level approach Instruction level approach Our work: MRlite Our work: MRlite Conclusion Conclusion 2

Cloud computing Internet services are the most popular applications nowadays Internet services are the most popular applications nowadays Millions of users Millions of users Computation is large and complex Computation is large and complex Google already processed 20TB data in 2004 Google already processed 20TB data in 2004 Cloud computing provides massive computing resources Cloud computing provides massive computing resources Available on demand Available on demand 3 A promising model to support processing large datasets housed on clusters

How to program and task? Challenges Challenges Parallelize the execution Parallelize the execution Scheduling the large scale distributed computation Scheduling the large scale distributed computation Handling faults Handling faults High performance High performance Ensuring fairness Ensuring fairness Programming models for Grid Programming models for Grid Do not automatically parallelize users’ programs Do not automatically parallelize users’ programs Pass the fault-tolerance work to applications Pass the fault-tolerance work to applications 4

Approaches 6 ApproachAdvantageDisadvantage Language level Instruction level Application framework level

MapReduce MapReduce: parallel computing framework for large-scale data processing MapReduce: parallel computing framework for large-scale data processing Successful used in datacenters comprising commodity computers Successful used in datacenters comprising commodity computers A fundamental piece of software in the Google architecture for many years A fundamental piece of software in the Google architecture for many years Open source variant already exists: Hadoop Open source variant already exists: Hadoop Widely used in solving data-intensive problems Widely used in solving data-intensive problems 7 MapReduce … Hadoop or variants …Hadoop

MapReduce Map and Reduce are higher-order functions Map and Reduce are higher-order functions Map: apply an operation to all elements in a list Map: apply an operation to all elements in a list Reduce: Like “fold”; aggregate elements of a list Reduce: Like “fold”; aggregate elements of a list 8 1 1 m m 4 4 m m 9 9 m m 16 m m 25 m m 1 1 2 2 3 3 4 4 5 5 m: x 2 0 0 1 1 r r 5 5 r r 14 r r 30 r r 55 r r final value Initial value r: + 1 2 + 2 2 + 3 2 + 4 2 + 5 2 = ?

MapReduce’s data flow 9

MapReduce Massive parallel processing made simple Example: world count Example: world count Map: parse a document and generate pairs Map: parse a document and generate pairs Reduce: receive all pairs for a specific word, and count Reduce: receive all pairs for a specific word, and count 10 // D is a document for each word w in D output // D is a document for each word w in D output Map Reduce for key w: count = 0 for each input item count = count + 1 output Reduce for key w: count = 0 for each input item count = count + 1 output Reduce

MapReduce easily scales up 11 Input files Map phase Intermediate files Reduce phase Output files

12 MapReduce Input Computation Output

Dryad General-purpose execution environment for distributed, data-parallel applications General-purpose execution environment for distributed, data-parallel applications Concentrates on throughput not latency Concentrates on throughput not latency Application written in Dryad is modeled as a directed acyclic graph (DAG) Application written in Dryad is modeled as a directed acyclic graph (DAG) Many programs can be represented as a distributed execution graph Many programs can be represented as a distributed execution graph 13

Dryad 14 Processing vertices Channels (file, pipe, shared memory) Inputs Outputs

Dryad Concurrency arise from vertices running simultaneously across multiple machines Concurrency arise from vertices running simultaneously across multiple machines Vertices subroutines are usually quite simple as sequential programs Vertices subroutines are usually quite simple as sequential programs User have control over the communication graph User have control over the communication graph Each vertex can has multiple input and output Each vertex can has multiple input and output 15

Approaches 16 ApproachAdvantageDisadvantage Application framework level Language’ level Instruction level Users are relaxed from the details of distributing the execution Automatically parallelize users ’ programs; Programs must follow the specific model

Tasking of execution Performance Performance Locality is crucial Locality is crucial Speculative execution Speculative execution Fairness Fairness The same cluster shared by multiple users The same cluster shared by multiple users Small jobs requires small response time while throughput is important for big jobs Small jobs requires small response time while throughput is important for big jobs Correctness Correctness Fault-tolerance Fault-tolerance 17

Locality and fairness Locality is crucial Locality is crucial Bandwidth is scarce resource Bandwidth is scarce resource Input data with duplications are stored in the same cluster for executions Input data with duplications are stored in the same cluster for executions Fairness Fairness Short jobs requires short response time Short jobs requires short response time 18 Locality and fairness conflicts with each other

FIFO scheduler in Hadoop Jobs in a queue with priority order Jobs in a queue with priority order FIFO by default FIFO by default When there are available slots When there are available slots Assign slots to tasks, that have local data, in priority order Assign slots to tasks, that have local data, in priority order Limit the assignment of non-local task to optimize locality Limit the assignment of non-local task to optimize locality 19

FIFO scheduler 20 JobQueue 2 tasks 1 tasks Node 1Node 2Node 3 Node 4

FIFO scheduler – locality optimization 21 JobQueue Far away in network topology Only dispatch one non-local task at one time 4 tasks 1 tasks Node 1Node 2Node 3 Node 4

Problem: fairness 22 JobQueue 3 tasks Node 1Node 2Node 3 Node 4

Problem: response time 23 JobQueue Small job: Only 1 task) 3 tasks 1 task Node 1Node 2Node 3 Node 4

Fair scheduling Assign free slots to the job that has the fewest running tasks Assign free slots to the job that has the fewest running tasks Strict fairness Strict fairness Running jobs gets nearly equal number of slots Running jobs gets nearly equal number of slots The small jobs finishes quickly The small jobs finishes quickly 24

Fair Scheduling 25 JobQueue Node 1Node 2Node 3 Node 4

Problem: l ocality 26 JobQueue Node 1Node 2Node 3 Node 4

Delay Scheduling Skip the job that cannot launch a local task Skip the job that cannot launch a local task Relax fairness slightly Relax fairness slightly Allow a job to launch non-local tasks if be skipped long enough Allow a job to launch non-local tasks if be skipped long enough Avoid starvation Avoid starvation 27

Delay Scheduling 28 JobQueue Node 1 skipcount 0 00 0120 Threshold: 2 Node 2Node 3 Node 4 Waiting time is short: Tasks finish quickly Skipped job is in the head of the queue

“Fault” Tolerance Nodes fail Nodes fail Re-run tasks Re-run tasks Nodes are slow (stragglers) Nodes are slow (stragglers) Run backup tasks (speculative execution) Run backup tasks (speculative execution) To minimize job’s response time To minimize job’s response time Important for short jobs Important for short jobs 29

Speculative execution The scheduler schedules backup executions of the remaining in-progress tasks The scheduler schedules backup executions of the remaining in-progress tasks The task is marked as completed whenever either the primary or the backup execution completes The task is marked as completed whenever either the primary or the backup execution completes Improve job response time by 44% according Google’s experiments Improve job response time by 44% according Google’s experiments 30

Speculative execution mechanism Seems a simple problem, but Resource for speculative tasks is not free Resource for speculative tasks is not free How to choose nodes to run speculative tasks? How to choose nodes to run speculative tasks? How to distinguish “stragglers” from nodes that are slightly slower? How to distinguish “stragglers” from nodes that are slightly slower? Stragglers should be found out early Stragglers should be found out early 31

Hadoop’s scheduler Start speculative tasks based on a simple heuristic Start speculative tasks based on a simple heuristic Comparing each task’s progress to the average Comparing each task’s progress to the average Assumption of homogeneous environment Assumption of homogeneous environment The default scheduler works well The default scheduler works well Broken in utility computing Broken in utility computing Virtualized “utility computing” environments, such as EC2 Virtualized “utility computing” environments, such as EC2 32 How to robustly perform speculative execution (backup tasks) in heterogeneous environments?

Speculative execution in Hadoop When there is no “higher priority” tasks, looks for a task to execute speculatively When there is no “higher priority” tasks, looks for a task to execute speculatively Assumption: The is no cost to launching a speculative task Assumption: The is no cost to launching a speculative task Comparing each task’s progress to the average progress Comparing each task’s progress to the average progress Assumption: Nodes perform similarly. (“Slow node is faulty”; “Nodes that ask for new tasks are fast”) Assumption: Nodes perform similarly. (“Slow node is faulty”; “Nodes that ask for new tasks are fast”) Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast 33

Speculative execution in Hadoop Threshold for speculative execution Threshold for speculative execution (Average progress score of each category of tasks) – 0.2 (Average progress score of each category of tasks) – 0.2 Tasks beyond the threshold are “equally slow” Tasks beyond the threshold are “equally slow” Ranks candidates by locality Ranks candidates by locality Wrong tasks may be chosen Wrong tasks may be chosen 35% completed 2x slower task with data available on idle node or 5% completed 10x slower task? 35% completed 2x slower task with data available on idle node or 5% completed 10x slower task? Too many speculative tasks and thrashing Too many speculative tasks and thrashing Taking away resources from useful tasks Taking away resources from useful tasks 34

Speculative execution in Hadoop Progress score Progress score Map: fraction of input data Map: fraction of input data Reduce: three phase (1/3 for each) and fraction of data processed Reduce: three phase (1/3 for each) and fraction of data processed Incorrect speculation of reduce tasks Incorrect speculation of reduce tasks Copy phase takes most of the time, but account only 1/3 Copy phase takes most of the time, but account only 1/3 30% tasks finishes quickly, 70% are in copy phase: Avg. progress rate = 30%*1+70%*1/3 = 53%, threshold=33% 30% tasks finishes quickly, 70% are in copy phase: Avg. progress rate = 30%*1+70%*1/3 = 53%, threshold=33% 35

LATE Longest Approximate Time to End Longest Approximate Time to End Principles Principles Ranks candidate by longest time to end Ranks candidate by longest time to end Choose the right task that hurts the job’s response time; slow nodes can be utilized as long as it doesn’t hurt the response time Choose the right task that hurts the job’s response time; slow nodes can be utilized as long as it doesn’t hurt the response time Only launch speculative tasks on fast nodes Only launch speculative tasks on fast nodes Not every node that asks for task is fast Not every node that asks for task is fast Cap speculative tasks Cap speculative tasks Limit resource contention and thrashing Limit resource contention and thrashing 36

LATE algorithm If a node asks for a new task and there are fewer than SpeculativeCap speculative tasks running: Ignore the request if the node's total progress is below SlowNodeThreshold Ignore the request if the node's total progress is below SlowNodeThreshold Rank currently running tasks by estimated time left Rank currently running tasks by estimated time left Launch a copy of the highest-ranked task with progress rate below SlowTaskThreshold Launch a copy of the highest-ranked task with progress rate below SlowTaskThreshold 37 Cap speculative tasks Only launch speculative tasks on fast nodes Only launch speculative tasks on fast nodes Rank candidates by longest time to end Rank candidates by longest time to end

Approaches 38 ApproachAdvantageDisadvantage Application framework level Instruction level Users are relaxed from the details of distributing the execution Automatically parallelize users’ programs; Programs must follow the specific model Language level

Language level approach Programming frameworks Programming frameworks Still not clear and compact enough Still not clear and compact enough Traditional programming language Traditional programming language Without giving special focus on high parallelism for large computing cluster Without giving special focus on high parallelism for large computing cluster New language New language Clear, compact and expressive Clear, compact and expressive Automatically parallelized “normal” programs Automatically parallelized “normal” programs Comfortable way for user to think about data processing problem on large distributed datasets Comfortable way for user to think about data processing problem on large distributed datasets 39

Sawzall Interpreted, procedural high-level programming language Interpreted, procedural high-level programming language Exploit high parallelism Exploit high parallelism Automate very large data sets analysis Automate very large data sets analysis Give users a way to clearly and expressively design distributed data processing programs Give users a way to clearly and expressively design distributed data processing programs 40

Overall flow Filtering Filtering Analysis each record individually Analysis each record individually Expressed in Sawzall Expressed in Sawzall Aggregation Aggregation Collate and reduce the intermediate values Collate and reduce the intermediate values Predefined aggregators Predefined aggregators 41 Map Reduce

An example 42 max_pagerank_url: table maximum(1)[domain:string] of url:string weight pagerank:int; doc:Document = input; emit max_pagerank_url[domain(doc.url)] <- doc.url weight doc.pagerank; Find out the most-linked-to page of each domain Aggregator: highest value Stores url Indexed by domain Weighted by pagerank input: pre-defined variable initialized by Sawzall Interpreted into Documentn type emit: sends intermediate value to the aggregator

Unusual features Sawzall runs on one record at a time Sawzall runs on one record at a time Nothing in the language to have one input record influent another Nothing in the language to have one input record influent another emit statement is the only output primitive emit statement is the only output primitive Explicit line between filtering and aggregation Explicit line between filtering and aggregation 43 Enables high degree of parallelism even though it is hidden from the language

Approaches 44 ApproachAdvantageDisadvantage Application framework level Users are relaxed from the details of distributing the execution Automatically parallelize users’ programs; Programs must follow the specific model Language level Instruction level Clearer, more expressive Comfortable way for programming More restrict programming model

Instruction level approach Provides instruction level abstracts and compatibility to users’ applications Provides instruction level abstracts and compatibility to users’ applications May choose traditional ISA such as x86/x86-64 May choose traditional ISA such as x86/x86-64 Run traditional applications without any modification Run traditional applications without any modification Easier to migrate applications to cloud computing environments Easier to migrate applications to cloud computing environments 45

Amazon Elastic Compute Cloud (EC2) Provides virtual machines runs traditional OS Provides virtual machines runs traditional OS Traditional programs can work on EC2 Traditional programs can work on EC2 Amazon Machine Image (AMI) Amazon Machine Image (AMI) Boot instances Boot instances Unit of deployment, packaged-up environment Unit of deployment, packaged-up environment Users design and implement the application logic in AMI; EC2 handles the deployment and resource allocation Users design and implement the application logic in AMI; EC2 handles the deployment and resource allocation 46

vNUMA Virtual shared-memory multiprocessor machine build from commodity workstations Make the computational power available to legacy applications and OSs Make the computational power available to legacy applications and OSs 47 Virtualization vNUMA PM VM PM

48 Architecture Hypervisor Hypervisor On each node On each node CPU CPU Virtual CPUs are mapped to real CPUs on nodes Virtual CPUs are mapped to real CPUs on nodes Memory Memory Divided between the nodes with equal-sized portions Divided between the nodes with equal-sized portions Each node manages a subset of the pages Each node manages a subset of the pages

49 Memory mapping VM VMM PM OS Application read *a maps b to real physical address c on node In application’s virtual memory address translate a to VM’s physical memory address b find *c

Approaches 50 ApproachAdvantageDisadvantage Application framework level Users are relaxed from the details of distributing the execution Automatically parallelize users’ programs; Programs must follow the specific model Language level Instruction level Clearer, more expressive Comfortable way for programming More restrict programming model Supports traditional applications Users handles the tasking Hard to scale up

Our work Analyze MapReduce’s design and use a case study to probe the limitation Analyze MapReduce’s design and use a case study to probe the limitation One-way scalability One-way scalability Difficult to handle dynamic, interactive and semantic-rich applications Difficult to handle dynamic, interactive and semantic-rich applications Design a new parallelization framework – MRlite Design a new parallelization framework – MRlite Able to scale “up” like MapReduce, and scale “down” to process moderate-size data Able to scale “up” like MapReduce, and scale “down” to process moderate-size data Low latency and massive parallelism Low latency and massive parallelism Small run-time system overhead Small run-time system overhead 52 Design a general parallelization framework and programming paradigm for cloud computing

Architecture of MRlite 53 MRlite client MRlite master scheduler MRlite master scheduler slave High speed Distributed storage High speed Distributed storage application Data flow Command flow Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master High speed distributed storage, stores intermediate files The MRlite master accepts jobs from clients and schedules them to execute on slaves Distributed nodes accept tasks from master and execute them

Result 54 The evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling.

Conclusion Cloud Computing needs a general programming framework Cloud Computing needs a general programming framework Cloud computing shall not be a platform to run just simple OLAP applications. It is important to support complex computation and even OLTP on large data sets Cloud computing shall not be a platform to run just simple OLAP applications. It is important to support complex computation and even OLTP on large data sets Design MRlite: a general parallelization framework for cloud computing Design MRlite: a general parallelization framework for cloud computing Handles applications with complex logic flow and data dependencies Handles applications with complex logic flow and data dependencies Mitigates the one-way scalability problem Mitigates the one-way scalability problem Able to handle all MapReduce tasks with comparable (if not better) performance Able to handle all MapReduce tasks with comparable (if not better) performance 56

Conclusion Emerging computing platforms increasingly emphasize parallelization capability, such as GPGPU MRlite respects applications’ natural logic flow and data dependencies MRlite respects applications’ natural logic flow and data dependencies This modularization of parallelization capability from application logic enables MRlite to integrate GPGPU processing very easily (future work) This modularization of parallelization capability from application logic enables MRlite to integrate GPGPU processing very easily (future work) 57

Thank you!

Appendix

LATE: Estimate finish times 60 progress score execution time progress rate = 1 – progress score progress rate estimated time left = = 1 progress score X execution time - 1 ) ( The smaller progress score, the longer estimated time left. Appendix

LATE: Solve the problems in Hadoop’s default scheduler Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast Too many speculative tasks and thrashing Too many speculative tasks and thrashing Ranks candidate by locality Ranks candidate by locality Wrong tasks may be chosen Wrong tasks may be chosen Incorrect speculation of reducers Incorrect speculation of reducers 61 Appendix

Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011.

Similar presentations

Presentation on theme: "Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011.

Similar presentations

Presentation on theme: "Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011."— Presentation transcript:

Similar presentations

About project

Feedback