Download presentation
Presentation is loading. Please wait.
1
Presentation Overview 1. Models of Parallel Computing The evolution of the conceptual framework behind parallel systems. 2.Grid Computing The creation of a structure within the parallel framework to facilitate efficient use of shared resources. 3.Cilk Language and Scheduler A method for scheduling parallel tasks on a small low-latency network; and a programming language to provide parallel computing with time guarantees.
2
Presentation Overview Parallel Computin g Grid Computing Scheduling Resource Discovery Cilk Set Matching P2P Techniques Routing Techniques
3
Models of Parallel Computing All Models of Parallel Computing can be subdivided into these four broad categories: Synchronous Shared Memory Models Asynchronous Shared Memory Models Synchronous Independent Memory Models Asynchronous Independent Memory Models
4
What defines a synchronous model? All memory operations take exactly unit-time. All processors that wish to perform an operation at time t do so simultaneously. Memory access conflicts are resolved using standard concurrency techniques. Generally speaking in a synchronous system:
5
Synchronous Shared Memory: PRAM Consists of P RAM processors each with a local register set Unbounded global shared memory Processors operate synchronously
6
PRAM Properties Processing units do not have their own memory. Processing units communicate only via global memory. Assumes synchronous memory access. Each processor has random access to any global memory cell in unit-time.
7
Problems with PRAM Problem 1: Assumes that the processors act synchronously without any overhead. Problem 2: Assumes 100% processor and memory reliability. Problem 3: Does not exploit caching or locality (all operations are “performed” in the main memory). Problem 4: Model is unrealistic for real computers.
8
Asynchronous Shared Memory Models Most Asynchronous Shared Memory systems build on the PRAM model making it more feasible for actual implementation. We can easily make the PRAM model more realistic by assuming asynchronous operation, and including an explicit synchronization step after every round. Where a round is the smallest unit of time that allows every processor to complete computation in a given time step. These models can generally be implemented on MIMD architecture, and charge appropriately for the cost of synchronization.
9
Synchronous Independent Memory Models These models consist of a connected set of processor/memory pairs Synchronization is assumed during computation The best example of a synchronous independent memory model is Bulk-Synchronous Parallel (BSP)
10
Bulk-Synchronous Parallel Model (BSP): Processing units are processor/memory pairs There is a router to provide inter-processor communication There is a barrier synchronizer to explicitly synchronize computation
11
BSP Properties BSP is conceptually simple, and provides a nice bridge to future models of computation that do not rely on shared memory. BSP is intuitive from a programming standpoint Can use any network topology with a router. Inter-processor message delivery time is not guaranteed, only a lower bound can be achieved (network latency). Synchronous operation taken for granted in program cost. Synchronization time is not guaranteed.
12
Asynchronous Independent Memory Models Most asynchronous independent memory models build on the BSP framework. These models tend to generalize BSP while providing upper bounds on communication cost and overhead. We briefly summarize the LogP model.
13
LogP Provides upper bound on network latency, and thus inter-processor communication time (overhead) All processors are seen as equidistant (network diameter is used for analysis) Resolves problems with router saturation in BSP Solves some of BSPs practical problems.
14
Summary of Computing Models Shared memory models are conceptually ideal from a programming point of view, but difficult to implement. Independent memory models are more feasible, but add complexity to synchronization. We will proceed to discuss Grid Computing with the general LogP model in mind.
15
Grid Computing Exploring Resource Discovery Protocols
16
What is Grid Computing? “A grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational resources. These resources include, but are not limited to, processors, data storage devices, software applications, and instruments such as telescopes or satellite dishes”. [Foster, Kesselman 1998]
17
What is Grid Computing? Dependability: The system must provide predictable and sustained service. Consistency: A grid should provide uniform service despite the vast heterogeneity of connected systems. Pervasiveness: Services should be constantly available regardless of where you move throughout the system (or similar service should be available) Inexpensiveness: The distributed structure should allow for affordable use of computational power relative to income and use.
18
What is Grid Computing? “[Grid Computing] is the synergistic use of high- performance networking, computing, and advanced software to provide access to advanced computational capabilities, regardless of the location of users and resources.” [Foster 1998]
19
What is Grid Computing? Goal: To access and make efficient use of remote resources.
20
The Power Grid A Motivating Analogy In 1910 efficient electric power generation was possible, but every user had to have his own generator. Connecting many heterogeneous electric generators together in a grid provided low-cost access to standardized service. Similarly a computational grid could provide reliable low-cost access to computational power. Computation today is like electricity in 1910
21
Why do we want a Grid? Solving difficult research problems Running large scale simulations Increase resource utilization Efficient use of scarce/distant resources Collaborative design and education
22
Major Classes of Grid Use Distributed Computing High Throughput On Demand Data Intensive Collaborative
23
Challenges of Grid Computing Building a Framework for communication Parallelizing code Dynamically scheduling resource use Providing consistent service despite heterogeneity Providing reliable service despite local failures Finding resources efficiently
24
Finding Resources in the Grid 1.Determine what resources we will need to solve the problem 2.Locate sufficient resources in the Grid 3.Reserve these resources 4.Execute the problem run Given an instance (or run) of a problem we want to solve, how can we expedite the following?
25
Different Views of the Resource Discovery Problem 1.A Peer-to-Peer Indexing problem 2.A Routing Problem 3.A Web search/crawling problem We can think of the Resource Discovery problem in 3 ways: We will need to repose the Resource Discovery Problem under each of these disciplines
26
P2P for Resource Discovery We will need a separate index for each resource Several resources may be used in parallel We want least cost fit whenever possible, but over fit is likely acceptable We need to have accountability for resource use, and a way to credit users who share resources We will want caching since users are likely to request the same types of resources multiple times
27
P2P for Resource Discovery Peer-to-Peer structure is desirable, but the search/lookup must be modified We will start to solve this problem employing set matching techniques for peer-to-peer lookups.
28
Condor Classified Advertisements Condor Classified Advertisements (ClassAds) provide a mapping from attribute names to expressions Condor matchmaking takes two ClassAds and evaluates one w.r.t. the other Two ClassAds match iff each has an attribute “requirements” that evaluates as true in the context of the other ClassAd A ClassAd can have an attribute “rank” that gives a numerical value to the quality of a particular matching (large rank == better match).
29
Set Extended ClassAd Syntax set expressions: Place constraints on collective properties of the set (e.g. Total disk space or total processing power) individual expressions: Place constraints on each ClassAd in the set (e.g. Each computer must have more than 1GB of RAM) We can extend this structure and consider a match between a single set request, and a ClassAd set: In this context the Set Matching Algorithm will attempt to create a set of ClassAds that meets both individual and set requirements
30
Set Matching Algorithm Note: The number of possible set matches is exponential in the size of ClassAds, so we will proceed with a heuristic approach.
31
Set Matching Algorithm: variables ClassAdSet: Set of all ClassAd to be considered BestSet: Closest set found so far CandidateSet: Set consider at each iteration LastRank: Rank of BestSet Rank: Rank of CandidateSet
32
Set Matching Algorithm While (ClassAdSet is not empty) { next={X| X=argmax(rank(Y+CandidateSet)), for all Y in ClassAdSet}; ClassAdSet-=next; CandidateSet+=next; Rank=rank(CandidateSet) If (requirements(CandidateSet)==true and Rank>LastRank) BestSet=CandidateSet; LastRank=Rank; } return BestSet;
33
Resource Discovery User provides mapper that maps workload for a certain application or problem to resource requirements and topology Resource set is compiled using MDS and a “resource monitor” Set-matching is applied in conjunction with the mapper to find an appropriate set of resources We can use Set Matching for Resource Discovery:
34
Resource Discovery: MDS MDS: Monitoring and Discovery Service component of Globus™ Toolkit provides information about a server’s configuration, CPU load, etc… Any query tool can be used in its place Servers can be queried periodically to maintain central database, or as needed within P2P structure
35
Resource Discovery: Architecture
36
P2P Resource Discovery 1.Run Set-Matching locally on ClassAd NeighborSet 2.If requirements are not met forward BestSet to a neighbor 3.Repeat process without visiting a node more than once 4.Report BestSet (or CandidateSet ) when TTL expires Consider a P2P network with fixed degree topology where each node has the ClassAd for all of its neighbors We could attempt to locate resources using the following technique:
37
“Silk” is a C based runtime system for multithreaded distributed applications. Including: C Language extension. Thread Scheduler. What is Cilk?
38
Provide a guaranteed bound on running time. Define a set of problems lend themselves to efficient distributed multithreading. Encouraging programmers to code for multithreading. What are Cilk’s Goals?
39
Multithreaded programs written in a traditional language like C/C++ typically run within an acceptable approximation of the optimal running time when used in practice. These same implementations often have poor worst case performance. Cilk guarantees performance within a constant factor of optimal, but limits itself to a subset of fully strict problems. Motivation:
40
1. A fully strict computation consists of tasks that pass data only to their direct parent task. A task is a single time unit of work. Threads are composed of one or more tasks in order. 2. In a fully strict computation, threads can not block. Instead, a thread spawns a special successor thread to receive return values. Successor threads do not acquire the CPU until the return values are ready. What is this fully strict business?
41
Additional Definitions: Task – A single time unit of work, executed by exactly one processor. Thread States – A thread can be alive (ready to execute) or stalled (waiting for data from another thread). Activation Frame – The memory shared by tasks in a single thread, which remains allocated regardless of the state of the thread. Activation Subtree – At any time t, the activation subtree consists of those threads that are alive. Activation Depth – The combined size of all child activation frames with respect to a parent thread.
42
The Cilk Model of Multithreaded Computation
43
Scheduling with Work Stealing Work Sharing – When a task is created, the host tries to migrate it to another processor. The drawback is that threads are migrated even if overall workload is high. Work Stealing – Under utilized processors attempt to migrate tasks from other processors. The advantage is that under high workload, communication is minimized, because task migration only takes place when the recipient of the task has the necessary resources to service it.
44
Goals of Work Stealing Keep the processors busy. Bound runtimes. Limit number of active threads in order to bound memory usage. Maximize locality of related tasks (keep them on the same processor). Minimize communication between remote tasks.
45
Work Stealing Definitions: T 1 = number of tasks in a computation, also the time it would take on a single processor. T P = time used on a P processor scheduling of a computation. T ∞ = depth of the computations critical path. S 1 = activation depth of a computation on a single processor. S P = activation depth on P processors. Remember: Activation Depth – The combined size of all child activation frames (allocated memory) with respect to a parent task.
46
Greedy Scheduling At each step, execute anything that is ready, in any order, utilizing as many processors as you have ready tasks (i.e., tasks not waiting on a dependency). Analysis: achieves T P <= T 1 / P + T ∞ In other words, it will take less than or equal to the amount of time it would take to compute each task plus the time to compute the critical path, i.e. the longest chain of dependencies. Problem: Memory usage is unbounded.
47
Greedy Scheduling can duplicate memory across multiple processors. For example, when a new task is spawned and different processors are handling the parent and the child, the parent’s address space will also be copied to the processor handling the child. We want an algorithm that guarantees that total memory usage will be within a constant of what the computation would consume on a single processor. Memory Usage with Greedy Scheduling
48
Busy-Leaves Scheduling with thread pools A global pool is kept containing threads not bound to a processor. All processors follow this algorithm: 1.If empty, get a new process A from the pool. 2.If A spawns a thread B, return A to the pool and commence work on B. 3.If A stalls, return A to the pool. 4.If A dies, check if all its parent’s (B) children thread are dead. If so, commence work on B. This algorithm essentially guarantees that all leaves in the execution tree are busy.
49
Analysis of Busy-Leaves Scheduling TP <= T1 / P + T∞ SP <= S1 * P. In other words, the amount of memory allocated for the entire computation will be less than or equal to the amount of memory it would take to run on a single processor. Problem: Competition for access to the global thread pool can slow down the overall running time.
50
Randomized Work-Stealing Algorithm Randomized Work-Stealing eliminates the global shared pool, and replaces it with a stack at each processor. New tasks are put on the top of the stack, and migrated tasks are taken off the bottom of the stack. Algorithm: 1.If empty, remove a thread from the bottom of the stack (A). 2.If A enables stalled parent B, B is placed on the stack. B may have to be found and stolen from another stack. 3.If A spawns a child C, A is put on the stack and work on C commences. 4.If A dies or stalls, check the stack for another task. If one exists, commence execution. If the stack is empty, steal the bottommost thread of a random processor
51
Analysis of Randomized Work Stealing (Outline) At each step, give a dollar to every processor. The processor must put its dollar in a logical bucket. This is know as an accounting argument. Each processor puts its dollar in: The Work bucket if it executed a task at this step. The Steal bucket if it initiates a steal at this step. The Wait bucket if it waits for a queued steal request. At the end of the computation: There are exactly T 1 dollars in Work buckets. The expected sum of all Steal buckets is O(P T ∞ ). Total bytes communicated is expected O(P T ∞ S max ). The expected sum of all Wait buckets is at most the sum of the Steal buckets.
52
Implementation of Cilk Cilk uses explicit continuation passing, meaning that any return values must be explicitly sent to the appropriate successor thread. Data structures: Closure - Holds a pointer to a function, a slot for each of its arguments and a counter indicating how many arguments are still to be supplied. A closure is ready when all its arguments are present. Continuation - Holds a pointer to an empty closure slot. Continuations can be shared among threads, for example ?k can be passed to a spawned function to be filled in later. Function Calls: spawn function (args)// any child thread spawn_next function (args)// successor threads send_argument (k, value)// sends value to k
53
Example: Fibonacci in Cilk thread fib (cont int k, int n){ // k is where the return value will be placed if (n < 2) send_argument(k, n); // done; return value else { // main work done in this section cont int x, y; spawn_next sum (k, ?x, ?y); spawn fib (x, n – 1); spawn fib (y, n – 2); } thread sum (cont int k, int x, int y){ send_argument(k, x + y); }
54
Programming for Parallelism For systems with many processors and good scheduling, performance depends on the critical path (T ∞ ). remember: T p = T 1 /P + T ∞ This dictates performance more and more as the number of processors increase. It is easy to see if you take the limit of T p as P goes to infinity.
55
Practical Example: *Socrates *Socrates is an chess program written in Cilk, once considered one of the strongest programs in the world. It was developed on a 32 processor cluster, but the final build was to run on a 512 processor machine. Say, for example that there were two competing algorithms: A. T 32 = 65 seconds. T 1 = 2048, T ∞ = 1. B. T 32 = 40 seconds. T 1 = 1024, T ∞ = 8. but on 512 processors, A. T 512 = T 1 /P + T ∞ = 2048/512 + 1 = 5. B. T 512 = 1024/512 + 8 = 10.
56
Example 2: Merge-Sort If you were to naively translate merge-sort into a parallel algorithm: Merge-Sort(A, p, r)// Sort array A, from index p to index r if p < r// we are not done then q <- (p + r) / 2// split partitions in two equal pieces spawn Merge-Sort(A, p, q)// sort first partition spawn Merge-Sort(A, q+1, r)// sort second partition Merge(A, p, q, r)// merge the two sorted partitions
57
Merge-Sort Analysis Because Merge() takes O(n) time on an array of n elements, this takes nlgn time on a single processor: T 1 = O(nlgn) For the parallel version: T ∞ = T ∞ (n/2) + O(n) Thus the parallel version is just O(n). This is not a great improvement. The speedup is T1/ T ∞, or just lgn in this case.
58
Better Parallel Merge-Sort Algorithm: You need to merge arrays A and B (A is larger). Take the median of A. O(1). Partition B against the median of A. O(lgn). Recursively merge A low with B low and A high with B high.
59
Analysis of Parallel Merge-Sort Critical path: PM ∞ < PM ∞ (3n/4) + O(lgn) = O(lg 2 n) Because the merges are done in parallel, the worst-case combined is the worst-case for a single merge. It’s 3n/4 because in worst-case, we must merge half of A with all of B.
60
Example 3: Matrix Multiplication We said that low critical path was always good, but given a limit on P, compromises can and must be made. For example, say we have a matrix multiplication algorithm that requires 10 7 processors and results in a speedup from O(n 3 ) to O(lg 2 n). If we only have 10 6 processors available, it would be advantages to reduce the parallelism (and thus the number of required concurrent processors), even if this results in a higher runtime, say O(n).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.