Download presentation
Presentation is loading. Please wait.
Published byEleanor Copeland Modified over 9 years ago
1
Day 2
2
Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size
3
One thing to remember Easy Hard
4
Seeking Concurrency Data dependence graphs Data parallelism Functional parallelism Pipelining
5
Data Dependence Graph Directed graph Vertices = tasks Edges = dependences
6
Data Parallelism Independent tasks apply same operation to different elements of a data set Okay to perform operations concurrently for i 0 to 99 do a[i] b[i] + c[i] endfor
7
Functional Parallelism Independent tasks apply different operations to different data elements First and second statements Third and fourth statements a 2 b 3 m (a + b) / 2 s (a 2 + b 2 ) / 2 v s - m 2
8
Pipelining Divide a process into stages Produce several items simultaneously
9
Data Clustering Data mining = looking for meaningful patterns in large data sets Data clustering = organizing a data set into clusters of “similar” items Data clustering can speed retrieval of related items
10
Document Vectors Moon Rocket Alice in Wonderland A Biography of Jules Verne The Geology of Moon Rocks The Story of Apollo 11
11
Document Clustering
12
Clustering Algorithm Compute document vectors Choose initial cluster centers Repeat –Compute performance function –Adjust centers Until function value converges or max iterations have elapsed Output cluster centers
13
Data Parallelism Opportunities Operation being applied to a data set Examples –Generating document vectors –Finding closest center to each vector –Picking initial values of cluster centers
14
Functional Parallelism Opportunities Draw data dependence diagram Look for sets of nodes such that there are no paths from one node to another
15
Data Dependence Diagram Build document vectors Compute function value Choose cluster centers Adjust cluster centersOutput cluster centers
16
Programming Parallel Computers Extend compilers: translate sequential programs into parallel programs Extend languages: add parallel operations Add parallel language layer on top of sequential language Define totally new parallel language and compiler system
17
Strategy 1: Extend Compilers Parallelizing compiler –Detect parallelism in sequential program –Produce parallel executable program Focus on making Fortran programs parallel
18
Extend Compilers (cont.) Advantages –Can leverage millions of lines of existing serial programs –Saves time and labor –Requires no retraining of programmers –Sequential programming easier than parallel programming
19
Extend Compilers (cont.) Disadvantages –Parallelism may be irretrievably lost when programs written in sequential languages –Performance of parallelizing compilers on broad range of applications still up in air
20
Extend Language Add functions to a sequential language –Create and terminate processes –Synchronize processes –Allow processes to communicate
21
Extend Language (cont.) Advantages –Easiest, quickest, and least expensive –Allows existing compiler technology to be leveraged –New libraries can be ready soon after new parallel computers are available
22
Extend Language (cont.) Disadvantages –Lack of compiler support to catch errors –Easy to write programs that are difficult to debug
23
Add a Parallel Programming Layer Lower layer –Core of computation –Process manipulates its portion of data to produce its portion of result Upper layer –Creation and synchronization of processes –Partitioning of data among processes A few research prototypes have been built based on these principles
24
Create a Parallel Language Develop a parallel language “from scratch” –occam is an example Add parallel constructs to an existing language –Fortran 90 –High Performance Fortran –C*
25
New Parallel Languages (cont.) Advantages –Allows programmer to communicate parallelism to compiler –Improves probability that executable will achieve high performance Disadvantages –Requires development of new compilers –New languages may not become standards –Programmer resistance
26
Current Status Low-level approach is most popular –Augment existing language with low-level parallel constructs –MPI and OpenMP are examples Advantages of low-level approach –Efficiency –Portability Disadvantage: More difficult to program and debug
27
Architectures Interconnection networks Processor arrays (SIMD/data parallel) Multiprocessors (shared memory) Multicomputers (distributed memory) Flynn’s taxonomy
28
Interconnection Networks Uses of interconnection networks –Connect processors to shared memory –Connect processors to each other Interconnection media types –Shared medium –Switched medium
29
Shared versus Switched Media
30
Shared Medium Allows only message at a time Messages are broadcast Each processor “listens” to every message Arbitration is decentralized Collisions require resending of messages Ethernet is an example
31
Switched Medium Supports point-to-point messages between pairs of processors Each processor has its own path to switch Advantages over shared media –Allows multiple messages to be sent simultaneously –Allows scaling of network to accommodate increase in processors
32
Switch Network Topologies View switched network as a graph –Vertices = processors or switches –Edges = communication paths Two kinds of topologies –Direct –Indirect
33
Direct Topology Ratio of switch nodes to processor nodes is 1:1 Every switch node is connected to –1 processor node –At least 1 other switch node
34
Indirect Topology Ratio of switch nodes to processor nodes is greater than 1:1 Some switches simply connect other switches
35
Evaluating Switch Topologies Diameter Bisection width Number of edges / node Constant edge length? (yes/no)
36
2-D Mesh Network Direct topology Switches arranged into a 2-D lattice Communication allowed only between neighboring switches Variants allow wraparound connections between switches on edge of mesh
37
2-D Meshes
38
Vector Computers Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers –Pipelined vector processor: streams data through pipelined arithmetic units –Processor array: many identical, synchronized arithmetic processing elements
39
Why Processor Arrays? Historically, high cost of a control unit Scientific applications have data parallelism
40
Processor Array
41
Data/instruction Storage Front end computer –Program –Data manipulated sequentially Processor array –Data manipulated in parallel
42
Processor Array Performance Performance: work done per time unit Performance of processor array –Speed of processing elements –Utilization of processing elements
43
Performance Example 1 1024 processors Each adds a pair of integers in 1 sec What is performance when adding two 1024-element vectors (one per processor)?
44
Performance Example 2 512 processors Each adds two integers in 1 sec Performance adding two vectors of length 600?
45
2-D Processor Interconnection Network Each VLSI chip has 16 processing elements
46
if (COND) then A else B
49
Processor Array Shortcomings Not all problems are data-parallel Speed drops for conditionally executed code Don’t adapt to multiple users well Do not scale down well to “starter” systems Rely on custom VLSI for processors Expense of control units has dropped
50
Multicomputer, aka Distributed Memory Machines Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters
51
Asymmetrical Multicomputer
52
Asymmetrical MC Advantages Back-end processors dedicated to parallel computations Easier to understand, model, tune performance Only a simple back-end operating system needed Easy for a vendor to create
53
Asymmetrical MC Disadvantages Front-end computer is a single point of failure Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front-end and back-end program
54
Symmetrical Multicomputer
55
Symmetrical MC Advantages Alleviate performance bottleneck caused by single front-end computer Better support for debugging Every processor executes same program
56
Symmetrical MC Disadvantages More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor
57
Commodity Cluster Co-located computers Dedicated to running parallel jobs No keyboards or displays Identical operating system Identical local disk images Administered as an entity
58
Network of Workstations Dispersed computers First priority: person at keyboard Parallel jobs run in background Different operating systems Different local images Checkpointing and restarting important
59
DM programming model Communicating sequential programs Disjoint address spaces Communicate sending “messages” A message is an array of bytes –Send(dest, char *buf, in len); –receive(&dest, char *buf, int &len);
60
Multiprocessors Multiprocessor: multiple-CPU computer with a shared memory Same address on two different CPUs refers to the same memory location Avoid three problems of processor arrays –Can be built from commodity CPUs –Naturally support multiple users –Maintain efficiency in conditional code
61
Centralized Multiprocessor Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs –Uniform memory access (UMA) multiprocessor –Symmetrical multiprocessor (SMP)
62
Centralized Multiprocessor
63
Private and Shared Data Private data: items used only by a single processor Shared data: values used by multiple processors In a multiprocessor, processors communicate via shared data values
64
Problems Associated with Shared Data Cache coherence –Replicating data across multiple caches reduces contention –How to ensure different processors have same value for same address? Synchronization –Mutual exclusion –Barrier
65
Cache-coherence Problem Cache CPU A Cache CPU B Memory 7 X
66
Cache-coherence Problem CPU ACPU B Memory 7 X 7
67
Cache-coherence Problem CPU ACPU B Memory 7 X 7 7
68
Cache-coherence Problem CPU ACPU B Memory 2 X 7 2
69
Write Invalidate Protocol CPU ACPU B 7 X 7 7 Cache control monitor
70
Write Invalidate Protocol CPU ACPU B 7 X 7 7 Intent to write X
71
Write Invalidate Protocol CPU ACPU B 7 X 7 Intent to write X
72
Write Invalidate Protocol CPU ACPU B X 2 2
73
Distributed Multiprocessor Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor
74
Distributed Multiprocessor
75
Cache Coherence Some NUMA multiprocessors do not support it in hardware –Only instructions, private data in cache –Large memory access time variance Implementation more difficult –No shared memory bus to “snoop” –Directory-based protocol needed
76
Flynn’s Taxonomy Instruction stream Data stream Single vs. multiple Four combinations –SISD –SIMD –MISD –MIMD
77
SISD Single Instruction, Single Data Single-CPU systems Note: co-processors don’t count –Functional –I/O Example: PCs
78
SIMD Single Instruction, Multiple Data Two architectures fit this category –Pipelined vector processor (e.g., Cray-1) –Processor array (e.g., Connection Machine)
79
MISD Multiple Instruction, Single Data Example: systolic array
80
MIMD Multiple Instruction, Multiple Data Multiple-CPU computers –Multiprocessors –Multicomputers
81
Summary Commercial parallel computers appeared in 1980s Multiple-CPU computers now dominate Small-scale: Centralized multiprocessors Large-scale: Distributed memory architectures (multiprocessors or multicomputers)
82
Programming the Beast Task/channel model Algorithm design methodology Case studies
83
Task/Channel Model Parallel computation = set of tasks Task –Program –Local memory –Collection of I/O ports Tasks interact by sending messages through channels
84
Task/Channel Model Task Channel
85
Foster’s Design Methodology Partitioning Communication Agglomeration Mapping
86
Foster’s Methodology
87
Partitioning Dividing computation and data into pieces Domain decomposition –Divide data into pieces –Determine how to associate computations with the data Functional decomposition –Divide computation into pieces –Determine how to associate data with the computations
88
Example Domain Decompositions
89
Example Functional Decomposition
90
Partitioning Checklist At least 10x more primitive tasks than processors in target computer Minimize redundant computations and redundant data storage Primitive tasks roughly the same size Number of tasks an increasing function of problem size
91
Communication Determine values passed among tasks Local communication –Task needs values from a small number of other tasks –Create channels illustrating data flow Global communication –Significant number of tasks contribute data to perform a computation –Don’t create channels for them early in design
92
Communication Checklist Communication operations balanced among tasks Each task communicates with only small group of neighbors Tasks can perform communications concurrently Task can perform computations concurrently
93
Agglomeration Grouping tasks into larger tasks Goals –Improve performance –Maintain scalability of program –Simplify programming In MPI programming, goal often to create one agglomerated task per processor
94
Agglomeration Can Improve Performance Eliminate communication between primitive tasks agglomerated into consolidated task Combine groups of sending and receiving tasks
95
Agglomeration Checklist Locality of parallel algorithm has increased Replicated computations take less time than communications they replace Data replication doesn’t affect scalability Agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks suitable for likely target systems Tradeoff between agglomeration and code modifications costs is reasonable
96
Mapping Process of assigning tasks to processors Centralized multiprocessor: mapping done by operating system Distributed memory system: mapping done by user Conflicting goals of mapping –Maximize processor utilization –Minimize interprocessor communication
97
Mapping Example
98
Optimal Mapping Finding optimal mapping is NP-hard Must rely on heuristics
99
Mapping Decision Tree Static number of tasks –Structured communication Constant computation time per task –Agglomerate tasks to minimize comm –Create one task per processor Variable computation time per task –Cyclically map tasks to processors –Unstructured communication –Use a static load balancing algorithm Dynamic number of tasks
100
Mapping Strategy Static number of tasks Dynamic number of tasks –Frequent communications between tasks Use a dynamic load balancing algorithm –Many short-lived tasks Use a run-time task-scheduling algorithm
101
Mapping Checklist Considered designs based on one task per processor and multiple tasks per processor Evaluated static and dynamic task allocation If dynamic task allocation chosen, task allocator is not a bottleneck to performance If static task allocation chosen, ratio of tasks to processors is at least 10:1
102
Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input
103
Boundary Value Problem Ice waterRodInsulation
104
Rod Cools as Time Progresses
105
Finite Difference Approximation
106
Partitioning One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition
107
Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels
108
Agglomeration and Mapping Agglomeration
109
Sequential execution time – time to update element n – number of elements m – number of iterations Sequential execution time: m (n-1)
110
Parallel Execution Time p – number of processors – message latency Parallel execution time m( (n-1)/p +2 )
111
Reduction Given associative operator a 0 a 1 a 2 … a n-1 Examples –Add –Multiply –And, Or –Maximum, Minimum
112
Parallel Reduction Evolution
115
Binomial Trees Subgraph of hypercube
116
Finding Global Sum 4207 -35-6-3 8123 -446
117
Finding Global Sum 17-64 4582
118
Finding Global Sum 8-2 910
119
Finding Global Sum 178
120
Finding Global Sum 25 Binomial Tree
121
Agglomeration
122
sum
123
The n-body Problem
125
Partitioning Domain partitioning Assume one task per particle Task has particle’s position, velocity vector Iteration –Get positions of all other particles –Compute new position, velocity
126
Gather
127
All-gather
128
Complete Graph for All-gather
129
Hypercube for All-gather
130
Communication Time Hypercube Complete graph
131
Adding Data Input
132
Scatter
133
Scatter in log p Steps 12345678 567812345612 7834
134
Summary: Task/channel Model Parallel computation –Set of tasks –Interactions through channels Good designs –Maximize local computations –Minimize communications –Scale up
135
Summary: Design Steps Partition computation Agglomerate tasks Map tasks to processors Goals –Maximize processor utilization –Minimize inter-processor communication
136
Summary: Fundamental Algorithms Reduction Gather and scatter All-gather
137
High Throughput Computing Easy problems – formerly known as “embarrassingly parallel” – now known as “pleasingly parallel” Basic idea – “Gee – I have a whole bunch of jobs (single run of a program) that I need to do, why not run them concurrently rather than sequentially” Sometimes called “bag of tasks” or parameter sweep problems
138
Bag-of-tasks
139
Examples A large number of proteins – each represented by a different file – to “dock” with a target protein –For all files x, execute f(x,y) Exploring a parameter space in n- dimensions –Uniform –Non-uniform Monte carlo’s
140
Tools Most common tool is a queuing system – sometimes called a load management system, or a local resource manager PBS, LSF, and SGE are the three most common. Condor is also often used. They all have the same basic functions, we’ll use PBS as an exemplar. Script languages (bash, Perl, etc.)
141
PBS qsub options script-file –Submit the script to run –Options can specify number of processors, other required resources (memory, etc.) –Returns the job ID (a string)
144
Other PBS qstat – give the status of jobs submitted to the queue qdel – delete a job from the queue
145
Blasting a set of jobs
146
Issues Overhead per job is substantial –Don’t want to run millisecond jobs –May need to “bundle them up” May not be enough jobs to saturate resources –May need to break up jobs IO System may become saturated –Copy large files to /tmp, check for existence in your shell script, copy if not there May be more jobs than the queuing system can handle (many start to break down at several thousand jobs) Jobs may fail for no good reason –Develop scripts to check for output and re-submit upto k jobs
147
Homework 1.Submit a simple job to the queue that echo’s the host name, redirect output to a file of your choice. 2.Via a script submit 100 “hostname” jobs to a script. Output should be “output.X” where X is the output number 3.For each file in a rooted directory tree run a “wc” to count the words. Maintain the results in a “shadow” directory tree. Your script should be able to detect results that have already been computed.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.