Computer Architecture II 1 Computer architecture II Programming for performance.

Computer Architecture II 1 Computer architecture II Programming for performance

Computer Architecture II 2 Recap Parallelization strategies –What to partition? –Embarrassingly Parallel Computations –Divide-and-Conquer –Pipelined Computations Application examples Parallelization steps 3 programming models –Data parallel –Shared memory –Message passing

Computer Architecture II 3 4 Steps in Creating a Parallel Program Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors

Computer Architecture II 4 Plan for today Programming for performance Amdahl’s law Partitioning for performance –Addressing decomposition and assignment Orchestration for performance

Computer Architecture II 5 Creating a Parallel Program Assumption: Sequential algorithm is given –Sometimes need very different algorithm, but beyond scope Pieces of the job: –Identify work that can be done in parallel –Partition work and perhaps data among processes –Manage data access, communication and synchronization –Note: work includes computation, data access and I/O Main goal: Speedup (plus low prog. effort and resource needs) Speedup (p) = For a fixed problem: Speedup (p) = Performance(p) Performance(1) Time(1) Time(p)

Computer Architecture II 6 Amdahl´s law Suppose a fraction f of your application is not parallelizable 1-f : parallelizable on p processors Speedup(P) = T 1 /T p <= T 1 /(f T 1 + (1-f) T 1 /p) = 1/(f + (1-f)/p) <= 1/f

Computer Architecture II 7 Amdahl’s Law (for 1024 processors) See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.

Computer Architecture II 8 Amdahl´s law But: –There are many problems can be “embarrassingly” parallelized Ex: image processing, differential equation solver –In some cases the serial fraction does not increase with the problem size –Additional speedup can be achieved from additional resources (super-linear speedup due to more memory) “

Computer Architecture II 9 Speedup speedup p linear speedup sub-linear speedup superlinear speedup!!

Computer Architecture II 10 Superlinear Speedup? Possible causes Algorithm –e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast Hardware –e.g., with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk)

Computer Architecture II 11 Parallel Efficiency Eff p = S p / p Typically 1, unless superlinear speedup Used to measure how well the processors are utilized –If increasing the number of process by a factor 10 increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5

Computer Architecture II 12 Performance Goal => Speedup Architect Goal –observe how program uses machine and improve the design to enhance performance Programmer Goal –observe how the program uses the machine and improve the implementation to enhance performance

Computer Architecture II 13 4 Steps in Creating a Parallel Program Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors

Computer Architecture II 14 Partitioning for Performance First two phases of parallelization process: decomposition & assignment Goal 1.Balancing the workload and reducing wait time at synch points 2.Reducing inherent communication 3.Reducing extra work for determining and managing a good assignment (static versus dynamic) Tensions between the 3 goals –Maximize load balance => smaller tasks => increase communication –No communication (run on 1 processor) => extreme load imbalance (all others idle) –Load balance => extra work to compute or manage the partitioning (ex. dynamic techniques)

Computer Architecture II 15 1. Load Balance –Work: data access, computation –Not just equal work, but must be busy at same time –Ex: Speedup ≤ 1000/400 = 2.5 Sequential Work Max Work on any Processor Speedup ≤

Computer Architecture II 16 1. Load balance a)Identify enough concurrency Data and functional parallelism (last class) b)Managing concurrency c)Task granularity d)Reduce communication and synchronization

Computer Architecture II 17 1 b) Static versus Dynamic assignment Static: before the program starts is clear who does what #pragma omp parallel for schedule(static) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Dynamic –External scheduler –Self-scheduled Each process picks a chunk of loop iterations and executes them #pragma omp parallel for schedule(dynamic,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Dynamic guided self-scheduling: processes take first larger chunks and then reduce this number progressively #pragma omp parallel for schedule(guided,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];}

Computer Architecture II 18 Dynamic Tasking with Task Queues Centralized queue: simple protocol –Problems: Communication, synchronization, contention Distributed queues: complicated protocol –Initial distribution of jobs May cause load imbalance Solution: task stealing: whom to steal from, how many tasks to steal,... –Termination detection

Computer Architecture II 19 1.c Task granularity Task granularity: amount of work associated with a task General rule: –Coarse-grained: often less load balance –Fine-grained: better load balance, but more overhead, often more communication and contention Processor 1 Processor 2 Processor 1 Processor 2

Computer Architecture II 20 1.d Reducing Serialization Synchronization for task assignment may cause serialization (for instance the access to a queue) Sequential Work Max (Work + Synch Wait Time) Speedup < Process 1 Process 2 Process 3 Synchronization pointWorkSynchronization wait time

Computer Architecture II 21 Reducing Serialization Event synchronization –Reduce use of conservative synchronization point-to-point instead of barriers finer granularity of access may reduce the synchronization time –But fine-grained synchronization more difficult to program, more synchronization operations. Mutual exclusion –Separate locks for separate data lock per task in task queue, not per queue finer grain => less contention/serialization, more space, less reuse –Smaller, less frequent critical sections don’t do reading/testing in critical section, only modification e.g. searching for task to dequeue outside critical section

Computer Architecture II 22 2. Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Inherent communication –Determined by assignment of tasks to processes –Actual communication may be larger (artifactual) One principle: Assign tasks that access same data to same process Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup < Synchronization pointWorkSynchronization wait time Communication Process 1 Process 2 Process 3

Computer Architecture II 23 Domain Decomposition Ocean Example: communicate with the neighbors, compute in the assigned domain Perimeter to Area communication-to-computation ratio (area to volume in 3-d): Depends on n,p: decreases with n, increases with p

Computer Architecture II 24 Domain Decomposition Communication/computation: for block, for strip –Block better –Application dependent: strip may be better in other cases 4*√p n 2*p n Best domain decomposition depends on information requirements Block versus strip decomposition:

Computer Architecture II 25 Finding a Domain Decomposition GOALS: load balance & low communication Static, by inspection –Must be predictable: Ocean Static, but not by inspection –Input-dependent, require analyzing input structure –E.g sparse matrix computations Semi-static (periodic repartitioning) –Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing –Initial decomposition, but highly unpredictable; e.g ray tracing

Computer Architecture II 26 3. Reducing Extra Work Common sources of extra work: –Computing a good partition e.g. partitioning in Barnes-Hut –Using redundant computation to avoid communication –Task, data and process management overhead applications, languages, runtime systems, OS –Imposing structure on communication coalescing small messages Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup <

Computer Architecture II 27 PART II: memory aware optimizations So far we have seen the parallel computer as a collection of communicating processors –Goals: balance load, reduce inherent communication and extra work –We have assumed an unlimited memory In reality the parallel computer uses a multi-cache, multi- memory system

Computer Architecture II 28 Memory-oriented View Multiprocessor as Extended Memory Hierarchy Levels in extended hierarchy: –Registers, caches, local memory, remote memory –Glued together by communication architecture –Levels communicate at a certain granularity of data transfer Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects Granularity increases, access time increases Capacity increases, cost/unit decreases

Computer Architecture II 29 Memory-oriented view Performance depends heavily on memory hierarchy Time spent by a program (usually given in cycles) Time prog =Time compute + Time access Data access time can be reduced by: – Optimizing machine larger caches lower latency Larger bandwidth – Optimizing program temporal and spatial locality

Computer Architecture II 30 Artifactual Communication in Extended Hierarchy poor allocation of data across distributed memories –Data accessed by a node is allocated in the memory of another even though accessed only by the remote node unnecessary data in a transfer (e.g. sender sends more) unnecessary transfers due to system granularities (e.g. cache coherent at block granularity) redundant communication of data (e.g. update-protocols : data sent at every modification, but needed only once) finite replication capacity (in cache or main memory): replicated data has to be evicted from memory y brought again later

Computer Architecture II 31 Replication induced artifactual communication Communication induced by finite capacity is most fundamental artifact –Like cache size and miss rate memory traffic in uniprocessors View as three level hierarchy for simplicity –Local cache, local memory, remote memory (ignore network topology) Classify “misses” in “cache” at any level as for uniprocessors (4 “C”s) compulsory or cold misses (no size effect) capacity misses (yes): has to evict smth because limited space conflict or collision misses (yes): memory that maps on the same cache blocks communication or coherence misses (no)

Computer Architecture II 32 Working set –Working set: size of data that fits in a certain level of memory 1. A few points for near-neighbor reuse 2. Three sub-rows 3. Whole matrix 15.while (!done) do 16.diff = 0; 17.for i  1 to n do 18.for j  1 to n do 19.temp = A[i,j]; 20.A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] + 21.A[i,j+1] + A[i+1,j]); 22.diff +=abs(A[i,j] - temp); 23.endfor 24.endfor 25.if (diff/(n*n) < TOL) then done = 1; 26.end while

Computer Architecture II 33 Working Set Perspective –Hierarchy of working sets –At first level cache (fully assoc, one-word block), inherent to algorithm working set curve for program –Traffic from any type of miss can be local or nonlocal (communication)

Computer Architecture II 34 Orchestration for Performance Reducing amount of communication: –Inherent: change the partitioning (seen earlier) –Artifactual: exploit spatial, temporal locality in the memory hierarchy Techniques often similar to those on uniprocessors Structuring communication to reduce cost

Computer Architecture II 35 Reducing Artifactual Communication Message passing model –Communication and replication are both explicit –Even artifactual communication is in explicit messages Shared address space model –Occurs transparently due to interactions of program and system –used for explanation

Computer Architecture II 36 Exploiting Temporal Locality –Def: reusing of data elements already brought into cache –Structure algorithm so working sets fit into the cache often techniques to reduce inherent communication: –assign tasks accessing the same elements to the same processor schedule tasks for data reuse once assigned –Ocean Solver example: blocking Each grid element accessed 5 times First time brought into cache then reused Rewrite the loops

Computer Architecture II 37 Exploiting Spatial Locality Def: when a data element is accessed, its neighbors are accessed Major spatial-related causes of artifactual communication : –Conflict misses –Data distribution/layout (allocation granularity) –Fragmentation (communication granularity) –False sharing of data (coherence granularity) AVOIDING ARTIFACTUAL COMMUNICATION: keep contiguous data accessed by one processor –Fix problems by modifying data structures, or layout/alignment

Computer Architecture II 38 Spatial Locality Example – Repeated sweeps over 2-d grid, each time adding 1 to elements – 4-d grid to achieve spatial locality ( line processor x column processor x line index x column index)

Computer Architecture II 39 Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows –Blocks have a spatial locality problem on remote data: when accessing the elements of neighboring processors whole cache blocks are fetched at column boundary –Row-wise can perform better despite worse inherent communication-to-computation ratio

Computer Architecture II 40 Example Performance Impact on Origin2000 OceanKernel solver

Computer Architecture II 41 Case study 1: Simulating Ocean Currents Ocean: Modeled as several two-dimensional grids :Static and regular Steps: set up the movement equation, solve them, update the grid values Multi grid method –sweep n x n first (finest level) –Go coarser (n/2 x n/2, n/4 x n/4) or finer depending on the error in the current sweep (a) Cross sections (b) Spatial discretization of a cross section

Computer Architecture II 42 Case Study 1: Ocean

Computer Architecture II 43 Partitioning Function parallelism: identify independent computation => reduce synchronization Data parallelism: static partitioning within a grid similar issues to those for kernel solver –Block versus strip inherent communication: block better Artifactual communication: line strip better (spatial locality) –Load imbalance due to grid border elements In the block case internal blocks do not have border elements

Computer Architecture II 44 Orchestration Spatial Locality similar to equation solver –4D versus 2D arrays –Block partitioning: poor spatial locality across rows, good across columns –except lots of grids, so cache conflicts across grids

Computer Architecture II 45 Orchestration Temporal locality: Complex working set hierarchy (six working sets, three important) – A few points for near-neighbor reuse – three sub-rows – partition of one grid Synchronization –Barriers between phases and solver sweeps –Locks for global variables –Lots of work between synchronization events

Computer Architecture II 46 Execution Time Breakdown 1026 x 1026 grid size with block partitioning on 32-processor Origin2000 4MB 2nd level cache 4-d grids much better than 2-d –Smaller access time (better locality) –Less time waiting at barriers 4D arrays2D arrays

Computer Architecture II 47 Case Study 2: Barnes-Hut Simulate the interactions of many stars evolving over time Computing forces is expensive –O(n 2 ) brute force approach –Barnes Hut: Hierarchical Method taking advantage of force law G (m 1 m 2 / r 2 )

Computer Architecture II 48 Case Study 2: Barnes-Hut Sequential algorithm –For each body (n times) traverse the tree top-down and compute the total force acting on that body. –If the cell is far enough, compute the force –Expected tree height: log n Space cell containing one body

Computer Architecture II 49 Application Structure –Main data structures: array of bodies, of cells, and of pointers to them Each body/cell has several fields: mass, position, pointers to others Contiguous chunks of pointers to bodies and cells are assigned to processes

Computer Architecture II 50 Partitioning Decomposition: bodies in most phases, cells in computing moments Challenges for assignment: –Non-uniform body distribution => non-uniform work and communication Cannot assign by inspection –Distribution changes dynamically across time-steps Cannot assign statically –Information needs fall off with distance from body Partitions should be spatially contiguous for locality –Different phases have different work distributions across bodies No single assignment ideal for all –Communication: fine-grained and irregular

Computer Architecture II 51 Load Balancing Particles are not equal –The number and mass of bodies acting upon differs –Solution: Assign costs to particles based on the work Work unknown before hand and changes with time-steps –But: System evolves slowly –Solution: Use work per particle in the current phase as a estimate for the cost for next time-step

Computer Architecture II 52 Load balancing: Orthogonal Recursive Bisection (ORB) Recursively bisect space into subspaces with equal work –Work is associated with bodies, as computed in the previous phase –Continue until one partition per processor –costly

Computer Architecture II 53 Another Approach: Cost-zones Insight: Quad-tree already contains an encoding of spatial locality. Cost-zones is low-overhead and very easy to program –Store cost in each node of the tree –Compute total work of the system (eg: 1000ops) and divide to the number of processors (eg 1000 ops / 10 proc = 100 ops/proc) –Each processor traverses the tree and picks its range (0-100, 100-200 …)

Computer Architecture II 54 Orchestration and Mapping Spatial locality: data distribution is much more difficult than in Ocean –Redistribution across time-steps –Logical granularity (body/cell) much smaller than page –Partitions contiguous in physical space does not imply contiguous in array (where the body are stored) Temporal locality and working sets –First working set (body to body interaction) –Second working set (compute forces on a body: good temporal locality because system evolves slowly) Synchronization: –Barriers between phases –No synch within force calculation: data written different from data read –Locks in tree-building, pt. to pt. event synch in center of mass phase Mapping: ORB maps well to hypercube, costzones to linear array

Computer Architecture II 55 Execution Time Breakdown 512K bodies on 32-processor Origin2000 – Static assignment of bodies versus costzones – Good load balance – Slow access for static due to lack of locality

Computer Architecture II 56 Raytrace Map a 3D scene on a 2D display pixel by pixel Rays shot through pixels in image are called primary rays –Reflect and refract when they hit objects and compute color and opacity –Recursive process generates ray tree per primary ray Hierarchical spatial data structure keeps track of primitives in scene (similar to the tree of Barnes-Hut) –Nodes are space cells, leaves have linked list of jobs Tradeoffs between execution time and image quality

Computer Architecture II 57 Partitioning Scene-oriented approach –Partition scene cells, process rays while they are in an assigned cell Ray-oriented approach –Partition primary rays (pixels), access scene data as needed –Simpler; used here Static assignment: bad load balance, unpredictability of ray bounce Dynamic assignment: use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing A block, the unit of assignment A tile, the unit of decomposition and stealing Insert all tiles in a queue Steal one tile at a time

Computer Architecture II 58 Orchestration and Mapping Spatial locality –Proper data distribution for ray-oriented approach very difficult –Dynamically changing, unpredictable access, fine-grained access Distribute the memory pages round-robin to avoid contention –Poor spatial locality Temporal locality –Working sets large and ill defined due to unpredictability –Replication would do well (but capacity limited) Synchronization: –One barrier at end, locks on task queues Mapping: natural to 2-d mesh for image, but likely not important

Computer Architecture II 59 Execution Time Breakdown –Balls arranged in bunch –Task stealing clearly very important for load balance

Computer Architecture II 1 Computer architecture II Programming for performance.

Similar presentations

Presentation on theme: "Computer Architecture II 1 Computer architecture II Programming for performance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture II 1 Computer architecture II Programming for performance.

Similar presentations

Presentation on theme: "Computer Architecture II 1 Computer architecture II Programming for performance."— Presentation transcript:

Similar presentations

About project

Feedback