Download presentation
Presentation is loading. Please wait.
Published byCathleen Willis Modified over 9 years ago
1
1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002
2
2 Papers surveyed Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors by Chris Holt, Jaswinder Pal Singh and John Hennessy Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors Scaling Application Performance on a Cache- coherent Multiprocessor by Dongming Jiang and Jaswinder Pal Singh Scaling Application Performance on a Cache- coherent Multiprocessor A Comparison of the MPI, SHMEM and Cache- coherent Shared Address Space Programming Models on the SGI Origin2000 by Hongzhang Shan and Jaswinder Pal SinghA Comparison of the MPI, SHMEM and Cache- coherent Shared Address Space Programming Models on the SGI Origin2000
3
Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors
4
4 Question Can realistic applications achieve reasonable performance on large scale DSM machines? Problem Size Programming effort and Optimization Main architectural bottlenecks
5
5 Metrics Minimum problem size to achieve a desired level of parallel efficiency Parallel Efficiency >= 70% Assumption: Problem size ↗ Performance ↗ Question: Is the assumption always true? Why or Why not?
6
6 Programming Difficulty and Optimization Techniques already employed Balance the workload Reduce inherent communication Incorporate major form of data locality(temporal and/or spatial) Further optimization Place the data appropriately in physically distributed memory instead of allowing pages of data to be placed round-robin across memories Modify major data structures substantially to reduce unnecessary communication and to facilitate proper data placement Algorithmic enhancements to further improve the load imbalance or reduce the amount of necessary communication Prefetching Software-controlled Insert prefetches by hand
7
7 Simulation Environment and Validation Simulated Architecture Stanford FLASH multiprocessor Validation Stanford DASH machine vs. Simulator Speedups: Simulator ≈ DASH machine
8
8 Applications Subset of SPLASH2 Important scientific & engineering computations Different communication patterns & requirements Indicative of several types of applications running on large scale DSMs
9
9 Results – With & w/o prefetching
10
10 Architectural Bottleneck
11
11 Conclusion Problem size Possible to get good performance on large scale DSMs, using problem sizes that are often surprisingly small, except Radix Program difficulty In most case not difficult to program Scalable performance can be achieved without changing the code too much Architectural bottleneck End-point contention Require extremely efficient communication controllers
13
Scaling Application Performance on a Cache-coherent Multiprocessor
14
14 The Question Can distributed shared memory, cache coherent, non-uniform memory access architectures scale on parallel apps? What do we mean by scale: Achieve parallel efficiency of 60%, For a fixed problem size, Increasing the number of processors,
15
15 DSM, cc-NUMA Each processor has private cache, Shared address space constructed from “public” memory of each processor, Loads / stores used to access memory, Hardware ensures cache coherence, Non-uniform: miss penalty for remote data higher, SGI Origin2000 chosen as an aggressive representative in this architectural family,
16
16 Origin 2000 overview Nodes placed as vertices of hypercubes: Ensures that communication latency grows linearly, as number of nodes doubles, Each node is dual 195 MHz proc, with own 32KB 1 st level cache, 4 MB second level cache Total addressible memory is 32GB Most aggressive in terms of remote to local memory access latency ratio
17
17 Benchmarks SPLASH-2: Barnes-Hut, Ocean, Radix Sort, etc, 3 new: Shear Warp, Infer, and Protein, Range of communication-to-computation ratio, temporal and spatial locality, Initial sizes of problems determined from earlier experiments: Simulation with 256 processors, Implementation with 32 processors,
18
18 Initial Experiments
19
19 Avg. Exec. Time Breakdown
20
20 Problem Size Idea: increase problem size until desired level of efficiency is achieved, Question: Feasible? Question: Even if feasible, is it desirable?
21
21 Changing Problem Size
22
22 Why problem size helps Communication to computation ratio improved Less load imbalance, both in computation and communication costs Less waiting in synch Superlienarity effects of cache size Helps larger processor counts, Hurts smaller processor counts Less false sharing
23
23 Application Restructuring What kind of restructuring: Algorithmic changes, data partitioning Ways restructuring helps: Reduced communication, Better data placement, Static partitioning for better load balance, Restructuring is app specific and complex, Bonus side-effect: Scale well on Shared Virtual Memory (clustered workstations) systems,
24
24 Application Restructuring
25
25 Conclusions Original versions not scalable on cc-NUMA Simulation not accurate for quantitative results; implementation needed Increasing size a poor solution App restructuring works: Restructured apps perform well also on SVM, Parallel efficiency of these versions better However, to validate results, good idea to run restructured apps on larger number of processors
27
A Comparison of the MPI, SHMEM and Cache-coherent Programming Models on the SGI Origin2000
28
28 Purpose Compare the three programming models on Origin2000 We focus on scientific applications that access data regularly or predictably Or do not require fine grained replication of irregularly accessed data
29
29 SGI Origin2000 Cache coherent, NUMA machine 64 processors 32 nodes, 2 MIPS R10000 processors each 512 MB memory per node Interconnection Network 16 vertices hypercube Pair of nodes associated with each vertex
30
30 Three Programming Models CC-SAS Linear address space for shared memory MP Communicate with other processes explicitly via message passing interface (MPI) SHMEM Shared memory one sided communication library Via get and put primitives
31
31 Applications and Algorithms FFT All-to-all communication(regular) Ocean Nearest-neighbor communication Radix All-to-all communication(irregular) LU One-to-many communication
32
32 Questions to be answered Can parallel algorithms be structured in the same way for good performance in all three models? If there are substantial differences in performance under three models, where are the key bottlenecks? Do we need to change the data structures or algorithms substantially to solve those bottlenecks?
33
33 Performance Result
34
34 Questions: Anything unusual in the previous slide? Working sets fit in the cache for multiprocessors but not for uniprocessors Why MP is much worse than CC-SAS and SHMEM?
35
35 Analysis: Execution time = BUSY + LMEM + RMEM + SYNC where BUSY: CPU computation time LMEM: CPU stall time for local cache miss RMEM: CPU stall time for sending/receiving remote data SYNC: CPU time spend at synchronization events
36
36 Time breakdown for MP
37
37 Improving MP performance Remove extra data copy Allocate all data involved in communication in shared address space Reduce SYNC time Use lock-free queue management instead in communication
38
38 Speedups under Improved MP
39
39 Question: Why does CC-SAS perform best for small problem size? Extra packing/unpacking operation in MP and SHMEM Extra packet queue management in MP
40
40 Speedups for Ocean
41
41 Speedups for Radix
42
42 Speedups for LU
43
43 Conclusions Good algorithm structures are portable among programming models. MP is much worse than CC-SAS and SHMEM. We can achieve similar performance if extra data copy and queue synchronization are well solved.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.