Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Similar presentations


Presentation on theme: "1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002."— Presentation transcript:

1 1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002

2 2 Papers surveyed  Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors by Chris Holt, Jaswinder Pal Singh and John Hennessy Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors  Scaling Application Performance on a Cache- coherent Multiprocessor by Dongming Jiang and Jaswinder Pal Singh Scaling Application Performance on a Cache- coherent Multiprocessor  A Comparison of the MPI, SHMEM and Cache- coherent Shared Address Space Programming Models on the SGI Origin2000 by Hongzhang Shan and Jaswinder Pal SinghA Comparison of the MPI, SHMEM and Cache- coherent Shared Address Space Programming Models on the SGI Origin2000

3 Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors

4 4 Question  Can realistic applications achieve reasonable performance on large scale DSM machines?  Problem Size  Programming effort and Optimization  Main architectural bottlenecks

5 5 Metrics  Minimum problem size to achieve a desired level of parallel efficiency   Parallel Efficiency >= 70%  Assumption: Problem size ↗  Performance ↗  Question: Is the assumption always true? Why or Why not?

6 6 Programming Difficulty and Optimization  Techniques already employed  Balance the workload  Reduce inherent communication  Incorporate major form of data locality(temporal and/or spatial)  Further optimization  Place the data appropriately in physically distributed memory instead of allowing pages of data to be placed round-robin across memories  Modify major data structures substantially to reduce unnecessary communication and to facilitate proper data placement  Algorithmic enhancements to further improve the load imbalance or reduce the amount of necessary communication  Prefetching  Software-controlled  Insert prefetches by hand

7 7 Simulation Environment and Validation  Simulated Architecture  Stanford FLASH multiprocessor  Validation  Stanford DASH machine vs. Simulator  Speedups: Simulator ≈ DASH machine

8 8 Applications  Subset of SPLASH2  Important scientific & engineering computations  Different communication patterns & requirements  Indicative of several types of applications running on large scale DSMs

9 9 Results – With & w/o prefetching

10 10 Architectural Bottleneck

11 11 Conclusion  Problem size  Possible to get good performance on large scale DSMs, using problem sizes that are often surprisingly small, except Radix  Program difficulty  In most case not difficult to program  Scalable performance can be achieved without changing the code too much  Architectural bottleneck  End-point contention  Require extremely efficient communication controllers

12

13 Scaling Application Performance on a Cache-coherent Multiprocessor

14 14 The Question  Can distributed shared memory, cache coherent, non-uniform memory access architectures scale on parallel apps?  What do we mean by scale:  Achieve parallel efficiency of 60%,  For a fixed problem size,  Increasing the number of processors,

15 15 DSM, cc-NUMA  Each processor has private cache,  Shared address space constructed from “public” memory of each processor,  Loads / stores used to access memory,  Hardware ensures cache coherence,  Non-uniform: miss penalty for remote data higher,  SGI Origin2000 chosen as an aggressive representative in this architectural family,

16 16 Origin 2000 overview  Nodes placed as vertices of hypercubes:  Ensures that communication latency grows linearly, as number of nodes doubles,  Each node is dual 195 MHz proc, with own 32KB 1 st level cache, 4 MB second level cache  Total addressible memory is 32GB  Most aggressive in terms of remote to local memory access latency ratio

17 17 Benchmarks  SPLASH-2: Barnes-Hut, Ocean, Radix Sort, etc,  3 new: Shear Warp, Infer, and Protein,  Range of communication-to-computation ratio, temporal and spatial locality,  Initial sizes of problems determined from earlier experiments:  Simulation with 256 processors,  Implementation with 32 processors,

18 18 Initial Experiments

19 19 Avg. Exec. Time Breakdown

20 20 Problem Size  Idea:  increase problem size until desired level of efficiency is achieved,  Question:  Feasible?  Question:  Even if feasible, is it desirable?

21 21 Changing Problem Size

22 22 Why problem size helps  Communication to computation ratio improved  Less load imbalance, both in computation and communication costs  Less waiting in synch  Superlienarity effects of cache size  Helps larger processor counts,  Hurts smaller processor counts  Less false sharing

23 23 Application Restructuring  What kind of restructuring:  Algorithmic changes, data partitioning  Ways restructuring helps:  Reduced communication,  Better data placement,  Static partitioning for better load balance,  Restructuring is app specific and complex,  Bonus side-effect:  Scale well on Shared Virtual Memory (clustered workstations) systems,

24 24 Application Restructuring

25 25 Conclusions  Original versions not scalable on cc-NUMA  Simulation not accurate for quantitative results; implementation needed  Increasing size a poor solution  App restructuring works:  Restructured apps perform well also on SVM,  Parallel efficiency of these versions better  However, to validate results, good idea to run restructured apps on larger number of processors

26

27 A Comparison of the MPI, SHMEM and Cache-coherent Programming Models on the SGI Origin2000

28 28 Purpose  Compare the three programming models on Origin2000  We focus on scientific applications that access data regularly or predictably  Or do not require fine grained replication of irregularly accessed data

29 29 SGI Origin2000  Cache coherent, NUMA machine  64 processors  32 nodes, 2 MIPS R10000 processors each  512 MB memory per node  Interconnection Network  16 vertices hypercube  Pair of nodes associated with each vertex

30 30 Three Programming Models  CC-SAS  Linear address space for shared memory  MP  Communicate with other processes explicitly via message passing interface (MPI)  SHMEM  Shared memory one sided communication library  Via get and put primitives

31 31 Applications and Algorithms  FFT  All-to-all communication(regular)  Ocean  Nearest-neighbor communication  Radix  All-to-all communication(irregular)  LU  One-to-many communication

32 32 Questions to be answered  Can parallel algorithms be structured in the same way for good performance in all three models?  If there are substantial differences in performance under three models, where are the key bottlenecks?  Do we need to change the data structures or algorithms substantially to solve those bottlenecks?

33 33 Performance Result

34 34 Questions:  Anything unusual in the previous slide?  Working sets fit in the cache for multiprocessors but not for uniprocessors  Why MP is much worse than CC-SAS and SHMEM?

35 35 Analysis: Execution time = BUSY + LMEM + RMEM + SYNC where BUSY: CPU computation time LMEM: CPU stall time for local cache miss RMEM: CPU stall time for sending/receiving remote data SYNC: CPU time spend at synchronization events

36 36 Time breakdown for MP

37 37 Improving MP performance  Remove extra data copy  Allocate all data involved in communication in shared address space  Reduce SYNC time  Use lock-free queue management instead in communication

38 38 Speedups under Improved MP

39 39 Question:  Why does CC-SAS perform best for small problem size?  Extra packing/unpacking operation in MP and SHMEM  Extra packet queue management in MP

40 40 Speedups for Ocean

41 41 Speedups for Radix

42 42 Speedups for LU

43 43 Conclusions  Good algorithm structures are portable among programming models.  MP is much worse than CC-SAS and SHMEM.  We can achieve similar performance if extra data copy and queue synchronization are well solved.


Download ppt "1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002."

Similar presentations


Ads by Google