1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Multiple Processor Systems
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
Introduction to MIMD architectures
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
G Robert Grimm New York University Disco.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Synchronization and Communication in the T3E Multiprocessor.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Message Passing Vs. Shared Address Space on a Cluster of SMPs Leonid Oliker NERSC/LBNL Hongzhang Shan, Jaswinder Pal Singh Princeton.
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
Outline Why this subject? What is High Performance Computing?
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Background Computer System Architectures Computer System Software.
Primitive Concepts of Distributed Systems Chapter 1.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CS5102 High Performance Computer Systems Thread-Level Parallelism
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
The Stanford FLASH Multiprocessor
Course Outline Introduction in algorithms and applications
Distributed Systems CS
High Performance Computing
Chapter 4 Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002

2 Papers surveyed  Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors by Chris Holt, Jaswinder Pal Singh and John Hennessy Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors  Scaling Application Performance on a Cache- coherent Multiprocessor by Dongming Jiang and Jaswinder Pal Singh Scaling Application Performance on a Cache- coherent Multiprocessor  A Comparison of the MPI, SHMEM and Cache- coherent Shared Address Space Programming Models on the SGI Origin2000 by Hongzhang Shan and Jaswinder Pal SinghA Comparison of the MPI, SHMEM and Cache- coherent Shared Address Space Programming Models on the SGI Origin2000

Application and Architectural Bottlenecks in Distributed Shared Memory Multiprocessors

4 Question  Can realistic applications achieve reasonable performance on large scale DSM machines?  Problem Size  Programming effort and Optimization  Main architectural bottlenecks

5 Metrics  Minimum problem size to achieve a desired level of parallel efficiency   Parallel Efficiency >= 70%  Assumption: Problem size ↗  Performance ↗  Question: Is the assumption always true? Why or Why not?

6 Programming Difficulty and Optimization  Techniques already employed  Balance the workload  Reduce inherent communication  Incorporate major form of data locality(temporal and/or spatial)  Further optimization  Place the data appropriately in physically distributed memory instead of allowing pages of data to be placed round-robin across memories  Modify major data structures substantially to reduce unnecessary communication and to facilitate proper data placement  Algorithmic enhancements to further improve the load imbalance or reduce the amount of necessary communication  Prefetching  Software-controlled  Insert prefetches by hand

7 Simulation Environment and Validation  Simulated Architecture  Stanford FLASH multiprocessor  Validation  Stanford DASH machine vs. Simulator  Speedups: Simulator ≈ DASH machine

8 Applications  Subset of SPLASH2  Important scientific & engineering computations  Different communication patterns & requirements  Indicative of several types of applications running on large scale DSMs

9 Results – With & w/o prefetching

10 Architectural Bottleneck

11 Conclusion  Problem size  Possible to get good performance on large scale DSMs, using problem sizes that are often surprisingly small, except Radix  Program difficulty  In most case not difficult to program  Scalable performance can be achieved without changing the code too much  Architectural bottleneck  End-point contention  Require extremely efficient communication controllers

Scaling Application Performance on a Cache-coherent Multiprocessor

14 The Question  Can distributed shared memory, cache coherent, non-uniform memory access architectures scale on parallel apps?  What do we mean by scale:  Achieve parallel efficiency of 60%,  For a fixed problem size,  Increasing the number of processors,

15 DSM, cc-NUMA  Each processor has private cache,  Shared address space constructed from “public” memory of each processor,  Loads / stores used to access memory,  Hardware ensures cache coherence,  Non-uniform: miss penalty for remote data higher,  SGI Origin2000 chosen as an aggressive representative in this architectural family,

16 Origin 2000 overview  Nodes placed as vertices of hypercubes:  Ensures that communication latency grows linearly, as number of nodes doubles,  Each node is dual 195 MHz proc, with own 32KB 1 st level cache, 4 MB second level cache  Total addressible memory is 32GB  Most aggressive in terms of remote to local memory access latency ratio

17 Benchmarks  SPLASH-2: Barnes-Hut, Ocean, Radix Sort, etc,  3 new: Shear Warp, Infer, and Protein,  Range of communication-to-computation ratio, temporal and spatial locality,  Initial sizes of problems determined from earlier experiments:  Simulation with 256 processors,  Implementation with 32 processors,

18 Initial Experiments

19 Avg. Exec. Time Breakdown

20 Problem Size  Idea:  increase problem size until desired level of efficiency is achieved,  Question:  Feasible?  Question:  Even if feasible, is it desirable?

21 Changing Problem Size

22 Why problem size helps  Communication to computation ratio improved  Less load imbalance, both in computation and communication costs  Less waiting in synch  Superlienarity effects of cache size  Helps larger processor counts,  Hurts smaller processor counts  Less false sharing

23 Application Restructuring  What kind of restructuring:  Algorithmic changes, data partitioning  Ways restructuring helps:  Reduced communication,  Better data placement,  Static partitioning for better load balance,  Restructuring is app specific and complex,  Bonus side-effect:  Scale well on Shared Virtual Memory (clustered workstations) systems,

24 Application Restructuring

25 Conclusions  Original versions not scalable on cc-NUMA  Simulation not accurate for quantitative results; implementation needed  Increasing size a poor solution  App restructuring works:  Restructured apps perform well also on SVM,  Parallel efficiency of these versions better  However, to validate results, good idea to run restructured apps on larger number of processors

A Comparison of the MPI, SHMEM and Cache-coherent Programming Models on the SGI Origin2000

28 Purpose  Compare the three programming models on Origin2000  We focus on scientific applications that access data regularly or predictably  Or do not require fine grained replication of irregularly accessed data

29 SGI Origin2000  Cache coherent, NUMA machine  64 processors  32 nodes, 2 MIPS R10000 processors each  512 MB memory per node  Interconnection Network  16 vertices hypercube  Pair of nodes associated with each vertex

30 Three Programming Models  CC-SAS  Linear address space for shared memory  MP  Communicate with other processes explicitly via message passing interface (MPI)  SHMEM  Shared memory one sided communication library  Via get and put primitives

31 Applications and Algorithms  FFT  All-to-all communication(regular)  Ocean  Nearest-neighbor communication  Radix  All-to-all communication(irregular)  LU  One-to-many communication

32 Questions to be answered  Can parallel algorithms be structured in the same way for good performance in all three models?  If there are substantial differences in performance under three models, where are the key bottlenecks?  Do we need to change the data structures or algorithms substantially to solve those bottlenecks?

33 Performance Result

34 Questions:  Anything unusual in the previous slide?  Working sets fit in the cache for multiprocessors but not for uniprocessors  Why MP is much worse than CC-SAS and SHMEM?

35 Analysis: Execution time = BUSY + LMEM + RMEM + SYNC where BUSY: CPU computation time LMEM: CPU stall time for local cache miss RMEM: CPU stall time for sending/receiving remote data SYNC: CPU time spend at synchronization events

36 Time breakdown for MP

37 Improving MP performance  Remove extra data copy  Allocate all data involved in communication in shared address space  Reduce SYNC time  Use lock-free queue management instead in communication

38 Speedups under Improved MP

39 Question:  Why does CC-SAS perform best for small problem size?  Extra packing/unpacking operation in MP and SHMEM  Extra packet queue management in MP

40 Speedups for Ocean

41 Speedups for Radix

42 Speedups for LU

43 Conclusions  Good algorithm structures are portable among programming models.  MP is much worse than CC-SAS and SHMEM.  We can achieve similar performance if extra data copy and queue synchronization are well solved.