A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Multiple Processor Systems
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiple Processor Systems
Introduction to MIMD architectures
November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Parallel Computing Overview CS 524 – High-Performance Computing.
1  1998 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Mapping Techniques for Load Balancing
Parallel Architectures
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Parallel and Distributed Simulation Hardware Platforms Simulation Fundamentals.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
1 Lecture 22 Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Message Passing Vs. Shared Address Space on a Cluster of SMPs Leonid Oliker NERSC/LBNL Hongzhang Shan, Jaswinder Pal Singh Princeton.
MIMD Distributed Memory Architectures message-passing multicomputers.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Scheduler Activations: Effective Kernel Support for the User-level Management of Parallelism Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Parallel Computing.
Outline Why this subject? What is High Performance Computing?
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Background Computer System Architectures Computer System Software.
Primitive Concepts of Distributed Systems Chapter 1.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Lecture 1: Parallel Architecture Intro
Parallel Processing Architectures
Distributed Systems CS
Chapter 4 Multiprocessors
Presentation transcript:

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000

Three Programming Models  CC-SAS –Linear address space for shared memory  MP –Communicate with other processes explicitly via message passing interface  SHMEM –Via get and put primitives

Platforms:  Tightly-coupled multiprocessors –SGI Origin2000: a cache-coherent distributed shared memory machine  Less tightly-coupled clusters –A cluster of workstations connected by ethernet

Purpose  Compare the three programming models on Origin2000, a modern 64-processor hardware cache-coherent machine –We focus on scientific applications that access data regularly or predictably.

Questions to be answered  Can parallel algorithms be structured in the same way for good performance in all three models?  If there are substantial differences in performance under three models, where are the key bottlenecks?  Do we need to change the data structures or algorithms substantially to solve those bottlenecks?

Applications and Algorithms  FFT –All-to-all communication(regular)  Ocean –Nearest-neighbor communication  Radix –All-to-all communication(irregular)  LU –One-to-many communication

Performance Result

question:  Why MP is much worse than CC-SAS and SHMEM?

Analysis: Execution time = BUSY + LMEM + RMEM + SYNC where BUSY: CPU computation time LMEM: CPU stall time for local cache miss RMEM: CPU stall time for sending/receiving remote data SYNC: CPU time spend at synchronization events

Where does the time go in MP?

Improving MP performance  Remove extra data copy –Allocate all data involved in communication in shared address space  Reduce SYNC time –Use lock-free queue management instead in communication

Speedups under Improved MP

Why does CC-SAS perform best?

 Extra packing/unpacking operation in MP and SHMEM  Extra packet queue management in MP  …

Speedups for Ocean

Speedups for Radix

Speedups for LU

Conclusions  Good algorithm structures are portable among programming models.  MP is much worse than CC-SAS and SHMEM under hardware-coherent machine. However, we can achieve similar performance if extra data copy and queue synchronization are well solved.  Something about programmability

Future work  How about those applications that indeed have irregular, unpredictable and naturally fine-grained data access and communication patterns?  How about software-based coherent machines (i.e. clusters)?