1 Parallel Applications 15-740 Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Slides:



Advertisements
Similar presentations
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Advertisements

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Introduction to MIMD architectures
OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Chapter 17 Parallel Processing.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Distributed Computing Systems CSCI 6900/4900. Review Definition & characteristics of distributed systems Distributed system organization Design goals.
Load Rebalancing for Distributed File Systems in Clouds.
Background Computer System Architectures Computer System Software.
Computer Sciences Department University of Wisconsin-Madison
Architecture Background
Cache Memory Presentation I
CMSC 611: Advanced Computer Architecture
Memory Hierarchies.
Course Outline Introduction in algorithms and applications
Parallel Sorting Algorithms
Hybrid Programming with OpenMP and MPI
Presented by David Wolinsky
High Performance Computing
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002

2 Scaling Parallel Applications Ning Hu, Stefan Niculescu, Vahe Poladian

3 The Question  Can distributed shared memory, cache coherent, non-uniform memory access architectures scale on parallel apps?  What do we mean by scale:  Achieve parallel efficiency of 60%,  For a fixed problem size,  Increasing the number of processors,

4 DSM, cc-NUMA  Each processor has private cache,  Shared address space constructed from “public” memory of each processor,  Loads / stores used to access memory,  Hardware ensures cache coherence,  Non-uniform: miss penalty for remote data higher,  SGI Origin2000 chosen as an aggressive representative in this architectural family,

5 Origin 2000 overview  Nodes placed as vertices of hypercubes:  Ensures that communication latency grows linearly, as number of nodes doubles,  Each node is dual 195 MHz proc, with own 32KB 1 st level cache, 4 MB second level cache  Total addressable memory is 32GB  Most aggressive in terms of remote to local memory access latency ratio

6 Benchmarks  SPLASH-2: Barnes-Hut, Ocean, Radix Sort, etc,  3 new: Shear Warp, Infer, and Protein,  Range of communication-to-computation ratio, temporal and spatial locality,  Initial sizes of problems determined from earlier experiments:  Simulation with 256 processors,  Implementation with 32 processors,

7 Initial Experiments

8 Avg. Exec. Time Breakdown

9 Problem Size  Idea:  increase problem size until desired level of efficiency is achieved,  Question:  Feasible?  Question:  Even if feasible, is it desirable?

10 Changing Problem Size

11 Why problem size helps  Communication to computation ratio improved  Less load imbalance, both in computation and communication costs  Less waiting in synch  Superlienarity effects of cache size  Helps larger processor counts,  Hurts smaller processor counts  Less false sharing

12 Application Restructuring  What kind of restructuring:  Algorithmic changes, data partitioning  Ways restructuring helps:  Reduced communication,  Better data placement,  Static partitioning for better load balance,  Restructuring is app specific and complex,  Bonus side-effect:  Scale well on Shared Virtual Memory (clustered workstations) systems,

13 App Restructuring

14 Conclusions  Original versions not scalable on cc-NUMA  Simulation not accurate for quantitative results; implementation needed  Increasing size a poor solution  App restructuring works:  Restructured apps perform well also on SVM,  Parallel efficiency of these versions better  However, to validate results, good idea to run restructured apps on larger number of processors

15 STOP