Architectural Interactions in High Performance Clusters

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Distributed Systems CS
High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Multiple Processor Systems
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Parallel Architectures
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Fundamental Design Issues Dr. Xiao Qin Auburn.
High Performance and Reliable Multicast over Myrinet/GM-2
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Workload-Driven Evaluation Dr. Xiao Qin.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
SEDA: An Architecture for Scalable, Well-Conditioned Internet Services
Berkeley Cluster Projects
Self Healing and Dynamic Construction Framework:
Multiscalar Processors
Last Class: RPCs and RMI
12.4 Memory Organization in Multiprocessor Systems
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Department of Computer Science University of California, Santa Barbara
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Guoliang Chen Parallel Computing Guoliang Chen
Complexity Measures for Parallel Computation
Lecture 1: Parallel Architecture Intro
The Stanford FLASH Multiprocessor
Summary Background Introduction in algorithms and applications
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
Computer Science Division
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Networks Networking has become ubiquitous (cf. WWW)
LoGPC: Modeling Network Contention in Message-Passing Programs
Latency Tolerance: what to do when it just won’t go away
Chapter 4 Multiprocessors
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
Complexity Measures for Parallel Computation
Process/Code Migration and Cloning
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Architectural Interactions in High Performance Clusters RTPP 98 David E. Culler Computer Science Division University of California, Berkeley

Run-Time Framework ° ° ° RunTime RunTime RunTime Network Parallel Program Parallel Program Parallel Program ° ° ° RunTime RunTime RunTime Machine Architecture Machine Architecture Machine Architecture Network

Two Example RunTime Layers Split-C thin global address space abstraction over Active Messages get, put, read, write MPI thicker message passing abstraction over Active Messages send, receive

Split-C over Active Messages Read, Write, Get, Put built on small Active Message request / reply (RPC) Bulk-Transfer (store & get) Request handler Reply handler

Model Framework: LogP o o (overhead) g (gap) P ( processors ) P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/rate) Processors Round Trip time: 2 x ( 2o + L)

LogP Summary of Current Machines Max MB/s: 38 141 47

Methodology Apparatus: 35 Ultra 170s (64 MB, .5 MB L2, Solaris 2.5) M2F Lanai + Myricom Net in Fat Tree variant GAM + Split-C Modify the Active Message layer to inflate L, o, g, or G independently Execute a diverse suite of applications and observe effect Evaluate against natural performance models

Adjusting L, o, and g (and G) in situ Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers)

Calibration

Applications Characteristics Message Frequency Write-based vs. Read-based Short vs. Bulk Messages Synchronization Communication Balance

Applications used in the Study

Baseline Communication

Application Sensitivity to Communication Performance

Sensitivity to Overhead

Sensitivity to gap (1/msg rate)

Sensitivity to Latency

Sensitivity to bulk BW (1/G)

Modeling Effects of Overhead Tpred = Torig + 2 x max #msgs x o request / response proc with most msgs limits overall time Why does this model under-predict?

Modeling Effects of gap Uniform communication model Tpred = Torig , if g < I, average msg interval = Torig + m (g - I ), otherwise Bursty Communication Tpred = Torig + m g g

Extrapolating to Low Overhead

MPI over AM: ping-pong bandwidth 17

MPI over AM: start-up

NPB2 Speedup: NOW vs SP2

NOW vs. Origin

Single Processor Performance

Understanding Speedup SpeedUp(p) = T1 MAXp (Tcompute + Tcomm. + T wait) Tcompute = (work/p + extra) x efficiency

Performance Tools for Clusters Independent data collection on every node Timing Sampling Tracing Little perturbation of global effects

Where the Time Goes: LU-a

Where the Time Goes: BT-a

Constant Problem Size Scaling 128 256 64 16 4 8 32

Communication Scaling Normalized Msgs per Proc Average Message Size

Communication Scaling: Volume

Extra Work

Cache Working Sets: LU 8-fold reduction in miss rate from 4 to 8 proc

Cache Working Sets: BT

Cycles per Instruction

MPI Internal Protocol Sender Receiver

Revised Protocol Sender Receiver

Sensitivity to Overhead

Conclusions Run Time systems for Parallel Programs must deal with a host of architectural interactions communication computation memory system Build a performance model of you RTPP only way to recognize anomalies Build tools along with the RT to reflect characteristics and sensitivity back to PP Much can lurk beneath a perfect speedup curve