Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Distributed Systems CS
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
CS 240A: Complexity Measures for Parallel Computation.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
MIMD Distributed Memory Architectures message-passing multicomputers.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Computer Sciences Department University of Wisconsin-Madison
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Workload-Driven Evaluation Dr. Xiao Qin.
Berkeley Cluster Projects
Department of Computer Science University of California, Santa Barbara
Parallel Inversion of Polynomial Matrices
Architectural Interactions in High Performance Clusters
CSE8380 Parallel and Distributed Processing Presentation
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and David Culler {fredwong, rmartin, remzi, davidwu, Department of Electrical Engineering and Computer Science Computer Science Division University of California, Berkeley June 15 th, 1998

Introduction n NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems n 7 scientific benchmarks that represents the most common computation kernels n NPB is written on top of Message Passing Interface (MPI) for portability n NPB is a Constant Problem Size (CPS) scaling benchmark suite n This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000

Motivation n Early study on NPB shows ideal speedup on NOW! u Scaling as good as T3D and better than SP-2 u Per node performance better than T3D, close to SP-2 n Submitted results for Origin 2000 show a spread

Presentation Outline n Hardware Configuration n Time Breakdown of the Applications n Communication Performance n Computation Performance n Conclusion

Hardware Configuration n SGI Origin 2000 (64 nodes) u MIPS R10000 processor, 195 MHz, 32KB/32KB L1 u 4MB external L2 cache per processor u 16GB memory total u MPI performance: 13  sec one-way latency, 150 MB peak, half-power at 8KB message size n Network Of Workstations (NOW) u UltraSPARC I processor, 167MHz, 16KB/16KB L1 u 512KB external L2 cache per processor u 128 MB memory per processor u MPI performance: 22  sec one-way latency, 27 MB peak, half-power at 4KB message size

Time Breakdown -- LU n Black line -- total running time u a single-man - 10 secs job u ideally, requires 5 secs for 2 men u total amount of work secs n More work, need communication

Time Breakdown -- LU

Time Breakdown -- SP

Communication Performance n Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW

Communication Efficiency n absolute bandwidth delivered are close u SP/32 on NOW s u SP/32 on SGI s n comm. efficiency on SGI only achieved 30% of potential bandwidth n protocols tradeoff are pronounce u hand-shake vs. bulk- send in pt2pt u collective ops

Computation Performance n Relative performance of the benchmarks on single node roughly close to the processor performance difference n Both computational CPI and L2 misses change significantly on both platforms when scaled

Recap on CPS Scaling

LU Working Set n 4-processor u Knee starts at 256KB

LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB

LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB n 16-processor u Knee starts at 64KB

LU Working Set n 4-processor u Knee starts at 256KB n 8-processor u Knee starts at 128KB n 16-processor u Knee starts at 64KB n 32-processor u Knee starts at 32KB n miss rate drops from 2MB to 4 MB global cache

n Cost under scaling u extra work worsen memory system’s performance SP Working Set u total memory references on SGI F 4-processor has billion memory reference F 25-processor has billion memory reference F 12.38% increase Cost Benefit

Conclusion n NPB u  -benchmarks hard to predict comm performance u global cache increases effectively reduce comp. time u sequential node arch. is a dominant factor in NPB perf. n NOW u an inexpensive way to go parallel u absolute performance is excellent u MPI on NOW has good scalability and performance u NOW vs. proprietary system -- detail instrumentation ability n speedup cannot tell the whole story, scalability involves: u the interplay of program and machine scaling u delivered comm. performance, not  -benchmarks u complicated memory system performance