VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28, 29 2003.

Slides:

Advertisements

Similar presentations

1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

SALSA HPC Group School of Informatics and Computing Indiana University.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

SHARCNET. Multicomputer Systems r A multicomputer system comprises of a number of independent machines linked by an interconnection network. r Each computer.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.

Parallel Simulations of Fracture and Fragmentation I. Arias, J. Knap and M. Ortiz California Institute of Technology Figure 1. The tractions at each side.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.

Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.

1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

COMING FROM? IMDEA Materials Institute (GETAFE) Polytechnic University of Madrid Vicente Herrera Solaz 1 Javier Segurado 1,2 Javier Llorca 1,2 1 Politechnic.

Parallel and Distributed Simulation Hardware Platforms Simulation Fundamentals.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Rensselaer Why not change the world? Rensselaer Why not change the world? 1.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Materials Process Design and Control Laboratory Finite Element Modeling of the Deformation of 3D Polycrystals Including the Effect of Grain Size Wei Li.

Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.

Computer Organization & Assembly Language © by DR. M. Amer.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Introduction Application of parallel programming to the KAMM model

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Interconnection network network interface and a case study.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.

Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

Page Replacement.

TeraScale Supernova Initiative

Hybrid Programming with OpenMP and MPI

CINECA HIGH PERFORMANCE COMPUTING SYSTEM

Cluster Computers.

Presentation transcript:

VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28,

ASCI Platform Specifics LLNL’s IBM SP3 (frost) –65 node SMP, 375 MHz Power3 Nighthawk-2 (16 CPUs/node) –16 GB memory/node –~ 20 TB global parallel file system –SP switch2, colony switch 2 GB/sec node-to-node bandwidth bi-directional LANL’s HP/Compaq Alphaserver ES45 (QSC) –256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node) –16 GB memory/node –~ 12 TB global file system –Quadrics network interconnect (QsNet) 2 mus latency 300 MB/sec bandwidth

Multiscale Polycrystal Studies Quantitative assessment of microstructural effects in macroscopic material response through the computation of full-field solutions of polycrystals Inhomogeneous plastic deformation fields Grain-boundary effects: –Stress concentration –Dislocation pile-up –Constraint-induced multislip Size dependence: (inverse) Hall-Petch effect Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing Enable full-scale simulation of engineering systems incorporating micromechanical effects.

Mesh Generation Ingrain subdivision behavior can be simulated in both single crystals and polycrystals. –texture simulation results agree well with experimental results Mesh generation method keeps the topology of individual grain shapes –Enables effective interactions between grains Increasing of the grain count in polycrystals gives a more stable mechanical response. Single grain corresponding to a single cell in a crystal

1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost

Multiscale Polycrystal Performance Aggregate parallel performance –LANL’s QSC Floating point operations 10.67% of peak Integer operations 15.39% of peak Memory operations % of peak –DCPI hardware counters used to collect data –Qopcounter tool used to analyze DCPI database –LLNL’s Frost L1 cache hit rate 98% –Load/store instructions executed w/o main memory access Load Store Unit idle 36% Floating point operations 4.47% of peak –Hpmcount tool used to count hardware events during program execution

Multiscale Polycrystal Performance II MPI routines can consume ~ 30% of runtime for large runs on Frost –Workload imbalance as grains are distributed across nodes –MPI_Waitall every step dominating communications time Nearest neighbor sends take longer from nodes with computationally heavy grains –Routines taking the most CPU time on QSC resolved_fcc_cuitino 18.85% upslip_fcc_cuitino_explicit 11.74% setafcc 9.16% matvec 8.5 % –~50% of execution time in 4 routines –Room for performance improvement with better load balancing and routine level optimization

Multiscale Polycrystal Scaling on LLNL’s IBM SP3, Frost elements

Multiscale Polycrystal Scaling on LANL’s HP/Compaq, QSC elements

Scaling for Polycrystalline Copper in a Shear Compression Specimen Configuration LANL’s HP/Compaq QSC system elements

3D Converging Shock Simulations in a Wedge 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases. –The 2D elliptical interface is computed using local shock polar analysis to yield a perfectly circular transmitted shock Resolution: 2000x400x400 with over 1T Byte of data generated. Density Pressure

Density Field in a 3D Wedge Density field in the Wedge. The transmitted shock front appears to be stable while the gas interface is Richtmyer-Meshkov unstable. The simulation took place on 1024 processors of LLNL’s IBM SP3, frost, 2000x400x400 initial grid.

Wedge3D Performance on LLNL’s IBM SP3, Frost Aggregate parallel performance for 1400x280x280 grid –LLNL’s Frost Floating point operations 5.8 to 10% of peak, depending on node –Hpmcount tool used to count hardware events during program execution –Most time consuming communication calls MPI_Wait() and MPI_Allreduce Accounting for 3 to 30% of runtime on 128 way run –175x70x70 grid per processor –Occasional high MPI time on a few nodes seem to be caused by system daemons competing for resources

Wedge3D Scaling on LLNL’s IBM SP3, Frost grid size XxYxZ

Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC Levels of subdivision 450K to 1.1M elements 85K -> 1.1M elements 61K -> 915K elements 450K elements

Crack Patterns in the Configuration Occurring During Scalability Studies on QSC

Fragmentation 2D Performance on LANL’s HP/Compaq, QSC Procedures with highest CPU cycle consumption –element_driver 14.9% –assemble 13.9% –NewNeohookean 8.12% 16 processor run with 2 levels of subdivision (60K elements) Dcpiprof too used to profile run Problems processing dcpi database FLOP rates for large runs –reported to LANL support –small runs yield 3% FLOP peak –Only ~ 10% spend in fragmentation routines! Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])