Slide-1 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK)

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Parallel computer architecture classification
Compiler Challenges for High Performance Architectures
Slide-1 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Benchmarking Working Group Session Agenda 1:00-1:15David KoesterWhat Makes.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Introduction CS 524 – High-Performance Computing.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
RBG 6/9/ Reactive Processing Jon Hiller Science and Technology Associates, Inc. For Dr. William Harrod DARPA IPTO 24 May 2005.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
Main Memory. Background Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
+ CS 325: CS Hardware and Software Organization and Architecture Introduction.
PMIT-6102 Advanced Database Systems
High Productivity Computing Systems Robert Graybill DARPA/IPTO March 2003.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
SGI Contributions to Supercomputing by 2010 Steve Reinhardt Director of Engineering
Microprocessor-based systems Curse 7 Memory hierarchies.
Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Cray Innovation Barry Bolding, Ph.D. Director of Product Marketing, Cray September 2008.
Slide-1 LACSI Extreme Computing MITRE ISIMIT Lincoln Laboratory This work is sponsored by the Department of Defense under Army Contract W15P7T-05-C-D001.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
Using LINPACK + STREAM to Build a Simple, Composite Metric for System Throughput John D. McCalpin, Ph.D. IBM Corporation Austin, TX Presented ,
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Cray Environmental Industry Solutions Per Nyberg Earth Sciences Business Manager Annecy CAS2K3 Sept 2003.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Outline Why this subject? What is High Performance Computing?
IMP: Indirect Memory Prefetcher
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Tackling I/O Issues 1 David Race 16 March 2010.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
How do we evaluate computer architectures?
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers
Flow Path Model of Superscalars
Chapter 8: Main Memory.
Memory Hierarchies.
Performance of computer systems
Chapter 1 Introduction.
What is Computer Architecture?
COMS 361 Computer Organization
6- General Purpose GPU Programming
Presentation transcript:

Slide-1 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK) 28 September 2004

Slide-2 HPCchallenge Benchmarks MITRE ICL/UTK Outline Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary

Slide-3 HPCchallenge Benchmarks MITRE ICL/UTK High Productivity Computing Systems Impact: Performance (time-to-solution): speedup critical national security applications by a factor of 10X to 40X Programmability (idea-to-first-solution): reduce cost and time of developing application solutions Portability (transparency): insulate research and operational application software from system Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing) Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology HPCS Program Focus Areas  Create a new generation of economically viable computing systems and a procurement methodology for the security/industrial community (2007 – 2010)

Slide-4 HPCchallenge Benchmarks MITRE ICL/UTK High Productivity Computing Systems Phase 1Phase 2 ( ) Phase 3 ( ) Concept Study Advanced Design & Prototypes Full Scale Development Petascale/s Systems Vendors New Evaluation Framework Test Evaluation Framework  Create a new generation of economically viable computing systems and a procurement methodology for the security/industrial community (2007 – 2010) Validated Procurement Evaluation Methodology -Program Overview- Productivity Team Half-Way Point Phase 2 Technology Assessment Review

Slide-5 HPCchallenge Benchmarks MITRE ICL/UTK HPCS Program Goals ‡ HPCS overall productivity goals: –Execution (sustained performance)  1 Petaflop/sec (scalable to greater than 4 Petaflop/sec)  Reference: Production workflow –Development  10X over today’s systems  Reference: Lone researcher and Enterprise workflows Productivity Framework –Base lined for today’s systems –Successfully used to evaluate the vendors emerging productivity techniques –Provide a solid reference for evaluation of vendor’s proposed Phase III designs. Subsystem Performance Indicators 1)2+ PF/s LINPACK 2)6.5 PB/sec data STREAM bandwidth 3)3.2 PB/sec bisection bandwidth 4)64,000 GUPS ‡ Bob Graybill (DARPA/IPTO) (Emphasis added)

Slide-6 HPCchallenge Benchmarks MITRE ICL/UTK Outline Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary

Slide-7 HPCchallenge Benchmarks MITRE ICL/UTK Processor-Memory Performance Gap Alpha full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions Caches in Pentium Pro: 64% area, 88% transistors *Taken from Patterson-Keeton Talk to SigMod

Slide-8 HPCchallenge Benchmarks MITRE ICL/UTK Processing vs. Memory Access Doesn’t cache solve this problem? –It depends. With small amounts of contiguous data, usually. With large amounts of non-contiguous data, usually not –In most computers the programmer has no control over cache –Often “a few” Bytes/FLOP is considered OK However, consider operations on the transpose of a matrix (e.g., for adjunct problems) –Xa= bX T a = b –If X is big enough, 100% cache misses are guaranteed, and we need at least 8 Bytes/FLOP (assuming a and b can be held in cache) Latency and limited bandwidth of processor-memory and node-node communications are major limiters of performance for scientific computation

Slide-9 HPCchallenge Benchmarks MITRE ICL/UTK Processing vs. Memory Access High Performance LINPACK Consider another benchmark: Linpack A x = b Solve this linear equation for the vector x, where A is a known matrix, and b is a known vector. Linpack uses the BLAS routines, which divide A into blocks. On the average Linpack requires 1 memory reference for every 2 FLOPs, or 4Bytes/Flop. Many of these can be cache references

Slide-10 HPCchallenge Benchmarks MITRE ICL/UTK Processing vs. Memory Access STREAM TRIAD Consider the simple benchmark: STREAM TRIAD a(i) = b(i) + q * c(i) a(i), b(i), and c(i) are vectors; q is a scalar Vector length is chosen to be much longer than cache size Each execution includes 2 memory loads + 1 memory store 2 FLOPs 12 Bytes/FLOP (assuming 32 bit precision) No computer has enough memory bandwidth to reference 12 Bytes for each FLOP!

Slide-11 HPCchallenge Benchmarks MITRE ICL/UTK Processing vs. Memory Access RandomAccess 2 n 1/2 Memory 64 bits T {A i } Length 2 n+2 a i 64 bits k = [a i ]  Define Addresses Tables Data Stream Data-Driven Memory Access Sequences of bits within a i Highest n bits The Commutative and Associative nature of  allows processing in any order Bit-Level Exclusive Or  Acceptable Error — 1% Look ahead and Storage — 1024 per “node” The expected value of the number of accesses per memory location T[ k ] E[ T[ k ] ] = ( 2 n+2 / 2 n ) = 4 k aiai T[ k ]

Slide-12 HPCchallenge Benchmarks MITRE ICL/UTK Bounding Mission Partner Applications

Slide-13 HPCchallenge Benchmarks MITRE ICL/UTK Outline Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary

Slide-14 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmarks HPCSchallenge Benchmarks –Being developed by Jack Dongarra (ICL/UT) –Funded by the DARPA High Productivity Computing Systems (HPCS) program (Bob Graybill (DARPA/IPTO)) To examine the performance of High Performance Computer (HPC) architectures using kernels with more challenging memory access patterns than High Performance Linpack (HPL)

Slide-15 HPCchallenge Benchmarks MITRE ICL/UTK HPCchallenge Goals To examine the performance of HPC architectures using kernels with more challenging memory access patterns than HPL –HPL works well on all architectures ― even cache- based, distributed memory multiprocessors due to 1.Extensive memory reuse 2.Scalable with respect to the amount of computation 3.Scalable with respect to the communication volume 4.Extensive optimization of the software To complement the Top500 list To provide benchmarks that bound the performance of many real applications as a function of memory access characteristics ― e.g., spatial and temporal locality

Slide-16 HPCchallenge Benchmarks MITRE ICL/UTK HPCchallenge Benchmarks Local DGEMM (matrix x matrix multiply) STREAM –COPY –SCALE –ADD –TRIADD EP-RandomAccess 1D FFT Global High Performance LINPACK (HPL) PTRANS — parallel matrix transpose G-RandomAccess 1D FFT b_eff — interprocessor bandwidth and latency HPCchallenge pushes spatial and temporal boundaries; sets performance bounds Available for download HPCchallenge pushes spatial and temporal boundaries; sets performance bounds Available for download FFT EP-RandomAccess G-RandomAccess STREAM PTRANSHPL DGEMM

Slide-17 HPCchallenge Benchmarks MITRE ICL/UTK Web Site Home Rules News Download FAQ Links Collaborators Sponsors Upload Results

Slide-18 HPCchallenge Benchmarks MITRE ICL/UTK Outline Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary

Slide-19 HPCchallenge Benchmarks MITRE ICL/UTK Preliminary Results Machine List (1 of 2) Affiliatio nManufacturerSystemProcessorTypeProcs U TennAtipa Cluster AMD 128 procsConquest clusterAMD Opteron128 AHPCRCCray X1 124 procsX1Cray X1 MSP124 AHPCRCCray X1 124 procsX1Cray X1 MSP124 AHPCRCCray X1 124 procsX1Cray X1 MSP124 ERDCCray X1 60 procsX1Cray X1 MSP60 ERDCCray X1 60 procsX1Cray X1 MSP60 ORNLCray X1 252 procsX1Cray X1 MSP252 ORNLCray X1 252 procsX1Cray X1 MSP252 AHPCRCCray X1 120 procsX1Cray X1 MSP120 ORNLCray X1 64 procsX1Cray X1 MSP64 AHPCRCCray T3E 1024 procsT3EAlpha ORNLHP zx6000 Itanium procsIntegrity zx6000Intel Itanium 2128 PSCHP AlphaServer SC procsAlphaServer SC45Alpha 21264B128 ERDCHP AlphaServer SC procsAlphaServer SC45Alpha 21264B484

Slide-20 HPCchallenge Benchmarks MITRE ICL/UTK Preliminary Results Machine List (2 of 2) AffiliationManufacturerSystemProcessorTypeProcs IBMIBM 655 Power4+ 64 procseServer pSeries 655IBM Power 4+64 IBMIBM 655 Power procseServer pSeries 655IBM Power IBMIBM 655 Power procseServer pSeries 655IBM Power NAVOIBM p690 Power4 504 procsp690IBM Power 4504 ARLIBM SP Power3 512 procsRS/6000 SPIBM Power 3512 ORNLIBM p690 Power4 256 procsp690IBM Power 4256 ORNLIBM p690 Power4 64 procsp690IBM Power 464 ARLLinux Networx Xeon 256 procsPowellIntel Xeon256 U ManchesterSGI Altix Itanium 2 32 procsAltix 3700Intel Itanium 232 ORNLSGI Altix Itanium procsAltixIntel Itanium 2128 U TennSGI Altix Itanium 2 32 procsAltixIntel Itanium 232 U TennSGI Altix Itanium 2 32 procsAltixIntel Itanium 232 U TennSGI Altix Itanium 2 32 procsAltixIntel Itanium 232 U TennSGI Altix Itanium 2 32 procsAltixIntel Itanium 232 NASA ASCSGI Origin R16K 256 procsOrigin 3900SGI MIPS R U Aachen/RWTHSunFire 15K 128 procsSun Fire 15k/6800 SMP-ClusterSun UltraSparc III128 OSCVoltaire Cluster Xeon 128 procsPinnacle 2X200 ClusterIntel Xeon128

Slide-21 HPCchallenge Benchmarks MITRE ICL/UTK STREAM TRIAD vs HPL Processors STREAM TRIADa(i) = b(i) + q *c(i) HPLA x = b

Slide-22 HPCchallenge Benchmarks MITRE ICL/UTK STREAM TRIAD vs HPL >252 Processors STREAM TRIADa(i) = b(i) + q *c(i) HPLA x = b

Slide-23 HPCchallenge Benchmarks MITRE ICL/UTK STREAM ADD vs PTRANS Processors STREAM ADDa(i) = b(i) + c(i) PTRANSa = a + b T

Slide-24 HPCchallenge Benchmarks MITRE ICL/UTK STREAM ADD vs PTRANS >252 Processors STREAM ADDa(i) = b(i) + c(i) PTRANSa = a + b T

Slide-25 HPCchallenge Benchmarks MITRE ICL/UTK Outline Brief DARPA HPCS Overview Architecture/Application Characterization HPCchallenge Benchmarks Preliminary Results Summary

Slide-26 HPCchallenge Benchmarks MITRE ICL/UTK Summary DARPA HPCS Subsystem Performance Indicators –2+ PF/s LINPACK –6.5 PB/sec data STREAM bandwidth –3.2 PB/sec bisection bandwidth –64,000 GUPS Important to understand architecture/application characterization –Where did all the lost “Moore’s Law performance go?” HPCchallenge Benchmarks — –Peruse the results! –Contribute!