O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 DOE Evaluation of the Cray X1 Mark R. Fahey James B. White III (Trey) Center for Computational.

Slides:



Advertisements
Similar presentations
Design and Evaluation of an Autonomic Workflow Engine Thomas Heinis, Cesare Pautasso, Gustavo Alsonso Dept. of Computer Science Swiss Federal Institute.
Advertisements

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 Using Eichrom Resins to Simplify the Analyses for Gallium Content and Atom Percent Fission.
Using Metrics to Reduce Cost of Re-work Dwight Lamppert Senior Test Manager Franklin Templeton.
Using Matrices in Real Life
Advanced Piloting Cruise Plot.
Analysis of Computer Algorithms
Convegno Progetto FIRB LSNO – Capri 19/20 aprile ESOPO: an Environment for Solving Optimization Problems Online M. DApuzzo *, M.L. De Cesare **,
Operations Management Maintenance and Reliability Chapter 17
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 OpenFlow + : Extension for OpenFlow and its Implementation Hongyu Hu, Jun Bi, Tao Feng, You Wang, Pingping Lin Tsinghua University
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
The MPI Forum: Getting Started Rich Graham Oak Ridge National Laboratory.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Regional Routing Model Review: C) Model Formulation and Scenario Analysis Frank Southworth Oak Ridge National Laboratory Oak Ridge, TN NETS Program.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
1 Processes and Threads Creation and Termination States Usage Implementations.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Points, Vectors, Lines, Spheres and Matrices
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
1 Implementing Internet Web Sites in Counseling and Career Development James P. Sampson, Jr. Florida State University Copyright 2003 by James P. Sampson,
Solve Multi-step Equations
Announcements Homework 6 is due on Thursday (Oct 18)
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
Randomized Algorithms Randomized Algorithms CS648 1.
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.
Chapter 11, Testing, Part 2: Integration and System Testing
© 2006 Cisco Systems, Inc. All rights reserved. MPLS v MPLS VPN Technology Introducing MPLS VPN Architecture.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Chapter 6 File Systems 6.1 Files 6.2 Directories
© 2012 National Heart Foundation of Australia. Slide 2.
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
6.4 Best Approximation; Least Squares
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
25 seconds left…...
Vector Algebra One Mark Questions PREPARED BY:
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Raspberry Pi Performance Benchmarking
Choosing an Order for Joins
University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
From Model-based to Model-driven Design of User Interfaces.
Probabilistic Reasoning over Time
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Global Climate Modeling Research John Drake Computational Climate Dynamics Group Computer.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 State of the CCS SOS 8 April 13, 2004 James B. White.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Data Requirements for Climate and Carbon Research John Drake, Climate Dynamics Group Computer.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Presentation transcript:

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 DOE Evaluation of the Cray X1 Mark R. Fahey James B. White III (Trey) Center for Computational Sciences Oak Ridge National Laboratory

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 2 Acknowledgement Research sponsored by the Mathematical, Information, and Computational Sciences Division, Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract No. DE-AC05- 00OR22725 with UT-Battelle, LLC.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 3 “Cray X1 Evaluation”  A. S. Bland  J. J. Dongarra  J. B. Drake  T. H. Dunigan, Jr.  T. H. Dunning, Jr.  A. Geist  B. Gorda  W. D. Gropp  R. J. Harrison  R. Kendall  D. Keyes  J. A. Nichols  L. Oliker  H. Simon  R. Stevens  J. B. White, III  P. H. Worley  T. Zacharia

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 4 Outline  CCS X1  Evaluation Overview  Applications  Climate  Fusion  Materials  Biology

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 5 Phase 1 – March 2003  32 Processors  8 nodes, each with 4 processors  128 GB shared memory  8 TB of disk space 400 GigaFLOP/s

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 6 Phase 2 – September 2003  256 Processors  64 nodes  1 TB shared memory  20 TB of disk space 3.2 TeraFLOP/s

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 7 X1 evaluation  Compare performance with other systems  Applications Performance Matrix  Determine most-effective usage  Evaluate system-software reliability and performance  Predict scalability  Collaborate with Cray on next generation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 8 Hierarchical approach  System software  Microbenchmarks  Parallel-paradigm evaluation  Full applications  Scalability evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 9 System-software evaluation  Job-management systems  Mean time between failure  Mean time to repair  All problems, with Cray responses  Scalability and fault tolerance of OS  Filesystem performance & scalability  Tuning for HPSS, NFS, and wide-area high- bandwidth networks  See Buddy Bland’s Talk  “Early Operations Experience with the Cray X1”  Thursday at 11:00

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 10 Microbenchmarking  Results of benchmarks   See Pat Worley’s talk today at 4:45  Performance metrics of components  Vector & scalar arithmetic  Memory hierarchy  Message passing  Process & thread management  I/O primitives  Models of component performance

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 11 Parallel-paradigm evaluation  MPI-1, MPI-2 one-sided, SHMEM, Global Arrays, Co-Array Fortran, UPC, OpenMP, MLP, …  Identify best techniques for X1  Develop optimization strategies for applications

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 12 Scalability evaluation  Hot-spot analysis  Inter- and intra-node communication  Memory contention  Parallel I/O  Trend analysis for selected communication and I/O patterns  Trend analysis for kernel benchmarks  Scalability predictions from performance models and bounds

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 13 Full applications  Full applications of interest to DOE Office of Science  Scientific goals require multi-tera-scale resources  Evaluation of performance, scaling, and efficiency  Evaluation of ease/effectiveness of targeted tuning

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 14 Identifying applications  Set priorities  Potential performance payoff  Potential science payoff  Schedule the pipeline  porting/development  processor tuning  production runs - science!  small number of applications in each stage  Potential user  Knows the application  Willing and able to learn the X1  Motivated to tune application, not just recompile

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 15 Workshops  Prototype Workshop at ORNL Nov. 5-6  Fusion, Feb 3-5  M3D, NIMROD, GYRO, GTC, AORSA and TORIC  Provides flexibility when impediments encountered  Concurrent work by different teams  Climate, Feb 6  CAM, CLM, POP  Want to optimize for both Cray and NEC  Materials, March 2  Dynamic Cluster Algorithm, FLAPW, LSMS, Socorro  Biology, May 9  AMBER, contribute to BioLib

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 16 Fusion: NIMROD  CCS Teaming with developer and Cray to port  Cray has actively participated  Performance inhibiters  Uses Fortran reshape quite a bit  Uses Fortran sums extensively inside loops that should be vectorizable  Compiler cannot vectorize, arrays are actually pointers  Dramatic effect on performance  Data structures are derived types with pointers that have allocatable attribute  How much performance could recouped with arrays?

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 17 Fusion: GYRO  Developer and CCS teaming  Hand-coded transpose operations using loops over MPI_Alltoall  Expected to scale to ES class machine  Scaling is ultimately limited by this  Functional port complete  Has identified a couple bugs in code  Several routines easily vectorized by manual loop interchange, and directives  Vectorized sin/cos calls by rewriting small chunks of code  Hand optimizations have yielded 5X speedup so far  About 35% faster than PWR4

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 18 Fusion: GTC  Developer ran GTC on SX6  Cray had looked at parts of GTC prior to workshop  Result:  Developer is directly working with Cray  GTC has been ported to the X1  Some optimizations introduced  Work ongoing

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 19 Fusion: AORSA  Uses ScaLAPACK  Cray has ScaLAPACK implementation  Not tuned  Cray pursuing ScaLAPACK optimizations  Ported  Performance worse than expected  Culprit is matrix scaling routine  Fix implemented, tests underway  With Cray Benchmarking group

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 20 Fusion: M3D  M3D uses PETSc  Parallel data layout done within this framework  Uses the iterative solvers  Accounts for 90% of time  Need to port PETSc to X1  Estimate of 6 man-months  Require significant changes

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 21 Climate: CAM  People involved  Cray(1), NEC(2), NCAR(1), ORNL(2)  Porting, profiling ongoing at Cray  NEC expects single node optimizations for SX-6 complete by early Fall  Coordination between NEC and Cray?  Radiation and Cloud models are focus of most work

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 22 Climate: CLM  Land component of the Community Climate System Model  Undergoing changes to data structures to make easier to extend and maintain  Fortran user-defined types with pointers  Vectorization involvement  NCAR(1), ORNL(2), Cray(1), NEC(1)  Coordination to be worked out  See Trey White’s presentation  Wednesday at 8:45

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 23 Climate: POP  Organization involvement  LANL(1), Cray(1), NCAR(2), CRIEPI(2)  Need to coordinate between CRIEPI and Cray  Significant optimizations already implemented, successful  Vectorization and Co-Array Fortran  See Pat Worley’s presentation  “Early Performance Evaluation of the Cray X1”  Today at 4:45

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 24 Materials: Dynamic Cluster Alg.  MPI, OpenMP, PBLAS, BLAS  significant amount of time spent in dger  significant amount of time spent in cgemm  On the IBM Power4, the blas2 calls dominate  A quick port was performed  Optimizations targeted a couple routines adding a few directives  Took a couple days  Showed dramatic speedup over IBM Power4  For the small problem that was solved  time doing calculations became nearly negligible  formatted I/O became dominant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 25 Materials: LSMS  Locally Self-consistent Multiple Scattering  Code spends most of its time  matrix-matrix multiplications  Computing partial inverse  Communication involves exchanges of smaller matrices with neighbors  Expected to vectorize well  Developers are moving to sparse-matrix formulations to scale to larger problems  With Cray Benchmarking group

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 26 Materials: FLAPW  Full Potential Linearized Augmented Plane Wave (FLAPW) method  All-electron method  Considered to be most precise electron structure method in solid state physics  Validation code and as such is important to a large percentage of the materials community  Cray has started porting it

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 27 Biology  Workshop May 9 (few days ago!)  Bioinformatics plan to exploit special features of X1  Not one or two codes that are more important  Probably can use primitives in BioLib  Collaborate with Cray on adding more primitives to BioLib  Molecular dynamics based biology codes expected to vectorize  Cray working on AMBER port, nearly done

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 28 Conclusions  Future:  Chemistry and Astrophysics workshops  Workshops very productive  Focused set of codes to port is important  Identify users/teams  Early results are promising  Evaluation continues, much to do  DOE science projects will be direct beneficiaries  Scientists will see firsthand the impact of the X1  Could have profound implications on SciDAC ISICs

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 29 References  System software evaluation  Buddy Bland’s talk Thursday at 11:00  Results of standard benchmarks   Pat Worley’s talk today at 4:45  Optimization Experiment with CLM  Trey White’s talk Wednesday at 8:45

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 30 Contacts  Trey White   Mark Fahey   Pat Worley   Buddy Bland  