SGI's Platform Strategy: Addressing the Productivity Gap in HPC Dave Parry Senior Vice President and General Manager Server and Platform Group Silicon Graphics, Inc.
Cost Software IT & Engineering Personnel ~ $50/Hr 2002 IT Costs Now in People and Software Basic Hardware ~ $1/Hr Vector RISC Commodity Changing Economics in HPC
Worldwide Production of Information Exabytes Source: Gartner Group Datasets are Getting (Much) Bigger, Too Satellite Systems Archive Growth Source: NOAA
Programming Is Getting Harder (AKA The Folly of “Least Common Denominator Computing”) OpenMP™. a = b SHMEM or MPI2 (one-sided). C 1 "get(b)" i.e. i=shmem_int_get(b) i=MPI_get(b) MPI (two-sided). C 2 "recv"; to wait for C 1 ’s request. C 1 "send"; to ask C 2 for ”b". C 2 finds that ”b" is needed by C 1. C 2 does a local "get(b)". C 2 does a "send(b)". C 1 does a "recv(b)" i.e. a=MPI_recv(b) C1C1 C2C2 mem. space for “a” mem. space for “b” To copy/transfer the value stored in “b” to “a”...
Memory Is Getting “Slower” Origin ® MHz Origin ® MHz Origin MHz
Summing up the Productivity Picture Productivity = cost -1 * value * efficiency * usability Where: cost -1 == MFLOPS/dollar (Moore’s Law) value == hardware cost/cost of ownership efficiency == productive cycles/MFLOPS (Constant at 5–10%) usability == programming effort per productive cycle Productivity Moore’s Law Productivity Gap MFLOPs per acquisition dollar Productive science per total dollar
Technology Directions to Close the Gap
Visualization Computation A Data-Centric View of Each Aspect of HPC Data Access Focus: One shared view of the data with pervasive access Focus: One shared view of the dataset with pervasive access Focus: One shared view of the visual model with pervasive access
? Image courtesy of Janssen Pharmaceuticals Visual Data Visual Data HPC/Capability HPC/Capacity We Want to Work Differently Grid Infrastructure
A Different View of System Architecture Scalable Shared Memory. Globally addressable. Thousands of ports. Flat & high bandwidth. Flexible & configurable Terascale to Petascale Data Set : Bring Function to Data Compute IO Graphics
Cost Software IT & Engineering Personnel ~ $50/Hr 2002 IT Costs Now in People and Software Basic Hardware ~ $1/Hr Vector RISC Commodity Changing Economics in HPC Challenges: “Impedance match” to HPC applications Availability of HPC-class architectures
Use an HPC Processor for HPC Applications
x advantage for Altix on 2P. Best Opteron result was run with single user mode and interleaved memory banks. SPECfp_rate_base2000 Use an HPC Processor for HPC Applications
Q22H 64p 128p 256p 1024p+ CY2001CY2003 Max NUMAlinked System Size Max Kernel Image or Partition Size 1H2H Max SMP System Size 2p SGI 750 Altix-Itanium2 512p Altix-Madison Combine Your HPC Processor with an HPC Architecture 1H2H CY2004 Altix-Madison9M
Altix, 1.3 Ghz is 1.46x faster than IBM eServer p690, 1.3 Ghz at 128P Altix, 1.5 Ghz is 16% faster than p690, 1.7 Ghz in spite of a lower peak flop rate. Combine Your HPC Processor with an HPC Architecture Source: July 24, 2003 and SGI performance reports Linpack HPC (NxN) Performance
World-record result for 64 and 32-processor systems SGI’s 1.5Ghz, 32P result is 2x better performance than IBM eServer p690, 1.7 Ghz SGI’s 1.3Ghz, 64P result is 1.95x better than Sun Fire 15K, 1.2 Ghz. Combine Your HPC Processor with an HPC Architecture SPECfp_rate_base2000 Performance
New Paradigms (usability) Single Physical Mem. Single O.S. Cache Coherent SGI® NUMA SGI® Origin 3000 SSI 512–1024P Cluster Tools IBM >32P Compaq >32P Sun >64P HP >64P 3 Run OpenMP™ codes Run MPI codes Single Address Space Single Admin. View Bus/Switch IBM® 32P Compaq 32P Sun™ 64P HP™ 64P Cluster D.I.Y. PCs connected 3 New_1? (App Level) New_2? (App Level)
Global Shared Memory between Supercluster Nodes C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick IX-Brick 64P Partition Operating System C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick Power Bay R-Brick C-Brick R-Brick C-Brick Power Bay C-Brick IX-Brick 64P Partition MPI/shmem app OpenMP™ app CPU_SETS System layer MPI/shmem app Parallel Scheduler, Array Services
The 64P result is a world-record result for a microprocessor-based system and fifth overall 1.56x better performance than IBM eServer p690 at 32P * 128 CPU result uses MPI code to run on Altix Supercluster with two 64P nodes, for smaller CPU counts OpenMP code was used. STREAM Triad Results SSI and SuperCluster Configs
A Path to Architectural Convergence Defense and Homeland Security Media ManufacturingScienceEnergy Origin Altix Origin Altix Application Specific Compute Multi-Paradigm Architecture
A Different View of System Architecture Scalable Shared Memory. Globally addressable. Thousands of ports. Flat & high bandwidth. Flexible & configurable Terascale to Petascale Data Set : Bring Function to Data Reconfigurable Compute IO Graphics Compute Reconfigurable
Multi-Paradigm Computing UltraViolet Scalable Shared Memory. Globally addressable. Thousands of ports. Flat & high bandwidth. Flexible & configurable Terascale to Petascale Data Set : Bring Function to Data Reconfigurable Scalar Vector IO Graphics Vector Streaming Scalar
Visualization Computation A Data-Centric View of Each Aspect of HPC Data Access Focus: One shared view of the data with pervasive access Focus: One shared view of the dataset with pervasive access Focus: One shared view of the visual model with pervasive access
Innovation workflow means data must be shared Design Compute Data Imagine Post- process Visualize Decide Adapting to the way people work: From the original concept to the final result, data is at the core of the workflow Information is shared between groups, and data is moved between hosts Data sets grow at each step Processes are improved when data copy is avoided, shortening time to insight
SGI in Data Management Integrated HW / SW Solutions SGI ® Data Management Legato, XFS™ / XVM, Snapshot FailSafe™, Fail Over Data Migration Facility / Tape Management Facility SGI ® File Server Scalable Bandwidth Storage Management, SAN Topology, SAN Cluster Management, TP900, TP9100, TP9500, HDS 9960, Ciprico 7000 and TALON™, Brocade, STK, ADIC, SGI Firmware DAS Scalability to over 12 GByte/s and up to 18 M TB Backup Archive / HSM Data Sharing High Availability RAID, JBOD, Hub, Switch, HBA, Tape DAS, NAS, SAN SGI ® SAN Server 1000 Management Topology Monitoring CXFS™, Samba / Cifs, BDS, NFS, FTP,..
SAN with CXFS: High performance data sharing with unlimited scale LAN SAN A unique high performance solution : Each host share one or more volumes consolidated in one or more RAID array. Centralized storage management High modularity True High Performances Data sharing, near local File System performances. Fully Resilient (HA) Fully POSIX Compliant As easy to share files as with NFS, but faster Windows NT ® & 2k SGI ® IRIX Sun TM Solaris Linux 64 for Altix IBM AIX Linux 32 More Under Development True Heterogeneity
Faster than WAN FTP or NFS Single name space = easy to administer, no data copies CXFS Usage - Wide Area & GRID Data Sharing SAN across distances of up to 8000KM
Data Lifecycle Management Storage Hierarchy & TCO Model with DMF TP9400 STK L700 w/9840 Primary Storage Online - high-performance disk Demote > 7 days < 365 Demote > 1 Yr < 2 Yr Promote used last 24 hrs Promote used last 7 days Nearline Disk High Capacity, Low cost, Lower performance Tape Libraries high-performance archive DMF manages data from one platform to another based on: age of file size of file type of file Archive > 2 Yr
SGI ® High-Performance Data Management Leadership Top performance and virtually unlimited scalability –Broke 3 Gbyte/sec SAN barrier (2000) –Delivering first 12 GB/sec (15GB peak) SAN (2002) –First 2 GB SAN Fabric (2001) –Wide area data sharing (2002) –Broke backup record - 10 Tbyte in an hour (2003)
Summing up the Productivity Picture Productivity = cost -1 * value * efficiency * usability Productivity Moore’s Law Productivity Gap Moore’s Law Productivity
Productivity in weather and climate HPC - SGI Altix Brings serious supercomputing capability to Linux Robust multi-OS shared filesystem with unmatched scale Porting of many key development and administration tools Ease of use from largest node size in the industry Environmental codes being ported, optimized, scaled
POP performance 1 degree global problem Forecast years/wallclock day Altix 1.3GHz ES40 Altix 1.5GHz (scaled)
MM5 performance T3a case Altix 1.5Ghz IBM p Ghz Xeon 2.2Ghz/myrinet Athlon 1.4Ghz/Dolphin SCI (all are MPI)
Other applications
© 2003 Silicon Graphics, Inc. All rights reserved. Silicon Graphics, SGI, Origin, OpenGL, XFS, InfiniteReality, IRIX, and the SGI logo are registered trademarks and OpenMP, NUMAflex, CXFS, InfinitePerformance, and the Silicon Graphics logo are trademarks of Silicon Graphics, Inc., in the United States and/or other countries worldwide. R10000 is a registered trademark of MIPS Technologies, Inc. Pentium and Itanium are registered trademarks of Intel Corporation. Windows is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds. All other trademarks mentioned herein are the property of their respective owners. (06/03)