Download presentation
Presentation is loading. Please wait.
Published byOswin Grant Modified over 9 years ago
1
Better answers Compaq HPTC Solutions Bruce Foster, Ph.D., MBA bruce.foster@compaq.com
2
Better answers Top100 SuperComputer Architectures (June 1999)
3
Better answers The Barriers to Performance Scaling CPU cycle time (nsec) 1 100 10 101001K10K100K 100 to 200 GFLOP Limit SC (PVP) SMP Cluster Farm MPP Numbers of CPUs Physical Limits Complexity Limits
4
Better answers CPU cycle time (nsec) Physical Limits 1 100 10 101001K10K100K SC (PVP) SMP Cluster Farm MPP Numbers of CPUs Complexity Limits Fastest Microprocessors with best interconnects for SMP Clusters yield Maximum Application Performance (TeraFLOP Level) Clusters of SMPs Are Breaking Through to the TeraFlop Level
5
Better answers High Performance Computing Systems HPTC Solutions AlphaServer Systems Interconnects Software Services
6
Better answers Compaq is in it for the long haul! Alpha roadmap committed for 10 years and beyond of performance leadership. Tandem will use Alpha in their next generation systems. Tandem owns 36 of the top 38 stock markets worldwide. Tandem will use Alpha in their next generation systems. Tandem owns 36 of the top 38 stock markets worldwide. Over 50% of Compaq’s revenue is from Enterprise Systems
7
Better answers Wide Presence in HPTC market Intel/ServerNet clusters at NCSA Alpha Linux/ServerNet at Caltech Alpha Tru64 Unix/FastEthernet at Swinburne Alpha Linux /Myrinet “C-Plant” at Sandia (#44 on Top500 list) HPTi win at FSL (Alpha Linux /Myrinet) 4 TFlop system Compaq Visual Fortran for W95/NT Compaq Compilers for Alpha/Linux Several very large SC systems (#34 on Top500 list) Celera 300 x 4 CPU ES40s (1.2 TFlop) ASCI PathForward and ASCI Turquoise
8
Better answers 1999 Small and Medium AlphaServers Compaq DS10 Compaq DS20 System 2 CPUs, small PC tower 2 CPUs, small PC tower 5.13 GB/s peak, 1.3 GB/s Single-CPU McCalpin Memory B/W 5.13 GB/s peak, 1.3 GB/s Single-CPU McCalpin Memory B/W Compaq ES40 System 4 CPUs, bigger cabinet 4 CPUs, bigger cabinet EV67 systems: 2.5 GB/s 4-CPU McCalpin b/w EV67 systems: 2.5 GB/s 4-CPU McCalpin b/w Double the I/O bandwidth & more slots Double the I/O bandwidth & more slots
9
Better answers Next Generation DS/ES AlphaServers Designed to Protect Your Investment SecondGeneration 125 MHz Data Bus Ultra2 64- bit RAID EV67 600+ MHz EV68 800+ MHz 8 MB L2 Cache 32 GB Ultra3 SCSI DVDDVD Processor Architecture Memory Storage ThirdGeneration Alpha 21264 500 MHz 4 MB L2 Cache 16 GB of Memory 83 MHz Data Bus 2 64-bit PCI busses 33 MHz PCI Ultra2 First Generation 4 PCI Busses 66 MHz PCI AGPAGP ThirdGeneration Note: Feature set varies between AlphaServer DS and ES products based on customer needs
10
Better answers SC’99 16x4 ES40 => 64 CPUs Quadrics Interconnect 1.7TB Storage
11
Better answers LINPACK NxN Rmax (GFlops)
12
Better answers
13
Cluster and Parallel File System Cluster File System File system mounted on any node is visible to all nodes without race conditions File system mounted on any node is visible to all nodes without race conditions Each node is both a CFS server and CFS client Each node is both a CFS server and CFS client Coherency is maintained by exchanging tokens Coherency is maintained by exchanging tokens Semantics are POSIX and X/OPEN compliant Semantics are POSIX and X/OPEN compliant Performance depends on access type and pattern Performance depends on access type and pattern Parallel File System Aggregates CFS files into a single parallel file Aggregates CFS files into a single parallel file Enables striping a single logical file across multiple underlying local files Enables striping a single logical file across multiple underlying local files
14
Better answers Compilers & Tools Compaq F90, C, C++, Java, … Shared memory Parallelization within SMP node by OpenMP Parallelization within SMP node by OpenMP 3rd party decomposition tools (KAI) 3rd party decomposition tools (KAI) Cray T3D/E-compatible Shmem library MPI (MPI 2, MPI-I/O, thread-safe) Debugger: TotalView (Etnus, Inc.) Performance analysis: Vampir (PALLAS GmbH) Load balancing: LSF (Platform Computing)
15
Better answers Our Capability Machine is Here A 16-CPU AlphaServer at SC’99 16-way GS160 AlphaServer 16-way GS160 AlphaServer 16 * 1.46 GF/CPU = 23.4 GFLOPS 16 * 1.46 GF/CPU = 23.4 GFLOPS High sustainable memory bandwidth High sustainable memory bandwidth 32-way: 32 CPUs: 46.8 GFLOPS 32 CPUs: 46.8 GFLOPS Very high sustainable memory bandwidth Very high sustainable memory bandwidth
16
Better answers Alpha Microprocessor Summary EV6 (21264).35 m, 466 - 500 MHz.35 m, 466 - 500 MHz 4-wide superscalar 4-wide superscalar Out-of-order execution Out-of-order execution EV67 (21264a).25 m, 667 - 730 MHz.25 m, 667 - 730 MHz 8MB L2 cache 8MB L2 cache EV68 (21264b).18 m, 800 - 1042 MHz.18 m, 800 - 1042 MHz EV7 (21364).18 m, ~1200 MHz.18 m, ~1200 MHz L2 cache on-chip L2 cache on-chip RAMBUS RAMBUS Glueless MP Glueless MP EV8 (21464).13 m, ~1500 MHz.13 m, ~1500 MHz 8-wide superscalar 8-wide superscalar SMT SMT... Future Alpha Microprocessors planned through to 2025 !
17
Better answers EV67/667MHz Preliminary HPTC Applications Results 30 to 45% improvement over ES40 EV6/500mhz Competitive leadership 1.15 to over 2 times HP N4000 1.15 to over 2 times HP N4000 – Better than an 8 CPU N4000 Over 2 times SGI Origin 2000 Over 2 times SGI Origin 2000 – Better than an 8 CPU Origin 2000 Over 2 times Sun UE3000 Over 2 times Sun UE3000 2 to 4 times Intel Xeon III 2 to 4 times Intel Xeon III
18
Better answers Global Switch EV6 Mem I/O Switch EV67 Mem I/O Switch EV67 Mem I/O Switch EV67 Mem I/O Switch EV67 Mem I/O Switch EV67 Mem I/O Switch EV67 Mem I/O Switch EV67 Mem I/O Switch New High-end AlphaServer Architecture A new way of looking at Servers Each Quad Building Block 4 EV67 CPUs (731 MHz, 1.46 GFlops) 4 EV67 CPUs (731 MHz, 1.46 GFlops) 4 Memory Arrays (total of 16GB, 32-way) 4 Memory Arrays (total of 16GB, 32-way) 6.4 GB/s Local Switch 6.4 GB/s Local Switch 28 PCI slots 28 PCI slots Quads aggregate via a Global Switch (8 ports) Combines up to 8 quads Combines up to 8 quads High Bandwidth, Low Latency High Bandwidth, Low Latency Preserves SMP programming model Up to 8 System Partitions Hardware firewalls provide software fault isolation between partitions Can be dynamically reconfigured Support multiple instances and versions of same O/S or different O/S completely (Tru64 UNIX, OpenVMS, and soon Linux)
19
Better answers Overview of CY2000 CPUs/SMP DS10 (1 CPU), DS10 (1 CPU), DS20 (2 CPUs), DS20 (2 CPUs), ES40 (4 CPUs), and ES40 (4 CPUs), and GS80 (8), GS160 (16) and GS320 (32) GS80 (8), GS160 (16) and GS320 (32) Systems up to 4096 CPUs 128-way 128-way Microprocessor speed Around 1GHz at end-2000 Around 1GHz at end-2000
20
Better answers Systems Area Network: FAST Message Passing Quadrics Backbone of our AlphaServer SC systems. Backbone of our AlphaServer SC systems. High Bandwidth, Low Latency, High Node/CPU Count High Bandwidth, Low Latency, High Node/CPU Count It’s a PCI Card; this allows systems of both small and big servers. It’s a PCI Card; this allows systems of both small and big servers. ServerNet Engineered for low per-node SAN cost. Engineered for low per-node SAN cost. Brings Tandem Non-Stop technology to Alpha Linux Beowulfs Brings Tandem Non-Stop technology to Alpha Linux Beowulfs Myrinet Ties together hundreds of Alphas on Sandia’s C-Plant. Ties together hundreds of Alphas on Sandia’s C-Plant. Ethernet/Fast Ethernet Low cost interconnect for medium size systems; (Alpha at Swinburne, Sydney Uni (Gordon Bell winner), CSIRO multiple divisions) Low cost interconnect for medium size systems; (Alpha at Swinburne, Sydney Uni (Gordon Bell winner), CSIRO multiple divisions)
21
Better answers Customer Comments: Alpha and Red Hat Comments from "The Center for the Neural Basis of Cognition ” It runs about six times faster on that {DS20} machine than on a Pentium II 400. It runs about six times faster on that {DS20} machine than on a Pentium II 400. Comments From West Coast University math department: PII-450-512k cache g77 -O3 75:02 PII-450-512k cache g77 -O3 75:02 Celeron 450A-128K cache g77 -O3 74:44 Celeron 450A-128K cache g77 -O3 74:44 Alpha 21164-600 4 MB cache g77 -O3 29:27 Alpha 21164-600 4 MB cache g77 -O3 29:27 Alpha 21264-500 4 MB cache g77 -O3 17:16 Alpha 21264-500 4 MB cache g77 -O3 17:16 Alpha 21264-500 4 MB cache fort -O3 8:42 Alpha 21264-500 4 MB cache fort -O3 8:42 I'm impressed (both with the AlphaServer 21264 and Compaq Fortran). It's a 5 mesh fluid flow used for modeling blood flows. I'm impressed (both with the AlphaServer 21264 and Compaq Fortran). It's a 5 mesh fluid flow used for modeling blood flows. Comments from Canadian University. With your Fortran compiler the DS20 is about 3.5x the speed of an SGI Origin 200 with a 180Mhz R10K CPU, pretty impressive. With your Fortran compiler the DS20 is about 3.5x the speed of an SGI Origin 200 with a 180Mhz R10K CPU, pretty impressive. 9 times ! 6 times ! 3.5 times!
22
Better answers Complete Suite of HPTC Systems 1- 2 Processors Up to 4GB of memory 6 PCI slots Switched based system - 64-bit PCI I/O subsystems - Very Large Memory Scalable clusters on DIGITAL UNIX, OpenVMS and Linux Modular system packaging - advanced systems management DS Series Apr 1- 4 Processors Up to 16GB of memory Up to 10 PCI slots ES SeriesFeb May ComingSoon 1-32 Processors Up to 128 + GB of memory Up to 224 PCI slots GS Series SC Series EV 67 667MHz 64-512 Processors Up to 2 TB memory Up to 1.2K I/O slots Announcing
23
Better answers Thank You! Please visit our HPTC Web Site or send eMail to Steve Tolnai or myself http://www.compaq.com/hpc eMail: tolnai@compaq.com bruce.foster@compaq.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.