Patryk Lasoń, Marek Magryś Further expansion of HPE’s largest warm-water-cooled Apollo 8000 system: „Prometheus” at Cyfronet Patryk Lasoń, Marek Magryś
ACC Cyfronet AGH-UST established in 1973 part of AGH University of Science and Technology in Krakow, Poland provides free computing resources for scientific institutions centre of competence in HPC and Grid Computing IT service management expertise (ITIL, ISO 20k) member of PIONIER consortium operator of Krakow MAN home for supercomputers
PL-Grid infrastructure Polish national IT infrastructure supporting e-Science based upon resources of most powerful academic resource centres compatible and interoperable with European Grid offering grid and cloud computing paradigms coordinated by Cyfronet Benefits for users unified infrastructure from 5 separate compute centres unified access to software, compute and storage resources non-trivial quality of service Challenges unified monitoring, accounting, security create environment of cooperation rather than competition Federation – the key to success
PLGrid Core project Competence Centre in the Field of Distributed Computing Grid Infrastructures Duration: 01.01.2014 – 31.11.2015 Project Coordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competence centre in the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data.
ZEUS 374 TFLOPS #269 on Top500
Zeus usage
New building 5 MW, UPS + diesel
2.4 PFLOPS, #49 on Top500
Prometheus – Phase 1 Installed in Q2 2015 HP Apollo 8000 13 m2, 15 racks (3 CDU, 12 compute) 1.65 PFLOPS 1728 nodes, Intel Haswell E5-2680v3 41472 cores, 13824 per island 216 TB DDR4 RAM N+1/N+N redundancy
Prometheus – Phase 2 Installed in Q4 2015 4th island 432 regular nodes (2 CPUs, 128 GB RAM) 72 nodes with GPGPUs (2x NVIDIA Tesla K40 XL) 2.4 PFLOPS total performance (Rpeak) 2140 TFLOPS in CPUs 256 TFLOPS in GPUs 2232 nodes, 53568 CPU cores, 279 TB RAM <850 kW power (including cooling)
Prometheus storage Diskless compute nodes Separate procurment for storage Lustre on top of DDN hardware Two filesystems: Scratch: 120 GB/s, 5 PB usable space Archive: 60 GB/s, 5 PB usable space HSM-ready NFS for home directories and software
Prometheus: IB farbic Core IB switches Compute isle Compute isle 576 CPU nodes Compute isle 576 CPU nodes Compute isle 576 CPU nodes Compute isle 432 CPU nodes 72 GPU nodes Compute isle services nodes Service isle storage nodes
Why liquid cooling? Water: up to 1000x more efficient heat exchange than air Less energy needed to move the coolant Hardware (CPUs, DIMMs) can handle ~80 C Challenge: cool 100% of HW with liquid network switches PSUs
What about MTBF? The less movement the better Example pumps fans HDDs pump MTBF: 50 000 h fan MTBF: 50 000 h 2300 node system MTBF: ~5 h
Why Apollo 8000? Most energy efficient The only solution with 100% warm water cooling Highest density Lowest TCO
Even more Apollo Focuses also on ‘1’ in PUE! Dry node maintenance Power distribution Less fans Detailed monitoring ‘energy to solution’ Dry node maintenance Less cables Prefabricated piping Simplified management
Secondary loop
System software CentOS 7 Boot to RAM over IB, image distribution with HTTP Whole machine boots up in 10 minutes with just 1 boot server Hostname/IP generator based on MAC collector Data automatically collected from APM and iLO Graphical monitoring of power, temperature and network traffic SNMP data source, GUI allows easy problem location Now synced with SLURM Spectacular iLO LED blinking system developed for the official launch
HPL: power usage
HPL: water temperature
Real application performance Prometheus vs. Zeus in theory 4x difference core to core Storage system (scratch) 10x faster More time to focus on the most popular codes COSMOS++ – 4.4x Quantum Espresso – 5.6x ADF – 6x Widely used QC code with the name derrived from a famous mathematician – 2x
Future plans Continue to move users from the previous system Add a few large-memory nodes Further improvements of the monitoring tools Detailed energy and temperature monitoring Energy-aware scheduling Collect the annual energy and PUE