A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś
ACC Cyfronet AGH-UST established in 1973 part of AGH University of Science and Technology in Krakow, Poland provides free computing resources for scientific institutions centre of competence in HPC and Grid Computing IT service management expertise (ITIL, ISO 20k) member of PIONIER consortium operator of Krakow MAN home for supercomputers
International projects
PL-Grid infrastructure Polish national IT infrastructure supporting e-Science – based upon resources of most powerful academic resource centres – compatible and interoperable with European Grid – offering grid and cloud computing paradigms – coordinated by Cyfronet Benefits for users – unified infrastructure from 5 separate compute centres – unified access to software, compute and storage resources – non-trivial quality of service Challenges – unified monitoring, accounting, security – create environment of cooperation rather than competition Federation – the key to success
Competence Centre in the Field of Distributed Computing Grid Infrastructures Duration: – Project Coordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competence centre in the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data. PLGrid Core project
Zeus usage
Why upgrade? Job size growth Users hate waiting for resources New projects, new requirements Follow the advances in HPC Power costs
New building
Requirements for the new system Petascale system Low TCO Energy efficiency Density Expandability Good MTBF Hardware: – core count – memory size – network topology – storage
Requirements: Liquid Cooling Water: up to 1000x more efficient heat exchange than air Less energy needed to move the coolant Hardware (CPUs, DIMMs) can handle ~80C Challenge: cool 100% of HW with liquid – network switches – PSUs
Requirements: MTBF The less movement the better – less pumps – less fans – less HDDs Example – pump MTBF: hrs – fan MTBF: hrs – 1800 node system MTBF: 7 hrs
Requirements: Compute Max jobsize ~10k cores Fastest CPUs, but compatible with old codes – Two socket nodes – No accelerators at this point Newest memory – At least 4 GB/core Fast interconnect – Infiniband FDR – No need for full CBB fat tree
Requirements: Topology services nodes Service isle storage nodes 576 nodes Compute isle Core IB switches 576 nodes Compute isle 576 nodes Compute isle 576 nodes Compute isle
Why Apollo 8000? Most energy efficient The only solution with 100% warm water cooling Highest density Lowest TCO
Even more Apollo Focuses also on ‘1’ in PUE! – Power distribution – Less fans – Detailed monitoring ‘energy to solution’ Dry node maintenance Less cables Prefabricated piping Simplified management
Prometheus HP Apollo m 2, 15 racks (3 CDU, 12 compute) 1.65 PFLOPS PUE <1.05, 680 kW peak power 1728 nodes, Intel Haswell E5-2680v cores, per island 216 TB DDR4 RAM System prepared for expansion CentOS 7
Prometheus storage Diskless compute nodes Separate tender for storage – Lustre-based – 2 file systems: Scratch: 120 GB/s, 5 PB usable space Archive: 60 GB/s, 5 PB usable space – HSM-ready NFS for home directories and software
Deployment timeline Day 0 - Contract signed ( ) Day 23 - Installation of the primary loop starts Day 35 - First delivery (service island) Day 56 - Apollo piping arrives Day st and 2nd island delivered Day rd island delivered Day basic acceptance ends Official launch event on
Facility preparation Primary loop installation took 5 weeks Secondary (prefabricated) just 1 week Upgrade of the raised floor done „just in case” Additional pipes for leakage/condensation drain Water dam with emergency drain Lot of space needed for the hardware deliveries (over 100 pallets)
Secondary loop
Challenges Power infrastructure being build in parallel Boot over Infiniband – UEFI, high frequency port flapping – OpenSM overloaded with port events BIOS settings being lost occasionally Node location in APM is tricky 5 dead IB cables (2‰) 8 broken nodes (4‰) 24h work during weekend
Solutions Boot to RAM over IB, image distribution with HTTP – Whole machine boots up in 10 min with just 1 boot server Hostname/IP generator based on MAC collector – Data automatically collected from APM and iLO Graphical monitoring of power, temperature and network traffic – SNMP data source, – GUI allows easy problem location – Now synced with SLURM Spectacular iLO LED blinking system developed for the offical launch 24h work during weekend
System expansion Prometheus expansion already ordered 4th island – 432 regular nodes (2 CPUs, 128 GB RAM) – 72 nodes with GPGPUs (2x Nvidia Tesla K40XL) Installation to begin in September 2.4 PFLOPS total performance (Rpeak) 2232 nodes, CPU cores, 279 TB RAM
Future plans Push the system to it’s limits Further improvements of the monitoring tools Continue to move users from the previous system Detailed energy and temperature monitoring Energy-aware scheduling Survive the summer and measure performance Collect the annual energy and PUE HP-CAST 25 presentation?