Presentation is loading. Please wait.

Presentation is loading. Please wait.

A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś.

Similar presentations


Presentation on theme: "A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś."— Presentation transcript:

1 A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś

2

3 ACC Cyfronet AGH-UST established in 1973 part of AGH University of Science and Technology in Krakow, Poland provides free computing resources for scientific institutions centre of competence in HPC and Grid Computing IT service management expertise (ITIL, ISO 20k) member of PIONIER consortium operator of Krakow MAN home for supercomputers

4 International projects

5 PL-Grid infrastructure Polish national IT infrastructure supporting e-Science – based upon resources of most powerful academic resource centres – compatible and interoperable with European Grid – offering grid and cloud computing paradigms – coordinated by Cyfronet Benefits for users – unified infrastructure from 5 separate compute centres – unified access to software, compute and storage resources – non-trivial quality of service Challenges – unified monitoring, accounting, security – create environment of cooperation rather than competition Federation – the key to success

6 Competence Centre in the Field of Distributed Computing Grid Infrastructures Duration: 01.01.2014 – 31.11.2015 Project Coordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competence centre in the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data. PLGrid Core project

7

8

9

10 Zeus usage

11 Why upgrade? Job size growth Users hate waiting for resources New projects, new requirements Follow the advances in HPC Power costs

12 New building

13 Requirements for the new system Petascale system Low TCO Energy efficiency Density Expandability Good MTBF Hardware: – core count – memory size – network topology – storage

14 Requirements: Liquid Cooling Water: up to 1000x more efficient heat exchange than air Less energy needed to move the coolant Hardware (CPUs, DIMMs) can handle ~80C Challenge: cool 100% of HW with liquid – network switches – PSUs

15 Requirements: MTBF The less movement the better – less pumps – less fans – less HDDs Example – pump MTBF: 50 000 hrs – fan MTBF: 50 000 hrs – 1800 node system MTBF: 7 hrs

16 Requirements: Compute Max jobsize ~10k cores Fastest CPUs, but compatible with old codes – Two socket nodes – No accelerators at this point Newest memory – At least 4 GB/core Fast interconnect – Infiniband FDR – No need for full CBB fat tree

17 Requirements: Topology services nodes Service isle storage nodes 576 nodes Compute isle Core IB switches 576 nodes Compute isle 576 nodes Compute isle 576 nodes Compute isle

18

19

20 Why Apollo 8000? Most energy efficient The only solution with 100% warm water cooling Highest density Lowest TCO

21 Even more Apollo Focuses also on ‘1’ in PUE! – Power distribution – Less fans – Detailed monitoring ‘energy to solution’ Dry node maintenance Less cables Prefabricated piping Simplified management

22 Prometheus HP Apollo 8000 13 m 2, 15 racks (3 CDU, 12 compute) 1.65 PFLOPS PUE <1.05, 680 kW peak power 1728 nodes, Intel Haswell E5-2680v3 41472 cores, 13824 per island 216 TB DDR4 RAM System prepared for expansion CentOS 7

23 Prometheus storage Diskless compute nodes Separate tender for storage – Lustre-based – 2 file systems: Scratch: 120 GB/s, 5 PB usable space Archive: 60 GB/s, 5 PB usable space – HSM-ready NFS for home directories and software

24 Deployment timeline Day 0 - Contract signed (20.10.2014) Day 23 - Installation of the primary loop starts Day 35 - First delivery (service island) Day 56 - Apollo piping arrives Day 98 - 1st and 2nd island delivered Day 101 - 3rd island delivered Day 111 - basic acceptance ends Official launch event on 27.04.2015

25 Facility preparation Primary loop installation took 5 weeks Secondary (prefabricated) just 1 week Upgrade of the raised floor done „just in case” Additional pipes for leakage/condensation drain Water dam with emergency drain Lot of space needed for the hardware deliveries (over 100 pallets)

26 Secondary loop

27 Challenges Power infrastructure being build in parallel Boot over Infiniband – UEFI, high frequency port flapping – OpenSM overloaded with port events BIOS settings being lost occasionally Node location in APM is tricky 5 dead IB cables (2‰) 8 broken nodes (4‰) 24h work during weekend

28 Solutions Boot to RAM over IB, image distribution with HTTP – Whole machine boots up in 10 min with just 1 boot server Hostname/IP generator based on MAC collector – Data automatically collected from APM and iLO Graphical monitoring of power, temperature and network traffic – SNMP data source, – GUI allows easy problem location – Now synced with SLURM Spectacular iLO LED blinking system developed for the offical launch 24h work during weekend

29

30 System expansion Prometheus expansion already ordered 4th island – 432 regular nodes (2 CPUs, 128 GB RAM) – 72 nodes with GPGPUs (2x Nvidia Tesla K40XL) Installation to begin in September 2.4 PFLOPS total performance (Rpeak) 2232 nodes, 53568 CPU cores, 279 TB RAM

31 Future plans Push the system to it’s limits Further improvements of the monitoring tools Continue to move users from the previous system Detailed energy and temperature monitoring Energy-aware scheduling Survive the summer and measure performance Collect the annual energy and PUE HP-CAST 25 presentation?

32


Download ppt "A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś."

Similar presentations


Ads by Google