A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś.

Slides:



Advertisements
Similar presentations
Control and IT Systems status and strategy ALBA, Piotr Goryl on behalf of Michał Ostoja-Gajewski, Krzysztof Wawrzyniak, Łukasz Żytniak, Tadeusz.
Advertisements

1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.
©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.
IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.
Appro Xtreme-X Supercomputers A P P R O I N T E R N A T I O N A L I N C.
Beowulf Supercomputer System Lee, Jung won CS843.
CURRENT AND FUTURE HPC SOLUTIONS. T-PLATFORMS  Russia’s leading developer of turn-key solutions for supercomputing  Privately owned  140+ employees.
Program Systems Institute Russian Academy of Sciences1 Program Systems Institute Research Activities Overview Extended Version Alexander Moskovsky, Program.
PRAKTICKÝ ÚVOD DO SUPERPOČÍTAČE ANSELM Infrastruktura, přístup a podpora uživatelů David Hrbáč
The CDCE BNL HEPIX – LBL October 28, 2009 Tony Chan - BNL.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.
Martin Hamilton, Centre Manager − hpc-midlands.ac.uk HPC Midlands Cloud Supercomputing for Academia and Industry.
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME SESAME – LinkSCEEM.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
5 Nov 2001CGW'01 CrossGrid Testbed Node at ACC CYFRONET AGH Andrzej Ozieblo, Krzysztof Gawel, Marek Pogoda 5 Nov 2001.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
1 PDC Site Update at HP-CAST NTIG 1st April 2008 by Peter Graham PDC Site Update at HP-CAST NTIG April 1st 2008 Linköping by Peter Graham
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Lustre at Dell Overview Jeffrey B. Layton, Ph.D. Dell HPC Solutions |
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
© Copyright 2010 Hewlett-Packard Development Company, L.P. 1 HP + DDN = A WINNING PARTNERSHIP Systems architected by HP and DDN Full storage hardware and.
HPC at IISER Pune Neet Deo System Administrator
. Bartosz Lewandowski. Center of e-Infrastructure National Research and Education Network PIONIER National Research and Education Network PIONIER Research.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Site Lightning Report: MWT2 Mark Neubauer University of Illinois at Urbana-Champaign US ATLAS Facilities UC Santa Cruz Nov 14, 2012.
Herbert Huber, Axel Auweter, Torsten Wilde, High Performance Computing Group, Leibniz Supercomputing Centre Charles Archer, Torsten Bloth, Achim Bömelburg,
UPPMAX and UPPNEX: Enabling high performance bioinformatics Ola Spjuth, UPPMAX
HPC system for Meteorological research at HUS Meeting the challenges Nguyen Trung Kien Hanoi University of Science Melbourne, December 11 th, 2012 High.
HPC Business update HP Confidential – CDA Required
 Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). 
CRISP & SKA WP19 Status. Overview Staffing SKA Preconstruction phase Tiered Data Delivery Infrastructure Prototype deployment.
Presented by Leadership Computing Facility (LCF) Roadmap Buddy Bland Center for Computational Sciences Leadership Computing Facility Project.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Large Scale Parallel File System and Cluster Management ICT, CAS.
Polish Infrastructure for Supporting Computational Science in the European Research Space FiVO/QStorMan: toolkit for supporting data-oriented applications.
EGEE is a project funded by the European Union under contract IST HellasGrid Hardware Tender Christos Aposkitis GRNET EGEE 3 rd parties Advanced.
Leibniz Supercomputing Centre Garching/Munich Matthias Brehm HPC Group June 16.
Lightweight construction of rich scientific applications Daniel Harężlak(1), Marek Kasztelnik(1), Maciej Pawlik(1), Bartosz Wilk(1) and Marian Bubak(1,
Towards energy efficient HPC HP Apollo 8000 at Cyfronet Part I Patryk Lasoń, Marek Magryś.
Federating PL-Grid Computational Resources with the Atmosphere Cloud Platform Piotr Nowakowski, Marek Kasztelnik, Tomasz Bartyński, Tomasz Gubała, Daniel.
Patryk Lasoń, Marek Magryś
NICS RP Update TeraGrid Round Table March 10, 2011 Ryan Braby NICS HPC Operations Group Lead.
Sugon Server TC5600-H v3 Moscow, 12/2015.
Power and Cooling at Texas Advanced Computing Center Tommy Minyard, Ph.D. Director of Advanced Computing Systems 42 nd HPC User Forum September 8, 2011.
Lenovo - Eficiencia Energética en Sistemas de Supercomputación Miguel Terol Palencia Arquitecto HPC LENOVO.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.
The Evolution of the Italian HPC Infrastructure Carlo Cavazzoni CINECA – Supercomputing Application & Innovation 31 Marzo 2015.
IBERGRID as RC Total Capacity: > 10k-20K cores, > 3 Petabytes Evolving to cloud (conditioned by WLCG in some cases) Capacity may substantially increase.
Gang Chen, Institute of High Energy Physics Feb. 27, 2012, CHAIN workshop,Taipei Co-ordination & Harmonisation of Advanced e-Infrastructures Research Infrastructures.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
PL-Grid: Polish Infrastructure for Supporting Computational Science in the European Research Space 1 ESIF - The PLGrid Experience ACK Cyfronet AGH PL-Grid.
Extreme Scale Infrastructure
NIIF HPC services for research and education
A Brief Introduction to NERSC Resources and Allocations
Buying into “Summit” under the “Condo” model
What is HPC? High Performance Computing (HPC)
Operations and plans - Polish sites
LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,
DIRECT IMMERSION COOLED IMMERS HPC CLUSTERS BUILDING EXPERIENCE
DI4R Conference, September, 28-30, 2016, Krakow
Jeremy Maris Research Computing IT Services University of Sussex
Appro Xtreme-X Supercomputers
DIRECT IMMERSION COOLED IMMERS HPC CLUSTERS BUILDING EXPERIENCE
NGS Oracle Service.
Статус ГРИД-кластера ИЯФ СО РАН.
Ernst Haunschmid, TU WIEN EOSC, 30th October 2018
H2020 EU PROJECT | Topic SC1-DTH | GA:
Presentation transcript:

A 1.7 Petaflops Warm-Water- Cooled System: Operational Experiences and Scientific Results Łukasz Flis, Karol Krawentek, Marek Magryś

ACC Cyfronet AGH-UST established in 1973 part of AGH University of Science and Technology in Krakow, Poland provides free computing resources for scientific institutions centre of competence in HPC and Grid Computing IT service management expertise (ITIL, ISO 20k) member of PIONIER consortium operator of Krakow MAN home for supercomputers

International projects

PL-Grid infrastructure Polish national IT infrastructure supporting e-Science – based upon resources of most powerful academic resource centres – compatible and interoperable with European Grid – offering grid and cloud computing paradigms – coordinated by Cyfronet Benefits for users – unified infrastructure from 5 separate compute centres – unified access to software, compute and storage resources – non-trivial quality of service Challenges – unified monitoring, accounting, security – create environment of cooperation rather than competition Federation – the key to success

Competence Centre in the Field of Distributed Computing Grid Infrastructures Duration: – Project Coordinator: Academic Computer Centre CYFRONET AGH The main objective of the project is to support the development of ACC Cyfronet AGH as a specialized competence centre in the field of distributed computing infrastructures, with particular emphasis on grid technologies, cloud computing and infrastructures supporting computations on big data. PLGrid Core project

Zeus usage

Why upgrade? Job size growth Users hate waiting for resources New projects, new requirements Follow the advances in HPC Power costs

New building

Requirements for the new system Petascale system Low TCO Energy efficiency Density Expandability Good MTBF Hardware: – core count – memory size – network topology – storage

Requirements: Liquid Cooling Water: up to 1000x more efficient heat exchange than air Less energy needed to move the coolant Hardware (CPUs, DIMMs) can handle ~80C Challenge: cool 100% of HW with liquid – network switches – PSUs

Requirements: MTBF The less movement the better – less pumps – less fans – less HDDs Example – pump MTBF: hrs – fan MTBF: hrs – 1800 node system MTBF: 7 hrs

Requirements: Compute Max jobsize ~10k cores Fastest CPUs, but compatible with old codes – Two socket nodes – No accelerators at this point Newest memory – At least 4 GB/core Fast interconnect – Infiniband FDR – No need for full CBB fat tree

Requirements: Topology services nodes Service isle storage nodes 576 nodes Compute isle Core IB switches 576 nodes Compute isle 576 nodes Compute isle 576 nodes Compute isle

Why Apollo 8000? Most energy efficient The only solution with 100% warm water cooling Highest density Lowest TCO

Even more Apollo Focuses also on ‘1’ in PUE! – Power distribution – Less fans – Detailed monitoring ‘energy to solution’ Dry node maintenance Less cables Prefabricated piping Simplified management

Prometheus HP Apollo m 2, 15 racks (3 CDU, 12 compute) 1.65 PFLOPS PUE <1.05, 680 kW peak power 1728 nodes, Intel Haswell E5-2680v cores, per island 216 TB DDR4 RAM System prepared for expansion CentOS 7

Prometheus storage Diskless compute nodes Separate tender for storage – Lustre-based – 2 file systems: Scratch: 120 GB/s, 5 PB usable space Archive: 60 GB/s, 5 PB usable space – HSM-ready NFS for home directories and software

Deployment timeline Day 0 - Contract signed ( ) Day 23 - Installation of the primary loop starts Day 35 - First delivery (service island) Day 56 - Apollo piping arrives Day st and 2nd island delivered Day rd island delivered Day basic acceptance ends Official launch event on

Facility preparation Primary loop installation took 5 weeks Secondary (prefabricated) just 1 week Upgrade of the raised floor done „just in case” Additional pipes for leakage/condensation drain Water dam with emergency drain Lot of space needed for the hardware deliveries (over 100 pallets)

Secondary loop

Challenges Power infrastructure being build in parallel Boot over Infiniband – UEFI, high frequency port flapping – OpenSM overloaded with port events BIOS settings being lost occasionally Node location in APM is tricky 5 dead IB cables (2‰) 8 broken nodes (4‰) 24h work during weekend

Solutions Boot to RAM over IB, image distribution with HTTP – Whole machine boots up in 10 min with just 1 boot server Hostname/IP generator based on MAC collector – Data automatically collected from APM and iLO Graphical monitoring of power, temperature and network traffic – SNMP data source, – GUI allows easy problem location – Now synced with SLURM Spectacular iLO LED blinking system developed for the offical launch 24h work during weekend

System expansion Prometheus expansion already ordered 4th island – 432 regular nodes (2 CPUs, 128 GB RAM) – 72 nodes with GPGPUs (2x Nvidia Tesla K40XL) Installation to begin in September 2.4 PFLOPS total performance (Rpeak) 2232 nodes, CPU cores, 279 TB RAM

Future plans Push the system to it’s limits Further improvements of the monitoring tools Continue to move users from the previous system Detailed energy and temperature monitoring Energy-aware scheduling Survive the summer and measure performance Collect the annual energy and PUE HP-CAST 25 presentation?