Common Practices for Managing Small HPC Clusters Supercomputing 12

Slides:



Advertisements
Similar presentations
Computing Infrastructure
Advertisements

Cloud Computing Mick Watson Director of ARK-Genomics The Roslin Institute.
IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Scale-out Central Store. Conventional Storage Verses Scale Out Clustered Storage Conventional Storage Scale Out Clustered Storage Faster……………………………………………….
Information Technology Center Introduction to High Performance Computing at KFUPM.
Managing Linux Clusters with Rocks Tim Carlson - PNNL
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
New Cluster for Heidelberg TRD(?) group. New Cluster OS : Scientific Linux 3.06 (except for alice-n5) Batch processing system : pbs (any advantage rather.
Academic and Research Technology (A&RT)
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Research Computing with Newton Gerald Ragghianti Nov. 12, 2010.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
DONVITO GIACINTO (INFN) ZANGRANDO, LUIGI (INFN) SGARAVATTO, MASSIMO (INFN) REBATTO, DAVID (INFN) MEZZADRI, MASSIMO (INFN) FRIZZIERO, ERIC (INFN) DORIGO,
Building a High-performance Computing Cluster Using FreeBSD BSDCon '03 September 10, 2003 Brooks Davis, Michael AuYeung, Gary Green, Craig Lee The Aerospace.
MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization.
Rocks cluster : a cluster oriented linux distribution or how to install a computer cluster in a day.
HPC at IISER Pune Neet Deo System Administrator
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
DRAFT 1 Institutional Research Computing at WSU: A community-based approach Governance model, access policy, and acquisition strategy for consideration.
Research Support Services Research Support Services.
Introduction to HPC resources for BCB 660 Nirav Merchant
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Small File File Systems USC Jim Pepin. Level Setting  Small files are ‘normal’ for lots of people Metadata substitute (lots of image data are done this.
HPC at HCC Jun Wang Outline of Workshop1 Overview of HPC Computing Resources at HCC How to obtain an account at HCC How to login a Linux cluster at HCC.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
Experiences with the Globus Toolkit on AIX and deploying the Large Scale Air Pollution Model as a grid service Ashish Thandavan Advanced Computing and.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
Infiniband in EDA (Chip Design) Glenn Newell Sr. Staff IT Architect Synopsys.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.
Rob Allan Daresbury Laboratory NW-GRID Training Event 25 th January 2007 Introduction to NW-GRID R.J. Allan CCLRC Daresbury Laboratory.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
Cluster Software Overview
Proposal for a IS schema Massimo Sgaravatto INFN Padova.
Portal Update Plan Ashok Adiga (512)
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Interconnect Trends in High Productivity Computing Actionable Market Intelligence for High Productivity Computing Addison Snell, VP/GM,
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
Introduction to Hartree Centre Resources: IBM iDataPlex Cluster and Training Workstations Rob Allan Scientific Computing Department STFC Daresbury Laboratory.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Requesting Resources on an HPC Facility Michael Griffiths and Deniz Savas Corporate Information and Computing Services The University of Sheffield
Jefferson Lab Site Report Sandy Philpott HEPiX Fall 07 Genome Sequencing Center Washington University at St. Louis.
Compute and Storage For the Farm at Jlab
Scheduling systems Carsten Preuß
What is HPC? High Performance Computing (HPC)
NCSA Supercluster Administration
Introduce yourself Presented by
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
H2020 EU PROJECT | Topic SC1-DTH | GA:
Presentation transcript:

Common Practices for Managing Small HPC Clusters Supercomputing 12

Small HPC Supercomputing 2012 Survey Instrument: tinyurl.com/smallHPC Fourth annual BoF: Discussing 2011 and 2012 survey results…

GPUs

How many GPU cores does this cluster have? Answer None61%41% %45% 1, ,9996%5% 5, ,9990% 10, ,9990%5% 25,000 or more0%5%

How many GPU cores does this cluster have? Answer None61%41% %45% 1, ,9996%5% 5, ,9990% 10, ,9990%5% 25,000 or more0%5%

Hardware Configuration

How many CPU cores are in this HPC cluster? 2011% < 20021% % % % > % 2012% < 20014% % 1,000- 4,99936% 5, ,999 23% 10, %

How many CPU cores are in this HPC cluster? 2011% < 20021% % % % > % 2012% < 20014% % 1,000- 4,99936% 5, ,999 23% 10, %

How much memory is there per physical server, i.e., per compute node? Answer% Gb16% Gb32% Gb11% Gb21% Gb58% > 64 Gb37% Unsure5% Answer% Gb9% Gb23% Gb41% Gb14% Gb32% Gb23% Gb18% > 256 Gb23% Unsure0%

How much memory is there per physical server, i.e., per compute node? Answer% Gb16% Gb32% Gb11% Gb21% Gb58% > 64 Gb37% Unsure5% Answer% Gb9% Gb23% Gb41% Gb14% Gb32% Gb23% Gb18% > 256 Gb23% Unsure0%

How much memory is there per core? Answer Gb0% Gb5% Gb32% Gb50% Gb9% Gb23% Gb23% > 16 Gb18% Unsure0%

How much memory is there per core? Answer Gb0% Gb5% Gb32% Gb50% Gb9% Gb23% Gb23% > 16 Gb18% Unsure0% Long tail toward the high side

What low latency network for MPI communication among CPU cores? Answer Infiniband 74%77% 10 Gigabit Ethernet, 10 GbE 11%14% 1 Gigabit Ethernet, 1 GbE 11%9% Unsure5%0%

What low latency network for MPI communication among CPU cores? Answer Infiniband 74%77% 10 Gigabit Ethernet, 10 GbE 11%14% 1 Gigabit Ethernet, 1 GbE 11%9% Unsure5%0% Little Change

Would you choose this form of low latency networking again? Yes94%86% No6%14%

Would you choose this form of low latency networking again? Yes94%86% No6%14% Reasons given by the 14% who said No: Infiniband next time 1 Gigabit Ethernet is actually 2gb network, and vendor no longer viable for low-latency bandwith 10 Gigabit Ethernet vendor no longer applicable in this area Will move to 10Gb in the near future

Scheduler Questions

What is the scheduler? Answer PBS6%9% TORQUE33%59% SGE (Oracle/Sun Grid Engine)28%23% LSF33%5% Lava0% Condor6%0% Maui/Moab39%55% SLURM0%9% Unsure---0% Other 0% 5%

What is the scheduler? Answer PBS6%9% TORQUE33%59% SGE (Oracle/Sun Grid Engine)28%23% LSF33%5% Lava0% Condor6%0% Maui/Moab39%55% SLURM0%9% Unsure---0% Other 0% 5%

Would you choose this scheduler again? Yes94%79% No6%21%

Would you choose this scheduler again? Yes94%79% No6%21% Those who said No, commented: Documentation has been shaky after Oracle's acquisition of SUN. The spinoff, various versions, invalid links to documentation made it quite challenging. Queue based preemption caused problems (Oracle/SGE) volume of jobs being submitted causing issues with load on open source scheduler. upgrading (Maui/Moab) SGE is no longer supported. Looking to use UGE

Do you allow serial jobs on this cluster? Yes95% No5%

Do you allow serial jobs on this cluster? Yes95% No5% No Change

What is the maximum amount of memory allowed per serial job? 2011% No maximum enforced 16% < 2 GB32% GB11% GB21% GB58% > 24 GB37% 2012% No maximum enforced 9% 2 GB or less23% GB41% GB14% GB32% More than 24 GB23%

What is the maximum amount of memory allowed per serial job? 2011% No maximum enforced 16% < 2 GB32% GB11% GB21% GB58% > 24 GB37% 2012% No maximum enforced 9% 2 GB or less23% GB41% GB14% GB32% More than 24 GB23% Decreases at the high end

What is the maximum amount of memory allowed per multi-core (mp or mpi) job? 2011% No maximum enforced 74% < 8 GB0% GB5% GB0% GB11% > 48 GB11% 2012% No maximum enforced 59% 8 GB or less0% GB0% GB9% GB0% More than 48 GB32%

What is the maximum amount of memory allowed per multi-core (mp or mpi) job? 2011% No maximum enforced 74% < 8 GB0% GB5% GB0% GB11% > 48 GB11% 2012% No maximum enforced 59% 8 GB or less0% GB0% GB9% GB0% More than 48 GB32% More maximums enforced, but maximums are high.

Storage

Where do users' home directories reside? Answer Local disk0% NFS47%55% Parallel file system (e.g., Luster or Panasas) 47%41% Unsure0% Other5%

Where do users' home directories reside? Answer Local disk0% NFS47%55% Parallel file system (e.g., Luster or Panasas) 47%41% Unsure0% Other5% Little Change

Would you configure users' home directories this way again? Yes 88%100% No 12%0%

Would you configure users' home directories this way again? Yes 88%100% No 12%0% People are satiisfed with what they are doing

What type of high performance storage/scratch space? Answer Local disk26%27% NFS16%27% Parallel file system (e.g., Luster or Panasas) 100% Unsure0% Other0%9% GPFS

What type of high performance storage/scratch space? Answer Local disk26%27% NFS16%27% Parallel file system (e.g., Luster or Panasas) 100% Unsure0% Other0%9% GPFS Little Change

Would you configure high performance/scratch space this way again? Yes 100%95% No 0%5%

Would you configure high performance/scratch space this way again? Yes 100%95% No 0%5% Essentially no change

Do you have an online, medium performance data storage service? Yes 28%55% No 72%45% Unsure 0%

Do you have an online, medium performance data storage service? Yes 28%55% No 72%45% Unsure 0%

Which of the following storage environments on this cluster do you back up? Answer Home directories53%68% High performance / scratch space 5%9% Medium performance, online storage 11%45% None47%27% Unsure0%

Which of the following storage environments on this cluster do you back up? Answer Home directories53%68% High performance / scratch space 5%9% Medium performance, online storage 11%45% None47%27% Unsure0%

Current Directions

If you were buying new compute nodes today, how many cores per node? 2011% 40% 85% 1221% 1632% > 1616% Unsure26% 2012% 40% 89% 125% 1650% 240% 3214% Unsure14% Other9%

If you were buying new compute nodes today, how many cores per node? 2011% 40% 85% 1221% 1632% > 1616% Unsure26% 2012% 40% 89% 125% 1650% 240% 3214% Unsure14% Other9%

If you were buying new compute nodes today, how much memory per node? 2011% 0-8 GB0% 9-16 GB0% GB0% GB41% >48 GB35% Unsure24% 2012% 0-8 GB9% 9-16 GB9% GB0% GB23% GB9% More than 64 GB41% Unsure9%

If you were buying new compute nodes today, how much memory per node? 2011% 0-8 GB0% 9-16 GB0% GB0% GB41% >48 GB35% Unsure24% 2012% 0-8 GB9% 9-16 GB9% GB0% GB23% GB9% More than 64 GB41% Unsure9%

Staffing

How many different individuals, excl. students, involved in operation, support, development? Answer individual5% 2-3 individuals41% 4-5 individuals32% 6-8 individuals18% 9-10 individuals5% individuals0% More than 15 individuals 0%

How many different individuals, excl. students, involved in operation, support, development? Answer individual5% 2-3 individuals41% 4-5 individuals32% 6-8 individuals18% 9-10 individuals5% individuals0% More than 15 individuals 0%

Approximately how many FTE, incl. students, operate the cluster to maintain the status quo (excluding user support)? Answer 2012 < 1 FTE27% 1.1 – 2 FTE41% 2.1 – 4 FTE23% 4.1 – 6 FTE9% 6.1 – 8 FTE0% More than 8 FTE0%

Approximately how many FTE, incl. students, operate the cluster to maintain the status quo (excluding user support)? Answer 2012 < 1 FTE27% 1.1 – 2 FTE41% 2.1 – 4 FTE23% 4.1 – 6 FTE9% 6.1 – 8 FTE0% More than 8 FTE0%

Approximately how many FTE, incl. students, support users of the cluster? Answer 2012 < 1 FTE32% 1.1 – 2 FTE27% 2.1 – 4 FTE27% 4.1 – 6 FTE9% 6.1 – 8 FTE5% More than 8 FTE0%

Approximately how many FTE, incl. students, support users of the cluster? Answer 2012 < 1 FTE32% 1.1 – 2 FTE27% 2.1 – 4 FTE27% 4.1 – 6 FTE9% 6.1 – 8 FTE5% More than 8 FTE0% Generally fewer FTE than for operations support

Approximately how many FTE, incl. students, are involved in hardware/software development efforts related to the cluster? Answer 2012 < 1 FTE48% 1.1 – 2 FTE33% 2.1 – 4 FTE14% 4.1 – 6 FTE5% 6.1 – 8 FTE0% More than 8 FTE0%

Approximately how many FTE, incl. students, are involved in hardware/software development efforts related to the cluster? Answer 2012 < 1 FTE48% 1.1 – 2 FTE33% 2.1 – 4 FTE14% 4.1 – 6 FTE5% 6.1 – 8 FTE0% More than 8 FTE0% Even fewer FTE than for user support

Inward facing staff versus outward facing staff? Answer 2012 There is a clear separation. 9% There is some separation. 41% There is almost no separation. 50%

Inward facing staff versus outward facing staff? Answer 2012 There is a clear separation. 9% There is some separation. 41% There is almost no separation. 50% Most staff do both operational and end user support

Small HPC BoFs Contact Information Surveytinyurl.com/smallHPC Websitehttps://sites.google.com/site/smallhpc/ ListSee link at above website Roger David