PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF

Slides:



Advertisements
Similar presentations
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
Advertisements

Site Report: The Linux Farm at the RCF HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory.
IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.
Introduction to DBA.
Overview of Midrange Computing Resources at LBNL Gary Jung March 26, 2002.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Institute for High Energy Physics ( ) NEC’2007 Varna, Bulgaria, September Activities of IHEP in LCG/EGEE.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
1 NERSC User Group Business Meeting, June 3, 2002 High Performance Computing Research Juan Meza Department Head High Performance Computing Research.
Marilyn T. Smith, Head, MIT Information Services & Technology DataSpace IS&T Data CenterMIT Optical Network 1.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
U.S. ATLAS Physics and Computing Budget and Schedule Review John Huth Harvard University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven.
CERN - European Laboratory for Particle Physics HEP Computer Farms Frédéric Hemmer CERN Information Technology Division Physics Data processing Group.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Outline IT Organization SciComp Update CNI Update
Ohio Supercomputer Center Cluster Computing Overview Summer Institute for Advanced Computing August 22, 2000 Doug Johnson, OSC.
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
March 2003 CERN 1 EDG and AliEn in Prague Dagmar Adamova INP Rez near Prague.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
ALICE-USA Grid-Deployment Plans (By the way, ALICE is an LHC Experiment, TOO!) Or (We Sometimes Feel Like and “AliEn” in our own Home…) Larry Pinsky—Computing.
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
- PDSF - Nersc's Production Linux Cluster (26mar02 - MRC LBNL) PDSF NERSC's Production Linux Cluster Craig E. Tull HCG/NERSC/LBNL.
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
October 2002 INFN Catania 1 The (LHCC) Grid Project Initiative in Prague Dagmar Adamova INP Rez near Prague.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
U.S. ATLAS Tier 1 Planning Rich Baker Brookhaven National Laboratory US ATLAS Computing Advisory Panel Meeting Argonne National Laboratory October 30-31,
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
The CRI compute cluster CRUK Cambridge Research Institute.
STAR Off-line Computing Capabilities at LBNL/NERSC Doug Olson, LBNL STAR Collaboration Meeting 2 August 1999, BNL.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Capability Computing – High-End Resources Wayne Pfeiffer Deputy Director NPACI & SDSC NPACI.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Office of Science U.S. Department of Energy NERSC Site Report HEPiX October 20, 2003 TRIUMF.
Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
Seaborg Decommission James M. Craw Computational Systems Group Lead NERSC User Group Meeting September 17, 2007.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
National Energy Research Scientific Computing Center (NERSC) PDSF at NERSC Thomas M. Langley NERSC Center Division, LBNL November 19, 2003.
LBNL/NERSC/PDSF Site Report for HEPiX Catania, Italy April 17, 2002 by Cary Whitney
VIII Encuentro Grupos Investigación Sep 2010 Grupo oneGrid Cluster Computing Grid Computing Cloud Computing Vicerrectoría de Ciencia, Tecnología e Innovación.
PDSF Computing model Thomas Davis ASG/NERSC, LBNL LCCWS.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Hall D Computing Facilities Ian Bird 16 March 2001.
Clouds , Grids and Clusters
Статус ГРИД-кластера ИЯФ СО РАН.
NERSC Reliability Data
CERN openlab for DataGrid applications
Presentation transcript:

PDSF and the Alvarez Clusters Presented by Shane Canon, NERSC/PDSF

NERSC Hardware National Energy Research Scientific Computing Center One of the nation’s top unclassified Computing resources, funded by the DOE for over 25 years with the mission of providing computing and network services for research. NERSC is located at Lawrence Berkeley Laboratory, in Berkeley, CA High Performance Computing Resources IBM SP cluster, processors, 1.2+ TB RAM, 20 TB+ cluster filesystem - Cray T3E, 692 processors, 177 GB RAM - Cray PVP, 64 processors, 3 GW RAM - PDSF, 160 Compute nodes, 281 processors, 7.5 TB disk space - HPSS, 6 StorageTek Silos, 880 TB’s of near-line and offline storage. Soon to be expanded to a full PetaByte of storage

NERSC Facilities New Oakland Scientific Facility - 20,000 sq. foot data center - 24x7 operations team - OC48 (2.5 Gbits/sec) connection to LBL/ESNet - options on 24,000 sq. foot expansion

NERSC Internet Access ESNet Headquarters - Provides leading edge networking to DOE researchers - Backbone has OC12 (622 Mbit/sec) connection to CERN - Backbone connects key DOE sites - Headquartered at Lawrence Berkeley - Location assures prompt response

Cluster Design Embarrassingly Parallel Commodity networking Commodity parts Buy “at the knee” No modeling

Issues with Cluster Configuration Maintaining consistency Scalability System Human Adaptability/Flexibility Community tools

Cluster Configuration Present Installation Home grown (nfsroot/tar image) Configuration management Rsync/RPM Cfengine

Cluster Configuration Future Installation kickstart (or systemimager/systeminstaller) Configuration management RPM Cfengine Database Resource management Integrate with configuration management

NERSC Staff NERSC and LBL have dedicated, experienced staff in the fields of high performance computing, GRID computing and mass storage Researchers - Will Johnston, Head of Distributed Systems Dept. GRID researcher Project manager for NASA Information Power Grid - Arie Shoshani, Head of Scientific Data Management Researches mass storage issues related to scientific computing - Doug Olson, Project Coordinator Particle Physics Data Grid Coordinator for STAR computing at PDSF - Dave Quarrie, Chief Software Architect, ATLAS Craig Tull, Offline Software Framework/Control, Coordinator for ATLAS computing at PDSF NERSC High Performance Computing Department - Advanced Systems Group evaluates and vetts HW/SW for production computing (4 FTE) - Computing Systems Group manages infrastructure for computing (9 FTE) - Computer Operations & Support provides 24x7x365 support (14 FTE) - Networking and Security Group provides Networking and Security (3 FTE) - Mass Storage manages the near-line and off-line storage facilities (5 FTE)

PDSF & STAR PDSF has been working with the STAR since Data collection occurs at Brookhaven, and DST’s are sent to NERSC - PDSF is the primary offsite computing facility for STAR - Collaboration carries out DST analysis and simulations at PDSF - STAR has 37 collaborating institutions (too many for arrows!)

PDSF Philosophy PDSF is a Linux cluster built from commodity hardware and open source software - Our mission is to provide the most effective distributed computer cluster possible that is suitable for experimental HENP applications - PDSF acronym came from SSC lab in 1995, along with original equipment - Architecture tuned for “embarassingly parallel” applications - Uses LSF 4.1 for batch scheduling - AFS access, and access to HPSS for mass storage - High speed (Gigabit Ethernet) access to HPSS system - One of several Linux clusters at LBL - Alvarez cluster has similar architecture, but supports Myrinet cluster interconnect - NERSC PC Cluster project by Future Technology Group is an experimental cluster - Genome cluster at LBL for research into fruit fly genome compute nodes, 281 processors, 7.5 TB of storage - Cluster uptime for year 2000 was > 98%, and for most recently measured period (January 2001), cluster utilization for batch jobs was 78%. - Overall cluster has had zero downtime due to security issues - PDSF and NERSC have a track record of solid security balanced with unobtrusive practices

More About PDSF PDSF uses a common resource pool for all projects - PDSF supports multiple experiments: STAR, ATLAS, BABAR, D0, Amanda, E871, E895, E896 and CDF. - Multiple projects have access to the computing resources, s/w available supports all experiments - Actual level of access is determined by the batch scheduler, using fair share rules - Each project’s investment goes into purchasing hardware and support infrastructure for the entire cluster - The use of a common configuration decreases management overhead, lowers administration complexity, and increases availability of useable computing resources - Use of commodity Intel hardware makes us vendor neutral, and lowers the cost to all of our users - Low cost and easy access to hardware makes it possible for us to update configurations relatively quickly to support new computing requirements. - Because the actual physical resources available is always greater than any individual contributor’s investment, there is usually some excess capacity available for sudden peaks in usage, and always a buffer to absorb sudden hardware failures