10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

Slides:



Advertisements
Similar presentations
Hardware Lesson 3 Inside your computer.
Advertisements

Computing Infrastructure
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
INSTALLING LINUX.  Identify the proper Hardware  Methods for installing Linux  Determine a purpose for the Linux Machine  Linux File Systems  Linux.
Scale-out Central Store. Conventional Storage Verses Scale Out Clustered Storage Conventional Storage Scale Out Clustered Storage Faster……………………………………………….
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Title US-CMS User Facilities Vivian O’Dell US CMS Physics Meeting May 18, 2001.
Lesson 5-Accessing Networks. Overview Introduction to Windows XP Professional. Introduction to Novell Client. Introduction to Red Hat Linux workstation.
A comparison between xen and kvm Andrea Chierici Riccardo Veraldi INFN-CNAF.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
By : Nabeel Ahmed Superior University Grw Campus.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Tanenbaum 8.3 See references
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Terabyte IDE RAID-5 Disk Arrays David A. Sanders, Lucien M. Cremaldi, Vance Eschenburg, Romulus Godang, Christopher N. Lawrence, Chris Riley, and Donald.
Configuration of Linux Terminal Server Group: LNS10A6 Thebe Laxmi, Sharma Prabhakar, Patrick Appiah.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Chapter 7 Microsoft Windows XP. Windows XP Versions XP Home XP Home XP Professional XP Professional XP Professional 64-Bit XP Professional 64-Bit XP Media.
Online Systems Status Review of requirements System configuration Current acquisitions Next steps... Upgrade Meeting 4-Sep-1997 Stu Fuess.
D0 Run IIb Review 15-Jul-2004 Run IIb DAQ / Online status Stu Fuess Fermilab.
25 Oct HEPiX1 Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division.
Software GCSE COMPUTING.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
Design & Management of the JLAB Farms Ian Bird, Jefferson Lab May 24, 2001 FNAL LCCWS.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
Farm Management D. Andreotti 1), A. Crescente 2), A. Dorigo 2), F. Galeazzi 2), M. Marzolla 3), M. Morandin 2), F.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
Jefferson Lab Site Report Sandy Philpott Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Disk Farms at Jefferson Lab Bryan Hess
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
RAL Site report John Gordon ITD October 1999
RAL Site Report John Gordon HEPiX/HEPNT Catania 17th April 2002.
System Administrator Responsible for? Install OS Network Configuration Security Configuration Patching Backup Performance Management Storage Management.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
Day12 Network OS. What is an OS? Provides resource management and conflict resolution. –This includes Memory CPU Network Cards.
2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Virtualization Supplemental Material beyond the textbook.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
Linux IDE Disk Servers Andrew Sansum 8 March 2000.
Tier1A Status Martin Bly 28 April CPU Farm Older hardware: –108 dual processors (450, 600 and 1GHz) –156 dual processor 1400MHz PIII Recent delivery:
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Running clusters on a Shoestring Fermilab SC 2007.
Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.
Course 03 Basic Concepts assist. eng. Jánó Rajmond, PhD
Virtual Environments and Level III Redundancy A modern, green and fault tolerant approach to radio automation system design Patrick J. Campion Director.
11 October 2000Iain A Bertram - Lancaster University1 Lancaster Computing Facility zStatus yVendor for Facility Chosen: Workstations UK yPurchase Contract.
Running clusters on a Shoestring US Lattice QCD Fermilab SC 2007.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Information Technology
NL Service Challenge Plans
Installing OS.
Design Unit 26 Design a small or home office network
Network Attached Storage NAS100
Linux Cluster Tools Development
Presentation transcript:

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab

10/18/01Linux Reconstruction Farms at Fermilab 2 Outline Hardware Configuration Software Management Tools

10/18/01Linux Reconstruction Farms at Fermilab 3 Hardware Configuration Four farms currently installed CDF (154), D0 (122), Fixed Target (90), CMS (56) 422 dual CPU nodes in all, MHz Also small prototype farm for development Gb disk each, 512 Mb RAM Typical I/O node: SGI Origin 2000, 1 Tb disk (RAID), 4 CPU’s, 2 x Gigabit Ethernet

10/18/01Linux Reconstruction Farms at Fermilab 4 Farms I/O Node SGI O x 400 MHz 2 X Gb Ethernet 1 Tb disk

10/18/01Linux Reconstruction Farms at Fermilab 5 Farm Workers MHz Dual PIII 50 Gb disk 512 Mb RAM

10/18/01Linux Reconstruction Farms at Fermilab 6 Farm Workers 2U dual PIII 750 MHz, 50 Gb disk. 1Gb RAM.

10/18/01Linux Reconstruction Farms at Fermilab 7 Qualified Vendors We evaluate vendors on hardware reliability, competency in Linux, service quality, and price/performance. Vendors chosen for desktops and farm workers 13 companies submitted evaluation units, five chosen in each category

10/18/01Linux Reconstruction Farms at Fermilab 8 Hardware Maintenance Recently decommissioned first Linux farm at Fermilab after three years of running. Mean time between failure ~24 months Out of 36 nodes in 3 years, replaced 25 hard drives, 6 other faults (motherboards, power supply, memory) Other nodes tend to show same pattern once the initial hardware is working.

10/18/01Linux Reconstruction Farms at Fermilab 9 Manpower and Womanpower SCS does all system admin work on farms. CDF—few users, 154 nodes, 1 Tb RAID on I/O node, worker nodes storage also used as part of dfarm. ( ½ time of Steve) D0—few users, 122 nodes. RAID and non- RAID disk. (was ½ time of Troy) Fixed target—many users, 90 nodes, non- RAID disk, lots of tape drives, (3/4 time of Karen) CDF/D0 users configure much of their own products.

10/18/01Linux Reconstruction Farms at Fermilab 10 Manpower and Womanpower contd. Much actual time spent in planning for growth of farms Burn-ins of nodes that are arriving Dealing with vendors on delivery of unacceptable nodes Evaluating new hardware Dealing with things that don’t scale

10/18/01Linux Reconstruction Farms at Fermilab 11 Fermi Linux Currently running 6.1, 7.1 is planned. Add a number of security fixes Follow all kernel and installer updates Updates sent out to ~1000 nodes by Autorpm Qualified vendors ship machines with it preloaded.

10/18/01Linux Reconstruction Farms at Fermilab 12 ICABOD Vendor ships system with Linux OS loaded. Expect scripts: –Reinstall the system if necessary –Change root password, partition disks –Configure static IP address –Install kerberos and ssh keys

10/18/01Linux Reconstruction Farms at Fermilab 13 Burn-in All nodes go through 1 month burn-in test. Load both CPU (2 x Disk (Bonnie) Network test Monitor temperatures and current draw. Reject if more than 2% down time.

10/18/01Linux Reconstruction Farms at Fermilab 14 Management tools

10/18/01Linux Reconstruction Farms at Fermilab 15 Things that break at ~150 nodes NIS password system—we are working on replacement based on rsync NFS? Maybe Autorpm Sequential command to 150 nodes very slow.

10/18/01Linux Reconstruction Farms at Fermilab 16 NGOP Monitor (Display)

10/18/01Linux Reconstruction Farms at Fermilab 17 NGOP Monitor (Display)

10/18/01Linux Reconstruction Farms at Fermilab 18 Future plans Next level of integration—1 “pod” of six racks plus switch, console server, display. Linux on disk servers, for NFS/NIS Develop a SAN-based filesystem so that we can have redundant file servers Biggest challenge is scalable network file system..nobody has beat this problem yet.