Download presentation
Presentation is loading. Please wait.
Published byCarmel Melton Modified over 8 years ago
1
Fermilab Site Report HEPiX Fall 2011 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359
2
Physics! CDF & D0 - Top Quark Asymmetry Results. CDF - Discovery of Ξ b 0. CDF - Λ c (2595) baryon. Combined CDF & D0 Limits on standard model higgs mass. 24-Oct-20111Fermilab Site Report - HEPiX Fall 2011
3
TeVatron Shutdown On Friday 30-Sep-2011, the Fermilab TeVatron was shut down following 28 years of operation, The collider reached peak luminosities of 4 x 10 32 per centimeter squared per second, The CDF and Dzero detectors recorded 8.63 PiB and 7.54 PiB of data respectively, corresponding to nearly 12 inverse femtobarns of data, CDF and Dzero data analysis continues, Fermilab has committed to 5+ years of support for analysis and 10+ years for access to data. 24-Oct-2011Fermilab Site Report - HEPiX Fall 20112
4
CDF & D0 Publications CDFD0 24-Oct-20113Fermilab Site Report - HEPiX Fall 2011
5
CD to CS Reorganization Computing Sector: –Vicky White serves as the Associate Director for Computing & CIO. Computing Sector has two divisions: –Core Computing Division lead by Jon Bakken. –Scientific Computing Division lead by TBD. http://cdorg.fnal.gov/adm/orgcharts/orgchart.pdf 24-Oct-20114Fermilab Site Report - HEPiX Fall 2011
6
Core Computing Division – a strong base for science Scientific Computing relies on Core Computing services and Computing Facility infrastructure –Core Networking and network services –Computer rooms, power and cooling –Enterprise virtualization –Email, videoconferencing, web services –Document databases, Indico, calendaring –Service desk –Monitoring and alerts –Logistics –Desktop support (Windows and Mac) –Printer support –Computer Security –Business services, including many projects - identity management, fermidash, teamcenter, EBS, timecards, etc. –… and more 24-Oct-20115Fermilab Site Report - HEPiX Fall 2011
7
Scientific Computing Division Scientific Computing is responsible for meeting the ever evolving needs of the Fermilab Scientific Program. –Data custodianship (enStore, dCache, Lustre), –Data reduction and analysis (FermiGrid, CDF & D0 Analysis clusters, CMS LPC & Tier 1, GPCF, FermiCloud, etc.), –Experiment support (electronics, hardware, software, etc.), –Physicists, engineers, developers, system administrators, database administrators, computer services specialists, 24-Oct-20116Fermilab Site Report - HEPiX Fall 2011
8
Soudan Mine Update CDMS detector and Minos “far detector” back in operation as of 25-May-2011. 24-Oct-20117Fermilab Site Report - HEPiX Fall 2011
9
Fermilab Computing Facilities Feynman Computing Center: –FCC2 Computer Room –FCC3 Computer Rooms –High availability services – e.g. core network, email, etc. –Tape Robotic Storage (3 10000 slot libraries) –UPS & Standby Power Generation –ARRA project: upgrade cooling and add HA computing room - completed Grid Computing Center: –3 Computer Rooms – GCC-[A,B,C] –Tape Robot Room – GCC-TRR –High Density Computational Computing –CMS, RUNII, Grid Farm batch worker nodes –Lattice HPC nodes –Tape Robotic Storage (4 10000 slot libraries) –UPS & taps for portable generators Lattice Computing Center: –High Performance Computing (HPC) –Accelerator Simulation, Cosmology nodes –Systems for Integration & Development –No UPS EPA Energy Star award for 2010 24-Oct-20118Fermilab Site Report - HEPiX Fall 2011
10
It’s not easy being Green… It Requires a Lot of Work Lots of space has been retired / consolidated - LCC 108, Tape Vault Mezz, FCC1. Ensured acquisition of EPEAT registered (95%), ENERGY STAR qualified (100%), or FEMP designated (95%) electronic office products when procuring electronics in eligible product categories. –http://www.epeat.net/http://www.epeat.net/ Fermilab participates in the Federal Electronics Challenge (FEC). The FEC focus on procurement, use and disposal of electronics (computing, cell phones, etc.) –We have received two bronze awards. Personal Computing Environmental Policy –http://computing.fnal.gov/xms/About/Computing_Policies/Personal_Computing_Environment al_Policyhttp://computing.fnal.gov/xms/About/Computing_Policies/Personal_Computing_Environment al_Policy Think Green – Guidance on procuring and using computing –http://computing.fnal.gov/xms/Services/Think_Greenhttp://computing.fnal.gov/xms/Services/Think_Green 24-Oct-20119Fermilab Site Report - HEPiX Fall 2011
11
Cooling Incidents at GCC Tuesday 7-Jun-2011 through Wednesday 8-Jun-2011: –Outside temperature of 93F / 34C caused Loss of cooling due to “high head” - refrigerant too hot coming back from the condensers. –Additional monitoring installed. Tuesday 19-Jul-2011 through Monday 25-Jul-2011: –Additional internal and external cooling installed due to predicted temperatures of 100F / 38C, –Ran at 30% capacity reduction through Monday 25-Jul-2011. Engineering study in progress to design a fix for the underlying issue(s). 24-Oct-201110Fermilab Site Report - HEPiX Fall 2011
12
Power Incidents - 1 Thursday 28-Jul-2011 & Friday 29-Jul-2011 [Site Wide]: –Site wide power outage due to lightning strike on 345kV primary power lines to site, site switched to secondary (lower capacity) power lines, –GCC down for ~8 hours on Thursday and ~2 hours on Friday when the site was switched back to primary power lines, –FCC rode through on UPS + Generators. Tuesday 16-Aug-2011 [GCC-B]: –GCC-B computer room lost power at ~20:45, –Investigation determined that an indicator light in a PLC board in the main room breaker enclosure failed in such a manner to “short out” the PLC board and trip off the entire room, –Power was restored at ~14:00 on Wednesday 17-Aug-2011. 24-Oct-201111Fermilab Site Report - HEPiX Fall 2011
13
Power Incidents - 2 Thursday 25-Aug-2011 [GCC-B]: –On Tuesday 23-Aug-2011, an internal UPS that supports the PLC board (replaced on Tuesday 16-Aug-2011) was identified as being “unstable” – likely damaged during the previous incident, –A scheduled power down of GCC-B was taken on Thursday 25-Aug- 2011 from 0830-1130 to replace the failing internal UPS. Saturday 15-Oct-2011 [FCC2]: –Scheduled ~8 hour downtime to complete ARRA funded electrical work and upgrade EPO for FCC2 computer room. –Downtime preparations started at 0300, power shut down at 0600, power back at 1300, EPO tests until 1400, network restart at 1400, file servers restarted at 1500, services restarted at 1600, all complete by 1800. An annual report is written on all data center service disruptions, including RCAs and lessons learned as appropriate. 24-Oct-201112Fermilab Site Report - HEPiX Fall 2011
14
Tape Robots & Tape Drives 7 th SL8500 tape robot installed - now have 4 in GCC & 3 in FCC. 10 T10KC tape drives were put into limited production on 20-Jun-2011: –Writing second copy of data onto T10KC tapes, –Encountered two types of errors once T10KC entered limited production, –Fortunately no data loss since T10KC was a second copy, –New firmware was installed in all T10KC drives to address these issues. After testing LTO5 and T10KC, Fermilab has decided to adopt T10KC tape technology and will start migrating to this media in FY2012. –USCMS has purchased 30 T10KC drives (will still write a second copy to LTO4 for a while)! –Run II (CDF/D0) has purchased 8 T10KC drives. We are working to provide a small file aggregation/cache for enStore. –Expect to release early in CY2012 24-Oct-201113Fermilab Site Report - HEPiX Fall 2011
15
Storage The Data Movement and Storage department are exploring alternative disk storage for analysis (ex: Lustre) and will be capable of bringing up a small production level Lustre system. The CMS Tier 1 has deployed CERN's EOS for user data files & report that it has worked wonderfully. More disk will probably be added and the system will grow in importance. All CMS Tier 1 data files on disk are available for external users via xrootd's data reflector capabilities & via the OSG "Data Anytime, Anywhere" program of work. 24-Oct-2011Fermilab Site Report - HEPiX Fall 201114
16
Computing Sector Project Management Capital “P” Projects and small “p” projects: Capital “P” projects typically run under formal project management, –Service-Now migration, Email migration, FermiDash, TeamCenter, etc. Small “p” projects typically run under line management, –Yearly worker node procurements, GPCF, FermiCloud, FermiGrid-HA2, VoIP pilot, enhancements to NIMI & Tissue work, server consolidation and virtualization, etc. 24-Oct-201115Fermilab Site Report - HEPiX Fall 2011
17
Capital “P” Projects Project NameDescription Current Status TeamcenterImplement a common Engineering Data Management System (EDMS) to capture all elements of the Fermilab engineering process and documents. Target go live is 1Q CY2012 Exchange MigrationDeliver an e-mail and calendaring service based on Exchange Server 2010. Migrate all Imap4, Exchange 2007, Lotus Notes, and Meeting Maker users to it. Target go live is ~now Service NowMigrate Fermilab ITIL support tool from BMC Remedy to cloud based Service-Now in support of ISO 20K certification Go live was 19-Oct- 2011 FermiDashDeliver a management dashboard for senior Laboratory management1 st draft of dashboard available SharePoint DeploymentProduction SharePoint deploymentSharePoint in production FY11 Computer Security Compliance Address FY11 Computer Security Compliance IssuesPlanning in progress Windows 7 Deployment Manage the Windows 7 pre-deployment testing and deployment, phase 1 is complete, phase 2 is being planned (replacement of hardware that is unable to run Windows 7 Roll out in progress EBS R12 UpgradeUpdate E-business suite to release 12Planning in progress Identity ManagementProvide a single authoritative source of truth for managing and maintaining information about individuals (employees, visitors, contractors, etc) associated with the laboratory. Provide a secure trusted electronic identity that can be used in a variety of ways, in particular to authorize use of computing services. Technology investigations underway 24-Oct-201116Fermilab Site Report - HEPiX Fall 2011
18
Email Migration Migration from obsolete IMAP servers to Exchange 2010 with anti-virus & anti-spam filters: –Exchange servers hosted onsite but managed by off-site subcontractors. –Anti-virus & anti-spam filters are hosted offsite “in the cloud”. 1 st phase of migration implemented on Tuesday 02-Aug-2011 – Anti-Virus / Anti-Spam service commissioned: –We encountered a few issues over the next 36 hours related to email processing of non “fnal.gov” domains that were hosted at Fermilab. –Resolved by a reconfiguration of the email delivery ACLs late on Wednesday 03-Aug- 2011. email delivery incident on Saturday 08-Oct-2011: –User SMTP credentials compromised, likely through IM client that transmitted (shared) credentials in the clear, –Spamcop notifications started at 6:21, user credentials randomized at 12:08, subcontractor disabled external routing of Fermilab email, workaround for external email routing implemented at 18:42. 24-Oct-201117Fermilab Site Report - HEPiX Fall 2011
19
Windows 7 Rollout Phase 1 is complete: –Migrated all the desktops with hardware capable of running Windows 7 Phase 2 is being planned: –Will target the older systems that need upgrade or replacement to run Windows 7. 24-Oct-2011Fermilab Site Report - HEPiX Fall 201118 Windows 7 Plot
20
Mac OS X 10.4 (Tiger) 10.5 (Leopard) 10.6 (Snow Leopard) 10.7 (Lion) –Released by Apple on 20- Jul-2011 –Approved for Fermilab deployment on 28-Sep-2011 Desktop Support is deploying the Casper service to enable central management of Macs –This replaces QMX. 24-Oct-2011Fermilab Site Report - HEPiX Fall 201119 Mac OS X Plot
21
Scientific Linux SL(F/C) 4 - End-Of-Life - 2/12/2012! SL(F/C) 5 SL(F/C) 6 Jason Harrington & Tyler Parsons have joined the SL team, Departure of Troy Dawson, addition of Pat Riehecky You will hear more about SL in Connie's talk. 24-Oct-201120Fermilab Site Report - HEPiX Fall 2011
22
Fermilab CPU Core Count 24-Oct-201121Fermilab Site Report - HEPiX Fall 2011
23
Data Storage at Fermilab 24-Oct-201122Fermilab Site Report - HEPiX Fall 2011
24
High Speed Networking Chicago MAN and Fermilab LightPath connections to StarLight are working extremely well, delivering production high bandwith data transfers in and out of Fermilab for LHC experiments. We encountered and resolved some performance issues with various network fabric extenders. Work is underway to implement a distributed core to the network in order to be resilient to building outages. Working on IPv6 testbed. –FermiGrid and FermiCloud will actively participate. Working on preparations to participate in the DOE 100 Gb/sec test network. 24-Oct-2011Fermilab Site Report - HEPiX Fall 201123
25
HPC Lattice QCD –Ds (~430 WN), J/Psi (~860 WN), Kaon (~284 WN) all running well. Computational Cosmology –~1200 cores, running well. Wilson Cluster –In the process of adding 34 nodes (each with 32 cores) to the cluster. GPU Cluster –Ordered at end of FY2011, delivery will be in the next couple of weeks. 24-Oct-2011Fermilab Site Report - HEPiX Fall 201124
26
FermiGrid Occupancy & Utilization Cluster(s) Current Size (Slots) Average Size (Slots) Average Occupancy Average Utilization CDF5630547793%67% CMS7132677294%87% D06540633579%53% GP3042289078%68% Total219272146387%70% 24-Oct-2011Fermilab Site Report - HEPiX Fall 201125
27
FY2011 Worker Node Acquisition Specifications: Quad processor, 8 core, AMD 6128/6128HE, 2.0 GHz, 64 Gbytes DDR3 memory, 3x2 Tbytes disk, 4 year warranty, $3,654 each. Who Retire- mentsBaseOptionPurchaseAssign CDF200016+365236 CMS--404+2064 D023303767 IF--040 0 GP4809965 Wilson--34--34 Total99106+61266 24-Oct-201126Fermilab Site Report - HEPiX Fall 2011
28
Other Significant FY2011 Purchases Storage: –5.52 Petabytes of Nexsan E60 SATA drives for raw cache disk (will mostly be used for dCache with some Lustre). –180 Terabytes of BlueArc SATA disk. –60 Terabytes of BlueArc 15K SAS disk. Servers: –119 servers, –16 different configurations, –All done in a single order. 24-Oct-201127Fermilab Site Report - HEPiX Fall 2011
29
FermiGrid-HA2 Deployment 1 st Rack moved from FCC1 to FCC2 on Tuesday 24-May-2011, 2 nd Rack moved from FCC1 to GCC-B on Tuesday 7-Jun-2011 & the FermiGrid-HA2 physical reorganization was completed on Tuesday 07-Jun-2011 at ~1300. –Critical services are now hosted in two data centers (FCC2 & GCC-B). –Non-critical services are split across the two data centers. The plan had been to utilize an scheduled power outage on 13-Aug-2011 for FCC2 to serve as the final acceptance test for the FermiGrid-HA2 project. –This power outage was later moved to 15-Oct-2011. The GCC-B cooling outage at 1500 on Tuesday 07-Jun-2011, resulted in all systems in GCC-B being immediately shutdown when the facilities personnel switched off the main electrical panel breakers. FermiGrid-HA2 functioned exactly as designed. –The critical services failed back to the single remaining copy of the service on FCC2. –The non-critical services went to reduced capacity. –When power was restored at 1700, the second copy of the critical services transparently rejoined the service “pool”, and the non-critical services resumed operation. 24-Oct-201128Fermilab Site Report - HEPiX Fall 2011
30
FermiGrid-HA2 Service Availability Service Raw Availability HA Configuration Measured HA Availability Minutes of Downtime VOMS – VO Management Service 99. 657%Active-Active100.000%0 GUMS – Grid User Mapping Service 99.652%Active-Active100.000%0 SAZ – Site AuthoriZation Service 99.657%Active-Active100.000%0 Squid – Web Cache99.640%Active-Active100.000%0 MyProxy – Grid Proxy Server99.954%Active-Standby99.954%240 ReSS – Resource Selection Service 99.635%Active-Active100.000%0 Gratia – Fermilab and OSH Accounting 99.365%Active-Standby99.997%120 Databases99.765%Active-Active99.988%60 24-Oct-201129Fermilab Site Report - HEPiX Fall 2011
31
FermiCloud Collaboration with KISTI: Vcluster (Grid cluster on demand). MPI on FermiCloud – achieving 70% of “bare metal” performance without Mellanox SRIOV drivers. Relocated ~half of FermiCloud from GCC-B to new FCC3 computer rooms. Collaboration with OpenNebula: OpenNebula 3.0 includes X.509 authentication patches written at Fermilab & contributed back to OpenNebula project! SAN Upgrade underway: 2 x SATABEAST, 2x2 x Brocade Switches, Dual FC HBAs in all systems, Designed to be fault tolerant in the event of building outage. 24-Oct-201130Fermilab Site Report - HEPiX Fall 2011
32
Summary Physics results from CDF, D0, CMS, Intensity Frontier and Cosmic Frontier are continuing, Fermilab is in a time of transition, with a very bright and interesting future, The Computing Sector has reorganized, The Fermilab computing facilities have faced several challenges and performed exceedingly well, Several large “P” computing Projects and numerous small “p” computing projects are underway, Virtualization, Grid & Cloud Computing is well established and growing, Thanks to the extremely hard working members of the Computing Sector! 24-Oct-201131Fermilab Site Report - HEPiX Fall 2011
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.