Online Servers and Schedule Paul J Smith. Online Servers Recent Changes: NUCS - > micethins -> there has been some discussion of changing the name – suggest.

Slides:

Advertisements

Similar presentations

Advertisements

Advanced Database Projects In Access © Hodder Education 2008 Access Projects – Problem Specification.

Backing Up Your Computer Hard Drive Lou Koch June 27, 2006.

Departmental drop-in session. The Staff project has three main aims: 1.Provide a standard image for all Windows Machines 2.Consolidate user.

Harris LiveVault® Online Backup System. Harris LiveVault 2  What is Harris LiveVault?  Why Harris LiveVault?  How Harris LiveVault works  Harris LiveVault.

MICE Target Electronics, DAQ and BPS MICE/ISIS Target Meeting 17 th September 2010 P J Smith – University of Sheffield.

Backup and Disaster Recovery (BDR) A LOGICAL Alternative to costly Hosted BDR ELLEGENT SYSTEMS, Inc.

Backups Rob Limbaugh March 2, Agenda  Explain of a Backup and purpose  Habits  Discuss Types  Risk/Scope  Disasters and Recovery.

MOM Report Ray Gamet MICE Operations Manager University of Liverpool MICO 18 th May 2009.

Monitoring and Control

Software Summary Database Data Flow G4MICE Status & Plans Detector Reconstruction 1M.Ellis - CM24 - 3rd June 2009.

MOM Summary 19/01/10 Water Now the ice has cleared I am able to get on the roof again. Could do with getting the conductivity readings fed back similarly.

MICE Electronics & DAQ P J Smith University of Sheffield.

Introducing New MAIS Web Feature: Project and UL InfoCenters Highlights of a new addition to the MAIS site with enhanced features created in response to.

Data Quality Assurance Linda R. Coney UCR CM26 Mar 25, 2010.

Linda R. Coney – 24th September 2009 MOM Update End of Sept Run Linda R. Coney 05 October, 2009.

Computing Panel Discussion Continued Marco Apollonio, Linda Coney, Mike Courthold, Malcolm Ellis, Jean-Sebastien Graulich, Pierrick Hanlet, Henry Nebrensky.

Online Reconstruction Update Linda R. Coney UCR Dec 17, 2009.

MICO Meeting, 1 st March 2010 Terry Hart (MOM) - Decay Solenoid - Targets - DAQ - March User Runs Plans - Suggestions.

Development and Quality Plans

1. Preventing Disasters Chapter 11 covers the processes to take to prevent a disaster. The most prudent actions include Implement redundant hardware Implement.

Problem solving in project management

NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

November 2009 Network Disaster Recovery October 2014.

These materials are licensed under the Creative Commons Attribution-Noncommercial 3.0 Unported license (

11 The Ultimate Upgrade Nicholas Garcia Bell Helicopter Textron.

LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.

CM26 March 2010Jean-Sebastien GraulichSlide 1 Online Summary o The heplnw17 case o DAQ o CAM o Online Reconstruction o Data Base o Data Storage Jean-Sebastien.

October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.

Love V4.0 a GREAT freeware program available as app Love V4.0 a GREAT freeware program available as app.

University of Sunderland COM369 Unit 8 COM369 Project Monitoring and Control Unit 8.

Cloud Computing Characteristics A service provided by large internet-based specialised data centres that offers storage, processing and computer resources.

1 Operations Overview L. Coney UC Riverside MICE CM32 – Glasgow – June 2012.

Linda Coney MC VC May 19, 2011 MICE Operations Manager Report.

Data goodness Mostly in black and white By Dom. You must love your data! Lost data : Current imaging data in BRIC cost ~£5.1M, just for scanning costs!

Preventing Common Causes of loss. Common Causes of Loss of Data Accidental Erasure – close a file and don’t save it, – write over the original file when.

Module 9 Planning a Disaster Recovery Solution. Module Overview Planning for Disaster Mitigation Planning Exchange Server Backup Planning Exchange Server.

Configuration Database MICE Collaboration Meeting 28, Sofia David Forrest University of Glasgow Antony Wilson Science and Technology Facilities Council.

DAQ MICO Report Online Monitoring: –Status All histograms are now implemented Still not fully online –Only monitoring from data file (slightly offline,

Software Status  Last Software Workshop u Held at Fermilab just before Christmas. u Completed reconstruction testing: s MICE trackers and KEK tracker.

Component 8/Unit 9bHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9b Creating Fault Tolerant.

ALMA Archive Operations Impact on the ARC Facilities.

What is a port The Ports Collection is essentially a set of Makefiles, patches, and description files placed in /usr/ports. The port includes instructions.

Cosc 4750 Backups Why Backup? In case of failure In case of loss of files –User and system files Because you will regret it, if you don’t. –DUMB = Disasters.

Online Reconstruction 1M.Ellis - CM th October 2008.

FIT03.05 Explain features of network maintenance.

Linda R. Coney – 5 November 2009 Online Reconstruction Linda R. Coney 5 November 2009.

| nectar.org.au NECTAR TRAINING Module 5 The Research Cloud Lifecycle.

Matthew Glenn AP2 Techno for Tanzania This presentation will cover the different utilities on a computer.

MICE CM28 Oct 2010Jean-Sebastien GraulichSlide 1 Detector DAQ o Achievements Since CM27 o DAQ Upgrade o CAM/DAQ integration o Online Software o Trigger.

Electronics Spares for StepIV Operations MICE CM39 C.Macwaters 25/6/14.

November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /

Software Overview 1M. Ellis - CM21 - 7th June 2008  Simulation Status  Reconstruction Status  Unpacking Library  Tracker Data Format  Real Data (DATE)

TRUE CANADIAN CLOUD Cloud Experts since The ORION Nebula Ecosystem.

MICE Online Group Linda R. Coney UCR MICE CM31 UMiss October 2011.

Online Update On behalf of the Online group. Introduction The main categories for the online schedule are based around the following items, so I will.

Coney - CM36 - June Online Overview L. Coney – UCR MICE CM36 – June 2013.

Part One Progress Check. Was your result as good as you hoped? The ‘multiple choice’ questions are OK if you know your stuff But the ‘longer’ questions.

MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.

Backup and Disaster Dr Stuart Petch CeG IT/IS Manager

Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.

L. Coney UC Riverside MICE CM32 – Glasgow – June 2012

Planning for Application Recovery

Server Upgrade HA/DR Integration

MICE Computing and Software

MICE Collaboration Meeting Saturday 22nd October 2005 Malcolm Ellis

SLS Timing Master Timo Korhonen, PSI.

Administering Your Network

Data Backup Strategies

Presentation transcript:

Online Servers and Schedule Paul J Smith

Online Servers Recent Changes: NUCS - > micethins -> there has been some discussion of changing the name – suggest micenopi01 -> micenopi 08? MICEIOC’s – I’m assuming Pierrick will discuss this but this requires adding several small machines to the system – No technical problems envisaged. Note: We don’t know what the long term reliability is of the NUC’s so we will carry a few spare. Nagios (Test) Server - > Up and running in a basic fashion but needs quite a lot more attention. Also requires a backup on Heplnm DAQ machines -> Yordan has installed several new machines. I’m assuming that he will discuss the details in his presentation. OS SL6.4 Installation - > People have been adding documentation to this – Thank you! Ed’s script is now complete– but who is going to keep the script up to date? 2

Schedule The main categories for the online schedule are: MLCR Upgrade- 75% complete, we are functional Nagios and System Health- System for monitoring the other machines Backups and Spares Documentation DAQ Online commissioning run- Planned runs that will be useful. Network Switches- Aim to replace current network stack. On-call Rotas- To be focused on early next year Online Monitoring- Need more information 3

Nagios and System Health As the online group is only too well aware of (!)we have a Nagios system that currently checks whether each machine is up and running. I/we intend to add to this functionality over the forthcoming months. It has surprised at how often machines are going up and down - Reduce the amount of nagging – message every 12 hours We want a second Nagios system on the heplnm system that monitors the first Nagios machine and a few sample machines on micenet – likely to use NSCA (passive checking). This will tell us that Nagios/micenet is up. Also need to integrate Nagios with Pierrick’s EPICS system – As I understand a good/not good signal will be sufficient but need to discuss further. There is a longer term aim to add Cacti and Ganglia but I consider these less critical. (Nice to have) 4

Nagios and System Health 5

Backups and Spares The current situation with Backups is From Chris Rogers: 1)Software on onrec machines is transient (i.e. the master software copy is in bzr somewhere). If the hard drive fails, it would take a couple of hours to restore the installation of latest version. 2) PPD stuff I handle 3) CDB is backed up to PPD by rsync; and from PPD to heplnm069. We take monthly and weekly snapshots. 4) The archiver data is the main thing that needs backing up. This is backed up but there is no snapshot done. So if the data becomes corrupted, the corruption will be replicated to the backups and the data can be lost. The archiver data is on one of the miceecserv machines. It needs some thought about how we handle things (e.g. how do we handle loss of connectivity to R9?) 6

Backups and Spares The current situation with Backups is From Matt Robinson “I need the people responsible for various machines to give me a list of file-system locations on each machine which should be backed up.” There is a mirrored raid (1TB) NAS in the MLCR where we can store backups as long as they are not too big. Currently 2/3 of this space is free. Note that there is no off- site storage for this kind of backup at present! The only machines that are currently utilising this backup drive is the Nagios Test Server and the Target machines. Matt has added instructions on how to back your data up to this drive. Note that it is not the online responsibility to back up your machines/data/software unless you have explicitly asked! If you want to take advantage of backup facilities offered by the online group it is the system owners responsibility to let us know and to ensure that what we are offering is suitable. 7

Backups and Spares I have toured the MLCR with Craig and Henry and I have a list of the spares situation. I have summarised this in a document but it has not yet been checked published. I don’t think we are in too bad a position – The main conclusion is that we may need to buy a few additional servers, depending on whether we feel we can tolerate the risk (or not) of the current ones failing. A more comprehensive list will be available soon. Spares on DAQ – Yordan has specified that we need one spare TDC and one spare digitiser. I understand that the funds for these items have been found. 8

Documentation The documentation needs improving, I see several areas: Hardware Processes and Checks Nagios Online On-call An additional large portion of documentation will be required for the online on-call but this will rely upon the technical documentation being improved. The timing for the completion of the documentation is when we require online on-call experts to be trained. I will say more about the on-call in a few slides time. 9

DAQ & DAQ/MICE Commissioning Run As Yordan will undoubtedly detail there has been some progress here: 7 new DAQ machines installed in the MLCR – OS and software installed. 5 are for active use, 2 are spares. 1 is earmarked as a tracker DAQ spare (it is in the tracker rack) and 1 is for general DAQ spare. There will be a period in November (16 th -26 th ) where the trackers will be tested. Most of the tracker DAQ group will be around at this time. “We will want to connect waveguides to the solenoid… As for detectors, the most invasive thing I'd like to do is inject light into TOF1 (to trigger) at the same time as firing the tracker LEDs to get a sense for the triggering delays. It's not clear to me whether direct light injection into the TOFs has ever been done, or whether it is permitted.” – D.Adey I will try and use this time to become a bit more familiar with the DAQ, this may help highlight any potential holes in the schedule. There is a final commissioning date of 21 st Jan for which the online group needs to be ready. 10

Network Switches The order for the new network stack has gone in. It’s a HP 5406R zl2 Should arrive by the end of October – we will then schedule an installation date. We have ordered: 6 chassis unit (J9821A), with: - 2 x PSUs (J9828A) - 4 x 24 ports GigE (J9550A) - 2 x 20 port GigE with 2 x SFP+ slots (J9536A) “The warranty is free next business day replacement for the lifetime of the product. We tend to buy a spare and have it on the shelf for a quick replacement whilst raising a ticket for the free replacement of the spare.” – RAL Networks 11

On-Call Rota A draft document has been produced, this includes: 1) Introduction – definition of what the On-Call responsibilities are in general terms 2) A list of responsibilities 3) Staffing – How do we staff? – share responsibility with other groups. Need for clear documentation and training. 4) Recruitment –Number of individuals required is not yet ascertained – requirement for some technical expertise. 5)Training. Defines the need to give enough time to write the documentation and train suitable and willing individuals. Chris was in receipt of this document and has suggested and was kindly implementing some changes to it. 12

Online Monitoring The Online Monitoring is working but there was a comment that users weren’t getting anything useful out of it. The system will be been changed to compare live plots with a set reference histograms Rhys Gardener is in the process of finding out exactly what needs monitoring and how it can be turned into something that gives timely and useful information to the individuals who are on shift. Rhys has commented that he can do some testing of the SW with empty plots i.e. the proposed January run - but ideally he requires beamline conditions. So the target activation run in Feb will be an ideal time to test this. 13

Conclusions Servers – There have been some additions to the server list with the MLCR upgrade, some changes with the DAQ machine upgrade, the Nagios (Test) Server and with the proposed changes to the way that the IOC’s are handled by Pierrick. Urgently I believe that we need to address Backups – What is being backed up, is there enough backups, can we recover? Contact the online group if you have critical files that need backing up! The schedule is coming together. It’s kind of evolved over the last couple of months into something that is representative of what needs to be accomplished for the Step IV run. The documentation requires some attention. However there is always something more urgent to do! But it is scheduled as it is important. Are there any unknown unknowns? The upcoming testing should help to answer this. I don’t believe that the schedule is resource limited – but there are a lot of things to do! 14