GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres in the UK. It includes a LHC Tier-1 regional centre at the Rutherford Appleton Laboratory and 4 regional Tier-2s: ScotGrid, NorthGrid, SouthGrid and London Tier-2. Over the next year the number of CPUs will increase to over 10,000 as the LHC begins operation. The EGEE Computing Grid GridPP manages the UK contribution to the Large Hadron Collider (LHC) computing Grid (LCG) and other UK HEP grid activities. LCG is a major contributor to the Enabling Grids for E-sciencE (EGEE) project. The EGEE grid is currently the largest functioning Grid in the world, with over 40,000 CPUs and over 30 million GB of storage at over 150 sites in 40 countries. Thousands of jobs are run every day on the EGEE production Grid by a wide selection of Virtual Organisation users. Organisation The 19 sites in the UK have formed into 4 regional Tier-2s: ScotGrid, NorthGrid, SouthGrid and the London Tier-2. RAL hosts another site, the GridPP Tier-1 which is managed independently of the other resources. Each Tier-2 has a technical coordinator and a manager to direct the work of the system administrators and monitor provision of resources. There is also a core deployment team who support and advise in areas such as storage, networking and monitoring.. Middleware releases Core grid services are run by the ROCs. They provide information about resources available, location of data and who is allowed to use what. The sites run a specific set of software, the middleware, to provide data for information systems and interpret grid job requests. As this is updated the deployment team work with the sites to ensure problems are solved quickly. Periods of downtime are scheduled in a Grid Operations Centre Database, which is linked into the mechanism users use to submit jobs to prevent a site that is unavailable from receiving any jobs. This database also serves as a useful repository of information about a site which can be used for accounting and monitoring.. Preparing for LCG GridPP deployment and operations needs to ensure the sites are prepared for the volume of data expected when the LHC experiments enter full operation. LCG has focussed on Tier-0 to Tier-1 testing. GridPP has extended this to Tier-1 to Tier-2 testing, to understand bottlenecks in the connecting networks and the optimisation of storage resource management arrangements at sites. Site Monitoring The deployment team is responsible for monitoring and solving problems with the sites on a daily basis. This area is developing quickly with support coming from across the EGEE project. Most of the tools which have been made available rely on information from the Site Availability Monitoring (SAM) tests. The SAM tests are run every 3 hours for each site to test basic functionality. All results from these and other tests are reviewed on a regular basis by a team of operators who spot areas of concern and submit trouble tickets to the Regional Operations Centre (ROC), site or user via the Global Grid User Support (GGUS) centre based in Karlsruhe, Germany. The UK & Ireland helpdesk interfaces directly to the GGUS system. Resource usage GridPP (together with LCG and EGEE) has developed a number of metrics to assist with the ongoing challenge of ensuring that resources are being utilised effectively. They cover areas like percentage of successful jobs, the number of users, and trouble tickets issued to a site. These reveal information which help improve performance and reliability. Computer racks full of worker nodes – a typical scene at each of the GridPP sites A snapshot from the UK developed Real Time Grid Monitor showing sites and jobs on EGEE the grid. GridPP sites and their grouping within regional Tier-2s Screen shots of two site monitoring tools with tests run every few hours to identify problems The results of a set of sustained inter-site data transfer tests during The graph shows the rates achieved first copying data into the Edinburgh storage element across the wide area network and then reading it back out again. All GridPP sites have been tested with such single direction transfers as well as simultaneous inbound and outbound. Example usage views: relative usage of EGEE resources by VO (pie chart) and successful hours of GridPP CPU time by the LHC experiments for Q