Download presentation
Presentation is loading. Please wait.
Published byNorah Atkins Modified over 9 years ago
1
EU funding for DataGrid under contract IST-2000- 25182 is gratefully acknowledged GridPP Tier-1A Centre CCLRC provides the GRIDPP collaboration (funded by PPARC) with a large computing facility at RAL based on commodity hardware and open source software. The current resources (~700 CPUs, ~80TB of usable disk, and 180TB of tape) provide an integrated service as a prototype Tier-1 centre for the LHC project and a Tier-A centre for the BaBar Experiment based at SLAC. The service is expected to grow substantially over the next few years. Job submission into the Tier1A can be carried out from any networked workstation, having access to the EDG User Interface software. EDG software provides the following tools: 1. Authentication grid-proxy-init 2. Job submission e edg-job-submit 3. Monitoring and control e edg-job-status edg-job-cancel edg-job-get-output 4. Data publication and replication globus-url-copy, GDMP 5. Resource scheduling – use of Mass Storage Systems JDL, sandboxes, storage elements The concept of a Tier1 centre predates the grid. The Monarc Project described a model of distributed computing for LHC with a series of layers or ‘tiers’ from the Tier0 at CERN which would hold all the raw data through Tier1s which would each hold all the reconstructed data through Tier2s to the individual institutes and physicists who would hold smaller and smaller abstractions of the data. The model also works partly in reverse with the Tier1 and 2 centres producing large amounts of simulated data which will migrate from Tier2 to Tier1 to CERN. EDG Job Submission The principle objective of the Tier1A is to assist GRIDPP in the deployment of experimental GRID technologies and meet their international commitments to various projects: It runs five independent international GRID testbeds, four for the European Datagrid collaborations and one for the LCG project. It provides GRID gateways into the main Tier1A “production” service for a number of projects including: the EDG and LCG projects, BaBarGrid for the SLAC based Babar experiment and SAM-Grid for the Fermilab CDF and D0 collaborations. Cluster Infrastructure The Tier1A uses Redhat Linux deployed on Pentium III and Pentium 4, rack mounted PC hardware. Disk storage is provided by external SCSI/IDE RAID controllers and commodity IDE disk. An STK Powderhorn Silo provides near-line tape access (see Datastore poster). COMPUTING FARM 7 racks holding 236 dual Pentium III/P4 cpus. DISK STORE 80TByte disk-based Mass Storage Unit after RAID 5 overhead. PCs are clustered on network switches with up to 8x1000Mb ethernet TAPE ROBOT Upgraded last year, uses 60GB STK 9940 tapes 45TB currrent capacity could hold 330TB.
2
EU funding for DataGrid under contract IST-2000- 25182 is gratefully acknowledged Automate which in turn will make a decision based on time of day and agreed service level agreement before escalating the problem – for example, out of office by paging a Computer Operator. Helpdesk Requestracker is used to provide a web based helpdesk. Queries can be submitted either by email or by web before being queued into subject queues in the system and assigned ticket numbers.Specialists can then take ownership of the problem and an escalation and notification system ensures that tickets do not get overlooked. GridPP Tier-1A Centre The service needs to be flexible and able to rapidly deploy staff and hardware to meet the changing needs of GRIDPP. Access can either be by the traditional login and manual job submission and editing or via one of several deployed GRID gateways: There are five independent GRID testsbeds available. These range from a few standalone nodes for EDG work-package developers to ten or more systems deployed into the international EDG and LCG production testbeds which take part in wide area work. A wide range of work runs on the EDG testbed including work from the Biomedical and Earth observation workpackages. A number of GRID gateways have been installed into the production farm where the bulk of the hardware is available. Some such as the EDG/LCG gateways are maintained by Tier1A staff, but some such as SAM- Grid are managed by the user community in close collaboration with the Tier1A. Once through the gateway, jobs are scheduled within the main production facility. Behind the gateways, the hardware is deployed into a number of hardware logical pools running 3 separate releases of Redhat Linux, all controlled by the OpenPBS batch system and MAUI schedulers. With so many systems, its necessary to automate as much as possible of the installation process. We use LCFG on the GRID testbeds and a complex Kickstart infrastructure on the main production service. Behind the scenes are a large number of service systems and business processes such as: · File servers · Security scanners and password crackers · RPM package monitoring & update services, · Automation and system monitoring. · Performance monitoring and accounting · System consoles · Change Control · Helpdesk Gradually over the next year its likely that some of the above will be replaced by GRID based Fabric Management tools being developed by EDG. Cluster Management tools Real-time Performance Monitoring On a large, complex cluster, real-time monitoring of system statistics of all components is needed to understand workflow and assist in problem resolution. We use Ganglia to provide us with detailed system metrics such as CPU load, memory utilisation and disk I/O rates. Near Real-time CPU Accounting CPU accounting data is vital to allow resource allocations to be met and monitor activity on the cluster. The Tier1A services takes accounting information from the OpenPBS system and loads it into a MYSQL databases for post processing through locally written perl scripts Ganglia is a scalable distributed monitoring system for high- performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters. Automate The commercial Automate package coupled to CERN’s SURE monitoring system is used to maintain a watch on the service. Alarms are raised on SURE if systems break or critical system parameters are found to be out of bounds. In the event of a mission critical fault, SURE will notify
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.