Presentation is loading. Please wait.

Presentation is loading. Please wait.

BINP/GCF Status Report Jan 2010

Similar presentations


Presentation on theme: "BINP/GCF Status Report Jan 2010"— Presentation transcript:

1 BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su

2 13 Jan 2010 BINP/GCF Status Report 2 Overview Current status Resource accounting Summary of recent activities and achievements BINP/GCF & NUSC (NSU) integration BINP LCG site related activities Proposed hardware upgrades Future prospects

3 BINP LCG Farm: Present Status 13 Jan 20103 CPU: 40 cores (100 kSI2k) | 200 GB RAM HDD: 25 TB raw (22 TB visible) Input power limit: 15 kVA Heat output: 5 kW

4 Resource Allocation Accounting (up to 80 VM slots are now available within 200 GB of RAM) Computing Power LCG: 4 host systems now (40%) 70% share is prospected for production with ATLAS VO (near future) KEDR: 4.0 – 4.5 host systems (40-45%) VEPP-2000, CMD-3, SND, test VMs, etc.: 1.5 – 2.0 host systems (15-20%) Centralized Storage LCG: 0.5 TB (VM images) 15 TB (DPM + VO SW) KEDR: 0.5 TB (VM images) 4 TB (local backups) CMD-3: 1 TB is reserved for the scratch area & local home NUSC / NSU: up to 4 TB reserved for the local NFS/PVFS2 buffer 13 Jan 2010 BINP/GCF Status Report 4 90% full, 150% reserved (200% limit)35% full, 90% reserved (100% limit)

5 BINP/GCF Activities in 2009Q4 Sorted by priority (from highest to lowest ) [done] Testing and tuning 10 Gbps NSC/SCN channel to NSU and getting it to production state [done] Deploying a minimalistic LCG site locally at BINP [done] BINP/GCF and NUSC (NSU) cluster network and virtualization systems integration [done] Probing the feasibility of efficient use of resources under VMware with native KEDR VMs deployed in various ways [done] Finding the long term stable configuration of KEDR VMs while running on several host systems in parallel [in progress] Getting to production with ATLAS VO with 25 kSI2k / 15 TB SLC4 based LCG site configuration [in progress] Preparing LCG VMs for running on NUSC (NSU) side [in progress] Studying the impact of BINP-MSK & BINP-CERN connectivity issues on GStat & SAM test failures 13 Jan 2010 BINP/GCF Status Report 5

6 BINP/GCF & NUSC (NSU) Integration BINP/GCF: XEN images NUSC: VMware images (converted from XEN) Various deployment options were studied: IDE/SCSI virtual disk (VD) VD performance/reliability tuning Locally/centrally deployed 1:1 and 2:1 VCPU/real CPU core modes Allowing disabling swap on the host system Up to 2 host systems with 16 VCPUs combined are tested (1 GB RAM/VCPU) Long term stability (up to 5 days) is shown for locally deployed VMs yet, most likely the problems are related to the centralized storage system of NUSC cluster Works are now suspended due to the hardware failure of NSC/SCN switch on the BINP side (more news by the end of the week) 13 Jan 2010 BINP/GCF Status Report 6

7 BINP LCG Site Related Activities STEP 1: DONE Defining the basic site configuration, deploying the LCG VMs, going through the GOCDB registration, etc. STEP 2: DONE Refining the VMs configuration, tuning up the network, getting new RDIG host certs, VOs registration, handling the errors reported by SAM tests, etc. STEP 3: IN PROGRESS Getting OK for all the SAM tests (currently being dealt with) Confirm the stability of operations for 1-2 weeks Upscale the number of WNs to the production level (from 12 up to 32 CPU cores = 80 kSI2k max) Ask ATLAS VO admins to install the experimental software on the site Test the site for ability to run ATLAS production jobs Check if the 110 Mbps SB RAS channel is capable to carry the load of 80 kSI2k site Get to production with ATLAS VO 13 Jan 2010 BINP/GCF Status Report 7

8 BINP/GCF Activities in 2010Q1-2 Sorted by priority (from highest to lowest ) Recovering from the 10 Gbps NSC/SCN failure on the BINP side Getting to production with 32-64 VCPUs for KEDR VMs on the NUSC side Recovering BINP LCG site visibility under GStat 2.0 Getting to production with ATLAS VO with 25 kSI2k / 15 TB LCG site configuration Testing LCG VMs on NUSC (NSU) side Finding stable configuration for LCG VMs for NUSC Upscaling LCG site to 80-200 kSI2k by using both BINP/GCF and NUSC resources Migrating LCG site to SLC5.x86_64 and CREAM CE as suggested by ATLAS VO and RDIG Making a quantitative conclusion on how the existing NSC networking channel is limiting our LCG site performance/reliability Allowing other local experiments to access NUSC resources via GRID farm interfaces (using the farm as pre-production environment) 13 Jan 2010 BINP/GCF Status Report 8

9 Future Prospects 13 Jan 2010 BINP/GCF Status Report 9 Major upgrade of the BINP/GCF hardware focusing on the storage system capacity and performance Up to 0.5 PB of online storage Switch SAN fabric Further extension of SC Network and virtualization environment TSU with 1100+ CPU cores is the most attractive target Solving the problem with NSK-MSK connectivity for the LCG site Dedicated VPN to MSK-IX seem to be the best solution Start getting the next generation hardware this year 8x increase of CPU cores density Adding DDR IB (20 Gbps) network to the farm 8 Gbps FC based SAN 2x increase of storage density Establish private 10 Gbps links between the local experiments and BINP/GCF farm thus allowing them to use NUSC resources

10 680 CPU cores/540 TB Configuration 13 Jan 2010 BINP/GCF Status Report 10 16 CPU cores / 1U, 4 GB RAM / CPU core, 8 Gbps FC SAN fabric, 20 Gbps (DDR IB) / 10 Gbps (Ethernet) / 4x 1 Gbps (Ethernet) interconnect 95 kVA UPS subsystem 2012 (prospected) 1.4 M$ in total

11 168 CPU cores/300 TB Configuration 13 Jan 2010 BINP/GCF Status Report 11 55 kVA UPS subsystem 5x CPU power, 10x storage capacity, adding DDR IB & 8 Gbps FC already 2010 (proposed) +14 MRub

12 PDU & Cooling Requirements PDU 15 kVA are available now (close to the limits, no way to plug the proposed 20 kVA UPS devices!) 170-200 kVA (0.4kV) & APC EPO subsystems are needed (draft of the tech. specs was prepared in 2009Q2) Engineering drawings for BINP/ITF hall have been recovered by CSD The list of requirements is to be finalized yet Cooling 30-35 kW are available now (7 kW modules, open tech. water circuit) 120-150 kW of extra cooling is required (assuming N+1 redundancy schema) Various cooling schemas were studied though locally installed water cooled air conditioners seem to be the best solution (18kW modules, closed water loop) No final design yet 13 Jan 2010 BINP/GCF Status Report 12 Once the plans for hardware purchasing are settled for 2010 the upgrade must be initiated

13 13 Jan 2010 BINP/GCF Status Report 13 Prospected 10 Gbps SC Network Layout 1000+ CPU cores (2010Q3-4) 1100+ CPU cores (since 2007)

14 Summary Major success is achieved in BINP/GCF and NUSC (NSU) computing resources The schema tested with KEDR VMs should be exploited by other experiments as well (e.g. CMD-3) 10 Gbps channel (once restored) will allow the direct use of NUSC resources from the BINP site (e.g. ANSYS for needs of VEPP-2000) LCG site may take advantage of using the NUSC resources as well (200 kSI2k will give us much better appearance) The upgrade of the BINP/ITF infrastructure is required for installing the new hardware (at least for PDU subsystem) If we are able to get extra networking hardware as proposed we may start plugging the experiments to the GRID farm and NUSC resources with 10 Gbps Ethernet uplinks this year 13 Jan 2010 BINP/GCF Status Report 14

15 Questions & Comments


Download ppt "BINP/GCF Status Report Jan 2010"

Similar presentations


Ads by Google