BEIJING-LCG2 Site Report

BEIJING-LCG2 Site Report

Computing Infrastructure
~13000 CPU cores Water cooling 5PB tape library 5PB disk space Power supply system Computer Center,IHEP

Local Topology of Computing & Storage
Computer Center,IHEP

Computing And Storage Resources
Lustre as main disk storage 9MDSs, 43 OSSs, 590 OSTs Version: Capacity: 4.2 PB Gluster experimental system 350TB storage provided for cosmic-ray experiments Dynamic and Scalable Distributed Metadata service , solve ls/mkdir performance problems/directory tree inconsistency DPM & dCache (Wlcg) 940TB, With SRM interface HSM, with Modified CASTOR 2 tape libraries + 2 robots 26 drives Capacity: 5 PB Local cluster: CPU cores +300 GPU Cards Scheduler1: PBS-2.5.5, Scheduler2: Condor +300 active users Grid Site(WLCG) WLCG 1700 Cores Torque Computer Center,IHEP

Cluster & Storage Service Topology
Local User Grid User Dirac Grid WMS Login Nodes Grid PBS Cluster 1700 Cores Local PBS Cluster 12000 Cores Cloud Cluster 240 Cores Lustre 4.2PB dCahce 540TB DPM 400TB Gluster 350TB Part of the luster and gluster data have backup in Castor, usually user data, some experiment data Castor 5PB Computer Center,IHEP

BEIJING-LCG2 GRID Site Services
on cloud APEL MyProxy Vobox Frontire-Squid LFC BDII UI Cream CE Torque ArcCE WN dCache Poolnode DPM Poolnode 1Gb Eth 940TB: Atlas: 400TB CMS: 540TB 1700 cores Hepspec 23000 10Gb Eth Computer Center,IHEP

WLCG Computing Resource
Disk capacity: 940TB(Atlas 400, CMS 540) 24 Slot infotrend with 4TB disk One box has to array, Each array has 12 disk with raid6 Really capacity 80TB per box. Disk server: 2 X HP 380g9 + 1 Dell R630 (atlas) 2 X HP 380g9 + 2 Dell R630 (cms) CPU:E5-2630v3 Mem 64GB 10Gb Ethernet 8GB HBA New Worknode 29 X HP blade bl460c gen9 CPU: 2 X E V3 12 core MEM: 4 X 16GB DISK: 1TB SATA

Hepspec06 Before 10832 (70*112 + 16*187) New blade 13718(29*473.06)
Now 24550 After retired (29* *187.61) Pledge:

Other Grid service All our grid service running on vmware ESXI host.
All the server were changed to new machine. DELL R630 and R620 CPU E5-2630V3/E5-2640v2 MEM:64GB Disk 4X1TB 1 Gb Ethernet

Middle ware EMI OS Configuration tools Upgrade to EMI3 (May. 2014)
All the server is Scientific Linux 6.5(May. 2014) Configuration tools Foreman + Puppet (May. 2014)

Job Number 1,702,760 3,015,983 Walltime 4,658,857 5,281,175 CPU Efficiency 84.7% 90.7% Node Utility 50.35% 68.06%

Argus Certificate OCSP
System maintain Fire wall Argus Certificate OCSP 硬件老化，系统硬件故障频率增加，磁盘空间占满，是导致问题逐渐增多的原因。

Grid Jobs statistics 2015

Site Trouble Shooting Lots of “a_wait process”. Add memory to ccsrm.ihep.ac.cn. Creamce timeout. Argus use ocsp to update certificates. Upgrade argus pkg. SAM test failed only happend in 0’olock. Sync uid and gid all the server. SAM test timeout. It was DNS problem. Change dns to another server. SAM test failed. Network have high packet lost between China and Europe. Could not update cern ca. Connected to cern was blocked by upper network service. cvmfs warring. Clean cvmfs cache. dCache pool node high load and service crashed.

Configuration Tools Quattor -> Puppet Finish migration in May 2014.
Puppet server 3.8 Provision server Foreman 1.8

NGI_CHINA Name : NGI_CHINA
ROC Web: SAM-NAGIOS: Myegi: Member:

BEIJING-LCG2 Site Report

Similar presentations

Presentation on theme: "BEIJING-LCG2 Site Report"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BEIJING-LCG2 Site Report

Similar presentations

Presentation on theme: "BEIJING-LCG2 Site Report"— Presentation transcript:

Similar presentations

About project

Feedback