ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009.

ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

Agenda Farm Grid Issues File System

Resource Old farm – Slc3: atlas02 + 8 Cores – Slc4: autilas + 16 Cores will integrated to new farm New farm – atlasui02 + 128 Cores – New Server, New release (in testing)

File System Size Used Avail % Mounted on Storage HOME – AFS 8.6G 0 8.6G 0% /afs – 202.122.33.48:/home/atlas 932G 88G 845G 10% /ihepbatch/home-atlas Software – bjlcg2.ihep.ac.cn:/data/exp_soft 1.1T 327G 791G 30% /ihepbatch/exp_soft – autilas.ihep.ac.cn:/opt/atlassw 29G 16G 13G 57% /opt/atlassw Data – 192.168.50.30:/atlas/data0 2.8T 543M 2.8T 1% /ihepbatch/atlasdata0 – 192.168.50.30:/atlas/data1 3.7T 1.4T 2.3T 38% /ihepbatch/atlasdata1 – 192.168.50.30:/atlas/data2 2.8T 1.3T 1.5T 47% /ihepbatch/atlasdata2 – 192.168.50.30:/atlas/data3 3.1T 512K 3.1T 1% /ihepbatch/atlasdata3 – 192.168.50.30:/atlas/data4 3.1T 512K 3.1T 1% /ihepbatch/atlasdata4 – 192.168.50.30:/atlas/data5 3.1T 512K 3.1T 1% /ihepbatch/atlasdata5

Storage Grid software repository AFSAFS /home/atlas SE(DPM) ATLAS Disk Server HOME Local Data ATLAS software Grid Data atlasui02autilas Torque/Maui

Software DQ2 enduser tools: /opt/atlassw/DQ2/endusers Ganga: 5.1.10 (updated by Lianyou)

Job management Server: Torque Scheduler: Maui Both are optimized atlasui02autilas Torque ServerMaui

Job monitor

Local DPM Access T3  T2, DPM accessing failed – “rfio:/…” Reason: – Both Castor and DPM have rf* tools – use the same library: libshift.so Solution: – Link DPM library (libdpm.so) to Castor library (libshift.so)

Tests with Athena 14.2.23 Jobs: – Simulation jobs – Reconstruction jobs Tests: – Old farm – New farm – Front end – Back end – Interactive (directly on computing nodes)

GangaRobot

Stress tests (GangaRobot)

Panda Jobs

Grid (Tier-2)

Disk Usage

Issues Many job failures in testing, a few succeeded Conclusion: – I/O issue Standardize job submitting operations move data from HOME space to Data disks – Most probably something wrong with the new batch system(the latest version, torque 2.4.1) will change to other versions and test again. – Next step Separate Local software environment from Grid

Issues AFSAFS /home/atlas SE(DPM) ATLAS Disk Server HOME Local Data atlasui02autilas Torque/Maui Local Software Grid software repository NFS

Comments Standardize your operations – Put your input data to /atlas/datax1 or from DPM. – Submit jobs from /home/atlas/xxx afs space not support for batch jobs currently – Put your output data to /atlas/datax2 – Please don't mix Home and Data space. – Add some debug sentences to your script e.g., Add 'hostname’ to your job script so that can know which node your job was running. Insert intervals when submit bulk jobs Data space – Public/Private – Public dataset classified by dataset name rather than by user name

File System

Luster MDS Server Disk Server

LUSTRE 压力测试（一）采用 600 个 BES 分析作业，运行 8 个小时，没有出现问题，读性能稳定在 800MB/s

LUSTRE 压力测试（二）采用 256 个 dd 写作业，同时运行一天，没有出现问题，性能稳定在 350MB/s

实际应用测试

测试方法在集群上设置两个测试专用队列 btq1,btq2 ，每个队列 300 个 CPU ；每个队列中均有 2CPU ， 4CPU ， 8CPU 的计算结点分别在两个队列上提交， 300 个， 250 个， 200 个， 150 个， 100 个， 50 个分析作业队列的分析作业分别对 LUSTRE 、 GPFS 文件系统中的数据文件进行分析计算（主要是读操作和少量写操作）查看作业运行期间，计算结点的运行效率网络流量，以及文件服务器的网络流量计算结点的运行效率取值参考 CPU USER 利用率

测试结果－ cpu 利用率 ★

测试结果－网络流量 ★

结论在当前情况下， 150 个分析作业同时运行效果较好－－ CPU 的利用率达到 60 ％以上。推测：要满足 1500 个分析作业同时高效运行，需要 30 个左右文件服务器支持的并行文件系统

Questions?

ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009.

Similar presentations

Presentation on theme: "ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009.

Similar presentations

Presentation on theme: "ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009."— Presentation transcript:

Similar presentations

About project

Feedback