Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009.

Similar presentations


Presentation on theme: "ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009."— Presentation transcript:

1 ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009

2 Agenda Farm Grid Issues File System

3 Farm

4 Resource Old farm – Slc3: atlas02 + 8 Cores – Slc4: autilas + 16 Cores will integrated to new farm New farm – atlasui02 + 128 Cores – New Server, New release (in testing)

5 File System Size Used Avail % Mounted on Storage HOME – AFS 8.6G 0 8.6G 0% /afs – 202.122.33.48:/home/atlas 932G 88G 845G 10% /ihepbatch/home-atlas Software – bjlcg2.ihep.ac.cn:/data/exp_soft 1.1T 327G 791G 30% /ihepbatch/exp_soft – autilas.ihep.ac.cn:/opt/atlassw 29G 16G 13G 57% /opt/atlassw Data – 192.168.50.30:/atlas/data0 2.8T 543M 2.8T 1% /ihepbatch/atlasdata0 – 192.168.50.30:/atlas/data1 3.7T 1.4T 2.3T 38% /ihepbatch/atlasdata1 – 192.168.50.30:/atlas/data2 2.8T 1.3T 1.5T 47% /ihepbatch/atlasdata2 – 192.168.50.30:/atlas/data3 3.1T 512K 3.1T 1% /ihepbatch/atlasdata3 – 192.168.50.30:/atlas/data4 3.1T 512K 3.1T 1% /ihepbatch/atlasdata4 – 192.168.50.30:/atlas/data5 3.1T 512K 3.1T 1% /ihepbatch/atlasdata5

6 Storage Grid software repository AFSAFS /home/atlas SE(DPM) ATLAS Disk Server HOME Local Data ATLAS software Grid Data atlasui02autilas Torque/Maui

7 Software DQ2 enduser tools: /opt/atlassw/DQ2/endusers Ganga: 5.1.10 (updated by Lianyou)

8 Job management Server: Torque Scheduler: Maui Both are optimized atlasui02autilas Torque ServerMaui

9 Job monitor

10 Local DPM Access T3  T2, DPM accessing failed – “rfio:/…” Reason: – Both Castor and DPM have rf* tools – use the same library: libshift.so Solution: – Link DPM library (libdpm.so) to Castor library (libshift.so)

11 Tests with Athena 14.2.23 Jobs: – Simulation jobs – Reconstruction jobs Tests: – Old farm – New farm – Front end – Back end – Interactive (directly on computing nodes)

12 Grid

13 GangaRobot

14 Stress tests (GangaRobot)

15 Panda Jobs

16 Grid (Tier-2)

17 Disk Usage

18 Issues Many job failures in testing, a few succeeded Conclusion: – I/O issue Standardize job submitting operations move data from HOME space to Data disks – Most probably something wrong with the new batch system(the latest version, torque 2.4.1) will change to other versions and test again. – Next step Separate Local software environment from Grid

19 Issues AFSAFS /home/atlas SE(DPM) ATLAS Disk Server HOME Local Data atlasui02autilas Torque/Maui Local Software Grid software repository NFS

20 Comments Standardize your operations – Put your input data to /atlas/datax1 or from DPM. – Submit jobs from /home/atlas/xxx afs space not support for batch jobs currently – Put your output data to /atlas/datax2 – Please don't mix Home and Data space. – Add some debug sentences to your script e.g., Add 'hostname’ to your job script so that can know which node your job was running. Insert intervals when submit bulk jobs Data space – Public/Private – Public dataset classified by dataset name rather than by user name

21 File System

22 NFS

23 Luster MDS Server Disk Server

24 LUSTRE 压力测试(一) 采用 600 个 BES 分析作业,运行 8 个小时,没有出现 问题,读性能稳定在 800MB/s

25 LUSTRE 压力测试(二) 采用 256 个 dd 写作业,同时运行一天,没有出现 问题,性能稳定在 350MB/s

26 实际应用测试

27 测试方法 在集群上设置两个测试专用队列 btq1,btq2 ,每个队列 300 个 CPU ;每个队列中均有 2CPU , 4CPU , 8CPU 的计算结点 分别在两个队列上提交, 300 个, 250 个, 200 个, 150 个, 100 个, 50 个分析作业 队列的分析作业分别对 LUSTRE 、 GPFS 文件系统中的数据文 件进行分析计算(主要是读操作和少量写操作) 查看作业运行期间,计算结点的运行效率网络流量,以及 文件服务器的网络流量 计算结点的运行效率取值参考 CPU USER 利用率

28 测试结果- cpu 利用率 ★

29 测试结果-网络流量 ★

30 结论 在当前情况下, 150 个分析作业同时运行效 果较好-- CPU 的利用率达到 60 %以上。 推测: 要满足 1500 个分析作业同时高效运行,需 要 30 个左右文件服务器支持的并行文件系 统

31 Questions?


Download ppt "ATLAS computing status in IHEP Erming Pei, CC-IHEP Yangzhou, May 15’ th 2009."

Similar presentations


Ads by Google