Presentation is loading. Please wait.

Presentation is loading. Please wait.

QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done.

Similar presentations


Presentation on theme: "QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done."— Presentation transcript:

1 QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done

2 Slide 2 Alex Martin QMUL e-Science Research Cluster Background Formed e-Science consortium within QMUL to bid for SRIF money etc (no existing central resource) Received money in all 3 SRIF rounds so far. Led by EPP + Astro + Materials+ Engineering Started from scratch in 2002, new machine room, Gb networking. Now have 230 kW of A/C Differing needs other fields tend to need parallel processing support MPI etc. Support effort a bit of a problem.

3 Slide 3 Alex Martin QMUL e-Science Research Cluster History of the High Throughput Cluster Already in its 4 th year (3 installation phases) In addition Astro Cluster of ~70 machines

4 Slide 4 Alex Martin QMUL e-Science Research Cluster

5 Slide 5 Alex Martin QMUL e-Science Research Cluster

6 Slide 6 Alex Martin QMUL e-Science Research Cluster 280 + 4 dual dual – core 2 Ghz Opteron nodes 40 + 4 with 8 Gbyte remainder with 4 Each with 2 x 250 Gbyte HD 3-COM Superstack 3 3870 network stack Dedicated second network for MPI traffic APC 7953 vertical PDU's Total measured power usage seems to be ~1A/machine ~ 65-70 kW total

7 Slide 7 Alex Martin QMUL e-Science Research Cluster Crosscheck:

8 Slide 8 Alex Martin QMUL e-Science Research Cluster Ordered last week in March 1 st batch of machines delivered in 2 weeks 5 further batches 1 week apart 3 week delay for proper PDU's Cluster cabled up and powered 2 weeks ago Currently all production boxes running legacy sl3/x86 Issues with scalability of services torque/ganglia. Also shared experimental area is I/0 bottleneck

9 Slide 9 Alex Martin QMUL e-Science Research Cluster

10 Slide 10 Alex Martin QMUL e-Science Research Cluster Cluster has been fairly heavily used ~40-45% on average

11 Slide 11 Alex Martin QMUL e-Science Research Cluster Tier-2 Allocations

12 Slide 12 Alex Martin QMUL e-Science Research Cluster S/W Infrastructure MySQL database containing all static info about machines and other hardware + network + power configuration Keep s/w configuration info in a subversion database: os version and release tag Automatic (re)installation and upgrades using a combination of both, tftp/kickstart pulls dynamic pages from web (Mason).

13 Slide 13 Alex Martin QMUL e-Science Research Cluster http://www.esc.qmul.ac.uk/cluster/

14 Slide 14 Alex Martin QMUL e-Science Research Cluster Ongoing work Commission SL4/x86_64 service (~30% speed improvement) (assume non-hep usage initially). Able to migrate boxes on demand. Tune MPI performance for jobs upto ~160 CPUs (non-ip protocol?) Better integrated monitoring (ganglia +pbs + opensmart? + existing db) dump Nagios? Add 1-wire Temp + power sensors.

15 Slide 15 Alex Martin QMUL e-Science Research Cluster Ongoing work continued Learn how to use large amount of distributed storage in efficient and robust way. Need to provide a POSIX f/s ( probably extending poolfs or something like lustre )


Download ppt "QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done."

Similar presentations


Ads by Google