Presentation is loading. Please wait.

Presentation is loading. Please wait.

T3 report: - Oxford HEPSYSMAN Sean Brisbane June 2015.

Similar presentations

Presentation on theme: "T3 report: - Oxford HEPSYSMAN Sean Brisbane June 2015."— Presentation transcript:

1 T3 report: - Oxford HEPSYSMAN Sean Brisbane June 2015

2 Activities T3 Dominant experiments shifting Storage Highlights AWS hybrid cloud T2 Mostly moved off torque Hybrid T2 & T3 Finally started to happen

3 TIER 3

4 T3 usage Cluster very heavily used for 3 weeks – Borrowed ~100 CPUs from T2 Seen a big shift away in recent months from LHCb and towards T2K and SNO+ use of T3 – T2k data/MC analysis – SNO+ production + simulation work Seen reduction on data analysis type jobs from LHCb and to an extent Atlas for some time – Assuming due to long shutdown

5 T2K Data processing workload requirement now at ~atlas levels But they only have 1 server Only run 20 jobs at once before I/O maxed-out Trying to avoid adding to lustre – See n/40/material/slides/1.pdf n/40/material/slides/1.pdf Want access to ~100 files – Each sequential on disk – 100 clients + NFS means quite random i/o at block level So, can we improve a single disk server by ~factor 5?

6 Yes (sort of) Transparent to user: Un-tuned, access in place 80MB/s Tuned, access in place ~160MB/s (middling RAID + Linux system tuning so far) Changes to user workflow: Copy to local scratch disks ~300MB/s “full Sequential i/o” Pre-warm files + middling tuning 800MB/s Initially ~20 concurrent jobs to max out I/O System tuning ~40 concurrent jobs Take the decisions of when to read each block from disk away from the system ~80


8 The Amazon CPU Case Amazon massively over provisions so as to NEVER run out of compute instances for people paying in full. They sell off the spare capacity on the spot market for a more reasonable cost – They will terminate instances if someone is willing to pay more than you and they are out of capacity Cost per core hour is “similar” to running ones own hardware flat-out On demand Scalability:- So no need to over-buy on hardware They gave me $5,000 to spend. Note: There is no case to keep significant data on AWS

9 Hybrid cloud (for proof of concept) Data Fixed Compute AWS Scalable Virtual Private Cloud (s) Batch Server(s) VPN Server(s)

10 Tier 3 vs Tier0,1,2 Grid computing favours maximum utilization – Should take of the bulk jobs for HEP T3 favours responsiveness – Left with fast turn-around and development jobs on T3 Period where High memory jobs were running on ~50% cores Daily peaks Weekend lull Scale-up In this case by borrowing grid nodes Trade-off between responsiveness and utilization for fixed size (CPU, RAM/core) cluster. Empty SL5 nodes (old hardware)

11 Quantify Risks:- AWS hybrid cloud 1.Amazon data transfer costs $0.09/GB. – Data transfer costs could ~97% of cost of using AWS – Can we separate data heavy and data light jobs? – What does the JISC/Janet agreement do? Amazon say does not reduce bandwidth costs 2. Are the spot instances reliable enough for our use? – Jobs last many hours – jobs don’t checkpoint 3.What is the manpower needed to make this viable? – Lets get a feel for that

12 Quantify Data transfer costs Systemtap framework for writing kernel modules (for monitoring) $PBS_JOBID to track job Look at ways to categorize jobs – group (easy) – user (easy) – job-type e.g. MC prod, analysis, toy. (hard, requires user to categorize)

13 Data and CPU costs per group Lets assume we ran all our last 2 months jobs on Amazon, what is the cost? (US $) * CPU scaled according to HS06 results for m3.medium instance type

14 Data and CPU costs per group 2 Groups stand out as easy targets Discount LHCb as their usage has been anomalous for the last few months * CPU scaled according to HS06 results for m3.medium instance type

15 Spot pricing + reliability A) If spot price goes above bid :- instance dies – Bid fairly high & keep an eye on it – AWS provide historical plots – Some availability zones much better value than others So far, spot instance lifetime 2 weeks and counting if the right zone is used Sometimes spot price is up to 10x on-demand price Once, instance just died – Maybe AWS killed it for kernel updates? – Only saw this once at beginning

16 AWS POC selected highlights: The Good Create cluster that scales in response to batch queue depth – Amazon has done this for us using Cloud Formation* – Caveat:- Centos image + torque + all *systems must be cloud based Security: Default security is pretty good on AWS. Virtual private VLAN, firewalls and rich ACL available *

17 AWS POC: The Bad Customizing the image AND keep auto-scaling Much harder than I expected (I was warned!) – Simplest customization:- Boot image, touch helloworld, clean-up & shutdown Spend ~3 days (inc 4 hrs with Amazon tech guy) Workflow established (not complicated) However, clean-up script exists but doesn’t always work. Feeling of latency maybe worse than I hoped (except Ireland + east coast) Spikey spot pricing? Long term reliability of market?

18 AWS POC: The ugly Shared data-access, auth + authz piece: – VPN +NFS +NIS over wan, software on cvmfs – NFS is slow over WAN, even for directory walk 1 shared torque server very difficult to get working – L3 VPN + NAT or on multi-homed systems is problematic – Past time we ditched torque Use manual scp to share VPN secrets – AWS supports cloud-init user-data


20 Summary Had issues with lustre storage Shift away from LHCb/Atlas to SNO+/t2k T2K I/O requirement is ballooning Looking at AWS T2 Mostly moved away from torque

21 Backup

22 Reality bites Real workload: Raid tuning to chunk size 256k didn’t help much (factor ~50%) We don’t actually have sequential i/o – Each file is sequential on disk, but ~100 concurrent accesses to different files from workers over NFS randomizes this too much – Linux Read-ahead not helping, must consider this random I/O – LSI raid card can only re-order ~1MB requests to select best read rate[3]. Raid card read-ahead doesn’t help Linux system tuning on top of RAID got me ~factor 50%, mainly switching to deadline scheduler – – 8& kb_bei_MegaRAID_Controllern_anpassen&edit-text=&act=url 8& kb_bei_MegaRAID_Controllern_anpassen&edit-text=&act=url Status 80MB/s -> 160MB/s, not enough

23 Storage outlook So far, only been able to experiment with non- destructive optimizations for realistic case – Benchmarks now show ~12Gb/s possible, with 12 disks and 1M raid chunk size. Buying 24 bay server – More disks, more sequential throughput Try software raid – Perhaps Linux can queue multiple sequential read requests Need to change my pre-warming method and buy new fileserver to get > 10GB/s

24 Storage tuning:- benchmark Chunk Size Raid layout chunk size better tuned for large files sequential i/o – Iozone bench for (mostly) sequential i/o – iozone -r 1M -s 128g -t 6g Chunk SizeR720xd benchmark local access NFSv4 64k270MB/s250MB/s 256k640-690MB/s570MB/s 1024k1160-1300MB/s? 1024 (sw raid6)690MB/s?

25 Creating Sequential I/O Huge improvements if we can create sequential i/o: – Single, supervisory process controls access to raid array – All blocks needed in the near future are read at once from disk, – All other disk access is blocked. – Multiple clients served from the RAM cache. Call the above pre-warm. Currently, doing this at application level. T2k were already using a supervisor script to filter jobs onto batch queue, modify this script.

Download ppt "T3 report: - Oxford HEPSYSMAN Sean Brisbane June 2015."

Similar presentations

Ads by Google