OffLine Physics Computing Simulation : CPU intensive, very low I/O, very long runs Reconstruction : CPU intensive, moderate I/O, long runs Data analysis : CPU intensive, High I/O, long runs Interactive Analysis : CPU intensive, High I/O, quick response, short runs @ 10 mins intervals Tape storage : 180000 3480 tapes in vault Staging model Traditionnally done on Mainframes (CRAY, IBM's) Simulation Recontruction Batch Data analysis Interative analysis
Applications Characteristics High data volumes : 180000 tapes in vault Staging model for Data analysis (i.e, xfer tape to disk, process it, xfer it out to tape) Standard I/O package (Zebra, EPIO) Little sharing, sometime for stage input Yes, FORTRAN ! High data volumes (disks, tapes, tape robots) Standard I/O packages Little file sharing Fortran
SHIFT Motivations Exploit cheap RISC technology spin-off RISC technology : 1988, Apollo DN1000 DN10000 : Mainframe CPU capacity 4-cpu, 1000 $/MIPS, 1 OS DN 10000: Faster than a mini, comparable to a mainframe SCSI disk today : > 2GB/disk, 1.8 MB/sec, < 4 K$ 3480 SCSI (STK): 2 MB/sec, 12 K$ HEP computing : always bigger needs in both data volumes CPU needs I/O bandwith SHIFT == cost effective computing SHIFT Motivations Exploit cheap RISC technology spin-off Exploit cheap SCSI technology Exploit HOPE success Answer to HEP computing always growing needs
SHIFT Goals Mainframe quality Scalable (down & up) Mianframe quality: stable, resilient Scale down (small experiments, external institutes - e.g. China) Scale up (higher demand, e.g. 40 GB at design, 250 now) Heterogeneous because of economics Integrated: present 1 single view to applications, all disk & tapes available everywhere System dev ; portable to minimize efforts, adaptable to new OS, devices, HW (i.e. Unix based) Mainframe quality Scalable (down & up) Heterogeneous (open) Integrated Minimal (portable) SW development
SHIFT Model Split in functional blocks Interconnect with a High speed backplane Simulation showed that for a 100 CERN Unit system : for 1 CU required 20 KB/s the backplane needed 18 MB/s + 3 MB/s Tape I/O per I/F > FDDI (at that time) & would require a lot of CPU time for protocol processing ==> UltraNet (1 Gb/s, 3-12 MB/s perf I/F) + Software !! CPU Server CPU Server CPU Server CPU Server CPU Server CPU Server Disk Server Disk Server High Speed Interconnect Tape Server Tape Server Tape Server
SHIFT Software Pseudo Distributed File System Tape Access Batch queues Pseudo distributed to present one single vies Tape access at the tape file level assuming a tape staging model Batch queues system wide available (actually load balanced) Pseudo Distributed File System Tape Access Batch queues
SHIFT Software Disk Pool Manager Remote Tape Copy DPM to organize file in pools of file systems (local & remote) RTCOPY = generalized remote tape staging NQS : load balancing RFIO: exploit netowrk I/O packages: Zebra, EPIO, Fatmen Developed in 4 month Currently 60000 lines of code Publicly available Disk Pool Manager Remote Tape Copy Clustered Batch Queues Remote File I/O Integration with I/O packages Unix Tape Control System Operator Interface Monitoring
CORE This is the environment CSF is the low I/O high CPU facility PIAF is being integrated Link to PARC Link to CHEOPS Link to private Simul farms (Opal) Intentions to expand to all LEP experiments and prepare LHC
CPU usage CORE is the main CPU provider IBM is still a big one (but $/mips) Cray is outragously expensive) IBM is still GP, CORE, SHIFT not Still have a lot to learn from mainframe world : Resilience (Disk failure rates) Maintainability (checkpoint/restart) Performance tuning Failure analysis Week 1-42 1992
SHIFT Stage statistics But after all, we do stage data, and not a little bit Tape drive usage is not bad Week 1-43 1992 Gigabytes SUN/STK stage data/drive/week : 48 GB
Tape mount statistics 10 % of all tape mounts ... but with 2 drives Week 1-43 1992 10 % of all tape mounts ... but with 2 drives We are increasing number of drives which are limiting data annalysis (jobs are waiting idle for tape mounts Tape mount/drive as good as mainframe IBM Mounts/Drive/Week 215 SUN/STK Mounts/Drive/Week 253
Conclusions RISC + SCSI cost effective CSF style easy to solve not tight to 1 manufacturer for CPU Network is not important apart speed, only the abstraction matters (sockets) Simulations : evry major HEP has such a project I/O: very few people have tried this, it caused us gey hairs Administration: machine number increase , config example, monitoring Scalable up (e.g. 40 GB to 250, DP) and down (Aleph, IHEP, SMC) Ready for LHC challenges (one order of magnitude) Only starting with // interactive data analysis Conclusions RISC + SCSI cost effective CSF style easy to solve (Remote) I/O is the major problem Administration is complex Really scalable