Enrico Fattibene CDG – CNAF 18/09/2017

Enrico Fattibene CDG – CNAF 18/09/2017
Update on tape system Enrico Fattibene CDG – CNAF 18/09/2017

Tape status Used 5354 (44.3 PB) Free 969 (7.7 PB)
Repackable 1 PB (after recent CMS cancellation) Incoming 270 (2.3 PB) Usage rate: 50 tapes per week

HSM transfer rate Mean daily transfer rate for each server hitting 600 MB/s Each tape server capable of handling 800 MB/s simultaneously for in-bound and out-bound traffic Theoretical limit is defined by a FC8 connection to the Tape Area Network We do not reach 800 MB/s of mean daily transfer rate because of a bit of inefficiency: For migrations it is caused by the time spent in filesystem scan to select the candidates For recalls it depends on the position of data on tape We plan to move HSM services to new servers equipped by FC16 connection to the Tape Area Network Double transfer rate for each server

Future data transfer Data transfer rate to CNAF tapes will increase in the next years 160 PB by 2021 Experiments plan to use tapes as nearline disk Important increase of recall activity Need to optimize drive usage to reduce costs Planning to buy new library

Drive usage configuration
Tape drives are shared among experiments Each experiment can use a maximum number of drives for recall or migration, statically defined in GEMSS In case of scheduled massive recall or migration activity these parameters are manually changed by administrators We noticed several cases of free drives that could be used by pending recall threads Considering the limit of 8 Gbit/s on FC connection of each HSM server We are working on a software solution to dynamically allocate additional drives to experiments and manage concurrent recall access

Drive usage efficiency
Total number of recall threads pending and free drives June-July 2017 In several cases a subset of free drives could be used by recall threads Total duration (in days) of recall and migration processes June-July 2017 Stacked plot aggregated by day The total usage is never greater than 8 days (over a total of 17 drives)

Recent intense recall activity by CMS
In the end of July-beginning of August we noticed increase in recall activity from CMS In attempt to help CMS with this activity we allocated more (x2) of tape drives from 3 (default) to 6 and than to 7 Since there was no communication from experiment regarding this activity we asked CMS contact for info Suggested to open a ticked After some investigations it turned out that this activity was “not intended”

Recent intense recall activity by CMS (cont.)
In 28 days (from 30/7 to 26/8): bytes recalled: 1.11PB (of 12PB total) % tapes used in recall: (out of 1544 total) - 72 % files recalled: 400k number of mounts (tot): 9858 mount/day (average): 352 mount/tape (average): 9, (max): 17 number of tape drives used (avr): 6 (max): 7 Waiting time: 5.6 days (average of max/day) After all, we’ve got re-read ~10% of all CMS data 1 read error recovered after tape copy to new media 1 robot arm (out of 4) broken Took more than a week to repair

Lessons learned and questions
Need to pay attention to un-announced but heavy activities Should we impose a protection mechanism against such events? Artificially delay recall processing? Limit number of tape drives? If we would do so, we can reduce number of mounts by increasing recall time Experiments should consider that recall time in this case may increase significantly From few minutes to 5-7 days Is it still convenient (use of tapes as slow disks as ATLAS asks)?

Enrico Fattibene CDG – CNAF 18/09/2017

Similar presentations

Presentation on theme: "Enrico Fattibene CDG – CNAF 18/09/2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enrico Fattibene CDG – CNAF 18/09/2017

Similar presentations

Presentation on theme: "Enrico Fattibene CDG – CNAF 18/09/2017"— Presentation transcript:

Similar presentations

About project

Feedback