Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face workshop September 2014 Eric Cano on behalf of CERN IT-DSS group
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 2New tape server softwareCastor Workshop Sep 2014 Overview Features for first release New tape server architecture –Control and reporting flows –Memory management and data flow –Error handling –Main process and sessions –Stuck session and recovery Development methodologies and QA What changes in practice? What is still missing? Logical Block Protection investigation Release plans and potential new features
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 3New tape server softwareCastor Workshop Sep 2014 Features for first release Continuation of the push to replace legacy tape software –Started with creation of tape gateway and bridge –VMGR+VDQM will be next Drop-in replacement –Tapeserverd consolidated in a single daemon –Replaces the previous stack: taped & satellites + rtcpd + tapebridged Identical outside protocols (almost) –Stager / Cli client (readtp in unchanged) –VMGR/VDQM –tpstat/tpconfig –New labelling command (castor-tape-label) Keep what works: –One process per session (pid listed in tpstat, as before) Better logs Latency shadowing (no impact of slow DB) Empty mount protection Result from big teamwork since last meeting: –E.Cano, S. Murray, V. Kotlyar, D. Kruse, D. Come
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 4New tape server softwareCastor Workshop Sep 2014 New tape server architecture Pipelined: based on FIFOs and threads/thread pools –Always fast to post to FIFO Push data blocks, reports, requests for more work –Each FIFO output is served by one thread(pool) Simple loop: pop, use/serve the data/request, repeat –All latencies are shadowed in the various threads –Keep the instruction pipeline non-empty with task prefetch –N-way parallel disk access (as before) –All reporting is asynchronous Tape thread is the central element that we want to keep busy at full speed
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 5New tape server softwareCastor Workshop Sep 2014 Data FIFO Free blocks Tape Write Task Data blocks Pop block, write to tape, (flush,) report result Return free block Migration session overview Migration Mount Manager (main thread)* Provide blocks Disk Read Task Get free blocks Read data from disk Push full data block Task queue Pop, execute, delete n threads Disk Read Thread Pool Task queue Pop, execute, delete 1 thread Tape Write Single Thread Request more on threshold Request for more Task Injector 1 thread Get more work from tape gateway, create and push tasks Report Packer 1 thread Pack information and send bulk report on flush/end session 1 thread Instantiate memory manager, injector, packer, disk and tape thread Give initial kick to task injector Wait for completion Global Status Reporter Pack information For tapeserverd and 1 thread Free blocks Client queue1 thread Memory manager *(main thread)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 6New tape server softwareCastor Workshop Sep 2014 Task queue Pop, execute, delete Disk Write Thread Pool n threads Request more on threshold Data FIFO Disk Write Task Data blocks Pop block, write to disk, report result Return free block Recall session overview Recall Mount Manager (main thread)* Tape Read Task Pull free blocks Read data from tape Push full data block 1 thread Task queue Pop, execute, delete Tape Read Single Thread Request for more Task Injector 1 thread Get more work from tape gateway, create and push tasks Individual file reports, flush reports, end of session report Report Packer 1 thread Pack information and send bulk report threshold/end session 1 thread Instantiate memory manager, injector, packer, disk and tape thread Give initial kick to task injector Wait for completion Global Status Reporter 1 thread Pack information For tapeserverd and *(main thread) Free blocks (no thread) Memory manager
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 7New tape server softwareCastor Workshop Sep 2014 Control flow Task injector –Initially called synchronously (empty mount detection) –Triggered by requests for more work (stored in a FIFO) –Gets more work from client –Creates and injects tasks Tasks created, linked to each other (reader/writer couple) and injected to the tape and disk thread FIFOs Disk thread pool –Pops disk tasks, executes them, deletes them and moves to the next Tape thread –Same as disk after initializing the session Mounting Tape identification Positioning for writing … and unmounting in the end The reader thread(pool) requests for more work –Based on task FIFO content thresholds –Always ask for n files or m bytes (whichever comes first, configurable) –Asks again when half of that is still available in the task FIFO –Asks again one last time when the task FIFO becomes empty (last call)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 8New tape server softwareCastor Workshop Sep 2014 Reporting flow Reports to client (file related) –Posted to a FIFO –Packed and transmitted in a separate thread Send on flush in migrations Send on thresholds in recalls –End of session also follows this path Reports to parent process (tape/drive related) –Posted to a FIFO –Transmitted asynchronously by a separate thread –Parent process keeps track of the session’s status and informs the VDQM and VMGR
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 9New tape server softwareCastor Workshop Sep 2014 Memory management and data flow Same as before: circulate a fixed number of memory blocks (size and count configurable) Errors can be piggy backed on data blocks –Writer side always does the reporting, even for read errors Central memory manager –Migration: actively pushes blocks for each tape write task Disk read tasks pulls block from there Returns the block with data in a second FIFO Data gets written to tape by the tape write task –Recalls: passive container Tape read task pulls memory blocks as needed Pushes them to the disk write tasks (in FIFOs) Disk write tasks pushes the data to the disk server –Memory blocks get recycled to the memory manager after writing to disk or tape
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 10New tape server softwareCastor Workshop Sep 2014 Error handling Reporting –Errors get logged when they happen –If error happens in the reader, it gets propagated to the writer through the data path –The writer propagates the error to the client Session behaviour on error –Recalls: carry on for stager, halt on error for readtp absolute positioning by blockId (stager) relative positioning by fSeq (readtp) –Migrations: any error ends the session
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 11New tape server softwareCastor Workshop Sep 2014 Main process and sessions The session is forked by the parent process –Parent process keeps track of sessions and drive statuses in a drive catalogue –Answers VDQM requests –Filters input requests based on drive state –Manages the configuration files The child session reports tape related status to the parent process –mount, unmounts –amount of data transferred for the watchdog The parent process informs the VMGR and VDQM on behalf of the child session –Client library completely rewritten Forking is actually done a utility sub-process (forker) –No actual forking from the multithreaded parent process Process inventory: –1 parent process + 1 fork helper process –N session processes (at most 1 per drive)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 12New tape server softwareCastor Workshop Sep 2014 ZeroMQ+Protocol buffers The parent/session processes communication is a no-risk protocol –Both ends get release/deployed together –Can be changed at any time Opportunity to experiment new serialization methodologies –Need to replace umbrello This gave good results –Protocol buffers provide robust serialization with little development effort –ZMQ handles many communication scenarios Still in finalization (issues in the watchdog communication)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 13New tape server softwareCastor Workshop Sep 2014 Stuck sessions and recovery Stuck sessions do happen –RFIO problems suspected Currently handled by a script –Log file based. No move for set time => kill –Problematic with unusually big files Watchdog will get more internal data –Too much to be logged –If data stops flowing for a given time => kill Clean-up process launched automatically when session killed No clean-up after session failure –a non-stuck session failed to do its own clean-up –=> drive down
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 14New tape server softwareCastor Workshop Sep 2014 Development methodologies and QA Full C++, maintainable software –Object encapsulation for separately manageable units Easy unit testing –Exception handling simplifies error reporting a lot –RAII (destructors) simplifies resource management –Cleaner drive specifics implementation through inheritance Easy to add new models Hardcoding-free SCSI and tape format layers –Naming conventions matching the SCSI documentations –String error reporting for all SCSI error –Very similar approach with the AUL tape format Unit testing –Allows running various scenarios systematically On RPM build Migrations, recalls, good day, bad day, full tape Using fake objects for drive, client interface Easier debugging when problems can be reproduced in unit test context –Run test standalone + through valgrind and helgrind Automatic detection of memory leaks and race conditions Completely brought to the CASTOR tree Automated system testing would be a nice addition to this setup
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 15New tape server softwareCastor Workshop Sep 2014 What changes in practice? The new logs –Convergence with the rest of CASTOR logs –Single line at completion of tape thread Summarises the session for tape log –More detailed timings Will make it easier to pinpoint performance bottlenecks –New log parsing required Should be greatly simplified as all relevant information is on a single line A single daemon Configuration not radically changed
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 16New tape server softwareCastor Workshop Sep 2014 What is still missing? Support for Oracle libraries The parent process’s watchdog for transfer sessions –Will move stuck transfers detection from operators scripts to internal (with better precision) File transfer protocol switching –Add local file support reliance on rfio removed –Add Xroot support switched on by configuration instead of RFIO Diskserver >= required (for stat call) –Add Ceph support Disk path based switch, automatic Fine tuning of logs for operations Document the latest developments
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 17New tape server softwareCastor Workshop Sep 2014 Release and deployment Data transfers are being validated now on IBM drives Oracle drives will follow with mount suport Some previously mentioned features missing Target date for a tapeserverd-only CASTOR release: end of November Production deployment ~January Compatible with current stagers on disk server will be needed for using Xroot is the end of road for rtcpd/taped
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 18New tape server softwareCastor Workshop Sep 2014 Logical block protection Tests of the tape drive feature have been done by F. Nikolaidis, J. Leduc and K. Ha Adds a 4 byte checksum to tape blocks Protects the data block during the transfer from computer memory to tape drive 2 checksum algorithm in use today: –Reed-Solomon –CRC32-C Reed-Solomon requires 2 threads to match drive throughput CRC32-C can fit in a single thread –CRC32-C is available on most recent drives
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 19New tape server softwareCastor Workshop Sep 2014 Next tape developments Tapeserverd –Logical block protection integration –Support for pre-emption of session VDQM/VMGR –Merge of the two in a single tape resource manager Simplify interface Asymmetric drive support Improve scheduling (atomic tape-in-drive semantics for migrations) –Today, the chosen tape might no have compatible drives available, leading to migration delays Remove need for manual synchronization Consider pre-emptive scheduling –max-out the system with background task (repack, verify) –Interrupt and make space for user sessions when they come –Allow over quota for users when free drives exist –Leading to 100% utilisation of the drives –Facilitates tape server upgrades –Integrate the authentication part for tape (from Cupv)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 20New tape server softwareCastor Workshop Sep 2014 Conclusion Tape server stack has been re-written and consolidated –New features already provide improvements Empty mount protection for both read and write Full request and report latency shadowing Better timing monitoring is already in place –Major clean-up will allow easier development and maintenance More new features coming –Xroot/Ceph support –Logical block protection –Session pre-emption End of the road for rtcpd/taped –Will be dropped form as soon as we are happy with tapeserverd in production More tape software consolidation around the corner –VDQM/VMGR