Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t DSS New tape server software Status and plans CASTOR face-to-face.

Slides:



Advertisements
Similar presentations
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CASTOR Status Alberto Pace.
Advertisements

Operating System.
MACHINE-INDEPENDENT VIRTUAL MEMORY MANAGEMENT FOR PAGED UNIPROCESSOR AND MULTIPROCESSOR ARCHITECTURES R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron,
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
CASTOR Project Status CASTOR Project Status CERNIT-PDP/DM February 2000.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Disk Drivers May 10, 2000 Instructor: Gary Kimura.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
1 - Oracle Server Architecture Overview
Computer Organization and Architecture
1 Operating Systems Ch An Overview. Architecture of Computer Hardware and Systems Software Irv Englander, John Wiley, Bare Bones Computer.
PRASHANTHI NARAYAN NETTEM.
I/O Systems CSCI 444/544 Operating Systems Fall 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams new features in 11g Zbigniew Baranowski.
Backup & Recovery 1.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
CERN IT Department CH-1211 Genève 23 Switzerland t Tape-dev update Castor F2F meeting, 14/10/09 Nicola Bessone, German Cancio, Steven Murray,
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Castor development status Alberto Pace LCG-LHCC Referees Meeting, May 5 th, 2008 DRAFT.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Castor incident (and follow up) Alberto Pace.
Hernán J. Larrea.  Symantec Corporation has not participated in the development of this document.  All sources will be properly mentioned at the end.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS XROOTD news New release New features.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
CERN IT Department CH-1211 Genève 23 Switzerland t ALICE XROOTD news New xrootd bundle release Fixes and caveats A few nice-to-know-better.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Overview of DMLite Ricardo Rocha ( on behalf of the LCGDM team.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
I/O Software CS 537 – Introduction to Operating Systems.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Developments for tape CERN IT Department CH-1211 Genève 23 Switzerland t DSS Developments for tape CASTOR workshop 2012 Author: Steven Murray.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
CERN IT Department CH-1211 Genève 23 Switzerland t Increasing Tape Efficiency Original slides from HEPiX Fall 2008 Taipei RAL f2f meeting,
Tape write efficiency improvements in CASTOR Department CERN IT CERN IT Department CH-1211 Genève 23 Switzerland DSS Data Storage.
CITA 171 Section 1 DOS/Windows Introduction. DOS Disk operating system (DOS) –Term most often associated with MS-DOS –Single-tasking operating system.
CPS110: Implementing threads on a uni-processor Landon Cox January 29, 2008.
OPERATING SYSTEM REVIEW. System Software The programs that control and maintain the operation of the computer and its devices The two parts of system.
CTA: CERN Tape Archive Rationale, Architecture and Status
Jean-Philippe Baud, IT-GD, CERN November 2007
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
CTA: CERN Tape Archive Overview and architecture
Operation System Program 4
OPERATING SYSTEMS DESIGN AND IMPLEMENTATION Third Edition ANDREW S
Operating System Concepts
CS703 - Advanced Operating Systems
Operating Systems.
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS New tape server software Status and plans CASTOR face-to-face workshop September 2014 Eric Cano on behalf of CERN IT-DSS group

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 2New tape server softwareCastor Workshop Sep 2014 Overview Features for first release New tape server architecture –Control and reporting flows –Memory management and data flow –Error handling –Main process and sessions –Stuck session and recovery Development methodologies and QA What changes in practice? What is still missing? Logical Block Protection investigation Release plans and potential new features

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 3New tape server softwareCastor Workshop Sep 2014 Features for first release Continuation of the push to replace legacy tape software –Started with creation of tape gateway and bridge –VMGR+VDQM will be next Drop-in replacement –Tapeserverd consolidated in a single daemon –Replaces the previous stack: taped & satellites + rtcpd + tapebridged Identical outside protocols (almost) –Stager / Cli client (readtp in unchanged) –VMGR/VDQM –tpstat/tpconfig –New labelling command (castor-tape-label) Keep what works: –One process per session (pid listed in tpstat, as before) Better logs Latency shadowing (no impact of slow DB) Empty mount protection Result from big teamwork since last meeting: –E.Cano, S. Murray, V. Kotlyar, D. Kruse, D. Come

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 4New tape server softwareCastor Workshop Sep 2014 New tape server architecture Pipelined: based on FIFOs and threads/thread pools –Always fast to post to FIFO Push data blocks, reports, requests for more work –Each FIFO output is served by one thread(pool) Simple loop: pop, use/serve the data/request, repeat –All latencies are shadowed in the various threads –Keep the instruction pipeline non-empty with task prefetch –N-way parallel disk access (as before) –All reporting is asynchronous Tape thread is the central element that we want to keep busy at full speed

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 5New tape server softwareCastor Workshop Sep 2014 Data FIFO Free blocks Tape Write Task Data blocks Pop block, write to tape, (flush,) report result Return free block Migration session overview Migration Mount Manager (main thread)* Provide blocks Disk Read Task Get free blocks Read data from disk Push full data block Task queue Pop, execute, delete n threads Disk Read Thread Pool Task queue Pop, execute, delete 1 thread Tape Write Single Thread Request more on threshold Request for more Task Injector 1 thread Get more work from tape gateway, create and push tasks Report Packer 1 thread Pack information and send bulk report on flush/end session 1 thread Instantiate memory manager, injector, packer, disk and tape thread Give initial kick to task injector Wait for completion Global Status Reporter Pack information For tapeserverd and 1 thread Free blocks Client queue1 thread Memory manager *(main thread)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 6New tape server softwareCastor Workshop Sep 2014 Task queue Pop, execute, delete Disk Write Thread Pool n threads Request more on threshold Data FIFO Disk Write Task Data blocks Pop block, write to disk, report result Return free block Recall session overview Recall Mount Manager (main thread)* Tape Read Task Pull free blocks Read data from tape Push full data block 1 thread Task queue Pop, execute, delete Tape Read Single Thread Request for more Task Injector 1 thread Get more work from tape gateway, create and push tasks Individual file reports, flush reports, end of session report Report Packer 1 thread Pack information and send bulk report threshold/end session 1 thread Instantiate memory manager, injector, packer, disk and tape thread Give initial kick to task injector Wait for completion Global Status Reporter 1 thread Pack information For tapeserverd and *(main thread) Free blocks (no thread) Memory manager

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 7New tape server softwareCastor Workshop Sep 2014 Control flow Task injector –Initially called synchronously (empty mount detection) –Triggered by requests for more work (stored in a FIFO) –Gets more work from client –Creates and injects tasks Tasks created, linked to each other (reader/writer couple) and injected to the tape and disk thread FIFOs Disk thread pool –Pops disk tasks, executes them, deletes them and moves to the next Tape thread –Same as disk after initializing the session Mounting Tape identification Positioning for writing … and unmounting in the end The reader thread(pool) requests for more work –Based on task FIFO content thresholds –Always ask for n files or m bytes (whichever comes first, configurable) –Asks again when half of that is still available in the task FIFO –Asks again one last time when the task FIFO becomes empty (last call)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 8New tape server softwareCastor Workshop Sep 2014 Reporting flow Reports to client (file related) –Posted to a FIFO –Packed and transmitted in a separate thread Send on flush in migrations Send on thresholds in recalls –End of session also follows this path Reports to parent process (tape/drive related) –Posted to a FIFO –Transmitted asynchronously by a separate thread –Parent process keeps track of the session’s status and informs the VDQM and VMGR

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 9New tape server softwareCastor Workshop Sep 2014 Memory management and data flow Same as before: circulate a fixed number of memory blocks (size and count configurable) Errors can be piggy backed on data blocks –Writer side always does the reporting, even for read errors Central memory manager –Migration: actively pushes blocks for each tape write task Disk read tasks pulls block from there Returns the block with data in a second FIFO Data gets written to tape by the tape write task –Recalls: passive container Tape read task pulls memory blocks as needed Pushes them to the disk write tasks (in FIFOs) Disk write tasks pushes the data to the disk server –Memory blocks get recycled to the memory manager after writing to disk or tape

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 10New tape server softwareCastor Workshop Sep 2014 Error handling Reporting –Errors get logged when they happen –If error happens in the reader, it gets propagated to the writer through the data path –The writer propagates the error to the client Session behaviour on error –Recalls: carry on for stager, halt on error for readtp absolute positioning by blockId (stager) relative positioning by fSeq (readtp) –Migrations: any error ends the session

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 11New tape server softwareCastor Workshop Sep 2014 Main process and sessions The session is forked by the parent process –Parent process keeps track of sessions and drive statuses in a drive catalogue –Answers VDQM requests –Filters input requests based on drive state –Manages the configuration files The child session reports tape related status to the parent process –mount, unmounts –amount of data transferred for the watchdog The parent process informs the VMGR and VDQM on behalf of the child session –Client library completely rewritten Forking is actually done a utility sub-process (forker) –No actual forking from the multithreaded parent process Process inventory: –1 parent process + 1 fork helper process –N session processes (at most 1 per drive)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 12New tape server softwareCastor Workshop Sep 2014 ZeroMQ+Protocol buffers The parent/session processes communication is a no-risk protocol –Both ends get release/deployed together –Can be changed at any time Opportunity to experiment new serialization methodologies –Need to replace umbrello This gave good results –Protocol buffers provide robust serialization with little development effort –ZMQ handles many communication scenarios Still in finalization (issues in the watchdog communication)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 13New tape server softwareCastor Workshop Sep 2014 Stuck sessions and recovery Stuck sessions do happen –RFIO problems suspected Currently handled by a script –Log file based. No move for set time => kill –Problematic with unusually big files Watchdog will get more internal data –Too much to be logged –If data stops flowing for a given time => kill Clean-up process launched automatically when session killed No clean-up after session failure –a non-stuck session failed to do its own clean-up –=> drive down

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 14New tape server softwareCastor Workshop Sep 2014 Development methodologies and QA Full C++, maintainable software –Object encapsulation for separately manageable units Easy unit testing –Exception handling simplifies error reporting a lot –RAII (destructors) simplifies resource management –Cleaner drive specifics implementation through inheritance Easy to add new models Hardcoding-free SCSI and tape format layers –Naming conventions matching the SCSI documentations –String error reporting for all SCSI error –Very similar approach with the AUL tape format Unit testing –Allows running various scenarios systematically On RPM build Migrations, recalls, good day, bad day, full tape Using fake objects for drive, client interface Easier debugging when problems can be reproduced in unit test context –Run test standalone + through valgrind and helgrind Automatic detection of memory leaks and race conditions Completely brought to the CASTOR tree Automated system testing would be a nice addition to this setup

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 15New tape server softwareCastor Workshop Sep 2014 What changes in practice? The new logs –Convergence with the rest of CASTOR logs –Single line at completion of tape thread Summarises the session for tape log –More detailed timings Will make it easier to pinpoint performance bottlenecks –New log parsing required Should be greatly simplified as all relevant information is on a single line A single daemon Configuration not radically changed

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 16New tape server softwareCastor Workshop Sep 2014 What is still missing? Support for Oracle libraries The parent process’s watchdog for transfer sessions –Will move stuck transfers detection from operators scripts to internal (with better precision) File transfer protocol switching –Add local file support reliance on rfio removed –Add Xroot support switched on by configuration instead of RFIO Diskserver >= required (for stat call) –Add Ceph support Disk path based switch, automatic Fine tuning of logs for operations Document the latest developments

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 17New tape server softwareCastor Workshop Sep 2014 Release and deployment Data transfers are being validated now on IBM drives Oracle drives will follow with mount suport Some previously mentioned features missing Target date for a tapeserverd-only CASTOR release: end of November Production deployment ~January Compatible with current stagers on disk server will be needed for using Xroot is the end of road for rtcpd/taped

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 18New tape server softwareCastor Workshop Sep 2014 Logical block protection Tests of the tape drive feature have been done by F. Nikolaidis, J. Leduc and K. Ha Adds a 4 byte checksum to tape blocks Protects the data block during the transfer from computer memory to tape drive 2 checksum algorithm in use today: –Reed-Solomon –CRC32-C Reed-Solomon requires 2 threads to match drive throughput CRC32-C can fit in a single thread –CRC32-C is available on most recent drives

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 19New tape server softwareCastor Workshop Sep 2014 Next tape developments Tapeserverd –Logical block protection integration –Support for pre-emption of session VDQM/VMGR –Merge of the two in a single tape resource manager Simplify interface Asymmetric drive support Improve scheduling (atomic tape-in-drive semantics for migrations) –Today, the chosen tape might no have compatible drives available, leading to migration delays Remove need for manual synchronization Consider pre-emptive scheduling –max-out the system with background task (repack, verify) –Interrupt and make space for user sessions when they come –Allow over quota for users when free drives exist –Leading to 100% utilisation of the drives –Facilitates tape server upgrades –Integrate the authentication part for tape (from Cupv)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 20New tape server softwareCastor Workshop Sep 2014 Conclusion Tape server stack has been re-written and consolidated –New features already provide improvements Empty mount protection for both read and write Full request and report latency shadowing Better timing monitoring is already in place –Major clean-up will allow easier development and maintenance More new features coming –Xroot/Ceph support –Logical block protection –Session pre-emption End of the road for rtcpd/taped –Will be dropped form as soon as we are happy with tapeserverd in production More tape software consolidation around the corner –VDQM/VMGR