2011+ Massimo Lamanna / CERN (for the IT-DSS group) ATLAS retreat Napoli 2-4 Jan 2011.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
CSE521: Introduction to Computer Architecture Mazin Yousif I/O Subsystem RAID (Redundant Array of Independent Disks)
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
I/O Systems and Storage Systems May 22, 2000 Instructor: Gary Kimura.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Architecture of intelligent Disk subsystem
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
CERN IT Department CH-1211 Geneva 23 Switzerland t Experience with NetApp at CERN IT/DB Giacomo Tenaglia on behalf of Eric Grancher Ruben.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CERN and Computing … … and Storage Alberto Pace Head, Data.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Introduction to data management.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS From data management to storage services to the next challenges.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
CERN Physics Database Services and Plans Maria Girone, CERN-IT
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
Ceph: A Scalable, High-Performance Distributed File System
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN – IT Department CH-1211 Genève 23 Switzerland t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Arne Wiebalck -- VM Performance: I/O
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
File Systems 2. 2 File 1 File 2 Disk Blocks File-Allocation Table (FAT)
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
Evaluating the performance of Seagate Kinetic Drives Technology and its integration with the CERN EOS storage system Ivana Pejeva openlab Summer Student.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
ALICE experiences with CASTOR2 Latchezar Betev ALICE.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
1 Thierry Titcheu Chekam 1,2, Ennan Zhai 3, Zhenhua Li 1, Yong Cui 4, Kui Ren 5 1 School of Software, TNLIST, and KLISS MoE, Tsinghua University 2 Interdisciplinary.
DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)
Hepix spring 2012 Summary SITE:
Predrag Buncic CERN Data management in Run3. Roles of Tiers in Run 3 Predrag Buncic 2 ALICEALICE ALICE Offline Week, 01/04/2016 Reconstruction Calibration.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CASTOR and EOS status and plans Giuseppe Lo Presti on behalf.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Storage plans for CERN and for the Tier 0 Alberto Pace (and.
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
CERN IT-Storage Strategy Outlook Alberto Pace, Luca Mascetti, Julien Leduc
Federating Data in the ALICE Experiment
Compute and Storage For the Farm at Jlab
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Diskpool and cloud storage benchmarks used in IT-DSS
Future of WAN Access in ATLAS
Status of the CERN Analysis Facility
CERN Lustre Evaluation and Storage Outlook
Ákos Frohner EGEE'08 September 2008
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Bernd Panzer-Steindel CERN/IT
CTA: CERN Tape Archive Overview and architecture
Computing Infrastructure for DAQ, DM and SC
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Presentation transcript:

2011+ Massimo Lamanna / CERN (for the IT-DSS group) ATLAS retreat Napoli 2-4 Jan 2011

Introduction IT DSS: responsible of the data management for physics at CERN Mainly but not exclusively the LHC experiments Physics data (on disk and tape): notably AFS, CASTOR and EOS Production services but not a steady-state situation Technology evolves New ideas are being discussed within HEP  Includes rediscovering of “old” ideas “All relevant data on disk” part of the computing model back in 1999 (Hoffmann review) Experience from the LHC data taking  10+ M files per month  Times 3 in total size in the next few years (40 PB → 120 PB in 2015) Learn more about your 2011 plans (rates, data model, etc...)  Real data are more interesting than MonteCarlo: users, users, users, users  Master operational load!

How we proceeded so far Bringing EOS alive Integrating best ideas, technology and experience Refocus on experiments requirements Expose developments at early stage Operations on board at early stage First experience: LST (with ATLAS) Similar activity starting with CMS And interest from others Some similarities with Tier3/Analysis Facilities approaches

Next steps (2011) Software EOS consolidation (April) Operations Set up EOS as a service (streamlined ops effort) Steady process:  Continue to build on CASTOR experience (procedures, etc...)  Simplifying the overall model and procedures In a nutshell: no need of synchronous (human) intervention More monitoring (including user monitoring) Deployment D1 area Dependable Serviceable High-Performance High- Concurrency Multi-User Data Access  Analysis use cases

Immediate steps Streamlining of the space token (disk pools) Less (but larger) entities (“disk pools”) Share different approaches/experiences Reduction of duplication/optimisation of resources Analysis areas (CASTOR → EOS) Hot-disk areas  Interplay with CERNVMF data will be understood Replication per directory  Fine-grained control of reliability, availability and performance Continuing to “protect” tape resources from users :) CMS is looking into the ATLAS model Successful implementation depends on experiment support More transparency (monitoring and user- supporting activities) service-now migration Keep users aware of what is going on (not only the ADC experts) will help our 1 st /2 nd /3 rd lines support

Data & Storage Services Replica healing (3-replica case) asynchronous operations (failure/draining/rebalancing/...) Rep 3 CLIENT MGM The intervention on the failing box can be done when appropriate (asynchronously) because the system re-establishes the foreseen number of copies. The machine carries no state (it can be reinstalled from scratch and the disk cleaned as in the batch node case). All circles represent a machine in production 6 CLIENT CLIENT CLIENT CLIENT

Future developments (example) availability and reliability Plain (reliability of the service = reliability of the hardware) Example: a CASTOR disk server running in RAID1 (Mirroring: n=1, m=1 S=1+m/n) Since the two copies belong to the same PC box, availability can be a problem Replication Reliable, maximum performance, but heavy storage overhead Example: n=1, m=2; S = % (EOS) Reed-Solomon, double, triple parity, NetRaid5, NetRaid6 Maximum reliability, minimum storage overhead Example (n+m) = 10+3, can lose any 3 out of 13 S = % EOS/RAID5 prototype being evaluated internally Low Density Parity Check (LDPC) / Fountain Codes Excellent performance, more storage overhead but better than Example: (n+m)=8+6, can lose any 3 out 14 (S=1+75%)

EOS CASTOR

Spare (for discussion)

Requirements for analysis Multi PB facility RW file access (random and sequential reads, updates) Fast Hierarchical Namespace –Target capacity: 10 9 files, 10 6 containers (directories) Strong authorization Quotas Checksums Distributed redundancy of services and data –Dependability and durability Dynamic hardware pool scaling and replacement without downtimes –Operability

Starting points April 2010: storage discussions within the IT- DSS group Prototyping/development started in May Input/discussion at the Daam workshop (June 17/18) Demonstrators Build on xroot strengths and know-how Prototype is under evaluation since August Pilot user: ATLAS Input from the CASTOR team (notably operations) ATLAS Large Scale Test (pool of ~1.5 PB) Now being opened to ATLAS users Ran by the CERN DSS operations team Still much work left to do Good points:  Early involvement of the users  Operations in the project from the beginning This activity is what we call EOS

Selected features of EOS Is a set of XRootd plug-ins And speaks XRoot protocol with you Just a Bunch Of Disks... JBOD - no hardware RAID arrays “Network RAID” within node groups Per-directory settings –Operations (and users) decide availability/performance (n. of replicas by directory – not physical placement) One pool of disk – different classes of service Dependable and durable –Self-healing –“Asynchronous” operations (e.g. replace broken disks when “convenient” while the system keeps on running)

Data & Storage Services EOS Architecture Head node File server MQ NS sync async Head node Namespace, Quota Strong Authentication Capability Engine File Placement File Location Message Queue Service State Messages File Transaction Reports File Server File & File Meta Data Store Capability Authorization Checksumming & Verification Disk Error Detection (Scrubbing) Starting points

Data & Storage Services EOS Namespace Version 1 (current) In-memory hierarchical namespace using Google hash Stored on disk as a changelog file Rebuilt in memory on startup Two views: –hierarchical view (directory view) –view storage location (filesystem view) very fast, but limited by the size of memory –1GB = ~1M files Version 2 (under development) Only view index in memory Metadata read from disk/buffer cache Perfect use case for SSDs (need random IOPS) 10 9 files = ~20GB per view 14

Namespace V1V2 * Inode Scale 100 M inodes1000 M inodes In-Memory Size GB (replicas have minor space contribution) 20 GB x n(replica) Boot Time~520 s ** min ** (difficult to guess) Pool size assuming avg. 10 Mb/file + 2 replicas 2 PB20 PB Pool Nodes assuming 40 TB/node File Systems assuming 20 / node

Data & Storage Services High Availability - Namespace scaling HN HN active in rw mode passive failover in rw mode HN active in ro mode HN active-passive RW master active-active RO slaves HNHN HA & Read Scale out HN /atlas HN /cms HN /alice HN /lhcb HNHN /atlas/data/atlas/user Write Scale out 16

File creation test 1 kHz NS Size: 10 Mio Files * 22 ROOT clients 1 kHz * 1 ROOT client 220 Hz MGM FST 17

File read test 7 kHz NS Size: 10 Mio Files * 100 Million read open * 350 ROOT clients 7 kHz * CPU usage 20% MGM FST

Data & Storage Services Replica layoutFS1 FS2FS3 Client write(offset,len) return code 500 MB/s injection result in - 1 GB/s output on eth0 of all disk servers GB/s input on eth0 of all disk servers Plain (no replica) Replica (here 3 replicas) More sophisticated redundant storage (RAID5, LDPC) Network IO for file creations with 3 replicas: 19

Data & Storage Services Replica healing Rep 1 Online Rep 2 Online Rep 3 Offline CLIENT MGM open file for update (rw) Rep 1 Online Rep 2 Online Rep 3 Offline CLIENT MGM come back in sec Rep 4 Sched. schedule transfer 2-4 Client RW reopen of an existing file triggers - creation of a new replica - dropping of offline replica 20

Replica placement Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 21 In order to minimize the risk of data loss we couple disks into scheduling groups (current default is 8 disks per group) The system selects a scheduling group to store a file in in a round-robin All the other replicas of this file are stored within the same group Data placement optimised vs hardware layout (PC boxes, network infrastructure, etc...)