Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.

Slides:



Advertisements
Similar presentations
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
Advertisements

Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.
GridKa May 2004 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Installing dCache into an existing Storage environment at GridKa Forschungszentrum.
GridKa January 2005 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Doris Ressmann 1 Mass Storage at GridKa Forschungszentrum Karlsruhe GmbH.
Service Data Challenge Meeting, Karlsruhe, Dec 2, 2004 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Plans and outlook at GridKa Forschungszentrum.
Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.
Copyright © 2007, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Jos van Wezel Doris Ressmann GridKa, Karlsruhe TSM as tape storage backend for disk pool managers.
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Introduction to Storage Area Network (SAN) Jie Feng Winter 2001.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Vorlesung Speichernetzwerke Teil 2 Dipl. – Ing. (BA) Ingo Fuchs 2003.
Storage area Network(SANs) Topics of presentation
NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.
Storage Area Network (SAN)
Storage Networking Technologies and Virtualization Section 2 DAS and Introduction to SCSI1.
Module – 7 network-attached storage (NAS)
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
National Energy Research Scientific Computing Center (NERSC) The GUPFS Project at NERSC GUPFS Team NERSC Center Division, LBNL November 2003.
Peter Stefan, NIIF 29 June, 2007, Amsterdam, The Netherlands NIIF Storage Services Collaboration on Storage Services.
Enterprise Storage Management Steve Duplessie Founder/Senior Analyst The Enterprise Storage Group September, 2001.
Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Module 10 Configuring and Managing Storage Technologies.
Module 9: Configuring Storage
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Data Management The GSM-WG Perspective. Background SRM is the Storage Resource Manager A Control protocol for Mass Storage Systems Standard protocol:
Storage Tank in Data Grid Shin, SangYong(syshin, #6468) IBM Grid Computing August 23, 2003.
Storage Trends: DoITT Enterprise Storage Gregory Neuhaus – Assistant Commissioner: Enterprise Systems Matthew Sims – Director of Critical Infrastructure.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
1 U.S. Department of the Interior U.S. Geological Survey Contractor for the USGS at the EROS Data Center EDC CR1 Storage Architecture August 2003 Ken Gacke.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Using NAS as a Gateway to SAN Dave Rosenberg Hewlett-Packard Company th Street SW Loveland, CO 80537
Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.
Hosted by Minimizing the Impact of Storage on Your Network W. Curtis Preston President The Storage Group.
Storage and Storage Access 1 Rainer Többicke CERN/IT.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
CASPUR Site Report Andrei Maslennikov Lead - Systems Rome, April 2006.
INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Tier-2 storage A hardware view. HEP Storage dCache –needs feed and care although setup is now easier. DPM –easier to deploy xrootd (as system) is also.
Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
KIT – University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association STEINBUCH CENTRE FOR COMPUTING - SCC
EGI-Engage Data Services and Solutions Part 1: Data in the Grid Vincenzo Spinoso EGI.eu/INFN Data Services.
GridKa December 2004 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Doris Ressmann dCache Implementation at FZK Forschungszentrum Karlsruhe.
DMLite GridFTP frontend Andrey Kiryanov IT/SDC 13/12/2013.
Parallel IO for Cluster Computing Tran, Van Hoai.
Mass Storage at SARA Peter Michielse (NCF) Mark van de Sanden, Ron Trompert (SARA) GDB – CERN – January 12, 2005.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
An Introduction to GPFS
GPFS Parallel File System
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
CASTOR: possible evolution into the LHC era
Video Security Design Workshop:
Nexsan iSeries™ iSCSI and iSeries Topologies Name Brian Montgomery
NL Service Challenge Plans
Introduction to Data Management in EGI
Bernd Panzer-Steindel, CERN/IT
Introduction to Networks
The INFN Tier-1 Storage Implementation
Storage Trends: DoITT Enterprise Storage
Cost Effective Network Storage Solutions
Presentation transcript:

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences Jos van Wezel Institute for Scientific Computing Karlsruhe, Germany

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Overview Estimated sizes and needs GridKa today and roadmap Connection models Hardware choices Software choices LCG

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Scaling the tiers Tier 0: 2 PB disk 10 PB tape 6000 kSi (data collection, distribution to tier 1 Tier 1: 1 PB disk 10 PB tape 2000 kSi (data processing, calibration, archiving for tier 2, distribute to tier 2) Tier 2: 0.2 PB disk no tape 3000 kSi (dataselections, simulation, distribute to tier 3) Tier 3: location and or group specific 1 opteron today ~ 1 kSi

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft GridKa growth

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Storage at GridKa GPFS via NFS to nodesdCache via dcap to nodes

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft GridKa road map expand and stabilize GPFS / NFS combination possibly install Lustre integrate dCache look for alternative to TSM if !! really needed Try SATA disks decide path for Parallel FS and dCache decide Tape backend scale for LHC (200 – 300 MB/s continuous for some weeks)

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Tier 2 targets (source: G.Quast / Uni-KA) 5 MB per node throughput 300 nodes 1000 MB/s 200 TB overall disk storage

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Estimate your needs (1) can you charge for the storage? –influences choice between on-line and offline (tape) –classification of data (volatile, precious, high IO, low IO) how many nodes will access the storage simultaneously –Absolute number of nodes –Number of nodes that run a particular job –Job classification to separate accesses

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Estimate your needs (2) What kind of access (Read/Write/Transfer sizes) – ability to control access pattern Pre-staging Software tuning –job classification to influence access pattern spread via scheduler What size will the storage have eventually –use benefit of random access via large number of controllers –up till 4 TB or 100 MB/s one controller –need high speed disks

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Disc technology keys Disk areal density is larger then tape –disks are rigid Density growth rate for disks continues (but slower) –deviation from Moores law (same for CPU) Superparamagnetic effect is not yet influencing progress –the end has been in view since 20 years Convergence of costs for disk and tape stopped –still factor 4 to 5 difference Disks and tape will be there at least another 10 years

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Disk areal density vs. head – media spacing Hitachi Deskstar 7k400 (2004): 400GB, 61 Gb/in. 2 IBM RAMAC (1956): 5 MB, 2000kb/in. 2 Head to media spacing (nm) Areal density (Mb/in. 2 )

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft To SATA or not when compared to SCSI/FC Up to 4 times cheaper (3 k / TB vs. 10 k / TB) 2 times slower in Multi user environment (access time) Not really for 24/7 operation (more failures) Larger capacity per disk max: 140 GB SCSI, 400 GB SATA (today) No large scale experience Warranty of drives for only 1 or 2 years. GridKa uses SCSI, SAN and expensive controllers bad experiences with IDE NAS boxes (160 GB disks, 3Ware controllers) New products, with SATA disks and expensive controllers IO ops are more important then throughput for most accesses

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Network attached storage IO –path via the network IO –path locally Fibre Channel or SCSI

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft NAS example server with 4 dual SCSI busses –more then 1 GB/s transfer 4 x 2 SATA RAID boxes (16 * 250 GB) –~4 TB per bus 2 * 4 * 2 * 4 = 72 TB on a server. est 30 keuro or 35 keuro with point to point FC Not that bad.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SAN IO –path to each host via SAN or iSCSI

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SAN or Ethernet SAN has easier management –exchange of hardware without interruption –joining separate storage elements iSCSI needs separate net (SCSI over IP) Very scalable performance –via switches or directors 1 SCSI bus maxes at 320 MB/s –better than current FC, but FC is duplex –not a fabric –example follows ELVM for easier management Network block device Kernel 2.6 new 16 TB limit SAN is expensive (500 EURO HBA, 1000 EURO switch port) A direct connection limitation can be partly compensated via High Speed interconnect (InfiniBand,Myrinet etc) Tighly coupled cluster with InfiniBand. Can be used for FC too, depending on the FS software.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Combining FC and Infiniband

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Software to drive the hardware File systems –GPFS (IBM) (GridKa uses this, so does UNI-KA) –SAN-FS (IBM $$) supports a range of architectures –Lustre (HP $) (Uni-KA Rechenzentrum cluster) –PVFS (stability is rather low) –GFS (now RedHat) or OpenGFS –NFS Linux implementation is messy but RH 3.0 EL seems promising NAS boxes reach impressive throughput, are stable, easy management, grow as needed (NetApp, Exanet) –Terragrid (very new) (Almost-posix) access via library preload –write once / read many –changing a file means creating a new and deleting the old –not usable for all software (e.g. no DBMS!) –Examples Gridftp (gfal), (x)rootd (rfio), dCache (dcap/gfal/rfio)

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft GPFS Stripes over n disks Linux and AIX or combined Max FS size 70 TB HSM option Scalable and very robust Easy management SAN or IP+SAN or IP only Add and remove storage on-line Vendor lock

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Accumulated throughput as function of number of nodes/raid-arrays (GPFS) MB/s Reading Writing

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SAN FS metadata server failover policy based management add and remove storage on line $$$

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft LUSTRE Object based LDAP config database Failover of OSTs Support for heterogeneous network e.g. InfiniBand Advanced security Open Source

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft SRM Storage resource manager Glue between worldwide grid and local mass storage (SE) A storage element should offer: –GridFTP –An SRM interface –Information publication via MDS LCG has SRM2 almost …. ready, SRM1 in operation SRM is build upon known MSS (CASTOR, dCache, Jasmine) dCache implements SRM v1

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft User SRM interaction Legenda: LFN: Logical file name RMC: Replication metadata catalog GUID: Grid unique identifier RLC: Replica location catalog RLI: Replica location index RLC + RLI = RLS RLS: Replica location service SURL: Site URL TURL: Transfer URL

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft In short Loosely coupled cluster: Ethernet Tightly coupled cluster: InfiniBand From 100 to 200 TB: local attached, NFS and or RFIO Above 200 TB: SAN, cluster file system and RFIO HSM via dCache – Grid SRM interface –Tape TSM / GSI solution ?? or Vanderbilt Enstor

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Some encountered difficulties Prescribed chain of software revision levels –support is given only to those who live by the rules –disk -> controller -> hba -> driver -> kernel -> application Linux limitations –block addressability < 2^31 –number of LUs < 128 NFS on Linux is a running target –enhancements or fixes introduce almost always a new bugs –limited experience in large (> 100 clients) installations Storage units become difficult to handle –exchanging 1 TB and rebalancing of live 5 TB file system takes 20 hrs – restoring a 5 TB file system can take up to a week –Acquirement needs 1 FTE / 10^6

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Thank you for your attention