STORM & GPFS on Tier-2 Milan

Slides:



Advertisements
Similar presentations
EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari
Advertisements

IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.
Block1 Wrapping Your Nugget Around Distributed Processing.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Author - Title- Date - n° 1 Partner Logo WP5 Summary Paris John Gordon WP5 6th March 2002.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
SA1 operational policy training, Athens 20-21/01/05 Presentation of the HG Node “Isabella” and operational experience Antonis Zissimos Member of ICCS administration.
GDB meeting - Lyon - 16/03/05 An example of data management in a Tier A/1 Jean-Yves Nief.
E Virtual Machines Lecture 6 Topics in Virtual Machine Management Scott Devine VMware, Inc.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
S. Pardi Computing R&D Workshop Ferrara 2011 – 4 – 7 July SuperB R&D on going on storage and data access R&D Storage Silvio Pardi
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
CASTOR: possible evolution into the LHC era
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
Experience of Lustre at QMUL
Bentley Systems, Incorporated
The EDG Testbed Deployment Details
Real Time Fake Analysis at PIC
The Beijing Tier 2: status and plans
LCG Service Challenge: Planning and Milestones
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
StoRM: a SRM solution for disk based storage systems
U.S. ATLAS Grid Production Experience
BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes
Diskpool and cloud storage benchmarks used in IT-DSS
Distributed Network Traffic Feature Extraction for a Real-time IDS
Service Challenge 3 CERN
StoRM Architecture and Daemons
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Sergio Fantinel, INFN LNL/PD
Experience of Lustre at a Tier-2 site
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
UTFSM computer cluster
Conditions Data access using FroNTier Squid cache Server
Simulation use cases for T2 in ALICE
Brookhaven National Laboratory Storage service Group Hironori Ito
Introduction to Networks
Mario Reale – IGI / GARR Lyon, Sept 19, 2011
The INFN Tier-1 Storage Implementation
Computing Infrastructure for DAQ, DM and SC
Storage Virtualization
GGF15 – Grids and Network Virtualization
Chapter 1: Introduction
Oracle Storage Performance Studies
Grid Canada Testbed using HEP applications
Haiyan Meng and Douglas Thain
Shared Access to Experiments’ Software
Distributed File Systems
Distributed File Systems
Scalable Database Services for Physics: Oracle 10g RAC on Linux
Distributed File Systems
Installation/Configuration
Presentation transcript:

STORM & GPFS on Tier-2 Milan Massimo.Pistolese@mi.infn.it Francesco.Prelz@mi.infn.it Workshop CCR – May 14 2009

Why StoRM ? Security model inherited by ACL support of underlying file system. Light diskspace and authorization manager, no special hardware required. Scalable Easily configurable using yaim. Individual parts of final configuration can still be tuned at a later time. CCR - May 14 2009 STORM&GPFS

Why StoRM ? SOAP web service, roles of all its parts can be assigned separately: front- end, back-end, gridftp, mysql request rate handled up to 40Hz Gridftp throughput 120MB/s on 1Gb/s lan. . Well matched with gpfs architecture. CCR - May 14 2009 STORM&GPFS

Why GPFS ? visibility as a local fs, no remote protocols. Cluster structured Slave clustering Redundancy Scalable Abstraction layer Robustness High performance: concurrent read from 16 clients 380MB/s (over a physical network limit of 412MB/s), concurrent write 320MB/s CCR - May 14 2009 STORM&GPFS

Network and storage setup CCR - May 14 2009 STORM&GPFS

gpfs gpfs gridftp-b1-1 gridftp-b1-2 gridftp-b1-3 gridftp-b1-4 Wn-b1-1 Ce-b1-1 (condor scheduler)‏ glite-UI-nodes.mi.infn.it glite-condor-nodes.mi.infn.it ui.mi.infn.it gpfs Wn-b1-1 (condor exe)‏ gpfs Wn-b1-38 (condor exe)‏ ....... gsiftp://gridftp-b1-1.mi.infn.it:2811/atlas/… /dev/storage_2 /dev/storage_1 /dev/software c13 c14 c2 c3 c1 c6 c7 c8 srm://t2cmcondor.mi.infn.it:8444/srm/managerv2?SFN=/atlas/… c15 c16 c17 c4 c5 c9 c10 c11 c12 mmcrfs /dev/software -F nsd-software.txt -B 64K -m 2 -M 2 -r 2 -R 2 -Q yes -n 512 -A yes -v no -N 50000000 ## definitions for /dev/storage_1 fs #sdb:ts-b1-2,ts-b1-1,ts-b1-4,ts-b1-3::dataAndMetadata:1:c2 c2:::dataAndMetadata:1::: #sdc:ts-b1-3,ts-b1-4,ts-b1-1,ts-b1-2::dataAndMetadata:1:c3 c3:::dataAndMetadata:1::: #sdd:ts-b1-4,ts-b1-3,ts-b1-2,ts-b1-1::dataAndMetadata:1:c4 c4:::dataAndMetadata:1::: #sde:ts-b1-1,ts-b1-2,ts-b1-3,ts-b1-4::dataAndMetadata:1:c5 c5:::dataAndMetadata:1::: storm FE T2cmcondor (central manager)‏ glite-condor.mi.infn.it gpfs gpfs ts-b1-1 ,..., ts-b1-4 ts-b1-5, ts-b1-6 ts-b1-7,ts-b1-8 Fiber channel FC multipath FC multipath storm-BE gridftp (ops)‏ gpfs 40TB 46TB 46TB se-b1-1 /opt/glite/yaim/bin/ig_yaim -c -s siteinfo/ig-site-info.def -n ig_SE_storm_backend gridftp-b1-1 gpfs gridftp-b1-2 gpfs gridftp-b1-3 gpfs gridftp-b1-4 gpfs

Issues with first production run Project Outline: Analyse MSSM A/H  tau tau  l h at 14TeV, to obtain discovery potential for this channel in ATLAS (Publish result as a PUB Note)‏ To do this must produce ALL our own datasets Production: Target Numbers of Events: Total: 35 M events Milano Share: 3.2 M events CCR - May 14 2009 STORM&GPFS

Issues with first production run Job specification: Atlfast II Simulation using job transform (csc_simul_reco_trf.py)‏ Input: evgen -> lcg-cp from SE: INFN-MILANO_LOCALGROUPDISK Output: AOD -> lcg-cp to SE: INFN-MILANO_LOCALGROUPDISK Event Per Job: 250 Total Jobs: 12800 Requirements: 2GB RAM / 2GB swap Running Time: Intel(R) Xeon(R) CPU L5420 @ 2.50GHz cache size: 6144 KB  6 hours (TYPE1)‏ Intel(R) Xeon(TM) CPU 3.06GHz cache size: 512 KB  18 hours (TYPE2)‏ Cluster Performance: 48 CPU (TYPE1) : 192 Jobs/Day 124 CPU (TYPE2): 165 Jobs/Day CCR - May 14 2009 STORM&GPFS

Issues with first production run Failure Rate: 50% Environment setup and variables tuning Machine requirements: huge memory needed Works at typical ATLAS Production failure rate ~3% when functioning correctly Mainly issues with setup, GPFS and storage Gpfs perfectly integrated with StoRM. No special settings required New clients can be easily added CCR - May 14 2009 STORM&GPFS

Issues with first production run Gpfs caching related to OS limits (single- process 2GB virtual memory on 32-bit nodes). 64-bit clients can accommodate larger cache.. C-NFS is good to provide fault-tolerant NFS read-only mounts for better, faster caching, but special care must be taken. Better to separate disk servers from nfs and set gpfs redundancy for related filesystems Local hosts should resolve gridftp servers, to prevent StoRM from generating too much internal traffic. (VLANs could also be used). CCR - May 14 2009 STORM&GPFS

Conclusions Most issues found in GPFS and worker node setup. The GPFS cluster architecture allows better organization for pools of similar machines, with central steering. GPFS performance can be easily enforced in a SAN. Headnodes can be added as needed StoRM scalability: Can manage as many frontend and gridftp servers as needed CCR - May 14 2009 STORM&GPFS