Shigeki Misawa RHIC Computing Facility Brookhaven National Laboratory Facility Evolution.

Slides:



Advertisements
Similar presentations
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
Advertisements

The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
Computing Infrastructure
IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
Institute for High Energy Physics ( ) NEC’2007 Varna, Bulgaria, September Activities of IHEP in LCG/EGEE.
Building on the BIRN Workshop BIRN Systems Architecture Overview Philip Papadopoulos – BIRN CC, Systems Architect.
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Secure Off Site Backup at CERN Katrine Aam Svendsen.
1 Andrew Hanushevsky - HEPiX, October 6-8, 1999 Mass Storage For BaBar at SLAC Andrew Hanushevsky Stanford.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
CT NIKHEF June File server CT system support.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Windows ® Powered NAS. Agenda Windows Powered NAS Windows Powered NAS Key Technologies in Windows Powered NAS Key Technologies in Windows Powered NAS.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Tier 3g Infrastructure Doug Benjamin Duke University.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
CERN - European Laboratory for Particle Physics HEP Computer Farms Frédéric Hemmer CERN Information Technology Division Physics Data processing Group.
IT Infrastructure Chap 1: Definition
CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Catania, April 2001.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
Ofer Rind - RHIC Computing Facility Site Report The RHIC Computing Facility at BNL HEPIX-HEPNT Vancouver, BC, Canada October 20, 2003 Ofer Rind RHIC Computing.
Farm Management D. Andreotti 1), A. Crescente 2), A. Dorigo 2), F. Galeazzi 2), M. Marzolla 3), M. Morandin 2), F.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
An Overview of PHENIX Computing Ju Hwan Kang (Yonsei Univ.) and Jysoo Lee (KISTI) International HEP DataGrid Workshop November 8 ~ 9, 2002 Kyungpook National.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
Nov 1, 2000Site report DESY1 DESY Site Report Wolfgang Friebel DESY Nov 1, 2000 HEPiX Fall
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
1 U.S. Department of the Interior U.S. Geological Survey Contractor for the USGS at the EROS Data Center EDC CR1 Storage Architecture August 2003 Ken Gacke.
Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC.
21 st October 2002BaBar Computing – Stephen J. Gowdy 1 Of 25 BaBar Computing Stephen J. Gowdy BaBar Computing Coordinator SLAC 21 st October 2002 Second.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November.
ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.
PHENIX Computing Center in Japan (CC-J) Takashi Ichihara (RIKEN and RIKEN BNL Research Center ) Presented on 08/02/2000 at CHEP2000 conference, Padova,
The Million Point PI System – PI Server 3.4 The Million Point PI System PI Server 3.4 Jon Peterson Rulik Perla Denis Vacher.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
January 30, 2016 RHIC/USATLAS Computing Facility Overview Dantong Yu Brookhaven National Lab.
Tier 1 at Brookhaven (US / ATLAS) Bruce G. Gibbard LCG Workshop CERN March 2004.
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
By Harshal Ghule Guided by Mrs. Anita Mahajan G.H.Raisoni Institute Of Engineering And Technology.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
Compute and Storage For the Farm at Jlab
Network Attached Storage Overview
Computing Infrastructure for DAQ, DM and SC
GRID COMPUTING PRESENTED BY : Richa Chaudhary.
Real IBM C exam questions and answers
Storage Trends: DoITT Enterprise Storage
AFS at CERN and other High Energy Physics Sites
IP Control Gateway (IPCG)
Data Management Components for a Research Data Archive
Presentation transcript:

Shigeki Misawa RHIC Computing Facility Brookhaven National Laboratory Facility Evolution

Change Drivers at the Facility ● Users ● Component Costs ● Technology ● Security ● Operational Experience

User Desires ● Unlimited on line storage space ● Uniform, global file namespace ● Infinite file system bandwidth ● Unlimited I/O transaction rates ● Infinite processing power ● No learning curve ● High availability ● $0 cost

Current Configuration

Current Configuration (cont'd) ● HPSS managed tape archive – 10 IBM Server – 9840/9940 tape drives ● NFS Servers – 14 E450 – 5 V480 – 100TB RAID5 ● MTI and Zyzzx(CMD) ● Brocade SAN ● GigE Backbone – Alcatel (PacketEngine) – Cisco – Alteon (phased out) – SysKonnect NIC ● ~1000 Dual CPU Linux nodes w/local scratch ● 2 AFS Cells – IBM AFS 3.6.x

Current Configuration (cont'd) ● Secure Gateways – Ssh/bbftp ● Facility Firewall ● Software for optimized file retrieval from HPSS ● Custom and LFS batch control ● Limited management of NFS/local scratch space ● Other experiment specific middle-ware

Matching User Desires ● HPSS provides near infinite storage space, although not on line ● NFS provides large amounts of on line storage, uniform, global namespace ● Linux Farm provides significant amount of processing power. ● Relatively low cost ● Issues – High bandwidth ? – Transaction rate ? – High availability ? – Learning curve ?

User Processing Models ● Reconstruction – Processing models among users are similar – Processing is well defined ● Analysis – Wide range of processing styles in use – Transitioning to finite set of processing models difficult to institute. ● Requires high level experiment acceptance ● Requires adoption of processing models by individual users

Security Perspective ● Facility Firewall ● On and Off site considered hostile ● Gateways (ssh, bbftp) bypass FW, provide access with no clear text passwords ● Exposure is getting smaller ● Management and monitoring getting better ● Primary vulnerablities – User lifecycle management – Reusable passwords – Distributed file system

Changes affecting security ● Rigorous DOE mandated user life cycle management ● Kerberos 5 and LDAP authorization/authentication ● Tighter network security (inbound/outbound access) ● Grid integration – Web services – Grid daemons – Deployment vs maintenance

HPSS ● Provides virtually unlimited storage ability ● Pftp access inconvenient and can be problematic ● General user access to HSI, a good thing ? ● Work arounds – Limit direct access – Identify and eliminate problematic access – Utilize optimization tools ● Prognosis: – Good enough, too much invested to switch

Gigabit/100Mb Ethernet ● GigE not a performance issue – Driving it is another question ● GigE port/switch costs an issue ● Vendor shake out somewhat of an issue ● Mix of copper/optical technology a nuisance ● 100Mb compute node connectivity sufficient for the forseeable future

Linux Farm ● Adequately provides needed processing power ● Split between pure batch and batch/interactive nodes ● Nodes originally disk light, now disk heavy – Cheaper ($/MB) than RAID 5 NFS disk – Better performance – Robust with respect to NFS/HPSS hiccups – Changing processing model

Linux Farm Limitation ● Individual nodes are now mission critical (since they are stateful) ● Maintenance an issue (HD must be easily swappable) ● Nodes no longer identical ● Node lifecycle problematic ● Local disk space management issues ● Non global namespace and access to data ● EOL of CPU vs EOL of Disk

NFS Servers ● Providing relatively large amounts of storage (~100TB) ● Availability and reliability getting better – Servers not a problem – Problems with all disk components, RAID controllers, hubs, cables, disk drives, GBICs, configuration tools, monitoring tools, switch,.... ● Overloaded servers are now the primary problem

NFS Servers ● 4x450MHz E-450, 2x900MHz V480 (coming on line soon) ● SysKonnect GigE NIC (Jumbo/non-jumbo) ● MTI and Zyzzx RAID 5 storage ● Brocade Fabric ● Veritas VxFS/VxVM

Performance of NFS Servers ● Maximum observed BW 55 MB/sec (non jumbo) ● NFS Logging recently enabled (and then disabled) – Variable access patterns – 1.4TB/day max observed BW out of a server – Max MB transferred usually, not the most highly accessed – Data files accessed, but shared libraries and log files are also accessed. – Limited statistics makes further conclusions difficult to make ● NFS Servers and disks are poorly utilized

NFS Logging (Solaris) ● Potentially a useful tool ● Webalizer used to analyze resulting log files – Daily stats on 'hits' and KB transferred – Time, client, and file distributions ● Poor implementation – Generation of binary/text log files problematic ● Busy FS -> observed 10MB/minute, 1-2 GB/hour log write rate ● Under high load, nfslogd cannot keep up with binary file generation ● nfslogd unable to analyze binary log files > 1.5 GB ● nfslogd cannot be run offline

NFS Optimization ● Without usage statistics cannot tune – File access statistics – Space usage – I/O transactions/FS – BW / FS ● Loosely organized user community makes tracking and controlling of user behavior difficult

Future Directions for NFS Servers ● Continued expansion of current architecture – Current plan ● Replace with fully distributed disks – Needs middleware (from grid?) to manage. Can users be taught to use the system ? ● Fewer but more capable servers (i.e., bigger SMP servers) – Not likely ($cost) – Will performance increase ?

Future Directions for NFS Servers ● More, cheaper NFS appliances – Currently not likely (administrative issues) ● IDE RAID systems ? – Technological maturity/administrative issues ● Specialized CIF/NFS servers ? – Cost/technological maturity/administrative issues ● Other file system technologies ? – Cost/technological maturity/administrative issue

AFS Servers ● Current State – 2 AFS Cells – IBM AFS 3.6.x – ~12 Servers ● System working adequately ● Future direction – Transition to OpenAFS – Transition to Kerberos 5 – AFS home directories ? – LSF/Grid integration ? – Security issues ? – Linux file servers ? ● Stability ● Backups

Grid Integration ● Satisfying DOE cyber security requirements ● Integration of Grid authentication and authorization with site authentication and authorization ● Stateful grid computing, difficult issues ● Stateless grid computing, not palatable to users ● Transition from site cluster to grid enabled cluster can be achieved in may ways with different tradeoffs.

Future direction issues ● Current facility implements mainstream technologies, although on a large scale. ● Current infrastructure is showing the limitations of using these technologies at a large scale. ● Future direction involves deploying non-mainstream (though relatively mature) technologies or immature technologies in a production environment. ● As a complication, some of these new technologies replace existing mature systems that are currently deployed.