Storage at the ATLAS Great Lakes Tier-2 Tier-2 Storage Administrator Talks Shawn McKee / University of Michigan OSG Storage ForumShawn McKee1.

Slides:



Advertisements
Similar presentations
NAGIOS AND CACTI NETWORK MANAGEMENT AND MONITORING SYSTEMS.
Advertisements

Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Network+ Guide to Networks, Fourth Edition
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Skyward Server Management Options Mike Bianco. Agenda: Managed Services Overview OpenEdge Management / OpenEdge Explorer OpenEdge Managed Demo.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 5: Managing File Access.
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
Chapter 1 Introducing Windows Server 2012/R2
Silberschatz, Galvin and Gagne  Operating System Concepts Common System Components Process Management Main Memory Management File Management.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Five Managing Disks and Data.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 14: Troubleshooting Windows Server 2003 Networks.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
Site Report: ATLAS Great Lakes Tier-2 HEPiX 2011 Vancouver, Canada October 24 th, 2011.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
1 The Google File System Reporter: You-Wei Zhang.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
GeoVision Solutions Storage Management & Backup. ๏ RAID - Redundant Array of Independent (or Inexpensive) Disks ๏ Combines multiple disk drives into a.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
MCTS Guide to Microsoft Windows 7
1 Guide to Novell NetWare 6.0 Network Administration Chapter 13.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 5: Managing File Access.
An Introduction to IBM Systems Director
Fundamentals Pages 1 to 19 in your workbook. A Tour of VTScada WEB – Script based, using its own programming language VTS – Visual Tag System. Added a.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Locality Aware dCache & Discussion on Sharing Storage USATLAS Facilities Meeting SMU October 12, 2011.
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Session objectives Discuss whether or not virtualization makes sense for Exchange 2013 Describe supportability of virtualization features Explain sizing.
Computer Emergency Notification System (CENS)
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
08/30/05GDM Project Presentation Lower Storage Summary of activity on 8/30/2005.
Week #3 Objectives Partition Disks in Windows® 7 Manage Disk Volumes Maintain Disks in Windows 7 Install and Configure Device Drivers.
Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.
Using NAS as a Gateway to SAN Dave Rosenberg Hewlett-Packard Company th Street SW Loveland, CO 80537
ATLAS Great Lakes Tier-2 (AGL-Tier2) Shawn McKee (for the AGL Tier2) University of Michigan US ATLAS Tier-2 Meeting at Harvard Boston, MA, August 17 th,
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
VMware vSphere Configuration and Management v6
Cluster Consistency Monitor. Why use a cluster consistency monitoring tool? A Cluster is by definition a setup of configurations to maintain the operation.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.
T3 data access via BitTorrent Charles G Waldman USATLAS/University of Chicago USATLAS T2/T3 Workshop Aug
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
Seminar On Rain Technology
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
ATLAS Tier-2 Storage Status AGLT2 OSG Storage Forum – U Chicago – Sep Shawn McKee / University of Michigan OSG Storage ForumShawn McKee1.
System Components Operating System Services System Calls.
STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,
Chapter 1 Introducing Windows Server 2012/R2
MCTS Guide to Microsoft Windows 7
MONITORING MICROSOFT WINDOWS SERVER 2003
AGLT2 Site Report Shawn McKee/University of Michigan
Distributed System Structures 16: Distributed Structures
An Introduction to Computer Networking
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

Storage at the ATLAS Great Lakes Tier-2 Tier-2 Storage Administrator Talks Shawn McKee / University of Michigan OSG Storage ForumShawn McKee1

AGLT2 Storage: Outline TOverview of AGLT2 T Description, hardware and configuration TStorage Implementation T Original options: AFS, NFS, Lustre, dCache T Choose dCache (broad support in USATLAS/OSG) TIssues T Monitoring needed T Manpower to maintain (including creating custom cron scripts) T Reliability and performance TFuture? T NFSv4(.x), Lustre (1.8.1+), Hadoop OSG Storage Forum2Shawn McKee

The ATLAS Great Lakes Tier-2 TWithin the US there are five Tier-2 computing centers, most split over two physical locations. They support “production” and user analysis tasks. TThe ATLAS Great Lakes Tier-2 (AGLT2) is hosted at the University of Michigan and Michigan State University. TAGLT2 design goals: q Incorporate 10GE networks for high-performance data transfers q Utilize 2.6 custom kernels (UltraLight) and SL[C]4 OS (soon to SL[C]5) q Deploy inexpensive high-capacity storage systems with large partitions using the XFS filesystem --- Still must address SRM q Take advantage of MiLR to the extent possible OSG Storage Forum3Shawn McKee

AGLT2 Storage Node Information TAGLT2 has 13 dCache storage nodes distributed between MSU and UM with either 40 or 52 useable TB each. TTotal dCache production storage is 500TB TWe use space-tokens to manage space allocations for ATLAS and currently have 7 space-token areas. TWe use XFS for the filesystem on our 50 pools. Each pool varies from TB in size. TdCache version is and we are running Chimera TWe also use AFS (1.4.10) and NFSv3 for storage OSG Storage ForumShawn McKee4

AGLT2 Storage Node UMFS05 OSG Storage Forum5Shawn McKee Current Node 2xE5450 processors 2xE5450 processors 32GB, 10GE Myricom CX4 1TB disks (15 disks/shelf) 4xRAID6 array (13TB each)

AGLT2 Network Architecture Example TGood pricing from Dell for access layer switches: q Managed with 10GE ports, lots of 10/100/1000 copper ports q QoS and layer 3 capabilities, redundant power supply OSG Storage Forum6Shawn McKee

dCache at AGLT2 TChoose dCache because BNL and OSG both supported it. TReally using a system outside of its original design purpose q We wanted a way to tie multiple storage locations into a single user accessible name-space providing unified storage. q dCache was intended as a front-end for a HSM system TLots of manpower and attention has been required q Significant effort required to “watch for” and debug problems early q For a while we felt like the little Dutch boy…every time we fixed a problem, another popped up! q /Chimera has improved the situation OSG Storage ForumShawn McKee7

AGLT2 Lustre Storage TWe have been exploring options for alternative storage systems. BestMan provides an SRM for a posix-filesystem TLustre was originally tested by us in 2007: Too early! q Recently setup a 200TB (4 server) Lustre configuration with dual MDT+MGS headnodes configured with heartbeat using V1.6.6 z Performance very good (faster for clients than /tmp) z Some “issues” with huge load on one storage server q Need (or ) support for “patchless” client. Should be available in v q Planning to re-test once we have the needed kernel support OSG Storage ForumShawn McKee8

Monitoring Storage TMonitoring our Tier-2 is a critical part of our work and storage monitoring is one of the most important aspects TWe use a number of tools to track status: q Cacti q Ganglia q Custom web pages q Brian’s dCache billing web interface q alerting (low space, failed auto-tests, specific errors noted) TSee or OSG Storage ForumShawn McKee9

Scripts to Manage Storage TAs we have found problems we sometimes needed to automate the solution q Rephot – Automated replica creation developed by Wenjing Wu scans for “hot” files and automatically adds/removes replicas q Pool-balancer – Rebalances pools within a group TNumerous scripts (cron) for maintenance q Directory ownership q Consistency checking, including Adler32 checksum verification q Repairing inconsistencies q Tracking usage (real space and space-token) q Auto-loading PNFSID into Chimera DB for faster access OSG Storage ForumShawn McKee10

AGLT2 Storage Issues TInformation is scattered across server nodes and log files TJava based components have verbose output which makes it hard to find the real problem TWith dCache we find the large number of components have a large phase-space for problems TErrors are not always indicative of the real problem… TIs syslog-ng a part of the solution? TCan error messages be significantly improved? OSG Storage ForumShawn McKee11

Benchmarking and Optimization TWe have spent some time trying to optimize our storage performance. TSee Systems Systems Systems TExplored system/kernel/network and I/O tunings for our hardware. Have achieved good performance for single read/write (>700MB/sec) per partition. Multiple readers/writers at a few hundred MB/sec. OSG Storage ForumShawn McKee12

AGLT2 dCache Pool Transfer Rates STEP09 OSG Storage ForumShawn McKee13 Avg 616 MBytes/sec over 8 days

Future AGLT2 Storage TWe are interested in exploring BestMan+X where ‘X’ is q NFSv4[.x] q Lustre 1.8.x q Hadoop TWe are interested in the following characteristics for each option: q Expertise required to install/configure q Manpower required to maintain q Robustness over time q Performance (single and multiple read/writer) TOur plan is to test these during the rest of 2009 OSG Storage ForumShawn McKee14

?Questions? OSG Storage ForumShawn McKee15

Backup Slides OSG Storage Forum16Shawn McKee

AGLT2 “Server” 10GE Activity STEP09 OSG Storage ForumShawn McKee17 Shown are the aggregated graphs for our 10GE storage servers during STEP09 (Units: Bytes/sec) Results are a combination of local and remote traffic

MiLR 10GE Protected Network TWe have a single “/23” network for the AGL-Tier2 q Internally each site (UM/MSU) has a /24 TOur network will have 3 10GE wavelengths on MiLR in a “triangle” q Loss of any of the 3 waves doesn’t impact connectivity for both sites OSG Storage Forum18Shawn McKee