AGLT2 Site Report Shawn McKee University of Michigan HEPiX Fall 2014 / UNL.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

Chapter 1: Introduction to Scaling Networks
ExoGENI Rack Architecture Ilia Baldine Jeff Chase Chris Heermann Brad Viviano
What’s New: Windows Server 2012 R2 Tim Vander Kooi Systems Architect
Protect Your Business and Simplify IT with Symantec and VMware Presenter, Title, Company Date.
Adam Duffy Edina Public Schools.  The heart of virtualization is the “virtual machine” (VM), a tightly isolated software container with an operating.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
NorthGrid status Alessandra Forti Gridpp13 Durham, 4 July 2005.
CON Software-Defined Networking in a Hybrid, Open Data Center Krishna Srinivasan Senior Principal Product Strategy Manager Oracle Virtual Networking.
Introducing VMware vSphere 5.0
Implementing Failover Clustering with Hyper-V
AGLT2 Site Report Shawn McKee University of Michigan HEPiX Spring 2014.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Upgrading the Platform - How to Get There!
Virtual Desktop Infrastructure Solution Stack Cam Merrett – Demonstrator User device Connection Bandwidth Virtualisation Hardware Centralised desktops.
AGLT2 Site Report Benjeman Meekhof University of Michigan HEPiX Fall 2013 Benjeman Meekhof University of Michigan HEPiX Fall 2013.
Understand what’s new for Windows File Server Understand considerations for building Windows NAS appliances Understand how to build a customized NAS experience.
A. Mohapatra, HEPiX 2013 Ann Arbor1 UW Madison CMS T2 site report D. Bradley, T. Sarangi, S. Dasu, A. Mohapatra HEP Computing Group Outline  Infrastructure.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
LAL Site Report Michel Jouvin LAL / IN2P3
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
Cisco Confidential © 2010 Cisco and/or its affiliates. All rights reserved. 1 MSE Virtual Appliance Presenter Name: Patrick Nicholson.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.
EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,
ATLAS Great Lakes Tier-2 (AGL-Tier2) Shawn McKee (for the AGL Tier2) University of Michigan US ATLAS Tier-2 Meeting at Harvard Boston, MA, August 17 th,
Microsoft Management Seminar Series SMS 2003 Change Management.
Shawn McKee/University of Michigan
VMware vSphere Configuration and Management v6
IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.
Symantec Storage Foundation High Availability 6.1 for Windows: What’s New Providing Support for ApplicationHA in Hyper-V and VMware.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
Microsoft Windows Server 2012 R2. What’s NEW in Windows Server 2012 R2.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
AGLT2 Site Report Shawn McKee University of Michigan March / OSG-AHM.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
AGLT2 Site Report Shawn McKee/University of Michigan HEPiX Fall 2012 IHEP, Beijing, China October 14 th, 2012.
HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
LHCONE NETWORK SERVICES: GETTING SDN TO DEV-OPS IN ATLAS Shawn McKee/Univ. of Michigan LHCONE/LHCOPN Meeting, Taipei, Taiwan March 14th, 2016 March 14,
AGLT2 Site Report Shawn McKee/University of Michigan Bob Ball, Chip Brock, Philippe Laurens, Ben Meekhof, Ryan Sylvester, Richard Drake HEPiX Spring 2016.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
vSphere 6 Foundations Exam Training
CERN IT Department CH-1211 Genève 23 Switzerland M.Schröder, Hepix Vancouver 2011 OCS Inventory at CERN Matthias Schröder (IT-OIS)
Dynamic Extension of the INFN Tier-1 on external resources
vSphere 6 Foundations Beta Question Answer
Bob Ball/University of Michigan
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
AGLT2 Site Report Shawn McKee/University of Michigan
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
Welcome! Thank you for joining us. We’ll get started in a few minutes.
AGLT2 Site Report Shawn McKee/University of Michigan
Christof Hanke, HEPIX Spring Meeting 2008, CERN
TYPES OF SERVER. TYPES OF SERVER What is a server.
GridPP Tier1 Review Fabric
Deploy OpenStack with Ubuntu Autopilot
Presentation transcript:

AGLT2 Site Report Shawn McKee University of Michigan HEPiX Fall 2014 / UNL

OutlineOutline Site Summary and Status Monitoring Provisioning with Cobbler HTCondor MCORE details Virtualization Status Networking Upgrade Updates on projects Plans for the future AGLT2-HEPiX 14-Oct-14 Outline

AGLT2-HEPiX 14-Oct-14 Site Summary The ATLAS Great Lake Tier-2 (AGLT2) is a distributed LHC Tier-2 for ATLAS spanning between UM/Ann Arbor and MSU/East Lansing. Roughly 50% of storage and compute at each site 5722 single core job slots (added 480 cores) MCORE slots increased from 240 to 420 (dynamic) 269 Tier-3 job slots usable by Tier-2 Average 9.26 HS06/slot 3.5 Petabytes of storage (adding 192 TB, retiring 36 TB) Total of 54.4 kHS06, up from 49.0 kHS06 in spring Most Tier-2 services virtualized in VMware 2x40 Gb inter-site connectivity, UM has 100G to WAN, MSU has 10G to WAN, lots of 10Gb internal ports and 16 x 40Gb ports High capacity storage systems have 2 x 10Gb bonded links 40Gb link between Tier-2 and Tier-3 physical locations

AGLT2-HEPiX 14-Oct-14 AGLT2 Monitoring AGLT2 has a number of monitoring components in use As shown in Annecy we have: Customized “summary” page-> OMD (Open Monitoring Distribution) at both UM/MSU Ganglia Ganglia ElasticsearchLogstashKibana Central syslog’ing via ELK: Elasticsearch, Logstash, Kibana SRMwatch SRMwatch to track dCache SRM status GLPI GLPI to track tickets (with FusionInventory)

Provisioning with Cobbler AGLT2-HEPiX 14-Oct-14 AGLT2 Provisioning/Config Mgmt AGLT2 uses a Cobbler server configuration managed by CFEngine and duplicated at both sites for building service nodes (excepting site-specific network/host info) Created flexible default kickstart template with Cobbler’s template language (Cheetah) to install a variety of “profiles” as selected when adding system to Cobbler (server, cluster-compute, desktop, etc). Simple PXE based installation from network Cobbler handles (with included post-install scripts) creating bonded NIC configurations – used to deal with those manually Cobbler manages mirroring of OS and extra repositories Kickstart setup is kept minimal and most configuration done by CFEngine on first boot Dell machines get BIOS and Firmware updates in post-install using utils/packages from Dell yum repositories See Ben Meekhof’s talk Thursday for details (

AGLT2-HEPiX 14-Oct-14 HTCondor CE at AGLT2 Bob Ball worked for ~1 month at AGLT2 setup – Steep learning curve for newbies – Lots of non-apparent niceties in preparing job-router configuration – RSL no longer available for routing decisions Cannot change content of job route except during condor-ce restart However, CAN modify variables and place them in ClassAd variables set in the router – Used at AGLT2 to control MCORE slot access Currently in place on test gatekeeper only Will extend to the primary GK ~10/22/14 See full details of our experience and setup at

AGLT2-HEPiX 14-Oct-14 MCORE at AGLT2 AGLT2 AGLT2 has supported MCORE jobs for many months now Condor configured for two MCORE job types – Static slots (10 total, 8 cores each) – Dynamic slots (420 of 8 cores each) Requirements statements added by the “condor_submit” script – Depends on count of queued MP8 jobs Result is instant access for a small number with gradual release of cores for more with time. Full details at QUEUED RUNNING

Virtualization Status AGLT2-HEPiX 14-Oct-14 Virtualization at AGLT2 Most Tier-2 services run on VMware (vSphere 5.5) UM uses iSCSI storage backends Dell MD3600i, MD3000i and SUN NAS 7410 vSphere manages virtual disk allocation between units and RAID volumes based on various volume performance capabilities and VM demand MSU runs on DAS – Dell MD3200 Working on site resiliency details Multisite SSO operational between sites (SSO at either site manages both sites) MSU is operating site-specific Tier-2 VMs (dcache doors, xrootd, cobbler) on vSphere VMware Replication Appliance is used to perform daily replications of critical UM VMs to MSU’s site. This is working well Our goal is to have MSU capable of bringing up Tier-2 service VMs within 1 day of loss of UM site. Queued: a real test of this process

AGLT2-HEPiX 14-Oct-14 AGLT2 100G Network Details Link down problematic optics

AGLT2-HEPiX 14-Oct-14 Software-Defined Storage Research NSF proposal submitted involving campus and our Tier-2 Ceph Exploring Ceph for future software- defined storage Goal is centralized storage that supports in place access from CPUs across campus Intends to leverage Dell “dense” storage MD3xxx (12 Gbps SAS) in JBOD mode Still waiting for news…

AGLT2-HEPiX 14-Oct-14 Update on DIIRT At Ann Arbor Gabriele Carcassi presented on “Using Control Systems for Operation and Debugging”Using Control Systems for Operation and Debugging This effort has continued and is now called DIIRT (Data Integration In Real Time)DIIRT Control System Studio UI for operators NFS CSV or JSON diirt server Websockets + JSON Web pages HTML + Javascript scripts dependency data flow Currently implemented Scripts populate NFS directory from condor/ganglia Files are served by diirt server through web sockets Control System Studio can create “drag’n’drop” UI

AGLT2-HEPiX 14-Oct-14 DIIRT UI Canvas allows drag-n-drop of elements to assemble views, no programming required Server can feed remote clients in real-time. Project info at

AGLT2-HEPiX 14-Oct-14 Future Plans Participating in SC14 (simple WAN data-pump system) Lustre Our Tier-3 uses Lustre 2.1 and has ~500TB – Approximately 35M files averaging 12MB/file – We will purchase new hardware providing another 500TB. LustreLustre ZFS – Intend to go to Lustre 2.5+ and VERY interested in using Lustre on ZFS for this LustreLustre – Plan: install new Lustre instance, then migrate existing Lustre data over, then rebuild older hardware into the new instance, retiring some components for spare parts. Still exploring OpenStack as an option for our site. Would like to use Ceph for a back-end. New network components support Software Defined Networking (OpenFlow). Once v1.3 is supported we intend to experiment with SDN in our Tier-2 and as part of LHCONE point-to-point testbed. Working on IPv6 dual-stack for all nodes in our Tier-2

ConclusionConclusion AGLT2-HEPiX 14-Oct-14 Summary Monitoring is helping us easily find/fix issues Virtualization tools working well and we are close to meeting our site resiliency goals Network upgrade in place 2x40G inter-site, 100G WAN DIIRT is a new project allowing us to customize how we manage and correlated diverse data. FUTURE: OpenStack, IPv6, Lustre on ZFS for Tier-3, SDN Questions ?