Guy Almes¹, Daniel Cruz¹, Jacob Hill 2, Steve Johnson¹, Michael Mason 1 3, Vaikunth Thukral¹, David Toback¹, Joel Walker² 1.Texas A&M University 2.Sam.

Slides:

Advertisements

Similar presentations

Presentation Prepared For:. Secure user Login provides access to specific ship-to addresses, customer catalog, order processing rules, and other account-based.

Advertisements

Lecture 1: Introduction

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

REDCap Treatment Randomization Module

Improving the way we learn

October 2011 David Toback, Texas A&M University Research Topics Seminar 1 David Toback January 2015 Big Computing and the Mitchell Institute for Fundamental.

ServiceLink Direct From Walker Martyn Software Ltd.

VoipNow Core Solution capabilities and business value.

SDN and Openflow.

Monitoring and performance measurement in Production Grid Environments David Wallom.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.

March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.

Chapter 13: Sharing Printers on Windows Server 2008 R2 Networks BAI617.

Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.

Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.

AEW studios Who We Are We run a small IT-based business. We've been in business for 3½ years. Essentially the business is husband/wife.

Hands-On Microsoft Windows Server 2008

Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.

Introduction to our On-Line Self Service Center at

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Website on Computer Security By: Brittany Freeman.

ENGLISH PROJECT 3 : THE INVESTIGATIVE REPORT BY: NATALIE ORMOND.

DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.

Global Science experiment Data hub Center Oct. 13, 2014 Seo-Young Noh Status Report on Tier 1 in Korea.

Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.

LARK Bringing Distributed High Throughput Computing to the Network Todd Tannenbaum U of Wisconsin-Madison Garhan Attebury

High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.

Matthew Palmer, Cambridge University01/10/2015 First Use of the UK e-Science Grid Overview The Physics Experiences Looking forward Conclusions Matthew.

BY OLIVIA WILSON AND BRITTANY MCDONALD Up Your Shields with Shields Up!

11 ALICE Computing Activities in Korea Beob Kyun Kim e-Science Division, KISTI

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

16 Ways To Take To Work. How Would You Use PBwiki At Work? Over 1,000 non-business users were surveyed How would you use PBwiki at work? All responses.

WSM Administrator Training. WSM Administrator Discussion of WSM Administrator responsibilities Discussion of WSM administrative interfaces Detailed discussion.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Portal Update Plan Ashok Adiga (512)

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

SOFTWARE SYSTEM LABORATORY 1 COMPUTERED GRADUATION FORM Performers: Ofir Medlinsky Ahmad Hamdan Instructor: Victor Kulikov GF.

Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Powered by Microsoft Azure, Auctori Is the Next Generation in Multilingual, Global, Search Engine Optimized Web Content Management Systems MICROSOFT AZURE.

The GridPP DIRAC project DIRAC for non-LHC communities.

Open Science Grid as XSEDE Service Provider Open Science Grid as XSEDE Service Provider December 4, 2011 Chander Sehgal OSG User Support.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.

1 November 17, 2005 FTS integration at CMS Stefano Belforte - INFN-Trieste CMS Computing Integration Coordinator CMS/LCG integration Task Force leader.

The GridPP DIRAC project DIRAC for non-LHC communities.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

October David Toback Department of Physics and Astronomy Mitchell Institute for Fundamental Physics and Astronomy January 2016 Big Computing in.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Deploying Elastic Java EE Microservices in the Cloud with Docker

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

Xiaomei Zhang CMS IHEP Group Meeting December

Fast Action Links extension A love letter to CiviCRM

Class project by Piyush Ranjan Satapathy & Van Lepham

Big Computing in High Energy Physics David Toback

Big Computing in High Energy Physics

Presentation transcript:

Guy Almes¹, Daniel Cruz¹, Jacob Hill 2, Steve Johnson¹, Michael Mason 1 3, Vaikunth Thukral¹, David Toback¹, Joel Walker² 1.Texas A&M University 2.Sam Houston State University 3.Los Alamos National Laboratory 1

Overview of CMS grid computing Tier3 site at Texas A&M o Tier3 site on an already existing cluster Advantages & Disadvantages to installing a Tier3 on an existing cluster Performance Custom monitoring tools for Tier3 sites o Why we need custom monitoring Conclusions 2

Computers Clusters Grid CMS Grid o Tiered (layered) structure (by functionality) How CMS Grid handles data: o PhEDEx Data around world; bring over tens of TBs per month o CRAB How jobs are submitted to the grid 3

Why particular structure? Better allocation & handling of resources Tier 0: CERN Tier 1: A few National Labs Tier 2: Bigger University Installations for national use (shared resources) Tier 3: Local use (our type of center; balance between meeting our users needs and helping community) 4

Worth noting: most people just create their own cluster When the time came for the A&M CMS Group to add Grid Computing capabilities, we decided to join an already existing cluster Brazos is a computing cluster at the university o Designed for high throughput (as opposed to high performance) Mostly used by stakeholders in the College of Science 5

Advantages: o Already established support and System administration; we have 256 *dedicated* priority cores, but we can run on all 2565 cores if needed Disadvantages o Dealing with University firewall issues; operating systems for various users; system administrators who want us to change how we run on the Grid; we want to let users from around the globe run if needed, but shared clusters often have rules about who can run 6

A&M Tier3 site came online on December 2010 Spent first few months of 2011 dealing with testing woes Really took off around August A plot of the core-hours per month run by our HEPX group (April 2011 – May 2012); dips due to many factors: problems with cluster, user inactivity, etc.

CMS does a *great* job of monitoring for Tier 0s, 1s, and 2s, as well as jobs and data transfers o Tier3 monitoring is generally unsupported by CMS Tier3s mostly supported by user community Decided to put in place monitoring specific for our Tier3 site (and transferable to other Tier3s) o Data transfers and load tests working? o Cluster up and running? o Test jobs working quickly? (custom setup by TAMU) o Users jobs working? Standard CMS monitoring is on lots of pages we combined everything you need in one main webpage with all the info Most important: now we have automated checks on the status, send s if anything goes wrong 8

Need monitoring to handle *many* types of debugging Example: User has generic complaint: my jobs arent working How do we figure out where the problem is? o User problem? Permissions problem? CMS software problem? Incompatibility between software & cluster? o Right software installed? Is cluster up and running? Is the cluster connecting to the Grid? Did someone, somewhere in the world, change anything? Lots of things need to work just for a job to run to completion 9

These are all INTERFACE problems: o System administrators cant do it alone dont know CMS software o Physicists cant do it alone dont know cluster administration or grid infrastructure Goal is to provide LOTS of standardized checks, see if we can quickly locate the general source of the problem This has led to our custom monitoring website Fully functional website is now in use

This has enabled us to successfully install a Tier 3 site on an existing cluster Because these tools incorporate our experience, they prove useful to other T3s Working on this project at the moment currently installing in Texas Tech 11

Successfully created a Tier3 site on an existing cluster at Texas A&M. Fully functional and all the advantages of a big, well supported system Have developed a powerful new monitoring system that alerts problems with in real time, and provides an easy interface of data for debugging Our experiences in bringing up a Tier3 on an existing cluster provide an excellent model for other institutions, and our monitoring tools are readily ported to other institutions (in progress) 12

"In this talk, we will present a brief overview of the CMS Tier3 site at Texas A&M Universty. It is largely unique in that we added resources to an already-existing cluster to create our site as opposed to creating a stand-alone system. We will comment on some of the particulars of our site, the advantages and disadvantages of this choice, as well as how it has performed. We will also discuss some powerful new custom monitoring tools we've developed to optimize our cluster performance."