OSG Production Foundations for 2M+ Hours/Day April 9, 2014 Rob Quick With Help from Shawn McKee and Chander Seghal.

Slides:



Advertisements
Similar presentations
Campus Grids & Campus Infrastructures Community Rob Gardner Computation Institute / University of Chicago July 17, 2013.
Advertisements

Major Incident Process
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Effective Tools for Productive People Lessons Learned for Successful Adoption eRoom Terry Barash Ford Motor Company.
May 9, 2008 Reorganization of the OSG Project The existing project organization chart was put in place at the beginning of It has worked very well.
Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.
PerfSONAR in ATLAS/WLCG Shawn McKee, Marian Babik ATLAS Jamboree / Network Section 3 rd December 2014.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
OSG Area Coordinators Meeting Operations Rob Quick 2/22/2012.
GIG Software Integration: Area Overview TeraGrid Annual Project Review April, 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
OSG Area Coordinators Meeting Operations Rob Quick 2/22/2012.
OSG Operations Rob Quick July 10th, 2012 OSG Staff Retreat.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Discussion Topics DOE Program Managers and OSG Executive Team 2 nd June 2011 Associate Executive Director Currently planning for FY12 XD XSEDE Starting.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
Update on OSG/WLCG Network Services Shawn McKee, Marian Babik 2015 WLCG Collaboration Workshop 12 th April 2015.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Update on WLCG/OSG perfSONAR Infrastructure Shawn McKee, Marian Babik HEPiX Fall 2015 Meeting at BNL 13 October 2015.
WLCG perfSONAR-PS Update Shawn McKee/University of Michigan WLCG Network and Transfers Metrics Co-Chair Spring 2014 HEPiX LAPP, Annecy, France May 21 st,
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
OSG User Support Update OSG Area Coordinators Call February 12, 2014 Chander Sehgal - FNAL.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
OSG Area Coordinators Meeting Operations Rob Quick 1/11/2012.
Network and Transfer WG perfSONAR operations Shawn McKee, Marian Babik Network and Transfer Metrics WG Meeting 28 h January 2015.
Production Oct 31, 2012 Dan Fraser. Current Production Focus Transition to RPMs 52(44) sites using RPM based installs 52(44) sites using RPM based installs.
PerfSONAR Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting Cambridge, UK February 9 th, 2015.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
OSG Networking: Summarizing a New Area in OSG Shawn McKee/University of Michigan Network Planning Meeting Esnet/Internet2/OSG August 23 rd, 2012.
Open Science Grid as XSEDE Service Provider Open Science Grid as XSEDE Service Provider December 4, 2011 Chander Sehgal OSG User Support.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Network Awareness and perfSONAR Why we want it. What are the challenges? Where are we going? Shawn McKee / University of Michigan OSG AHG - US CMS Tier-2.
Accelerating Campus Research with Connective Services for Cyberinfrastructure Rob Gardner Steve Tuecke.
DICE: Authorizing Dynamic Networks for VOs Jeff W. Boote Senior Network Software Engineer, Internet2 Cándido Rodríguez Montes RedIRIS TNC2009 Malaga, Spain.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
LHCONE Monitoring Thoughts June 14 th, LHCOPN/LHCONE Meeting Jason Zurawski – Research Liaison.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
User Support of WLCG Storage Issues Rob Quick OSG Operations Coordinator WLCG Collaboration Meeting Imperial College, London July 7,
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Operations Portal Development Update on Requirements Cyril L'Orphelin IN2P3/CNRS.
Using Check_MK to Monitor perfSONAR Shawn McKee/University of Michigan North American Throughput Meeting March 9 th, 2016.
Campana (CERN-IT/SDC), McKee (Michigan) 16 October 2013 Deployment of a WLCG network monitoring infrastructure based on the perfSONAR-PS technology.
March 2014 Open Science Grid Operations A Decade of HTC Infrastructure Support Kyle Gross Operations Support Lead Indiana University / Research Technologies.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
OSG Operations – Lessons Learned CHEP 2010, 18 October 15:10 (Asia/Taipei) – Room 2, BHSS OSG Operations – Lessons Learned CHEP 2010, 18 October 15:10.
OSG User Support OSG User Support March 13, 2013 Chander Sehgal
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
perfSONAR-PS Deployment: Status/Plans
POW MND section.
Leigh Grundhoefer Indiana University
Presentation transcript:

OSG Production Foundations for 2M+ Hours/Day April 9, 2014 Rob Quick With Help from Shawn McKee and Chander Seghal

OSG Council Aug 18 th 2010 Once Upon a Time… Phoenix, November 2003 Super Computing 2

OSG Council Aug 18 th 2010 Agenda OSG Networking Capturing Opportunistic Cycles OSG Operations OSG as a Community 3

OSG Council Aug 18 th 2010 OSG Networking Area OSG Networking was added at the beginning of OSG’s second 5-year period in 2012 The “Mission” is to have OSG become the network service data source for its constituents  Information about network performance, bottlenecks and problems should be easily available.  Should support our VOs, users and site-admins to find network problems and bottlenecks.  Provide network metrics to higher level services so they can make informed decisions about their use of the network (Which sources, destinations for jobs or data are most effective?) Goal: OSG hosts network information for its constituents, aiding in finding/fixing problems and enabling applications and users to better take advantage of their networks OSG Networking 4

OSG Council Aug 18 th 2010 Year 1&2 Goals and Key Initiatives in Network Area Year 1 of OSG Networking was primarily focused on getting network monitoring in place  Deploying perfSONAR-PS: Instrumenting OSG sites with standardized tools to gather network metrics  OSG Network Service: Gathering OSG network metrics centrally and making them available for users and applications  Network Documentation: Creating documentation for OSG user and VO managers to guide them in understanding and diagnosing network issues Year 2 primary components:  Complete deployment of perfSONAR-PS  Improving the modular dashboard  Explore extending coverage to include WLCG  Enable alarming and problem analysis based upon network metrics  Improve tools and documentation from user perspective OSG Networking 5

OSG Council Aug 18 th 2010 Replacement Prototype: MaDDash OSG Networking 6 MaDDash (Monitoring and Debugging Dashboard) supported by ESnet Must be migrated to OSG!

OSG Council Aug 18 th 2010 Prototype: Service Monitoring OSG Networking 7 OMD (Open Monitoring Distribution) Integrated package over Nagios Checks/verifies primitive services are functional Ensures we get good network metrics Must be migrated to OSG!

OSG Council Aug 18 th 2010 Alerting/Alarming for Network Issues What most sites want is a tool that lets them know if there is a network problem (and ideally WHERE it is) In year 2 we started to develop this capability for OSG sites  Primitive OSG perfSONAR-PS service monitoring is easy and we have Nagios-type plugins that check services  Much harder is deciding when network metrics gathered by perfSONAR-PS require an alert or alarm:  Is the change in metrics due to “normal” (heavy) network use or is there a new problem?  If there is a real problem, where is it located? This is critical because we should only alert someone if the problem is one they can fix Interesting project at Georgia Tech called Pythia (see Terena presentation )  Submitted new proposal NSF SI2-SSE “PuNDIT” (Pythia Network Diagnosis Infrastructure) which targets OSG/WLCG  Goal is to provide this needed alerting/alarming component OSG Networking 8

OSG Council Aug 18 th 2010 Network Area Near Term Goals OSG is strongly encouraging non-WLCG sites to deploy perfSONAR- PS toolkit instances so we can help them with network issues. Automating the creation of “mesh-configurations” using OIM and GOCDB registration information OSG production has older network datastore and monitoring in place BUT it must be merged with newer replacements.  Prototype services need to migrate into OSG from AGLT2  Must integrate new RESTful API components from perfSONAR v3.4  Must test API and client use-cases from OSG and WLCG We must evaluate the impact of monitoring and gathering network metrics for all of WLCG before committing to provide their monitoring and data aggregation. OSG Networking 9

OSG Council Aug 18 th 2010 OSG Eco-system 10 All OSG Usage for 12 months ending 31-March-2014 Some of these VOs access opportunistic cycles e.g. osg, glow, engage, hcc, sbgrid

OSG Council Aug 18 th 2010 OSG Opportunistic Eco-system 11 Usage by “opportunistic VOs” for 12 months ending 31-March-2014 Of these, the OSG VO provides access to US researchers who are not already affiliated with an existing community in OSG

OSG Council Aug 18 th 2010 OSG VO Mission & Usage 12 The OSG VO does not own any computing resources and only exists to harvest unused cycles at OSG sites (Opportunistic cycles) and make them available to researchers who are not already affiliated with an OSG VO. For the 12 months ending 31-March-2014, the OSG VO harvested 64.4M hours (from sites by using gWMS) and delivered 57.7M hours to various submit hosts to enable the computing of researchers Submit HostWall Hours OSG-XD (XSEDE and OSG Direct)**54,694,294 UCSDgrid1,104,882 Bakerlab1,012,264 OSGCONNECT **870,640 ISI3,539 LSU63 Total57,685,682 ** Core OSG Services

OSG Council Aug 18 th 2010 Access to OSG DHTC Fabric via OSG VO 13 OSG DHTC Fabric >100 sites OSG Flocking Node Interactive Login Node XSEDE Users OSG-Direct Users OSG-ConnectDuke-Connect iPlant Virginia Tech BakerLab ISI Others …. All access operates under the OSG VO using glideinWMS

OSG Council Aug 18 th 2010 OSG-Direct users April 2013 to March Project NamePIInstitutionField of ScienceWall Hours SnowmassMeenakshi NarainBrown UniversityHigh Energy Physics8,632,986 SPLINTERRobert QuickIndiana UniversityMedicine4,601,962 Duke-QGPSteffen A. BassDuke UniversityNuclear Physics2,543,933 ECFAMeenakshi NarainBrown UniversityHigh Energy Physics1,744,646 UMichPaul WolbergUniversity of MichiganMicrobiology1,433,598 PhenoStefan HoecheSLACHigh Energy Physics1,108,623 RITP. Stanislaw RadziszowskiRochester Institute of TechnologyComputer Science721,291 UPRRP-MRSteven MasseyUniversidad de Puerto Rico (UPRRP)Bioinformatics714,359 IU-GALAXYRobert QuickIndiana UniversityBioinformatics640,484 DetectorDesignJohn StrologasUniversity of New MexicoMedical Imaging451,803 EICTobias TollBrookhaven National LaboratoryAccelerator Physics410,594 OSG-StaffChander SehgalFermilabComputer Science43,948 DeerDiseaseLene Jung KjaerSouthern Illinois UniversityBiological Sciences28,599 SNOplusJoshua R KleinUniversity of PennsylvaniaPhysics - Neutrino489 P0-LBNEMaxim PotekhinBrookhaven National LaboratoryPhysics - Neutrino17 BNLPETMartin PurschkeBrookhaven National LaboratoryMedical Imaging1 Total 16 users 23,077,333

OSG Council Aug 18 th 2010 XSEDE users April 2013 to March Project NamePIInstitutionField of ScienceWall Hours TG-IBN130001Donald KriegerUniversity of PittsburghBiological Sciences29,495,083 TG-PHY120014Qaisar ShafiUniversity of DelawarePhysics528,458 TG-TRA100004Andrew RuetherSwarthmore CollegeOther444,374 TG-DMR130036Emanuel GullUniversity of MichiganMaterials Research318,768 TG-MCB100109Lillian ChongUniversity of PittsburghMolecular Biosciences264,362 TG-CHE130091Paul SidersUniversity of Minnesota; DuluthChemistry86,280 TG-ATM130015Phillip AndersonUniversity of Texas at DallasAtmospheric Sciences77,169 TG-IRI130016Joseph Cohen University of Massachusetts; BostonInformation; Robotics; and Intelligent Systems70,536 TG-DMS120024Benjamin OngMichigan State UniversityMathematical Sciences68,908 TG-CHE130103Jeremy Moix Massachusetts Institute of TechnologyChemistry58,355 TG-ATM130009Phillip AndersonUniversity of Texas at DallasAtmospheric Sciences39,971 TG-MCB090163Michael HaganBrandeis UniversityMolecular Biosciences38,590 TG-OCE130029Yvonne ChanUniversity of Hawaii; ManoaOcean Sciences31,670 TG-TRA120014Pol LlovetMontana State UniversityCross-Disciplinary Activities19,472 TG-IBN130008Jorden SchossauMichigan State UniversityBiological Sciences16,857 TG-MCB120070Joseph HargitaiAlbert Einstein College of MedicineMolecular Biosciences378 TG-TRA120041Hanning ChenGeorge Washington UniversityComputer and Information Science231 TG-MCB090174Shantenu JhaRutgers UniversityMolecular Biosciences58 TG-PHY110015Pran NathNortheastern UniversityPhysics37 TG-MCB130072Robert QuickIndiana UniversityMolecular Biosciences16 TG-CCR120041Luca ClementiSan Diego Supercomputer CenterComputer and Computation Research12 TG-STA110014S Nancy Wilkins- DiehrUniversity of California-San DiegoOther5 Total 22 users 31,559,590

OSG Council Aug 18 th 2010 Operations Mission and Structure The mission of OSG Operations is to maintain and support a production quality computing environment for research communities. Operations Support  Support Desk  Ticket Tracking  Community Notification and Communication Operations Infrastructure  Compute Services  Distributed  IU, FNAL, UCSD, UNL, UC 16

OSG Council Aug 18 th 2010 Service Levels Maintaining All Services at SLA Levels  This includes compute and support services.  All compute services at 99.41% Availability  Only missed a single monthly metric for MyOSG in July 2013  All critical services 99.92% Availability  Outage could lead to mass job failure  This is approximately 12 hours between June 2012 and February  Service Desk – No exceptions to SLA 17

OSG Council Aug 18 th 2010 Communication and Interoperability Continual Communication via Notifications, Blog Aggregation, Real Time Operational Event Tracker Inter-Area Communication and Coordination with Major Stakeholders  Bring all area coordinator together weekly for Production meeting  ATLAS, CMS, and Invited VOs Ongoing Collaboration with WLCG and EGI  ENMR VO fully interoperational Interoperability for peering infrastructures researchers  WLCG, XSEDE Campus Bridging, EGI-Inspire 18

OSG Council Aug 18 th 2010 Impact of Production Foundations Stable Infrastructure Timely Support Adoption of New Technologies Continual Communication Resource and Infrastructure Monitoring 19

OSG Council Aug 18 th Impact of Production Foundations

OSG Council Aug 18 th 2010 Things We Learned Yesterday “Researchers are people.” -Lauren “You have to believe in sharing.” –Miron So what are we doing?  Networks  Science  Operations  Technology  Karaoke But what are we really doing? 21

OSG Council Aug 18 th 2010 The Real Challenge of Operations Building a strong sense of community for users, resource suppliers, and OSG staff  Stable Services  Built in continued quasi-daily one-on-one interactions  Done in long term dialogues  You can not have a sense of community without a sense of caring. “What should young people do with their lives today? Many things, but the most daring thing is to create stable communities…” Kurt Vonnegut 22

OSG Council Aug 18 th 2010 Thoughts? 23