2018 COMPUTING REVIEW HOMEWORK QUESTIONS

Slides:



Advertisements
Similar presentations
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
Advertisements

Parallel Session B2 - CPU and Resource Allocation " Panelists: – Charles Young (BaBar) – David Bigagli " Seed Questions: – Batch queuing system in use?
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.
Uniprocessor Scheduling Chapter 9. Aim of Scheduling The key to multiprogramming is scheduling Scheduling is done to meet the goals of –Response time.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Performance Evaluation
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
May 2010 Graham Heyes Data Acquisition and Analysis group. Physics division, JLab Data Analysis Coordination, Planning and Funding.
Jefferson Lab Status Hall A collaboration Dec. 16, 2013 R. D. McKeown Deputy Director For Science.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Outline IT Organization SciComp Update CNI Update
1 Computing & Networking User Group Meeting Roy Whitney Andy Kowalski Sandy Philpott Chip Watson 17 June 2008.
Computing and IT Update Jefferson Lab User Group Roy Whitney, CIO & CTO 10 June 2009.
NLIT May 26, 2010 Page 1 Computing Jefferson Lab Users Group Meeting 8 June 2010 Roy Whitney CIO & CTO.
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
Computing Division Helpdesk Activity Report Rick Thies May 23, 2006.
1 Metrics for the Office of Science HPC Centers Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
ALMA Archive Operations Impact on the ARC Facilities.
Course Title: “Operating System” Chapter No: 04 “Process Scheduling” Course Instructor: ILTAF MEHDI IT Lecturer, MIHE, Kart-i Parwan, Kabul.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Memory management.
Thomas Jefferson National Accelerator Facility Hall C Commissioning: The Past (1996) First Exp: “by consensus” What Hampton Univ. wanted What ANL wanted.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Margaret Votava / Scientific Computing Division FIFE Workshop 20 June 2016 State of the Facilities.
9 th International Conference on Supplemental Instruction Thursday, May 26, 2016 Supplemental Instruction Teams: A System of Investment and Support.
Computing Infrastructure – Minos 2009/12 ● Downtime schedule – 3 rd Thur monthly ● Dcache upgrades ● Condor / Grid status ● Bluearc performance – cpn lock.
Compute and Storage For the Farm at Jlab
Information Technology Virtualized Server Update
Getting the Most out of Scientific Computing Resources
C Loomis (CNRS/LAL) and V. Floros (GRNET)
OPERATING SYSTEMS CS 3502 Fall 2017
Getting the Most out of Scientific Computing Resources
Jefferson Lab Overview
CPU Scheduling CSSE 332 Operating Systems
JLab Auger Auger is the interface to JLab’s data analysis cluster (“the farm”) Controls batch job submissions Manages input/output from jobs Provides details.
Lecture 20: WSC, Datacenters
Working Group 4 Facilities and Technologies
SCEC Drupal Website Development Overview and Status
Chapter 9: Virtual Memory – Part I
RoMEO and CRIS Technical Issues & Efficiency Tips
Customer experience loop
ATLAS Sites Jamboree, CERN January, 2017
LQCD Computing Operations
US CMS Testbed.
Grid Canada Testbed using HEP applications
Scientific Computing At Jefferson Lab
SharePoint Administrative Communications Planning: Dynamic User Notifications for Upgrades, Migrations, Testing, … Presented by Robert Freeman (
Chapter 9: Virtual-Memory Management
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Meeting, training & teaching in the interactive iWorld
Nexus Meter Read Performance April 2018
Explanation of rating scales
Pancreas Program Functional Inactivity
Computing Overview Amber Boehnlein.
Lecture 2 Part 3 CPU Scheduling
Outline IT Division News Other News
data backup & system report
Product released! Software Released! Now what?.
Kanga Tim Adye Rutherford Appleton Laboratory Computing Plenary
Virtual Memory: Working Sets
Long-Lived Data Collections
Customer experience loop
Jefferson Lab Scientific Computing Update
Chapter 12: Software Support and Maintenance
CPU Scheduling David Ferry CSCI 3500 – Operating Systems
Presentation transcript:

2018 COMPUTING REVIEW HOMEWORK QUESTIONS Q1-4 & Q8: Scientific Computing Chip & Sandy Q6, 9, 10: GlueX & CLAS12 David & Graham Q5, 7 PhDs, publications Rolf

Q1: What criteria are used to decide which workflows run locally vs remotely? GlueX made an early choice to use OSG for their simulations as they were very much less I/O intensive (i.e. no need to move much data). When it became clear that we might not have enough local computing to do both GlueX and CLAS-12, we looked into using NERSC. GlueX’s reconstruction workflow gave the shortest latency on KNL’s and was a large near term load, so earlier this calendar year we received a Director’s Reserve allocation, and then we (Chris Larrieu) started on modifying SWIF to run on that remote resource (SWIF-2). We were already using Globus for LQCD transfers to/from NERSC and OLCF, and so we knew it was capable enough, and only had to learn how to drive it from software. With early experience looking promising, GlueX submitted a larger production allocation request that could absorb a large fraction of their event reconstruction, and I submitted a smaller startup request for CLAS-12 (which will re-use all of the SWIF-2 investments).

Q2: Is there a list of new feature requests from the Halls and individual users? Is there a process for prioritizing computing/data/software work to address these feature requests? There is no separate formal process for, nor formal list of, feature requests. Sometimes users will submit a trouble ticket which is in fact a feature request, and Sandy captures those. Sandy meets with the hall points-of-contact for SciComp one or twice a week, and also collects requests via that meeting. I join those meetings every couple of months, particularly when we are considering significant changes to the system, and I discuss potential changes and also encourage them to make suggestions. We do keep a long list of things on a “to-do” list, many of which never get done since newer things are added higher up the priority order. The halls know that we are oversubscribed, and I can think of one person who could probably describe a request that has been pending 2 years. Several times a year my group reviews at least the first “page” of that list for near term prioritization. This accurately reflects IT division process for all groups. In MIS, we are piloting processes that will slightly increase the level of formality for major upgrades.

Q3: Are there metrics (e. g Q3: Are there metrics (e.g., user satisfaction, ticket turn around times, etc.) that you can share from user feedback? Simple counts of user support tickets over the past year:  156 in Scientific Computing; 39 for tape library (MSS). The goal is to respond within 24 hours and resolve in 2 days. Note that not all of these are "trouble"; they also include new projects, quota changes, etc. Some small fraction require more interaction with the users and might take days to close due to the communications lags. We do not have a dashboard for formal reporting process on performance, although we did have in the more formal 3 lab LQCD project. USQCD does a 3 lab satisfaction survey every year. IT has done a user survey twice, most recently in May 2017. Operations staff generally gets good marks, documentation sometimes gets a “B”. Overall satisfaction in 2016 was 80% satisfied or very satisfied, 13% neutral, 7% dissatisfied. Satisfaction with “fair share” system was lowest, 15% negative, support for physics software second lowest at 14% negative. In 2017 overall (all of IT) rose to 89% positive. Most SciComp services were rated 4 out of 5, with only 2% negative. Survey is available if you wish. In 2019, IT is deploying ServiceNow to replace the homegrown ticketing system. We expect over the next couple of years to be developing dashboards for key services.

Significant outages over the past year: Q4A: Please give us performance information on the current computing facility. E.g., what were significant downtimes of the farm and Lustre for the past year? Significant outages over the past year: Nov 20, 4 hours: farm paused for central computing /home outage Oct 17-18:  Lustre sporadic service due to file locking issues; parameters adjusted Aug 28 MSS, 4 hours: 4* LTO-8 drive installs Aug 1-2: 2 Lustre hangs, overnight for 5 hours, 1 hour daytime for OSS14 hangs; 2014 nodes pulled from production to become read-only Jul 20 MSS: gripper failure; IBM same day repair (8 hours) Jun 5, 3 hours: Outage for Facilities power installs for HPC May 8-9: farm 18 hour outage for Facilities transformer work for cooling system; mysql upgrade Dec 19: /work outage to add additional disks

Already done: 2014 OSSes are now readonly (that helped) Q4B: … How does Lustre perform under high load? Can you give some examples of significant improvements that could be made if additional resources were available? Our worst problem has been when running thousands of jobs doing lots of small I/O – bandwidth suffers badly. Next worst was running lots of jobs each opening lots of files – metadata server performance suffers. Sometimes an OSS hangs (especially 3 of the 2014 servers). If ANYTHING suffers, it tends to propagate back to the MDS and then everything suffers. Already done: 2014 OSSes are now readonly (that helped) Our plans for 2019 will make major improvements in (1) the MDS performance (SSDs), (2) overall bandwidth (50% increase), and (3) Lustre software maturity (moving from 2.5 to at least 2.10). This $190K investment should be a major improvement.

Q8: WHAT DURATION (# OF YEARS) ARE YOU OBLIGATED TO HOLD THE HALLS AND LQCD DATA? For the halls and JLab/NP LQCD, we hold the data forever. Increasing tape capacity and falling cost per PB make this not excessively expensive. We do, however, encourage “deleting” data sets that are no longer needed so that we can re-cycle the tapes and/or avoid migration to newer tape generations. For non-JLab LQCD, we hold the data for 18 months past the end of an allocation year. We encourage BNL and FNAL affiliated LQCD researchers to work with their “home” (science) institutions to work out similar arrangements.

Q6: WHAT IS THE IMPACT OF NERSC ALLOCATIONS & SCHEDULING ON YOUR PUBLICATION TIMELINE? CLAS12: minimal. CLAS12 requires offsite resources for the simulation workload but has headroom in the allocation of local resources for onsite simulation if required. The plan is to use a mix of local, OSG and NERSC. The impact of NERSC allocation or scheduling would be offset by a change the mix to use more OSG GlueX: For the next 2: none. For future running: We are already exploring additional supercomputing sites (e.g. IU) that will supplement NERSC and JLab.

Q9: HOW DOES THE NERSC PREFERENCE FOR NON-SIMULATION SUBMISSIONS IMPACT YOUR PLANNING? A preference from a resource provider for a particular type of work would be covered by changing the mix between NERSC and OSG. GlueX: does not plan to run simulation at NERSC for this reason. Current plan calls for simulation to be run on OSG. CLAS12: The same applies. If NERSC or some other resource preferred reconstruction we would move some workload from local resources to that resource. This would free up local resources to use for simulation.

Q10: WHAT IS THE SOURCE OF THE 50% INEFFICIENCY (SLIDE 13 OF GRAHAM’s TALK)? This number came from looking at the efficiency of reconstruction of 10% of the Spring 2018 CLAS12 data. The previous bullet on my slide 13 states that CLAS12 need to be efficient in their use of local resources. We will certainly improve the efficiency but even if we are only as efficient as we have been so far the time it will take to process all of the 2019 data plus the 2018 backlog is less than a year. So we are not in a situation where we have a backlog that will grow over time.

Q5 & Q7: JLAB PUBLICATIONS & PhD THESES UNDERWAY From this year’s Operations Review: PhD Theses (experiment & theory): Steady stream of ~30 PhDs per year Total PhD theses based on JLab research is >> 600 ~200 grad students working on or waiting on PhD data/topic – 22- week run in FY18 and prospects for FY19 run helped a lot! Publications average per year (experiment): PRLs/PLs (8) ; Other refereed publications (26) ; Instrumentation papers (8) Students analyzing 12-GeV data: Halls A & C ~ 43, Hall B ~ 13, Hall D ~ 20) About 10 PhDs already awarded based on 12-GeV era Data (few in Hall A, few from Hall B/HPS, 4 from Hall D/GlueX)

Q5 & Q7: FROM 2017 S&T REVIEW

Q5 & Q7: FROM 2017 S&T REVIEW

Q5 & Q7: Jefferson Lab User Growth “Other Users” include US National Labs and Industry