Overload of frontier lpad by MC Overlay

Slides:



Advertisements
Similar presentations
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
Advertisements

G. Alonso, D. Kossmann Systems Group
Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Computer Science 162 Section 1 CS162 Teaching Staff.
Generic Simulator for Users' Movements and Behavior in Collaborative Systems.
Performance Testing Design By Omri Lapidot Symantec Corporation Mobile: At SIGiST Israel Meeting November 2007.
ATLAS : File and Dataset Metadata Collection and Use S Albrand 1, J Fulachier 1, E J Gallas 2, F Lambert 1 1. Introduction The ATLAS dataset search catalogs.
ATLAS Data Periods in COMA Elizabeth Gallas - Oxford ATLAS Software and Computing Week CERN – April 4-8, 2011.
Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Introduction: Distributed POOL File Access Elizabeth Gallas - Oxford – September 16, 2009 Offline Database Meeting.
Software Performance Testing Based on Workload Characterization Elaine Weyuker Alberto Avritzer Joe Kondek Danielle Liu AT&T Labs.
LHC: ATLAS Experiment meeting “Conditions” data challenge Elizabeth Gallas - Oxford - August 29, 2009 XLDB3.
- Iain Bertram R-GMA and DØ Iain Bertram RAL 13 May 2004 Thanks to Jeff Templon at Nikhef.
ATLAS Database Operations Invited talk at the XXI International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, September 2007 Alexandre.
Moving Around in Scratch The Basics… -You do want to have Scratch open as you will be creating a program. -Follow the instructions and if you have questions.
A Method for Transparent Admission Control and Request Scheduling in E-Commerce Web Sites S. Elnikety, E. Nahum, J. Tracey and W. Zwaenpoel Presented By.
1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.
3rd November Richard Hawkings Luminosity, detector status and trigger - conditions database and meta-data issues  How we might apply the conditions.
Conditions Metadata for TAGs Elizabeth Gallas, (Ryan Buckingham, Jeff Tseng) - Oxford ATLAS Software & Computing Workshop CERN – April 19-23, 2010.
Alwayson Availability Groups
Database authentication in CORAL and COOL Database authentication in CORAL and COOL Giacomo Govi Giacomo Govi CERN IT/PSS CERN IT/PSS On behalf of the.
Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.
Michelle Malcher PepsiCo Session # For the DBA Manager – Understanding Oracle and DBAs.
November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /
11th November Richard Hawkings Richard Hawkings (CERN) ATLAS reconstruction jobs & conditions DB access  Conditions database basic concepts  Types.
Summary of User Requirements for Calibration and Alignment Database Magali Gruwé CERN PH/AIP ALICE Offline Week Alignment and Calibration Workshop February.
(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Dario Barberis: ATLAS DB S&C Week – 3 December Oracle/Frontier and CondDB Consolidation Dario Barberis Genoa University/INFN.
INTRODUCTION TO WEB HOSTING
Project Management: Messages
Database Replication and Monitoring
Virtualization and Clouds ATLAS position
Debugging Intermittent Issues
Cluster Optimisation using Cgroups
Diskpool and cloud storage benchmarks used in IT-DSS
The Stream Model Sliding Windows Counting 1’s
3D Application Tests Application test proposals
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
Mechanism: Limited Direct Execution
Fundamentals of Information Systems, Sixth Edition
Software Engineering Process
AMI – Status November Solveig Albrand Jerome Fulachier
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Debugging Intermittent Issues
Conditions Data access using FroNTier Squid cache Server
Issues with Simulating High Luminosities in ATLAS
Modularity and Memory Clearly, programs must have access to memory
Discussions on group meeting
Software Architecture in Practice
We’ll be spending minutes talking about Quiz 1 that you’ll be taking at the next class session before you take the Gateway Quiz today.
Test Upgrade Name Title Company 9/18/2018 Microsoft SharePoint
An open source web application for real-time display of pending orders
X in [Integration, Delivery, Deployment]
Windows PowerShell Remoting: Definitely NOT Just for Servers
Chapter 12: Automated data collection methods
Turbo-Charged Transaction Logs
Unit 6: Application Development
Overview of features for new and returning users
Solving Linear Equations
Software Engineering Process
Algorithms for Selecting Mirror Sites for Parallel Download
Software Engineering Process
Constructing a Test We now know what makes a good question:
Presentation transcript:

Overload of frontier lpad by MC Overlay Elizabeth Gallas (Oxford) ADC Weekly Meeting April 15, 2014

Overview Caveat … these are bits pieces This is an open discussion ! which I am aware of and which seem relevant to the discussion Limited time to collect metrics This is an open discussion ! Corrections, additions: welcome ! Problem: MC Overlay jobs cause Frontier overload on the grid Aspects of the issue: Conditions aspects of MC Overlay jobs Conditions folders of interest Conditions DB & COOL Conditions deployment on the grid (Frontier & DB Releases) MC Overlay Task deployment on the grid Reconstruction Software  how these aspects, in combination, result in overload April 2014 E.Gallas

MC Overlay jobs Overlay real “zero bias” events on simulated events An exception to the norm wrt Conditions Access data from multiple Conditions Instances: COMP200 (Run 1 real data conditions) OFLP200 (MC conditions)  This is not thought to contribute to the problem What seems exceptional and notable is that Conditions data volume needed by each job to reconstruct these events Is much greater (10-200x) conditions volume of typical reco Estimates vary … Is greater than event data volume of each job Event volume: A few hundred events ? x 1.5MB  the metadata is larger than the data itself … April 2014 E.Gallas

Conditions deployment on the grid 2 modes of access (direct Oracle access demoted) DB Release files or Frontier MC Overlay can’t use just the default DB Release It doesn’t contain real data conditions So it is using Frontier Alastair: An unusual aspect to the overload is that it is actually bringing down Frontier servers: reboot required !  try to understand cause of comatosis (more later on this) Could we use DB Release (for some/all conditions)? From Misha: Yes, sure it's possible to make DBRelease for these data DB access also could be mixed in any way (frontier + DB Release) Finding the folder list: main role of DBRelease-on-Demand system it's release and jobOption specific. Problem: is how distribute it. Before HOTDISK was used Now it should be CVMFS: requires new approach. DB Release size … can’t be known without studying Alastair: CVMFS likes small files, not large ones … April 2014 E.Gallas

Conditions DB and Athena IOVDbSvc gets conditions in a time window wider than the actual request So each conditions retrieval contains probably a bit more data than might be needed by the job This mechanism is generally very effective in reducing subsequent queries in related time windows Unsure if this mechanism is helping here It depends on if subsequent zero bias events in the same job are in the Run/LB range of the retrieved conditions April 2014 E.Gallas

MC Overlay task deployment Assumptions about how these tasks are deployed: Related jobs are (same MC process id) deployed to specific sites (or clouds) and each requires a unique set of zero bias events over all the jobs Each of the “related” jobs Are in clouds using the same Squids and/or Frontier Accesses the conditions needed for the zero bias events being overlayed The conditions being accessed is always distinct this completely undermines any benefit of Frontier caching (queries are always unique) Multiply this by the hundred/thousand jobs in the task, each retrieving distinct conditions  obvious stress on the system April 2014 E.Gallas

Query evaluation: Folder of interest: Identified via Frontier logs: ATLAS_COOLONL_TDAQ.COMP200_F0063_IOVS IOV: 1351385600000000000-1351390400000000000 Evaluate this specific query: RunLB range: 213486 LB 612–700 (part of run 213486) Folder: COOLONL_TDAQ/COMP200 /TDAQ/OLC/BUNCHLUMIS IOV basis: TIME (not Run/LB) Channel Count: 4069 channels retrieved generally less … depends on IOV Payload: RunLB (UInt63) AverageRawInstLum (Float) BunchRawInstLum (Blob64k) –> LOB !! Large Object !! Valid (UInt32) The query retrieves 2583 rows, each including LOBs number of rows >> number of LBs (~80) This is the nature of the folder being used Bunch-wise Luminosity ! April 2014 E.Gallas

… more about LOBs … Folder accessed has a LOB payload (Large Object) Back to COOL ( and via Frontier): LOB access from COOL not the same as access to other payload column types There is some more back/forth communication Between client (Frontier) and Oracle Rows are retrieved individually Always a question: can LOB access be improved ? Also: is there something about Frontier and LOBs Something that might cause the Frontier failure ? It doesn’t happen with single jobs Only seems to occur when loaded above a certain level no individual query in these jobs results in data throughput beyond the system capacity it is somehow triggered by load April 2014 E.Gallas

No system has infinite capacity General ATLAS Database Domain Goal: Develop and deploy systems which can deliver any data in databases needed by jobs Even large volumes when needed In reality: Capacity, bandwidth, etc … are not infinite So consider ways to moderate requests but still satisfy use cases This case, bunch-wise luminosity is being retrieved More channels are being retrieved than being used Inefficiency in the COOL callback mechanism Already planned improvement to folders for Run 2 Thanks to Eric, Mika (lumi experts), Andy (MC experts) for critical feedback I asked in email off-thread … answers on the next slide: Is bunch-wise luminosity really needed ? April 2014 E.Gallas

Is bunch-wise lumi needed ? Andy: … not doing anything special for overlay for lumi info … running standard reco … must be the default for standard reco of data as well … What's unusual for overlay is that each event can be from a different LB, whereas for data the events are mostly from the same LB within a job. Eric: Yes of course, and this could trigger the IOVDBSvc to constantly reload this information for every event from COOL. … per-BCID luminosity … used by LAr as a part of standard reco since (early) 2012 … used to predict the LAr noise as a function of position in the bunch train from out of time pileup. I don't know exactly what happens in the overlay job, but presumably it also accesses this information to find the right mix of events. April 2014 E.Gallas

Attempt at a summary Aspects of conditions implementation, usage and deployment all seem to conspire … no one smoking gun DB caching mechanisms: completely undermined by this pattern of access Software: using default reconstruction for luminosity Bunch-wise corrections needed for real data reco (Lar) NO problems with this in deployment – should not change ! But is the default overkill for zero bias overlay ? Would the BCID average luminosity suffice (use different folder) ? Eliminates the need for LOB access in this use case Conditions DB/COOL and Frontier COOL side: no obvious culprit … BLOB sizes vary Frontier: Evaluate cause of failure w/ LOBs under high load Task deployment: DB Release option ? any other ideas ? Please be patient Must find the best overall long term solution for this case Without undermining software which is critical for other use cases Use this use case for studying the bottlenecks April 2014 E.Gallas