OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.
COM vs. CORBA.
Seeking prime numbers quickly through parallel-computing Daniel J. Wright.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Concurrency: Threads, Address Spaces, and Processes Sarah Diesburg Operating Systems COP 4610.
Why Concurrency? Allows multiple applications to run at the same time  Analogy: juggling.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Minimum intrusion GRID. Build one to throw away … So, in a given time frame, plan to achieve something worthwhile in half the time, throw it away, then.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Hyper-Threading Neil Chakrabarty William May. 2 To Be Tackled Review of Threading Algorithms Hyper-Threading Concepts Hyper-Threading Architecture Advantages/Disadvantages.
Threads. Processes and Threads  Two characteristics of “processes” as considered so far: Unit of resource allocation Unit of dispatch  Characteristics.
Threads Chapter 4. Modern Process & Thread –Process is an infrastructure in which execution takes place  (address space + resources) –Thread is a program.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
Understanding and Managing WebSphere V5
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Embedded Java Research Geoffrey Beers Peter Jantz December 18, 2001.
Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
System Software System software deals with the physical complexities of how the hardware works. System software generally consists of four kinds of programs:
L3 Filtering: status and plans D  Computing Review Meeting: 9 th May 2002 Terry Wyatt, on behalf of the L3 Algorithms group. For more details of current.
Annie Griffith Infrastructure Programme Manager July 2007 UK Link Technology Refresh.
Online Data Challenges David Lawrence, JLab Feb. 20, /20/14Online Data Challenges.
Bottlenecks: Automated Design Configuration Evaluation and Tune.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Jan 3, 2001Brian A Cole Page #1 EvB 2002 Major Categories of issues/work Performance (event rate) Hardware  Next generation of PCs  Network upgrade Control.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
NGOP Status and Plans Jim Fromm Marc Mengel Jack Schmidt May 2, 2006.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
1 Performance Optimization In QTP Execution Over Video Automation Testing Speaker : Krishnesh Sasiyuthaman Nair Date : 10/05/2012.
Lessons from HLT benchmarking (For the new Farm) Rainer Schwemmer, LHCb Computing Workshop 2014.
V. Serbo, SLAC ACAT03, 1-5 December 2003 Interactive GUI for Geant4 by Victor Serbo, SLAC.
The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,
ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
3.14 Work List IOC Core Channel Access. Changes to IOC Core Online add/delete of record instances Tool to support online add/delete OS independent layer.
7. CBM collaboration meetingXDAQ evaluation - J.Adamczewski1.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Online Software 8-July-98 Commissioning Working Group DØ Workshop S. Fuess Objective: Define for you, the customers of the Online system, the products.
L2 Upgrade review 19th June 2007Alison Lister, UC Davis1 XFT Monitoring + Error Rates Alison Lister Robin Erbacher, Rob Forrest, Andrew Ivanov, Aron Soha.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Online Monitoring System at KLOE Alessandra Doria INFN - Napoli for the KLOE collaboration CHEP 2000 Padova, 7-11 February 2000 NAPOLI.
General requirements for BES III offline & EF selection software Weidong Li.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
David Nathan Brown, LBNL New Analysis Model Why Revise BaBar’s Analysis Model? Conceptual Overview of the New Model Requirements Summary Implementation.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Online Software November 10, 2009 Infrastructure Overview Luciano Orsini, Roland Moser Invited Talk at SuperB ETD-Online Status Review.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Applying Control Theory to Stream Processing Systems
Controlling a large CPU farm using industrial tools
Threads Chapter 4.
Constructing a system with multiple computers or processors
Technical Capabilities
Presents: Rally To Java Conversion Suite
Mattan Erez The University of Texas at Austin
Types of Parallel Computers
Presentation transcript:

OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004

Gregory Dubois-Felsmann – (date)(subject)2 Obligatory caveat I’m available for advice and to provide continuity, but… I won’t be able to undertake any more non-trivial, non-emergency OEP development

Gregory Dubois-Felsmann – (date)(subject)3 What OEP is A conceptual unit of the online system –The framework for processing all complete-event data in the online system An implementation; a set of code that: –Defines and navigates the raw event structure (both the data from ODF and the persistence of data from Level 3) –Makes this data available to applications in the standard BaBar Framework Level 3 Fast Monitoring Other monitoring applications: event displays, beam spot monitoring, etc. –Provides distributed histogramming services –Controls the lifetimes of all the processes that do this work

Gregory Dubois-Felsmann – (date)(subject)4 Current status (Up-to-date performance metrics not available because we, unexpectedly, are not running.) The conceptual design and the event data format have turned out to work well and I don’t see a need to revise them The performance of the system as implemented has been very satisfactory for several years. –On the old Solaris farm, we were CPU-constrained, but that time was dominated by the performance of the Level 3 algorithms themselves –On the Linux farm we have had lots of headroom even at 1 L3/node… –Until quite recently: Rainer reports that since we started running Fast Monitoring on Linux (i.e., faster) and running the second monitoring farm instance for beam spot measurement, the trickle stream service has become CPU-intensive

Gregory Dubois-Felsmann – (date)(subject)5 There have been some upgrades Several iterations of improvement in process lifetime control tools (OepDaemon/OepManager – many thanks to Jim H.)… … which enabled running more Fast Monitoring processes and additional sets of them Rewrite from scratch of low-level DHP infrastructure, much fine- tuning Improvements in logging performance (see Jim’s talk)

Gregory Dubois-Felsmann – (date)(subject)6 There is more that can be done Framework overhead, and interface-to-Framework overhead –This was found typically to be about 25% in the old Solaris days –Can address several things: Framework overhead – Level 3 runs a large number of modules, so this can add up –There may be some effort invested in this motivated by speeding up the physics executables, which have enormous numbers of modules Interface-to-framework overhead – there’s some unnecessary copying of data that could be eliminated by trickier coding – probably trivial benefit Event navigation overhead – probably a 10% speedup in Level 3 from the long- planned “fast module scanning” project –This is a fairly straightforward non-multi-threaded programming problem and doesn’t need anything other than a good C++ programmer One related project, for the record –Making input modules work for non-event data

Gregory Dubois-Felsmann – (date)(subject)7 Still more that can be done CPU utilization –We have two CPUs on each farm node –The load from (ODF event level + OEP framework + Level 3 code) is concentrated in a single thread that runs the Level 3 algorithms –Could run two parallel streams of Level 3 processing –Requires a (much) more sophisticated version of the interface-to-Framework OEP code –This was in the original design but was sacrified to 1999-era schedule triage; the need hasn’t been acute enough since then (it only became relevant after the Linux upgrade) –This is a straightforward design but needs to be implemented by someone with a good understanding of multiprocessing –There are some technical questions about DHP and logging, basically: Are the multiple L3 instances to be treated as independent sources, or will they be re-aggregated per node?

Gregory Dubois-Felsmann – (date)(subject)8 Yet more that can be done Trickle stream –The Fast Monitoring architecture depends on transferring events over the network from the Level 3 processes, on a sampling basis, to other machines running the monitoring code –Apparently the server side of the existing system is expensive –The long-pending “advanced trickle stream” is being commissioned now. It shares no code with the old protocol, so we’ll have to re-measure this –It doesn’t seem likely to be an intrinsic problem – we receive a higher volume of data on the network from the event builder, very inexpensively –The more sophisticated event distribution system mentioned above would be able to take this load out of the Level 3 process –But one could consider a model in which (some) Fast Monitoring code runs on the same machines that run Level 3 There are concerns about further eroding the “deadtime firewall”

Gregory Dubois-Felsmann – (date)(subject)9 Scaling We run on 30 nodes now. We know we can run on 60 (from experience in the Sun era). We don’t quite understand the implications of running two (or more) instances of Level 3 per node for scaling of DHP and logging So the scaling of a (more nodes) x (more processes/node) system is not fully understood

Gregory Dubois-Felsmann – (date)(subject)10 Conclusions We will probably need to use one or more of these tools in order to get to 2007 The development work will require someone with a solid understanding of C++ and multiprocessing.