Comments on #3: “Motivation for Regional Analysis Centers and Use Cases” Chip Brock 3/13/2.

Slides:



Advertisements
Similar presentations
31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
Advertisements

Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Computer Memory GCSE Computing.
6/2/2015 Michael Diesburg HCP Distributed Computing at the Tevatron D0 Computing and Event Model Michael Diesburg, Fermilab For the D0 Collaboration.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Brick Finding Ankara CM 2/4/2009 Dario Autiero. A large effort was put in the last months by a team of people in order to recover the events pending due.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
CSSE 533 – Database Systems Week 1, Day 1 Steve Chenoweth CSSE Dept.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Jan. 17, 2002DØRAM Proposal DØRACE Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Remote Analysis Station ArchitectureRemote.
DØ RAC Working Group Report Progress Definition of an RAC Services provided by an RAC Requirements of RAC Pilot RAC program Open Issues DØRACE Meeting.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University.
CERN – IT Department CH-1211 Genève 23 Switzerland t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.
Feb. 14, 2002DØRAM Proposal DØ IB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) Introduction Partial Workshop Results DØRAM Architecture.
29,30 July 2010 India CMS Meeting,BARC Mumbai 1 Update on Z’-> τ τ->τ jet+ τ jet analysis Nitish Dhingra(P.U.,India) Kajari Mazumdar(TIFR,India) Jasbir.
Remote Institute Tasks Frank Filthaut 11 February 2002  Monte Carlo production  Algorithm development  Alignment, calibration  Data analysis  Data.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Feb. 13, 2002DØRAM Proposal DØCPB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Partial Workshop ResultsPartial.
HI July Exercise and Muon DQM preparation Mihee Jo Mihee Jo / Lab meeting.
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
Comments on #3: “Motivation for Regional Analysis Centers and Use Cases” Chip Brock 3/13/2.
Oct 16, 2009T.Kurca Grilles France1 CMS Data Distribution Tibor Kurča Institut de Physique Nucléaire de Lyon Journées “Grilles France” October 16, 2009.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?
DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.
Analysis Tools interface - configuration Wouter Verkerke Wouter Verkerke, NIKHEF 1.
ATLAS Distributed Computing Tutorial Tags: What, Why, When, Where and How? Mike Kenyon University of Glasgow.
CSC 108H: Introduction to Computer Programming
Computer Hardware What is a CPU.
Some introduction Cosmics events can produce energetic jets and missing energy. They need to be discriminated from collision events with true MET and jets.
Explain the five parts of an information system: people, procedures, software, hardware, and data.
AN INTRODUCTION TO PARAGRAPHING
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
IGCSE 6 Cambridge Effectiveness of algorithms Computer Science
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
I/O Resource Management: Software
Big-Data Fundamentals
Artem Trunov and EKP team EPK – Uni Karlsruhe
Status of Full Simulation for Muon Trigger at SLHC
INFORMATION AND PROGRESS
QM222 A1 Visualizing data using Excel graphs
ALICE Computing Upgrade Predrag Buncic
Introduction CSE 1310 – Introduction to Computers and Programming
ILD Ichinoseki Meeting
Grid Canada Testbed using HEP applications
Teaching with Instructional Software
Chapter 13: Systems Analysis and Design
DQM for the RPC subdetector
Spreadsheets, Modelling & Databases
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
BUILDING A PUZZLE WHAT CAN WE LEARN FROM IT? By:
DØ Internal Computing Review
DØ RAC Working Group Report
Proposal for a DØ Remote Analysis Model (DØRAM)
Development of LHCb Computing Model F Harris
Software Development Techniques
Presentation transcript:

Comments on #3: “Motivation for Regional Analysis Centers and Use Cases” Chip Brock 3/13/2

what’s the problem? well, no *problem*, really an offsite architecture has lots of components and correlations and it’s complicated - it has to mesh with the existing SAM, GRID, and database group efforts  This is true whether you have RACs or not, in order for a user to use their own resources at their remote sites. RACs will make it easier and less dependent on network performances. how do we design it? how do we explain it? one way to approach it’s design is from a systems perspective another way is from specifically how it will be used in practice that’s what I’m trying to do I suspect that through illustrations - what I’ve been calling “stories” - we can head off questions and help to focus ourselves seems a luxury to have this planning opportunity now!  I totally disagree. In order for us to be meaningfully utilize something like RAC and Grid and expedite Run Iib results before LHC’s experiments pumping out theirs, we’ve got to have the system in place, at the latest, by the end of Run IIa. It would have been more optimal if this efforts have been in progress a couple of years ahead.

pick a small set of measurements/tasks what’s a story? pick a small set of measurements/tasks imagine what each step of such a measurement would be do it in the context of the data formats that we have available try to design the offsite architecture to accommodate these sorts of tasks do it in the context likely kinds of offsite institutions I picked three imagined tasks (yesterday) each exercises a different part of the data tier measurement of the W cross section establishing a jet cone energy algorithm determining the em scale corrections I think there is one more case: Jet cross section measurements.

As I understand it, these are the data formats: ROOT tuples thumbnail 5kB, clusters, named trigger, muon hits, jets, taus, vertices DST 50kB, resident on disk, standard RECO output, hits, clusters, global tracks, non-tracking raw data, contains TMB DBG I am sorry but what does DBG stand for? 500kB, created on demand, trigger, cal corrected hits, clusters, vertices, physics objects RAW 250kB, tape

here’s what you can maybe do: I have comments on tasks for each data tier on your spreadsheet. Knowing the capability of each data tier would determine which tier we must keep at RACs. this is all very sketchy and stream of consciousness now… just to give an idea of what I think it would be worth putting down

institutions: I can imagine institutions of generically the following sorts: where tasks requiring the following data formats might be doable: As I understand your logic in my picture, US-A group is a generic, small size US groups with very little storage and computing resources, while B1 is a bit larger institutions which has more than A but not as much as C. In this picture, you are already expanding the thoughts into a truly gridified network, because one can imagine access data from your neighbouring institution’s desktop machine, because it holds the data you want, even if it is not as large as an RAC. em scale cal jet cone algorithm W cross section

so, the telling of the story what’s involved in a particular measurement eg, W cross section: nominally: acquire a data set of measured “W’s”  This is one of the cases that would require specialized data stream which might be small enough to be kept in your US-B1 institutions. It looks like it would be about 3 TB at the end of Run II just for Wen samples, assuming no prescale. determine/apply corrections determine/calculate background subtract and count corrected W’s multiply by live luminosity

in practice: prepare dataset sitting at your ROOT-capable remote desk: prepare dataset how? stream DST’s? to an RAC? to home? at FNAL?  In my scheme this depends heavily on at which stage of the analysis the user is. If the analysis is at a baby, this person should not be allowed to acquire all data set, beyond what is available for him through the Thumbnail at his RAC. A group led by Meena is supposed to work on the policies for approving the full data set access. When the particular analysis is ready and approved following the determined policy, one should expect to acquire data throughout the network, ultimately his job will have to be packaged and send to where the data reside. Whether this process produces reduced root-tuples or summary histograms, one is expected to get considerably reduced size summary sent back to his own location and access them. extract them into private root tuples, at RAC? at FNAL?  As has been answered above, it is inconceivable to have all data set stored at one location, especially if one considers background samples. Therefore, the natural location for the user to get the data from is the “D0Grid Network” not one specific site.

make cuts do cuts on root tuples? propagate the cut list info back for any db/DST information? how? SAM? file list? MC analyzed, where? a third site?  This is part of the packaged job’s functionality. All interactive analysis should be limited to smaller data set that belong to the user’s RAC. Once the cuts and analysis are finalized, one packages the analysis in rootuple macro, make up a job, and put the job request into the Grid network job distributor. The job distributor must know where the necessary data reside and what the computing capacity for that particular location is before sending a job over to the site. The outputs from these smaller “staged” jobs should be transported to the requester’s local machine for further examination, while concatenation of the output is still in progress. This also means that either the RAC that holds the popular data stream must be located in an optimal position on the network or these specific, popular data set must be duplicated and stored throughout RACs.

measure/calculate bckgnds where? at ROOT level? prepared where? same as 1-2? This should be user’s choice whether the background analysis is done on root level or goes back to DST or heavier data set. However, the policy for the access is as I have prescribed in the signal case two slides before. As far as data access is concerned, there is no significant difference between signal and background enriched samples, though it would vary depends on the popularity and size of the desired data set. luminosity determined where, how? event list, run list prepared from 1., sent back as query to database..where? FNAL? RAC? flat file, cache, subset db prepared and sent where? RAC?  As I laid out before, if the full statistics analysis will require the jobs to be packaged and sent to the location where the desired data reside, the access to the relevant database information must also reside in RACs. It seems to me, the optimal solution would be the RACs holding some sort of replication of all necessary database information. It also serves purpose of re-reconstruction at RACs.

goal of this exercise would be: describe the ideal procedure in terms of real projects imagining the real flow of requests, data movement, etc. If we can do this then we know how to design the system  This is exactly the reason why we need use cases. This would certainly define what kind of services must be provided by the RACs, and what data tier must be kept at RACs. However, as I said before, what this would do more is on D0Grid architecture than how the hardware and RACs have to be configured. and we know how to explain it.