Download presentation
Presentation is loading. Please wait.
Published by匀百 阳 Modified over 5 years ago
1
Comments on #3: “Motivation for Regional Analysis Centers and Use Cases”
Chip Brock 3/13/2
2
what’s the problem? well, no *problem*, really
an offsite architecture has lots of components and correlations and it’s complicated - it has to mesh with the existing SAM, GRID, and database group efforts This is true whether you have RACs or not, in order for a user to use their own resources at their remote sites. RACs will make it easier and less dependent on network performances. how do we design it? how do we explain it? one way to approach it’s design is from a systems perspective another way is from specifically how it will be used in practice that’s what I’m trying to do I suspect that through illustrations - what I’ve been calling “stories” - we can head off questions and help to focus ourselves seems a luxury to have this planning opportunity now! I totally disagree. In order for us to be meaningfully utilize something like RAC and Grid and expedite Run Iib results before LHC’s experiments pumping out theirs, we’ve got to have the system in place, at the latest, by the end of Run IIa. It would have been more optimal if this efforts have been in progress a couple of years ahead.
3
pick a small set of measurements/tasks
what’s a story? pick a small set of measurements/tasks imagine what each step of such a measurement would be do it in the context of the data formats that we have available try to design the offsite architecture to accommodate these sorts of tasks do it in the context likely kinds of offsite institutions I picked three imagined tasks (yesterday) each exercises a different part of the data tier measurement of the W cross section establishing a jet cone energy algorithm determining the em scale corrections I think there is one more case: Jet cross section measurements.
4
As I understand it, these are the data formats:
ROOT tuples thumbnail 5kB, clusters, named trigger, muon hits, jets, taus, vertices DST 50kB, resident on disk, standard RECO output, hits, clusters, global tracks, non-tracking raw data, contains TMB DBG I am sorry but what does DBG stand for? 500kB, created on demand, trigger, cal corrected hits, clusters, vertices, physics objects RAW 250kB, tape
5
here’s what you can maybe do:
I have comments on tasks for each data tier on your spreadsheet. Knowing the capability of each data tier would determine which tier we must keep at RACs. this is all very sketchy and stream of consciousness now… just to give an idea of what I think it would be worth putting down
6
institutions: I can imagine institutions of generically the following sorts: where tasks requiring the following data formats might be doable: As I understand your logic in my picture, US-A group is a generic, small size US groups with very little storage and computing resources, while B1 is a bit larger institutions which has more than A but not as much as C. In this picture, you are already expanding the thoughts into a truly gridified network, because one can imagine access data from your neighbouring institution’s desktop machine, because it holds the data you want, even if it is not as large as an RAC. em scale cal jet cone algorithm W cross section
7
so, the telling of the story
what’s involved in a particular measurement eg, W cross section: nominally: acquire a data set of measured “W’s” This is one of the cases that would require specialized data stream which might be small enough to be kept in your US-B1 institutions. It looks like it would be about 3 TB at the end of Run II just for Wen samples, assuming no prescale. determine/apply corrections determine/calculate background subtract and count corrected W’s multiply by live luminosity
8
in practice: prepare dataset
sitting at your ROOT-capable remote desk: prepare dataset how? stream DST’s? to an RAC? to home? at FNAL? In my scheme this depends heavily on at which stage of the analysis the user is. If the analysis is at a baby, this person should not be allowed to acquire all data set, beyond what is available for him through the Thumbnail at his RAC. A group led by Meena is supposed to work on the policies for approving the full data set access. When the particular analysis is ready and approved following the determined policy, one should expect to acquire data throughout the network, ultimately his job will have to be packaged and send to where the data reside. Whether this process produces reduced root-tuples or summary histograms, one is expected to get considerably reduced size summary sent back to his own location and access them. extract them into private root tuples, at RAC? at FNAL? As has been answered above, it is inconceivable to have all data set stored at one location, especially if one considers background samples. Therefore, the natural location for the user to get the data from is the “D0Grid Network” not one specific site.
9
make cuts do cuts on root tuples? propagate the cut list info back for any db/DST information? how? SAM? file list? MC analyzed, where? a third site? This is part of the packaged job’s functionality. All interactive analysis should be limited to smaller data set that belong to the user’s RAC. Once the cuts and analysis are finalized, one packages the analysis in rootuple macro, make up a job, and put the job request into the Grid network job distributor. The job distributor must know where the necessary data reside and what the computing capacity for that particular location is before sending a job over to the site. The outputs from these smaller “staged” jobs should be transported to the requester’s local machine for further examination, while concatenation of the output is still in progress. This also means that either the RAC that holds the popular data stream must be located in an optimal position on the network or these specific, popular data set must be duplicated and stored throughout RACs.
10
measure/calculate bckgnds
where? at ROOT level? prepared where? same as 1-2? This should be user’s choice whether the background analysis is done on root level or goes back to DST or heavier data set. However, the policy for the access is as I have prescribed in the signal case two slides before. As far as data access is concerned, there is no significant difference between signal and background enriched samples, though it would vary depends on the popularity and size of the desired data set. luminosity determined where, how? event list, run list prepared from 1., sent back as query to database..where? FNAL? RAC? flat file, cache, subset db prepared and sent where? RAC? As I laid out before, if the full statistics analysis will require the jobs to be packaged and sent to the location where the desired data reside, the access to the relevant database information must also reside in RACs. It seems to me, the optimal solution would be the RACs holding some sort of replication of all necessary database information. It also serves purpose of re-reconstruction at RACs.
11
goal of this exercise would be:
describe the ideal procedure in terms of real projects imagining the real flow of requests, data movement, etc. If we can do this then we know how to design the system This is exactly the reason why we need use cases. This would certainly define what kind of services must be provided by the RACs, and what data tier must be kept at RACs. However, as I said before, what this would do more is on D0Grid architecture than how the hardware and RACs have to be configured. and we know how to explain it.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.