“Grey areas” of the new architecture Massimo Sgaravatto INFN Padova
Issues Many topics reported in D1.4 were not deeply discussed Some were NEVER discussed Not sure if there is a general consensus on what has been written (Hope so) In any case D1.4 too vague Ok for a “high level architecture document such as D1.4 Not enough in my opinion to describe in details how the whole system will work and how the whole stuff must be reorganized/implemented Not all components are in the picture (e.g. the Grid Accounting components)
Examples of areas that must be clarified Reservation and co-allocation How a reservation/co-allocation is used by a job Where and how a status of a reservation/co-allocation is kept ? LB ? Interfaces with GARA Interfaces with LB Which components push events to LB ? Which events are pushed to LB ? “Collection” jobs (e.g. jobs belonging to a same DAG) LB API needed for job checkpointing Which are the events that the Workload Manager can be notified by the Log Monitor, and what is the expected actions ? A job is submitted to CondorG when a suitable resource has been found, or is it immediately inserted into CondorG queue on hold, and then released when a suitable resource is found ? …
What is needed (in my opinion) Necessary to define much more clearly and in much more details the whole architecture Needed to define, considering the various use cases (the various commands and the various events which could occur) the exact functionalities provided by these components and the interfaces between these components Necessary to define clear responsibilities for the various components This must be done NOW if we want to rely on the new architecture by release 2.0
Responsabilities User Interface: Datamat Network Server: Catania (recycle some existing code of RB ?) Protocol: Catania (recycle some existing code of RB ?) Workload Manager: CNAF (recycle some existing code of RB ?) Reservation Agent: CNAF Co-Allocation Agent: CNAF Resource Broker (MatchMaker): Catania Partitioner: Padova Helper: Francesco G. Job Adapter: CNAF(recycle some existing code of jobwrapper) JSS object (Padova) Log Monitor: Padova (evolution of JSSparser) Logging & Bookkeeping: CESNET Integration with DAGMan: CNAF Grid Accounting components: Torino Interactive jobs support integration
Proposed schedule Today: define responsibilities for the various modules Today: define which functionalities can be realistically be in place (and tested) for release 2.0 (~8 working weeks till the end of September) Planned new functionalities (release 1.4 and 2.0): Support for interactive jobs: Support for job dependencies Integration with WP2 query optimization service Java API (if needed by applications) GUI Advance reservation API Deployment of Accounting infrastructure over Testbed (HLRs with command line interface) Support for logical trivial job check-pointing Support for job partitioning Full integration of cost estimation/accounting into scheduling policies Integration of advance reservation/co-allocation in to Resource Broker RB relying on the new IS Glue Schema Today and next days: identify which other components are missing in the picture and plug them in the picture (only Grid Accounting stuff ?)
Proposed schedule (Chat) meetings to discuss in more details the functionalities of the various components and the interfaces between them Start considering existing functionalities and then considering, one by one, the new functionalities that will be in place for release 2.0 Starting this Wednesday (“real” meeting between few partners) Date ??: New CVS in place Date ??: Start implementation relying on the new CVS September 2-5: EDG Workshop in Budapest September 9: start hands-on meeting September 30: release 2.0
Mail from Bob Jones … … Reflecting on what we discussed and taking into account to the opinions of several of you, I think we should be more realistic and assume there will only be at most one more EDG release after 1.2 that is deployed on the production testbed in The SC2002 et al. demos for November should be prepared based on release 1.2 Obviously the development and certification testbeds will be more advanced. For the EU review at the start of 2003, I think we could imagine providing demos of what is currently possible on the production testbed (i.e. reuse the SC2002 et al. demos) and also show them the latest features of the development or certification testbeds.
Mail from Bob Jones Mware sw scheduling info: Please look at the software release plan ( and, for each item for your WP listed in release 1.2, 1.3, 1.4 & 2.0 tell me: Delivery date: When you expect it to be delivered Note1 : If it is already included in release 1.2 then just say "1.2" Note 2: "delivered" means documented and tested (REALLY!) Effort Required: State how much effort is required to make the delivery (remember: documented & tested). Please specify in (wo)man weeks. Identify who will perform the work (i.e. specify the names and how many weeks of work they do each) Note 1: please check with the people concerned that your information is correct and that they can schedule the estimated time (i.e. they are not over committed with other tasks, on holiday for that period etc.) Dependencies List other sw not already included in release 1.2 that it depends on (both in your WP and any other) GLUE schema: please be sure to include details of the work on the information providers/consumers (including their current status). In general I prefer you to be pessimistic rather than optimistic about your dateshttp://edms.cern.ch/document/333297
Software release plan ItemExpected Release date Involved people Estimated effort Required Dependencies
WP1 Software release plan Item Expected Release date Involved teams Estimated effort Required Dependencies C++ API1.3Datamat Support for MPICH jobs1.3Padova Improving error reporting 1.3Datamat, Catania Support for interactive jobs 1.4Milano Job dependencies1.4CNAFCondor team? Integration with WP2 Query Optim. Service 1.4CataniaWP2 Query Opt. Service
WP1 Software release plan ItemExpected Release date Involved teams Estimated effort Required Dependencies Java API (if needed)1.4Datamat GUI1.4Datamat Deployment of Accounting infrast. over Testbed (HLRs with command line interface) 1.4TorinoWP4? Advance reservation API1.4CNAF
WP1 Software release plan Item Expected Release date Involve d teams Estimated effort Required Dependencies RB relying on the Glue schema 1.4CataniaSchema and DIT defined WP4 (inf. pr.) Job checkpointing2.0Pd, Ces. LB Job partitioning2.0PadovaJob checkp., job depend. Full integration of cost estimation/accounting into scheduling policies 2.0Catania, Torino Integration of advance res./co-all. in to RB 2.0Catania, CNAF
My personal ideas Deliver new 1.2 RPMs as requested JSS problems + fixes for outstanding issues with autotools (if any) No new 1.3 RPMs To avoid to be asked to support 1.3 (as it happened with 1.2) and therefore not being able to implement the new stuff Deliver 2.0 RPMs (but with less functionalities as original planned)
WP1 Sw rel. plan (my prop.) Item Expected Release date Involved teams Estimated effort Required Dependencies C++ API1.3 2.0SM, MP (CT) Datamat (FP, AM), CESNet (AK), Pd (RP) 3 person week Support for MPICH jobs1.3 2.0Padova (AG)½ person week Improving error reporting and communication from UI 1.3 2.0Datamat (FP, AM), Catania (SM, MP) 2 person week Support for interactive jobs1.4 2.0Mi (MM), CNAF (ER) Datamat (FP, AM) 3 person week Job dependencies1.4 2.0CNAF (FG, ER), Cesnet (all), Datamat (FP, AM) 16 person week Integration with WP2 Query Optim. Service 1.4 2.0Catania (SM, MP) 1 person week WP2 Query Opt. Service
WP1 Sw rel. plan (my prop.) ItemExpected Release date Involved teams Estimated effort Required Dependencies Java API + GUI1.4 2.0Datamat (GA) 6 person week Deployment of Accounting infrast. over Testbed (HLRs with command line interface) 1.4 2.0Torino (AG, SB) 8 person week WP4 Advance reservation API1.4 2.0CNAF (FG, ER, SF) 2 person week
WP1 Sw rel. plan (my prop.) Item Expected Release date Involve d teams Estimated effort Required Dependencies RB relying on the Glue schema 1.4 2.0Catania (SM, MP) 2 person week Schema and DIT defined WP4 (inf. pr.) Job checkpointing2.0Pd (AG, RP), Ces. (MM) 6 person week LB Job partitioning2.0 after 2.0 Padova (AG, RP) 4 person week Job checkp., job depend. Full integration of price estimation/accounting into scheduling policies 2.0 after 2.0 Catania (SM, MP), Torino (SB, AG) 8 person week Integration of advance res./co-all. in to RB 2.0 after 2.0 Catania (SM, MP), CNAF (ER, SF, FG) 12 pers. week WP4, WP5, WP7