EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Grid Observatory: goals and challenges C. Germain-Renaud (CNRS/LRI & LAL) EGEE’07 Conference Budapest, Hungary 1-5 October 2007
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 2 Overview NA4 cluster in EGEE-III proposal Integrate the collection of data on the behaviour of the EGEE grid and users with the development of models and of an ontology for the domain knowledge
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 3 Some immediate questions Ressource allocation –Performance of the gLite scheduling hierarchy –Published waiting time –Reactive grids – Everybody's grid Dimensioning –Patterns and trends in requests and usage –Anticipate peaks On-line fault management –Detection –Diagnosis –Prevention
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 4 The big picture Considering current technologies, we expect that the total number of device administrators will exceed 220 millions by 2010 – Gartner June 2001 No more Moore’s Law free lunch: much more complex software & applications The Virtual Organization concept creates common goods
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 5 Autonomic Computing Computing systems that manage themselves in accordance with high-level objectives from humans. Kephart & Chess A vision of Autonomic Computing, IEEE Computer 2003 –Self-*: configuration, optimization, healing, protection –Of open non steady state dynamic systems
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 6 Autonomic Computing Computing systems that manage themselves in accordance with high-level objectives from humans. Kephart & Chess A vision of Autonomic Computing, IEEE Computer 2003 –Self-*: configuration, optimization, healing, protection –Of open non steady state dynamic systems –Academic and industry involved
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 7 Autonomic Grids Statistical analysis Data mining Machine learning monitor analyze plan execute knowledge DATA REQUIRED
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 8 Data Collection and Publication Acquisition, consolidation, long-term conservation of traces of EGEE activities –Permanent storage of reliable, exhaustive, filtered information –Exhaustive: added value in snapshots of the inputs and grid state e.g. workload and available services during a relevant time range –Filtered: from operational to structured No join ! L&B schema
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 9 Data Collection and Publication Acquisition, consolidation, long-term conservation of traces of EGEE activities –Permanent storage of reliable, exhaustive, filtered information: from operational to structured –No monitoring development: rich ecosystem of sources, with very different scopes, deployment and institutional status –Centralized CIC tools (GOCDB, SAM, SFT,…), core gLite (L&B, BDII,…) sites (Maui/PBS logs) gLite integrators (R-GMA, Job Provenance) experience integrators (DashBoard) external software (MonaLisa)
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 10 Data Collection and Publication Acquisition, consolidation, long-term conservation of traces of EGEE activities –Permanent storage of reliable, exhaustive, filtered information: from operational to structured –No monitoring development: rich ecosystem of sources, with very different scopes, deployment and institutional status The major challenge is exhaustive –Some data are outside the scope: external traffic on shared resources –Inside the scope, we need snapshots of the grid state and inputs –Privacy related legal constraints –Scientific usage will help –Interaction with EGI –Long-term: privacy-preserving data mining
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 11 Data Collection and Publication Publication service: navigation and querying –Integration of independent sources –Indexing along the needs of the users communities Scheduling: ongoing work with CoreGrid Jobs: ongoing work with KDUbik Ontology –The Glue Information Model: an ontology of the resources –Concepts for the grid dynamics e.g. job lifecycle or users relations –Expert concepts as prior knowledge of non-trivial correlations: workflows, failure modes,… Resource Job
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 12 Models Intrinsic characterizations of «grid traffic»: (distribution of) e.g. job arrival rate, running time, application data locality –Likely to be similar to IP traffic: many short, and a significant number of long, at all scales –Long range dependencies
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 13 Models Intrinsic characterizations of «grid traffic»: (distribution of) e.g. job arrival rate, running time, application data locality –Likely to be similar to IP traffic: many short, and a significant number of long, at all scales –Long range dependencies Characterizations of middleware-dependant metrics e.g. queuing delays, overhead, SE load
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 14 Models Intrinsic characterizations of «grid traffic»: (distribution of) e.g. job arrival rate, running time, application data locality –Likely to be similar to IP traffic: many short, and a significant number of long, at all scales –Long range dependencies Characterizations of middleware-dependant metrics e.g. queuing delays, SE load Inference of models for middleware components and applications, users and usage profiles, users interactions
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 15 Autonomic dependability On-line failure detection and anticipation Passive vs Active probing : a lot of information is available from user work Black-box –On-line statistics from « similar » actions (executions, data access, middleware modules)
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 16 Evaluation Assessing performance at the grid scale is a challenge –Need a snapshot of the inputs and grid state e.g. workload and available services during a relevant time range –Classical optimization does not scale –Advanced optimization: anytime algorithms
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 17 Abrupt changepoint detection Page-Hinckley statistics Time-sequential version of Wald’s statistics – also known as CUSUM « intelligent threshold » test which minimizes the expected time before a change detection for a fixed false positive rate Routine in quality control, clinical trials VO software bug Blackhole
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 18 Autonomic dependability On-line failure detection and anticipation Passive vs Active probing : a lot of information is available from user work Black-box –On-line statistics from « similar » actions (executions, data access, middleware modules) Supervised and unsupervised learning
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 19 Mining the L&B logs Constructive induction Double clustering
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 20 Autonomic dependability On-line failure detection and anticipation Passive vs Active probing : a lot of information is available from user work Black-box –On-line statistics from « similar » actions (executions, data access, middleware modules) Supervised and unsupervised learning Active probing –Adaptive on-line test selection for best coverage of possibly faulty components –Experience planning
Enabling Grids for E-sciencE EGEE-II INFSO-RI Application Track - Grid Observatory 21 Goals & Challenges Contributions to a quantitative approach of grid middleware and architecture, in the RISC sense Operational impacts on EGEE: evaluation, autonomic dependability Basic research in autonomic computing Collaboration between EGEE and national research initiatives and other UE projects: DEMAIN, PASCAL KD-Ubiq, CoreGrid, and hopefully more Adequate tradeoff between productivity and sustainability