Presentation is loading. Please wait.

Presentation is loading. Please wait.

Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences.

Similar presentations

Presentation on theme: "Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences."— Presentation transcript:

1 Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences

2 ● Common effort of the ALICE and LGC Collaborations. ● Thanks to my colleagues of the ALICE-MUON Collaboration. Special thanks to Jean Cleymans, Bruce Becker, Artur Szostak, Gareth de Vaux, Sukalyan Chattopadyay, Corrado Cicalo, Timm Steinbeck, Volker Lindenstruth, Heinz Tilsner, Florent Staley and others Acknowledgments

3  Management of large data sets  Inter-operability  Standards and protocols  Security and certification Topics for discussion

4 Digital Curation  Maintainance of digital research data and other digital materials over their entire life-cycle and over time for current and future generations of users.  Processes of digital archiving and preservation  Also includes all the processes needed for good data creation and management, and the capacity to add value to data to generate new sources of information and knowledge. ", and services in this field." Centre for Digital Curation

5 Digital Curation(2)  Curation and long-term preservation of digital resources will be of increasing importance for a wide range of activities within research and education.  Through sensors, experiments, digitisation and computer simulation, digital resources and data are growing in volume and complexity at a staggering rate.  The cost of producing these resources is very high: satellites, particle accelerators, genome sequencing, and large scale digitisation and electronic publishing collectively represent a cumulative investment of billions of pounds in digital research and learning.  Long-term curation and preservation of digital resources is seen as a challenge which is difficult if not impossible for individual institutions to resolve on their own due to the complexity and scale of the challenges involved.

6 Curation in Physical Sciences  Data is being generated in large volumes.  In laboratories; old archival material (design specifications, codes etc) can serve as reference resources.  Remote information access through online publications.  Data management and real-time remote analysis  Heavily dependent on bandwidth  New middleware is being developed for access ofdata across geographically disparate centres.  Data sharing in astro; nuclear and particle physics  Usually characterised by large collaborations (in excess of 100 people)  MetaData are essential for the selection of events  Can use the Grid file catalogue for one part of the MetaData  During the Data Challenge we used the file catalogue for storing part of the MetaData

7 simulation reconstruction analysis interactive physics analysis batch physics analysis batch physics analysis detector event summary data raw data event reprocessing event reprocessing event simulation event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) event filter (selection & reconstruction) processed data CER N

8 Experimental conditions in heavy-ion colliders  Beam : Pb-Pb, Ca-Ca, p-p, p-A  Rates :  8000 events/s Minimum bias  50-100/s central events (2-5%  tot )  acquisition rate 100 Hz (central) 1000 Hz (dimuons)  1 month/year (10 6 s) =10 7 central events  Multiplicity : dn/dy from 2000 to 8000 so a total of about 60000

9 Consequences  More than 60 GBytes produced per second in Alice: High Level Trigger (HLT) + compression to reduce raw data to 1.2 GB/s : 2 to 3 PB/year in 1 month of data taking Very fast acquisition and network  ALICE will be one of the largest data base in history  Need a GRID to distribute and analyse data

10 The Grid Vision The GRID: networked data processing centres and ”middleware” software as the “glue” of resources. Researchers perform their activities regardless geographical location, interact with colleagues, share and access data Scientific instruments and experiments provide huge amount of data

11 Classification of Grids  Computational Grids (including CPU scavenging Grids) which focuses primarily on computationally-intensive operations Computational GridsCPU scavenging  Data Grids or the controlled sharing and management of large amounts of distributed data Data Gridsdistributed data  Equipment Grids which have a primary piece of equipment e.g. a telescope, and where the surrounding Grid is used to control the equipment remotely and to analyse the data produced. Equipment Gridscontrol

12 Grid beyond high energy physics  Due to the computational power of the EGEE new communities are requiring services for different research fields  Normally these communities do not need the complex structure that required by the HEP communities  In many cases, their productions are shorter and well defined in the year  The amount of CPU required is much lower and also the Storage capabilities 20 applications from 7 domains High Energy Physic, Biomedicine, Earth Sciences, Computational Chemistry Astronomy, Geo-physics and financial simulation 36

13 LCG services – built on two major science grid infrastructures EGEE - Enabling Grids for E-Science OSG - US Open Science Grid

14 LCG Service Hierarchy Tier-0 – the accelerator centre  Data acquisition & initial processing  Long-term data curation  Distribution of data  Tier-1 centres Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands Tier-1 (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 – “online” to the data acquisition process  high availability  Managed Mass Storage –  grid-enabled data service  Data-heavy analysis  National, regional support Tier-2 – ~100 centres in 20 countries  Simulation  End-user analysis – batch and interactive

15 Tier0 / Tier1 / Tier2 Networks Cape Town ?

16 Summary of Tier0/1/2 Roles  Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times;  Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s;  Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction.  Very difficult to estimate Network requirements! N.B. there are differences in roles by experiment Essential to test using complete production chain of each!

17 Tier2Tier1Tier2Tier1 Production of RAW Shipment of RAW to CERN Reconstruction of RAW in all T1’s Analysis AliEn job control Data transfer Physics Data Challenge(s) F. Carminatti (CERN)

18 ALICE Network in the World Yerevan CERN Saclay Lyon Dubna Cape Town, ZA Birmingham Cagliari NIKHEF GSI Catania Bologna Torino Padova IRB Kolkata, India OSU/OSC LBL/NERSC Merida Bari 37 people 21 insitutions Active sites

19 Undersea Cable Capacity

20 Asymmetric Inter-regional Bandwidth

21 Result: Sample Bandwidth Costs for African Universities Source: IEEAF

22  Management of large data sets  $$ and R  Database management Skills  Digital divide :  Cyber infr: network/HR/libraries/data sets/LAN  etc  Inter-operability: e.g Astro-Grid, mammo Grid etc  Standards  and protocols  Preservation and quality  Access (meaning of numbers)/terminology and use of unfamiliar data  Configuration management  Ex: Particle data book  Security and certification  Certification authorities  Dialogue between researchers & librarians  Role of libraries and curators  Guidelines  Academic training programme/ schools outreach  Schools: New curriculum development (lost data)  Research students: access to previous theses  Resource management Topics for discussion

23 Challenges  Strategy for Natural sciences across different domains

Download ppt "Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural sciences."

Similar presentations

Ads by Google