Grid in Alice – Status and Perspective Predrag Buncic P.Saiz, A. Peters C. Cristiou, J-F. Grosse-Oetringhaus A. Harutyunyan, A. Hayrapetyan.

Slides:



Advertisements
Similar presentations
1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.
Advertisements

Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
EGEE-II INFSO-RI Enabling Grids for E-sciencE The gLite middleware distribution OSG Consortium Meeting Seattle,
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
High Performance Computing Course Notes Grid Computing.
Future of AliEn ALICE © | Offline Week | June 2013| Predrag Buncic.
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
DISTRIBUTED COMPUTING
BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Enabling Grids for E-sciencE ENEA and the EGEE project gLite and interoperability Andrea Santoro, Carlo Sciò Enea Frascati, 22 November.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
INFSO-RI Enabling Grids for E-sciencE Experience of using gLite for analysis of ATLAS combined test beam data A. Zalite / PNPI.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
GRID Security & DIRAC A. Casajus R. Graciani A. Tsaregorodtsev.
ANALYSIS TOOLS FOR THE LHC EXPERIMENTS Dietrich Liko / CERN IT.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
ALICE experiences with CASTOR2 Latchezar Betev ALICE.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
The GridPP DIRAC project DIRAC for non-LHC communities.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
INFN-GRID Workshop Bari, October, 26, 2004
MC data production, reconstruction and analysis - lessons from PDC’04
Simulation use cases for T2 in ALICE
ALICE Computing Model in Run3
ALICE Computing Upgrade Predrag Buncic
LCG middleware and LHC experiments ARDA project
The LHCb Computing Data Challenge DC06
Presentation transcript:

Grid in Alice – Status and Perspective Predrag Buncic P.Saiz, A. Peters C. Cristiou, J-F. Grosse-Oetringhaus A. Harutyunyan, A. Hayrapetyan

Split, 5 October Overview Grid in Alice  Dreaming about Grid (2001 – 2005)  Waking up (2005 – 2006)  Grid Future (2007+) Conclusions

Split, 5 October ALICE Need for Grid 1.25 GB/sec 2 PB/year 8 h/event

Split, 5 October Five years ago…

Split, 5 October Alice Grid AliEn stack User Interface VTD/OSG stackEDG stack Nice! Now I do not have to worry about ever changing GRID environment…

Split, 5 October New approach  Using standard protocols and widely used Open Source components  Interface to many Grids End-to-end solution  SOA (Service Oriented Architecture) SOAP/Web Services (18) –Core Services (Brokers, Optimizers, etc) –Site Services –Package Manager –Other (non Web) Services (ldap, database proxy, posix I/O)  Distributed file and metadata catalogue  API and a set of user interfaces Used as production system for Alice since end of 2001  Survived 5 software years AliEn v1.0 (2001)

Split, 5 October Technology matrix 2.x3.X (OGSA)4.X (WS)Globus *.xEDG 1.xLCG2.x 0.xgLite1.x/2.x3.x 2.X (WS)AliEn1.X (WS) WS OGSA Proprietary protocol (Globus) Open Grid Services Architecture (Globus) Web Services (W3) ?

Split, 5 October Before: Globus model RB Site E Site DSite F Flat grid, each user interacts directly with site resources Site A Site CSite B

Split, 5 October AliEn model V.O.#2 Site E Site DSite F Grid Service Provider (Supersite): Hosts Core Services (per V.O) Resource Provider (Site): Hosts an instance of CE, SE Services (per V.O.) Virtual Organisation: Collection of Sites, Users & Services

Split, 5 October gLite: Initial goals EDGVDT... LCG...AliEn Re-engineer and harden Grid middleware (AliEn, EDG, VDT and others) Provide production quality middleware Globus 2 basedWeb services based gLite-2gLite-1LCG-2LCG-1

Split, 5 October Reducing M/W Scope API Grid Middle ware Common Service Layer (description, discovery, resource access) Low level network, message transport layer (TCP/IP -> HTTP -> SOAP ) Baseline Services

Split, 5 October Alien v2.0 Implementation of gLite architecture  gLite architecture was derived from AliEn New API Service and ROOT API  Shell, C++, perl, java bindings Analysis support  Batch and interactive  ROOT/PROOF interfaces  Complex XML datasets and tag files for event level metadata  Handling of complex workflows New (tactical) SE and POSIX I/O  Using xrootd protocol in place of aiod (glite I/O) Job Agent model  Improved job execution efficiency (late binding)

Split, 5 October Distributed Analysis File Catalogue Query User Job (many events) Job Output Data set (ESD’s, AOD’s) Output file 1Output file 2 Output file n Job Optimizer Job Broker Sub-job 1 Sub-job 2 Sub-job n CE and SE Processing CE and SE Processing CE and SE Processing File-merging Job Submit to CE with closest SE Grouped by SE files location

Split, 5 October ROOT / AliEn UI

Split, 5 October Reality check: PDC’06 1. Production of MC events for detector and software performance studies 2. Verification of the ALICE distributed computing model  Integration and debugging of the GRID components into a stable system LCG Resource broker, LCG file catalogue, File transfer system, Vo-boxes AliEn central services – catalogue, job submission and control, task queue, monitoring  Distributed calibration and alignment framework  Full data chain RAW data from DAQ, registration in the AliEn FC, first pass reconstruction at T0, replication at T1  Computing resources verification of scalability and stability of the on-site services and building of expert support  End-user analysis on the GRID

Split, 5 October ALICE sites on the world map

Split, 5 October History of running jobs 7M p+p events combined in 70 runs (ESDs, simulated RAW and ESD tag files) 50K Pb+Pb event in 100 runs (ESDs and ESD tag files)

Split, 5 October Concurrently running jobs

Split, 5 October Resources statistics Resources contribution (normalized Si2K units): 50% from T1s, 50% from T2s  The role of the T2 remains very important!

Split, 5 October Data movement (xrootd) Step 1: produced data is sent to CERN  Up to 150 MB/sec data rate (limited by the amount of available CPUs) – ½ of the rate during Pb+Pb data export Total of 0.5 PB of data registered in CASTOR2 (300K files,1.6 GB/file)

Split, 5 October Data movement (FTS) Step 2: data is replicated from CERN to the T1s  Test of LCG File Transfer Service  Goal is 300 MB/sec – exercise is still ongoing

Split, 5 October Next Step: Alien v3.0 Addressing the issues and problems encountered so far and trying to guess the technology trends  Scalability and complexity We would like Grid to grow but how to manage complexity of such system?  Intra-VO scheduling How to manage priorities with VO in particular for analysis?  Security How to reconcile (Web) Service Oriented Architecture with growing security paranoia aimed at closing all network ports? How to fulfil the legal requirements for process and file traceability?  Grid collaborative environment How to work together on the Grid?

Split, 5 October Intra-VO Scheduling Simple economy concept on top of existing fair share model  Users pay (virtual) money for utilizing Grid resources  Sites earn money by providing resources  The more user is prepared to ‘pay’ for job execution, sooner it is likely to be executed Rationale  To motivate sites to provide more resources with better QOS  To make users aware of the cost of their work Implementation (in AliEn v2-12)  Lightweight Banking Service for Grid (LBSG) Account creation/deletion Funds addition Funds transaction Retrieval of transactions’ list and balance

Split, 5 October Overlay Messaging Networks Due to security concerns, any service that listens on open network port is seen as very risky Solution  We can use Instant Messaging protocols to create overlay network to avoid opening ports IM can be used to route SOAP messages between central and site services –No need for incoming connectivity on site head node It provides presence information for free –simplifies configuration and discovery  XMPP (Jabber) A set of open technologies for streaming XML between two clients –Many open-source implementation –Distributed architecture –Clients connect to servers –Direct connections between servers  Jabber is used by Google IM/Talk This channel could be used to connect grid users with Google collaborative tools

Split, 5 October Sandboxing and Auditing The concept of VO identity gaining recognition  Model is accepted by OSG and exploited by LHC experiments  VO acts as an intermediary on behalf of its users Task Queue – repository for user requests –AliEn, Dirac, Panda Computing Element –Requests jobs and submits them to local batch system  Recently this model was extended to the worker node Job Agent (pilot job) running under the VO identity on the worker node serves many real users The next big step in enhancing Grid security would be to run the Job Agents (pilot jobs) within a Virtual Machine This can provide a perfect process and file sandboxing Software which is run inside a VM can not negatively affect the execution of another VM

Split, 5 October The Big Picture Virtual Cluster (User layer) Virtual Grid (V.O. layer) Physical Grid (Common layer) Large “physical grid”  Reliably execute jobs, store, retrieve and move files Individual V.O. will have at given point in time access to a subset of these resources  Using standard tools to submit the job (Job Agents as well as other required components of VO grid infrastructure) to physical grid sites  This way V.O. ‘upper’ middleware layer will create an overlay, a grid tailored to V.O needs but on smaller scale  At this scale, depending on the size of the VO, some of the existing solutions might be applicable Individual users interacting with V.O middleware will typically see a subset of the resources available to the entire VO  Each session will have certain number of resources allocated  In the most complicated case, users will want to interactively steer a number of jobs running concurrently on a many of Grid sites  Once again an overlay (Virtual Cluster) valid for duration of user session AliEn/PROOF demo at SC05

Split, 5 October Conclusions Alice is using Grid resources to carry out production (and analysis) since 2001 At present, common Grid software does not provide sufficient and complete solution  Experiments, including Alice, have developed their own (sometimes heavy) complementary software stack In Alice, we are reaching a ‘near production’ level of service based on AliEn components combined with baseline LCG services  Testing of the ALICE computing model with ever increasing complexity of tasks  Seamless integration of interactive and batch processing models Strategic alliance with ROOT  Gradual build up of the distributed infrastructure in preparation for data taking in 2007  Improvements of the AliEn software hidden thresholds are only uncovered under high load storage still requires a lot of work and attention Possible directions  Convergence of P2P and Web technologies  Complete virtualization of distributed computational resources,  By layering experiment software stack on top of basic physical Grid infrastructure we can reduce the scale of the problem and make Grid really work