Www.egi.eu EGI-InSPIRE RI-261323 EGI-InSPIRE www.egi.eu EGI-InSPIRE RI-261323 Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro.

Slides:



Advertisements
Similar presentations
SLA-Oriented Resource Provisioning for Cloud Computing
Advertisements

Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
1 NETE4631 Cloud deployment models and migration Lecture Notes #4.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1 The CERN Cloud Computing Project William Lu, Ph.D. Platform Computing.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Cloud Computing Wrap-Up Fernando Barreiro Megino (CERN IT-ES) Sergey Panitkin.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Jose Castro Leon CERN – IT/OIS CERN Agile Infrastructure Infrastructure as a Service.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Commissioning the CERN IT Agile Infrastructure with experiment workloads Ramón Medrano Llamas IT-SDC-OL
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.
Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI CERN and HelixNebula, the Science Cloud Fernando Barreiro Megino (CERN IT)
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
Activities and Perspectives at Armenian Grid site The 6th International Conference "Distributed Computing and Grid- technologies in Science and Education"
© 2015 MetricStream, Inc. All Rights Reserved. AWS server provisioning © 2015 MetricStream, Inc. All Rights Reserved. By, Srikanth K & Rohit.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
St. Petersburg, 2016 Openstack Disk Storage vs Amazon Disk Storage Computing Clusters, Grids and Cloud Erasmus Mundus Master Program in PERCCOM Author:
PRIN STOA-LHC: STATUS BARI BOLOGNA-18 GIUGNO 2014 Giorgia MINIELLO G. MAGGI, G. DONVITO, D. Elia INFN Sezione di Bari e Dipartimento Interateneo.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Workload Management Workpackage
AWS Integration in Distributed Computing
Introduction to Distributed Platforms
Blueprint of Persistent Infrastructure as a Service
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
ATLAS Cloud Operations
StratusLab Final Periodic Review
StratusLab Final Periodic Review
WLCG Collaboration Workshop;
Introduction to Cloud Computing
Cloud Computing R&D Proposal
Presentation transcript:

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI ATLAS Cloud Computing R&D ATLAS Cloud Computing R&D is a young initiative Active participation, almost 10 persons working part time on various topics Goal: How we can integrate cloud resources with our current grid resources? Data processing and workload management PanDA queues in the cloud Centrally managed, non-trivial deployment but scalable Benefits ATLAS & sites, transparent to users Tier3 analysis clusters: instant cloud sites Institute managed, low/medium complexity Personal analysis queue: one click, run my jobs User managed, low complexity (almost transparent) Data storage Short term data caching to accelerate above data processing use cases Transient data Object storage and archival in the cloud Integrate with DDM EFFICIENCY, ELASTICITY Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May

EGI-InSPIRE RI Data processing and workload management PanDA queues in the cloud Analysis Clusters in the Cloud Personal PanDA Analysis Queues in the Cloud Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Helix Nebula: The Science Cloud European Cloud Computing Initiative: CERN, EMBL, ESA + European IT industry Evaluate cloud computing for science and build a sustainable European cloud computing infrastructure Identify and adopt policies for trust, security and privacy CERN/ATLAS is one of three flagship users to test a few commercial cloud providers (CloudSigma, T-Systems, ATOS...) Agreed to run MC Production jobs ~3 weeks per provider First step completed: First step completed: Ran ATLAS simulation jobs (i.e. small I/O requirements) on CloudSigma 4 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Basic idea: KISS principle The most simple model possible Use CernVM with preinstalled SW Configure VMs at Cloud Sigma which join a Condor pool with master at CERN (one of the pilot factories) Create a new PanDA queue HELIX Real MC production tasks are assigned manually I/O copied over the WAN from CERN (lcg- cp/lcg-cr for input/output) 5 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Results 100 nodes/200 CPUs at Cloud Sigma used for production tasks Smooth running with very few failures Finished 6 x 1000-job MC tasks over ~2 weeks We ran 1 identical task at CERN to get reference numbers Wall clock performance cannot be compared directly, since we don’t have the same hardware on both sites CloudSigma has ~1.5Ghz of AMD Opteron 6174 per jobslot, CERN has a ~2.3GHz Xeon L5640 Best comparison would be CHF/event, which is presently unknown 6 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Cloud Scheduler 7 Allow users to run jobs on IaaS resources by submitting to standard Condor queue Simple python software package Monitor state of a Condor queue Boot VMs in response to waiting jobs Custom Condor job description attributes to identify VM requirements Control multiple IaaS cloud sites through cloud APIs Same approach used successfully for BaBar jobs Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Cloud Scheduler: Results Cloud resources are aggregated and served to one analysis and one production queue Analysis queue: HammerCloud benchmarks WNs in FutureGrid (Chicago), SE in Univ. Victoria Evaluate performance and feasibility of analysis jobs in the cloud I/O intensive: Remote data access from Victoria - depends on WAN bandwidth capacity between sites Data access tests through WebDAV and GridFTP Initial results: 692/702 successful jobs Production queue: in continuous operation for simulation jobs Low I/O requirements Several thousand jobs have been executed on over 100 VMs running at FutureGrid (Chicago) and Synnefo cloud (Uvic) Plan to scale up number of running jobs as cloud resources become available 8 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Data processing and workload management PanDA queues in the cloud Analysis Clusters in the Cloud Personal PanDA Analysis Queues in the Cloud Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Analysis Clusters in the Cloud A site should be able to easily deploy new analysis clusters in a commercial or private cloud resource easy and user-transparent way to scale out jobs in the cloud is needed scientists may spend time analyzing data and not doing system administration Goal: migrate functionality of a physical Data Analysis Cluster (DAC) into the cloud Support in the cloud: Services that support job execution and submission: Condor User management: LDAP ATLAS software distribution: CVMFS Web caching: squid Evaluate cloud management tools that will enable us to define these DACs in all their complexity: CloudCRV, Starcluster, Scalr Scalr found to have most robust feature set and active community 10 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Data processing and workload management PanDA queues in the cloud Analysis Clusters in the Cloud Personal PanDA Analysis Queues in the Cloud Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Personal PanDA Analysis Queues in the Cloud Enable users to access extra computing resources on-demand Private and commercial cloud providers Access new cloud resources with minimal changes to their analysis workflow on grid sites Submit PanDA jobs to a virtual cloud site Virtual cloud site has no pilot factory associated: user has to run PanDA pilots to retrieve their jobs Personal PanDA pilot retrieves only user’s jobs and ignores others 12 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Cloud Factory 13 Need a simple tool to automatize the above workflow Sense queued jobs in the personal queue Instantiate sufficient cloud VMs via cloud API Run the personal pilots Monitor VMs (alive/zombie, busy/idle, X509 proxy status) Destroy VMs after they are no longer needed (zombies, expired proxy, unneeded) Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Cloud Factory Testing at LxCloud 14 LxCloud: OpenNebula testing instance with EC2 API at CERN CVMFS for ATLAS releases CERN storage element as data source HammerCloud stress-tests ran to compare job performance in LxCloud and CERN batch facility LSF No difference in reliability Small decrease in average performance: 10.8Hz in LSF to 10.3Hz in LxCloud Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Storage and Data Management Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Data Access Tests Evaluate the different storage abstraction implementations that cloud platforms provideEvaluate the different storage abstraction implementations that cloud platforms provide Amazon EC2 provides at least three storage options Simple Storage Service (S3) Elastic Block Store (EBS) Ephemeral store associated with a VM Different cost-performance benefits for each layout that need to be analyzed Cloud storage performance on 3-node PROOF farm EBS volume performs better than ephemeral disk But ephemeral disk comes free with EC2 instances Scaling of storage space and performance with the size of the analysis farm 16 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Xrootd Storage Cluster in the Cloud Evaluate write performance of cluster where data comes from outside the cloud 2 data servers and 1 redirector node using Amazon EC2 large instances and either Amazon ephemeral storage Amazon EBS Transfers using Xrootd native copy program set to fetch files from multiple sources Most useable configuration: 2 ephemeral storage partitions joined together by Linux LVM Average transfer rate of 16 MB/s to one data server Around 45 MB/s for the three Startup time of the VM increased due to assembling, formatting and configuring the storage Using 1 or 2 EBS partitions Average transfer rate under 12 MB/s Large number of very slow transfers (< 1 MB/s) Very high storage costs given current Amazon rates 17 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Future Evaluation of Object Stores Integration of Amazon S3 with ATLAS DDM DDM team will demonstrate compatibility of their future system with S3-API implementations of various storage providers Huawei, OpenStack Swift and Amazon Main questions to be answered: how to store, retrieve and delete data from an S3 storage how to combine data organization models S3 bucket/object model ATLAS dataset/file how to integrate a cloud storage with existing grid middleware how to integrate authentication and authorization mechanisms 18 Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Conclusions Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May 2012

EGI-InSPIRE RI Conclusions and cloud futures Data processing Many activities are reaching a point where we can start getting feedback from users. In the next months we should Determine what we can deliver in production Start focusing and eliminate options Improve automation and monitoring Still suffering from lack of standardization amongst providers Cloud storage This is the hard part Looking forward to good progress in caching (XRootd in cloud) Some “free” S3 endpoints are just coming online, so effective R&D is only starting now ATLAS DDM S3 evaluation and integration proposal written recently Support grid sites who want to offer private cloud resources Develop guidelines, best practices Good examples already, e.g. LxCloud, PIC, BNL, and others Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May

EGI-InSPIRE RI Thank you for your attention Fernando H. Barreiro Megino (CERN IT-ES) CHEP– New York May