Analysis support issues, Virtual analysis facilities (ie Tier 3s in the cloud) Doug Benjamin (Duke University) on behalf of Sergey Panitkin, Val Hendrix,

Slides:

Advertisements

Similar presentations

Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.

Advertisements

Duke Atlas Tier 3 Site Doug Benjamin (Duke University)

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Cloud Don McGregor Research Associate MOVES Institute

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 7 2/23/2015.

Matthew Palmer, Cambridge University01/10/2015 First Use of the UK e-Science Grid Overview The Physics Experiences Looking forward Conclusions Matthew.

YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Cloud Computing Wrap-Up Fernando Barreiro Megino (CERN IT-ES) Sergey Panitkin.

Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Tier 3 Computing Doug Benjamin Duke University. Tier 3’s live here Atlas plans for us to do our analysis work here Much of the work gets done here.

Atlas Tier 3 Virtualization Project Doug Benjamin Duke University.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro.

Tier 3 architecture Doug Benjamin Duke University.

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

WLCG Overview Board, September 3 rd 2010 P. Mato, P.Buncic Use of multi-core and virtualization technologies.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

LHC Computing, CERN, & Federated Identities

US Atlas Tier 3 Overview Doug Benjamin Duke University.

1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.

Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?

T3g software services Outline of the T3g Components R. Yoshida (ANL)

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI CERN and HelixNebula, the Science Cloud Fernando Barreiro Megino (CERN IT)

Cloud computing and federated storage Doug Benjamin Duke University.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

ATLAS TIER3 in Valencia Santiago González de la Hoz IFIC – Instituto de Física Corpuscular (Valencia)

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

Atlas Tier 3 Overview Doug Benjamin Duke University.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Structured Container Delivery Oscar Renalias Accenture Container Lead (NOTE: PASTE IN PORTRAIT AND SEND BEHIND FOREGROUND GRAPHIC FOR CROP)

Status: ATLAS Grid Computing

AWS Integration in Distributed Computing

Doug Benjamin Duke University On Behalf of the Atlas Collaboration

Virtualization and Clouds ATLAS position

Virtualisation for NA49/NA61

Blueprint of Persistent Infrastructure as a Service

Dag Toppe Larsen UiB/CERN CERN,

Dag Toppe Larsen UiB/CERN CERN,

ATLAS Cloud Operations

Virtualisation for NA49/NA61

Readiness of ATLAS Computing - A personal view

Thoughts on Computing Upgrade Activities

CernVM Status Report Predrag Buncic (CERN/PH-SFT).

Introduction to Cloud Computing

Virtualization, Cloud Computing, and TeraGrid

Presentation transcript:

Analysis support issues, Virtual analysis facilities (ie Tier 3s in the cloud) Doug Benjamin (Duke University) on behalf of Sergey Panitkin, Val Hendrix, Henrik Ohman, Peter Onyisi

How do Atlas physics work today ? User surveys were conducted (US ATLAS 145 responses and ATLAS 267) ATLAS wide US ATLAS

Difference in cultures ATLAS vs US ATLAS (local usage) US Analyzers – rely more heavily on local resources than collegues

What data do people now use? US and ATLAS analyzers not that different

What types of D3PD’s are used? US ATLAS plot - ATLAS wide plot not so different

How do people get their data? dq2-get very popular amount analyzers o Seen in surveys and dq2 traces

Moving toward the future 2014/2015 will be important time for ATLAS and US ATLAS o Machine will be at full energy o ATLAS wants to have a trigger rate to tape ~ 1 kHZ (vs ~ 400 Hz now) o We need to produce results as fast as before (maybe faster) o ARRA funded computing will be 5 years old and beginning to age out o US funding of LHC computing likely flat at best. Need to evolve the ATLAS computing model o Need at least factor of 5 improvement (more data and need to go faster) Need to evolve how people do their work o Fewer local resources will be available in the future o US ATLAS users need to move towards central resources o But…. Need to avoid a lost in functionality and performance

Future of Tier 3 sites Why do research groups like local computing? o Groups like the clusters that they have because now they are easy to maintain (We succeeded!!!) o They give them instant access to their resources o They give unlimited quota to use what they want for what ever they want We need to have a multi-faceted approach to the Tier 3 evolution Need to allow and encourage sites with the funding to continue to refresh their existing clusters. o We do not want to give the impression in any way that we would no welcome addition DOE/NFS funding to be used in this way o What about cast off compute nodes from Tier 1/Tier 2 sites? Need to use beyond pledged resources to the benefit of all users and groups of a given country o If we can not get sufficient funding to fund the refresh of the existing Tier 3’s o Beyond Pledged resources are a key component to the solution.

New technologies on horizon Federated storage o With caching Tier 3’s can really reduce data management WAN data access o User code must be improved to reduce latencies o Decouples Storage and CPU Analysis Queues using beyond pledged resources o Can be used to make use of beyond pledges resources for data analysis not just MC production Cloud computing o Users are becoming more comfortable for cloud computing (gmail, icloud) o Virtual Tier 3’s All are coupled Done right, they should allow us to do more with the resources that we will have

Virtual Analysis clusters( Tier 3’s) R&D effort by Val Hendrix, Sergey Panitkin, DB and qualification task for student Henrik Ohman o Part time effort for all of us, is a issue toward progress o Peter Onyisi has just joined University of Texas as a professor and is starting We been working on a variety of clouds (by necessity and not necessarily by choice) o Public clouds – EC2 (micro spot instances), google (while still free) o Research clouds – Future grid (we have an approve project) o We are scrounging resouces where we can (not always the most efficient) Sergey and I have both independently working data access/handling using Xrootd and the federation We could use a stable Private cloud testbed.

Virtual analysis clusters (cont.) Panda is used for work load management in most cases (Proof is only exception) o Perhaps we can use Proof on demand Open issues o Configuration and contextualization - We are collaborating with CERN and BNL on puppet - outlook looks promising o What is the scale of the virtual clusters? Personal analysis cluster Research group analysis cluster (institution level) Multi institution analysis cluster (many people, several groups to country wide) o Proxies for the Panda Pilots (personal or robotic) o What are the latency incurred by this dynamic system? Spin up and tear down time? o What leave to data reuse do we want (ie caching)? o How is the data handling done in and out of the clusters Initial activity used Federation storage as source o What is the performance penalty of virtualization

Google Compute Engine During the initial free trial period – Sergey got a project approved o He was able to extend it through at least into December o Review next week on the progress and useage o Would like to extend it Current status o Variety of VM’s mostly based on CENTOS 6 x86_64 o Have a prototype SL 5.8 image (with lots of issues) o Have 10 TB persistent storage, upto 1000 instances or 1000 cores Proof cluster established o 8 core VM’s w/ TB ephermial storage o over 60 worker nodes more that 500 cores o Centos 6 image used o Used by colleagues at BNL primarily

Google Compute Engine (cont) Analysis cluster o Goal to run Panda jobs producing D3PD’s o Storage will be provided by BNL spacetokens Local site mover will copy files to VM’s for processing Output directed back to BNL Have a robotic certificate (with needed production voms extensions) o Tested at Centos 6 VM Could not use the image to clone into other VM and connect via ssh Did contain OSG worknode middleware and required libraries to pass node validation tests. Will need to make new version to fix non function ssh Need to Validate D3PD production on SL 6 platform (not AOD production) o CVMFS establish and tested in cluster o Condor installed but not fully tested o APFv2 needs to be installed and configured Could use help with queue definitions in sched config o Within the Last 7 days received a prototype SL 5.8 image from Google Added the needed libraries and configurations but…. Had to install and hack code used to create cloned images SSH access does not work --- might have to

Proof on Google Compute Engine - Sergy Panitkin - T3 type Proof/Xrootd cluster on Google Compute Engine 64 nodes with ~500 cores 180TB of ephemeral storage The cluster is intended for ATLAS D3PD based physics analysis. Studying data transfer from ATLAS using dq2, xrdcp and also direct read. Plan to study Proof workload distribution system scalability at a very large number of nodes.

GCE/Future Grid Henrik Ohman Node configuration o Working with puppet for node configurations (cvmfs, xrootd, condor, panda, etc.) o Considering benefits and drawbacks of masterful vs. masterless setups - what would be easiest for the researcher who wants to setup his own cluster in the cloud? Cluster orchestration o Investigating whether puppet can also be used for scaling the size of the cluster o (node_gce) * o (gce_compute) This work will be useable on GCE/Future Grid (Cloud Stack) and Amazon Web Services

Amazon Cloud (AWS) Val Hendrix,Henrik Ohman, DB Beate’s LBNL group received grant to run physics analysis on Amazon Cloud Purpose – “the analysis using Amazon resources will start from skimmed/slimmed D3PDs and go from there. The would eventually like to do the analysis from DAODs, but I think that would be too expensive in terms of CPU and disk for this particular project." -- Mike Hance. limited sum of Grant money and they would like to find another place for them to do their analysis once there Amazon money runs out.

Future Grid Peter Onyisi The UT-Austin ATLAS group is interested in having a "personal cluster" solution deployed on university- hosted cloud resources (the TACC Alamo FutureGrid site). o Have our own CentOS-derived disk image that delivers basic functionality but will migrate to more "mainline" images soon. We are looking at a very interactive use case (more towards PROOF than PanDA) o the really heavy lifting will use T1/T2 resources. o Good integration with local storage resources is essential. o Remote access to D3PD via federated xrootd is also in the plan.

Conclusions New technologies are maturing that can be used and have the potential of being transformative So additional resources from the facilities would go a long way toward helping Virtual Tier 3 are making slow process o New labor has been added