Evolving Data Logistics Needs of the LHC Experiments Paul Sheldon Vanderbilt University.

Slides:

Advertisements

Similar presentations

© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.

Advertisements

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.

High Performance Computing Course Notes Grid Computing.

Kathy Benninger, Pittsburgh Supercomputing Center Workshop on the Development of a Next-Generation Cyberinfrastructure 1-Oct-2014 NSF Collaborative Research:

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.

1© Copyright 2015 EMC Corporation. All rights reserved. SDN INTELLIGENT NETWORKING IMPLICATIONS FOR END-TO-END INTERNETWORKING Simone Mangiante Senior.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Assessment of Core Services provided to USLHC by OSG.

IRNC Special Projects: IRIS and DyGIR Eric Boyd, Internet2 October 5, 2011.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

NORDUnet NORDUnet The Fibre Generation Lars Fischer CTO NORDUnet.

Big Data: Movement, Crunching, and Sharing Guy Almes, Academy for Advanced Telecommunications 13 February 2015.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

ANSE: Advanced Network Services for the HEP Community Harvey B Newman California Institute of Technology LHCONE Workshop Paris, June 17, Networking.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Campus Cyberinfrastructure – Network Infrastructure and Engineering (CC-NIE) Kevin Thompson NSF Office of CyberInfrastructure April 25, 2012.

LHC Open Network Environment LHCONE Artur Barczyk California Institute of Technology LISHEP Workshop on the LHC Rio de Janeiro, July 9 th,

Thoughts on Future LHCOPN Some ideas Artur Barczyk, Vancouver, 31/08/09.

10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

DataTAG Research and Technological Development for a Transatlantic Grid Abstract Several major international Grid development projects are underway at.

Data Logistics in Particle Physics Ready or Not, Here it Comes… Prof. Paul Sheldon Vanderbilt University Prof. Paul Sheldon Vanderbilt University.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.

Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.

Scientific Networking: The Cause of and Solution to All Problems April 14 th Workshop on High Performance Applications of Cloud and Grid Tools Jason.

ASCR/ESnet Network Requirements an Internet2 Perspective 2009 ASCR/ESnet Network Requirements Workshop April 15/16, 2009 Richard Carlson -- Internet2.

Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.

The LHC Computing Models and Concepts Continue to Evolve Rapidly and Need To How to Respond ?

BNL Wide Area Data Transfer for RHIC & ATLAS: Experience and Plans Bruce G. Gibbard CHEP 2006 Mumbai, India.

ANSE: Advanced Network Services for [LHC] Experiments Artur Barczyk, California Institute of Technology for the ANSE team LHCONE Point-to-Point Service.

Internet2 Update October 7 th 2010, LHCOPN Jason Zurawski, Internet2.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.

From the Transatlantic Networking Workshop to the DAM Jamboree to the LHCOPN Meeting (Geneva-Amsterdam-Barcelona) David Foster CERN-IT.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

ANSE: Advanced Network Services for Experiments Shawn McKee / Univ. of Michigan BigPANDA Meeting / UTA September 4, 2013.

Reconfigurable Optical Mesh and Network Intelligence Nazar Neayem Alcatel-Lucent Internet 2 - Summer 2007 Joint Techs Workshop Fermilab - Batavia, IL July.

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.

Point-to-point Architecture topics for discussion Remote I/O as a data access scenario Remote I/O is a scenario that, for the first time, puts the WAN.

TeraPaths: A QoS Enabled Collaborative Data Sharing Infrastructure for Petascale Computing Research The TeraPaths Project Team Usatlas Tier 2 workshop.

ANSE: Advanced Network Services for Experiments Institutes: –Caltech (PI: H. Newman, Co-PI: A. Barczyk) –University of Michigan (Co-PI: S. McKee) –Vanderbilt.

NORDUnet NORDUnet e-Infrastrucure: Grids and Hybrid Networks Lars Fischer CTO, NORDUnet Fall 2006 Internet2 Member Meeting, Chicago.

1 (Brief) Introductory Remarks On Behalf of the U.S. Department of Energy ESnet Site Coordinating Committee (ESCC) W.Scott Bradley ESCC Chairman

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Strawman LHCONE Point to Point Experiment Plan LHCONE meeting Paris, June 17-18, 2013.

From the Transatlantic Networking Workshop to the DAM Jamboree David Foster CERN-IT.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 

The DYNES Architecture & LHC Data Movement Shawn McKee/University of Michigan For the DYNES collaboration Contributions from Artur Barczyk, Eric Boyd,

1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.

BDTS and Its Evaluation on IGTMD link C. Chen, S. Soudan, M. Pasin, B. Chen, D. Divakaran, P. Primet CC-IN2P3, LIP ENS-Lyon

VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)

DATA ACCESS and DATA MANAGEMENT CHALLENGES in CMS.

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

Clouds , Grids and Clusters

Establishing End-to-End Guaranteed Bandwidth Network Paths Across Multiple Administrative Domains The DOE-funded TeraPaths project at Brookhaven National.

Readiness of ATLAS Computing - A personal view

Reconfigurable Optical Mesh and Network Intelligence

Thoughts on Computing Upgrade Activities

ExaO: Software Defined Data Distribution for Exascale Sciences

Presentation transcript:

Evolving Data Logistics Needs of the LHC Experiments Paul Sheldon Vanderbilt University

Outline To meet its computing and storage needs (and socio/political and funding needs), the LHC experiments deployed a globally distributed “grid” computing model. Successfully demonstrated grid concepts in production at an extremely large scale – enabled significant science! Short term (2 years) and longer term (10 years) projects will require them to significantly evolve their computing model. Wide area distributed storage Agile, intellegent network based solutions: DYNES and ANSE Thanks to Artur Barczyk, Ken Bloom, Eric Boyd, Oli Gutsche, Ian Fisk, Harvey Newman, Jim Shank, Jason Zurawski, and others for slides and ideas. September 5, 2013Emerging Data Logistics Needs of the LHC Expts2

A High Profile Success September 5, 2013Emerging Data Logistics Needs of the LHC Expts3

A High Profile Success September 5, 2013Emerging Data Logistics Needs of the LHC Expts4 The LHC would not have been successful without grid computing, data logistics, and robust high performance networks

Distributed Compute Resources and People September 5, 2013Emerging Data Logistics Needs of the LHC Expts5 CMS Collaboration: 38 Countries, >163 Institutes

Distributed Compute Resources and People was not possible or even desirable Sociology, Politics, Funding LHC Experiments decided early on that it was not possible or even desirable to put all or even most computing resources at CERN: Sociology, Politics, Funding Physics is done in small groups, which are themselves geographically distributed To maximize the quality and rate of scientific discovery, all physicists must have equal ability to access and analyze the experiment's data To maximize the quality and rate of scientific discovery, all physicists must have equal ability to access and analyze the experiment's data September 5, 2013Emerging Data Logistics Needs of the LHC Expts6

Distributed Compute Resources and People was not possible or even desirable Sociology, Politics, Funding LHC Experiments decided early on that it was not possible or even desirable to put all or even most computing resources at CERN: Sociology, Politics, Funding Physics is done in small groups, which are themselves geographically distributed To maximize the quality and rate of scientific discovery, all physicists must have equal ability to access and analyze the experiment's data To maximize the quality and rate of scientific discovery, all physicists must have equal ability to access and analyze the experiment's data September 5, 2013Emerging Data Logistics Needs of the LHC Expts7 Adopt a grid-like computing model, use those pieces of grid concepts and tools that were “ready” and needed.

Tiered Computing Model September 5, 2013Emerging Data Logistics Needs of the LHC Expts8 Tier 0 LHC raw data logged to the “Tier 0 ” compute center at CERN. T 0 does initial processing, verification and data quality tests.

Tiered Computing Model September 5, 2013Emerging Data Logistics Needs of the LHC Expts9 Tier 1 Tier 1’s Tier 0 Data is then distributed to Tier 1 centers, the top level centers for each region of the world (US CMS Tier 1 is at Fermilab). These process the data to create “data products” that physicists will use Not all Tier 1’s are shown 20 Gbs dedicated network

Tier 2 and Tier 3 Centers 8 Regional Tier 2 ’s in the US, 53 worldwide. Tier 3 ’s serve individual Universities Tier 2 and 3 ’s provide the computing and storage used by physicists to analyze the data & produce physics. Physicists work in small groups, but those groups may be globally distributed. September 5, 2013Emerging Data Logistics Needs of the LHC Expts10 Data Flow

Analysis Projects are Mostly Self-Organized Worker nodes at each T2/3 site present a batch system Access through grid interface/certs (no local access) User analysis workflows built and submitted by “grid enabled” software framework (CRAB, PanDA) breaks tasks into jobs – decides where to run them – monitors them – retrieves the output when they are done. Data tethering: originally jobs could only run at a site if they were hosting the required data locally Evolving model (more later) is relaxing this requirement September 5, 2013Emerging Data Logistics Needs of the LHC Expts11

How is it Working? 200K – 300K LHC jobs launched/day September 5, 2013Emerging Data Logistics Needs of the LHC Expts12 CMS Tier 1 CMS Tier 2 CPU Hours (HS06h) 1 HS06h ~ 1 GFLOP hour

Data Movement Some data is strategically placed based on high level “management” decisions Users can also request datasets be placed at “friendly sites” where resources have been made available to their analysis group. (They can whitelist those sites in CRAB/PanDA.) System for Organized Transfers (PhEDEx in CMS)… Destination based transfer requests (transfer dataset X to site A) System picks out source sites based on link quality and load (several source sites might be used) PhEDEx schedules GridFTP transfers between sites through several layers of software Software layers protect sites, enable multiple streams and take care of authentication (Globus tools, SRM, FTS, etc.) September 5, 2013Emerging Data Logistics Needs of the LHC Expts13

How is it Working? September 5, 2013Emerging Data Logistics Needs of the LHC Expts14 ATLAS CMS 2012 Throughput (MB/S) 2012 AVG: ~5 GB/s AVG: ~1 GB/s Combined: 180 PB/year Aggregate Bandwidth on all Tier N Links

Model Has Evolved Originally strictly hierarchical In addition to strategic pre- placement of data sets… Meshed data flows: Any site can use any other site to get data Dynamic data caching: Analysis sites will pull datasets from other sites “on demand” Remote data access: Jobs executing locally access data at remote sites in quasi-real time September 5, 2013Emerging Data Logistics Needs of the LHC Expts15

Near Term Data Requirments: LHC Run 2 LHC Run 2 (begins 2015): Data xfer volume 2-4 times larger Energy increase 8 to 13 TeV – more complex events, larger event sizes Trigger rate increase from Hz to 1 kHz Dynamic data placement and cache release replicate samples in high demand automatically Release cache of replicas automatically when demand decreases Establish CMS data federation for all CMS files on disk Allow remote access of files with local caching as job runs processing can start before samples arrive at destination site Higher demand for sustained network bandwidth, fill gaps btwn bursts Increase diversity of resources: get access to more resources September 5, 2013Emerging Data Logistics Needs of the LHC Expts16

10 Year Projections Requirements in 10 years? Look 10 years back… September 5, 2013Emerging Data Logistics Needs of the LHC Expts17 Source: Ian Fisk and Jim Shank

Fisk and Shank 10 Year Projections The processing has increased by a factor of 30 in capacity This is essentially what would be expected from a Moore’s law increase with a 2 year cycle Says we spent similar amounts Storage and networking have both increased by factor of times trigger and 10 times event size In their judgment, future programs will likely show similar growth “How to make better use of resources as the technology changes?” September 5, 2013Emerging Data Logistics Needs of the LHC Expts18 Fisk and Shank, Snowmass Planning Exercise for High Energy Physics

Fisk/Shank View of the Future September 5, 2013Emerging Data Logistics Needs of the LHC Expts19

…Fisk/Shank View of Future September 5, 2013Emerging Data Logistics Needs of the LHC Expts20

…Fisk/Shank View of Future September 5, 2013Emerging Data Logistics Needs of the LHC Expts21

DYNES: Addressing Growing/Evolving Network Needs NSF MRI-R2: DYnamic NEtwork System (DYNES) A nationwide cyber-instrument spanning ~40 US universities and ~14 Internet2 connectors Extends Internet2/ESnet Dynamic Circuit tools (OSCARS, …) Who is it? Project Team: Internet2, Caltech, Univ. of Michigan, and Vanderbilt Univ. Community of regional networks and campuses LHC, astrophysics community, OSG, WLCG, other virtual organizations What are the goals? Support large, long-distance scientific data flows in the LHC, other data intensive science (LIGO, Virtual Observatory, LSST,…), and the broader scientific community Build a distributed virtual instrument at sites of interest to the LHC but available to R&E community generally September 5, 2013Emerging Data Logistics Needs of the LHC Expts22

Why DYNES? LHC exemplifies growing demand for research bandwidth LHC “Tiered” computing infrastructure already encompasses 140+ sites moving data at an average rate of ~5 GB/s LHC data volumes and transfer rates are expected to expand by an order of magnitude over the next few years US LHCNet, for example, is expected to reach Gbps by 2014 between its points of presence in NYC, Chicago, CERN and Amsterdam Network usage on this scale can only be accommodated with planning, an appropriate architecture, and nationwide community involvement By the LHC groups at universities and labs By campuses, regional and state networks connecting to Internet2 By ESnet, US LHCNet, NSF/IRNC, major networks in US & Europe September 5, 2013Emerging Data Logistics Needs of the LHC Expts23

DYNES Mission/Goals Provide access to Internet2’s dynamic circuit network in a manageable way, with fair-sharing Based on a ‘hybrid’ network architecture utilizing both routed and circuit based paths. Enable such circuits w/ bandwidth guarantees across multiple US network domains and to Europe Will require scheduling services at some stage Build a community with high throughput capability using standardized, common methods September 5, 2013Emerging Data Logistics Needs of the LHC Expts24

Why Dynamic Circuits? Internet2’s and ESnet’s dynamic circuits services provide: Increased effective bandwidth capacity, and reliability of network access, by isolating traffic from standard IP “many small flows” traffic Guaranteed bandwidth as a service by building a system to automatically schedule and implement virtual circuits Improved ability of scientists to access network measurement data for network segments end-to-end via perfSONAR monitoring Static “nailed-up” circuits will not scale. GP network firewalls incompatible with enabling large-scale science network dataflows Implementing many high capacity ports on traditional routers would be expensive September 5, 2013Emerging Data Logistics Needs of the LHC Expts25

DYNES Dataflow September 5, 2013Emerging Data Logistics Needs of the LHC Expts26 user requests via a DYNESTransfer URL DYNES provided equipmentat end sites and regionals(Inter-Domain Controllers,Switches, FDT Servers) DYNES Agent will providethe functionality to request a circuit instantiation, initiate and manage the data transfer, and terminate the dynamically provisionedresources. Site A Site Z Dynamic Circuit

DYNES Topology September 5, 2013Emerging Data Logistics Needs of the LHC Expts27 Based on Applications Rcvd from Potential Sites

ANSE: Integrating DYNES into LHC Computing Frameworks NSF CC-NIE: Advanced Network Services for Expts (ANSE) Project Team: Caltech, Michigan, Texas-Arlington, & Vanderbilt Goal: Enable strategic workflow planning including network capacity as well as CPU and storage as a co-scheduled resource Path Forward: Integrate advanced network-aware tools with the mainstream production workflows of ATLAS and CMS Use in-depth monitoring and network provisioning Complex workflows: a natural match and a challenge for SDN Exploit state of the art progress in high throughput long distance data transport, network monitoring and control September 5, 2013Emerging Data Logistics Needs of the LHC Expts28

ANSE Methodology Use agile, managed bandwidth for tasks with levels of priority along with CPU and disk storage allocation. define goals for time-to-completion, with reasonable chance of success define metrics of success, such as the rate of work completion, with reasonable resource usage efficiency define and achieve “consistent” workflow Dynamic circuits a natural match Process-Oriented Approach Measure resource usage and job/task progress in real-time If resource use or rate of progress is not as requested/planned, diagnose, analyze and decide if and when task re-planning is needed Classes of work: defined by resources required, estimated time to complete, priority, etc. September 5, 2013Emerging Data Logistics Needs of the LHC Expts29

Summary and Conclusions Successful globally distributed computing model has been extremely important for LHC’s success Have intentionally not been bleeding-edge, but have certainly validated the distributed/grid model For the future, CPU may scale sufficiently but we can’t afford to grow storage by a factor of 100 Agile, large, intelligent network-based systems are key Leads to less bursty, more sustained use of networks Projects such as DYNES and ANSE are essential first steps as the LHC evolves to what Harvey Newman calls a data intensive Content Delivery Network September 5, 2013Emerging Data Logistics Needs of the LHC Expts30