FermiGrid/CDF/D0/OSG. Global Collaboration With Grids Ziggy wants his humans home by the end of the day for food and attention Follow Ziggy through National,

Slides:



Advertisements
Similar presentations
PRAGMA Application (GridFMO) on OSG/FermiGrid Neha Sharma (on behalf of FermiGrid group) Fermilab Work supported by the U.S. Department of Energy under.
Advertisements

Dec 14, 20061/10 VO Services Project – Status Report Gabriele Garzoglio VO Services Project WBS Dec 14, 2006 OSG Executive Board Meeting Gabriele Garzoglio.
High Performance Computing Course Notes Grid Computing.
CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Open Science Ruth Pordes Fermilab, July 17th 2006 What is OSG Where Networking fits Middleware Security Networking & OSG Outline.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Key Project Drivers - FY11 Ruth Pordes, June 15th 2010.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Mine Altunay OSG Security Officer Open Science Grid: Security Gateway Security Summit January 28-30, 2008 San Diego Supercomputer Center.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Mar 28, 20071/9 VO Services Project Gabriele Garzoglio The VO Services Project Don Petravick for Gabriele Garzoglio Computing Division, Fermilab ISGC 2007.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
Partnerships & Interoperability - SciDAC Centers, Campus Grids, TeraGrid, EGEE, NorduGrid,DISUN Ruth Pordes Fermilab Open Science Grid Joint Oversight.
Open Science Grid For CI-Days NYSGrid Meeting Sebastien Goasguen, John McGee, OSG Engagement Manager School of Computing.
The Open Science Grid OSG Ruth Pordes Fermilab. 2 What is OSG? A Consortium of people working together to Interface Farms and Storage to a Grid and Researchers.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
4/25/2006Condor Week 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
DTI Mission – 29 June LCG Security Ian Neilson LCG Security Officer Grid Deployment Group CERN.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Sep 25, 20071/5 Grid Services Activities on Security Gabriele Garzoglio Grid Services Activities on Security Gabriele Garzoglio Computing Division, Fermilab.
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
1 Open Science Grid.. An introduction Ruth Pordes Fermilab.
Victoria A. White Head, Computing Division, Fermilab Fermilab Grid Computing – CDF, D0 and more..
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
April 18, 2006FermiGrid Project1 FermiGrid Project Status April 18, 2006 Keith Chadwick.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
Defining the Technical Roadmap for the NWICG – OSG Ruth Pordes Fermilab.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
FermiGrid The Fermilab Campus Grid 28-Oct-2010 Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiGrid Highly Available Grid Services Eileen Berman, Keith Chadwick Fermilab Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
FermiGrid - PRIMA, VOMS, GUMS & SAZ Keith Chadwick Fermilab
Bob Jones EGEE Technical Director
Ian Bird GDB Meeting CERN 9 September 2003
f f FermiGrid – Site AuthoriZation (SAZ) Service
Leigh Grundhoefer Indiana University
Open Science Grid at Condor Week
DØ MC and Data Processing on the Grid
GLOW A Campus Grid within OSG
Presentation transcript:

FermiGrid/CDF/D0/OSG

Global Collaboration With Grids Ziggy wants his humans home by the end of the day for food and attention Follow Ziggy through National, Campus, and Community grids to see how it happens

What is DØ? The DØ experiment consists of a worldwide collaboration of scientists conducting research on the fundamental nature of matter. – 500 scientists and engineers – 60 institutions – 15 countries The research is focused on precise studies of interactions of protons and antiprotons at the highest available energies.

DØ Detector The detector is designed to stop as many as possible of the subatomic particles created from energy released by colliding proton/antiproton beams. –The intersection region where the matter- antimatter annihilation takes place is close to the geometric center of the detector. –The beam collision area is surrounded by tracking chambers in a strong magnetic field parallel to the direction of the beam(s). –Outside the tracking chamber are the pre- shower detectors and the calorimeter.

What is reprocessing? Periodically an experiment will reprocess data taken previously due to improvements in understanding the detector: –the calorimeter recalibration –improvements in the algorithms used in the analysis The reprocessing effort pushes the limits of software and infrastructure to get the most physics out of the data collected by the DØ detector A new layer of silicon detector of the DZERO detector

Case for using OSG resources Goal: reprocess ~500 M RunII events with newly calibrated detector and improved reconstruction software by end of March ’07 when the data have to be ready for physics analysis –Input: 90Tb of detector data Tb in executables –Output: 60 Tb of data in 500 CPU years Estimated resources: need about CPUs for a period of about 4 months. Problem: DØ did not have enough dedicated resources to complete the task in the target 3 months Solution: Use SAM-Grid–OSG interoperability to allow SAM-Grid jobs to be executed in OSG clusters.

Opportunistic usage model –Agreed to share computing cycles with OSG users –Exact amount of resources at any time can not be guaranteed OSG Usage Model OSG ClustersCPUs Brazil230 CC-IN2P3 Lyon500 LOUISIANA LTU-CCT250 (128) UCSD300 (70) PURDUE-ITaP600 (?) Oklahoma University 200 Indiana University 250 NERSC – LBL250 University of Nebraska256 CMS FNAL 2250

SAM-Grid SAM-Grid is an infrastructure that understands DØ processing needs and maps them into available resources (OSG) Implements job to resource mappings –Both computing and storage Uses SAM (Sequential Access via Metadata) –Automated management of storage elements –Metadata cataloguing Job submission and job status tracking Progress monitoring

SAMGrid Architecture

Challenge: Certification Compare production at a new site with “standard” production at the DØ farm “the same” Certified! Note: Experienced problems during the certification on virtual OS. *default random seed in python was set to the same value on all machines Reference OSG Cluster

Challenge: Data Accessibility Test Acceptable NOT Acceptable 2000 secs to transfer data (30 streams) secs to transfer data (30 streams)

Challenge: Troubleshooting The OSG Troubleshooting team was instrumental to the success of the project. OSG-Related Problems before the intervention of the Troubleshooting Team (03/27/2007) Most jobs succeed (04/17/2007)

Reprocessing Summary “This was the first major production of real high energy physics data (as opposed to simulations) ever run on OSG resources" said Brad Abbott, head of the DØ Computing group. On OSG, DØ sustained execution of over 1000 simultaneous jobs, and overall moved over 70 Terabytes of data. Reprocessing was completed in June. Towards the end of the production run the throughput on OSG was more than 5 million events per day two to three times more than originally planned. In addition to the reprocessing effort, OSG provided 300,000 CPU hours to DØ for one of the most precise measurements to date of the top quark mass, and to achieve this result in time for the spring physics conferences

Reprocessing over time

Top quark discovered in 1995 at the Tevatron using the pair production mode Prediction of single top quark has recently been confirmed by the D0 data Important measurement of the t-b coupling Similar final state as WH -> lv + bb search –Therefore also a key milestone in the Higgs search D0 Discovery—Single Top Production

Conclusion Successful and pioneering effort in data intensive production in an opportunistic environment Challenges in support, coordination of resource usage, and reservation of the shared resources Iterative approach in enabling new resources helped make computing problem more manageable

Central hadronic calorimeter Muon detector Central outer tracker (COT) The Collider Detector at Fermilab (CDF)

A Mountain of Data 5.8 x 10 9 events 804TB raw data 2.4 PB total data At least 2x more data coming before end of run.

Computing Model Each event is independent— one job can fail and others will continue No inter-process communication Mostly integer computing

The Computing Problem—WW candidate event Reconstruction/analysis Connecting the dots on 3-D spiral tracks Correlate with calorimeter energy Find missing energy (large red arrow) Combinatoric fitting to see what is consistent with W particle.

CAF Software Front end submission, authentication and monitoring software Users submit, debug, monitor from desktop Works with various batch systems CDF began with dedicated facilities at Fermilab and remote institutions Monitoring page at

Why the Open Science Grid Majority of CPU load is simulation Requires 10GHz-sec per event Some analyses need > 1 billion simulated events Increasing data volume mean that demand for computing is growing faster than dedicated resources at FNAL and elsewhere. Simulation relatively easy to set up on remote sites CDF member institutions that previously had dedicated CDF facilities now are using grid interfaces Strategy: –Data analysis mostly close to home (FermiGrid CAF) –Monte Carlo simulations spread across the OSG (NAMCAF).

Condor Glide-ins Submit “pilot job” to a number of remote sites Pilot job calls home server to get a work unit Integrity of job and executable checked by MD5 checksums To CDF users—looks like a local batch pool Glidekeeper daemons monitor remote sites, submits enough jobs in advance to use available slots.

Igor Sfiligoi - Glide CAF over multiple sites - Elba Batch queue GlideCAF overview GlideCAF (Portal) Collector Main Schedd Submitter Daemon Negotiator Glidekeeper Daemon Glide-in Schedd Monitoring Daemons (1) Checks to see if jobs are queued. (2) If jobs are queued, a glide-in is submitted to a second schedd. (3) Globus Grid Pool Available Batch Slot Glide-in Startd Globus Batch queue Glide-ins to Grid (4) Glide-ins to grid- local batch (5) Startd registers (7) Real job goes to slot (8) Grid Pool

North American CAF—single submission point for all OSG Sites –CDF user interface, uses OSG tools underneath –no CDF-specific hardware or software at OSG sites Accesses OSG sites at MIT, Fermilab, UCSD, Florida & Chicago OSG sites at Purdue, Toronto, Wisconsin, McGill to be added Provides up to 1000 job slots already Similar entry points to European sites (LCGCAF) and Taiwan, Japan sites (PACCAF) NAMCAF—CDF Computing On Open Science Grid

DØ CDF CDF OSG Usage

Auxiliary tools--gLExec All Glidein jobs on the grid appear to come from same user. gLExec uses Globus callouts to contact site authentication infrastructure EGEE—LCAS/LCMAPS OSG—GUMS/SAZ Each individual user job authenticates to the site at the start of the job Gives site independent control on who it takes glideins from.

W boson mass measurement The CDF Run 2 result is the most precise single measurement of the W mass (used ~million CPU hours for mass fitting) LEP

What is FermiGrid? FermiGrid is: –The Fermilab campus Grid and Grid portal. The site globus gateway. Accepts jobs from external (to Fermilab) sources and forwards the jobs onto internal clusters. –A set of common services to support the campus Grid and interface to Open Science Grid (OSG) / LHC Computing Grid (LCG): VOMS, VOMRS, GUMS, SAZ, MyProxy, Squid, Gratia Accounting, etc. –A forum for promoting stakeholder interoperability and resource sharing within Fermilab: CMS, CDF, D0; KTeV, miniBoone, minos, mipp, etc. –The Open Science Grid portal to Fermilab Compute and Storage Services. FermiGrid Web Site & Additional Documentation: – Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

Jobmanager-cemon MatchMaking Service What is it? –FermiGrid has a matchmaking service deployed on the central gatekeeper (fermigrid1.fnal.gov). This service is used to match the incoming jobs against the various resources available at the point in time that the job was submitted. How can users make use of the MatchMaking Service? –Users begin by submitting jobs to the fermigrid1 central gatekeeper through jobmanager-cemon. –By default, the value of the "requirements" attribute is set such that users job will be matched against clusters which support the users VO (Virtual Organization) and have at least one free slot available at the time when the job is submitted to fermigrid1. –However, users have the ability to add additional conditions to this "requirements” attribute, using the attribute named "gluerequirements" in the condor submit file. –These additional conditions should be specified in terms of Glue Schema attributes. More information: –

FermiGrid - Current Architecture CMS WC1 CDF OSG1 CDF OSG2 D0 CAB1 GP Farm VOMS Server SAZ Server GUMS Server Step 1 - user issues voms-proxy-init user receives voms signed credentials Step 2 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 4 – Gateway requests GUMS Mapping based on VO & Role Step 3 – Gateway checks against Site Authorization Service clusters send ClassAds via CEMon to the site wide gateway Step 5 - Grid job is forwarded to target cluster BlueArc Periodic Synchronization D0 CAB2 Site Wide Gateway Exterior Interior

SAZ - Animation Gatekeeper DN VO Role CA SAZ ADMINADMIN Job

FermiGrid - Current Performance VOMS: –Current record ~1700 voms-proxy-inits/day. –Not a driver for FermiGrid-HA. GUMS: –Current record > 1M mapping requests/day –Maximum system load <3 at a CPU utilization of 130% (max 200%) SAZ: –Current record > 129K authorization decisions/day. –Maximum system load <5.

Bluearc/dCache Open Science Grid has two storage methods –NFS-mounted $OSG_DATA Implemented with BlueArc NFS filer –SRM/dCache Volatile area, 7TB, for any grid user Large areas backed up on tape for Fermi experiments

Replication FermiGrid-HA - Component Design LVS Standby VOMS Active VOMS Active GUMS Active GUMS Active SAZ Active SAZ Active LVS Standby LVS Active MySQL Active MySQL Active LVS Active Heartbeat

FermiGrid-HA - Actual Component Deployment Activefg5x1 Activefg5x2 Activefg5x3 Activefg5x4 Activefermigrid5 Xen Domain 0 Activefg6x1 Activefg6x2 Activefg6x3 Activefg6x4 Activefermigrid6 Xen Domain 0 VOMS GUMS SAZ MySQL Xen VM 1 Xen VM 2 Xen VM 3 Xen VM 4 LVS (Active)LVS (Standby)

Supported by the Department of Energy Office of Science SciDAC-2 program from the High Energy Physics, Nuclear Physics and Advanced Software and Computing Research programs, and the National Science Foundation Math and Physical Sciences, Office of CyberInfrastructure and Office of International Science and Engineering Directorates.

Open Science Grid The Vision: Transform compute and data intensive science through a cross-domain self-managed national distributed cyber- infrastructure that brings together campus and community infrastructure and facilitating the needs of Virtual Organizations at all scales Submit local, Run Global

Open Science Grid CS/IT Campus Grids: (e.g. DOSAR, Fermigrid, GLOW, GPN, GROW…) Science Community Infrastructure (e.g. ATLAS,CMS, LIGO, …) National & International Cyber Infrastructure for Science (e.g. Teragrid, EGEE, …) Campus Grids Community Grids National Grids Need to be harmonized Into a well Integrated whole

Open Science Grid: International Partners EGEE, Teragrid, Nordugrid, NYSGrid, GROW, GLOW, APAC, DiSUN, FermiGrid, LCG, TIGRE, ASGC, NWICG An International Science Community: Common Goals, Shared Data, Collaborative work Resource

Open Science Grid

Open Science Grid: Rosetta – A non-physics experiment For each protein we design, we consume about 3,000 CPU hours across 10,000 jobs,” says Kuhlman. “Adding in the structure and atom design process, we’ve consumed about 100,000 CPU hours in total so far.”

Open Science Grid: CHARMM CHARMM: CHemistry at HARvard Macromolecular Mechanics “I’m running many different simulations to determine how much water exists inside proteins and whether these water molecules can influence the proteins,” Damjanovic says.

Open Science Grid: How it all comes together Resources that Trust the VO VO Management Service OSG Infrastructure VO Middleware & Applications Virtual Organization Management services (VOMS) allow registration, administration and control of members of the group. Resources trust and authorize VOs not individual users OSG infrastructure provides the fabric for job submission and scheduling, resource discovery, security, monitoring, …

Open Science Grid: Software Stack Infrastructure Applications HEP Data and workflow management etc Biology Portals, databases etc User Science Codes and Interfaces Astrophysics Data replication etc VO Middleware OSG Release Cache: OSG specific configurations, utilities, etc Virtual Data Toolkit (VDT) Core technologies + software needed by stakeholders, many components shared with EGEE Core Grid Technology Distributions: Condor, Globus, MyProxy: shared with TeraGrid and others Existing Operating, Batch systems and Utilities. Resource

Open Science Grid: Security Operational security is a priority – Incident response – Signed agreements, template policies – Auditing, assessment and training Symmetry of Sites and VOs – VO and Site : two faces of a coin – we believe in symmetry – VO and Site each have responsibilities Trust Relationships – A Sites trust the VOs that use it. – A VO trusts the Sites it runs on. – VOs trust their users.

Open Science Grid: Come Join OSG !!! How to become an OSG Citizen: Join the OSGEDU VO: – Run small applications after learning how to use OSG from schools Be part of the Engagement program and Engage VO: –Support within the Facility to bring applications to production on the distributed infrastructure Be a standalone VO and a Member of the Consortium: –Ongoing use of OSG & participate in one or more activity groups. Open Science Grid