Download presentation
Presentation is loading. Please wait.
Published byElinor Quinn Modified over 8 years ago
Global Collaboration With Grids Ziggy wants his humans home by the end of the day for food and attention Follow Ziggy through National, Campus, and Community grids to see how it happens
What is DØ? The DØ experiment consists of a worldwide collaboration of scientists conducting research on the fundamental nature of matter. – 500 scientists and engineers – 60 institutions – 15 countries The research is focused on precise studies of interactions of protons and antiprotons at the highest available energies.
DØ Detector The detector is designed to stop as many as possible of the subatomic particles created from energy released by colliding proton/antiproton beams. –The intersection region where the matter- antimatter annihilation takes place is close to the geometric center of the detector. –The beam collision area is surrounded by tracking chambers in a strong magnetic field parallel to the direction of the beam(s). –Outside the tracking chamber are the pre- shower detectors and the calorimeter.
What is reprocessing? Periodically an experiment will reprocess data taken previously due to improvements in understanding the detector: –the calorimeter recalibration –improvements in the algorithms used in the analysis The reprocessing effort pushes the limits of software and infrastructure to get the most physics out of the data collected by the DØ detector A new layer of silicon detector of the DZERO detector
Case for using OSG resources Goal: reprocess ~500 M RunII events with newly calibrated detector and improved reconstruction software by end of March ’07 when the data have to be ready for physics analysis –Input: 90Tb of detector data + 250 Tb in executables –Output: 60 Tb of data in 500 CPU years Estimated resources: need about 1500-2000 CPUs for a period of about 4 months. Problem: DØ did not have enough dedicated resources to complete the task in the target 3 months Solution: Use SAM-Grid–OSG interoperability to allow SAM-Grid jobs to be executed in OSG clusters.
Opportunistic usage model –Agreed to share computing cycles with OSG users –Exact amount of resources at any time can not be guaranteed OSG Usage Model OSG ClustersCPUs Brazil230 CC-IN2P3 Lyon500 LOUISIANA LTU-CCT250 (128) UCSD300 (70) PURDUE-ITaP600 (?) Oklahoma University 200 Indiana University 250 NERSC – LBL250 University of Nebraska256 CMS FNAL 2250
SAM-Grid SAM-Grid is an infrastructure that understands DØ processing needs and maps them into available resources (OSG) Implements job to resource mappings –Both computing and storage Uses SAM (Sequential Access via Metadata) –Automated management of storage elements –Metadata cataloguing Job submission and job status tracking Progress monitoring
SAMGrid Architecture
Challenge: Certification Compare production at a new site with “standard” production at the DØ farm “the same” Certified! Note: Experienced problems during the certification on virtual OS. *default random seed in python was set to the same value on all machines Reference OSG Cluster
Challenge: Data Accessibility Test Acceptable NOT Acceptable 2000 secs to transfer data (30 streams) 10000 secs to transfer data (30 streams)
Challenge: Troubleshooting The OSG Troubleshooting team was instrumental to the success of the project. OSG-Related Problems before the intervention of the Troubleshooting Team (03/27/2007) Most jobs succeed (04/17/2007)
Reprocessing Summary “This was the first major production of real high energy physics data (as opposed to simulations) ever run on OSG resources" said Brad Abbott, head of the DØ Computing group. On OSG, DØ sustained execution of over 1000 simultaneous jobs, and overall moved over 70 Terabytes of data. Reprocessing was completed in June. Towards the end of the production run the throughput on OSG was more than 5 million events per day two to three times more than originally planned. In addition to the reprocessing effort, OSG provided 300,000 CPU hours to DØ for one of the most precise measurements to date of the top quark mass, and to achieve this result in time for the spring physics conferences
Reprocessing over time
Top quark discovered in 1995 at the Tevatron using the pair production mode Prediction of single top quark has recently been confirmed by the D0 data Important measurement of the t-b coupling Similar final state as WH -> lv + bb search –Therefore also a key milestone in the Higgs search D0 Discovery—Single Top Production
Conclusion Successful and pioneering effort in data intensive production in an opportunistic environment Challenges in support, coordination of resource usage, and reservation of the shared resources Iterative approach in enabling new resources helped make computing problem more manageable
Central hadronic calorimeter Muon detector Central outer tracker (COT) The Collider Detector at Fermilab (CDF)
A Mountain of Data 5.8 x 10 9 events 804TB raw data 2.4 PB total data At least 2x more data coming before end of run.
Computing Model Each event is independent— one job can fail and others will continue No inter-process communication Mostly integer computing
The Computing Problem—WW candidate event Reconstruction/analysis Connecting the dots on 3-D spiral tracks Correlate with calorimeter energy Find missing energy (large red arrow) Combinatoric fitting to see what is consistent with W particle.
CAF Software Front end submission, authentication and monitoring software Users submit, debug, monitor from desktop Works with various batch systems CDF began with dedicated facilities at Fermilab and remote institutions Monitoring page at
Why the Open Science Grid Majority of CPU load is simulation Requires 10GHz-sec per event Some analyses need > 1 billion simulated events Increasing data volume mean that demand for computing is growing faster than dedicated resources at FNAL and elsewhere. Simulation relatively easy to set up on remote sites CDF member institutions that previously had dedicated CDF facilities now are using grid interfaces Strategy: –Data analysis mostly close to home (FermiGrid CAF) –Monte Carlo simulations spread across the OSG (NAMCAF).
Condor Glide-ins Submit “pilot job” to a number of remote sites Pilot job calls home server to get a work unit Integrity of job and executable checked by MD5 checksums To CDF users—looks like a local batch pool Glidekeeper daemons monitor remote sites, submits enough jobs in advance to use available slots.
Igor Sfiligoi - Glide CAF over multiple sites - Elba 2006 24 Batch queue GlideCAF overview GlideCAF (Portal) Collector Main Schedd Submitter Daemon Negotiator Glidekeeper Daemon Glide-in Schedd Monitoring Daemons (1) Checks to see if jobs are queued. (2) If jobs are queued, a glide-in is submitted to a second schedd. (3) Globus Grid Pool Available Batch Slot Glide-in Startd Globus Batch queue Glide-ins to Grid (4) Glide-ins to grid- local batch (5) Startd registers (7) Real job goes to slot (8) Grid Pool
North American CAF—single submission point for all OSG Sites –CDF user interface, uses OSG tools underneath –no CDF-specific hardware or software at OSG sites Accesses OSG sites at MIT, Fermilab, UCSD, Florida & Chicago OSG sites at Purdue, Toronto, Wisconsin, McGill to be added Provides up to 1000 job slots already Similar entry points to European sites (LCGCAF) and Taiwan, Japan sites (PACCAF) NAMCAF—CDF Computing On Open Science Grid
Auxiliary tools--gLExec All Glidein jobs on the grid appear to come from same user. gLExec uses Globus callouts to contact site authentication infrastructure EGEE—LCAS/LCMAPS OSG—GUMS/SAZ Each individual user job authenticates to the site at the start of the job Gives site independent control on who it takes glideins from.
W boson mass measurement The CDF Run 2 result is the most precise single measurement of the W mass (used ~million CPU hours for mass fitting) LEP experiments @CERN
What is FermiGrid? FermiGrid is: –The Fermilab campus Grid and Grid portal. The site globus gateway. Accepts jobs from external (to Fermilab) sources and forwards the jobs onto internal clusters. –A set of common services to support the campus Grid and interface to Open Science Grid (OSG) / LHC Computing Grid (LCG): VOMS, VOMRS, GUMS, SAZ, MyProxy, Squid, Gratia Accounting, etc. –A forum for promoting stakeholder interoperability and resource sharing within Fermilab: CMS, CDF, D0; KTeV, miniBoone, minos, mipp, etc. –The Open Science Grid portal to Fermilab Compute and Storage Services. FermiGrid Web Site & Additional Documentation: – Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.
Jobmanager-cemon MatchMaking Service What is it? –FermiGrid has a matchmaking service deployed on the central gatekeeper ( This service is used to match the incoming jobs against the various resources available at the point in time that the job was submitted. How can users make use of the MatchMaking Service? –Users begin by submitting jobs to the fermigrid1 central gatekeeper through jobmanager-cemon. –By default, the value of the "requirements" attribute is set such that users job will be matched against clusters which support the users VO (Virtual Organization) and have at least one free slot available at the time when the job is submitted to fermigrid1. –However, users have the ability to add additional conditions to this "requirements” attribute, using the attribute named "gluerequirements" in the condor submit file. –These additional conditions should be specified in terms of Glue Schema attributes. More information: –
FermiGrid - Current Architecture CMS WC1 CDF OSG1 CDF OSG2 D0 CAB1 GP Farm VOMS Server SAZ Server GUMS Server Step 1 - user issues voms-proxy-init user receives voms signed credentials Step 2 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 4 – Gateway requests GUMS Mapping based on VO & Role Step 3 – Gateway checks against Site Authorization Service clusters send ClassAds via CEMon to the site wide gateway Step 5 - Grid job is forwarded to target cluster BlueArc Periodic Synchronization D0 CAB2 Site Wide Gateway Exterior Interior
SAZ - Animation Gatekeeper DN VO Role CA SAZ ADMINADMIN Job
FermiGrid - Current Performance VOMS: –Current record ~1700 voms-proxy-inits/day. –Not a driver for FermiGrid-HA. GUMS: –Current record > 1M mapping requests/day –Maximum system load <3 at a CPU utilization of 130% (max 200%) SAZ: –Current record > 129K authorization decisions/day. –Maximum system load <5.
Bluearc/dCache Open Science Grid has two storage methods –NFS-mounted $OSG_DATA Implemented with BlueArc NFS filer –SRM/dCache Volatile area, 7TB, for any grid user Large areas backed up on tape for Fermi experiments
Replication FermiGrid-HA - Component Design LVS Standby VOMS Active VOMS Active GUMS Active GUMS Active SAZ Active SAZ Active LVS Standby LVS Active MySQL Active MySQL Active LVS Active Heartbeat
FermiGrid-HA - Actual Component Deployment Activefg5x1 Activefg5x2 Activefg5x3 Activefg5x4 Activefermigrid5 Xen Domain 0 Activefg6x1 Activefg6x2 Activefg6x3 Activefg6x4 Activefermigrid6 Xen Domain 0 VOMS GUMS SAZ MySQL Xen VM 1 Xen VM 2 Xen VM 3 Xen VM 4 LVS (Active)LVS (Standby)
Supported by the Department of Energy Office of Science SciDAC-2 program from the High Energy Physics, Nuclear Physics and Advanced Software and Computing Research programs, and the National Science Foundation Math and Physical Sciences, Office of CyberInfrastructure and Office of International Science and Engineering Directorates.
Open Science Grid The Vision: Transform compute and data intensive science through a cross-domain self-managed national distributed cyber- infrastructure that brings together campus and community infrastructure and facilitating the needs of Virtual Organizations at all scales Submit local, Run Global
Open Science Grid CS/IT Campus Grids: (e.g. DOSAR, Fermigrid, GLOW, GPN, GROW…) Science Community Infrastructure (e.g. ATLAS,CMS, LIGO, …) National & International Cyber Infrastructure for Science (e.g. Teragrid, EGEE, …) Campus Grids Community Grids National Grids Need to be harmonized Into a well Integrated whole
Open Science Grid: International Partners EGEE, Teragrid, Nordugrid, NYSGrid, GROW, GLOW, APAC, DiSUN, FermiGrid, LCG, TIGRE, ASGC, NWICG An International Science Community: Common Goals, Shared Data, Collaborative work Resource
Open Science Grid
Open Science Grid: Rosetta – A non-physics experiment For each protein we design, we consume about 3,000 CPU hours across 10,000 jobs,” says Kuhlman. “Adding in the structure and atom design process, we’ve consumed about 100,000 CPU hours in total so far.”
Open Science Grid: CHARMM CHARMM: CHemistry at HARvard Macromolecular Mechanics “I’m running many different simulations to determine how much water exists inside proteins and whether these water molecules can influence the proteins,” Damjanovic says.
Open Science Grid: How it all comes together Resources that Trust the VO VO Management Service OSG Infrastructure VO Middleware & Applications Virtual Organization Management services (VOMS) allow registration, administration and control of members of the group. Resources trust and authorize VOs not individual users OSG infrastructure provides the fabric for job submission and scheduling, resource discovery, security, monitoring, …
Open Science Grid: Software Stack Infrastructure Applications HEP Data and workflow management etc Biology Portals, databases etc User Science Codes and Interfaces Astrophysics Data replication etc VO Middleware OSG Release Cache: OSG specific configurations, utilities, etc Virtual Data Toolkit (VDT) Core technologies + software needed by stakeholders, many components shared with EGEE Core Grid Technology Distributions: Condor, Globus, MyProxy: shared with TeraGrid and others Existing Operating, Batch systems and Utilities. Resource
Open Science Grid: Security Operational security is a priority – Incident response – Signed agreements, template policies – Auditing, assessment and training Symmetry of Sites and VOs – VO and Site : two faces of a coin – we believe in symmetry – VO and Site each have responsibilities Trust Relationships – A Sites trust the VOs that use it. – A VO trusts the Sites it runs on. – VOs trust their users.
Open Science Grid: Come Join OSG !!! How to become an OSG Citizen: Join the OSGEDU VO: – Run small applications after learning how to use OSG from schools Be part of the Engagement program and Engage VO: –Support within the Facility to bring applications to production on the distributed infrastructure Be a standalone VO and a Member of the Consortium: –Ongoing use of OSG & participate in one or more activity groups. Open Science Grid
Similar presentations
© 2025 Inc.
All rights reserved.