Download presentation
Presentation is loading. Please wait.
1
University of Michigan (May 8, 2003)Paul Avery1 University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu Grids for 21 st Century Data Intensive Science University of Michigan May 8, 2003
2
University of Michigan (May 8, 2003)Paul Avery2 Grids and Science
3
University of Michigan (May 8, 2003)Paul Avery3 The Grid Concept Grid: Geographically distributed computing resources configured for coordinated use Fabric: Physical resources & networks provide raw capability Middleware: Software ties it all together (tools, services, etc.) Goal: Transparent resource sharing
4
University of Michigan (May 8, 2003)Paul Avery4 Fundamental Idea: Resource Sharing Resources for complex problems are distributed Advanced scientific instruments (accelerators, telescopes, …) Storage, computing, people, institutions Communities require access to common services Research collaborations (physics, astronomy, engineering, …) Government agencies, health care organizations, corporations, … “Virtual Organizations” Create a “VO” from geographically separated components Make all community resources available to any VO member Leverage strengths at different institutions Grids require a foundation of strong networking Communication tools, visualization High-speed data transmission, instrument operation
5
University of Michigan (May 8, 2003)Paul Avery5 Some (Realistic) Grid Examples High energy physics 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data Fusion power (ITER, etc.) Physicists quickly generate 100 CPU-years of simulations of a new magnet configuration to compare with data Astronomy An international team remotely operates a telescope in real time Climate modeling Climate scientists visualize, annotate, & analyze Terabytes of simulation data Biology A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour
6
University of Michigan (May 8, 2003)Paul Avery6 Grids: Enhancing Research & Learning Fundamentally alters conduct of scientific research Central model:People, resources flow inward to labs Distributed model:Knowledge flows between distributed teams Strengthens universities Couples universities to data intensive science Couples universities to national & international labs Brings front-line research and resources to students Exploits intellectual resources of formerly isolated schools Opens new opportunities for minority and women researchers Builds partnerships to drive advances in IT/science/eng “Application” sciences Computer Science Physics Astronomy, biology, etc. Universities Laboratories Scientists Students Research Community IT industry
7
University of Michigan (May 8, 2003)Paul Avery7 Grid Challenges Operate a fundamentally complex entity Geographically distributed resources Each resource under different administrative control Many failure modes Manage workflow across Grid Balance policy vs. instantaneous capability to complete tasks Balance effective resource use vs. fast turnaround for priority jobs Match resource usage to policy over the long term Goal-oriented algorithms: steering requests according to metrics Maintain a global view of resources and system state Coherent end-to-end system monitoring Adaptive learning for execution optimization Build high level services & integrated user environment
8
University of Michigan (May 8, 2003)Paul Avery8 Data Grids
9
University of Michigan (May 8, 2003)Paul Avery9 Data Intensive Science: 2000-2015 Scientific discovery increasingly driven by data collection Computationally intensive analyses Massive data collections Data distributed across networks of varying capability Internationally distributed collaborations Dominant factor: data growth (1 Petabyte = 1000 TB) 2000~0.5 Petabyte 2005~10 Petabytes 2010~100 Petabytes 2015~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handle additional dimension of data access & movement
10
University of Michigan (May 8, 2003)Paul Avery10 Data Intensive Physical Sciences High energy & nuclear physics Including new experiments at CERN’s Large Hadron Collider Astronomy Digital sky surveys: SDSS, VISTA, other Gigapixel arrays VLBI arrays: multiple- Gbps data streams “Virtual” Observatories (multi-wavelength astronomy) Gravity wave searches LIGO, GEO, VIRGO, TAMA Time-dependent 3-D systems (simulation & data) Earth Observation, climate modeling Geophysics, earthquake modeling Fluids, aerodynamic design Dispersal of pollutants in atmosphere
11
University of Michigan (May 8, 2003)Paul Avery11 Data Intensive Biology and Medicine Medical data X-Ray, mammography data, etc. (many petabytes) Radiation Oncology (real-time display of 3-D images) X-ray crystallography Bright X-Ray sources, e.g. Argonne Advanced Photon Source Molecular genomics and related disciplines Human Genome, other genome databases Proteomics (protein structure, activities, …) Protein interactions, drug delivery Brain scans (1-10 m, time dependent)
12
University of Michigan (May 8, 2003)Paul Avery12 1800 Physicists 150 Institutes 32 Countries Driven by LHC Computing Challenges Complexity:Millions of individual detector channels Scale:PetaOps (CPU), Petabytes (Data) Distribution:Global distribution of people & resources
13
University of Michigan (May 8, 2003)Paul Avery13 CMS Experiment at LHC “Compact” Muon Solenoid at the LHC (CERN) Smithsonian standard man
14
University of Michigan (May 8, 2003)Paul Avery14 LHC Data Rates: Detector to Storage Level 1 Trigger: Special Hardware 40 MHz ~1000 TB/sec 75 KHz 75 GB/sec 5 KHz 5 GB/sec Level 2 Trigger: Commodity CPUs 100 Hz 100 – 1500 MB/sec Level 3 Trigger: Commodity CPUs Raw Data to storage Physics filtering
15
University of Michigan (May 8, 2003)Paul Avery15 All charged tracks with pt > 2 GeV Reconstructed tracks with pt > 25 GeV (+30 minimum bias events) 10 9 events/sec, selectivity: 1 in 10 13 LHC: Higgs Decay into 4 muons
16
University of Michigan (May 8, 2003)Paul Avery16 CMS Experiment Hierarchy of LHC Data Grid Resources Online System CERN Computer Center > 20 TIPS UK Korea Russia USA Institute 100-1500 MBytes/s 2.5-10 Gbps 1-10 Gbps 10-40 Gbps 1-2.5 Gbps Tier 0 Tier 1 Tier 3 Tier 4 Tier0/( Tier1)/( Tier2) ~ 1:1:1 Tier 2 Physics cache PCs Institute Tier2 Center ~10s of Petabytes by 2007-8 ~1000 Petabytes in 5-7 years
17
University of Michigan (May 8, 2003)Paul Avery17 Digital Astronomy Future dominated by detector improvements Moore’s Law growth in CCDs Gigapixel arrays on horizon Growth in CPU/storage tracking data volumes Glass MPixels Total area of 3m+ telescopes in the world in m 2 Total number of CCD pixels in Mpixels 25 year growth: 30x in glass, 3000x in pixels
18
University of Michigan (May 8, 2003)Paul Avery18 The Age of Astronomical Mega-Surveys Next generation mega-surveys will change astronomy Large sky coverage Sound statistical plans, uniform systematics The technology to store and access the data is here Following Moore’s law Integrating these archives for the whole community Astronomical data mining will lead to stunning new discoveries “Virtual Observatory” (next slides)
19
University of Michigan (May 8, 2003)Paul Avery19 Virtual Observatories Source Catalogs Image Data Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Standards Multi-wavelength astronomy, Multiple surveys
20
University of Michigan (May 8, 2003)Paul Avery20 Virtual Observatory Data Challenge Digital representation of the sky All-sky + deep fields Integrated catalog and image databases Spectra of selected samples Size of the archived data 40,000 square degrees Resolution 50 trillion pixels One band (2 bytes/pixel)100 Terabytes Multi-wavelength:500-1000 Terabytes Time dimension:Many Petabytes Large, globally distributed database engines Multi-Petabyte data size, distributed widely Thousands of queries per day, Gbyte/s I/O speed per site Data Grid computing infrastructure
21
University of Michigan (May 8, 2003)Paul Avery21 Sloan Sky Survey Data Grid
22
University of Michigan (May 8, 2003)Paul Avery22 International Grid/Networking Projects US, EU, E. Europe, Asia, S. America, …
23
University of Michigan (May 8, 2003)Paul Avery23 Global Context: Data Grid Projects U.S. Projects Particle Physics Data Grid (PPDG)DOE GriPhyNNSF International Virtual Data Grid Laboratory (iVDGL)NSF TeraGridNSF DOE Science GridDOE NSF Middleware Initiative (NMI)NSF EU, Asia major projects European Data Grid (EU, EC) LHC Computing Grid (LCG) (CERN) EU national Projects (UK, Italy, France, …) CrossGrid (EU, EC) DataTAG (EU, EC) Japanese Project Korea project
24
University of Michigan (May 8, 2003)Paul Avery24 Particle Physics Data Grid Funded 2001 – 2004 @ US$9.5M (DOE) Driven by HENP experiments: D0, BaBar, STAR, CMS, ATLAS
25
University of Michigan (May 8, 2003)Paul Avery25 PPDG Goals Serve high energy & nuclear physics (HENP) experiments Unique challenges, diverse test environments Develop advanced Grid technologies Focus on end to end integration Maintain practical orientation Networks, instrumentation, monitoring DB file/object replication, caching, catalogs, end-to-end movement Make tools general enough for wide community Collaboration with GriPhyN, iVDGL, EDG, LCG ESNet Certificate Authority work, security
26
University of Michigan (May 8, 2003)Paul Avery26 GriPhyN and iVDGL Both funded through NSF ITR program GriPhyN: $11.9M (NSF) + $1.6M (matching)(2000 – 2005) iVDGL:$13.7M (NSF) + $2M (matching)(2001 – 2006) Basic composition GriPhyN:12 funded universities, SDSC, 3 labs(~80 people) iVDGL:16 funded institutions, SDSC, 3 labs(~80 people) Expts:US-CMS, US-ATLAS, LIGO, SDSS/NVO Large overlap of people, institutions, management Grid research vs Grid deployment GriPhyN:CS research, Virtual Data Toolkit (VDT) development iVDGL:Grid laboratory deployment 4 physics experiments provide frontier challenges VDT in common
27
University of Michigan (May 8, 2003)Paul Avery27 GriPhyN Computer Science Challenges Virtual data (more later) Data + programs (content) +programs (executions) Representation, discovery, & manipulation of workflows and associated data & programs Planning Mapping workflows in an efficient, policy-aware manner to distributed resources Execution Executing workflows, inc. data movements, reliably and efficiently Performance Monitoring system performance for scheduling & troubleshooting
28
University of Michigan (May 8, 2003)Paul Avery28 Goal: “PetaScale” Virtual-Data Grids Virtual Data Tools Request Planning & Scheduling Tools Request Execution & Management Tools Transforms Distributed resources (code, storage, CPUs, networks) Resource Management Services Security and Policy Services Other Grid Services Interactive User Tools Production Team Single Researcher Workgroups Raw data source PetaOps Petabytes Performance
29
University of Michigan (May 8, 2003)Paul Avery29 GriPhyN/iVDGL Science Drivers US-CMS & US-ATLAS HEP experiments at LHC/CERN 100s of Petabytes LIGO Gravity wave experiment 100s of Terabytes Sloan Digital Sky Survey Digital astronomy (1/4 sky) 10s of Terabytes Data growth Community growth 2007 2002 2001 Massive CPU Large, distributed datasets Large, distributed communities
30
University of Michigan (May 8, 2003)Paul Avery30 Virtual Data: Derivation and Provenance Most scientific data are not simple “measurements” They are computationally corrected/reconstructed They can be produced by numerical simulation Science & eng. projects are more CPU and data intensive Programs are significant community resources (transformations) So are the executions of those programs (derivations) Management of dataset transformations important! Derivation: Instantiation of a potential data product Provenance: Exact history of any existing data product We already do this, but manually!
31
University of Michigan (May 8, 2003)Paul Avery31 Transformation Derivation Data product-of execution-of consumed-by/ generated-by “I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.” “I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.” “I want to search a database for 3 muon SUSY events. If a program that does this analysis exists, I won’t have to write one from scratch.” “I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.” Virtual Data Motivations (1)
32
University of Michigan (May 8, 2003)Paul Avery32 Virtual Data Motivations (2) Data track-ability and result audit-ability Universally sought by scientific applications Facilitates tool and data sharing and collaboration Data can be sent along with its recipe Repair and correction of data Rebuild data products—cf., “make” Workflow management Organizing, locating, specifying, and requesting data products Performance optimizations Ability to re-create data rather than move it Manual /error prone Automated /robust
33
University of Michigan (May 8, 2003)Paul Avery33 “Chimera” Virtual Data System Virtual Data API A Java class hierarchy to represent transformations & derivations Virtual Data Language Textual for people & illustrative examples XML for machine-to-machine interfaces Virtual Data Database Makes the objects of a virtual data definition persistent Virtual Data Service (future) Provides a service interface (e.g., OGSA) to persistent objects Version 1.0 available To be put into VDT 1.1.7
34
University of Michigan (May 8, 2003)Paul Avery34 Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Chimera Application: SDSS Analysis Galaxy cluster data Size distribution
35
University of Michigan (May 8, 2003)Paul Avery35 Virtual Data and LHC Computing US-CMS Chimera prototype tested with CMS MC (~200K events) Currently integrating Chimera into standard CMS production tools Integrating virtual data into Grid-enabled analysis tools US-ATLAS Integrating Chimera into ATLAS software HEPCAL document includes first virtual data use cases Very basic cases, need elaboration Discuss with LHC expts: requirements, scope, technologies New ITR proposal to NSF ITR program ($15M) Dynamic Workspaces for Scientific Analysis Communities Continued progress requires collaboration with CS groups Distributed scheduling, workflow optimization, … Need collaboration with CS to develop robust tools
36
University of Michigan (May 8, 2003)Paul Avery36 iVDGL Goals and Context International Virtual-Data Grid Laboratory A global Grid laboratory (US, EU, E. Europe, Asia, S. America, …) A place to conduct Data Grid tests “at scale” A mechanism to create common Grid infrastructure A laboratory for other disciplines to perform Data Grid tests A focus of outreach efforts to small institutions Context of iVDGL in US-LHC computing program Develop and operate proto-Tier2 centers Learn how to do Grid operations (GOC) International participation DataTag UK e-Science programme: support 6 CS Fellows per year in U.S.
37
University of Michigan (May 8, 2003)Paul Avery37 US-iVDGL Sites (Spring 2003) Partners? EU CERN Brazil Australia Korea Japan UF Wisconsin BNL Indiana Boston U SKC Brownsville Hampton PSU J. Hopkins Caltech Tier1 Tier2 Tier3 FIU FSU Arlington Michigan LBL Oklahoma Argonne Vanderbilt UCSD/SDSC NCSA Fermilab
38
University of Michigan (May 8, 2003)Paul Avery38 US-CMS Grid Testbed
39
University of Michigan (May 8, 2003)Paul Avery39 Production Run for Monte Carlo data production Assigned 1.5 million events for “eGamma Bigjets” ~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple 2 months continuous running across 5 testbed sites Demonstrated at Supercomputing 2002 US-CMS Testbed Success Story
40
University of Michigan (May 8, 2003)Paul Avery40 Creation of WorldGrid Joint iVDGL/DataTag/EDG effort Resources from both sides (15 sites) Monitoring tools (Ganglia, MDS, NetSaint, …) Visualization tools (Nagios, MapCenter, Ganglia) Applications: ScienceGrid CMS:CMKIN, CMSIM ATLAS:ATLSIM Submit jobs from US or EU Jobs can run on any cluster Demonstrated at IST2002 (Copenhagen) Demonstrated at SC2002 (Baltimore)
41
University of Michigan (May 8, 2003)Paul Avery41 WorldGrid Sites
42
University of Michigan (May 8, 2003)Paul Avery42 Grid Coordination
43
University of Michigan (May 8, 2003)Paul Avery43 U.S. Project Coordination: Trillium Trillium = GriPhyN + iVDGL + PPDG Large overlap in leadership, people, experiments Driven primarily by HENP, particularly LHC experiments Benefit of coordination Common software base + packaging: VDT + PACMAN Collaborative / joint projects: monitoring, demos, security, … Wide deployment of new technologies, e.g. Virtual Data Stronger, broader outreach effort Forum for US Grid projects Joint view, strategies, meetings and work Unified entity to deal with EU & other Grid projects
44
University of Michigan (May 8, 2003)Paul Avery44 International Grid Coordination Global Grid Forum (GGF) International forum for general Grid efforts Many working groups, standards definitions Close collaboration with EU DataGrid (EDG) Many connections with EDG activities HICB: HEP Inter-Grid Coordination Board Non-competitive forum, strategic issues, consensus Cross-project policies, procedures and technology, joint projects HICB-JTB Joint Technical Board Definition, oversight and tracking of joint projects GLUE interoperability group Participation in LHC Computing Grid (LCG) Software Computing Committee (SC2) Project Execution Board (PEB) Grid Deployment Board (GDB)
45
University of Michigan (May 8, 2003)Paul Avery45 HEP and International Grid Projects HEP continues to be the strongest science driver (In collaboration with computer scientists) Many national and international initiatives LHC a particularly strong driving function US-HEP committed to working with international partners Many networking initiatives with EU colleagues Collaboration on LHC Grid Project Grid projects driving & linked to network developments DataTag, SCIC, US-CERN link, Internet2 New partners being actively sought Korea, Russia, China, Japan, Brazil, Romania, … Participate in US-CMS and US-ATLAS Grid testbeds Link to WorldGrid, once some software is fixed
46
University of Michigan (May 8, 2003)Paul Avery46 New Grid Efforts
47
An Inter-Regional Center for High Energy Physics Research and Educational Outreach (CHEPREO) at Florida International University E/O Center in Miami area iVDGL Grid Activities CMS Research AMPATH network (S. America) Int’l Activities (Brazil, etc.) Status: Proposal submitted Dec. 2002 Presented to NSF review panel Project Execution Plan submitted Funding in June?
48
University of Michigan (May 8, 2003)Paul Avery48 A Global Grid Enabled Collaboratory for Scientific Research (GECSR) $4M ITR proposal from Caltech (HN PI,JB:CoPI) Michigan (CoPI,CoPI) Maryland (CoPI) Plus senior personnel from Lawrence Berkeley Lab Oklahoma Fermilab Arlington (U. Texas) Iowa Florida State First Grid-enabled Collaboratory Tight integration between Science of Collaboratories Globally scalable work environment Sophisticated collaborative tools (VRVS, VNC; Next-Gen) Agent based monitoring & decision support system (MonALISA) Initial targets are the global HENP collaborations, but GESCR is expected to be widely applicable to other large scale collaborative scientific endeavors “Giving scientists from all world regions the means to function as full partners in the process of search and discovery”
49
University of Michigan (May 8, 2003)Paul Avery49 Large ITR Proposal: $15M Dynamic Workspaces: Enabling Global Analysis Communities
50
University of Michigan (May 8, 2003)Paul Avery50 UltraLight Proposal to NSF 10 Gb/s+ network Caltech, UF, FIU, UM, MIT SLAC, FNAL Int’l partners Cisco Applications HEP VLBI Radiation Oncology Grid Projects
51
University of Michigan (May 8, 2003)Paul Avery51 GLORIAD New 10 Gb/s network linking US-Russia-China Plus Grid component linking science projects H. Newman, P. Avery participating Meeting at NSF April 14 with US-Russia-China reps. HEP people (Hesheng, et al.) Broad agreement that HEP can drive Grid portion More meetings planned
52
University of Michigan (May 8, 2003)Paul Avery52 Summary Progress on many fronts in PPDG/GriPhyN/iVDGL Packaging: Pacman + VDT Testbeds (development and production) Major demonstration projects Productions based on Grid tools using iVDGL resources WorldGrid providing excellent experience Excellent collaboration with EU partners Building links to our Asian and other partners Excellent opportunity to build lasting infrastructure Looking to collaborate with more international partners Testbeds, monitoring, deploying VDT more widely New directions Virtual data a powerful paradigm for LHC computing Emphasis on Grid-enabled analysis
53
University of Michigan (May 8, 2003)Paul Avery53 Grid References Grid Book www.mkp.com/grids Globus www.globus.org Global Grid Forum www.gridforum.org PPDG www.ppdg.net GriPhyN www.griphyn.org iVDGL www.ivdgl.org TeraGrid www.teragrid.org EU DataGrid www.eu-datagrid.org
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.