Download presentation
Presentation is loading. Please wait.
Published byAnthony Parker Modified over 10 years ago
1
Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director www.nesc.ac.uk 2 nd December 2003
2
What is e-Science?
3
Foundation for e-Science sensor nets Shared data archives computers software colleagues instruments Grid e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) enabling a whole-system approach Diagram derived from Ian Fosters slide
4
Three-way Alliance Computing Science Systems, Notations & Formal Foundation Process & Trust Theory Models & Simulations Shared Data Experiment & Advanced Data Collection Shared Data Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Requires Much Engineering, Much Innovation Changes Culture, New Mores, New Behaviours New Opportunities, New Results, New Rewards
5
Biochemical Pathway Simulator (Computing Science, Bioinformatics, Beatson Cancer Research Labs) DTI Bioscience Beacon Project Harnessing Genomics Programme Slide from Professor Muffy Calder, Glasgow
6
Why is Data Important?
7
Information, knowledge, decisions & designs Derived & Synthesised Data Data as Evidence – all disciplines Hypothesis, Curiosity or … Collections of Data Analysis & Models Driven by creativity, imagination and perspiration
8
Information, knowledge, decisions & designs Data as Evidence - Historically Hypothesis, Curiosity or … Collections of Data Analysis & Models Derived & Synthesised Data Individuals idea Personal collection Personal effort Lab Notebook Driven by creativity, imagination, perspiration & personal resources
9
Information, knowledge, decisions & designs Data as Evidence – Enterprise Scale Hypothesis, Curiosity, Business Goals Collections of Digital Data Analysis & Computational Models Derived & Synthesised Data Agreed Hypothesis or Goals Enterprise databases & (archive) file stores Data Production Pipelines Data Products & Results Driven by creativity, imagination, perspiration & companys resources
10
Information, knowledge, decisions & designs Data as Evidence – e-Science Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Communities and Challenges Multi-enterprise & Public Curation Synthesis from Multiple Sources Multi-enterprise Models, Computation & Workflow Shared Data Products & Results Driven by creativity, imagination, perspiration & shared resources
11
global in-flight engine diagnostics in-flight data airline maintenance centre ground station global network eg SITA internet, e-mail, pager DS&S Engine Health Center data centre Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York 100,000 aircraft 0.5 GB/flight 4 flights/day 200 TB/day
12
LHC Distributed Simulation & Analysis Tier2 Centre ~1 TIPS Online System Offline Farm ~20 TIPS CERN Computer Centre >20 TIPS RAL Regional Centre US Regional Centre French Regional Centre Italian Regional Centre Institute Institute ~0.25TIPS Workstations ~100 MBytes/sec 100 - 1000 Mbits/sec One bunch crossing per 25 ns 100 triggers per second Each event is ~1 Mbyte Physicists work on analysis channels Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~ Gbits/sec or Air Freight Tier2 Centre ~1 TIPS ~Gbits/sec Tier 0 Tier 1 Tier 3 Tier 4 1 TIPS = 25,000 SpecInt95 PC (1999) = ~15 SpecInt95 ScotGRID++ ~1 TIPS Tier 2 1. CERN
13
DataGrid Testbed Dubna Moscow RAL Lund Lisboa Santander Madrid Valencia Barcelona Paris Berlin Lyon Grenoble Marseille Brno Prague Torino Milano BO-CNAF PD-LNL Pisa Roma Catania ESRIN CERN HEP sites ESA sites IPSL Estec KNMI (>40) Francois.Etienne@in2p3.frFrancois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.itAntonia.Ghiselli@cnaf.infn.it Testbed Sites
14
Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Multiple overlapping communities Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Supported by common standards & shared infrastructure
15
Life-science Examples
16
Database Growth PDB Content Growth Bases 45,356,382,990
17
Wellcome Trust: Cardiovascular Functional Genomics Glasgow Edinburgh Leicester Oxford London Netherlands Shared data Public curated data BRIDGES IBM Depends on building & maintaining security, privacy & trust
18
Comparative Functional Genomics Large amounts of data Highly heterogeneous Data types Data forms community Highly complex and inter-related Volatile my Grid Project: Carole Goble, University of Manchester
19
UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
20
DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago tomographic reconstruction real-time collection wide-area dissemination desktop & VR clients with shared controls Advanced Photon Source Online Access to Scientific Instruments archival storage From Steve Tuecke 12 Oct. 01
21
Community = 1000s of home computer users Philanthropic computing vendor (Entropia) Research group (Scripps) Common goal= advance AIDS research Home Computers Evaluate AIDS Drugs From Steve Tuecke 12 Oct. 01
23
Astronomy Examples
25
Global Knowledge Communities driven by Data: e.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins
26
Sloan Digital Sky Survey Production System Slide from Ian Fosters ssdbm 03 keynote
27
Supernova Cosmology Requires Complex, Widely Distributed Workflow Management
28
Engineering Examples
29
whole-system simulations braking performance steering capabilities traction dampening capabilities landing gear models lift capabilities drag capabilities responsiveness wing models deflection capabilities responsiveness stabilizer models airframe models crew capabilities - accuracy - perception - stamina - reaction times - SOPs human models thrust performance reverse thrust performance responsiveness fuel consumption engine models NASA Information Power Grid: coupling all sub-system simulations - slide from Bill Johnson
30
Mathematicians Solve NUG30 Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientists Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) 14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,17,30,6,20,19, 8,18,7,27,12,11,23 MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin From Miron Livny 7 Aug. 01
31
Network for Earthquake Engineering Simulation NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC From Steve Tuecke 12 Oct. 01
32
National Airspace Simulation Environment NASA Information Power Grid: aircraft, flight paths, airport operations and the environment are combined to get a virtual national airspace Virtual National Air Space VNAS GRC engine models LaRC airframe models landing gear models ARC wing models stabilizer models human models FAA ops data weather data airline schedule data digital flight data radar tracks terrain data surface data 22,000 commercial US flights a day 50,000 engine runs 22,000 airframe impact runs 132,000 landing/ take-off gear runs 48,000 human crew runs 66,000 stabilizer runs 44,000 wing runs simulation drivers
33
Data Challenges
34
Derived from Ian Fosters slide at ssdbM July 03 Its Easy to Forget How Different 2003 is From 1993 Enormous quantities of data: Petabytes For an increasing number of communities gating step is not collection but analysis Ubiquitous Internet: >100 million hosts Collaboration & resource sharing the norm Security and Trust are crucial issues Ultra-high-speed networks: >10 Gb/s Global optical networks Bottlenecks: last kilometre & firewalls Huge quantities of computing: >100 Top/s Moores law gives us all supercomputers Organising their effective use is the challenge Moores law everywhere Instruments, detectors, sensors, scanners, … Organising their effective use is the challenge
35
Tera Peta Bytes RAM time to move 15 minutes 1Gb WAN move time 10 hours ($1000) Disk Cost 7 disks = $5000 (SCSI) Disk Power 100 Watts Disk Weight 5.6 Kg Disk Footprint Inside machine RAM time to move 2 months 1Gb WAN move time 14 months ($1 million) Disk Cost 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Kilowatts Disk Weight 33 Tonnes Disk Footprint 60 m 2 May 2003 Approximately Correct See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
36
The Story so Far Technology enables Grids, More Data & … Information Grids will dominate Collaboration essential Combining approaches Combining skills Sharing resources (Structured) Data is the language of Collaboration Data Access & Integration a Ubiquitous Requirement Many hard technical challenges Scale, heterogeneity, distribution, dynamic variation Intimate combinations of data and computation With unpredictable (autonomous) development of both
37
Scientific Data Opportunities Global Production of Published Data Volume Diversity Combination Analysis Discovery Challenges Data Huggers Meagre metadata Ease of Use Optimised integration Dependability Opportunities Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data & Computation Challenges Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility
38
UK e-Science
39
From presentation by Tony Hey
40
e-Science Programmes Vision UK will lead the in the exploitation of e-Infrastructure New, faster and better research Engineering design, medical diagnosis, decision support, … e-Business, e-Research, e-Design & e- Decision Depends on Leading e-Infrastructure development & deployment
41
e-Science and SR2002 Research Council 2004-62001-4 Medical£13.1M (£8M) Biological£10.0M (£8M) Environmental £8.0M (£7M) Eng & Phys£18.0M(£17M) HPC £2.5M (£9M) Core Prog.£16.2M(£15M) + £20M Particle Phys & Astro £31.6M(£26M) Economic & Social £10.6M (£3M) Central Labs £5.0M (£5M)
42
www.nesc.ac.uk National e- Science Centre HPC(x) National e-Science Institute International relationships Engineering Task Force Grid Support Centre Architecture Task Force OGSA-DAI One of 11 Centre Projects GridNet to support standards work One of 6 administration projects Training team 5 Application projects 15 Fundamental Research projects EGEE Globus Alliance
43
NeSI in Edinburgh National e-Science Centre
44
NeSI Events held in the 2 nd Year (from 1 Aug 2002 to 31 Jul 2003) We have had 86 events: (Year 1 figures in brackets) 11 project meetings ( 4) 11 research meetings ( 7) 25 workshops (17 + 1) 2 summer schools(0) 15 training sessions(8) 12 outreach events(3) 5 international meetings(1) 5 e-Science management meetings (7) (though the definitions are fuzzy!) > 3600 Participant Days Suggestions always welcome Establishing a training team Investing in community building, skill generation & knowledge development
45
NeSI Workshops Space for real work Crossing communities Creativity: new strategies and solutions Written reports Scientific Data Mining, Integration and Visualisation Grid Information Systems Portals and Portlets Virtual Observatory as a Data Grid Imaging, Medical Analysis and Grid Environments Grid Scheduling Provenance & Workflow GeoSciences & Scottish Bioinformatics Forum http://www.nesc.ac.uk/events/
46
E-Infrastructure
47
OGSA Infrastructure Architecture OGSI: Interface to Grid Infrastructure Data Intensive Applications for Application area X Compute, Data & Storage Resources Distributed Simulation, Analysis & Integration Technology for Application area X Data Intensive Users Virtual Integration Architecture Generic Virtual Data Access and Integration Layer Structured Data Integration Structured Data Access Structured Data Relational XML Semi-structured- Transformation Registry Job Submission Data TransportResource Usage Banking BrokeringWorkflow Authorisation
48
1a. Request to Registry for sources of data about x 1b. Registry responds with Factory handle 2a. Request to Factory for access to database 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3b. GDS interacts with database 3c. Results of query returned to client as XML SOAP/HTTP service creation API interactions RegistryFactory 2b. Factory creates GridDataService to manage access Grid Data Service Client XML / Relationa l database Data Access & Integration Services
49
GDTS 2 GDS 3 2 GDTS 1 S x S y 1a. Request to Registry for sources of data about x & y 1b. Registry responds with Factory handle 2a. Request to Factory for access and integration from resources Sx and Sy 2b. Factory creates GridDataServices network 2c. Factory returns handle of GDS to client 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation SOAP/HTTP service creation API interactions Data Registry Data Access & Integration master Client Analyst XML database Relational database GDS GDTS 3b. Client tells analyst GDS 1 Future DAI Services? scientific Application coding scientific insights Problem Solving Environment Semantic Meta data Application Code
50
Integration is our Focus Supporting Collaboration Bring together disciplines Bring together people engaged in shared challenge Inject initial energy Invent methods that work Supporting Collaborative Research Integrate compute, storage and communications Deliver and sustain integrated software stack Operate dependable infrastructure service Integrate multiple data sources Integrate data and computation Integrate experiment with simulation Integrate visualisation and analysis High-level tools and automation essential Fundamental research as a foundation
51
Take Home Message Data is a Major Source of Challenges AND an Enabler of New Science, Engineering, Medicine, Planning, … Information Grids Support for collaboration Support for computation and data grids Structured data is fundamental Integrated strategies & technologies needed E-Infrastructure is Here – More to do – technically & socio-economically Join in – explore the potential – develop the methods & standards NeSC would like to help you develop e-Science We seek suggestions and collaboration
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.