Download presentation
Presentation is loading. Please wait.
Published byLily Allen Modified over 11 years ago
1
TICER Summer School, August 24th 20061 Ticer Summer School Thursday 24 th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh www.nesc.ac.uk
2
TICER Summer School, August 24th 20062 Digital Libraries, Grids & E-Science What is E-Science? What is Grid Computing? Data Grids Requirements Examples Technologies Data Virtualisation The Open Grid Services Architecture Challenges
3
TICER Summer School, August 24th 20063
4
4 What is e-Science? Goal: to enable better research in all disciplines Method: Develop collaboration supported by advanced distributed computation –to generate, curate and analyse rich data resources From experiments, observations, simulations & publications Quality management, preservation and reliable evidence –to develop and explore models and simulations Computation and data at all scales Trustworthy, economic, timely and relevant results –to enable dynamic distributed collaboration Facilitating collaboration with information and resource sharing Security, trust, reliability, accountability, manageability and agility
5
climateprediction.net and GENIE Largest climate model ensemble >45,000 users, >1,000,000 model years 10K 2K Response of Atlantic circulation to freshwater forcing
6
6 Courtesy of David Gavaghan & IB Team Integrative Biology Tackling two Grand Challenge research questions: What causes heart disease? How does a cancer form and grow? Together these diseases cause 61% of all UK deaths Building a powerful, fault-tolerant Grid infrastructure for biomedical science Enabling biomedical researchers to use distributed resources such as high-performance computers, databases and visualisation tools to develop coupled multi-scale models of how these killer diseases develop.
7
Biomedical Research Informatics Delivered by Grid Enabled Services Synteny Grid Service blast + Portal http://www.brc.dcs.gla.ac.uk/projects/bridges/
8
TICER Summer School, August 24th 20068 eDiaMoND: Screening for Breast Cancer 1 Trust Many Trusts Collaborative Working Audit capability Epidemiology Other Modalities -MRI -PET -Ultrasound Better access to Case information And digital tools Supplement Mentoring With access to digital Training cases and sharing Of information across clinics Letters Radiology reporting systems eDiaMoND Grid 2ndary Capture Or FFD Case Information X-Rays and Case Information Digital Reading SMF Case and Reading Information CADTemporal Comparison Screening Electronic Patient Records Assessment/ Symptomatic Biopsy Case and Reading Information Symptomatic/Assessment Information Training Manage Training Cases Perform Training SMF CAD 3D Images Patients Provided by eDiamond project: Prof. Sir Mike Brady et al.
9
TICER Summer School, August 24th 20069 E-Science Data Resources Curated databases –Public, institutional, group, personal Online journals and preprints Text mining and indexing services Raw storage (disk & tape) Replicated files Persistent archives Registries …
10
TICER Summ er School, August 24th 2006© 10 EBank Slide from Jeremy Frey
11
TICER Summ er School, August 24th 2006© 11 Biomedical data – making connections 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg Slide provided by Carole Goble: University of Manchester
12
TICER Summer School, August 24th 200612 Using Workflows to Link Services Describe the steps in a Scripting Language Steps performed by Workflow Enactment Engine Many languages in use –Trade off: familiarity & availability –Trade off: detailed control versus abstraction Incrementally develop correct process –Sharable & Editable –Basis for scientific communication & validation –Valuable IPR asset Repetition is now easy –Parameterised explicitly & implicitly
13
TICER Summer School, August 24th 200613 Workflow Systems LanguageWF Enact.Comments Shell scripts Shell + OSCommon but not often thought of as WF. Depend on context, e.g. NFS across all sites PerlPerl runtime Popular in bioinformatics. Similar context dependence – distribution has to be coded JavaJVMPopular target because JVM ubiquity – similar dependence – distribution has to be coded BPELBPEL Enactment OASIS standard for industry – coordinating use of multiple Web Services – low level detail - tools TavernaScuflEBI, OMII-UK & MyGrid http://taverna.sourceforge.net/index.php http://taverna.sourceforge.net/index.php VDT / Pegasus Chimera & DAGman High-level abstract formulation of workflows, automated mapping towards executable forms, cached result re-use Kepler BIRN, GEON & SEEK http://kepler-project.org/
14
TICER Summ er School, August 24th 2006© 14 Workflow example Taverna in MyGrid http://www.mygrid.org.uk/http://www.mygrid.org.uk/ allows the e-Scientist to describe and enact their experimental processes in a structured, repeatable and verifiable way GUI Workflow language Enactment engine
15
TICER Summ er School, August 24th 2006© 15 Pub/Sub for Laboratory data using a broker and ultimately delivered over GPRS Notification Comb-e-chem: Jeremy Frey
16
TICER Summer School, August 24th 200616 Relevance to Digital Libraries Similar concerns –Data curation & management –Metadata, discovery –Secure access (AAA +) –Provenance & data quality –Local autonomy –Availability, resilience Common technology –Grid as an implementation technology
17
TICER Summer School, August 24th 200617
18
TICER Summer School, August 24th 200618 What is a Grid? License Printer A grid is a system consisting of Distributed but connected resources and Software and/or hardware that provides and manages logically seamless access to those resources to meet desired objectives A grid is a system consisting of Distributed but connected resources and Software and/or hardware that provides and manages logically seamless access to those resources to meet desired objectives R2AD Database Web server Data CenterCluster HandheldSupercomputer Workstation Server Source: Hiro Kishimoto GGF17 Keynote May 2006
19
TICER Summer School, August 24th 200619 Virtualizing Resources Resources Web services Access Storage Sensors Applications Information Computers Resource-specific Interfaces Common Interfaces Type-specific interfaces Hiro Kishimoto: Keynote GGF17
20
TICER Summer School, August 24th 200620 Ideas and Forms Key ideas –Virtualised resources –Secure access –Local autonomy Many forms –Cycle stealing –Linked supercomputers –Distributed file systems –Federated databases –Commercial data centres –Utility computing
21
TICER Summer School, August 24th 200621 Grid Middleware Virtualized resources Grid middleware services Brokering Service Registry Service Data Service CPU Resource Printer Service Job-Submit Service Compute Service Notify Advertise Application Service Hiro Kishimoto: Keynote GGF17
22
TICER Summer School, August 24th 200622 Key Drivers for Grids Collaboration –Expertise is distributed –Resources (data, software licences) are location-specific –Necessary to achieve critical mass of effort –Necessary to raise sufficient resources Computational Power –Rapid growth in number of processors –Powered by Moores law + device roadmap –Challenge to transform models to exploit this Deluge of Data –Growth in scale: Number and Size of resources –Growth in complexity –Policy drives greater data availability
23
TICER Summer School, August 24th 200623 Minimum Grid Functionalities Supports distributed computation –Data and computation –Over a variety of hardware components (servers, data stores, …) Software components (services: resource managers, computation and data services) –With regularity that can be exploited By applications By other middleware & tools By providers and operations –It will normally have security mechanisms To develop and sustain trust regimes
24
TICER Summer School, August 24th 200624 Source: Hiro Kishimoto GGF17 Keynote May 2006 Grid & Related Paradigms Utility Computing Computing services No knowledge of provider Enabled by grid technology Utility Computing Computing services No knowledge of provider Enabled by grid technology Distributed Computing Loosely coupled Heterogeneous Single Administration Distributed Computing Loosely coupled Heterogeneous Single Administration Cluster Tightly coupled Homogeneous Cooperative working Cluster Tightly coupled Homogeneous Cooperative working Grid Computing Large scale Cross-organizational Geographical distribution Distributed Management Grid Computing Large scale Cross-organizational Geographical distribution Distributed Management
25
TICER Summer School, August 24th 200625
26
TICER Summer School, August 24th 200626 Why use / build Grids? Research Arguments –Enables new ways of working –New distributed & collaborative research –Unprecedented scale and resources Economic Arguments –Reduced system management costs –Shared resources better utilisation –Pooled resources increased capacity –Load sharing & utility computing –Cheaper disaster recovery
27
TICER Summer School, August 24th 200627 Why use / build Grids? Operational Arguments –Enable autonomous organisations to Write complementary software components Set up run & use complementary services Share operational responsibility General & consistent environment for Abstraction, Automation, Optimisation & Tools Political & Management Arguments –Stimulate innovation –Promote intra-organisation collaboration –Promote inter-enterprise collaboration
28
TICER Summer School, August 24th 200628 Grids In Use: E-Science Examples Data sharing and integration Life sciences, sharing standard data-sets, combining collaborative data-sets Medical informatics, integrating hospital information systems for better care and better science Sciences, high-energy physics Data sharing and integration Life sciences, sharing standard data-sets, combining collaborative data-sets Medical informatics, integrating hospital information systems for better care and better science Sciences, high-energy physics Capability computing Life sciences, molecular modeling, tomography Engineering, materials science Sciences, astronomy, physics Capability computing Life sciences, molecular modeling, tomography Engineering, materials science Sciences, astronomy, physics High-throughput, capacity computing for Life sciences: BLAST, CHARMM, drug screening Engineering: aircraft design, materials, biomedical Sciences: high-energy physics, economic modeling High-throughput, capacity computing for Life sciences: BLAST, CHARMM, drug screening Engineering: aircraft design, materials, biomedical Sciences: high-energy physics, economic modeling Simulation-based science and engineering Earthquake simulation Simulation-based science and engineering Earthquake simulation Source: Hiro Kishimoto GGF17 Keynote May 2006
29
TICER Summer School, August 24th 200629
30
PDB 33,367 Protein structures EMBL DB 111,416,302,701 nucleotides Database Growth Slide provided by Richard Baldock: MRC HGU Edinburgh
31
TICER Summer School, August 24th 200631 Requirements: Users viewpoint Find Data –Registries & Human communication Understand data –Metadata description, Standard / familiar formats & representations, Standard value systems & ontologies Data Access –Find how to interact with data resource –Obtain permission (authority) –Make connection –Make selection Move Data –In bulk or streamed (in increments)
32
TICER Summer School, August 24th 200632 Requirements: Users viewpoint 2 Transform Data –To format, organisation & representation required for computation or integration Combine data –Standard database operations + operations relevant to the application model Present results –To humans: data movement + transform for viewing –To application code: data movement + transform to the required format –To standard analysis tools, e.g. R –To standard visualisation tools, e.g. Spitfire
33
TICER Summer School, August 24th 200633 Requirements: Owners viewpoint Create Data –Automated generation, Accession Policies, Metadata generation –Storage Resources Preserve Data –Archiving –Replication –Metadata –Protection Provide Services with available resources –Definition & implementation: costs & stability –Resources: storage, compute & bandwidth
34
TICER Summer School, August 24th 200634 Requirements: Owners viewpoint 2 Protect Services –Authentication, Authorisation, Accounting, Audit –Reputation Protect data –Comply with owner requirements – encryption for privacy, … Monitor and Control use –Detect and handle failures, attacks, misbehaving users –Plan for future loads and services Establish case for Continuation –Usage statistics –Discoveries enabled
35
TICER Summer School, August 24th 200635
36
TICER Summer School, August 24th 200636 Large Hadron Collider The most powerful instrument ever built to investigate elementary particle physics Data Challenge: –10 Petabytes/year of data –20 million CDs each year! Simulation, reconstruction, analysis: –LHC data handling requires computing power equivalent to ~100,000 of today's fastest PC processors
37
TICER Summer School, August 24th 200637 Composing Observations in Astronomy Data and images courtesy Alex Szalay, John Hopkins No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects
38
GODIVA Data Portal Grid for Ocean Diagnostics, Interactive Visualisation and Analysis Daily Met Office Marine Forecasts and gridded research datasets National Centre for Ocean Forecasting ~3Tb climate model datastore via Web Services Interactive Visualisations inc. Movies ~ 30 accesses a day worldwide Other GODIVA software produces 3D/4D Visualisations reading data remotely via Web Services Online Movies www.nerc-essc.ac.uk/godiva
39
GODIVA Visualisations Unstructured Meshes Grid Rotation/Interpolation GeoSpatial Databases v. Files (Postgres, IBM, Oracle) Perspective 3D Visualisation Google maps viewer
40
NERC Data Grid The DataGrid focuses on federation of NERC Data Centres Grid for data discovery, delivery and use across sites Data can be stored in many different ways (flat files, databases…) Strong focus on Metadata and Ontologies Clear separation between discovery and use of data. Prototype focussing on Atmospheric and Oceanographic data www.ndg.nerc.ac.uk
41
Global In-flight Engine Diagnostics in-flight data airline maintenance centre ground station global network eg SITA internet, e-mail, pager DS&S Engine Health Center data centre Distributed Aircraft Maintenance Environment: Leeds, Oxford, Sheffield &York, Jim Austin 100,000 aircraft 0.5 GB/flight 4 flights/day 200 TB/day Now BROADEN Significant in getting Boeing 787 engine contract
42
TICER Summer School, August 24th 200642
43
TICER Summer School, August 24th 200643 Storage Resource Manager (SRM) http://sdm.lbl.gov/srm-wg/ de facto & written standard in physics, … Collaborative effort –CERN, FNAL, JLAB, LBNL and RALCERN, FNAL, JLAB, LBNL and RAL Essential bulk file storage –(pre) allocation of storage abstraction over storage systems –File delivery / registration / access –Data movement interfaces E.g. gridFTP Rich function set –Space management, permissions, directory, data transfer & discovery
44
TICER Summer School, August 24th 200644 Storage Resource Broker (SRB) http://www.sdsc.edu/srb/index.php/Main_Page SDSC developed Widely used –Archival document storage –Scientific data: bio-sciences, medicine, geo-sciences, … Manages –Storage resource allocation abstraction over storage systems –File storage –Collections of files –Metadata describing files, collections, etc. –Data transfer services
45
TICER Summer School, August 24th 200645 Condor Data Management Stork –Manages File Transfers –May manage reservations Nest –Manages Data Storage –C.f. GridFTP with reservations Over multiple protocols
46
TICER Summer School, August 24th 200646 Globus Tools and Services for Data Management l GridFTP u A secure, robust, efficient data transfer protocol l The Reliable File Transfer Service (RFT) u Web services-based, stores state about transfers l The Data Access and Integration Service (OGSA-DAI) u Service to access to data resources, particularly relational and XML databases l The Replica Location Service (RLS) u Distributed registry that records locations of data copies l The Data Replication Service u Web services-based, combines data replication and registration functionality Slides from Ann Chervenak
47
TICER Summer School, August 24th 200647 RLS in Production Use: LIGO l Laser Interferometer Gravitational Wave Observatory Currently use RLS servers at 10 sites u Contain mappings from 6 million logical files to over 40 million physical replicas l Used in customized data management system: the LIGO Lightweight Data Replicator System (LDR) u Includes RLS, GridFTP, custom metadata catalog, tools for storage management and data validation Slides from Ann Chervenak
48
TICER Summer School, August 24th 200648 RLS in Production Use: ESG l Earth System Grid: Climate modeling data (CCSM, PCM, IPCC) l RLS at 4 sites l Data management coordinated by ESG portal l Datasets stored at NCAR u 64.41 TB in 397253 total files u 1230 portal users l IPCC Data at LLNL u 26.50 TB in 59,300 files u 400 registered users u Data downloaded: 56.80 TB in 263,800 files u Avg. 300GB downloaded/day u 200+ research papers being written Slides from Ann Chervenak
49
Enabling Grids for E-sciencE INFSO-RI-508833 TICER Summer School, August 24th 20062 nd EGEE Review, CERN - gLite Middleware Status 49 gLite Data Management FTS –File Transfer Service LFC –Logical file catalogue Replication Service –Accessed through LFC AMGA –Metadata services
50
Enabling Grids for E-sciencE INFSO-RI-508833 TICER Summer School, August 24th 20062 nd EGEE Review, CERN - gLite Middleware Status 50 Data Management Services FiReMan catalog –Resolves logical filenames (LFN) to physical location of files and storage elements –Oracle and MySQL versions available –Secure services –Attribute support –Symbolic link support –Deployed on the Pre-Production Service and DILIGENT testbed gLite I/O –Posix-like access to Grid files –Castor, dCache and DPM support –Has been used for the BioMedical Demo –Deployed on the Pre-Production Service and the DILIGENT testbed AMGA MetaData Catalog –Used by the LHCb experiment –Has been used for the BioMedical Demo
51
Enabling Grids for E-sciencE INFSO-RI-508833 TICER Summer School, August 24th 20062 nd EGEE Review, CERN - gLite Middleware Status 51 File Transfer Service Reliable file transfer Full scalable implementation –Java Web Service front-end, C++ Agents, Oracle or MySQL database support –Support for Channel, Site and VO management –Interfaces for management and statistics monitoring Gsiftp, SRM and SRM-copy support Support for MySQL and Oracle Multi-VO support GridFTP and SRM copy support
52
TICER Summer School, August 24th 200652 Commercial Solutions Vendors include: –Avaki –Data Synapse Benefits & costs –Well packaged and documented –Support –Can be expensive But look for academic rates
53
TICER Summer School, August 24th 200653
54
TICER Summer School, August 24th 200654 Data Integration Strategies Use a Service provided by a Data Owner Use a scripted workflow Use data virtualisation services –Arrange that multiple data services have common properties –Arrange federations of these –Arrange access presenting the common properties –Expose the important differences –Support integration accommodating those differences
55
TICER Summer School, August 24th 200655 Data Virtualisation Services Form a federation –Set of data resources – incremental addition –Registration & description of collected resources –Warehouse data or access dynamically to obtain updated data –Virtual data warehouses – automating division between collection and dynamic access Describe relevant relationships between data sources –Incremental description + refinement / correction Run jobs, queries & workflows against combined set of data resources –Automated distribution & transformation Example systems –IBMs Information Integrator –GEON, BIRN & SEEK –OGSA-DAI is an extensible framework for building such systems
56
TICER Summer School, August 24th 200656 Virtualisation variations Extent to which homogeneity obtained –Regular representation choices – e.g. units –Consistent ontologies –Consistent data model –Consistent schema – integrated super-schema –DB operations supported across federation –Ease of adding federation elements –Ease of accommodating change as federation members change their schema and policies –Drill through to primary forms supported
57
TICER Summer School, August 24th 200657 OGSA-DAI http://www.ogsadai.org.uk A framework for data virtualisation Wide use in e-Science –BRIDGES, GEON, CaBiG, GeneGrid, MyGrid, BioSimGrid, e-Diamond, IU RGRBench, … Collaborative effort –NeSC, EPCC, IBM, Oracle, Manchester, Newcastle Querying of data resources –Relational databases –XML databases –Structured flat files Extensible activity documents –Customisation for particular applications
58
TICER Summer School, August 24th 200658
59
TICER Summer School, August 24th 200659 The Open Grid Services Architecture An open, service-oriented architecture (SOA) Resources as first-class entities Dynamic service/resource creation and destruction Built on a Web services infrastructure Resource virtualization at the core Build grids from small number of standards-based components Replaceable, coarse-grained e.g. brokers Customizable Support for dynamic, domain-specific content… …within the same standardized framework Hiro Kishimoto: Keynote GGF17
60
TICER Summer School, August 24th 200660 OGSA Capabilities Security Cross-organizational users Trust nobody Authorized access only Security Cross-organizational users Trust nobody Authorized access only Information Services Registry Notification Logging/auditing Information Services Registry Notification Logging/auditing Execution Management Job description & submission Scheduling Resource provisioning Execution Management Job description & submission Scheduling Resource provisioning Data Services Common access facilities Efficient & reliable transport Replication services Data Services Common access facilities Efficient & reliable transport Replication services Self-Management Self-configuration Self-optimization Self-healing Self-Management Self-configuration Self-optimization Self-healing Resource Management Discovery Monitoring Control Resource Management Discovery Monitoring Control OGSA OGSA profiles Web services foundation Hiro Kishimoto: Keynote GGF17
61
TICER Summer School, August 24th 200661 Basic Data Interfaces Storage Management e.g. Storage Resource Management (SRM) Storage Management e.g. Storage Resource Management (SRM) Data Access ByteIO Data Access & Integration (DAI) Data Access ByteIO Data Access & Integration (DAI) Data Transfer Data Movement Interface Specification (DMIS) Protocols (e.g. GridFTP) Data Transfer Data Movement Interface Specification (DMIS) Protocols (e.g. GridFTP) Replica management Metadata catalog Cache management Replica management Metadata catalog Cache management Hiro Kishimoto: Keynote GGF17
62
TICER Summer School, August 24th 200662
63
TICER Summer School, August 24th 200663 The State of the Art Many successful Grid & E-Science projects –A few examples shown in this talk Many Grid systems –All largely incompatible –Interoperation talks under way Standardisation efforts –Mainly via the Open Grid Forum –A merger of the GGF & EGA Significant user investment required –Few out of the box solutions
64
TICER Summer School, August 24th 200664 Technical Challenges Issues you cant avoid –Lack of Complete Knowledge (LOCK) –Latency –Heterogeneity –Autonomy –Unreliability –Scalability –Change A Challenging goal –balance technical feasibility –against virtual homogeneity, stability and reliability –while remaining affordable, manageable and maintainable
65
TICER Summer School, August 24th 200665 Areas In Development Data provenance Quality of Service –Service Level Agreements Resource brokering –Across all resources Workflow scheduling –Co-sheduling Licence management Software provisioning –Deployment and update Other areas too!
66
TICER Summer School, August 24th 200666 Operational Challenges Management of distributed systems –With local autonomy Deployment, testing & monitoring User training User support Rollout of upgrades Security –Distributed identity management –Authorisation –Revocation –Incident response
67
TICER Summer School, August 24th 200667 Grids as a Foundation for Solutions The grid per se doesnt provide –Supported e-Science methods –Supported data & information resources –Computations –Convenient access Grids help providers of these, via –International & national secure e-Infrastructure –Standards for interoperation –Standard APIs to promote re-use But Research Support must be built –Application developers –Resource providers
68
TICER Summer School, August 24th 200668 Collaboration Challenges Defining common goals Defining common formats –E.g. schemas for data and metadata Defining a common vocabulary –E.g. for metadata Finding common technology –Standards should help, eventually Collecting metadata –Automate where possible
69
TICER Summer School, August 24th 200669 Social Challenges Changing cultures –Rewarding data & resource sharing –Require publication of data Taking the first steps –If everyone shares, everyone wins –The first people to share must not lose out Sustainable funding –Technology must persist –Data must persist
70
TICER Summer School, August 24th 200670
71
TICER Summer School, August 24th 200671 Summary E-Science exploits distributed computing resource to enable new discoveries, new collaborations and new ways of working Grid is an enabling technology for e-science. Many successful projects exist Many challenges remain
72
TICER Summer School, August 24th 200672 Globus Alliance CeSC (Cambridge) Digital Curation Centre e-Science Institute UK e-Science EGEE, ChinaGrid Grid Operations Support Centre National Centre for e-Social Science National Institute for Environmental e-Science Open Middleware Infrastructure Institute
73
TICER Summer School, August 24th 200673
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.