UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre www.nesc.ac.uk.

Slides:



Advertisements
Similar presentations
Exploiting the WWW: Lessons from a UK Research Project on a Health Record BrokerExploiting the WWW: Lessons from a UK Research Project on a Health Record.
Advertisements

Delivery of Industrial Strength Middleware Federated Strengths Agility & Coordination Prof. Malcolm Atkinson Director 21 st January 2004.
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
The UK e-Science Programme & The National e-Science Centre Malcolm Atkinson Director of NeSC Universities of Edinburgh and Glasgow Pilot Projects Meeting.
UK e-Science Grid Infrastructure meets Biological Research Challenges Malcolm Atkinson Director of National e-Science Centre 2 nd October.
UK Role in Open Grid Services Architecture Towards an Architectural Road Map A Report to the Technical Advisory Group from The Architecture Task Force.
Research Councils ICT Conference Welcome Malcolm Atkinson Director 17 th May 2004.
National e-Science Centre Glasgow e-Science Hub Opening: Remarks NeSCs Role Prof. Malcolm Atkinson Director 17 th September 2003.
Open Grid Service Architecture - Data Access & Integration (OGSA-DAI) Dr Martin Westhead Principal Consultant, EPCC Telephone: Fax:+44.
Databases and the Grid OGSA-DAI Architecture & Status Malcolm Atkinson OGSA-DAI Chief Architect for all members of the OGSA-DAI team Director of National.
UK e-Science Report on OGSA, OGSI & OGSA-DAI Malcolm Atkinson Director of National e-Science Centre 28 th October 2002 Meeting of the UK.
UK Role in Open Grid Services Architecture Towards an Architectural Road Map Malcolm Atkinson Director of NeSC 18 th April 2002.
SWITCH Visit to NeSC Malcolm Atkinson Director 5 th October 2004.
LHCb Bologna Workshop Glenn Patrick1 Backbone Analysis Grid A Skeleton for LHCb? LHCb Grid Meeting Bologna, 14th June 2001 Glenn Patrick (RAL)
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
EInfrastructures (Internet and Grids) US Resource Centers Perspective: implementation and execution challenges Alan Blatecky Executive Director SDSC.
® IBM India Research Lab © 2006 IBM Corporation Challenges in Building a Strategic Information Integration Infrastructure Mukesh Mohania IBM India Research.
GEODE Workshop 16 th January 2007 Issues in e-Science Richard Sinnott University of Glasgow Ken Turner University of Stirling.
Bryan Lawrence Head, British Atmospheric Data Centre Rutherford Appleton Laboratory (with thanks to Tony Hey, the.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
SuperJANET4 Update Roland Trice, ULCC SuperJANET4 Rollout Manager.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Welcome e-Science in the UK Building Collaborative eResearch Environments Prof. Malcolm Atkinson Director 23 rd February 2004.
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Database Taskforce and the OGSA-DAI Project Norman Paton University of Manchester.
DATAGRID Testbed release 0 Organization and working model F.Etienne, A.Ghiselli CNRS/IN2P3 – Marseille, INFN-CNAF Bologna DATAGRID Conference, 7-9 March.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Study Visit Programme to NeSC – October 2004 UK National Network, Campus and Regional Issues People Networks Linda McCormick Director of Computing Service.
Extensible Framework for Data Access & Integration Malcolm Atkinson Director 10 th November 2004.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
SLICE Simulation for LHCb and Integrated Control Environment Gennady Kuznetsov & Glenn Patrick (RAL) Cosener’s House Workshop 23 rd May 2002.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
DAIT (DAI Two) NeSC Review 18 March Description and Aims Grid is about resource sharing Data forms an important part of that vision Data on Grids:
OGSA-DAI in OMII-Europe Neil Chue Hong EPCC, University of Edinburgh.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
1 HPDC12 Seattle Structured Data and the Grid Access and Integration Prof. Malcolm Atkinson Director 23 rd June 2003.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Grids - the near future Mark Hayes NIEeS Summer School 2003.
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
The UK eScience Grid (and other real Grids) Mark Hayes NIEeS Summer School 2003.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Combining the strengths of UMIST and The Victoria University of Manchester “Use cases” Stephen Pickles e-Frameworks meets e-Science workshop Edinburgh,
IBM & HSBC visit Malcolm Atkinson Director & e-Science Envoy UK National e-Science Centre & e-Science Institute 30 th March 2006.
1 The Challenge of Data Integration Data + Grid = Discovery? Prof. Malcolm Atkinson Director 22 nd January 2003.
Utility Computing: Security & Trust Issues Dr Steven Newhouse Technical Director London e-Science Centre Department of Computing, Imperial College London.
7. Grid Computing Systems and Resource Management
OGSA-DAI & DAIT projects Update for TAG Prof. Malcolm Atkinson Director 30 th October 2003.
OGSA-DAI Users’ Meeting Introduction Malcolm Atkinson Director 7 th April 2004.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
1 OGSA Transition ATF Migration Strategy Prof. Malcolm Atkinson Director 28 th April 2003.
17 th October 2002Data Provenance Grid Data Requirements Scoping Metadata & Provenance Dave Pearson Oracle Corporation UK.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
The OGSA-DAI Project Databases and the Grid Neil Chue Hong Project Manager EPCC, Edinburgh
Toward a common data and command representation for quantum chemistry Malcolm Atkinson Director 5 th April 2004.
Data and storage services on the NGS.
Japanese & UK N+N Data, Data everywhere and … Prof. Malcolm Atkinson Director 3 rd October 2003.
The National Grid Service Mike Mineter.
CIMA and Semantic Interoperability for Networked Instruments and Sensors Donald F. (Rick) McMullen Pervasive Technology Labs at Indiana University
Welcome Grids and Applied Language Theory Dave Berry Research Manager 16 th October 2003.
UK Role in Open Grid Services Architecture Towards an Architectural Road Map A Report to the Technical Advisory Group from The Architecture Task Force.
RC ICT Conference 17 May 2004 Research Councils ICT Conference The UK e-Science Programme David Wallace, Chair, e-Science Steering Committee.
OGSA-DAI.
Amy Krause EPCC OGSA-DAI An Overview OGSA-DAI on OMII 2.0 OMII The Open Middleware Infrastructure Institute NeSC,
UK Grid: Moving from Research to Production
Welcome to National e-Science Centre Official Opening
UK e-Science OGSA-DAI November 2002 Malcolm Atkinson
The National Grid Service
Presentation transcript:

UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre 25 th October 2002 SDMIV workshop, e-Science Institute Edinburgh

Overview UK e-Science Reminder of Investment and Infrastructure International e-Science Examples and Collaboration Data Access and Integration Lego Bricks for Scientific Application Developers Tailored: Application and Computing Scientists A Computer Scientist’s Christmas List Diversity and Opportunity The Way Ahead

e-Science Fundamentally about Collaboration Sharing  Ideas  Thought processes and Stimuli  Effort  Resources Requires  Communication  Common understanding & Framework  Mechanisms for sharing fairly  Organisation and Infrastructure Scientists (Biologists) have done this for Centuries

e-Science (take 2) Fundamentally about Collaboration Sharing  Ideas  Thought processes and Stimuli  Effort  Resources Requires  Communication  Common understanding & Framework  Mechanisms for sharing fairly  Organisation and Infrastructure Text, digital media, structured, organised & curated data, computable models, visualisation, shared instruments, shared systems, shared administration, … Nationally & Internationally Distributed, … Routine, Daily, Automated, … That Requires very Significant Investment in Digital Systems and their Support

e-Science (take 3) Fundamentally about Collaboration Sharing  Ideas  Thought processes and Stimuli  Effort  Resources Requires  Communication  Common understanding & Framework  Mechanisms for sharing fairly  Organisation and Infrastructure Digital networks, digital work- places, digital instruments, … Metadata, ontologies, standards, shared curated data, shared codes, … Common platforms, shared software, shared training, … The Grid SHOULD make this much easier by providing a common, supported high-level of Software and Organisational infrastructure Citation, Authentication, Authorisation, Accounting, Provenance, Policies, … Shared Provision of Platform,

Grid Expectations Persistence Always there, Always Working, Always Supported Stability You can build on foundations that don’t move Trustworthy & Predictable Honours commitments  Digital policies, digital contracts, security, …  Data integrity, longevity and accessibility  Performance High-level & Extensible The capabilities you need are already there Ubiquitous Your collaborators use it

Grid Reality Persistence Always there, Always Working, Always Supported Stability You can build on foundations that don’t move Trustworthy & Predictable Honours commitments  Digital policies, digital contracts, security, …  Data integrity, longevity and accessibility  Performance High-level & Extensible The capabilities you need are already there Ubiquitous Your collaborators use it Political, Economic & Technical issues to Solve Early days but Open Grid Services link with Web Services + GGF standardisation Not yet but very substantial global effort to achieve this Good basis for extension Commitment to basic functionality WS + Community effort Global & Industrial Rallying Cry Must work with Web Services

Cambridge Newcastle Edinburgh Oxford Glasgow Manchester Cardiff Southampton London Belfast Daresbury Lab RAL Hinxton UK Grid Network National e- Science Centre always-on video walls Access Grid always-on video walls HPC(x)

Scotland via Glasgow NNW Northern Ireland MidMAN TVN South Wales MAN SWAN& BWEMAN WorldCom Glasgow WorldCom Edinburgh WorldCom Manchester WorldCom Reading WorldCom Leeds WorldCom Bristol WorldCom London WorldCom Portsmouth Scotland via Edinburgh YHMAN NorMAN EMMAN EastNet External Links LMN Kentish MAN LeNSE 10Gbps 622Mbps 155Mbps SuperJanet4, June Gbps 2.5Gbps Tony Hey July 2001

National e-Science Centre Events Workshops Research Meetings International Meetings History of Events GGF5 HPDC11 Summer school > 50 workshops held > 1000 people in total Many return often Planned Events 25 workshops Conferences to 2005 Visitors 3 arrived 4 arranged International collaboration, visits & visitors China Argonne National Lab SDSC NCSA … Centre Projects Pilot Projects Regional Support Research Projects EPSRC, MRC, WT, SHEFC

A day in the life of NeSC

UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

DataGrid Testbed Dubna Moscow RAL Lund Lisboa Santander Madrid Valencia Barcelona Paris Berlin Lyon Grenoble Marseille Brno Prague Torino Milano BO-CNAF PD-LNL Pisa Roma Catania ESRIN CERN HEP sites ESA sites IPSL Estec KNMI (>40) - Testbed Sites

A Simplified Grid Anatomy Grid Plumbing & Security Infrastructure SchedulingAccountingAuthorisation MonitoringDiagnosisLogging Scientific Application Data & Compute Resources Operations Team Application Developers Distributed Owners Scientific Users

The Crux Grid Plumbing & Security Infrastructure SchedulingAccountingAuthorisation MonitoringDiagnosisLogging Scientific Application Data & Compute Resources Operations Team Application Developers Distributed Owners Scientific Users Keep all the (pink) groups HAPPY

A SDMIV Grid Anatomy Grid Plumbing & Security Infrastructure SchedulingAccountingAuthorisation MonitoringDiagnosisLogging Scientific Application Data & Compute Resources Distributed SDMIV Users Data Access Data Integration Structured Data Data Providers Data Curators

Database Growth PDB protein structures

Data Mining : Science vs Commerce Data in files FTP a local copy /subset. ASCII or Binary. Each scientist builds own analysis toolkit Analysis is tcl script of toolkit on local data. Some simple visualization tools: x vs y Data in a database Standard reports for standard things. Report writers for non-standard things GUI tools to explore data. Decision trees Clustering Anomaly finders Jim Gray UCSC April 2002

But…some science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point you need indices to limit search parallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$ Jim Gray UCSC April ,000 Kg 250 KW 60 Racks = 120m 2

Web Services Grid Technology Grid Services OGSA & OGSI

Web Services Rapid Integration Dynamic binding Commercial Power Financial & Political Independence Client from Service Service from Client Separation Function from Delivery Description WSDL, WSC, WSEF, … Tools & Platforms Java ONE, Visual.NET WebSphere, Oracle, … www. w3c. org / TR / SOAP or TR/wsdl

Grid Technology Virtual Organisations Sharing & Collaboration Security Single Sign in, delegation Distribution & fast FTP But Various Protocols Resource Mangement Discovery Process Creation Scheduling Monitoring Portability Ubiquitous APIs & Modules Gov’nm’t Agency Buy in Industrial Buy in Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3),

Open Grid Services Architecture Virtual Grid Services Applications Multiple implementations of Grid Services Using operations Implemented by OGS infrastructure Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration

Scientific Data Deluge of Data Exponential growth  Doubling times Astronomy12 months Bio-Sequences9 months Functional Genomics6 months Bytes/dollar12 to 18 months Not How big it is but

Scientific Data Deluge of Data Exponential growth  Doubling times Astronomy12 months Bio-Sequences9 months Functional Genomics6 months Bytes/dollar12 to 18 months Not How big it is but What you do with it Sharing Curation Metadata Automated movement, access & integration Computational Access

Scientific Data Deluge of Data Exponential growth  Doubling times Astronomy12 months Bio-Sequences9 months Functional Genomics6 months Bytes/dollar12 to 18 months Not How big it is but How you Embrace & Manage Change The Database is a Knowledge chest The Database is a Communication Hub Autonomously Managed (Curated) change An Essential part of e-BioMedical, Astronomical, …, Science & Engineering

Wellcome Trust: Cardiovascular Functional Genomics Glasgow Edinburgh Leicester Oxford London Netherlands Shared data Public curated data

Data Access & Integration Central to e-Science Astronomy, Earth Sciences, Ecology, Biology, Medicine, … Collaboration  Shared Databases  Curated Knowledge  Accumulated Observations  Accumulated Simulations Computation  Data mining  Input to models  Calibration of models Presentation  Publication of results  Visualisation

GGF DAIS WG Chairs Norman Paton (Manchester Uni.) Leanne Guy (CERN) Dave Pearson (Oracle UK) Activity BoF GGF4 Toronto WG Meeting GGF5 Edinburgh Papers for GGF6 Workshops & Mail lists Goals Agree Standards for Database Access & Integration Freely available reference implementations  OGSA-DAI one source & focus for discussions Norman Paton, Inderpal Narang, Leanne Guy, Susan Maliaka, Greg Ricardi, …

OGSA-DAI project Lego kit for Data Access & Integration Components for e-Science Applications Accelerated Application Development Multiple Data Models Distributed Data Access via Grid & Proxies Integration, Translation & Transformation Open Source Reference Implementation For DAIS-WG standard Trigger for Component Construction Start a community

Oxford Glasgow Cardiff Southampton London Belfast Daresbury Lab RAL OGSA-DAI Partners EPCC & NeSC Newcastle IBM USA IBM Hursley Oracle Manchester EPCC & NeSC IBM UK IBM USA Manchester e-SC Newcastle e-SC Oracle £3 million, 18 months, started February 2002 Cambridge Hinxton

Primary Components

Advanced Components

Composed Components

Composing Components OGSA-DAI Component Data Transport

DAI Key Components GridDataServiceGDSAccess to data & DB operations GridDataServiceFactoryGDSFMakes GDS & GDSF GridDataServiceRegistryGDSRDiscovery of GDS(F) & Data GridDataTranslationServiceTranslates or Transforms Data GridDataTransportDepotGDTDData transport with persistence Relational & XML models supported Role-based Authorisation Binary structured files

OGSA Relationship ClassGridServiceRegistryNotificationConsumerNotificationProducer GDSMandatory OptionalNormal GDSFMandatory OptionalNormal GDSRMandatory Normal GDTSMandatory GDTDMandatory OptionalNormal

DAI portType Usage ClassGridDataServiceDataTransportFactory GDSMandatoryNormal GDSFOptionalNormalMandatory GDSROptional GDTSOptionalMandatory GDTDOptionalMandatory

Distributed Query

OGSA-DAI Time Line Feb ’02May ’02Jul ’02Sep ’02Dec ’02Feb ’03May ’03Sep ’03 Ship Alpha Release for GT3 Integration RDB + GT2 / OGSA Prototypes Available XML + OGSA Prototype Available Design Documents & Demos for DAIS GGF5 XML + OGSA Prototypes for Early Adopters WS + GSI UK support ( > 100 downloads) Phase 2 Starts Phase 1 Starts Presentation & GGF7 GGF6 WG Papers & Prototypes Productisation, RAMPS & Extension

OGSA-DAI Summary On Schedule & Going Well Contributions via GGF5 & 6 Releases with GT3 Releases scheduled Status: Early Days Released prototypes Tested Architectural Design Using OGSA Working with Early Adopter Pilot Projects  AstroGrid & MyGrid First PRODUCT release Dec ‘02 Influence OGSA-DAI direction Via DAIS-WG & Direct messages to us

Data Processing Processing Characteristics -Well defined work flow -Correction, calibration, transformation,filtering, merging -Relatively static reference data -Stable processing functions (audited changes) -Periodic reprocessing from archive Instrument Raw Data Reference Data Multi-stage Processing Processed Data Archive In Silico Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Analysis and Interpretation Summarisation Processed Data Archive Summarised Data Analysis Characteristics - Variable workflow - Standard functions - Standard and personal filtering and summarisation - Retain drill down capability Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Analysis and Interpretation Analysis and Interpretation Characteristics - Highly dynamic work flow - Multiple data types - Volatile data - Annotations, inferences, conclusions - Evidential reasoning - Shared multiple versions of truth - Periodic version consolidation Processed Data Result data Retrieval & Update Summarised Data Personalised DatabaseConclusions/Inferences - Descriptions - Trends - Correlations - Relationships Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Metadata Requirements Technical Metadata Direct referencing - Physical location and data schema/structure Data currency/status – version, time stamping Accreditation/Access permissions - Ownership (Dublin Core) Query time/Governance - data volume, no. of records, access paths Contextual Metadata Logical referencing physical data – semantic/syntactic ontologies Lexical translation – Thesaurus, ontological mapping Named derivations (summarisations) Scope of Requirements All science communities Related to provenance Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Metadata Requirements Data Versioning Distinguish latest/agreed version of data Maintain history record of change Synchronise and mirror replicated data Distinguish shared personal interpretations and/or annotations Provenance Record of data processing – calibration, filtering, transformation Record of workflow – methods, standards and protocols Reasoning – evidential justification for inferences & conclusions Scope of Requirements All science communities Includes Technical and Contextual Metadata Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Provenance Issues Schema evolution Granularity of record Processed v Derived Inheritance Lack of structured annotations, ontologies Interactive analysis = dynamic workflow Multiple derived data sources Context of usage Best practice can change Multiple versions of the truth Evidential reasoning Existing data & applications Where is the provenance record stored Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago

Collaborative Annotation See DAS Distributed Annotation Service Challenges  Autonomy  Selective viewing  Identification  Provenance  Derivation

Biomedical e-Scientists Is this one species? Understanding bird energy Understanding a river / ocean interaction Understanding a biochemical pathway Understanding a cell Understanding a Heart or Brain Understanding Rhododendra Understanding Evolution … No One-Size fits all solutions But sharable re-usable components

Opportunities Many, many … More than we can address Compute needs Data management needs Data integration needs … Must choose some pioneers To meet a range of common requirements To provoke rich & high-level platform To generate re-usable components A Long-Term Commitment Needed

Advancing SDMIV Grid Grid Plumbing & Security Infrastructure SchedulingAccountingAuthorisation MonitoringDiagnosisLogging Scientific Application Data & Compute Resources Distributed SDMIV Users Data Access Data Integration Structured Data SDMIV (Grid) Application Component Library

Summary e-Science Data as well as Compute Challenges  Needed to be put together Need ubiquitous supported consistent platforms Grid A (potentially) invaluable platform Only show in town Data Integration Hard  Develop & Use Standard kit of parts Started to build the kit No ready made general integration Combines application & computing science Opportunities No one-size fits all, but re-usable subsystems Invest in wider range of Problem driven pioneering Strategic choices needed