Standards and Ontologies to Enable Discovery Data and Information Integration Robin McEntire GlaxoSmithKline 19 Nov, 2002.

Slides:



Advertisements
Similar presentations
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 12 Slide 1 Distributed Systems Design 2.
Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Distributed Systems Architectures Slide 1 1 Chapter 9 Distributed Systems Architectures.
Information Systems Analysis and Design
Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
CIM2564 Introduction to Development Frameworks 1 Overview of a Development Framework Topic 1.
GSC16-OBS-03 ITU-T GSC – 16 Observer Presentation Karen Higginbottom, JTC 1 Chair.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Distributed Systems Architectures
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Components and Architecture CS 543 – Data Warehousing.
Ch 12 Distributed Systems Architectures
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Architecture of Grid File System (GFS) - Based on the outline draft - Arun swaran Jagatheesan San Diego Supercomputer Center Global Grid Forum 11 Honolulu,
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Bioinformatics Grid Application for Life Science. COMMUNICATION NETWORK DEVELOPMENT SPECIFIC SUPPORT ACTION BIOINFOGRID Luciano Milanesi CNR-ITB.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
1 소프트웨어공학 강좌 Chap 9. Distributed Systems Architectures - Architectural design for software that executes on more than one processor -
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
Working Together to Advance Terminology Tooling Presentation to OHT Board, Birmingham Jennifer Zelmer & Karen Gibson.
High level Knowledge-based Grid Services for Bioinformaticans Carole Goble, University of Manchester, UK myGrid project
Database System Concepts and Architecture
Nicholas LoulloudesMarch 3 rd, 2009 g-Eclipse Testing and Benchmarking Grid Infrastructures using the g-Eclipse Framework Nicholas Loulloudes On behalf.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Sharing Research Data Globally Alan Blatecky National Science Foundation Board on Research Data and Information.
The Grid as Future Scientific Infrastructure Ian Foster Argonne National Laboratory University of Chicago Globus Alliance
Using Taxonomies Effectively in the Organization KMWorld 2000 Mike Crandall Microsoft Information Services
1 Introduction to Middleware. 2 Outline What is middleware? Purpose and origin Why use it? What Middleware does? Technical details Middleware services.
KAROLINSKA INSTITUTET International Biobank and Cohort Studies: Developing a Harmonious Approch February 7-8, 2005, Atlanta; GA Standards The P 3 G knowledge.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Extending Access To Information Resource Discovery Service William E. Moen, Ph.D. Kathleen R. Murray, Ph.D. School of Library and Information Sciences.
CSC 480 Software Engineering Lecture 18 Nov 6, 2002.
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
Clinical Collaboration Platform Overview ST Electronics (Training & Simulation Systems) 8 September 2009 Research Enablers  Consulting  Open Standards.
My Grid and Taverna: Now and in the Future Dr. K. Wolstencroft University of Manchester.
Mining the Biomedical Research Literature Ken Baclawski.
© 2013, published by Flat World Knowledge Chapter 10 Understanding Software: A Primer for Managers 10-1.
BIOINFOGRID: Bioinformatics Grid Application for life science MILANESI, Luciano National Research Council Institute of.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Information Architecture The Open Group UDEF Project
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
IoT Meets Big Data Standardization Considerations
©Ian Sommerville 2000, Tom Dietterich 2001 Slide 1 Distributed Systems Architectures l Architectural design for software that executes on more than one.
Visual Knowledge ® Software Inc. Visual Knowledge BioCAD Case Study Parallels to Other Domains VK Semantic Web Server.
Ontology in MBSE How ontologies fit into MBSE The benefits and challenges.
Postgraduate Module Enterprise Database Systems Technological Educational Institution of Larisa in collaboration with Staffordshire University Larisa
ISWG / SIF / GEOSS OOS - August, 2008 GEOSS Interoperability Steven F. Browdy (ISWG, SIF, SCC)
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
12. DISTRIBUTED WEB-BASED SYSTEMS Nov SUSMITHA KOTA KRANTHI KOYA LIANG YI.
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
Katy Wolstencroft University of Manchester
ISO/IEC JTC 1/SC 7 Working Group 42 - Architecture Johan Bendz
ISO/IEC Joint Technical Committee 1 ISO/IEC JTC 1
Bio68: Bioinformatics Databases
Model-Driven Analysis Frameworks for Embedded Systems
ISO/IEC Joint Technical Committee 1 ISO/IEC JTC 1
Presentation transcript:

Standards and Ontologies to Enable Discovery Data and Information Integration Robin McEntire GlaxoSmithKline 19 Nov, 2002

Q: What non-existing technology do you most wish you had? A: A technology that would allow you to put in a DNA sequence and then spit out the specific protein function, disease association, known pharmacophores that could be developed into small molecules, and market value of small molecule or protein therapeutic (antibody) drugs generated from that gene. Martin Leach CuraGen Director of Bioinformatics Bioinform 4(26), 10 (6 Nov 2000)

Drug Discovery Process, circa 2002 data mining microarrays transgenics cheminformatics bioinformatics HT chemistry chemical diversity HT Screening SAR identify ‘hit’ optimize ‘hit’ structure target validation target identification/ validation in vivo testing genotyping

Discovery Process IT Sequencing Synthesis Screening Synthesis Planning Inventory Compound design Analyze Results Prepare reagents Develop Assay Candidate targets Select target Discovery Analytical

Drug Discovery Today Solution Genomics Combi-chem HTS & uHTS Pharmaco- genomics New Bottleneck Data analysis, interpretation, & integration Bottleneck Few novel targets Lead explosion in a series Too long to screen Relating genes to disease 

ID MURA_BACSU STANDARD; PRT; 429 AA. DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASE DE (EC ) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINE DE ENOLPYRUVYL TRANSFERASE) (EPT). GN MURA OR MURZ. OS BACILLUS SUBTILIS. OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE; OC BACILLUS. KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE. FT ACT_SITE BINDS PEP (BY SIMILARITY). FT CONFLICT S -> A (IN REF. 3). SQ SEQUENCE 429 AA; MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI Integration of discovery information

What technologies can help? Integration - to assist the transformation of data to information and to knowledge Text Mining - to expose the information/knowledge locked in text documents (internal and external) Grid computing Open source and public domain initiatives...

Two fundamental problems for information integration Heterogeneous software systems –hardware platforms –operating systems –network protocols –programming languages & application formats Heterogeneous data semantics –naming conflicts –measurement conflicts –representation conflicts –computational conflicts –granularity conflicts

Solutions This works until the next scientific advance This works until the next merger Require all information providers to use a single consistent vocabulary Convert all software to single language, OS, hardware platform

Alternatively... Focus on interoperability Collaboratively develop standards to support software interoperability Collaboratively develop tools and shareable ontologies Use the Tom Sawyer approach!

How to cope? Don’t rely on particular hardware platforms è Your system will outlive hardware Don’t rely on one operating system è There will always be many — perhaps from one vendor Don’t rely on a single programming language è They come and go faster than hardware Do follow the first principle of good design è Define small, well-documented interfaces between modules è Define common terminologies and common business objects

Coping -- Software Architecture The real issue isn’t how many tiers you have, it’s understanding how to organize a distributed application –what are the components? –where do they live? –how do they talk? Most applications tend to follow a common structural pattern: presentation, “business model” (analysis), and data storage

Two-tier systems “Business model” is embedded in presentation (“fat client”) or data storage (stored procedures, triggers) Back end physical storage, legacy applications, etc. Data representation is medium of exchange (brittle, low-level) Flat file, ASN.1, XML,...

Three-tier systems Local objects on desktop manage presentation, act as clients to middle tier Middle layer provides abstract model of business process and information, encapsulates back end Back end physical storage, legacy applications, etc. Distributed object technology is the established technology of choice for the middle tier

Focus on modeling business behavior Business logic/process is a first-class citizen –business logic focuses on behavior, not data –insulates client from data representation –encapsulates (hides) implementation, legacy systems “Middle” layer should embody an abstract model of business process –its development is a long-term, core investment –this is where component technology is headed

Component Interfaces are needed - but are not the whole story Integration of life sciences information across scientific disciplines and business areas is essential, however... Terminology is inconsistent – information searches are usually incomplete and inaccurate Definitions and descriptions of objects across a business area differ among data sources – integrating multiple sources is labor-intensive, expensive, and time-consuming Make common, shareable ontologies a part of the component marketplace

Text Mining

Text Mining - Challenges and Possibilities Information overload. There’s too much. Free text is a large category: most bio- information is only in text –Medline indexes about 600K entries/year. –Pharmas make heavy use of full-text ejournals –The USPTO has over 2 million full-text patents online Business needs to –find documents/information –screen and sort inputs –discover relationships and mine information

Text Mining We would like –Better retrieval –Help with handling the documents we have –Help finding specific pieces of information without having to read each document What might help? –Statistical techniques –Natural language processing techniques –Knowledge domain based techniques Controlled vocabularies and ontologies are key

Grid Computing

Still being defined to some extent. A good working definition for a large part of The Grid is “A heterogeneous, location-transparent pool of network accessible computation, data and application resources within a secure, managed common namespace.” Unifies compute, data and application resources –Allows use of resources regardless of location –Allows aggregation of discrete resources Analogous to the electric power grid. Resource available to the user can come from anywhere

The Grid More than technology for high performance computing -- it’s a different way of looking at computing and network-accessible resources There is an explosion in the complexity, diversity and distribution of hardware, software and information Mergers, acquisitions, joint ventures, and partnerships in all industries are creating the need for distributed and virtual organizations Consortial efforts to build consensus and standards (Global Grid Forum, GGF) Controlled vocabularies and ontologies are key

Build Shareable Ontologies Express formalized ontologies in a common language (or a small number of languages), facilitating representation and exchange of ontological knowledge Establish consortia and community-based initiatives to build common ontologies to establish shared understandings within the industry Do the experiment -- insert ontologies into the component, text mining and grid computing space!

Role of External Alliances and Collaborations in the Enterprise Architecture

External Alliances and Collaborations Two essentials; –The job is too big for any one organisation –Standard components, infrastructure and ontologies promote best-of-breed External alliances can play a vital role in defining & developing suitable services & standards

Engagement with alliances Shopper / Victim No alliance engagement: shop for (or simply accept) vendor-supported standards Watcher Semi-passive acceptance: evaluate & select from alliance (& other) products Navigator Active participant: influences software & component development to suit enterprise strategic needs

Standards selection criteria Robustness Architectural fit Availability of implementations Stability Continuing development Level of adoption / acceptance Size & vigor of user community Cost of adoption / migration

Infrastructure standards (examples) Data Interchange Services (e.g., PDF, HTML, ISO/IEC [JPEG], XML) Data Management Services (ISO 9075:1992 [SQL], SQL CLI) Graphics & Imaging Services (GIF, TIFF, GKS, CGM) International Operation Services (ISO/IEC Universal Multiple- Octet Coded Character Set) Location & Directory Services (IETF RFC1738 [URL], RFC2251 [LDAP]) Network Services (IETF RFC 821 SMTP, X.400, IETF RFC 793 TCP) Object-Oriented Provision of Services (CORBA, X/Open G302) Operating System Services (IEEE Std 1003 [POSIX]) Security Services (ISO/IEC , SSL, IETF RFC 2222 SASL) Software Engineering Services (ISO/IEC DIS [C++], Java JDK, VM) System & Network Management Services (SNMP) User Interface Services (X Window system) Source: Standards Information Base (The Open Group)

Information standards examples

Fitness to purpose Architectural fit Platform requirements Availability –Open source –Vendor supported Flexibility, configurability Staff training Longevity, stability Total cost of use (licensing terms) Component/Service/Ontology Selection Criteria

Standardized components & services

Sources of standards Vendors Information Providers Academic Research Projects Standards Organizations Industry Consortia Home-grown

Component & standards development alliances & consortia ISO, ANSI, IEEE, IETF, OASIS, W3C Health Level Seven (HL7) Life Sciences Research DTF (OMG LSR) Open Bioinformatics Foundation: Biopython, BioJava, BioCORBA, Bioperl, BioDAS, BioMOBY, BioSOAP Microarray Gene Expression Database Group (MGED) Clinical Data Interchange Standards Consortium (CDISC) Interoperable Informatics Infrastructure Consortium (I3C) Global Grid Forum (GGF)

Alliance selection criteria Technical scope of alliance mission (roadmap) Alliance architectural commitments Membership (breadth of industry participation) Standards adoption process Ability to influence Ease of participation (cost, mechanism, openness) Track record (i.e., stability, longevity, productivity) IP Issues Alliance staff support Total cost of membership Other benefits of membership?

Acknowledgements David Benton Jim Butler Filip Fuma Scott Harker Paula Matuszek Richard Moore