Approaches to Storing and Querying Structural Information in Botanical Specimen Descriptions. Trevor Paterson bbsrc biotechnology and biological sciences.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Napier University School of Computing Capturing Botanical Descriptions for Taxonomy Prometheus I Taxonomic Database POET OODB Jessie Kennedy, Cédric Rageneaud.
MP IP Strategy Stateye-GUI Provided by Edotronik Munich, May 05, 2006.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Introduction to Structured Query Language (SQL)
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
MS Access: Database Concepts Instructor: Vicki Weidler.
Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. Elance Inc.
Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
Retrievals & Projections Objectives of the Lecture : To consider retrieval actions from a DB; To consider using relational algebra for defining relations;
ASP.NET Programming with C# and SQL Server First Edition
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Representing taxonomy MarBEF-IODE workshop Oostende, March 2007.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
CODD’s 12 RULES OF RELATIONAL DATABASE
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
The Prometheus Database for Plant Taxonomy Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh
1 Part 2: EJB Persistency Jianguo Lu. 2 Object Persistency A persistent object is one that can automatically store and retrieve itself in permanent storage.
1 Chapter 1 Introduction. 2 Introduction n Definition A database management system (DBMS) is a general-purpose software system that facilitates the process.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
DataBase Management System What is DBMS Purpose of DBMS Data Abstraction Data Definition Language Data Manipulation Language Data Models Data Keys Relationships.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Session 1 Module 1: Introduction to Data Integrity
CIS 250 Advanced Computer Applications Database Management Systems.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
The Prometheus Database Project bbsrc biotechnology and biological sciences research council.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Introduction to Databases Angela Clark University of South Alabama.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Concept Relationship Editor: A visual interface to support the assertion of synonymy relationships between taxonomic classifications Paul Craig & Jessie.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
Prometheus II: Capturing and Relating Character Concept Definitions in Plant Taxonomy The Biological Problem Concepts describe objects and people invariably.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Databases and DBMSs Todd S. Bacastow January
Indexing Structures for Files and Physical Database Design
Relational Algebra Chapter 4, Part A
Database.
Contents Preface I Introduction Lesson Objectives I-2
DATABASES WHAT IS A DATABASE?
Presentation transcript:

Approaches to Storing and Querying Structural Information in Botanical Specimen Descriptions. Trevor Paterson bbsrc biotechnology and biological sciences research council The Prometheus Database Project

Prometheus II: Capturing Botanical Descriptions for Taxonomy (Oracle RDB) A novel model for composing and recording taxonomic descriptions, using an ontology of defined terms. JK, PB, GR, CR, Trevor Paterson, MP, MW, Sarah McDonald, Kate Armstrong PII Visualisation and Data Entry Tools JK, MW, TP, Alan Cannon Prometheus I: A Taxonomic Database (POET OODB) A novel object-based representation of multiple overlapping taxonomic hierarchies – based on specimen circumscription. Jessie Kennedy, Cédric Raguenaud, Mark Watson, Martin Pullan, Mark Newman, Peter Barclay PI refactor Oracle RDB JK, Gordon Russell, Andrew Cumming PI Visualisation Tools JK, Martin Graham

 Characters are central to the taxonomic process they describe specimens, species and higher ‘Taxa’ (groups)  they are the means of recognizing and categorizing diversity (forming taxonomies)  they are a major data output/resource generated from the process e.g. - for reuse by other taxonomists - for reuse in other applications e.g for specimen identification (field guides, keys etc.) ‘Character Descriptions’ in Taxonomy Character 1 e.g. taxon1: petals lanceolate, petal length 5 mm taxon2: petals ovate, petal length 2-4 mm Character 2

Problems with Taxonomic Characters  No formal methodology or model for recognizing and describing Characters  Language (terminology) used in descriptions is ill-defined (natural language based)  A taxonomic revision is often the work of one individual and can be highly idiosyncratic – (lack formalism, personal terminology)  loss of data ( not recorded )  ambiguous / poorly-defined data  data not comparable between projects Consequences

Prometheus System for Taxonomic Description  A standard conceptual model for the composition of character descriptions  Ontologies of defined and related ‘terms’ to record character descriptions  Store description data in electronic/database form

The Prometheus II Character Model Representing Characters as Atomic Statements: Description Elements

The Prometheus II Character Model An Ontology of Defined Terms for Unambiguous Character Description

How do we record ‘which’ structure we are talking about: contextual localisation. ie how do we unambiguously represent: a general leaf a basal leaf an apical leaf a leaf on a stem a leaf on a branch a leaf that is part of a whorl or even more complex, how do we represent a region or a substructure on any of these potential leaves

 Our ontology records a PartOf relationship between structures to capture these potential structural relationships  This PartOf relationship is OPTIONAL, i.e. ontologically a leaf might be part of a stem, whorl, branch etc; but a given specimen may only some of these possible structural compositions  Only when a particular structural context is referred to in a description is its reality asserted  The PartOf relationship encompasses any compositional relationship, not only strict part/subpart, but also less clear associations and attachments. In order to provide a flexible, unambiguous yet non- prescriptive means of localizing structures...

The PartOf relationship in the ontology specifies all the possible compositional relationships between anatomical structures  a given structure can potentially be PartOf a number of Parent Structures  PartOf forms an acyclic, directed graph  Can be materialized as a tree hierarchy (by duplicating structures with more than one potential parent)

Specifying an exact structural context from the optional Part_Of hierarchy Androecium Column FlowerFloret Inflorescence

Software that we are developing for assisting a structured approach to taxonomic description uses the ontological PartOf hierarchy to organize data entry interfaces, for editing and scoring.

The Project Definition Interface (Ontology  Proforma) Central organizing relationship for the ontology – ‘Partof’ hierarchy of structures. (‘Edit’ by filtering or adding necessary regions and generic structures.) Properties applicable to selected structure. (Expand to show states available).

 where only the structures of interest for a given project are included  and ‘generic structures’ and ‘regions’ are explicitly added to the compositional tree  This then represents a description template or ‘Proforma’ used for a particular project  Therefore the actual structural ‘Paths’ used vary from project to project On a Project by Project basis a filtered version of the Ontology can be specified (A ‘Proforma’ Ontology)

BA DC D E H 9 8 J F I E 7 L K E K ONTOLOGY BA DC D E E L spine hair Generic Structures lower surface upper surface apex base centre Regions PROFORMA ONTOLOGY Creating a Proforma Ontology

A Completed Proforma Interface

Yet another level of complexity is added by allowing any structure to be duplicated at the time of project definition and in fact multiple instances of structures might be required at scoring time to record observations for separate distinguishable instances of the same structure

Representing multiple copies of Structures BA DC D E E L ‘Leaf’ #1 proforma ontology  Structure B (Leaf) is cloned  The path of the Leaf structures B, D and E in the proforma ontology has to include its ‘clone’ identity i.e. B#1 or B#2.  We might want to record data for multiple instances of each Leaf, and include an ‘instance’ identity: eg B#1: instance1,2,3... BD E ‘Leaf’ #2 [B#1] [B#2]

 We want to refer to a structure in the context of its parents  We need to take account of differences between project specific versions of the ontology  We need to allow for changes (additions) to the ontology over time  We want to be able to ‘clone’ more than one version of a structure  We want to be able to score multiple instances of the same structure How can we store/represent (and query!) all this context/path information for structures in a relational database ?

A well-studied problem, various ‘solutions’  adjacency lists - calculate transitive closure  proprietary tree traversal queries (‘connect by’)  node numbering algorithms – maintenance problems Need an easy and efficient solution  materialize the path – and traverse programmatically  what about XML? - as a native tree representation Given that we are concerned with representing a structure hierarchy….. What do we know about Hierarchical Trees in Relational Databases?

Node Numbering and Materializing the Part_Of hierarchy NodeID ParentID Materialized Path

Could store a table of paths in the ontology but heavy maintenance costs with versioning and project additions..instead could store the actual path with each description statement (Description Element)

XML and Trees intuitively you can represent the whole structural hierarchy as an XML Tree Again Node numbering algorithms and XML node indexing algorithms exist for XML databases - -still got maintenance problems

Or we can materialize the Path of each node as an XML fragment instead of as a string.. equivalent to the string: As with the strings – these could be stored as a table of paths for the ontology or the path can be stored as part of each description Statement (DE)

These XML fragments can store more information than a simple string equivalent to the string 288(1).239(2).243(1)[3] Even more data could be added to the XML as attributes - such as Term=“Flower” ParentNodeID=“X” Adding this additional information to a single string path is complex - e.g parsing out the brackets... - could store additional strings, such as: Inflorescence.Floret.Flower

[note, because our XML fragment is recursive, it cannot be validated to an acceptable XSD schema to allow Oracle to Shred the XML to relational tables, therefore XML storage is essentially as a CLOB] Testing the alternative representations of the Materialized Path for storage and retrieval of Structural Context Information in Plant Character descriptions As dot separated string, parsed programmatically by PL/SQL vs As an XML fragment parsed by Oracle XML database functions

(I) Paths stored as strings are in the format: (2) ‘Traversing’ the path encoded in such strings can be achieved by combining various Oracle PL/SQL functions: eg INSTR (searches a string for a substring returning the index i) SUBSTR (returns a substring of length l, from index i, of a string) CONCAT (can concatenate strings and characters) TO_CHAR (allows conversion of integers to strings) These can be combined to write stored PL/SQL functions to automate path traversal: eg getParent (descStructure Number, path varchar2) returns StructureID getParentPath (descStructure Number, path varchar2) returns PathString select getParent (504, ‘ (2) ') "Parent" from dual; Parent row selected

(2) Stored XML Paths are in the format: > Oracle XPath expressions allow traversal up and down the tree in a single query: select unique extractValue(PATH, “504Parent” from DE where existsnode(PATH, = 1 and existsnode(PATH, = 0 ; 504Parent 100, 235, 373, 423, 521, 553, 96 7 rows selected.

Two test queries to compare the efficiency of String based query versus XML Query: Simple Query A: Find all information pertaining to Flowers - 1 st need the ID of Flowers – trivial look up ( ID = ‘243’) - then fetch all DescriptionElements that have flower in their path somewhere more Complex Query B : Find which descriptions have parts with scales [ ‘ 504 ’ ], and what these parts are:

Simple Query A: Find all information pertaining to Flowers ( ID = ‘243’) (Avi) Using simple string comparison: where DE.structurepath like '%.243.%' (i.e. flower within the path) or DE.structurepath like '%.243(.%' (i.e. a clone of flower in the path) or DE.structurepath like '%.243' (i.e. flower at the end of a path) (Ai &Aii) Using an XPath expression in Oracle's SQL/XML Query language: where existsNode(path, = 1 (Aiii) Using an XPath expression in OracleText’s 'contains' function (with a path_section index): where contains(path, ) >0 (Aiv & Av) Whilst with xml_section or auto_section indexing the query might resemble: where contains(path, '243 WITHIN TermIDattr') >0 or where contains(path, '243 WITHIN >0

more complex example B: to find which descriptions have parts with scales [ ‘ 504 ’ ], and what these parts are: we would want to search for: scale ID being present in the path of a description element and get the parent of scale [ie a single path traversal] checking that if presence has been scored for scale: it is present that the DE is not marked ‘ not scored ’ its necessary to scope the query eg using ProformaID and OntologyID could easily add increasing complexity/specificity eg to look for the same information but only where applies to scales found on Flowers get paths where flower ID [ ‘ 243 ’ ] precedes scale ID at any level [traversal up and down the path]

select distinct * from (select distinct "DescriptionID", "SpecimenID", t.term "Structure Possessing Scales", "Scale State" from (select de.idseq "DE_ID", getparent(504, de.structurepath) "ParentID", d.idseq "DescriptionID", d.specimen "SpecimenID", ('Present') "Scale State" from descriptionelement de, description d, term t, statelist s where (de.structurepath like '%.504.%' or de.structurepath like '%.504(%' or de.structurepath like '%.504') and de.description = d.idseq and de.describedstructure = t.idseq and de.not_scored_flag = 'F' and de.stategroup = 134 and s.de = de.idseq and s.state = 2325 ) x, term t where x."ParentID"=t.idseq ) UNION ( (select distinct "DescriptionID", "SpecimenID", t.term "Structure Possessing Scales", "Scale State" from (select de.idseq "DE_ID", getparent(504, de.structurepath) "ParentID", d.idseq "DescriptionID",d.specimen "SpecimenID", ('Quantitative Score') as "Scale State" from descriptionelement de, description d, term t where (de.structurepath like '%.504.%' or de.structurepath like '%.504(%' or de.structurepath like '%.504') and de.description = d.idseq and de.describedstructure = t.idseq and de.not_scored_flag = 'F' and ( de.stategroup != 134 or de.stategroup is null) ) x, term t where x."ParentID"=t.idseq) );

select * from (( select "DescriptionID", "SpecimenID", t.term "Parent of Scale", ('Quantitative Score') "ScaleState" from ( select DE.Description "DescriptionID", D.Specimen "SpecimenID", extractValue(PATH, "att" from descriptionelement DE, Description D where existsnode(PATH, = 1 and existsnode(PATH, = 0 and DE.not_scored_flag = 'F' and DE.Description = D.idseq and (DE.stategroup is Null or DE.stategroup !=134) ) x, term t where x."att" = t.idseq )UNION ( select "DescriptionID", "SpecimenID", t.term "Parent of Scale", ('Present') "ScaleState" from ( select DE.Description "DescriptionID", D.Specimen "SpecimenID", extractValue(PATH, "att" from descriptionelement DE, Description D, statelist s where existsnode(PATH, = 1 and existsnode(PATH, = 0 and DE.not_scored_flag = 'F' and DE.Description = D.idseq and DE.stategroup =134 and s.de = de.idseq and s.state = 2325 ) x, term t where x."att" = t.idseq ));

DescriptionID SpecimenID Structure Possessing Scale State Scales Blade Present Blade Quantitative Score Bract Present Bract Quantitative Score Bract Present Bract Quantitative Score Blade Present Blade Quantitative Score Filament Present Nectary Quantitative Score Sepal Quantitative Score Stipe Present Petal Quantitative Score Stipe Quantitative Score Nectary Present Petal Quantitative Score rows selected.

Some Conclusions  to represent contextual information in plant descriptions we have stored a materialized representation of the compositional path for each structure being described – either as a delimited string or an XML fragment  our results show that this path is queried most efficiently in our database if stored as a string and queried using string comparison functions  however, storing the materialized path as an XML fragment provides an attractive alternative that allows more straightforward query design and might allow a richer and more readable representation of the information  storage, indexing and query XML datatypes is database dependent, but simple string-parsing functions are provided as part of most database systems  optimization of XML Query and indexing is complicated, although in many respects the syntax of XPath queries allows simpler query formulation than parsing the string path.

Jessie Kennedy, Cédric Raguenaud, Mark Watson, Martin Pullan, Mark Newman, Peter Barclay, Martin Graham, Gordon Russell, Andrew Cumming, Sarah MacDonald, Kate Armstrong, Alan Cannon, Robert Kukla, Trevor Paterson