10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity Grid Part 2: Litchi and ambiguous names by Richard J. White.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Malcolm Scoble The Natural History Museum
UK-based developments in online thesauri for taxonomic information Copp, C., Grant, M., Hewzulla, D., Hussey, C., Robinson, J., van Breda, J. & White,
At Reading Frank Bisby, Alistair Culham, Paul Valdes, Neil Caithness, Tim Sutton, Peter Brewer At Cardiff Alec Gray, Andrew Jones, Nick Fiddian, Nick Pittas,
Cardiff School of Computer Science & Informatics Biodiversity Informatics at COMSC Andrew Jones & Richard White School of Computer Science & Informatics.
Key Stage 3 National Strategy Scientific enquiry Science.
The design process IACT 403 IACT 931 CSCI 324 Human Computer Interface Lecturer:Gene Awyzio Room:3.117 Phone:
Chapter 4 Quality Assurance in Context
Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK.
I: The Lineage of Taxonomic Revisions The taxonomic history of Aus L. 1758, first described by Linnaeus in 1758 (i), is shown through four subsequent revisions.
Project Proposal.
Richard White Biodiversity Informatics. Part One An introduction to biodiversity data.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Common Data Models and Protocols Richard White, Cardiff University Talk given at “Making Species Databases Interoperable”,
Chapter 9: Ontology Management Service-Oriented Computing: Semantics, Processes, Agents – Munindar P. Singh and Michael N. Huhns, Wiley, 2005.
Corals and sea anemones on line: a functioning biodiversity database D. G. Fautin R. W. Buddemeier University of Kansas: Department of Ecology and Evolutionary.
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
© Franz Kurfess Project Topics 1 Topics for Master’s Projects and Theses -- Winter Franz J. Kurfess Computer Science Department Cal Poly.
Requirements Analysis Concepts & Principles
How can Computer Science contribute to Research Publishing?
Automatic Data Ramon Lawrence University of Manitoba
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
Science Inquiry Minds-on Hands-on.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
115 October 2005Richard White - Sp2000/ENBI - Stockholm Litchi: interlinking species information systems Richard White, Andrew Jones, Ed Donovan Computer.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
The design process z Software engineering and the design process for interactive systems z Standards and guidelines as design rules z Usability engineering.
Richard White Biodiversity Informatics Projects. Thoughts Role of biodiversity data in bioinformatics – assisting with organising and retrieving bioinformatic.
Richard White Biodiversity Data. Outline Biodiversity: what is it? – Definitions: is biodiversity: A resource? Something which can be measured? How to.
17.1 History of Classification
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 6 Slide 1 Requirements Engineering Processes l Processes used to discover, analyse and.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
GLOBAL BIODIVERSITY INFORMATION FACILITY Cataloging and using Taxonomic Data The Global Names Architecture David Remsen Senior Programme Officer, ECAT.
Odyssey A Reuse Environment based on Domain Models Prepared By: Mahmud Gabareen Eliad Cohen.
DEPICT: DiscovEring Patterns and InteraCTions in databases A tool for testing data-intensive systems.
When Search is not Enough Case Study: The Advertising Research Foundation Gilbane Boston November 27, 2007 Gilbane Boston November 27, 2007.
CS 3610: Software Engineering – Fall 2009 Dr. Hisham Haddad – CSIS Dept. Chapter 6 System Engineering Overview of System Engineering.
1 Introduction to Software Engineering Lecture 1.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
Classification Chapter 8. Learning Outcomes By the end of this week, you should:  recognise the value of identification and scientific naming (nomenclature).
ES component and structure Dr. Ahmed Elfaig The production system or rule-based system has three main component and subcomponents shown in Figure 1. 1.Knowledge.
1 Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He Tok Wang Ling Dept. of Computer Science School of Computing National.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Week 1a: Introduction to the Module Graham Logan Building 303, Room 30 CO5021 Systems Development.
1October 2006Richard White, Andrew Jones & Frank Bisby - TDWG - St Louis Federating taxonomic databases: progress with the Catalogue of Life Dynamic Checklist.
CSE 303 – Software Design and Architecture
The role of persistent identifiers in tracking taxon changes Andrew C. Jones, Richard J. White, Ewen R. Orme, School of Computer Science, Cardiff University,
The History of Classification Copyright © McGraw-Hill Education Early Systems of Classification Classification is the grouping of objects or organisms.
Requirements Engineering Requirements Validation and Management Lecture-24.
Requirements Analysis
Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.
Research Word has a broad spectrum of meanings –“Research this topic on ….” –“Years of research has produced a new ….”
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Update on Ecoinformatics Technical Working Group Activities Larry Fitzwater Computer Scientist US Environmental Protection Agency Rome, Italy – 17 May.
Expert System / Knowledge-based System Dr. Ahmed Elfaig 1.ES can be defined as computer application program that makes decision or solves problem in a.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Mechanisms for Requirements Driven Component Selection and Design Automation 최경석.
Laurea Triennale in Informatica – Corso di Ingegneria del Software I – A.A. 2006/2007 Andrea Polini XVII. Verification and Validation.
Chapter (12) – Old Version
CCNT Lab of Zhejiang University
HCI in the software process
The design process Software engineering and the design process for interactive systems Standards and guidelines as design rules Usability engineering.
The design process Software engineering and the design process for interactive systems Standards and guidelines as design rules Usability engineering.
Section 1: The History of Classification
HCI in the software process
Practical Database Design and Tuning Objectives
Presentation transcript:

10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity Grid Part 2: Litchi and ambiguous names by Richard J. White presented to the Biostatistics & Bioinformatics Unit, Cardiff Wednesday 10 March 2004

10 March 2004Richard J. White – COMSC / BB Unit Ambiguous nomenclature Challenges in creating global biodiversity information systems by merging and linking databases: ambiguities arise from the way scientific names refer to species for example, if two species are combined, one of the original names must be re-used to refer to the new concept conversely, when a species is divided into two, one part must retain the original name

10 March 2004Richard J. White – COMSC / BB Unit A problem in Biodiversity Informatics The way species are named may affect the reliability and usability of species information systems Techniques to handle the problem semi-automatically can be developed This problem and potential solutions may in some cases generalise to other naming schemes

10 March 2004Richard J. White – COMSC / BB Unit Names for species A new name is published by an author who thinks the species is new and therefore needs a name Later, others may disagree and merge this species with another (the older name is re-used to designate the merged species – same name, different meaning (broader circumscription)

10 March 2004Richard J. White – COMSC / BB Unit Names for species Alternatively, a species may be split in two; one of the new species gets a new name (the older name is re-used to designate the other one – same name, different meaning (narrower circumscription)

10 March 2004Richard J. White – COMSC / BB Unit Example Locate sequence data for all species of Vicia Some data may be listed under species of the obsolete genus Orobus A name such as Vicia narbonensis might be regarded by some as just another name for Vicia faba

10 March 2004Richard J. White – COMSC / BB Unit Example You want to discover all there is to know about one species It may be listed in different sources under different names These examples show why taxonomists attach great importance to synonyms

10 March 2004Richard J. White – COMSC / BB Unit (PDL cover)

10 March 2004Richard J. White – COMSC / BB Unit (PDL page)

10 March 2004Richard J. White – COMSC / BB Unit (ILDIS search results)

10 March 2004Richard J. White – COMSC / BB Unit (ILDIS species page)

10 March 2004Richard J. White – COMSC / BB Unit “Mr Linnaeus” A web-based mock-up to explore aspects of the user interface of a system for interpreting “taxonomically intelligent links” Prepared by Helen Bradbrook, an MSc student in the School of Plant Sciences at the University of Reading

10 March 2004Richard J. White – COMSC / BB Unit

10 March 2004Richard J. White – COMSC / BB Unit

10 March 2004Richard J. White – COMSC / BB Unit

10 March 2004Richard J. White – COMSC / BB Unit

10 March 2004Richard J. White – COMSC / BB Unit

10 March 2004Richard J. White – COMSC / BB Unit

10 March 2004Richard J. White – COMSC / BB Unit Ambiguous nomenclature The problems are inherent in the subjective nature of the species concept they cannot be removed by, for example, using numbers instead of names (unless a completely new name or number is invented every time the circumscription changes) Some of these issues were addressed in the LITCHI project …

10 March 2004Richard J. White – COMSC / BB Unit LITCHI Project A rule-based tool for the detection and repair of conflicts and merging of data in taxonomic databases

10 March 2004Richard J. White – COMSC / BB Unit Litchi a BBSRC/EPSRC “Bioinformatics Initiative” project (with Reading) using “conflicts” between species databases arising from ambiguous nomenclature but information is implicit in the lists of synonyms accompanying species names rule-based (Prolog) definition, detection and resolution of conflicts

10 March 2004Richard J. White – COMSC / BB Unit Project Staff Suzanne Embury, Alex Gray, Andrew Jones, Iain Sutherland Object and Knowledge-based Systems Group, Department of Computer Science, University of Wales, Cardiff, PO Box 916, Cardiff CF24 3XF Frank Bisby, Sue Brandt Centre for Plant Diversity and Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS John Robinson, Richard White Biodiversity & Ecology Research Division, School of Biological Sciences, University of Southampton, Southampton SO16 7PX

10 March 2004Richard J. White – COMSC / BB Unit Why is LITCHI needed? Species names are the key to biodiversity information Trend towards large biodiversity databases and global systems Manual merging of taxonomic databases very time- consuming Users want to browse “seamlessly” from one web-site to another Users want to assemble reliable data sets drawn from several sources, but information on naming “conflicts” is hard to find and checking for them is tedious

10 March 2004Richard J. White – COMSC / BB Unit Example 1 Checklist A Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym] Checklist B Caragana sibirica Medikus [accepted name] Caragana arborescens Lam. [synonym]

10 March 2004Richard J. White – COMSC / BB Unit Example 2 Checklist A Caesalpinia crista L. [accepted name] Checklist B Caesalpinia crista L. [accepted name] Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]

10 March 2004Richard J. White – COMSC / BB Unit Example 3 In the case of the species Cytisus scoparius Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius) Treatment B will list it as Sarothamnus scoparius (synonym Cytisus scoparius) Genus Cytisus Genus Sarothamnus Genus Cytisus Cytisus scoparius Sarothamnus scoparius Cytisus striatus Sarothamnus striatus Cytisus multiflorus Cytisus praecox Treatment A recognises one genus, Cytisus Treatment B recognises two genera, Cytisus and Sarothamnus

What we did Formulated rules for integrity and conflict, first in English and then in definite clauses of logic Translated these declarative rules to build and test a Prolog model Devised and tested algorithms to detect and report conflicts Devised and tested algorithms to manage the partially-automated correction of the conflicting elements Built and operated a prototype software system

10 March 2004Richard J. White – COMSC / BB Unit Integrity and conflict rules How a scientific name should be composed (Rules of Nomenclature) Rules for citing the assemblage of names and synonyms for one taxon Rules of integrity and “concept relationships” (overlap etc.) between the taxa in a taxonomic treatment Rules for detecting conflicts between treatments Rules for classifying conflicts to determine the action to be taken

10 March 2004Richard J. White – COMSC / BB Unit Testing the rules Conflicts were detected in the ILDIS database by Rule 3 which states that a full name may not appear as an accepted name and a synonym in the same checklist:  (  n,a,l) accepted_name(n,a,_,l,_)  synonym(n,a,_,l,_) In Prolog form, this rule is expressed: litchi_rule3:- accepted_name(N,A,_,L,_), synonym(N,A,_,L,_).

10 March 2004Richard J. White – COMSC / BB Unit A detected conflict The Prolog conflict detection engine reported: conflict(3:[Astragalus,variegatus]: [Freyn,&,Bornm,.]:combinedlist) The conflict report includes the following information: Astragalus variegatus Freyn & Bornm. (accepted name) Astragalus sarypulensis B.Fedtsch. (synonym) Astragalus rufescens Freyn (accepted name) Astragalus variegatus Freyn & Bornm. (synonym)

10 March 2004Richard J. White – COMSC / BB Unit Conflict display

10 March 2004Richard J. White – COMSC / BB Unit Repairing violations User may wish to look at context of violation to determine appropriate repair Domain-specific knowledge can be applied to narrow down set of (taxonomically) valid repairs presented to the user

10 March 2004Richard J. White – COMSC / BB Unit Conflict repair

10 March 2004Richard J. White – COMSC / BB Unit Implementing LITCHI: major aspects Design of a suitable architecture Development of a model for species checklists Modelling taxonomic practice using constraints Providing appropriate support to the editor in repairing constraint violations

Summary We modelled the knowledge integrity rules in a taxonomic treatment. The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon (examples later). Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases.

10 March 2004Richard J. White – COMSC / BB Unit Outcome of project A prototype tool for merging checklists & checking integrity of individual checklists was implemented & is freely available (but scarcely usable) We plan to extend this work:  “re-implemented” production version  dynamic linking (so-called “taxonomically intelligent links”)

10 March 2004Richard J. White – COMSC / BB Unit Litchi 2 Solutions to the nomenclature challenges, including Litchi and its interaction with Spice are being developed further in the course of the new BBSRC “Biodiversity World” Grid demonstrator project and the EU “Species 2000 europa” and ENBI projects (involving the same parties)

10 March 2004Richard J. White – COMSC / BB Unit Litchi 2 “Intelligent linking” is to protect users from and explain nomenclatural ambiguities Development of these techniques would be easier if we had an explicit representation of the overlaps between species in different databases Such “cross-maps” can be constructed automatically using similar rules in the new Litchi version 2

10 March 2004Richard J. White – COMSC / BB Unit Future projects Ambiguous nomenclature on-going programme of projects (already involving collaboration with staff here in COMSC) building tools such as Litchi to help bioinformaticians deal with ambiguous nomenclature These techniques might be extended to other areas of bioinformatics where subjective identification and ambiguous nomenclature occur, such as the names of proteins (as suggested by Andrew Jones), genes, geographical areas, habitat types, etc.

10 March 2004Richard J. White – COMSC / BB Unit An “intelligent” system It would know about the synonymies and ambiguities existing in various data domains It would help the user work with such data It would contain a thesaurus, “knowledge-base” or “ontology”

10 March 2004Richard J. White – COMSC / BB Unit An “intelligent” system These are hard to construct by hand Litchi shows how this might be done by supervised automatic procedures in the case of species names We want to generalise these ideas and techniques to other data domains, maybe those that you are interested in