Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University.

Similar presentations


Presentation on theme: "Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University."— Presentation transcript:

1 Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University of Aberdeen Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science, University of Manchester David Stead, Al Brown Molecular and Cell Biology, University of Aberdeen www.qurator.org Describing the Quality of Curated e-Science Information Resources

2 Combining the strengths of UMIST and The Victoria University of Manchester E-science experiment Information and quality in e-science Scientists required to place their data in the public domain Scientists use other scientists' experimental results as part of their own work Lab experiment In silico experiments (eg Workflow-based) How can I decide whether I can trust this data? Variations in the quality of the data No control over the quality of public data Difficult to measure and assess quality - No standards Public BioDBs

3 Combining the strengths of UMIST and The Victoria University of Manchester A concrete scenario Qualitative proteomics: identification of proteins in a cell sample Step 1Step n Candidate Data for matching (peptides peak lists) Match algorithm Reference DBs - MSDB - NCBI - SwissProt/Uniprot Wet lab Information service (“Dry lab”) Hit list: {ID, Hit Ratio, Mass Coverage,…} False negatives: incompleteness of reference DBs, pessimistic matching False positives: optimistic matching False negatives: incompleteness of reference DBs, pessimistic matching False positives: optimistic matching

4 Combining the strengths of UMIST and The Victoria University of Manchester Quality is personal Scientists tend to express their quality requirements for data by giving acceptability criteria These are personal and vary with the expected use of the data “What is the right trade-off between false positives and false negatives?”

5 Combining the strengths of UMIST and The Victoria University of Manchester Requirements for IQ ontology 1. Establish a common vocabulary –Let scientists express quality concepts and criteria in a controlled way –Within homogeneous scientific communities –Enable navigation and discovery of existing IQ concepts 2. Sharing and reuse: let users contribute to the ontology while ensuring consistency –Achieve cost reduction 3. Making IQ computable in practice –Automatically apply acceptability criteria to the data

6 Combining the strengths of UMIST and The Victoria University of Manchester Quality Indicators Quality Indicators: measurable quantities that can be used to define acceptability criteria: “Hit Ratio”, “Mass Coverage”, “ELDP” –provided by the matching algorithm Match algorithm Information service (“Dry lab”) Hit list: {proteinID Hit Ratio, Mass Coverage,…} Experimentally established correlation between these indicators and the probability of mismatch

7 Combining the strengths of UMIST and The Victoria University of Manchester Data acceptability criteria Indicators used as indirect “clues” to assess quality Quality Assertions (QA) formally capture these clues as functions of indicators Data classification or ranking functions: ex: PIClassifier defined as f(proteinID, Hit Ratio, Mass Coverage, ELDP)  { (proteinID, rank) } –This provides a custom ranking of the match results Formalized acceptability criteria are conditions on QAs accept(proteinID) if PIClassifier(ProteinID,…) > X OR …

8 Combining the strengths of UMIST and The Victoria University of Manchester IQ ontology backbone Class restriction: MassCoverage   is-evidence-for. ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence. HitScore PIScoreClassifier   assertion-based-on-evidence. Mass Coverage assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity

9 Combining the strengths of UMIST and The Victoria University of Manchester Quality properties Users may add to a collection of generic quality properties Accuracy Currency Consistency Completenes s Conformity Timeliness Conciseness PI-acceptability ? User-defined Quality property Generic quality properties Part of the backbone How do we ensure consistent specialization?

10 Combining the strengths of UMIST and The Victoria University of Manchester … Specializations of base ontology concepts Concrete assertion (informal): “the property Accuracy of Protein Identification is based upon the Hit Ratio indicator for Protein Hit data” Concrete assertion (informal): “the property Accuracy of Protein Identification is based upon the Hit Ratio indicator for Protein Hit data” Proteomics Protein identification Data Entity Quality Indicator … Abstract assertion (informal): “a Quality Property is based upon one or more Quality Indicators for a Data Entity ” Abstract assertion (informal): “a Quality Property is based upon one or more Quality Indicators for a Data Entity ” Quality Property … Accuracy Property Protein Hit Accuracy of Protein identification Hit Ratio

11 Combining the strengths of UMIST and The Victoria University of Manchester Maintaining consistency by reasoning Axiomatic definition for Accuracy: (  QtyProperty-from-QtyAssertion. (  QA-based-on-evidence. ConfidenceEvidence)) PI-TopK PMF-Match Ranking PI-acceptability Mass Coverage Hit Ratio PIMatch Confidence Characterization Accuracy QtyProperty-from-QtyAssertion Pref-based-on-evidence Based-on Output-of  Has-quality characterization Is a

12 Combining the strengths of UMIST and The Victoria University of Manchester Computing quality in practice Annotation model: Representation of indicator values as semantic annotations: –model: RDF schema –annotation instances: RDF metadata Binding model: Representation of the mapping between Data ontology classes  data resources Functions ontology classes  service resources Goal: to make quality assertions defined in the ontology computable in practice Goal: to make quality assertions defined in the ontology computable in practice

13 Combining the strengths of UMIST and The Victoria University of Manchester Data resource annotations Resource = Data items at various granularity Data item  indicator values

14 Combining the strengths of UMIST and The Victoria University of Manchester Data resource bindings Data class  data resource Account for different granularities, data types

15 Combining the strengths of UMIST and The Victoria University of Manchester Service resource bindings Function class  (Web) service implementation –Eg annotation function, QA function

16 Combining the strengths of UMIST and The Victoria University of Manchester The complete quality model

17 Combining the strengths of UMIST and The Victoria University of Manchester IQ Service Example

18 Combining the strengths of UMIST and The Victoria University of Manchester Summary An extensible OWL DL ontology for Information Quality –Consistency maintained using DL reasoning Used by e-scientists to share and reuse: –Quality indicators and metrics –Formal criteria for data acceptability Annotation model: generic schema for associating quality metadata to data resources Binding model: generic schema for mapping ontology concepts to (data, service) resources Model tested on data for proteomics experiments


Download ppt "Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University."

Similar presentations


Ads by Google