Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK Roma, 3/4/07
Combining the strengths of UMIST and The Victoria University of Manchester Quality of data Main driver, historically: data cleaning for Integration: use of same IDs across data sources Warehousing, analytics: –restore completeness, –reconcile referential constraints –cross-validation of numeric data by aggregation Focus: Record de-duplication, reconciliation, “linkage” –Ample literature – see eg Nov 2006 issue of IEEE TKDE Consistency of data across sources Managing uncertainty in databases (Trio - Stanford) Data quality control in the data management practice
Combining the strengths of UMIST and The Victoria University of Manchester Common quality issues Completeness: not missing any of the results Correctness: each data should reflect the actual real- world entity that it is intended to model –The actual address where you live, the correct balance in your bank account… Timeliness: delivered in time for use by a consumer process –Eg stock information …
Combining the strengths of UMIST and The Victoria University of Manchester Taxonomy for data quality dimensions
Combining the strengths of UMIST and The Victoria University of Manchester Our motivation: quality in public e-science data GenBank UniProt EnsEMBLEntrezdbSNP Large volumes of data in many public repositories Increasingly creative uses for this data Problem: using third party data of unknown quality may result in misleading scientific conclusions
Combining the strengths of UMIST and The Victoria University of Manchester Some quality issues in biology “Quality” covers a broader spectrum of issues than traditional DQ “X% of database A may be wrong (unreliable) – but I have no easy way to test that” “This microarray data looks ok but is testing the wrong hypothesis” The output from this sequence matching algorithm produces false positives … Each of these issues calls for a separate testing procedure Difficult to generalize Each of these issues calls for a separate testing procedure Difficult to generalize
Combining the strengths of UMIST and The Victoria University of Manchester Correctness in biology - examples Data typeCreation processCorrectness Uniprot protein annotation Manual curationFunctional annotation f for p correct if function f can reliably be attributed to p Qualitative proteomics: Protein identification Generate peptides peak lists, match peak lists (eg Imprint) No false positives: Every protein in the output is actually present in the cell sample Transcriptomics: Gene expression report (up/down- regulation) Microarray data analysis No false positives, no false negatives
Combining the strengths of UMIST and The Victoria University of Manchester Defining quality in e-science is challenging In-silico experiments express cutting-edge research –Experimental data liable to change rapidly –Definitions of quality are themselves experimental Scientists’ quality requirements often just a hunch –Quality tests missing or based on experimental heuristics –Definitions of quality criteria are personal and subjective Quality controls tightly coupled to data processing –Often implicit and embedded in the experiment –Not reusable “Quality” personal criteria for data acceptability
Combining the strengths of UMIST and The Victoria University of Manchester Research goals 1. Make personal definitions of quality explicit and formal –Identify a common denominator for quality concepts –Expressed as a conceptual model for Information Quality Elicit “nuggets” of latent quality knowledge from the experts Elicit “nuggets” of latent quality knowledge from the experts 2.Make existing data processing quality-aware –Define an architectural framework that accommodates personal definitions of quality –Compute quality levels and expose them to the user
Combining the strengths of UMIST and The Victoria University of Manchester Example: protein identification Data output Protein identification algorithm “Wet lab” experiment Protein Hitlist Protein function prediction Correct entry true positive Evidence: mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain
Combining the strengths of UMIST and The Victoria University of Manchester Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
Combining the strengths of UMIST and The Victoria University of Manchester Quality process components Data output Protein identification algorithm “Wet lab” experiment Protein Hitlist Protein function prediction Goal: to automatically add the additional filtering step in a principled way Goal: to automatically add the additional filtering step in a principled way PMF score = (HR x 100) + MC + (ELDP x 10) Quality filtering Quality assertion: Evidence: mass coverage (MC) Hit ratio (HR) ELDP
Combining the strengths of UMIST and The Victoria University of Manchester Quality Assertions QA(D): any function of evidence (metadata for D) that computes equivalence classes on D 1. Score model (total or partial order) 2. Classification model: D B A C Actions associated to regions: Eg accept/reject but possibly more Quality-equivalent regions
Combining the strengths of UMIST and The Victoria University of Manchester Layered definition of Quality DB Data sources custom quality knowledge Quality Assertions functions QA Quality Views: definition of acceptability regions QV quality evidence annotations Env Annotation functions Long-lived reusable Commodities Expert-defined Dynamic User controlled
Combining the strengths of UMIST and The Victoria University of Manchester Abstract Quality Views An operational definition for personal quality: 1. Formulate a quality assertion on the dataset: –i.e. a ranking of proteins by PMF score –“quality knolwedge, possibly subjective” 2. Identify underlying evidence necessary to compute the assertion –the variables used to compute the score (HR, MC, ELDP) –Objective, inexpensive 3. Define annotation functions that compute evidence values Functions that compute HR, MC, ELDP 4. Define quality regions on the ranked dataset In this case, intervals of acceptability 5. Associate actions to each region
Combining the strengths of UMIST and The Victoria University of Manchester Computable quality views as commodities Cost-effective quality-awareness for data processing: Reuse of high-level definitions of quality views Compilation of abstract quality views into quality components Abstract quality views binding and compilation Executable Quality process - runtime environment - data-specific quality services Qurator architectural framework:
Combining the strengths of UMIST and The Victoria University of Manchester Quality hypotheses discovery and testing Quality model Performance assessment Execution on test data abstract quality view Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Multiple target environments: Workflow query processor Quality model definition
Combining the strengths of UMIST and The Victoria University of Manchester Experimental quality Making data processing quality-aware using Quality Views –Query, browsing, retrieval, data-intensive workflows Discovery and validation: “nuggets of quality knowldege” Quality View Model testing Test datasets Embedding quality views and flow-through testing +
Combining the strengths of UMIST and The Victoria University of Manchester Execution model for Quality views Binding compilation executable component –Sub-flow of an existing workflow –Query processing interceptor Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’Quality view on D’ Qurator quality framework Services registry Services implementation Host workflow: D D’
Combining the strengths of UMIST and The Victoria University of Manchester Example: original proteomics workflow Taverna workflow Quality flow embedding point
Combining the strengths of UMIST and The Victoria University of Manchester Example: embedded quality workflow
Combining the strengths of UMIST and The Victoria University of Manchester Interactive conditions / actions
Combining the strengths of UMIST and The Victoria University of Manchester Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> Evaluate conditions Execute actions ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 Compute assertions Classifier <QualityAssertion serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass" Persistent evidence
Combining the strengths of UMIST and The Victoria University of Manchester Reference (semantic) model quality evidence annotations custom quality knowledge DB Env Data sources Annotation functions Quality Assertions functions QA Quality Views definition of acceptability regions QV Common Semantic Model (IQ Ontology)
Combining the strengths of UMIST and The Victoria University of Manchester A semantic model for quality concepts Quality “upper ontology” (OWL) Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF) Evidence Meta-data model (RDF)
Combining the strengths of UMIST and The Victoria University of Manchester Main taxonomies and properties Class restriction: MassCoverage is-evidence-for. ImprintHitEntry Class restriction: PIScoreClassifier assertion-based-on-evidence. HitScore PIScoreClassifier assertion-based-on-evidence. Mass Coverage assertion-based-on-evidence: QualityAssertion QualityEvidence is-evidence-for: QualityEvidence DataEntity
Combining the strengths of UMIST and The Victoria University of Manchester The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion
Combining the strengths of UMIST and The Victoria University of Manchester Qurator architecture
Combining the strengths of UMIST and The Victoria University of Manchester Quality-aware query processing
Combining the strengths of UMIST and The Victoria University of Manchester Research issues Quality modelling: Provenance as evidence –Can data/process provenance be turned into evidence? Experimental elicitation of new Quality Assertions –Seeking new collaborations with biologists! Classification with uncertainty –Data elements belong to a quality class with some probability Computing Quality Assertions with limited evidence –Evidence may be expensive and sometimes unavailable –Robust classification / score models Architecture: Metadata management model –Quality Evidence is a type of metadata with known features…
Combining the strengths of UMIST and The Victoria University of Manchester Summary For complex data types, often no single “correct” and agreed-upon definition of quality of data Qurator provides an environment for fast prototyping of quality hypotheses –Based on the notion of “evidence” supporting a quality hypothesis –With support for an incremental learning cycle Quality views offer an abstract model for making data processing environments quality-aware –To be compiled into executable components and embedded –Qurator provides an invocation framework for Quality Views Publications: Qurator is registered with OMII-UK