Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.

Slides:



Advertisements
Similar presentations
Secure Naming structure and p2p application interaction IETF - PPSP WG July 2010 Christian Dannewitz, Teemu Rautio and Ove Strandberg.
Advertisements

IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
De novo glycan structure search with CID MS/MS spectra of native N-glycopeptides Hannu Peltoniemi
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
Software Testing and Quality Assurance
1 genSpace: Community- Driven Knowledge Sharing for Biological Scientists Gail Kaiser’s Programming Systems Lab Columbia University Computer Science.
MEASUREMENT. Measurement “If you can’t measure it, you can’t manage it.” Bob Donath, Consultant.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Dimensions of Data Quality M&E Capacity Strengthening Workshop, Addis Ababa 4 to 8 June 2012 Arif Rashid, TOPS.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
S/W Project Management Software Process Models. Objectives To understand  Software process and process models, including the main characteristics of.
Database Design - Lecture 1
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
Lecture # 06 Design Principles II
Managing Information Quality in e-Science using Semantic Web technology Alun Preece, Binling Jin, Edoardo Pignotti Department of Computing Science, University.
Towards the Management of Information Quality in Proteomics David Stead University of Aberdeen.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 6 Slide 1 Requirements Engineering Processes l Processes used to discover, analyse and.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Role of Statistics in Geography
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Proteome data integration characteristics and challenges K. Belhajjame 1, R. Cote 4, S.M. Embury 1, H. Fan 2, C. Goble 1, H. Hermjakob, S.J. Hubbard 1,
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
SOFTWARE DESIGN (SWD) Instructor: Dr. Hany H. Ammar
IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury,
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
1 Introduction to Software Engineering Lecture 1.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Rosnow, Beginning Behavioral Research, 5/e. Copyright 2005 by Prentice Hall Ch. 2: Creative Ideas and Working Hypotheses.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Intro to Scientific Research Methods in Geography Chapter 2: Fundamental Research Concepts.
Combining the strengths of UMIST and The Victoria University of Manchester Quality views: capturing and exploiting the user perspective on information.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,
Design CIS 4800 Kannan Mohan Department of CIS Zicklin School of Business, Baruch College Copyright © 2009 John Wiley & Sons, Inc. Copyright © 2008 Course.
Reusing Modeling Elements in IV&V Thomas Otani Naval Postgraduate School 2009 NASA Independent Verification and Validation (IVV) Annual Workshop John Ryan.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Recording the Context of Action for Process Documentation Ian Wootten Cardiff University, UK
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
Data Mining and Decision Support
By Jay Krishnan. Introduction Information gathered from Proteomic techniques + neuroscientific research = Information on protein composition and function.
Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of.
High throughput biology data management and data intensive computing drivers George Michaels.
Metayogi Increasing the Accessibility of the Semantic Web Karim Tharani Doug Macdonald Rachel Heidecker.
Themes in Geosciences.
Software Configuration Management
Data challenges in the pharmaceutical industry
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Measuring Social Life: How Many? How Much? What Type?
Chapter 5 Designing the Architecture Shari L. Pfleeger Joanne M. Atlee
Metadata in the modernization of statistical production at Statistics Canada Carmen Greenough June 2, 2014.
Data Warehousing Data Mining Privacy
Dept. of Computation, UMIST
Presentation transcript:

Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK Roma, 3/4/07

Combining the strengths of UMIST and The Victoria University of Manchester Quality of data Main driver, historically: data cleaning for Integration: use of same IDs across data sources Warehousing, analytics: –restore completeness, –reconcile referential constraints –cross-validation of numeric data by aggregation Focus: Record de-duplication, reconciliation, “linkage” –Ample literature – see eg Nov 2006 issue of IEEE TKDE Consistency of data across sources Managing uncertainty in databases (Trio - Stanford) Data quality control in the data management practice

Combining the strengths of UMIST and The Victoria University of Manchester Common quality issues Completeness: not missing any of the results Correctness: each data should reflect the actual real- world entity that it is intended to model –The actual address where you live, the correct balance in your bank account… Timeliness: delivered in time for use by a consumer process –Eg stock information …

Combining the strengths of UMIST and The Victoria University of Manchester Taxonomy for data quality dimensions

Combining the strengths of UMIST and The Victoria University of Manchester Our motivation: quality in public e-science data GenBank UniProt EnsEMBLEntrezdbSNP Large volumes of data in many public repositories Increasingly creative uses for this data Problem: using third party data of unknown quality may result in misleading scientific conclusions

Combining the strengths of UMIST and The Victoria University of Manchester Some quality issues in biology “Quality” covers a broader spectrum of issues than traditional DQ “X% of database A may be wrong (unreliable) – but I have no easy way to test that” “This microarray data looks ok but is testing the wrong hypothesis” The output from this sequence matching algorithm produces false positives … Each of these issues calls for a separate testing procedure Difficult to generalize Each of these issues calls for a separate testing procedure Difficult to generalize

Combining the strengths of UMIST and The Victoria University of Manchester Correctness in biology - examples Data typeCreation processCorrectness Uniprot protein annotation Manual curationFunctional annotation f for p correct if function f can reliably be attributed to p Qualitative proteomics: Protein identification Generate peptides peak lists, match peak lists (eg Imprint) No false positives: Every protein in the output is actually present in the cell sample Transcriptomics: Gene expression report (up/down- regulation) Microarray data analysis No false positives, no false negatives

Combining the strengths of UMIST and The Victoria University of Manchester Defining quality in e-science is challenging In-silico experiments express cutting-edge research –Experimental data liable to change rapidly –Definitions of quality are themselves experimental Scientists’ quality requirements often just a hunch –Quality tests missing or based on experimental heuristics –Definitions of quality criteria are personal and subjective Quality controls tightly coupled to data processing –Often implicit and embedded in the experiment –Not reusable “Quality”  personal criteria for data acceptability

Combining the strengths of UMIST and The Victoria University of Manchester Research goals 1. Make personal definitions of quality explicit and formal –Identify a common denominator for quality concepts –Expressed as a conceptual model for Information Quality Elicit “nuggets” of latent quality knowledge from the experts Elicit “nuggets” of latent quality knowledge from the experts 2.Make existing data processing quality-aware –Define an architectural framework that accommodates personal definitions of quality –Compute quality levels and expose them to the user

Combining the strengths of UMIST and The Victoria University of Manchester Example: protein identification Data output Protein identification algorithm “Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive Evidence: mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain

Combining the strengths of UMIST and The Victoria University of Manchester Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives

Combining the strengths of UMIST and The Victoria University of Manchester Quality process components Data output Protein identification algorithm “Wet lab” experiment Protein Hitlist Protein function prediction Goal: to automatically add the additional filtering step in a principled way Goal: to automatically add the additional filtering step in a principled way PMF score = (HR x 100) + MC + (ELDP x 10) Quality filtering Quality assertion: Evidence: mass coverage (MC) Hit ratio (HR) ELDP

Combining the strengths of UMIST and The Victoria University of Manchester Quality Assertions QA(D): any function of evidence (metadata for D) that computes equivalence classes on D 1. Score model (total or partial order) 2. Classification model: D B A C Actions associated to regions: Eg accept/reject but possibly more Quality-equivalent regions

Combining the strengths of UMIST and The Victoria University of Manchester Layered definition of Quality DB Data sources custom quality knowledge Quality Assertions functions QA Quality Views: definition of acceptability regions QV quality evidence annotations Env Annotation functions Long-lived reusable Commodities Expert-defined Dynamic User controlled

Combining the strengths of UMIST and The Victoria University of Manchester Abstract Quality Views An operational definition for personal quality: 1. Formulate a quality assertion on the dataset: –i.e. a ranking of proteins by PMF score –“quality knolwedge, possibly subjective” 2. Identify underlying evidence necessary to compute the assertion –the variables used to compute the score (HR, MC, ELDP) –Objective, inexpensive 3. Define annotation functions that compute evidence values Functions that compute HR, MC, ELDP 4. Define quality regions on the ranked dataset In this case, intervals of acceptability 5. Associate actions to each region

Combining the strengths of UMIST and The Victoria University of Manchester Computable quality views as commodities Cost-effective quality-awareness for data processing: Reuse of high-level definitions of quality views Compilation of abstract quality views into quality components Abstract quality views binding and compilation Executable Quality process - runtime environment - data-specific quality services Qurator architectural framework:

Combining the strengths of UMIST and The Victoria University of Manchester Quality hypotheses discovery and testing Quality model Performance assessment Execution on test data abstract quality view Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Multiple target environments: Workflow query processor Quality model definition

Combining the strengths of UMIST and The Victoria University of Manchester Experimental quality Making data processing quality-aware using Quality Views –Query, browsing, retrieval, data-intensive workflows  Discovery and validation: “nuggets of quality knowldege” Quality View Model testing Test datasets  Embedding quality views and flow-through testing +

Combining the strengths of UMIST and The Victoria University of Manchester Execution model for Quality views Binding  compilation  executable component –Sub-flow of an existing workflow –Query processing interceptor Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’Quality view on D’ Qurator quality framework Services registry Services implementation Host workflow: D  D’

Combining the strengths of UMIST and The Victoria University of Manchester Example: original proteomics workflow Taverna workflow Quality flow embedding point

Combining the strengths of UMIST and The Victoria University of Manchester Example: embedded quality workflow

Combining the strengths of UMIST and The Victoria University of Manchester Interactive conditions / actions

Combining the strengths of UMIST and The Victoria University of Manchester Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> Evaluate conditions Execute actions ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 Compute assertions Classifier <QualityAssertion serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass" Persistent evidence

Combining the strengths of UMIST and The Victoria University of Manchester Reference (semantic) model quality evidence annotations custom quality knowledge DB Env Data sources Annotation functions Quality Assertions functions QA Quality Views definition of acceptability regions QV Common Semantic Model (IQ Ontology)

Combining the strengths of UMIST and The Victoria University of Manchester A semantic model for quality concepts Quality “upper ontology” (OWL) Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF) Evidence Meta-data model (RDF)

Combining the strengths of UMIST and The Victoria University of Manchester Main taxonomies and properties Class restriction: MassCoverage   is-evidence-for. ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence. HitScore PIScoreClassifier   assertion-based-on-evidence. Mass Coverage assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity

Combining the strengths of UMIST and The Victoria University of Manchester The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion

Combining the strengths of UMIST and The Victoria University of Manchester Qurator architecture

Combining the strengths of UMIST and The Victoria University of Manchester Quality-aware query processing

Combining the strengths of UMIST and The Victoria University of Manchester Research issues Quality modelling: Provenance as evidence –Can data/process provenance be turned into evidence? Experimental elicitation of new Quality Assertions –Seeking new collaborations with biologists! Classification with uncertainty –Data elements belong to a quality class with some probability Computing Quality Assertions with limited evidence –Evidence may be expensive and sometimes unavailable –Robust classification / score models Architecture: Metadata management model –Quality Evidence is a type of metadata with known features…

Combining the strengths of UMIST and The Victoria University of Manchester Summary For complex data types, often no single “correct” and agreed-upon definition of quality of data Qurator provides an environment for fast prototyping of quality hypotheses –Based on the notion of “evidence” supporting a quality hypothesis –With support for an incremental learning cycle Quality views offer an abstract model for making data processing environments quality-aware –To be compiled into executable components and embedded –Qurator provides an invocation framework for Quality Views Publications: Qurator is registered with OMII-UK