Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

XML: Extensible Markup Language
Copyright © 2005 by Limsoon Wong Building Gene Networks by Information Extraction, Cleansing, & Integration Limsoon Wong Institute for Infocomm Research.
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Introduction to Natural Language Processing Phenotype RCN Meeting Feb 2013.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Data - Information - Knowledge
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Process Specifications and Structured Decisions Systems Analysis and Design, 8e Kendall.
Issues in the Transfer of Help Tools to Government Agencies: The Example of the Statistical Interactive Glossary (SIG) Stephanie W. Haas School of Information.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Lecture 2.21 Retrieving Information: Using Entrez.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Protein Databases EBI – European Bioinformatics Institute
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Organizing Data & Information
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.
Kendall & KendallCopyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall 9 Kendall & Kendall Systems Analysis and Design, 9e Process Specifications.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Information Systems & Databases 2.2) Organisation methods.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall & Kendall 8.
Bioinformatics and Computational Biology
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Copyright © 2005 by Limsoon Wong Building Gene Networks by Information Extraction, Cleansing, & Integration Limsoon Wong Institute for Infocomm Research.
Copyright © 2011 Pearson Education Process Specifications and Structured Decisions Systems Analysis and Design, 8e Kendall & Kendall Global Edition 9.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Best pTree organization? level-1 gives te, tf (term level)
Retrieving Information: Using Entrez
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Replication, Transcription, Translation
Comparison Of DNA And RNA Synthesis in Prokaryotes and Eukaryotes
Presentation transcript:

Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore

Copyright © 2005 by Limsoon Wong. Issues Sound: –Is the contents of our databases correct? –Trying our hands on a data cleansing problem Complete: –Is the structure of our databases expressive enough to capture critical information explicitly? Understandable: –Is our databases or search results understandable? Other issues relating to NLP/IE

Copyright © 2005 by Limsoon Wong Soundness: Is the contents of our databases correct? This part is based on work of Judice Koh and Vladimir Brusic

Categories of errors found Copyright © 2005 by Limsoon Wong.

Copyright © 2005 by Limsoon Wong Context of the misspellingsCorrectionsMisspellings EMBL:Y18050 E.faecium pbp5 gene TITLE Modification of penicillin-binding protein 5 asociated with high level ampicillin resistance in Enterococcus faecium gi| |emb|X |EFPBP5G associatedasociated Swiss-Prot:P03385 Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70); Tranmembrane protein (TM) (p15E); R protein]. gi|119478|sp|P03385|ENV_MLVMO transmembranetranmembrane Patent Database:A76783 Sequence 11 from Patent WO CDS < /note="gene cassete encoding intercalating jun-zipper and linker" gi| |emb|A ||pat|WO| |11[ ] Cassette Cassete GenBank:AAD26534 nectin-1 [Rattus norvegicus] TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruited to Cadherin-based Adherens Junctions through Interaction with Afadin, a PDZ Domain-containing Protein gi| |gb|AAD Immunoglobulin Immunoglobin RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Spelling errors Format violation Annotation error Dubious sequences Sequence redundancy Data Provenance flaws Cross- annotation error Sequence structure violation Usually typo errors Occurs in different fields of the record We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez. Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Example Spelling Errors

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Overlapping intron/exon Annotation error Dubious sequences Sequence redundancy Data Provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Syn7 gene of putative polyketide synthase in NCBI TPA record BN has overlapping intron 5 and exon 6. rpb7+ RNA polymerase II subunit in GENBANK record AF has overlapping exon 1 and exon 2. Example Overlapping Intron/Exon Errors Copyright © 2005 by Limsoon Wong

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Annotation error Dubious sequences Sequence redundancy Data provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Replication of sequence information Different views Overlapping annotations of the same sequence Submission of the same sequence to different databases Repeated submission of the same sequence to the same database Initially submitted by different groups Protein sequences may be translated from duplicate nucleotide sequences =protein&list_uids= &dopt=GenPept Example Seqs w/ Identical Info Copyright © 2005 by Limsoon Wong

Soundness: Trying our hands on a data cleansing problem This part is based on work of Judice Koh

Copyright © 2005 by Limsoon Wong. SINGLE- SOURCE DATABASE ATTRIBUTE Schema remapping Sequence Structure Parser RECORD Concatenated values Spelling errors Format violation Synonyms Homonyms/Abbreviations Misuse of fields Features do not correspond with sequence Dictionary lookup Integrity constraints Undersized sequences Uninformative sequences Replication of sequence information Different views Overlapping annotations of the same sequence Fragments Duplicate detection Mis-fielded values Comparative analysis Sequence structure violation MULTI- SOURCE DATABASE Vector contaminated sequences Vector screening Putative features Cross-annotation error

Copyright © 2005 by Limsoon Wong. Scorpion venom dataset containing 520 records 695 duplicate pairs are collectively identified. Snake PLA2 venom dataset containing 780 records Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB) 251 duplicate pairs 444 duplicate pairs scorpion AND (venom OR toxin) serpentes AND venom AND PLA2 Expert annotation Dataset

Copyright © 2005 by Limsoon Wong. Rule 1S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. Rule 2S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates. What else do we learn? Definition of the sequence records do not play a role in identifying the record duplicates Results

Copyright © 2005 by Limsoon Wong Completeness: Is the structure of our databases expressive enough to capture critical information explicitly?

Copyright © 2005 by Limsoon Wong. Expressive Power Take a key paper such as the Kohn paper that summarises current knowledge on p53 regulation. Is there a structured database that is able to capture all info in that paper explicitly? Is there a semi-structured database that is able to capture all info in that paper explicitly? How well does this (semi-) structured database generalize to other similar type of papers?

Copyright © 2005 by Limsoon Wong Understandability: Is our databases or search results understandable?

Copyright © 2005 by Limsoon Wong. Self-Organization Take a search on p53. You will get >300k hits or some number like that on MEDLINE It is not feasible for anyone to go thru all of that to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year. Need to organize the database and/or the search results into hierarchy or “semantic” net to make it easier for users to understand or to browse the results How do we define this hierarchy/net? Can this hierarchy/net be self-organized?

Copyright © 2005 by Limsoon Wong Problems relating to NLP/IE This part is mostly based on work of Chris Tan and See-Kiong Ng

Copyright © 2005 by Limsoon Wong. Handling full-length papers Source document structure parsing Hyper-linked file tracking Figure and table processing Special symbol handling

Copyright © 2005 by Limsoon Wong. Information retrieval Document and sentence retrieval Relevant interaction filtering

Copyright © 2005 by Limsoon Wong. Bio name recognition Nomenclature loosely followed Frequent use of conjunction and disjunction in bio names with multiple bio-entity names sharing one head noun Long descriptive names Names of genes and proteins used interchangeably

Copyright © 2005 by Limsoon Wong. Inherent complexity of biological interactions  Sentences describing them also tend to be complicated Bio-interaction extraction

Domain knowledge is often needed for interaction template filling Copyright © 2005 by Limsoon Wong.

Extraction of other relevant info Contextual information –Species, cell type, cellular localisation, etc Negative information Speculative & incomplete facts

Copyright © 2005 by Limsoon Wong. Information integration Bio-name mapping Bio-interaction mapping –how do you know two complex sentences are talking about the same interaction?

Copyright © 2005 by Limsoon Wong. Resource for training & benchmarking Is there such a good resource, especially for the more complex tasks?

I2RI2R Communications & DevicesServices & ApplicationsMedia Processing Human Centric Media Semantics Infocomm Security Context-Aware Systems Knowledge Discovery Radio SystemsNetworkingLightwaveEmbedded Systems Digital Wireless Acknowledgements Copyright © 2005 by Limsoon Wong Data Cleansing: Judice Koh, Vladimir Brusic, Mong Li Lee, Asif M. Khan, Paul T.J. Tan, Heiny Tan, Kenneth Lee, Wilson Goh, Songsak Tongchusak, Kavitha Gopalakrishnan NLP/IE Issues: See-Kiong Ng, Chris Tan