Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.

Slides:



Advertisements
Similar presentations
Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.
Advertisements

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Knowledge Management for Disease Coding (KMDC): Background & Introduction Timothy Hays, Ph.D. Project Manager, Knowledge Management for Disease Coding.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Managing data Resources: An information system provides users with timely, accurate, and relevant information. The information is stored in computer files.
Software Metrics II Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Biological Data Mining A comparison of Neural Network and Symbolic Techniques
Toward Making Online Biological Data Machine Understandable Cui Tao.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
The Protein Data Bank (PDB)
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Medical Informatics Basics
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Bioinformatics.
Database Systems COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Bioinformatics and medicine: Are we meeting the challenge?
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
1 Introduction to Database Systems. 2 Database and Database System / A database is a shared collection of logically related data designed to meet the.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
1 Understanding Cataloging with DLESE Metadata Karon Kelly Katy Ginger Holly Devaul
Mining the Biomedical Research Literature Ken Baclawski.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics and Computational Biology
Lecture 1: Immunogenetics Dr ; Kwanama
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Japan Consortium for Glycobiology and Glycotechnology DataBase 日本糖鎖科学統合データベース PACDB - Pathogen Adherence to Carbohydrate Database The Pathogen Adherence.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Facilitating Document Annotation Using Content and Querying Value.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Protein families, domains and motifs in functional prediction May 31, 2016.
Introduction To DBMS.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Development of the Amphibian Anatomical Ontology
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Introduction C.Eng 714 Spring 2010.
Drug Information Resources
Functional Annotation of the Horse Genome
PIR: Protein Information Resource
Data Warehousing and Data Mining
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Ligand Docking to MHC Class I Molecules
Subject Name: SOFTWARE ENGINEERING Subject Code:10IS51
Identification, length distribution, and motif analysis of linear and spliced peptides by a combined de novo library searching hybrid workflow approach.
Presentation transcript:

Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China International Bioinformatics Workshop Apr 2008

Overview Definitions and background Architectures of existing immunological databases Machine-learning for biological databases Conclusion

Biology produces more data than we can process >3000 HLA alleles different T-cell receptors linear 9mer epitopes Post-translational spliced epitopes Data are stored in databases, literature, laboratory records, clinical records, … A major issue: turning data into knowledge The information centric world

Impractical to do manual curation ≥ 16 million PubMed abstracts ~80K immunology related references Large amounts of data that are difficult to interpret Protein-protein interaction extraction from text Bioinformatics: systematic construction and updating of databases Use of bioinformatics

Ad hoc bioinformatics Biological system Computational analysis Biological interpretation

More systematic use of bioinformatics Biological system Computational analysis Biological interpretation Formal description Mathematical problem Conversion of results

Knowledge discovery from databases is the process of automated extraction of useful information or knowledge from individual or multiple databases

1) Data explosion Current databases: Volume of data increasing exponentially GenBank, SWISS-PROT, IMGT, PubMed, etc New databases: Growth in numbers Increase in size More complex Biologists: Maintain personal data bank Information relevant to their research Define objectives for data mining and analysis

2) Data quality Nature of biological data: Fuzzy and complex Varying interpretations Problems with raw data: Inconsistent Inaccurate Redundant Irrelevant Incomplete Incorrect Data cleaning: Limit on the percentage error that can be tolerated in the data Prevent propagation of errors to our databases Prevent depreciation of data quality

3) Database creation and maintenance Software tools and programming efforts: Data collection Constructing databases Integrating data mining tools Updating the databases Nature of the databases: Short lifespan Hard to maintain

4) Data integration Disparities in data sources: Data structures Data formats Views Search mechanisms Location

Overview Definitions and background Architectures of existing immunological databases Machine-learning for biological databases Conclusion

Web-resources for immune epitope information Immune Epitope Database and Analysis Resource (IEDB) Contains B-cell epitopes, T-cell epitopes, MHC ligands for humans, non- human primates, rodents, and other animal species. URL: The international ImMunoGeneTics information system (IMGT) Specializes in Ig, T-cell receptors, MHC, Ig superfamily, MHC superfamily, and related proteins of the immune system of human and other vertebrate species URL: SYFPEITHI Contains ~3,500 T-cell epitopes, MHC ligands and peptide motifs for humans and rodents URL:

Web-resources for immune epitope information MHCBN Contains T-cell epitopes, TAP ligands, MHC binding peptides and MHC non-binding peptides for humans and rodents URL: MPID-T Contains 3D structural information of 187 T-cell receptors, MHCs and interacting epitopes for humans and rodents, spanning 40 alleles URL: AntiJen/JenPep Contains T-cell epitopes, MHC ligands, TAP ligands and B-cell epitopes. URL:

The IEDB class diagram

Relationships between an epitope & contexts

Overview Definitions and background Architectures of existing immunological databases Machine-learning for biological databases Conclusion

Naϊve Bayes classifiers Attribute values are conditionally independent given the target value Goal: to assign a new instance v j the most probable target value V target given a set of attribute values The target class may be defined as: V target = argmax P ( v j ) Π P ( a i | v j )

Comparison of popular text classification algorithms Dataset 20,910 PubMed abstracts 181,299 unique words AROC NBC: ANN: SVM: DT: Wang et al., BMC Bioinformatics 2007, 8:269

Feature selection (FS) Data source PubMed abstracts Medical Subject Headings (MeSH) - National Library of Medicine's controlled vocabulary used for indexing articles, for cataloging books and other holdings Publication title Author(s) etc

Feature selection (FS) Algorithms Document frequency (DF) – ranks features based on the number of abstracts they appear in Information gain (IG) – measures the number of bits of information obtained for category prediction based on their occurrence in a document IG(u) = -∑ P(ci) log P(ci) + P(u) ∑ P(ci|u) log P(ci|u) + P(t) ∑ P(ci|ū) log P(ci|ū) where u is the feature of interest, ci (i = 1, …, m) denotes the set of categories the documents belong to

Feature condensation (FC) Stemming To reduce words to their common root e.g. “binding, binds, bind” to bind Porter stemmer – A ROC = to A ROC = Domain specific vocabulary may be reduced to unsuitable terms

Feature extraction (FE) Rules to capture immune related expressions and group them together Reduction of feature space (i.e. no. of unique words) Enrichment of information content Better performance?

Feature extraction (FE) Examples: Sequence length – identify sequence length and replace with “~range 50~” if sequences to be mapped stretches 50 amino acids MHC alleles – identify MHC alleles and replace with “~mhc_allele~” Protein sequences – identify sequences as a) exclusively containing characters representing the 20 aa, b) in upper case, length > threshold, and replace with “~sequence~”

Performance comparison Wang et al., BMC Bioinformatics 2007, 8:269

Overview Definitions and background Architectures of existing immunological databases Machine-learning for biological databases Conclusion

Machine-learning algorithms enable systematic approach to database construction and facilitates scientific discovery It must be performed with due care and must be scientifically and technically sound