Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
C6 Databases.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Centers of Excellence for Influenza Research and Surveillance 6 th Annual Meeting Aug 1, 2012 Status of IRD Development.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
The Imperial College Tissue Bank A searchable catalogue for tissues, research projects and data outcomes Prof Gerry Thomas - Dept. Surgery & Cancer The.
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
DATA WAREHOUSING.
Influenza A Virus Pandemic Prediction and Simulation Through the Modeling of Reassortment Matthew Ingham Integrated Sciences Program University of British.
Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:
Genetic Research Using Bioinformatics: LESSON 6:
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Data Mining Solutions (Westphal & Blaxton, 1998) Dr. K. Palaniappan Dept. of Computer Engineering & Computer Science, UMC.
Influenza Research Database (IRD): A Web-based Resource for Influenza Virus Data and Analysis Victoria Hunt 1 *, R. Burke Squires 1, Jyothi Noronha 1,
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Interoperability Scenario Producing summary versions of compound multimedia historical documents.
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China.
Chapter 1 Introduction to Data Mining
Page 1 CSISS Center for Spatial Information Science and Systems Design and Implementation of CWIC Metrics Weiguo Han, Liping Di, Yuanzheng Shao, Lingjun.
Statistical Tool for Identifying Sequence Variations That Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) July 22,
Identification of human-to-human transmissibility factors in PB2 proteins of influenza A by large-scale mutual information analysis Sixth International.
Using Comparative Genomics to Explore the Genetic Code of Influenza Sangeeta Venkatachalam.
CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s.
THE QUESTION: SHOULD I GET A FLU SHOT EACH YEAR?.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
John R. LaMontagne Memorial Symposium on Pandemic Influenza Research April 4-5, 2005 Institute of Medicine Working Group One: Influenza Virulence and Antigenic.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Interoperability & Knowledge Sharing Advisor: Dr. Sudha Ram Dr. Jinsoo Park Kangsuk Kim (former MS Student) Yousub Hwang (Ph.D. Student)
Futures Lab: Biology Greenhouse gasses. Carbon-neutral fuels. Cleaning Waste Sites. All of these problems have possible solutions originating in the biology.
Rule-based Knowledge Aggregation for Large-Scale Protein Sequence Analysis of Influenza A Viruses Sixth International Conference on Bioinformatics (InCoB2007)
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Supporting Scientific Collaboration Online SCOPE Workshop at San Diego Supercomputer Center March 19-22, 2008.
Integration of Host Factor Data into the Virus Pathogen Database and Analysis Resource (ViPR) and the Influenza Research Database (IRD) Brett E. Pickett.
The Informatics Crystal Ball: Mining the Past to Predict the Species Jump Event 19 April 2011 Richard H. Scheuermann, Ph.D. Department of.
Page 1 CSISS Center for Spatial Information Science and Systems CWIC Metrics: Current and Future Weiguo Han, Liping Di, Yuanzheng Shao, Lingjun Kang Center.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
A collaborative tool for sequence annotation. Contact:
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Investigations of HIV-1 Env Evolution Evolutionary Bioinformatics Education: A BioQUEST Curriculum Consortium Approach Grand Valley State University August.
Lab Interactions and Ontologies LAB CBW Bioinformatics Workshop February 23 th 2006, Toronto Christopher Hogue Blueprint Initiative.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Data mining in web applications
Databases, Ontologies and Text mining Session Introduction Part 2
MIS2502: Data Analytics Advanced Analytics - Introduction
UCSD Neuron-Centered Database
Data challenges in the pharmaceutical industry
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Metadata Construction in Collaborative Research Networks
Dr. Bhavani Thuraisingham The University of Texas at Dallas
Course Introduction CSC 576: Data Mining.
Presentation transcript:

Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore

Page 2 Large-scale Research Questions What can we learn from large-scale studies of pathogens? Does H5N1 Avian influenza have pandemic potential? What makes Human flu different from Avian flu? What are stable potential immune epitopes to use as vaccine candidates for influenza? How does each serotype of dengue differ from all others?

Page 3 Large-scale Research Questions What can we learn from large-scale studies of pathogens? Does H5N1 Avian influenza have pandemic potential? What makes Human flu different from Avian flu? What are stable potential immune epitopes to use as vaccine candidates for influenza? How does each serotype of dengue differ from all others? Large scale Statistical evidence Historical data Systematic analysis

Page 4 We need Metadata! Metadata = Descriptive data about sequences If you want to compare avian vs human, you need host organism info If you want conservation analysis, you need to have serotype and host information If you want to study a period of virus evolution, you need date information If you want a balanced dataset, you may need to filter according to country, date, subtype

Page 5 Knowledge Mining H5N1 mutation map Knowledge Aggregation User-defined Dictionaries User-defined Extraction Rules and Priorities Cross-reference Identifiers Cross-reference Identifiers Identify mutations in H5N1 that characterize transmissibility amongst humans User-defined Queries Extract Desired Source Knowledge from Public Databases Public Database Records Conservation Analysis Evidence of strain co- circulation Viral Protein References Identify Evolutionarily Stable Region across subgroups Characteristic Mutations Analysis Epitope Vaccine Candidates Active Text Mining Identify Biomedical literature with Cross- reactivity information Documents with Cross-reactivity information User-defined Dictionaries Curator's Knowledge User-defined Patterns Biomedical Text Viral Sequence and Metadata Previous Annotations

Page 6 Scalability in Bioinformatics Knowledge Mining Integrative scalability We need to integrate heterogeneous information from multiple data repositories with multiple purposes Quantitative scalability We need methods that can leverage on and explore effectively large-scale data sets Hierarchical scalability We need to cascade analysis tasks, flowing knowledge from one task to the next

Page 7 Obstacles to Scalability Heterogeneity of Biological Databases Systemic: access to data in different databases Syntactic: data formats, use of free text Structural: different table structures in different databases Semantic: data with different meaning and intent Semantic Heterogeneity is particularly insidious Data is rarely used in the way it was originally intended Low level of end-use technical expertise Biologists, not computer scientists Excel spreadsheets, Web page “scraping” Does not scale up

Page 8 Good Pretty Bad Not so Good Semantic Heterogeneity in GenBank

Page 9 Fields (e.g. country/date) are inconsistently encoded Inconsistent level of details between databases Inconsistent field location within different records of the same database Implicit encoding of the data (e.g. within the title of a publication) Multiple usage of the same field Usage of isolation_source field in different GenPept records /isolation_source="Homo sapiens" AAT85667 /isolation_source="Homo sapiens" AAT85667 /isolation_source="Samoa BAC77216 /isolation_source="Samoa" BAC77216 /isolation_source="isolated in AAN74539 /isolation_source="isolated in 1993" AAN74539 Semantic Heterogeneity in GenBank

Page 10 Influenza Large-Scale Studies Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences Various types of analysis, e.g. Identify amino acid mutations sites that characterize human-transmissible strains Compare the diversity of viral sequences over different periods of time and geographical areas Several Metadata fields required Protein nameSubtypeIsolate HostCountryYear Manual Curation is not an Option!

Page 11 The Aggregator of Biological Knowledge An end-user environment for data retrieval, extraction and analysis Uses XML technology and structural rules to allow biologists to extract and reconcile the data needed Wrapper framework provides access to multiple sources Manages extracted results Offers plug-in architecture for analysis tools Data Analysis Data Collection Data Management augment filter input Public Repositories query manage control Researcher KDD System Data Analysis Data Collection Data Management augment filter input Public Repositories query manage control Researcher ABK

Page 12 ABK Structural Rules Concise visualization of XML as name/value tree Familiar presentation of metadata for biologists Point-and-click selection of location and constraints Automatic formation of XML Structural Rule Hierarchical value reconciliation Tabulated visualization and manual curation RDF storage and output

Page 13 Data Extraction and Cleaning DENV-1 sequences Different rules (or different documents) produced conflicting values User can fill in or override values Values produced by user-defined rules

Page 14 Rule performance Multiple rules often needed Some properties are very fragmented

Page 15 Can H5N1 viruses spread amongst humans?

Page 16 The Antigenic Variability Analyzer (AVANA)

Page 17 Using MI to detect Characteristic Sites At a characteristic site, the residue observed is strongly associated to a set of sequences E.g. : Arg -> Avian Thr -> Human This association is explored by measuring mutual information of The residue observed at a site The label of the set in which it is observed MI is in range 0 – 1.0 MI = 0.0 -> no statistical significance in the occurrence of residues in the two sets MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa

Page 18 A2A (719 sequences) H2H (1650 sequences) PB2 Protein MI Entropy Spikes indicate characteristic sites

Page 19 RNP proteins: PB DEMTITAIVATALRAEASVAAVKDE NTTAVMTSMVSMKTKTTIIRN Nuclear Localization Signal PB1 binding NP binding RNA cap binding A2A H2H PB2 (759 aa) 17 sites

Page 20 H2H characteristic mutations in H5N1

Page 21 Ongoing Projects at ISS InViDiA - Integrated Virus Diversity Analysis Web-based tool for metadata-enabled diversity analysis WADE - Web-based Aggregation and Display of Epitopes Web-based tool for aggregating epitope predictions from multiple prediction systems

Page 22 Thanks to Johns Hopkins University Prof. J Thomas August Dana-Farber Cancer Institute, Harvard Dr. Vladimir Brusic Dept. of Biochemistry, NUS Prof. Tan Tin Wee AT Heiny, Asif M Khan, Hu Yong Li Institut Pasteur Dr. Hervé Bourhy Partial Grant Support: National Institute of Allergy and Infectious Diseases, NIH Grant No. 5 U19 AI56541, Contract No. HHSN C