Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute.

Similar presentations


Presentation on theme: "Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute."— Presentation transcript:

1 Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute of Systems Science and Yong Loo Lin School of Medicine, National University of Singapore

2 Page 2 Large-scale Research Questions What can we learn from large-scale studies of pathogens? Does H5N1 Avian influenza have pandemic potential? What makes Human flu different from Avian flu? What are stable potential immune epitopes to use as vaccine candidates for influenza? How does each serotype of dengue differ from all others?

3 Page 3 Large-scale Research Questions What can we learn from large-scale studies of pathogens? Does H5N1 Avian influenza have pandemic potential? What makes Human flu different from Avian flu? What are stable potential immune epitopes to use as vaccine candidates for influenza? How does each serotype of dengue differ from all others? Large scale Statistical evidence Historical data Systematic analysis

4 Page 4 We need Metadata! Metadata = Descriptive data about sequences If you want to compare avian vs human, you need host organism info If you want conservation analysis, you need to have serotype and host information If you want to study a period of virus evolution, you need date information If you want a balanced dataset, you may need to filter according to country, date, subtype

5 Page 5 Knowledge Mining H5N1 mutation map Knowledge Aggregation User-defined Dictionaries User-defined Extraction Rules and Priorities Cross-reference Identifiers Cross-reference Identifiers Identify mutations in H5N1 that characterize transmissibility amongst humans User-defined Queries Extract Desired Source Knowledge from Public Databases Public Database Records Conservation Analysis Evidence of strain co- circulation Viral Protein References Identify Evolutionarily Stable Region across subgroups Characteristic Mutations Analysis Epitope Vaccine Candidates Active Text Mining Identify Biomedical literature with Cross- reactivity information Documents with Cross-reactivity information User-defined Dictionaries Curator's Knowledge User-defined Patterns Biomedical Text Viral Sequence and Metadata Previous Annotations

6 Page 6 Scalability in Bioinformatics Knowledge Mining Integrative scalability We need to integrate heterogeneous information from multiple data repositories with multiple purposes Quantitative scalability We need methods that can leverage on and explore effectively large-scale data sets Hierarchical scalability We need to cascade analysis tasks, flowing knowledge from one task to the next

7 Page 7 Obstacles to Scalability Heterogeneity of Biological Databases Systemic: access to data in different databases Syntactic: data formats, use of free text Structural: different table structures in different databases Semantic: data with different meaning and intent Semantic Heterogeneity is particularly insidious Data is rarely used in the way it was originally intended Low level of end-use technical expertise Biologists, not computer scientists Excel spreadsheets, Web page “scraping” Does not scale up

8 Page 8 Good Pretty Bad Not so Good Semantic Heterogeneity in GenBank

9 Page 9 Fields (e.g. country/date) are inconsistently encoded Inconsistent level of details between databases Inconsistent field location within different records of the same database Implicit encoding of the data (e.g. within the title of a publication) Multiple usage of the same field Usage of isolation_source field in different GenPept records /isolation_source="Homo sapiens" AAT85667 /isolation_source="Homo sapiens" AAT85667 /isolation_source="Samoa BAC77216 /isolation_source="Samoa" BAC77216 /isolation_source="isolated in AAN74539 /isolation_source="isolated in 1993" AAN74539 Semantic Heterogeneity in GenBank

10 Page 10 Influenza Large-Scale Studies Analyze all influenza protein sequences available GenBank + GenPept = 92,343 documents Final dataset comprises 40,169 unique sequences Various types of analysis, e.g. Identify amino acid mutations sites that characterize human-transmissible strains Compare the diversity of viral sequences over different periods of time and geographical areas Several Metadata fields required Protein nameSubtypeIsolate HostCountryYear Manual Curation is not an Option!

11 Page 11 The Aggregator of Biological Knowledge An end-user environment for data retrieval, extraction and analysis Uses XML technology and structural rules to allow biologists to extract and reconcile the data needed Wrapper framework provides access to multiple sources Manages extracted results Offers plug-in architecture for analysis tools Data Analysis Data Collection Data Management augment filter input Public Repositories query manage control Researcher KDD System Data Analysis Data Collection Data Management augment filter input Public Repositories query manage control Researcher ABK

12 Page 12 ABK Structural Rules Concise visualization of XML as name/value tree Familiar presentation of metadata for biologists Point-and-click selection of location and constraints Automatic formation of XML Structural Rule Hierarchical value reconciliation Tabulated visualization and manual curation RDF storage and output

13 Page 13 Data Extraction and Cleaning DENV-1 sequences Different rules (or different documents) produced conflicting values User can fill in or override values Values produced by user-defined rules

14 Page 14 Rule performance Multiple rules often needed Some properties are very fragmented

15 Page 15 Can H5N1 viruses spread amongst humans?

16 Page 16 The Antigenic Variability Analyzer (AVANA)

17 Page 17 Using MI to detect Characteristic Sites At a characteristic site, the residue observed is strongly associated to a set of sequences E.g. : Arg -> Avian Thr -> Human This association is explored by measuring mutual information of The residue observed at a site The label of the set in which it is observed MI is in range 0 – 1.0 MI = 0.0 -> no statistical significance in the occurrence of residues in the two sets MI = 1.0 -> Residues observed in one set are never observed in the other, and vice versa

18 Page 18 A2A (719 sequences) H2H (1650 sequences) PB2 Protein MI Entropy Spikes indicate characteristic sites

19 Page 19 RNP proteins: PB2 9446481105199271292368475613627661674567588702 DEMTITAIVATALRAEASVAAVKDE NTTAVMTSMVSMKTKTTIIRN Nuclear Localization Signal PB1 binding NP binding RNA cap binding A2A H2H http://www-micro.msb.le.ac.uk/3035/Orthomyxoviruses.html PB2 (759 aa) 17 sites

20 Page 20 H2H characteristic mutations in H5N1

21 Page 21 Ongoing Projects at ISS InViDiA - Integrated Virus Diversity Analysis Web-based tool for metadata-enabled diversity analysis WADE - Web-based Aggregation and Display of Epitopes Web-based tool for aggregating epitope predictions from multiple prediction systems

22 Page 22 Thanks to Johns Hopkins University Prof. J Thomas August Dana-Farber Cancer Institute, Harvard Dr. Vladimir Brusic Dept. of Biochemistry, NUS Prof. Tan Tin Wee AT Heiny, Asif M Khan, Hu Yong Li Institut Pasteur Dr. Hervé Bourhy Partial Grant Support: National Institute of Allergy and Infectious Diseases, NIH Grant No. 5 U19 AI56541, Contract No. HHSN2662-00400085C


Download ppt "Large-scale knowledge aggregation for infectious diseases ASEAN-China International Bioinformatics Workshop Singapore, 17 th April 2008 Olivo Miotto Institute."

Similar presentations


Ads by Google