Download presentation
Presentation is loading. Please wait.
Published byBarrie Parker Modified over 9 years ago
1
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable
2
Motivations Understanding correlations between genotype and phenotype Predicting genotype phenotype Phenotypes: –Protein function –Drug/therapy response –Drug-drug interactions for expression –Drug mechanism –Interacting pathways of metabolism
3
Projects –Homology detection, protein family classification (funded by a DuPont S&E award) Support Vector Machines Hidden Markov models Graph theoretic methods –Probabilistic modeling for BioSequence (funded by NIH) HMMs, and beyond Motifs finding Secondary structure –Comparative Genomics Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) Evolution of metabolic pathways Tree and graph comparisons
4
Detect remote homologues Attributes to be looked at: -Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: -Quasi-consensus based comparison of profile HMM for protein sequences (submitted to Bioinformatics) -Using extended phylogenetic profiles and support vector machines for protein family classification (SNPD 04) -Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)
5
Support Vector Machines
6
1 1 1 1 0 1 1 1 1 1 = 3 0.5 0 1 1 1 1 = 3 0.1 x = y = z = Hamming distanceTree-based distance Data: phylogenetic profiles - How to account for correlations among profile components? profile extension (Narra & Liao, SNPD 04)
7
From MSA to profile HMMs using existing packages (SAM-T99 or HMMER) Generation of quasi consensus sequence from the model Alignment of consensus sequence of a model with the other model Extraction of two alignments in each direction Quasi consensus based comparison of HMMs V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V G A - - H A G E Y V - K A - T I A E H A - G A - H D G E F Consensus 2 Seed 1 Seed 2 A G A - - H D G E F V - G A N - V A E H V - G A H - A G E Y Seed 2 Consensus 1 Seed 1 V - K A - T I A E H V G A - - H A G E Y V - - - - N V D E V V E A - - D V A G H V K G - - - - - - D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V V G A - - N V A E H S(c 2 |M 1 ) A - G A - H D G E F V G A - - H A G E Y Aln 21 A G A - - H D G E F V - G A H - A G E Y Aln 12 V - G A N - V A E H A G A - - H D G E F V - - - - N V D E F C K A - - D V A G H V K G - - - - - - F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V K A - - T I A E H S(c 1 |M 2 ) M 1 V G A N V A E H Consensus 1 M 2 V K A T I A E H Consensus 2
9
Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? –Patterns/motifs –Secondary structure To capture long range correlations of bio sequences –Transporter proteins –RNA secondary structure Methods: generative versus discriminative –Linear dependent processes –Stochastic grammars –Model equivalence
10
TMMOD: An improved hidden Markov model for predicting transmembrane topology (to appear in IEEE ICTAI04)
11
Mod.Reg. Data set Correct topology Correct location Sens- itivity Speci- ficity TMMOD 1 (a) (b) (c) S-83 65 (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%) 65 (78.3%) 97.4% 71.3% 97.1% 97.4% 71.3% 97.1% TMMOD 2 (a) (b) (c) S-83 61 (73.5%) 54 (65.1%) 65 (78.3%) 61 (73.5%) 66 (79.5%) 99.4% 93.8% 99.7% 97.4% 71.3% 97.1% TMMOD 3 (a) (b) (c) S-83 70 (84.3%) 64 (77.1%) 74 (89.2%) 71 (85.5%) 65 (78.3%) 74 (89.2%) 98.2% 95.3% 99.1% 97.4% 71.3% 97.1% TMHMMS-83 64 (77.1%)69 (83.1%)96.2% PHDtmS-83 (85.5%) (88.0%)98.8%95.2% TMMOD 1 (a) (b) (c) S-160 117 (73.1%) 92 (57.5%) 117 (73.1%) 128 (80.0%) 103 (64.4%) 126 (78.8%) 97.4% 77.4% 96.1% 97.0% 80.8% 96.7% TMMOD 2 (a) (b) (c) S-160 120 (75.0%) 97 (60.6%) 118 (73.8%) 132 (82.5%) 121 (75.6%) 135 (84.4%) 98.4% 97.7% 98.4% 97.2% 95.6% 97.2% TMMOD 3 (a) (b) (c) S-160 120 (75.0%) 110 (68.8%) 135 (84.4%) 133 (83.1%) 124 (77.5%) 143 (89.4%) 97.8% 94.5% 98.3% 97.6% 98.1% TMHMMS-160123 (76.9%)134 (83.8%)97.1%97.7%
12
Genomics study of enterobacterial BT agents (funded by the US Army via Center for Biological Defense, USF ) Goals: –Identification of genes and sequence tags as targets for novel diagnosis and therapy –BT agents: Yersinia pestis, Salmonella, Escherichia coli O157:H7) Methods: –Various bioinformatics tools and databases
13
Comparative Genomics Motivation: –Evolution of metabolic pathways –Gene functions –De novo (alternative pathways) Genetic engineering Drug discovery Methods: –Put data into a context: knowledge/data representation Trees, graphs, etc. –Learning models/methods
14
O1O1 O2O2 OmOm P1P1 P1P1 PnPn 101 0 11 01 0 Profiling: pairs of attribute-value
15
What we found: Informative way to compare genomes Majority pathways (or rather their enzyme components) evolve in congruence with species
16
What we do next: –Database and search engine –Off-line self-consistent iteration –Pathways in a network Graph comparisons –Identify key components of networks –Small world topology Cross-level interactions with regulatory networks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.