NUS-KI Course on Bioinformatics, Nov 2005 Sequence Analysis and Function Prediction Limsoon Wong.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
Profiles for Sequences
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Sequence similarity.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Similar Sequence Similar Function Charles Yan Spring 2006.
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. (1999). Detecting protein function and protein-protein interactions from genome sequences.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Automatic methods for functional annotation of sequences Petri Törönen.
Protein Bioinformatics Course
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
NUS-KI IMS 28 Nov 2005 Protein Function Prediction from Protein Interactions Limsoon Wong.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 10 February.
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Cluster validation Integration ICES Bioinformatics.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Copyright  2003 limsoon wong Techniques & Applications of Sequence Comparison Limsoon Wong Institute for Infocomm Research 27 August 2003.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
InterPro Sandra Orchard.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapters 10 and 19 of The Practical Bioinformatician.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Sequence similarity, BLAST alignments & multiple sequence alignments
Demo: Protein Information Resource
Basics of Comparative Genomics
Techniques & Applications of Sequence Comparison
Logical Analysis and Invariants
Protein Bioinformatics Course
Dr Tan Tin Wee Director Bioinformatics Centre
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Presentation transcript:

NUS-KI Course on Bioinformatics, Nov 2005 Sequence Analysis and Function Prediction Limsoon Wong

NUS-KI Course on Bioinformatics, Nov 2005 Very Brief Intro to Sequence Comparison/Alignment

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Motivations for Sequence Comparison DNA is blue print for living organisms  Evolution is related to changes in DNA  By comparing DNA sequences we can infer evolutionary relationships between the sequences w/o knowledge of the evolutionary events themselves Foundation for inferring function, active site, and key mutations

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Sequence Alignment Key aspect of sequence comparison is sequence alignment A sequence alignment maximizes the number of positions that are in agreement in two sequences Sequence U Sequence V mismatch match indel

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Sequence Alignment: Poor Example Poor seq alignment shows few matched positions  The two proteins are not likely to be homologous No obvious match between Amicyanin and Ascorbate Oxidase

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Sequence Alignment: Good Example Good alignment usually has clusters of extensive matched positions  The two proteins are likely to be homologous good match between Amicyanin and unknown M. loti protein

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Multiple Alignment: An Example Multiple seq alignment maximizes number of positions in agreement across several seqs Seqs belonging to same “family” usually have more conserved positions in a multiple seq alignment Conserved sites

NUS-KI Course on Bioinformatics, Nov 2005 Application of Sequence Comparison: Guilt-by-Association

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNR YVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWE QNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGD VTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTG TFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELE VT Function Assignment to Protein Sequence How do we attempt to assign a function to a new protein sequence?

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Guilt-by-Association Compare T with seqs of known function in a db Assign to T same function as homologs Confirm with suitable wet experiments Discard this function as a candidate

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong BLAST: How it works Altschul et al., JMB, 215: , 1990 BLAST is one of the most popular tool for doing “guilt-by-association” sequence homology search find from db seqs with short perfect matches to query seq (Exercise: Why do we need this step?) find seqs with good flanking alignment

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Homologs Obtained by BLAST Thus our example sequence could be a protein tyrosine phosphatase  (PTP  )

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Example Alignment with PTP 

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Guilt-by-Association: Caveats Ensure that the effect of database size has been accounted for Ensure that the function of the homology is not derived via invalid “transitive assignment’’ Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Interpretation of P-value Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit P-value is interpreted as prob. that a random seq. has an equally good alignment Suppose the P-value of an alignment is  If database has 10 7 seqs, then you expect 10 7 * = 10 seqs in it that give an equally good alignment  Need to correct for database size if your seq. comparison prog does not do that!

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Examples of Invalid Function Assignment: The IMP dehydrogenases (IMPDH) A partial list of IMPdehydrogenase misnomers in complete genomes remaining in some public databases

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong IMPDH Domain Structure Typical IMPDHs have 2 IMPDH domains that form the catalytic core and 2 CBS domains. A less common but functional IMPDH (E70218) lacks the CBS domains. Misnomers show similarity to the CBS domains

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Invalid Transitive Assignment Mis-assignment of function A B C Root of invalid transitive assignment No IMPDH domain

NUS-KI Course on Bioinformatics, Nov 2005 Application of Sequence Comparison: Active Site/Domain Discovery

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Discover Active Site and/or Domain How to discover the active site and/or domain of a function in the first place? –Multiple alignment of homologous seqs –Determine conserved positions Easier if sequences of distance homologs are used

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Multiple Alignment of PTPs Notice the PTPs agree with each other on some positions more than other positions These positions are more impt wrt PTPs Else they wouldn’t be conserved by evolution  They are candidate active sites

NUS-KI Course on Bioinformatics, Nov 2005 Even Non-Homologous Sequences Help: The SVM Pairwise Approach

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong SVM-Pairwise Framework Training Data S1 S2 S3 … Testing Data T1 T2 T3 … Training Features S 1 S 2 S 3 … S 1 f 11 f 12 f 13 … S 2 f 21 f 22 f 23 … S 3 f 31 f 32 f 33 … … … … … … Feature Generation Trained SVM Model (Feature Weights) Training Testing Features S 1 S 2 S 3 … T 1 f 11 f 12 f 13 … T 2 f 21 f 22 f 23 … T 3 f 31 f 32 f 33 … … … … … … Feature Generation Support Vectors Machine (Radial Basis Function Kernel) Classification Discriminant Scores RBF Kernel f 31 is the local alignment score between S 3 and S 1 f 31 is the local alignment score between T 3 and S 1 Image credit: Kenny Chua

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Performance of SVM-Pairwise Receiver Operating Characteristic (ROC) –The area under the curve derived from plotting true positives as a function of false positives for various thresholds. Rate of median False Positives (RFP) –The fraction of negative test examples with a score better or equals to the median of the scores of positive test examples.

NUS-KI Course on Bioinformatics, Nov 2005 What if no homolog of known function is found? Try Genome Phylogenetic Profiles!

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Phylogenetic Profiling Pellegrini et al., PNAS, 96: , 1999 Gene (and hence proteins) with identical patterns of occurrence across phyla tend to function together  Even if no homolog with known function is available, it is still possible to infer function of a protein

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Phylogenetic Profiling: How it Works

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Phylogenetic Profiles: Evidence Pellegrini et al., PNAS, 96: , 1999 Proteins grouped based on similar keywords in SWISS-PROT have more similar phylogenetic profiles

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Phylogenetic Profiling: Evidence Wu et al., Bioinformatics, 19: , 2003 Proteins having low hamming distance (thus similar phylogenetic profiles) tend to share c’mon pathways Exercise: Why do proteins having high hamming distance also have this behaviour?  KEGG COG fraction of gene pairs having hamming distance D and share a common pathway in KEGG/COG hamming distance (D) hamming distance X,Y = #lineages X occurs + #lineages Y occurs – 2 * #lineages X, Y occur

NUS-KI Course on Bioinformatics, Nov 2005 What if no homolog of known function is found? Try Neighbours of Protein Interactions! Level-1 neighbour Level-2 neighbour

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong An illustrative Case of Indirect Functional Association? Is indirect functional association plausible? Is it found often in real interaction data? Can it be used to improve protein function prediction from protein interaction data? SH3 Proteins SH3-Binding Proteins

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong YBR055C | YDR158W | |1.1.9 YJR091C | | YMR101C |42.1 YPL149W |14.4 | |42.25 | YPL088W |2.16 |1.1.9 YMR300C |1.3.1 YBL072C | YOR312C | YBL061C |1.5.4 | | | |42.1 | | YBR023C | | | |42.1 | | | YKL006W | | YPL193W | YAL012W | |1.1.9 YBR293W | |42.25 |1.1.3 |1.1.9 YLR330W |1.5.4 | | | | YLR140W YDL081C | YDR091C |1.4.1 | | | YPL013C | |42.16 YMR047C | |14.4 |16.7 | | | Freq of Indirect Functional Association 59.2% proteins in dataset share some function with level-1 neighbours 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Over-Rep of Functions in L1 & L2 Neighbours Sensitivity vs Precision Precision Sensitivity L1 - L2 L2 - L1 L1 ∩ L2

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Use L1 & L2 Neighbours for Prediction Weighted Average –Over-rep of functions in L1 and L2 neighbours –Each observation of L1 or L2 neighbour is summed S(u,v) is an “index” for function xfer betw u and v,  (k, x) = 1 if k has function x, 0 otherwise N k is the set of interacting partners of k  x is freq of function x in the dataset

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Reliability of Expt Sources Diff Expt Sources have diff reliabilities –Assign reliability to an interaction based on its expt sources (Nabieva et al, 2004) Reliability betw u and v computed by: r i is reliability of expt source i, E u,v is the set of expt sources in which interaction betw u and v is observed SourceReliability Affinity Chromatography Affinity Precipitation Biochemical Assay Dosage Lethality0.5 Purified Complex Reconstituted Complex0.5 Synthetic Lethality Synthetic Rescue1 Two Hybrid

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong An “Index” for Function Transfer Based on Reliability of Interactions Take reliability into consideration when computing Equiv Measure: N k is the set of interacting partners of k r u,w is reliability weight of interaction betw u and v

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Performance Evaluation Prediction performance improves after incorporation of L1, L2, & interaction reliability info Informative FCs Precision Sensitivity NC Chi² PRODISTIN Weighted Avg Weighted Avg R

NUS-KI Course on Bioinformatics, Nov 2005 Application of Sequence Comparison: Key Mutation Site Discovery

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Identifying Key Mutation Sites K.L.Lim et al., JBC, 273: , 1998 Some PTPs have 2 PTP domains PTP domain D1 is has much more activity than PTP domain D2 Why? And how do you figure that out? Sequence from a typical PTP domain D2

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Emerging Patterns of PTP D1 vs D2 Collect example PTP D1 sequences Collect example PTP D2 sequences Make multiple alignment A1 of PTP D1 Make multiple alignment A2 of PTP D2 Are there positions conserved in A1 that are violated in A2?  These are candidate mutations that cause PTP activity to weaken Confirm by wet experiments

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong present absent D1 D2 This site is consistently conserved in D1, but is consistently missing in D2  possible cause of D2’s loss of function This site is consistently conserved in D1, but is not consistently missing in D2  not a likely cause of D2’s loss of function Emerging Patterns of PTP D1 vs D2

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong D1 D2 Key Mutation Site: PTP D1 vs D2 Positions marked by “!” and “?” are likely places responsible for reduced PTP activity –All PTP D1 agree on them –All PTP D2 disagree on them

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Key Mutation Site: PTP D1 vs D2 Positions marked by “!” are even more likely as 3D modeling predicts they induce large distortion to structure D1 D2

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong Confirmation by Mutagenesis Expt What wet experiments are needed to confirm the prediction?  Mutate E  D in D2 and see if there is gain in PTP activity  Mutate D  E in D1 and see if there is loss in PTP activity Exercise: Why do you need this 2-way expt?

NUS-KI Course on Bioinformatics, Nov 2005 Suggested Readings

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong References S.E.Brenner. “Errors in genome annotation”, TIG, 15: , 1999 T.F.Smith & X.Zhang. “The challenges of genome sequence annotation or `The devil is in the details’”, Nature Biotech, 15: , 1997 D. Devos & A.Valencia. “Intrinsic errors in genome annotation”, TIG, 17: , K.L.Lim et al. “Interconversion of kinetic identities of the tandem catalytic domains of receptor-like protein tyrosine phosphatase PTP- alpha by two point mutations is synergist and substrate dependent”, JBC, 273: , 1998.

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong References J. Park et al. “Sequence comparisons using multiple sequences detect three times as many remote homologs as pairwise methods”, JMB, 284(4): , 1998 J. Park et al. “Intermediate sequences increase the detection of homology between sequences”, JMB, 273: , 1997 S.F.Altshcul et al. “Basic local alignment search tool”, JMB, 215: , 1990 S.F.Altschul et al. “Gapped BLAST and PSI- BLAST: A new generation of protein database search programs”, NAR, 25(17): , 1997

NUS-KI Course on Bioinformatics, Nov 2005Copyright 2005 © Limsoon Wong References M. Pellegrini et al. “Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles”, PNAS, 96: , 1999 J. Wu et al. “Identification of functional links between genes using phylogenetic profiles”, Bioinformatics, 19: , 2003