Dr Tan Tin Wee Director Bioinformatics Centre

Slides:

Advertisements

Similar presentations

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor.

1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.

Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.

Phylogenetic reconstruction

Comparative genomics Joachim Bargsten February 2012.

Molecular Evolution Revised 29/12/06

Tree Reconstruction.

© Wiley Publishing All Rights Reserved. Phylogeny.

Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.

Bioinformatics and Phylogenetic Analysis

Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Phylogenetic trees Sushmita Roy BMI/CS 576

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

BINF6201/8201 Molecular phylogenetic methods

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.

Introduction to Phylogenetics

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Tutorial 4 Substitution matrices and PSI-BLAST 1.

Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Applied Bioinformatics Week 8 Jens Allmer. Theory I.

Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Bioinformatics Overview

Sequence similarity, BLAST alignments & multiple sequence alignments

Introduction to Bioinformatics Resources for DNA Barcoding

Evolutionary genomics can now be applied beyond ‘model’ organisms

Phylogenetic basis of systematics

Basics of Comparative Genomics

Sequence comparison: Local alignment

Phylogenetic Inference

LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:

Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.

Genome Annotation Continued

Methods of molecular phylogeny

Ab initio gene prediction

Sequence Based Analysis Tutorial

Summary and Recommendations

Sequence Based Analysis Tutorial

Chapter 19 Molecular Phylogenetics

Basics of Comparative Genomics

Basic Local Alignment Search Tool

Summary and Recommendations

Presentation transcript:

Dr Tan Tin Wee Director Bioinformatics Centre Basic Overview of Bioinformatics Tools and Biocomputing Applications III Dr Tan Tin Wee Director Bioinformatics Centre

More BioComputational Tools Phylogenetics Analysis Multiple Sequence Alignment Profile Searching Sensitivity and Specificity and Probabilities in the Prediction of Functions

Phylogenetic Analysis Assumption: evolutionary descent Divergence Phylogenetic tree Rooted and unrooted trees Species X Y A B

Rooted and Unrooted Trees Rooted: ancestral state of the evolved organism or gene is known. Branches at bifurcation points until terminal branches, or tips/ leaves. Unrooted trees represent branching order, but does not indicate the root of the last common ancestor

Phylogenetic inference for genes Infancy, inexact science computational tools based on general mathematical and statistical principles Phylogenetic reconstructions may conflict with common sense. Incorrect sequence alignments, inadequate models All sites within sequences evolve at different rates unequal rate effects

Some algorithms Maximum parsimony maximum likelihood distance methods UPGMA paralinear (logdet) distances Software Packages: PAUP phylogenetic analysis using parsimony PHYLIP phylogenetic inference package MacClade, GAMBIT, MEGA/METREE

Limitations Inspection of sequence alignments Removal of deviant sequences from the phylogenetic inference Different genes analysed produce different trees "Bootstrapping" for estimating statistical significance may still have errors in interpretation

Uses Molecular Taxonomy B Uses C Molecular Taxonomy 16S and 23S rRNA analysis for bacterial classification 18S rRNA analysis of nematodes, drosophila epidemiological analysis of strain variation eg. In infections pathogens D

Multiple Sequence Analysis Gather a set of sequences of putative similarity or homology Pairwise comparison for each set of multiple sequences Build a "tree" of similarity realignment of all sequences based on "ancestral" sequence padding with gaps etc Used for generating "profiles"

Use Detection of conserved and variable regions Infer gene functions Variable segments - infer dispensable to function or antigenic variants Motifs can be used to analyse unknown sequence and infer possible function or relatedness Motifs as basis for annotation of genome project sequences

Software CLUSTALW Profile software based on Hidden Markov Models (HMM) statistical models, eg HMMer, HMMPro, META-MEME, PROBE, BLOCKS

Example C. elegans genome project several large gene families of sequence homology - function unknown. Now classified as putative G-protein coupled receptors (GPCRs). Have to detect significant similarity between putative Worm GPCRs and experimentally known GPCRs in other species

Process Select a typical unknown sequence BLAST Search against nr database Inspect hits and E-values Top scoring hits - mitochondrial L11 ribosomal protein E=0.002 (not low enough to be trusted for annotation) The rest of top scorers are all nematode-specific unknown sequences Compare with PSI-BLAST iterative searching at NCBI Similarity with mammalian GPCRs or the high scoring mt rL11 protein ?

Further analysis Gather all nematode specific sequences WormPep database of non-redundant seqs Discard seqs of abnormally long or short Multiple sequence alignment using CLUSTALW General Profile of multiple alignment using HMMer Use profile to search database again

Results Similarity at significance level detected with Mammalian GPCRs Find that L11 protein has very significant high score E=5x10 Pitfalls of PSI-Blast - significance of match to the training set during iteration. Finally, L11 protein may be wrongly annotated and not based on experimental results -49

A.Sensitivity and Specificity of a Fairly Good Test Total real +ve = 73 Total real - ve = 27 Specificity = (25)/(2+25)=.93 picked up 25 of the 27 negatives, very specific Low false positives Sensitivity = 70/(70+3)=.96 able to pickup 70 of the total 73 that are known positive- quite sensitive- Low false negatives Gold standards Known gold standard + ve - ve + ve - ve 70 2 3 25 Exptal test result N=100

B.Increase Sensitivity but Lower Specificity of a Test Total real +ve = 73 Total real - ve = 27 Specificity = (14)/(13+14)=.52 picked up 14 of the 27 negatives, not very specific high false positives Sensitivity = 72/(72+1)=.99 able to pickup 72 of the total 73 that are known positive- super sensitive Low false negatives Known gold standard + ve - ve + ve - ve 72 13 1 14 Exptal test result N=100

C.Increase Specificity of a Test but Sensitivity may drop Total real +ve = 73 Total real - ve = 27 Specificity = (27)/(0+27)=1.0 picked up 27 of the 27 negatives,completely specific increase threshold to zero false positives, true positives will drop Sensitivity = 50/(50+23)=.68 able to pickup 50 of the total 73 that are known positive- not quite sensitive- Low false negatives Known gold standard + ve - ve + ve - ve 50 23 27 Exptal test result N=100

Trade off involved If threshold of test set high, so that all the noise disappears, you may also miss out on some true positives, get a lot of false negatives and thus not so sensitive - case C If threshold of test set low, so that you get as much of the positives as you can get, ie high sensitivity, your non-specific false positive hits start appearing - Case B

Computational Predictions of Gene Function Sensitivity and specificity has similar tradeoffs. Cutoff threshold values have to be empirically determined or arbitrarily chosen depending on situation