Multiply Aligning RNA Sequences

Slides:

Advertisements

Similar presentations

B. Knudsen and J. Hein Department of Genetics and Ecology

Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Homology Based Analysis of the Human/Mouse lncRNome

RNA Structure Prediction

Structural bioinformatics

Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)

Predicting RNA Structure and Function

Introduction to Bioinformatics - Tutorial no. 9 RNA Secondary Structure Prediction.

Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.

Profile-profile alignment using hidden Markov models Wing Wong.

Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Structural Alignment of Pseudoknotted RNAs Banu Dost, Buhm Han, Shaojie Zhang, Vineet Bafna.

Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

3D-COFFEE Mixing Sequences and Structures Cédric Notredame.

Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.

COURSE OF BIOINFORMATICS Exam_31/01/2014 A.

1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.

Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,

Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Protein and RNA Families

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.

Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France.

T-Coffee tutorial ACGT Retreat 2012 Jean-François Taly, Ionas Erb and Cedrik Magis.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.

Beyond ab initio modelling… Comparative and Boltzmann equilibrium Yann Ponty, CNRS/Ecole Polytechnique with invaluable help from Alain Denise, LRI/IGM,

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Motif Search and RNA Structure Prediction Lesson 9.

Tracking down ncRNAs in the genomes. How to find ncRNA gene The stability of ncRNA secondary structure is not sufficiently different from the predicted.

Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.

T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.

MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,

最佳的多重序列比對方法針對基因組領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.

T-COFFEE, a novel method for combining biological information Cédric Notredame.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Aligning Kinases Applying MSA Analysis to the CDK family.

Independent scientist

The Transcriptional Landscape of the Mammalian Genome

ncRNA Multiple Alignments with R-Coffee

Genome alignment Usman Roshan.

Stochastic Context-Free Grammars for Modeling RNA

Vienna RNA web servers

Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.

Recent Progress in Multiple Sequence Alignments: A Survey

Stochastic Context-Free Grammars for Modeling RNA

Introduction to Bioinformatics II

Basic Local Alignment Search Tool (BLAST)

T-Coffee: What’s New in The Grinder

Chem 291C Draft Sample Preliminary Seminar

Presentation transcript:

Multiply Aligning RNA Sequences -Phylogeny -SAR -Re-Sequencing Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

Open Questions in Multiple Sequence Alignments Aligning Protein Sequences Aligning RNA Sequences

Accurately Aligning Protein Sequences Remains Challenging with sequences less than 20% identity These sequences can be structurally homologues Correct alignments can help discovering functional sites Expresso/3D-Coffee is currently the most accurate way of combining sequence and structural information Available on www.tcoffee.org

Comparing ncRNAs

ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? tRNA, rRNA, snoRNAs, microRNAs, siRNAs piRNAs long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them Open question 30.000 is a common guess Harder to detect than proteins .

Detecting ncRNAs in silico: a long way to go… RNAse P (Not in ENCODE)

UCSC RFAM prediction Search (CMsearch) Genome RNAalifold RFAM Lizard ---GG--TGGAGACTAGTCTGAATTGGGTTATGAAG--CCA-- Rat GGCGG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC-- Hedgehog GACGG--GGGAGAGTAGTCTGAATTAGGTTATGGGG--CCC-- Shrew GACGG-CGGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC-- Medaka GTGAG--TGGAGAGTAGTCTGAATTGGGT---------TCT-- X.tropicalis AGCGG-CGGGAGAGTAGTCTGACTTGGGTTATGAGG--TGC-- Cat GACGG--GGGAGAGTAGTCTGAATTGGGTTATGAGGCCCCC-- Dog ------------------------------------------- Rhesus GGCGG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--TCC-- Mouse GGCGG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC-- Chimp GGCGG--AGGAGAGTAGTCTGAATTGGGTTATGAGG--TCC-- Human GGCGG--AGGAGAGTAGTCTGAATTGGGTTATGAGG--TCC-- TreeShrew GCGCG--GGGAGAGTAGTCTGAATTGGGTTATGAGG--CCC-- UCSC RFAM prediction RNAalifold RFAM Search (CMsearch) Genome

Results for RNase P UCSC Predicted Nothing RFAM OK Mammalian alignment Vertebrate alignment Structure Results UCSC Predicted Nothing RFAM OK Matthias Zytneki

Results for RNase P Better Alignments = Better Predictions Qualitative Improvement Matthias Zytneki Thomas Derrien Roderic Guigo Ramin Shiekhattar Quantitative Improvement

ncRNAs can have different sequences and Similar Structures

ncRNAs Can Evolve Rapidly GAACGGACC CTTGCCTGG G A C CTTGCCTCC GAACGGAGG G A C CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------**

ncRNAs are Difficult to Align Same Structure Low Sequence Identity Small Alphabet, Short Sequences  Alignments often Non-Significant

Obtaining the Structure of a ncRNA is difficult Hard to Align The Sequences Without the Structure Hard to Predict the Structures Without an Alignment

The Holy Grail of RNA Comparison: Sankoff’ Algorithm

The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment Time Complexity: O(L2n) Space Complexity: O(L3n) In Practice, for Two Sequences: 50 nucleotides: 1 min. 6 M. 100 nucleotides 16 min. 256 M. 200 nucleotides 4 hours 4 G. 400 nucleotides 3 days 3 T. Forget about Multiple sequence alignments Database searches

The next best Thing: Consan Consan = Sankoff + a few constraints Use of Stochastic Context Free Grammars Tree-shaped HMMs Made sparse with constraints The constraints are derived from the most confident positions of the alignment Equivalent of Banded DP

Going Multiple…. Structural Aligners

Game Rules Using Structural Predictions Produces better alignments Is Computationally expensive Use as much structural information as possible while doing as little computation as possible…

Adapting T-Coffee To RNA Alignments

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

Consistency: Conflicts and Information X Y X Y X X Z Z Y Y W Z W Z Y-Z is unhappy X-W is unhappy Y Z X W Y W X Z Y Z X W Partly Consistent  Less Reliable Fully Consistent  More Reliable

R-Coffee: Modifying T-Coffee at the Right Place Incorporation of Secondary Structure information within the Library Two Extra Components for the T-Coffee Scoring Scheme A new Library A new Scoring Scheme

Progressive Alignment Using The R-Score RNAplfold RNA Sequences Secondary Structures Primary Library R-Coffee Extended Progressive Alignment Using The R-Score RNAplfold Consan or Mafft / Muscle / ProbCons R-Coffee Extension R-Score

R-Coffee Extension TC Library C G G G Score X C C Score Y C G C G C G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

R-Coffee Scoring Scheme R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG)) C G C G

Validating R-Coffee

RNA Alignments are harder to validate than Protein Alignments Protein Alignments  Use of Structure based Reference Alignments RNA Alignments No Real structure based reference alignments The structures are mostly predicted from sequences Circularity

BraliBase and the BraliScore Database of Reference Alignments 388 multiple sequence alignments. Evenly distributed between 35 and 95 percent average sequence identity Contain 5 sequences selected from the RNA family database Rfam The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences).

BraliBase SPS Score Number of Identically Aligned Pairs SPS= RFam MSA SPS= Number of Aligned Pairs

BraliBase: SCI Score R N A p f o l d RNAlifold Average DG Seq X Cov Covariance (((…)))…((..)) DG Seq1 (((…)))…((..)) DG Seq2 (((…)))…((..)) DG Seq3 (((…)))…((..)) DG Seq4 (((…)))…((..)) DG Seq5 (((…)))…((..)) DG Seq6 RNAlifold Average DG Seq X Cov SCI= (((…)))…((..)) ALN DG DG ALN

BRaliScore Braliscore= SCI*SPS

R-Coffee + Regular Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------- Poa 0.62 0.65 0.70 48 154 Pcma 0.62 0.64 0.67 34 120 Prrn 0.64 0.61 0.66 -63 45 ClustalW 0.65 0.65 0.69 -7 83 Mafft_fftnts 0.68 0.68 0.72 17 68 ProbConsRNA 0.69 0.67 0.71 -49 39 Muscle 0.69 0.69 0.73 -17 42 Mafft_ginsi 0.70 0.68 0.72 -49 39 Improvement= # R-Coffee wins - # R-Coffee looses

RM-Coffee + Regular Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------- Poa 0.62 0.65 0.70 48 154 Pcma 0.62 0.64 0.67 34 120 Prrn 0.64 0.61 0.66 -63 45 ClustalW 0.65 0.65 0.69 -7 83 Mafft_fftnts 0.68 0.68 0.72 17 68 ProbConsRNA 0.69 0.67 0.71 -49 39 Muscle 0.69 0.69 0.73 -17 42 Mafft_ginsi 0.70 0.68 0.72 -49 39 RM-Coffee4 0.71 / 0.74 / 84

R-Coffee + Structural Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------- Stemloc 0.62 0.75 0.76 104 113 Mlocarna 0.66 0.69 0.71 101 133 Murlet 0.73 0.70 0.72 -132 -73 Pmcomp 0.73 0.73 0.73 142 145 T-Lara 0.74 0.74 0.69 -36 -8 Foldalign 0.75 0.77 0.77 72 73 Dyalign --- 0.63 0.62 --- --- Consan --- 0.79 0.79 --- --- RM-Coffee4 0.71 / 0.74 / 84

How Best is the Best…. Method vs. R-Coffee-Consan RM-Coffee4 Poa 241 *** 217 *** T-Coffee 199 *** Prrn 232 *** 198 *** Pcma 218 *** 151 *** Proalign 216 *** 150 ** Mafft fftns 206 *** 148 * ClustalW 203 *** 136 *** Probcons 192 *** 128 * Mafft ginsi 170 *** 115 Muscle 169 *** 111 M-Locarna 234 *** 183 ** Stral 169 *** 62 FoldalignM 146 61 Murlet 130 * -12 Rnasampler 129 * -27 T-Lara 125 * -30

Range of Performances Effect of Compensated Mutations

Split Alignments and RNA Few of the new long RNAs are reported with a secondary structure Two explanations They do not have a secondary structure It is hard to predict the structure To predict the structure One needs an Homologues to build an MSA To find homologues one needs to find them

Split Alignments and RNA -Protein Split Alignments -Guided by Primary structure Transcript genome

Split Alignments and RNA CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTC AGAGGTGCATA GAACGGAGG

Split Alignments and RNA Homology appears through secondary structures One needs to evaluate all possible secondary structures Very computationaly intensive

Conclusion/Future Directions T-Coffee/Consan is currently the best MSA protocol for ncRNAs Testing how important is the accuracy of the secondary structure prediction Going deeper into Sankoff’s territory: predicting and aligning simultaneously Solving the split alignment problem

Credits and Web Servers Andreas Wilm (UCD) Des Higgins (UCD) Sebastien Moretti (SIB) Ioannis Xenarios (SIB) Matthias Zytneki (CRG) Thomas Derrien (CRG) Roderic Guigo (CRG) Ramin Shiekhattar (CRG) CGR, SIB, UCD www.tcoffee.org