ncRNA Multiple Alignments with R-Coffee

Slides:

Advertisements

Similar presentations

B. Knudsen and J. Hein Department of Genetics and Ecology

Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Homology Based Analysis of the Human/Mouse lncRNome

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.

Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.

RNA Structure Prediction

Structural bioinformatics

Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)

Predicting RNA Structure and Function

Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.

Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.

Sónia Martins Bruno Martins José Cruz IGC, February 20 th, 2008.

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

3D-COFFEE Mixing Sequences and Structures Cédric Notredame.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.

Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.

Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

CrossWA: A new approach of combining pairwise and three-sequence alignments to improve the accuracy for highly divergent sequence alignment Che-Lun Hung,

Integrating Biological Information In Multiple Sequence Alignments Confronting Bits and Pieces of Information Cédric Notredame CNRS-Marseille, France

Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.

Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France.

T-Coffee tutorial ACGT Retreat 2012 Jean-François Taly, Ionas Erb and Cedrik Magis.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Beyond ab initio modelling… Comparative and Boltzmann equilibrium Yann Ponty, CNRS/Ecole Polytechnique with invaluable help from Alain Denise, LRI/IGM,

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Motif Search and RNA Structure Prediction Lesson 9.

Tracking down ncRNAs in the genomes. How to find ncRNA gene The stability of ncRNA secondary structure is not sufficiently different from the predicted.

Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.

Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

T-COFFEE, a novel method for Multiple Sequence Alignments Cédric Notredame.

MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,

最佳的多重序列比對方法針對基因組領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

T-COFFEE, a novel method for combining biological information Cédric Notredame.

Poster Design & Printing by Genigraphics ® Esposito, D., Heitsch, C. E., Poznanovik, S. and Swenson, M. S. Georgia Institute of Technology.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Independent scientist

Genome alignment Usman Roshan.

Stochastic Context-Free Grammars for Modeling RNA

Vienna RNA web servers

Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.

Department of Computer Science

Predicting RNA Structure and Function

Small RNA and Cyanobacteria

Recent Progress in Multiple Sequence Alignments: A Survey

Stochastic Context-Free Grammars for Modeling RNA

Multiply Aligning RNA Sequences

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Sequence Based Analysis Tutorial

mRNA Degradation and Translation Control

Basic Local Alignment Search Tool (BLAST)

T-Coffee: What’s New in The Grinder

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Chem 291C Draft Sample Preliminary Seminar

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Presentation transcript:

ncRNA Multiple Alignments with R-Coffee Laundering the Genome Dark Matter Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

No Plane Today…

ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? tRNA, rRNA, snoRNAs, microRNAs, siRNAs piRNAs long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them Open question 30.000 is a common guess Harder to detect than proteins .

ncRNAs can have different sequences and Similar Structures

ncRNAs Can Evolve Rapidly GAACGGACC CTTGCCTGG G A C CTTGCCTCC GAACGGAGG G A C CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------**

ncRNAs are Difficult to Align Same Structure Low Sequence Identity Small Alphabet, Short Sequences  Alignments often Non-Significant

Obtaining the Structure of a ncRNA is difficult Hard to Align The Sequences Without the Structure Hard to Predict the Structures Without an Alignment

The Holy Grail of RNA Comparison: Sankoff’ Algorithm

The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment Time Complexity: O(L2n) Space Complexity: O(L3n) In Practice, for Two Sequences: 50 nucleotides: 1 min. 6 M. 100 nucleotides 16 min. 256 M. 200 nucleotides 4 hours 4 G. 400 nucleotides 3 days 3 T. Forget about Multiple sequence alignments Database searches

The next best Thing: Consan Consan = Sankoff + a few constraints Use of Stochastic Context Free Grammars Tree-shaped HMMs Made sparse with constraints The constraints are derived from the most confident positions of the alignment Equivalent of Banded DP

Going Multiple…. Structural Aligners

Game Rules Using Structural Predictions Produces better alignments Is Computationally expensive Use as much structural information as possible while doing as little computation as possible…

Adapting T-Coffee To RNA Alignments

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

Consistency: Conflicts and Information X Y X Y X X Z Z Y Y W Z W Z Y is unhappy X is unhappy Y Z X W Y W X Z Y Z X W Partly Consistent  Less Reliable Fully Consistent  More Reliable

R-Coffee: Modifying T-Coffee at the Right Place Incorporation of Secondary Structure information within the Library Two Extra Components for the T-Coffee Scoring Scheme A new Library A new Scoring Scheme

Progressive Alignment Using The R-Score RNAplfold RNA Sequences Secondary Structures Primary Library R-Coffee Extended Progressive Alignment Using The R-Score RNAplfold Consan or Mafft / Muscle / ProbCons R-Coffee Extension R-Score

R-Coffee Extension TC Library C G G G Score X C C Score Y C G C G C G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method.

R-Coffee Scoring Scheme R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG)) C G C G

Validating R-Coffee

RNA Alignments are harder to validate than Protein Alignments Protein Alignments  Use of Structure based Reference Alignments RNA Alignments No Real structure based reference alignments The structures are mostly predicted from sequences Circularity

BraliBase and the BraliScore Database of Reference Alignments 388 multiple sequence alignments. Evenly distributed between 35 and 95 percent average sequence identity Contain 5 sequences selected from the RNA family database Rfam The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences).

BraliBase SPS Score Number of Identically Aligned Pairs SPS= RFam MSA SPS= Number of Aligned Pairs

BraliBase: SCI Score R N A p f o l d RNAlifold Average DG Seq X Cov Covariance (((…)))…((..)) DG Seq1 (((…)))…((..)) DG Seq2 (((…)))…((..)) DG Seq3 (((…)))…((..)) DG Seq4 (((…)))…((..)) DG Seq5 (((…)))…((..)) DG Seq6 RNAlifold Average DG Seq X Cov SCI= (((…)))…((..)) ALN DG DG ALN

BRaliScore Braliscore= SCI*SPS

R-Coffee + Regular Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------- Poa 0.62 0.65 0.70 48 154 Pcma 0.62 0.64 0.67 34 120 Prrn 0.64 0.61 0.66 -63 45 ClustalW 0.65 0.65 0.69 -7 83 Mafft_fftnts 0.68 0.68 0.72 17 68 ProbConsRNA 0.69 0.67 0.71 -49 39 Muscle 0.69 0.69 0.73 -17 42 Mafft_ginsi 0.70 0.68 0.72 -49 39 Improvement= # R-Coffee wins - # R-Coffee looses

RM-Coffee + Regular Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------- Poa 0.62 0.65 0.70 48 154 Pcma 0.62 0.64 0.67 34 120 Prrn 0.64 0.61 0.66 -63 45 ClustalW 0.65 0.65 0.69 -7 83 Mafft_fftnts 0.68 0.68 0.72 17 68 ProbConsRNA 0.69 0.67 0.71 -49 39 Muscle 0.69 0.69 0.73 -17 42 Mafft_ginsi 0.70 0.68 0.72 -49 39 RM-Coffee4 0.71 / 0.74 / 84

R-Coffee + Structural Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------- Stemloc 0.62 0.75 0.76 104 113 Mlocarna 0.66 0.69 0.71 101 133 Murlet 0.73 0.70 0.72 -132 -73 Pmcomp 0.73 0.73 0.73 142 145 T-Lara 0.74 0.74 0.69 -36 -8 Foldalign 0.75 0.77 0.77 72 73 Dyalign --- 0.63 0.62 --- --- Consan --- 0.79 0.79 --- --- RM-Coffee4 0.71 / 0.74 / 84

How Best is the Best…. Method vs. R-Coffee-Consan RM-Coffee4 Poa 241 *** 217 *** T-Coffee 199 *** Prrn 232 *** 198 *** Pcma 218 *** 151 *** Proalign 216 *** 150 ** Mafft fftns 206 *** 148 * ClustalW 203 *** 136 *** Probcons 192 *** 128 * Mafft ginsi 170 *** 115 Muscle 169 *** 111 M-Locarna 234 *** 183 ** Stral 169 *** 62 FoldalignM 146 61 Murlet 130 * -12 Rnasampler 129 * -27 T-Lara 125 * -30

Range of Performances Effect of Compensated Mutations

Conclusion/Future Directions T-Coffee/Consan is currently the best MSA protocol for ncRNAs Testing how important is the accuracy of the secondary structure prediction Going deeper into Sankoff’s territory: predicting and aligning simultaneously

Credits and Web Servers Andreas Wilm Des Higgins Sebastien Moretti Ioannis Xenarios Cedric Notredame CGR, SIB, UCD www.tcoffee.org cedric.notredame@europe.com