Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Similar presentations


Presentation on theme: "Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program."— Presentation transcript:

1 Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

2 They are Everywhere… And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? – tRNA, rRNA, snoRNAs, – microRNAs, siRNAs – piRNAs – long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them – Open question – 30.000 is a common guess – Harder to detect than proteins.

3 Searching “…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”

4 ncRNAs can have different sequences and Similar Structures

5 ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------** GAACGGACCGAACGGACC CTTGCCTGGCTTGCCTGG G G A A CC A C G G A G A C G CTTGCCTCCCTTGCCTCC GAACGGAGGGAACGGAGG G G A A CC A C G G A G A C G

6 ncRNAs are Difficult to Align --CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** * CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------** Regular Alignment

7 ncRNAs are Difficult to Align Same Structure  Low Sequence Identity Small Alphabet, Short Sequences  Alignments often Non- Significant

8 Obtaining the Structure of a ncRNA is difficult Hard to Align The Sequences Without the Structure Hard to Predict the Structures Without an Alignment

9 The Holy Grail of RNA Comparison: Sankoff’ Algorithm

10 The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment – Time Complexity: O(L 2n ) – Space Complexity: O(L 3n ) In Practice, for Two Sequences: – 50 nucleotides: 1 min.6 M. – 100 nucleotides 16 min.256 M. – 200 nucleotides 4 hours 4 G. – 400 nucleotides3 days3 T. Forget about – Multiple sequence alignments – Database searches

11 The next best Thing: Consan Consan = Sankoff + a few constraints Use of Stochastic Context Free Grammars – Tree-shaped HMMs – Made sparse with constraints The constraints are derived from the most confident positions of the alignment Equivalent of Banded DP

12 Consan for Databases: Infernal Infernal is a Faster version of Consan For Database Search Sill Very Slow Receiver operating characteristicReceiver operating characteristic (ROC) Comparison of Infernal with BLAST

13 Consan for Databases: Infernal BLAST: 360 s. Fast Infernal: 182 000 s. Slow Infernal: 5 320 000 s.

14 Searching Databases for New RNAs

15 Rfam: In practice Rfam contains RNA families – Families  Multiple Sequence Alignment  Models – Models are like Pfam Profiles Use Consan or Cmsearch rather than HMMer Much Slower – Too expensive to search the models Models are used to build Rfam People usually BLAST Rfam

16 Where do Rfam Families Come From? Infernal Requires a Model Models requires an MSA The MSA requires a Family It all starts with a BlastN Rfam, Gardner et al. NAR 2008

17 Can we make BlastN more accurate ? BlastN is not very accurate because: – Poor substitution models for Nucleic Acids – Low information density (4 symbols) BlastN assumes – Equal evolution rates for all nucleotides – Independence form Neighbors

18 Love Thy Neighbor Measured Nearest Neighbor Dependencies on Rfam sequences

19 High Rate of CpG mutations

20 Measuring Di-Nucleotide Evolution Each Nucleotide can be made more informative It can incorporate the “name” of its Neighbor – AA => a – AG => b – AC => c – AT => d – … A 16 Letter alphabet can be used to recode all nucleotide sequences We name these extended Nucleotides

21

22 Blosum-R and eRNA

23 Substitutions ?? How much does it cost to turn one nucleotide into another one ? Blosum/Pam style matrix Matrices estimated on Rfam families

24 Blosum-R and eRNA

25 Using BlastR When Nucleic Acids look like Proteins They can be aligned with Protein Methods – BlastN  BlastP – BlastP with eRNA is BlastR

26 Validating Blast-R

27 Benchmarking BlastR Rfam PP PN EVALUESEVALUES Blast Query

28 Benchmarking BlastR Rfam 001 Rfam 002 Rfam … Rfam 001 Rfam 002 Rfam … Blast ROC

29 Benchmarking BlastR Good Bad False Positives True Positive Good Bad

30 Benchmarking BlastR False Positives True Positive Good Bad Area Under Curve Small AUC  Better

31 BlastR vs The World

32 The 3 Components of Blast R BlastP is better than BlastN BlosumR makes BlastP a little bit better Blast: wuBlast

33 The 3 Components of Blast R BlastP is better than BlastN BlosumR makes BlastP a little bit better And Faster

34 BlastR and Clustering Given all Rfam in Bulk How good is BlastR at reconstituting all the families Sensitivity 1-Specificty

35 BlastR and Clustering Given all Rfam in Bulk How good is BlastR at reconstituting all the families Sensitivity 1-Specificty

36 BllastR: In Practice

37 E-Value Threshold: 10 -20 BlastN BlastR

38 Take Home Searching Nucleotides is Difficult BlastN is not a very good algorithm Simple Adaptations can improve the situation – Changing the algorithm (BlastP) – Changing the Scoring Scheme (BlastP-Nuc) – Changing the alphabet (BlastR)


Download ppt "Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program."

Similar presentations


Ads by Google