RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

RNA Secondary Structure Prediction
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
MiRNA in computational biology 1 The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig C. Mello for their discovery of "RNA interference.
RNA Structure Prediction
Predicting the 3D Structure of RNA motifs Ali Mokdad – UCSF May 28, 2007.
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Predicting RNA Structure and Function
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University.
RNA Secondary Structure Prediction
Predicting RNA Structure and Function. Nobel prize 1989Nobel prize 2009 Ribozyme Ribosome RNA has many biological functions The function of the RNA molecule.
Presenting: Asher Malka Supervisor: Prof. Hermona Soreq.
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.
Predicting RNA Structure and Function. Following the human genome sequencing there is a high interest in RNA “Just when scientists thought they had deciphered.
[Bejerano Fall10/11] 1.
. Class 5: RNA Structure Prediction. RNA types u Messenger RNA (mRNA) l Encodes protein sequences u Transfer RNA (tRNA) l Adaptor between mRNA molecules.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Predicting RNA Structure and Function
CISC667, F05, Lec27, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Review Session.
Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
RNA-Seq and RNA Structure Prediction
RNA: Secondary Structure Prediction and Analysis
RNA multiple sequence alignment Craig L. Zirbel October 14, 2010.
Introduction to RNA Bioinformatics Craig L. Zirbel October 5, 2010 Based on a talk originally given by Anton Petrov.
RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.
Non-coding RNA gene finding problems. Outline Introduction RNA secondary structure prediction RNA sequence-structure alignment.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Transfection. What is transfection? Broadly defined, transfection is the process of artificially introducing nucleic acids (DNA or RNA) into cells, utilizing.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012 Adapted from Haixu Tang.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
1 RNA Bioinformatics Genes and Secondary Structure Anne Haake Rhys Price Jones & Tex Thompson.
RNA Secondary Structure Prediction. 16s rRNA RNA Secondary Structure Hairpin loop Junction (Multiloop)Bulge Single- Stranded Interior Loop Stem Image–
© Wiley Publishing All Rights Reserved. RNA Analysis.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
RNA Structure Prediction
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
CS5263 Bioinformatics RNA Secondary Structure Prediction.
Progress toward Predicting Viral RNA Structure from Sequence: How Parallel Computing can Help Solve the RNA Folding Problem Susan J. Schroeder University.
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
[BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.
RNA folding & ncRNA discovery
RNA Structure Prediction RNA Structure Basics The RNA ‘Rules’ Programs and Predictions BIO520 BioinformaticsJim Lund Assigned reading: Ch. 6 from Bioinformatics:
This seems highly unlikely.
Motif Search and RNA Structure Prediction Lesson 9.
Tracking down ncRNAs in the genomes. How to find ncRNA gene The stability of ncRNA secondary structure is not sufficiently different from the predicted.
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
RNA Structure Prediction
Rapid ab initio RNA Folding Including Pseudoknots via Graph Tree Decomposition Jizhen Zhao, Liming Cai Russell Malmberg Computer Science Plant Biology.
Lecture 8.21 Lecture 8.2: RNA Jennifer Gardy Centre for Microbial Diseases and Immunity Research University of British Columbia
RNAs. RNA Basics transfer RNA (tRNA) transfer RNA (tRNA) messenger RNA (mRNA) messenger RNA (mRNA) ribosomal RNA (rRNA) ribosomal RNA (rRNA) small interfering.
AAA AAAU AAUUC AUUC UUCCG UCCG CCGG G G Karen M. Pickard CISC889 Spring 2002 RNA Secondary Structure Prediction.
Stochastic Context-Free Grammars for Modeling RNA
Vienna RNA web servers
Lecture 21 RNA Secondary Structure Prediction
Lab 8.3: RNA Secondary Structure
Predicting RNA Structure and Function
RNA Secondary Structure Prediction
RNA Secondary Structure Prediction
Stochastic Context-Free Grammars for Modeling RNA
Comparative RNA Structural Analysis
RNA Secondary Structure Prediction
RNA folding & ncRNA discovery
RNA 2D and 3D Structure Craig L. Zirbel October 7, 2010.
Presentation transcript:

RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012

Contents  Non-coding RNAs and their functions  RNA structures  RNA folding –Nussinov algorithm –Energy minimization methods  microRNA target identification

 ncRNAs have important and diverse functional and regulatory roles that impact gene transcription, translation, localization, replication, and degradation –Protein synthesis (rRNA and tRNA) –RNA processing (snoRNA) –Gene regulation RNA interference (RNAi) Andrew Fire and Craig Mello (2006 Nobel prize) –DNA-like function Virus –RNA world RNAs have diverse functions

Non-coding RNAs  A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein; small RNA (sRNA) is often used for bacterial ncRNAs.  tRNA (transfer RNA), rRNA (ribosomal RNA), snoRNA (small RNA molecules that guide chemical modifications of other RNAs)  microRNAs (miRNA, μRNA, single-stranded RNA molecules of nucleotides in length, regulate gene expression)  siRNAs (short interfering RNA or silencing RNA, double-stranded, nucleotides in length, involved in the RNA interference (RNAi) pathway, where it interferes with the expression of a specific gene. )  piRNAs (expressed in animal cells, forms RNA-protein complexes through interactions with Piwi proteins, which have been linked to transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells)  long ncRNAs (non-protein coding transcripts longer than 200 nucleotides)

Riboswitch  What’s riboswitch  Riboswitch mechanism Image source: Curr Opin Struct Biol. 2005, 15(3):

Structures are more conserved  Structure information is important for alignment (and therefore gene finding) CGAGCUCGAGCU CAAGUUCAAGUU

Features of RNA  RNA typically produced as a single stranded molecule (unlike DNA)  Strand folds upon itself to form base pairs & secondary structures  Structure conservation is important  RNA sequence analysis is different from DNA sequence

Canonical base pairing NN N O H H N N N O H H H N N NN O O H N N N N N HH Watson-Crick base pairing Non-Watson-Crick base pairing G/U (Wobble)

tRNA structure

RNA secondary structure Hairpin loop Junction (Multiloop) Bulge Loop Single-Stranded Interior Loop Stem Pseudoknot

Complex folds

Pseudoknots i j j’j’ i’i’ i jj’j’i’i’ ?

RNA secondary structure representation  2D  Circle plot  Dot plot  Mountain  Parentheses  Tree model (((…)))..((….))

Main approaches to RNA secondary structure prediction  Energy minimization –dynamic programming approach –does not require prior sequence alignment –require estimation of energy terms contributing to secondary structure  Comparative sequence analysis –using sequence alignment to find conserved residues and covariant base pairs. –most trusted  Simultaneous folding and alignment (structural alignment)

Assumptions in energy minimization approaches  Most likely structure similar to energetically most stable structure  Energy associated with any position is only influenced by local sequence and structure  Neglect pseudoknots

Base-pair maximization  Find structure with the most base pairs –Only consider A-U and G-C and do not distinguish them  Nussinov algorithm (1970s) –Too simple to be accurate, but stepping-stone for later algorithms

 Problem definition –Given sequence X=x 1 x 2 …x L, compute a structure that has maximum (weighted) number of base pairings  How can we solve this problem? –Remember: RNA folds back to itself! –S(i,j) is the maximum score when x i..x j folds optimally –S(1,L)? –S(i,i)? Nussinov algorithm 1Li j S(i,j)

“Grow” from substructures (1)(2) (4) (3) 1 Liji+1j-1 k

Dynamic programming  Compute S(i,j) recursively (dynamic programming) –Compares a sequence against itself in a dynamic programming matrix  Three steps

Initialization Example: GGGAAAUCC GGGAAAUCC G0 G00 G00 A00 A00 A00 U00 C00 C00 the main diagonal the diagonal below L: the length of input sequence

Recursion GGGAAAUCC G0000 G00000 G00000 A0000? A0001 A0011 U0000 C000 C00 Fill up the table (DP matrix) -- diagonal by diagonal j i

Traceback GGGAAAUCC G G G A A A00111 U0000 C000 C00 The structure is: What are the other “optimal” structures?

An exercise  Input: AUGACAU  Fill up the table  Trace back  Give the optimal structure  What’s the size of the hairpin loop AUGACAU A U G A C A U

Energy minimization methods  Nussinov algorithm (base pair maximization) is too simple to be accurate  Energy minimization algorithm predicts secondary structure by minimizing the free energy (  G)   G calculated as sum of individual contributions of: –loops –stacking

Free energy computation U U A A G C A G C U A A U C G A U A 3’ A 5’ mismatch of hairpin -2.9 stacking nt bulge -2.9 stacking -1.8 stacking 5’ dangling -0.9 stacking -1.8 stacking -2.1 stacking  G = -4.6 KCAL/MOL nt loop

Loop parameters (from Mfold) Unit: Kcal/mol DESTABILIZING ENERGIES BY SIZE OF LOOP SIZE INTERNAL BULGE HAIRPIN

Stacking energy (from Vienna package) # stack_energies /* CG GC GU UG AU */

Mfold versus Vienna package  Mfold – – form1.cgihttp://frontend.bioinfo.rpi.edu/applications/mfold/cgi-bin/rna- form1.cgi –Suboptimal structures The correct structure is not necessarily structure with optimal free energy Within a certain threshold of the calculated minimum energy  Vienna -- calculate the probability of base pairings –

Mfold energy dot plot

Mfold algorithm (Zuker & Stiegler, NAR (1):133)

Inferring structure by comparative sequence analysis  Need a multiple sequence alignment as input  Requires sequences be similar enough (so that they can be initially aligned)  Sequences should be dissimilar enough for covarying substitutions to be detected “Given an accurate multiple alignment, a large number of sequences, and sufficient sequence diversity, comparative analysis alone is sufficient to produce accurate structure predictions” (Gutell RR et al. Curr Opin Struct Biol 2002, 12: )

RNA variations  Variations in RNA sequence maintain base-pairing patterns for secondary structures ( conserved patterns of base-pairing)  When a nucleotide in one base changes, the base it pairs to must also change to maintain the same structure  Such variation is referred to as covariation. CGAGCUCGAGCU CAAGUUCAAGUU

If neglect covariation  In usual alignment algorithms they are doubly penalized …GA…UC… …GC…GC… …GA…UA…

Covariance measurements  Mutual information (desirable for large datasets) –Most common measurement –Used in CM (Covariance Model) for structure prediction  Covariance score (better for small datasets)

Mutual information  : frequency of a base in column i  : joint (pairwise) frequency of a base pair between columns i and j  Information ranges from 0 and ? bits  If i and j are uncorrelated (independent), mutual information is 0

Mutual information plot

Structure prediction using MI  S(i,j) = Score at indices i and j; M(i,j) is the mutual information between i and j  The goal is to maximize the total mutual information of input RNA  The recursion is just like the one in Nussinov Algorithm, just to replace w(i,j) (1 or 0) with the mutual information M(i,j)

Covariance-like score  RNAalifold –Hofacker et al. JMB 2002, 319:  Desirable for small datasets  Combination of covariance score and thermodynamics energy

Covariance-like score calculation The score between two columns i and j of an input multiple alignment is computed as following:

Covariance model  A formal covariance model, CM, devised by Eddy and Durbin –A probabilistic model –≈ A Stochastic Context-Free Grammer –Generalized HMM model  A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus  Provides very accurate results  Very slow and unsuitable for searching large genomes

CM training algorithm Unaligned sequence Modeling construction EM Multiple alignment alignment Parameter re-estimation Covariance model

Binary tree representation of RNA secondary structure  Representation of RNA structure using Binary tree  Nodes represent –Base pair if two bases are shown –Loop if base and “gap” (dash) are shown  Pseudoknots still not represented  Tree does not permit varying sequences –Mismatches –Insertions & Deletions Images – Eddy et al.

Overall CM architecture MATP emits pairs of bases: modeling of base pairing BIF allows multiple helices (bifurcation)

Covariance model drawbacks  Needs to be well trained (large datasets)  Not suitable for searches of large RNA –Structural complexity of large RNA cannot be modeled –Runtime –Memory requirements

ncRNA gene finding  De novo ncRNA gene finding –Folding energy –Number of sub-optimal RNA structures  Homology ncRNA gene searching –Sequence-based –Structure-based –Sequence and structure-based

Rfam & Infernal  Rfam 9.1 contains 1379 families (December 2008)  Rfam 10.0 contains 1446 families (January 2010)  Rfam is a collection of multiple sequence alignments and covariance models covering many common non-coding RNA families  Infernal searches Rfam covariance models (CMs) in genomes or other DNA sequence databases for homologs to known structural RNA families

An example of Rfam families  TPP (a riboswitch; THI element) –RF00059 –is a riboswitch that directly binds to TPP (active form of VB, thiamin pyrophosphate) to regulate gene expression through a variety of mechanisms in archaea, bacteria and eukaryotes

Simultaneous structure prediction and alignment of ncRNAs The grammar emits two correlated sequences, x and y

References  How Do RNA Folding Algorithms Work? Eddy. Nature Biotechnology, 22: , 2004 (a short nice review)  Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Durbin, Eddy, Krogh and Mitchison Chapter 10, pages  Secondary Structure Prediction for Aligned RNA Sequences. Hofacker et al. JMB, 319: , 2002 (RNAalifold; covariance-like score calculation)  Optimal Computer Folding of Large RNA Sequences Using Thermodynamics and Auxiliary Information. Zuker and Stiegler. NAR, 9(1): , 1981 (Mfold)  A computational pipeline for high throughput discovery of cis- regulatory noncoding RNAs in Bacteria, PLoS CB 3(7):e126 –Riboswitches in Eubacteria Sense the Second Messenger Cyclic Di-GMP, Science, 321:411 – 413, 2008 –Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline, Nucl. Acids Res. (2007) 35 (14): –CMfinder—a covariance model based RNA motif finding algorithm. Bioinformatics 2006;22:

Understanding the transcriptome through RNA structure  'RNA structurome’  Genome-wide measurements of RNA structure by high-throughput sequencing  Nat Rev Genet Aug 18;12(9): Nat Rev Genet.