Lecture 11. RNA Secondary Structure Prediction The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics
Lecture outline From sequences to functions RNA secondary structures Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
From Sequences to Functions Part 1 From Sequences to Functions
From sequences to functions One of the biggest questions in biology: Can one tell the function of a molecule (DNA/RNA/protein) from its sequence alone? Sometimes, but usually not Easier if we also know the structure Common believe: sequence structure function Of course, also depends on the environment Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Molecular structures Four levels: Primary structures The sequence Secondary structures First formed Local Tertiary structures Global “Folds”, “domains” Quaternary structures Multiple molecules Image credit: http://www.personal.psu.edu/jms5704/blogs/simmons/levels_of_protein_s_c_la_784.jpg Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Primary structures Connections (strong covalent bonds vs. weak hydrogen bonds) Which molecules are connected Which atoms are connected First-level constraints of the possible structures Example: Molecules close in primary structure must also be close in secondary, tertiary and quaternary structures Image credit: Wikibooks Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Primary structures Orientation: DNA, RNA: 5’-3’ Amino acids: Amino (N) terminus to carboxyl (C) terminus “Residue”: what remains after a water molecule is expelled Image credit: http://bealbio.wikispaces.com/file/view/dsDNA.jpg, http://attentionmanagement.ca/userfiles/image/DNA-RNA%20directions.gif, http://www.phschool.com/science/biology_place/biocoach/images/translation/peptbond.gif, http://www.cystinuria.org/resources/education/aminoacids/peptide.gif Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
DNA secondary structures Double helix A-DNA (dehydrated samples) Right-handed 11bp per turn Most common: B-DNA 10.5bp per turn Z-DNA (some methylated DNA) Left-handed 12bp per turn Image credit: Wikipedia Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
DNA secondary structures A-DNA B-DNA Z-DNA Image credit: Wikipedia Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
RNA secondary structures Largely possible to be projected onto a 2D plane Stem/hairpin loop Stacking pairs Bulge Internal loop Multi-loop Exterior loop Dangling nucleotides Less stable pair Coaxial stacking Image credit: http://www.clcbio.com/scienceimages/rna_prediction/RNA_structure_prediction_web.png Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
RNA secondary structures Pseudoknots: complex structures Image credit: Wikipedia, Sperschneider and Datta, RNA 14(4):630-640, (2008) Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Protein secondary structures Three main types: -helixes -sheets Coils (connectors) Image credit: http://calcium.uhnres.utoronto.ca/cadherin/images/pub_pages/general/ribbon.jpg, http://www.mun.ca/biology/scarr/MGA2-03-25.jpg Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
DNA tertiary structures Wrapped around nucleosomes formed by histone proteins Condensed form at beginning of mitosis and meiosis Image credit: http://micro.magnet.fsu.edu/cells/nucleus/images/chromatinstructurefigure1.jpg, Wikipedia Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
RNA tertiary structures Overall structure of an RNA More studied for RNAs that do not translate into proteins -- “non-coding” RNAs Example: tRNA Image credit: Wikipedia Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Protein tertiary structures Complex structures Mainly caused by weak forces (hydrogen bonds and hydrophobic interactions) Occasionally stronger forces (disulfide bonds between cysteines) The CATH hierarchy Class: composition of secondary structures Architecture: overall shape Topology: connection of secondary structures Homologous: with common ancestor Image credit: CATH Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Quaternary structures Types: Protein subunit-protein sub-unit Protein-protein Protein-DNA Protein-RNA (Protein-small molecules) RNA-RNA ... Protein-subunit interaction (Hemoglobin) Protein-DNA interaction Image credit: Wikipedia, http://serrano.crg.es/images/protein_dna1.jpg Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Structure and function Why function depends on structure? Structure itself is the function (e.g., tubulins) Binding Complementarity of interacting structures Formation of special bonds Image credit: http://www.nigms.nih.gov/NR/rdonlyres/54BEAC37-47A9-454A-BC4F-B94EA127FA1E/0/fig1a_large.jpg, http://upload.wikimedia.org/wikimedia/en-labs/7/7f/Protein_Protein_Docking.JPG Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Structure and function Why function depends on structure? (cont’d) Functional group (e.g., catalytic site) Determining localization (e.g., transporter membrane proteins) Image credit: http://www.catalysis-ed.org.uk/principles/images/enzyme_substrate.gif, Spudich , Science 288(5470):1358-1359, 2000 Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
RNA Secondary Structures Part 2 RNA Secondary Structures
Important RNA classes Coding: Non-coding: Messenger RNAs (mRNAs) For translating into proteins Non-coding: Ribosomal RNAs (rRNAs) Parts of the ribosome complex Transfer RNAs (tRNAs) Delivering free amino acids during translation Micro RNAs (miRNAs) Binding mRNA targets to promote RNA degradation or repress translation Small nucleolar RNAs (snoRNAs) Guiding chemical modifications of other RNAs Small nuclear RNAs (snRNAs) Involved in mRNA splicing Long non-coding RNAs (lncRNAs) Some involved in gene regulation ... Image source: http://legacy.hopkinsville.kctcs.edu/sitecore/instructors/Jason-Arnold/VLI/Module%201/m1DNAfunction/m1DNAfunction3.html Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Importance of RNA structures Structure is important to many classes of RNA Examples: tRNA snoRNA Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Representing RNA secondary structures Formats: (see http://projects.binf.ku.dk/pgardner/bralibase/RNAformats.html): Dot-bracket format Stockholm format ... Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Dot-bracket format Sequence (nucleotides 10, 20, 30, etc. marked in red): GUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUGAUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUAUCACAGUAUCUGACG Structure: ......((((.......((((((.(((....((((((.((((..........)))).)))))).))).)))))).((((((.....)))))).))))..... Image credit: Xihao Hu Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Predicting RNA secondary structures A basic assumption in structure predictions: Real structure has the lowest free energy In a simplified view, more stable bonds lower free energy In the case of RNA secondary structures: Good to form more pairs A-U C-G Sometimes G-U (a “wobble base pair”) Good to form more stable pairs C-G > A-U > G-U Good to have stable sub-structures E.g., stacking pairs Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Predicting RNA secondary structures We will assume there are no pseudoknots With pseudoknots, currently there is no known algorithm that can find the optimal solution efficiently We need two things: A thermodynamic model for computing the free energy of a structure A method for finding the structure with the minimum free energy This setting sounds familiar? A pseudoknot Image credit: Wikipedia Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Further assumptions The free energy of a secondary structure is the sum of the free energies of the sub-structures Not the sum of individual bases/base pairs, as one base pair can participate in multiple sub-structures The free energies of the sub-structures are independent Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Problem definition Given an RNA sequence, find a set of base pairs so that each base is paired at most once Example: Input sequence: GUGAAUGAUGAAUUU...ACG Output set of base pairs: (7, 97) (8, 96) ... (18, 74) (81, 87) Image credit: Xihao Hu Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Linear view Last update: 21-Nov-2015 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... . ... ( ( ( ( . . . . . . . ( ( ( ( ( ( . ( ... ) . ) ) ) ) ) ) . ( ( ( ( ( ( . . . . . ) ) ) ) ) ) . ) ) ) ) ... Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Thermodynamics model We will consider four types of sub-structures here: Stacking pairs: both (i, j) and (i+1, j-1) are in the set Hairpin loop: there is a pair (i, j), where all bases from i+1 to j-1 are not paired Bulge/Internal loop: there are two pairs (i, j) and (i1, j1), where i<i1<j1<j, and all bases from i+1 to i1-1 and from j1+1 to j-1 are not paired Multi-loop: there are pairs (i, j), (i1, j1), ..., (ik, jk), where i<i1<j1<...<ik<jk<j, and all bases from i+1 to i1-1, from j1+1 to i2-1, ..., jk-1+1 to ik-1 and from jk+1 to j-1 are unpaired One base pair can participate in multiple structures Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Stacking pairs Both (i, j) and (i+1, j-1) are in the set E.g., i:20, j:72 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i+1 j-1 j Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Hairpin loop There is a pair (i, j), where all bases from i+1 to j-1 are not paired E.g., i: 81, j: 87 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i j Image source: http://img.ehowcdn.com/article-new/ds-photo/getty/article/151/226/87820768_XS.jpg Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Bulge/Internal loop Internal loop: There are two pairs (i, j) and (i1, j1), where i<i1<j1<j, and all bases from i+1 to i1-1 and from j1+1 to j-1 are not paired Called a bulge if only one side has unpaired bases E.g., i:23, j:69, i1:25, j1:67 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i1 j1 j Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Multi-loop Multi-loop: There are pairs (i, j), (i1, j1), ..., (ik, jk), where i<i1<j1<...<ik<jk<j, and all bases from i+1 to i1-1, from j1+1 to i2-1, ..., jk-1+1 to ik-1 and from jk+1 to j-1 are unpaired E.g., k=2, i:10, j:94, i1:18, j1:74, i2:76, j2:92 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... i i1 j1 i2 j2 j Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
One possible thermodynamic model Unpaired bases have 0 free energy and all the terms below have negative free energy eS(i, j): for the stacking pairs (i, j) and (i+1, j-1) eH(i, j): for the hairpin loop closed at (i, j) eBI(i, j, i1, j1): for a bulge or internal loop enclosed by the pairs (i, j) and (i1, j1) eM(i, j, i1, j1, ..., ik, jk): for a multi-loop that consists of the pairs (i, j), (i1, j1), ..., (ik, jk) and satisfying i<i1<j1<...<ik<jk<j Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Finding the optimal structure Dynamic programming Let s be the RNA sequence with n nucleotides Tables: V(j): free energy of the optimal structure for s[1..j] Final answer is based on V(n) VP(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair VBI(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a budge or internal loop VM(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a multi-loop Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Update formulas V(j): free energy of the optimal structure for s[1..j] For j > 1, 1 ... j ... j is unpaired 1 ... j-1 j ... j pairs with i 1 ... i-1 i ... j ... Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Update formulas VP(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair We require that i < j ... i ... j ... Stacking pairs ... i i+1 ... j-1 j ... Hairpin loop ... i ... j ... All unpaired Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Update formulas VBI(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a budge or internal loop (i.e., i and j take the roles of i1 and j1) ... i ... j ... Budge or internal loop ... i ... i1 ... j1 ... j ... All unpaired All unpaired Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Update formulas VM(i, j): free energy of the optimal structure for s[i..j] with i and j forming a pair that closes a multi-loop ... i ... j ... Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Time and space requirements V: n entries, each takes O(n) time VP(i, j): O(n2) entries, each takes constant time Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Time and space requirements VBI: O(n2) entries, each takes O(n2) time VM: O(n2) entries, each takes O(n2k) time Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Time and space requirements Summary: V: n entries, each takes O(n) time VP: O(n2) entries, each takes constant time VBI: O(n2) entries, each takes O(n2) time VM: O(n2) entries, each takes O(n2k) time Total: O(n2) space, O(n2k+2) time Exponential if k is unbounded Some approximations could bring the time down to O(n4) – still huge for large n, but feasible for small or median n Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Some remarks If we allow general pseudoknots, there is currently no efficient way to find the optimal RNA secondary structure with the minimum free energy Other methods to predict RNA secondary structures: Conservation and covariation High conservation: 2 and 4 Strong covariation: 1 and 5 Experimental methods (e.g., RNA footprinting) 12345 ACGGU ACUGU CCAGG UCCGA Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Representing pseudoknots Without pseudoknots, RNA secondary structures can be unambiguously represented by dots (single bases) and brackets (base pairs) What if there are pseudoknots? Need more types of brackets 1 ... 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 ... . ... ( ( ( ( . . . . . . . ( ( ( ( ( ( . ( ... ) . ) ) ) ) ) ) . ( ( ( ( ( ( . . . . . ) ) ) ) ) ) . ) ) ) ) ... GAAGUACAAUAUGUAACCG .{.((((.....))})).. Image source: http://ultrastudio.org/upload/RNAPseudoKnot-25005810.jpg Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Case Study, Summary and Further Readings Epilogue Case Study, Summary and Further Readings
Case study: Drug finding/design Drugs are mostly chemicals with a specific structure that interacts with some biological objects Examples: Inhibiting the activities of an important protein of bacteria Blocking the interaction between virus and receptors of host cell Simulating the production of a hormone Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Case study: Drug finding/design Suppose we want to identify/design a chemical to target a particular object (e.g., a protein), we need to make sure that they have tight bindings through a process called docking Image source: http://vds.cm.utexas.edu/ Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Case study: Drug finding/design Computational problem: Input: a target protein and a list of chemicals Goal: find a chemical that binds the target well Try different locations and orientations Binding depends on structure and chemistry Output: One or more chemicals that bind the target well Difficulties: Computational complexity Large search space for each protein-chemical combination Need to try many chemicals Need to ensure specificity (not to target other proteins and cause side effects) Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Case study: Drug finding/design There is a game for players to try folding proteins called FoldIt (http://fold.it/) Score based on free energy Real time update of scores and ranks Players can discuss and share solutions Resulted in some amazingly good folds as compared to automatic predictions by computer programs Image source: http://fold.it/portal/site_files/theme/science/competition.png Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Summary Functions depend on structures Different levels of structures: Primary (sequence) Secondary (local) Tertiary (global) Quaternary (interactions) RNA secondary structures can be predicted by dynamic programming based on a thermodynamic model Important sub-structures Stacking pairs Hairpin loops Internal loops/bulges Multi-loops Pseoduknots Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015
Further readings Chapter 11 of Algorithms in Bioinformatics: A Practical Introduction Speed up of algorithm Algorithm for RNA structure perdition with pseudoknots Free slides available Parts VII and VIII of Fundamental Concepts of Bioinformatics Protein folding and protein structure prediction Docking Last update: 21-Nov-2015 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2015