Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004
Non-Coding RNA Background Basics Biology Overview Why ncRNA - Central Dogma? Problem Space HMM/sCFG Solution Paper Pair HMMs on Tree Structures Alignment of Trees, Structural Alignment Experimental Evaluation Conclusion
Central Dogma of Molec. Bio.
Biology Overview RNA merely plays an accessory role Complexity is defined by proteins encoded in the genome
Biology Overview Non-coding RNA (ncRNA) is a RNA molecule that functions w/o being translated into a protein Most prominent examples: Transfer RNA (tRNA), Ribosomal RNA (rRNA)
Genome Biol. 2002; Beyond The Proteome: Non-coding Regulatory RNAs Why Non-coding RNA Protein-coding genes can’t account for all complexity ncRNA is important! Gene regulators
Non-coding RNA Problems Finding ncRNA genes in the genome: locate these genes Finding Homologs of ncRNA: figure out what they do
Finding ncRNA Genes Protein Approaches Statistically biased (codon triplets) Open Reading Frames ncRNA Approaches High CG content (hyperthermophiles) Promoter/Terminator identification (E. Coli) Comparative Genome Analysis
Genetic Code
Similarity Searching Proteins BLAST, Sequence Alignment (DP) Genes that code for proteins are conserved across genomes (e.g. low rate of mutation) ncRNA Secondary structure usually conserved Alignment scoring based on structure is imperative
ncRNA: Sequence vs Structure
Alignment Approaches sCFGs: Modeling secondary structure, scoring sequences HMM for scoring of sequence and secondary structure alignment
Pair HMMs on Tree Structures Outline Alignment on Trees Structural Alignment Secondary Structure Representation Hidden Markov Model Recurrence Relations Experimental Evaluation Future Work
Alignment on Trees b a c d e fg ih b a c d e fg ih
Structural Alignment Problem: Given an RNA sequence with known Secondary Structure and an RNA sequence (unknown structure), obtain the optimal alignment of the two AUCGAAAGAU G G G G AC A C C C G A C U AA A G A U
Structural Representation Skeletal Tree ( , ): Branch Structure (X, , Y): Base-pairs (X, ) or ( , Y): Unpaired bases X,Y {A,U,G,C}
Hidden Markov Model M: Match state, I: Insertion state, D: Deletion state XY : State transition probability from X to Y X : Initial probability : Emission probabilityfor pair x,y X,Y {M,I,D}
Notation Let w=a 1 a 2 …a n be an unfolded RNA sequence of length n Let w[i] denote i th symbol in w Let w[i,j] denote a substring a i a i+1 …a j of w
Notation Let T be a skeletal tree representing a folded RNA sequence (known structure) Let v(j) denote the label of node j in tree T Let T[j] denote the subtree rooted at node j in tree T Let j n denote the nth child of node j in tree T
Recurrence Relation (Match)
Recurrence Relation (Delete)
Recurrence Relation (Insert)
Structural Alignment Intuition: Given the ncRNA sequence, b with unknown structure, generate a predicted folded structure for b, align the resulting tree with the ncRNA with known secondary structure a. Complexity: O(K M N 3 ) K = # states in pair HMM, M = size of skeletal tree, N = length of unfolded sequence
Experimental Evaluation Dynamic Programming to calculate recurrence relations, prototype system to execute algorithm Experiments on 2 families of RNA: Transfer RNAs and Hammerhead Ribozyme
Parameters Gorodkin et al. (1997)
Results: tRNA
Results: Hammerhead Ribozyme
Future Work Since based on dynamic programming (of pairwise alignment), many DP techniques can apply Refine emission probabilities, relate score matrix (reliable alignment for RNA families)
Conclusions ncRNA space is quite open - no really great techniques yet How many ncRNA genes are there? Absence of evidence ≠ evidence of absence Eddy’s call to arms “it is time for RNA computational biologists to step up”
Thanks!
References Sakakibara, K., “Pair Hidden Markov Models on Tree Structures”, Bioinformatics, 19: , 2003 Eddy, S., “Computational Genomics of Noncoding RNA Genes”, Cell, Vol 109: , 2002 Szymanski, M., Barciszewski, J., “Beyond The Proteome: Non-coding Regulatory RNAs”