Download presentation
Presentation is loading. Please wait.
Published byMerry Marsh Modified over 9 years ago
1
3/14/08, UMKC1 TCGR: A Novel DNA/RNA Visualization Technique Margaret H. Dunham Donya Quick Southern Methodist University Margaret H. Dunham and Donya Quick Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu Some slides presented at IEEE BIBE 2006
2
3/14/08, UMKC2 Outline nIntroduction nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
3
3/14/08, UMKC3 Outline nIntroduction n Background n CGR/FCGR n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
4
3/14/08, UMKC4 DNA nDeoxyribonucleic Acid nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides n Adenine (A) n Cytosine (C) n Guanine (G) n Thymine (T) nTwo strands bound together nContains genetic information Image source: http://www.visionlearning.com/library/m odule_viewer.php?mid=63 http://www.visionlearning.com/library/m odule_viewer.php?mid=63
5
3/14/08, UMKC5 Nucleotide Bases http://www.people.virginia.edu/~rjh9u/gif/bases.gif
6
3/14/08, UMKC6 Transcription nDuring transcription, DNA is converted in mRNA nRNA is processed and noncoding regions removed nCoding regions are converted in protein nEnzyme (RNA Polymerase) that starts transcription by binding to DNA code
7
3/14/08, UMKC7 Transcription http://ghs.gresham.k12.or.us/science/ps/sci/ ibbio/chem/nucleic/chpt15/transcripti on.gif
8
3/14/08, UMKC8 RNA nRibonucleic Acid nContains A,C,G but U (Uracil) instead of T nSingle Stranded nMay fold back on itself nNeeded to create proteins nMove around cells – can act like a messenger nmRNA – moves out of nucleus to other parts of cell
9
3/14/08, UMKC9 Translation nSynthesis of Proteins from mRNA nNucleotide sequence of mRNA converted in amino acid sequence of protein nFour nucleotides nTwenty amino acids nCodon – Group of 3 nucleotides nAmino acids have many codings
10
3/14/08, UMKC10 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDE CCUGAGCCAACUAUUGAUGAA Central Dogma: DNA -> RNA -> Protein www.bioalgorithms.infowww.bioalgorithms.info; chapter 6; Gene Prediction
11
3/14/08, UMKC11 http://www.time.com/time/magazine/article/0,91 71,1541283,00.html
12
3/14/08, UMKC12 Human Genome nScientists originally thought there would be about 100,000 genes nAppear to be about 20,000 nWHY? nAlmost identical to that of Chimps. What makes the difference? nAnswers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)
13
3/14/08, UMKC13 More Questions nIf each cell in an organism contains the same DNA – n How does each cell behave differently? n Why do cells behave differently during childhood development? n What causes some cells to act differently – such as during disease? nDNA contains many genes, but only a few are being transcribed – why? nOne answer - miRNA
14
3/14/08, UMKC14 miRNA nShort (20-25nt) sequence of noncoding RNA nSingle strand nPreviously assumed to be garbage nImpact/Prevent translation of mRNA nBind to target areas in mRNA – Problem is that this binding is not perfect (particularly in animals) nmRNA may have multiple (nonoverlapping) binding sites for one miRNA
15
3/14/08, UMKC15 miRNA Functions nCauses some cancers nEmbryo Development nCell Differentiation nCell Death nPrevents the production of a protein that causes lung cancer nControl brain development in zebra fish nAssociated with HIV
16
3/14/08, UMKC16 Outline nIntroduction n Background n CGR/FCGR n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
17
3/14/08, UMKC17 Chaos Game Representation (CGR) Scatter plot showing occurrence of patterns of nucleotides. University of the Basque Country http://insilico.ehu.es/genomics/my_words/ http://insilico.ehu.es/genomics/my_words/
18
3/14/08, UMKC18 Frequency CGR (FCGR) Shows the frequencies of oligonucleotides using a color scheme normalized to the distribution of frequency of occurrence of associated patterns.
19
3/14/08, UMKC19 Chaos Game Representation (CGR) n2D technique to visually see the distribution of subpatterns nOur technique is based on the following: n Generate totals for each subpattern n Scale totals to a [0,1] range. (Note scaling can be a problem) n Convert range to red/blue 0-0.5: White to Blue 0.5-1: Blue to Red AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU FCGR
20
3/14/08, UMKC20 FCGR AAACCACC AGATCGCT GAGCTATC GGGTTGTT AC GT a) Nucleotidesb) Dinucleotides c) Trinucletides AAAAACACAACC AAGAATACGACT AGAAGCATAATC AGGAGTATGATT GAAGACGCAGCC GAGGATGCGGCT GGAGGCGTAGTC GGGGGTGTGGTT CAACACCCACCC CAGCATCCGCCT CGACGCCTACTC CGGCGTCTGCTT TAATACTCATCC TAGTATTCGTCT TGATGCTTATTC TGGTGTTTGTTT Figures courtesy of Eamonn Keogh, UCR
21
3/14/08, UMKC21 FCGR Example Homo sapiens – all mature miRNA Patterns of length 3 UUC GUG
22
3/14/08, UMKC22 Outline nIntroduction n Background n CGR/FCGR n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
23
3/14/08, UMKC23 Motivation 2000bp Flanking Upstream Region mir-258.2 in C elegans a) All 2000 bp b) First 240 bp b) Last 240 bp
24
3/14/08, UMKC24 Research Objectives nIdentify, develop, and implement algorithms which can be used for identifying potential miRNA functions. nCreate an online tool which can be used by other researchers to apply our algorithms to new data.
25
3/14/08, UMKC25 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
26
3/14/08, UMKC26 Temporal CGR (TCGR) nTemporal version of Frequency CGR n In our context temporal means the starting location of a window n2D Array n Each Row represents counts for a particular window in sequence First row – first window Last row – last window We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern lengt nAs sequence lengths vary, we only examine complete windows nWe only count patterns completely contained in each window.
27
3/14/08, UMKC27 TCGR Example acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga Moving Window ACGT Pos 0-8 2331 Pos 1-9 1332 … Pos 34-42 2421 ACGT Pos 0-80.40.60.60.2 Pos 1-90.20.60.60.4 … Pos 34-420.40.80.40.2
28
3/14/08, UMKC28 TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3
29
3/14/08, UMKC29 TCGR Example (cont’d) Window 0: Pos 0-8 Window 1: Pos 1-9 Window 17: Pos 17-25 Window 18: Pos 18-26 Window 34: Pos 34-42 acgtgcacg cgtgcacgt tccggaacc ccggaacca ccacgtcga A C G T
30
3/14/08, UMKC30 TCGR – Viruses miRNA ( Window=9; Pattern=1;2;3) Epstein Barr Human Cytomegalovirus Kaposi sarc Herpesvirus Mouse Gammaherpesvirus Pattern =1 Pattern =2 Pattern =3
31
3/14/08, UMKC31 TCGR – Mature miRNA (Window=5; Pattern=3) All Mature Mus musculus Homo sapiens C. elegans ACG CGCGCGUCG
32
3/14/08, UMKC32 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
33
3/14/08, UMKC33 EMM Overview nTime Varying Discrete First Order Markov Model nNodes are clusters of real world states. nLearning continues during prediction phase. nLearning: n Transition probabilities between nodes n Node labels (centroid of cluster) n Nodes are added and removed as data arrives
34
3/14/08, UMKC34 EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: nEMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. nEMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. nEMMDecrement algorithm, which removes nodes from the EMM when needed.
35
3/14/08, UMKC35 EMM Cluster nFind closest node to incoming event. nIf none “close” create new node nLabeling of cluster is centroid of members in cluster nO(n)
36
3/14/08, UMKC36 EMM Increment <18,10,3,3,1,0,0><17,10,2,3,1,0,0><16,9,2,3,1,0,0><14,8,2,3,1,0,0><14,8,2,3,0,0,0><18,10,3,3,1,1,0.> 1/3 N1 N2 2/3 N3 1/1 1/3 N1 N2 2/3 1/1 N3 1/1 1/2 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N1 1
37
3/14/08, UMKC37 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
38
3/14/08, UMKC38 Research Approach 1.Represent potential miRNA sequence with TCGR sequence of count vectors 2.Create EMM using count vectors for known miRNA (miRNA stem loops, miRNA targets) 3.Predict unknown sequence to be miRNA (miRNA stem loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM
39
3/14/08, UMKC39 Related Work 1 nPredicted occurrence of pre-miRNA segments form a set of hairpin sequences nNo assumptions about biological function or conservation across species. nUsed SVMs to differentiate the structure of hiarpin segments that contained pre- miRNAs from those that did not. nSensitivey of 93.3% nSpecificity of 88.1% 1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
40
3/14/08, UMKC40 Preliminary Test Data 1 nPositive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119. nNegative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters. nPositive Test: This dataset contains 30 pre-miRNAs. nNegative Test: This dataset contains 1000 randomly chosen sequences from coding regions. 1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
41
3/14/08, UMKC41 POSITIVEPOSITIVE NEGATIVENEGATIVE TCGRs for Xue Training Data
42
3/14/08, UMKC42 POSITIVEPOSITIVE N E G AT I V E TCGRs for Xue Test Data
43
3/14/08, UMKC43 Predictive Probabilities with Xue’s Data EMMTest Data Mean Std Dev Max Min NegativeTest-Neg0000 Test-Pos0000 Train-Neg0.379630.0500850.912560.2945 Train-Pos0000 PositiveTest-Neg0000 Test-Pos0.258940.187010.420750 Train-Neg0000 Train-Pos0.389260.0484390.911550.32209
44
3/14/08, UMKC44 Preliminary Test Results nPositive EMM n Cutoff Probability = 0.3 n False Positive Rate = 0% n True Positive Rate = 66% nTest results could be improved by meta classifiers combining multiple positive and negative classifiers together.
45
3/14/08, UMKC45 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion / Future Work
46
3/14/08, UMKC46 Future Research 1.Obtain all known mature miRNA sequences for a species – initially the 119 C. elegans miRNAs. 2.Create TCGR count vectors for each sequence and each sub-pattern length (1,2,3,4,5). 3.Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created 4.Obtain negative data (much as Xue did in his research) from coding regions for C. elegans. 5.Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created 6.Construct a meta-classifier based on the combined results of prediction from each of these ten EMMs. 7.Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic sequence in the C. elegans genome to search for miRNAs. Note: all 119 validated C. elegans miRNAs are contained in the non-exonic part of the genome and thus the first pass of the algorithm will be tested for its ability to detect all 119 validated miRNAs. 8.Validate the prediction of novel miRNAs using molecular biology.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.