Download presentation
Presentation is loading. Please wait.
PublishEdwina James Modified over 9 years ago
1
Doug Raiford Lesson 3
2
Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction
3
Remember Start codon codes for methionine Stop codons do not code for an amino acid Does every ATG mark the beginning of a gene? Does every TAG, TAA, or TGA mark the end? Start codon: ATG Stop codons: TAG, TAA, or TGA 10/13/20153Gene Prediction
4
The start and stop codons must be “in frame” A set of codons must fit between them Length evenly divisible by three Open reading frame Series of codons bracketed by start and stop codons (in frame) 10/13/20154Gene Prediction
5
The distance between start and stop codons tends to be longer than expected How long would we expect that distance to be? 10/13/20155Gene Prediction
6
There are 64 different codons A given codon should show- up randomly around once every 64 codons or 192 nts (64*3) 3 stop codons Expect 3 in every 64 codons or once every 21 1/3 codons (21 1/3 * 3 = 64 nts) 10/13/20156Gene Prediction
7
Number of genes in E. coli is 4356 Min 44 nts, max 8621 8 are < 64 143 < 128 (3%) Good start but must be more Approximately 77,000 ORFs > 2* expected on each strand Escherichia coli 10/13/20157Gene Prediction
8
To “find” a gene would look for nt sequences that look like the parts of a gene Promoter RegionCoding regionTerminator Region RNA polymerase Start Codon ‘ATG’ = Methionine Stop Codon: non coding ‘TAA’, ‘TAG’, or ‘TGA’ 10/13/20158Gene Prediction
9
Attract polymerase Specific sequences Gene regulation Each promoter has unique pattern Motifs 10/13/20159Gene Prediction Coding region-35-10 Transcription start site Ribosomal binding site for -10 sequence T A T A A T for -35 sequence T T G A C A Start Codon Polymerase binding
10
Slightly different -35 and -10 motifs attract different sigma factors Genes with similar upstream regions tend to be related: they express similarly 10/13/2015Gene Prediction10
11
Hairpin Followed by U-run (A-run in the DNA) 10/13/2015Gene Prediction11
12
Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase 10/13/2015Gene Prediction12 DNA AAAAAAAA Polymerase UUUUUUU mRNA
13
How find? Difficult: fuzzy, not carved in stone 10/13/201513Gene Prediction Coding region-35-10 Transcription start site Ribosomal binding site for -10 sequence T A T A A T for -35 sequence T T G A C A Start Codon Polymerase binding
14
Hidden Markov Models often used All about the statistics Markov Chain: series of events along with probabilities 10/13/201514Gene Prediction TATAAT A Start G or C Yay! I found one or T or A
15
Previous was a “state machine” representation Should have states and observations The states are “hidden” 10/13/2015Gene Prediction15 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1
16
Each state has a probability of “emitting” any given observation Each state has a probability of “transitioning” to any given next state 10/13/2015Gene Prediction16 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1
17
Transition probability matrix Rows represent current state Columns represent state to which a transition will occur Entry is the probability associated with that transition Emission probability matrix Rows represent states Columns represent which observation is emitted Entry is the probability associated with that emission 10/13/2015Gene Prediction17 TRANS To state From stateprobability EMIS Observations stateprobability
18
Requires a subject matter expert to build a model Often start with a state for each position in a possible match Example looking for something similar to TATAAT Might not have both A’s Might have extra one in first slot Never have G’s or C’s 10/13/2015Gene Prediction18 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1
19
Also need a state for non-participating regions 10/13/2015Gene Prediction19 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1
20
124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1 First guess as to probabilities Maybe from state associated with first T to A 100% Then 50% 50% whether A or T Then 100% T 10/13/2015Gene Prediction20
21
Baum-Welch or Viterbi algorithm Pass the algorithm a sequence of observations and first guess as to probabilities It refines the probability matrices 10/13/2015Gene Prediction21 Assumes that the sequence adheres to the underlying probabilities. Traverses states keeping track of actual frequency of emissions and transitions Adjusts matrices accordingly
22
Called checking the posterior probabilities Given a sequence, check all possible paths through the model Multiply the associated probabilities Path with the highest probability is likely the path through the hidden states Can use the “forward algorithm” to cut down the number of paths (dynamic programming) Location in sequence where most probable states are “TATAAT” is a match 10/13/2015Gene Prediction22 123450 ACGT.25 1 1 1 TATA ACGT 1 16/17 1/17 1 1 1 1 1
23
Matlab very useful at matrix operations 10/13/2015Gene Prediction23 seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c'] seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2] EMIS = [.25,.25,.25,.25;#ACGT 0,0,0,1; 1,0,0,0; 0,0,0,1; 1,0,0,0;.25,.25,.25,.25] TRANS = [16/17,1/17,0,0,0,0; 0,0,1,0,0,0; 0,0,0,1,0,0; 0,0,0,0,1,0; 0,0,0,0,0,1; 0,0,0,0,0,1] seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c'] seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2] EMIS = [.25,.25,.25,.25;#ACGT 0,0,0,1; 1,0,0,0; 0,0,0,1; 1,0,0,0;.25,.25,.25,.25] TRANS = [16/17,1/17,0,0,0,0; 0,0,1,0,0,0; 0,0,0,1,0,0; 0,0,0,0,1,0; 0,0,0,0,0,1; 0,0,0,0,0,1] 123450 ACGT.25 1 1 1 TATA ACGT 1 16/17 1/17 1 1 1 1 1
24
Gene mark georgia institute http://exon.biology.gatech.edu/ http://exon.biology.gatech.edu/ Genscan http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html Genie Berkeley http://www.fruitfly.org/seq_tools/genie.html http://www.fruitfly.org/seq_tools/genie.html Glimmer university of maryland http://www.cbcb.umd.edu/software/GlimmerHM M/ http://www.cbcb.umd.edu/software/GlimmerHM M/ 10/13/201524Gene Prediction
25
Can include all regions in the model States for each position in each region Coding region could be simple set of three regions 10/13/201525Gene Prediction Coding region-35-10 Transcription start site Ribosomal binding site for -10 sequence T A T A A T for -35 sequence T T G A C A Start Codon Polymerase binding Termination region
26
Classic example: states are rainy or sunny If know whether someone is walking, shopping or cleaning, can predict state 10/13/201526Gene Prediction states EmissionsObservations
27
10/13/201527Gene Prediction
28
If something that is observable is dependent on an underlying state can use HMM In motifs sequence is visible, whether or not a region is a promoter site is not 10/13/201528Gene Prediction
29
Each state has a probability of emitting any given observation Each state has a probability of transitioning to any given next state 10/13/2015Gene Prediction29 Probabilistic parameters of a hidden Markov model (example) x — states y — possible observations a — state transition probabilities b — output probabilities
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.