Presentation is loading. Please wait.

Presentation is loading. Please wait.

Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.

Similar presentations


Presentation on theme: "Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction."— Presentation transcript:

1 Doug Raiford Lesson 3

2  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

3  Remember  Start codon codes for methionine  Stop codons do not code for an amino acid  Does every ATG mark the beginning of a gene?  Does every TAG, TAA, or TGA mark the end? Start codon: ATG Stop codons: TAG, TAA, or TGA 10/13/20153Gene Prediction

4  The start and stop codons must be “in frame”  A set of codons must fit between them  Length evenly divisible by three  Open reading frame  Series of codons bracketed by start and stop codons (in frame) 10/13/20154Gene Prediction

5  The distance between start and stop codons tends to be longer than expected  How long would we expect that distance to be? 10/13/20155Gene Prediction

6  There are 64 different codons  A given codon should show- up randomly around once every 64 codons or 192 nts (64*3)  3 stop codons  Expect 3 in every 64 codons or once every 21 1/3 codons (21 1/3 * 3 = 64 nts) 10/13/20156Gene Prediction

7  Number of genes in E. coli is 4356  Min 44 nts, max 8621  8 are < 64  143 < 128 (3%)  Good start but must be more  Approximately 77,000 ORFs > 2* expected on each strand Escherichia coli 10/13/20157Gene Prediction

8  To “find” a gene would look for nt sequences that look like the parts of a gene Promoter RegionCoding regionTerminator Region RNA polymerase Start Codon ‘ATG’ = Methionine Stop Codon: non coding ‘TAA’, ‘TAG’, or ‘TGA’ 10/13/20158Gene Prediction

9  Attract polymerase  Specific sequences  Gene regulation  Each promoter has unique pattern  Motifs 10/13/20159Gene Prediction Coding region-35-10 Transcription start site Ribosomal binding site for -10 sequence T A T A A T for -35 sequence T T G A C A Start Codon Polymerase binding

10  Slightly different -35 and -10 motifs attract different sigma factors  Genes with similar upstream regions tend to be related: they express similarly 10/13/2015Gene Prediction10

11  Hairpin  Followed by U-run (A-run in the DNA) 10/13/2015Gene Prediction11

12  Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase 10/13/2015Gene Prediction12 DNA AAAAAAAA Polymerase UUUUUUU mRNA

13  How find?  Difficult: fuzzy, not carved in stone 10/13/201513Gene Prediction Coding region-35-10 Transcription start site Ribosomal binding site for -10 sequence T A T A A T for -35 sequence T T G A C A Start Codon Polymerase binding

14  Hidden Markov Models often used  All about the statistics  Markov Chain: series of events along with probabilities 10/13/201514Gene Prediction TATAAT A Start G or C Yay! I found one or T or A

15  Previous was a “state machine” representation  Should have states and observations  The states are “hidden” 10/13/2015Gene Prediction15 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1

16  Each state has a probability of “emitting” any given observation  Each state has a probability of “transitioning” to any given next state 10/13/2015Gene Prediction16 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1

17  Transition probability matrix  Rows represent current state  Columns represent state to which a transition will occur  Entry is the probability associated with that transition  Emission probability matrix  Rows represent states  Columns represent which observation is emitted  Entry is the probability associated with that emission 10/13/2015Gene Prediction17 TRANS To state From stateprobability EMIS Observations stateprobability

18  Requires a subject matter expert to build a model  Often start with a state for each position in a possible match  Example looking for something similar to  TATAAT  Might not have both A’s  Might have extra one in first slot  Never have G’s or C’s 10/13/2015Gene Prediction18 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1

19  Also need a state for non-participating regions 10/13/2015Gene Prediction19 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1

20 124580 ACGT.25 1 1 1 TATA ACGT 1.99.1 1.5 1 1 1 3 1 67 A 1 1 T 1 1  First guess as to probabilities  Maybe from state associated with first T to A 100%  Then 50% 50% whether A or T  Then 100% T 10/13/2015Gene Prediction20

21  Baum-Welch or Viterbi algorithm  Pass the algorithm a sequence of observations and first guess as to probabilities  It refines the probability matrices 10/13/2015Gene Prediction21 Assumes that the sequence adheres to the underlying probabilities. Traverses states keeping track of actual frequency of emissions and transitions Adjusts matrices accordingly

22  Called checking the posterior probabilities  Given a sequence, check all possible paths through the model  Multiply the associated probabilities  Path with the highest probability is likely the path through the hidden states  Can use the “forward algorithm” to cut down the number of paths (dynamic programming)  Location in sequence where most probable states are “TATAAT” is a match 10/13/2015Gene Prediction22 123450 ACGT.25 1 1 1 TATA ACGT 1 16/17 1/17 1 1 1 1 1

23  Matlab very useful at matrix operations 10/13/2015Gene Prediction23 seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c'] seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2] EMIS = [.25,.25,.25,.25;#ACGT 0,0,0,1; 1,0,0,0; 0,0,0,1; 1,0,0,0;.25,.25,.25,.25] TRANS = [16/17,1/17,0,0,0,0; 0,0,1,0,0,0; 0,0,0,1,0,0; 0,0,0,0,1,0; 0,0,0,0,0,1; 0,0,0,0,0,1] seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c'] seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2] EMIS = [.25,.25,.25,.25;#ACGT 0,0,0,1; 1,0,0,0; 0,0,0,1; 1,0,0,0;.25,.25,.25,.25] TRANS = [16/17,1/17,0,0,0,0; 0,0,1,0,0,0; 0,0,0,1,0,0; 0,0,0,0,1,0; 0,0,0,0,0,1; 0,0,0,0,0,1] 123450 ACGT.25 1 1 1 TATA ACGT 1 16/17 1/17 1 1 1 1 1

24  Gene mark georgia institute  http://exon.biology.gatech.edu/ http://exon.biology.gatech.edu/  Genscan  http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html  Genie Berkeley  http://www.fruitfly.org/seq_tools/genie.html http://www.fruitfly.org/seq_tools/genie.html  Glimmer university of maryland  http://www.cbcb.umd.edu/software/GlimmerHM M/ http://www.cbcb.umd.edu/software/GlimmerHM M/ 10/13/201524Gene Prediction

25  Can include all regions in the model  States for each position in each region  Coding region could be simple set of three regions 10/13/201525Gene Prediction Coding region-35-10 Transcription start site Ribosomal binding site for -10 sequence T A T A A T for -35 sequence T T G A C A Start Codon Polymerase binding Termination region

26  Classic example: states are rainy or sunny  If know whether someone is walking, shopping or cleaning, can predict state 10/13/201526Gene Prediction states EmissionsObservations

27 10/13/201527Gene Prediction

28  If something that is observable is dependent on an underlying state can use HMM  In motifs sequence is visible, whether or not a region is a promoter site is not 10/13/201528Gene Prediction

29  Each state has a probability of emitting any given observation  Each state has a probability of transitioning to any given next state 10/13/2015Gene Prediction29 Probabilistic parameters of a hidden Markov model (example) x — states y — possible observations a — state transition probabilities b — output probabilities


Download ppt "Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction."

Similar presentations


Ads by Google