Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,

Similar presentations


Presentation on theme: "Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,"— Presentation transcript:

1 Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble, and Eyal Pribman. Partially modified by Benny Chor.

2 Annotation of Genomic Sequence Given the sequence of an organism ’ s genome, we would like to be able to identify: –Genes –Exon boundaries & splice sites –Beginning and end of translation –Alternative splicings –Regulatory elements (e.g. promoters) The only certain way to do this is experimentally, but it is time consuming and expensive. Computational methods can achieve reasonable accuracy quickly, and help direct experimental approaches. primary goals secondary goals

3 Prokaryotic Gene Structure Promoter CDS Terminator transcription Genomic DNA mRNA  Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'.  The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA  The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.

4 Pieces of a (Eukaryotic) Gene (on the genome) 5’ 3’ 5’ ~ 1-100 Mbp 5’ 3’ 5’ … … … … ~ 1-1000 kbp exons (cds & utr) / introns (~ 10 2 -10 3 bp) (~ 10 2 -10 5 bp) Polyadenylation site promoter (~10 3 bp) enhancers (~10 1 -10 2 bp) other regulatory sequences (~ 10 1 -10 2 bp)

5 What is it about genes that we can measure (and model)? Most of our knowledge is biased towards protein-coding characteristics –ORF (Open Reading Frame): a sequence defined by in- frame AUG and stop codon, which in turn defines a putative amino acid sequence. –Codon Usage: most frequently measured by CAI (Codon Adaptation Index) Other phenomena –Nucleotide frequencies and correlations: value and structure –Functional sites: splice sites, promoters, UTRs, polyadenylation sites

6 A simple measure: ORF length Comparison of Annotation and Spurious ORFs in S. cerevisiae Basrai MA, Hieter P, and Boeke J Genome Research 1997 7:768-771

7 Codon Adaptation Index (CAI) Parameters are empirically determined by examining a “large” set of example genes This is not perfect –Genes sometimes have unusual codons for a reason –The predictive power is dependent on length of sequence

8 CAI Example: Counts per 1000 codons

9 Splice signals (mice): GT, AG

10 General Things to Remember about (Protein-coding) Gene Prediction Software It is, in general, organism-specific It works best on genes that are reasonably similar to something seen previously It finds protein coding regions far better than non- coding regions In the absence of external (direct) information, alternative forms will not be identified It is imperfect! (It’s biology, after all…)

11 Rest of Lecture Outline Eukaryotic gene structure Modeling gene structure Using the model to make predictions Improving the model topology Modeling fixed-length signals

12 A eukaryotic gene This is the human p53 tumor suppressor gene on chromosome 17. Genscan is one of the most popular gene prediction algorithms.

13 A eukaryotic gene 3’ untranslated region Final exon Initial exon Introns Internal exons This particular gene lies on the reverse strand.

14 An Intron 3’ splice site 5’ splice site revcomp(CT)=AG revcomp(AC)=GT GT: signals start of intron AG: signals end of intron

15 Signals vs contents In gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content. Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites Examples of contents: exons, introns, UTRs, promoter regions

16 Prior knowledge We want to build a probabilistic model of a gene that incorporates our prior knowledge. E.g., the translated region must have a length that is a multiple of 3.

17 Prior knowledge The translated region must have a length that is a multiple of 3. Some codons are more common than others. Exons are usually shorter than introns. The translated region begins with a start signal and ends with a stop codon. 5’ splice sites (exon to intron) are usually GT; 3’ splice sites (intron to exon) are usually AG. The distribution of nucleotides and dinucleotides is usually different in introns and exons.

18 A simple gene model Transcription stop Transcription start StartEnd Gene Intergenic

19 A probabilistic gene model Transcription stop Transcription start StartEnd Gene Intergenic Every box stores transition probabilities for outgoing arrows. Every arrow stores emission probabilities for emitted nucleotides. 0.67 0.33 1.00 0.25 0.75 Pr(TACAGTAGATATGA) = 0.0001 Pr(AACAGT) = 0.001 Pr(AACAGTAC) = 0.002 …

20 Parse For a given sequence, a parse is an assignment of gene structure to that sequence. In a parse, every base is labeled, corresponding to the content it (is predicted to) belongs to. In our simple model, the parse contains only “I” (intergenic) and “G” (gene). A more complete model would contain, e.g., “-” for intergenic, “E” for exon and “I” for intron. S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

21 The probability of a parse Transcription stop Transcription start StartEnd Gene Intergenic 0.67 0.33 1.00 0.25 0.75 Pr(ACTGACTACTACGACTACGAT CTACTACGGGCGCGACCT) = 0.0000543 Pr(ATGCGTATGTTTTGA) = 0.00000000142 Pr(ACTGACTATGCGATCTACGAC TCGACTAGCTAC) = 0.0000789 Pr(parse P| sequence S, model M) = 0.67  0.0000543  1.00  0.00000000142  0.75 x 0.0000789 = 3.057  10 -18 S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

22 Finding the best parse For a given sequence S, the model M assigns a probability Pr(P|S,M) to every parse P. We want to find the parse P* that receives the highest probability.

23 An analogy Given two sequences, S 1 and S 2, and a scoring system M, find the alignment A* that maximizes the score. Solution: dynamic programming

24 Dynamic programming Transcription stop Transcription start StartEnd Gene Intergenic 2 Intergenic 1 Intergenic 4 Intergenic 3 Intergenic 1 Intergenic 2 Intergenic 3 Intergenic 4 Gene A C A C C T T G T A C G T

25 DP matrix ACAATGTGAC I1 I2 I3 I4 G S = ACAATGTGAC P = IIIGGGGGGI

26 Beyond Simplest Model Improving the gene model topology Fixed-length signals –PSSMs –Dependencies between positions Variable-length contents –Using HMMs –Semi-Markov models Parsing algorithms –Viterbi –Posterior decoding Including other types of data –Expressed sequence tags –Orthology

27 Improved model topology Draw a model that includes introns Transcription stop Transcription start StartEnd Gene Intergenic 2 Intergenic 1 Intergenic 4 Intergenic 3

28 Improved model topology Transcription stop Transcription start Start End 5’ splice site 3’ splice site

29 Improved model topology Transcription stop Transcription start Start End 5’ splice site 3’ splice site 4 intergenics 1 intron 4 exons

30 Improved model topology Transcription stop Transcription start Start End 5’ splice site 3’ splice site Single exonInitial exon Intron Internal exon Final exon

31 Modeling the 5’ splice site Most introns begin with the letters “GT.” We can add this signal to the model. 5’ splice site 3’ splice site Intron GT

32 Modeling the 5’ splice site Most introns begin with the letters “GT.” We can add this signal to the model. Indeed, we can model each nucleotide with its own arrow. 5’ splice site 3’ splice site Intron GT Pr(A)=0 Pr(C)=0 Pr(G)=0 Pr(T)=1 Pr(A)=0 Pr(C)=0 Pr(G)=1 Pr(T)=0

33 Modeling the 5’ splice site Like most biological phenomenon, the splice site signal admits exceptions. The resulting model of the 5’ splice site is a length-2 PSSM. 5’ splice site 3’ splice site Intron GT Pr(A)=0.01 Pr(C)=0.01 Pr(G)=0.01 Pr(T)=0.97 Pr(A)=0.01 Pr(C)=0.01 Pr(G)=0.97 Pr(T)=0.01

34 Real splice sites Real splice sites show some conservation at positions beyond the first two. We can add additional arrows to model these states. weblogo.berkeley.edu

35 Modeling the 5’ splice site 5’ splice site 3’ splice site Intron


Download ppt "Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,"

Similar presentations


Ads by Google