3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.

Slides:



Advertisements
Similar presentations
From Gene to Protein How Genes Work
Advertisements

Probabilistic sequence modeling I: frequency and profiles Haixu Tang School of Informatics.
An Introduction to Bioinformatics Finding genes in prokaryotes.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Ab initio gene prediction Genome 559, Winter 2011.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models in Bioinformatics
Finding Eukaryotic Open reading frames.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
The Molecular Genetics of Gene Expression
Gene Prediction: Statistical Approaches Lecture 22.
Eukaryotic Gene Finding
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Introduction to Molecular Biology. G-C and A-T pairing.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Translation and Transcription
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Central Dogma First described by Francis Crick
Hidden Markov Models In BioInformatics
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
RNA and Protein Synthesis
Genome sequencing Haixu Tang School of Informatics.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Chapter 13. The Central Dogma of Biology: RNA Structure: 1. It is a nucleic acid. 2. It is made of monomers called nucleotides 3. There are two differences.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Protein Synthesis Chapter 17. Protein synthesis  DNA  Responsible for hereditary information  DNA divided into genes  Gene:  Sequence of nucleotides.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
RNA, Transcription, and the Genetic Code. RNA = ribonucleic acid -Nucleic acid similar to DNA but with several differences DNARNA Number of strands21.
From Gene to Protein Transcription and Translation.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Features of the genetic code: Triplet codons (total 64 codons) Nonoverlapping Three stop or nonsense codons UAA (ocher), UAG (amber) and UGA (opal)
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Eukaryotic Gene Structure
Transcription.
Gene architecture and sequence annotation
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Outline What is an amino acid / protein
More on translation.
Introduction to Bioinformatics II
Gene Prediction: Statistical Approaches
Chapter 17 From Gene to Protein.
The Toy Exon Finder.
Presentation transcript:

3. Genome Annotation: Gene Prediction

Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaa tgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgc taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatt taccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaa tggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggat ccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcct gcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatcc gatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatat gctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctggga tccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcc gatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgc ggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaat gcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgct aagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaat ggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggc tatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctat gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctcatgcgg Gene!

Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA Central Dogma: DNA -> RNA -> Protein

Prokaryotic and eukaryotic organisms

Codon: 3 consecutive nucleotides 4 3 = 64 possible codons Genetic code is degenerative and redundant –Includes start and stop codons –An amino acid may be coded by more than one codon Translating Nucleotides into Amino Acids

In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations Systematically deleted nucleotides from DNA –Single and double deletions dramatically altered protein product –Effects of triple deletions were minor –Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein Codons

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames Genetic Code and Stop Codons

Six Frames in a DNA Sequence stop codons – TAA, TAG, TGA start codons - ATG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

Detect potential coding regions by looking at ORFs –A genome of length n is comprised of (n/3) codons –Stop codons break genome into segments between consecutive Stop codons –The subsegments of these that start from the Start codon (ATG) are ORFs ORFs in different frames may overlap Genomic Sequence Open reading frame ATGTGA Open Reading Frames (ORFs)

Long open reading frames may be a gene –At random, we should expect one stop codon every (64/3) ~= 21 codons –However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold –This is naïve because some genes (e.g. some neural and immune system genes) are relatively short Long vs.Short ORFs

Testing ORFs: Codon Usage Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene This compensate for pitfalls of the ORF length test

Codon Usage in Human Genome

AA codon /1000 frac Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC Pro CCG Pro CCA Pro CCT Pro CCC AA codon /1000 frac Leu CTG Leu CTA Leu CTT Leu CTC Ala GCG Ala GCA Ala GCT Ala GCC Gln CAG Gln CAA Codon Usage in Mouse Genome

Transcription in prokaryotes Coding region Promoter Transcription start side Untranslated regions Transcribed region start codon stop codon 5’ 3’ upstream downstream Transcription stop side -k denotes k th base before transcription, +k denotes k th transcribed base

Microbial gene finding Microbial genome tends to be gene rich (80%-90% of the sequence is coding) The most reliable method – homology searches (e.g. using BLAST and/or FASTA) Major problem – finding genes without known homologue.

Open Reading Frame Open Reading Frame (ORF) is a sequence of codons which starts with start codon, ends with an end codon and has no end codons in-between. Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse Is the ORF a coding sequence? 1.Must be long enough (roughly 300 bp or more) 2.Should have average amino-acid composition specific for a give organism. 3.Should have codon use specific for the given organism.

Recognition of gene related signals Promoter: bacteria TATAAT – Pribnow box at about –10; Terminator: Rho-independent (intrinsic) transcription termination – G-C reach inverted repeat. Other signal recognition –e.g binding motifs

Codon frequency frequency in coding regionfrequency in non-coding region Input sequence Compare Coding region or non-coding region

Example codon position ACTG 128%33%18%21% 232%16%21%32% 333%15%14%38% freqency in genome 31%18%19%31% Assume: bases making codon are independent P(x|in coding) P(x|random) =  P(Ai at ith position) P(Ai in the sequence) i Score of AAAGAT:.28*.32*.33*.21*.26*.14.31*.31*.31*.31*.31*.19

Using codon frequency to find correct reading frame Consider sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9…. where x i is a nucleotide let p 1 = p x1 x2 x3 p x3 x4 x5 …. p 2 = p x2 x3 x4 p x5 x6 x7 …. p 3 = p x3 x4 x5 p x6 x7 x8 …. then probability that ith reading frame is the coding frame is: p i p 1 + p 2 + p 3 Algorithm: slide a window along the sequence and compute P i Plot the results P i =

First order Markov chain: formal def. M = (Q, , P) Q – finite set of states, say |Q|= n a – n x n transition probability matrix a(i,j)= P[q t+1 =j|g t =i]  – n-vector, starting probability vector  (i) = Pr[q 0 =i] For any row of a the sum of entries = 1  (i) = 1 Comment: Frequently extra start and end states are added and vector  is not needed.

Hidden Markov Model Hidden Markov Model is a Markov model in which one does not observe a sequence of states but results of a function prescribed on states – in our case this is emission of a symbol (amino acid or a nucleotide). States are hidden to the observers.

Introducing emission probabilities Assume that at each state a Markov process emits (with some distribution) a symbol from alphabet . Rather than observing a sequence of states we observe a sequence of emitted symbols. Example:  ={A,C,T,G}. Generate a sequence where A,C,T,G have frequency p(A) =.33, p(G)=.2, p(C)=.2, p(T) =.27 respectively A.33 T.27 C.2 G one state emission probabilities

HMM – formal definition HMM is a Markov process that at each time step generates a symbol from some alphabet, , according to emission probability that depends on state. M = (Q, , , a,e) Q – finite set of states, say n states ={q 0,q 1,…} a – n x n transition probability matrix: a(i,j) = Pr[q t+1 =j|g t =i]  – n-vector, start probability vector:  (i) = Pr[q 0 =i]  = {  1, …,  k }-alphabet e(i,j) = P[o t =  j |q t = i]; o t –t th element of generated sequences = probability of generating  j in state q i (S=o 0,…o T the output sequence)

Algorithmic problems related to HMM Given HMM M and a sequence S compute Pr(S|M) – probability of generating S by M. Given HMM M and a sequence S compute most likely sequence of states generating S. What is the most likely state at position i. Given a representative sample (training set) of a sequence property construct HMM that best models this property.

Probability of generating a sequence S by HMM Recall: probability of a sequence of states S=q 0,…,q n in Markov chain : P(S|M) = a(q 0,q 1 ) a(q 1, q 2 ) a(q 2, q 3 ) … a(q n-1, q n ) Now: Our sequence S =  0,…,  T is a word over  and there can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities.

Formal definition of P(S|M) M = (Q, , a,e) ; (  is dropped: q 0 is the starting state) S =  0 …  n P[S|M] =  p  Q n+1 P[p|M]P[S|p,M] (sum over all possible paths p=q 0,…q n of product: probability of the path times the probability of generating given sequence assuming given path.) P[p|M] = a(q 0,q 1 ) a(q 1, q 2 ) a(q 2, q 3 ) … a(q n-1, q n ) P[S|p,M] = e(q 0,  1 ) e(q 1,  2 ) e(q 2,  3 ) … e(q n,  n ) Computing L S (M) from the definition (enumerating all sequences p) would take exponential time!

Example A.5 T.25 C.25 A.25 T.25 C.25 G.25 A.3 T.3 C.3 G.1 A.3 T.3 C.1 G.3 A.5 T.5 Sample sequence: S=ATCACAT Assume: 0 start state 4 – end state Two possible paths: p1=0,1,3,3,3,3,4 and p2=0,1,2,2,2,4 P(p1|M) = 1*.5*.4*.4*.4*.6=.0192; P(S|p1) =.5*.25*.3*.3*.3*.3*.5 = ; P(p1|M) * P(S|p1) = P(p2|M) =.0192; P(S|p2) =1*.5*.25*.1*.3*.1*.3*.5 = ; P(p2|M) * P(S|p2) = P(S|M) = = Observe how small are the numbers – we need to switch from probabilities to log

Most likely path Most likely path for sequence S is a path p in M that maximizes P(S|p). Why it may be interesting to know most likely path: –Frequently HMM are designed in such a way that one can assign biological relevance to a state: e.g. position in protein sequence, beginning of coding region, beginning/end of an intron.

Example: GpC island Model for general sequence Mode for GpC island Transitions between any pair of states ATCAGTGCGCAAATAC If this is most likely path we have estimated position if GpC island

Simple HMM for prokaryotic genes Position 1Position 3Position 2 Non-Coding Region 11 p 1-p q 1-q

HMM-based Gene finding programs for microbial genes GenMark- [Borodovsky, McInnich – 1993, Comp. Chem., 17, ] 5 th order HMM GLIMMER[Salzberg, Delcher, Kasif,White 1998, Nucleic Acids Research, Vol 26, ] – Interpolated Markov Model (IMM)

Eukaryotic gene finding On average, vertebrate gene is about 30KB long Coding region takes about 1KB Exon sizes vary from double digit numbers to kilobases An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp but both can be much longer.

Exons and Introns In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) This makes computational gene prediction in eukaryotes even more difficult Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Central Dogma and Splicing exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding

Gene Structure

Splicing Signals Exons are interspersed with introns and typically flanked by GT and AG

Splice site detection 5’ 3’ Donor site Position %

Consensus splice sites Donor: 7.9 bits Acceptor: 9.4 bits

Promoters Promoters are DNA segments upstream of transcripts that initiate transcription Promoter attracts RNA Polymerase to the transcription start site 5’ Promoter 3’

Splicing mechanism (

Splicing mechanism Adenine recognition site marks intron snRNPs bind around adenine recognition site The spliceosome thus forms Spliceosome excises introns in the mRNA