Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
An Introduction to Bioinformatics Finding genes in prokaryotes.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Ab initio gene prediction Genome 559, Winter 2011.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Gene Prediction: Statistical Approaches Lecture 22.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Dynamic Programming II
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Hidden Markov Models In BioInformatics
3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Regular Expression ^ beginning of string $ end of string. any character except newline * match 0 or more times + match 1 or more times ? match 0 or 1 times;
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
What is a Hidden Markov Model?
Interpolated Markov Models for Gene Finding
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Introduction to Bioinformatics II
Gene Prediction: Statistical Approaches
The Toy Exon Finder.
Presentation transcript:

Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012

Gene prediction aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgat gacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccg atgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatg aatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatg catgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa tgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgc taatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcgg ctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatg cggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgct aatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgac aatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatg ctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagct gggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaag ctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggcta tgctaagctcatgcggctatgctaagctgggaatgc

What’s a gene  What is a gene, post-ENCODE? –Gerstein et al. Genome Res : –ENCODE consortium: characterization of 1% of the human genome by experimental and computational techniques  Definitions: –Definition 1970s–1980s: Gene as open reading frame (ORF) sequence pattern –Definition 1990s–2000s: Annotated genomic entity, enumerated in the databanks –The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products

Mark B. Gerstein et al. Genome Res. 2007; 17: The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products Post-ENCODE definition

Can we still do gene prediction?  Prokaryotic genes  Eukaryotic genes gene promoter startstop terminator exon promoter startstop donoracceptor intron Open reading frame (ORF)

 Similarity based predictors  Statistical approaches (ab initio gene predictors) –Predictions depend on analysis of a variety of sequence patterns that are characteristic of exons, intron-exon boundaries, upstream regulatory regions, etc. –Hidden Markov models (a probabilistic sequence model), Neural networks (NN) and other methods are used to combine sequence content and signal classifiers into a consistent gene structure Gene predictors (for protein-coding genes)

 Microbial genome tends to be gene rich (80%- 90% of the sequence is coding)  Often no introns  Highly conserved patterns in the promoter region, transcription and translation start site Gene prediction is easier in microbial genomes

Coding region Promoter Transcription start side Untranslated regions Transcribed region start codon stop codon 5’ 3’ upstream downstream Transcription stop side -k denotes k th base before transcription, +k denotes k th transcribed base Prokaryote gene structure

ORF is a sequence of codons which starts with start codon, ends with an end codon and has no end codons in- between. Searching for ORFs – consider all 6 possible reading frames: 3 forward and 3 reverse (next slide) Is the ORF a coding sequence? 1.Must be long enough (roughly 300 bp or more) 2.Should have average amino-acid composition specific for the given organism. 3.Should have codon use specific for the given organism. Open reading frame (ORF)

Six frame translation of a DNA sequence  stop codons – TAA, TAG, TGA  start codons - ATG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames The Genetic Code

Codon usage  Codon: 3 consecutive nucleotides  4 3 = 64 possible codons  Genetic code is degenerative and redundant –Includes start and stop codons –An amino acid may be coded by more than one codon (degeneracy & wobbling pairing) –Certain codons are more in use  Uneven use of the codons may characterize a real gene

AA codon /1000 frac Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC Pro CCG Pro CCA Pro CCT Pro CCC AA codon /1000 frac Leu CTG Leu CTA Leu CTT Leu CTC Ala GCG Ala GCA Ala GCT Ala GCC Gln CAG Gln CAA Codon usage in Mouse genome

Codon frequency frequency in coding regionfrequency in non-coding region Input sequence Compare Coding region or non-coding region

codon position ACTG 128%33%18%21% 232%16%21%32% 333%15%14%38% frequency in genome 31%18%19%31% P(x|ORF) P(x|random) =  P(Ai at ith position) P(Ai in the sequence) i Score of AAAGAT:.28*.32*.33*.21*.26*.14.31*.31*.31*.31*.31*.19 A simple calculation assuming independence of nucleotides

Coding region prediction using hexmer frequencies Coding potential –hexmer frequencies in coding versus in non-coding regions  log (C i (X)/N i (X))

Position 1Position 3Position 2 Non-Coding Region 11 p 1-p q 1-q Simple HMM for prokaryotic genes A: p A C: p C G: p G T: p T States Transition probability Emission probability

M = (Q, , a) Q – finite set of states, say |Q|= n a – n x n transition probability matrix a(i,j)= P[q t+1 =j|q t =i]  – n-vector, starting probability vector  (i) = Pr[q 0 =i] For any row of a the sum of entries = 1  (i) = 1 First order Markov chain

 Hidden Markov Model is a Markov model in which one does not observe a sequence of states but results of a function prescribed on states  States are hidden to the observers  In the simple gene HMM model, states are coding-region and non-coding region. But we observe the sequences of nucleotides. Hidden Markov Model

 Assume that at each state a Markov process emits (with some distribution) a symbol from alphabet .  Rather than observing a sequence of states we observe a sequence of emitted symbols. Example:  ={A,C,T,G}. Generate a sequence where A,C,T,G have frequency p(A) =.33, p(G)=.2, p(C)=.2, p(T) =.27 respectively A.33 T.27 C.2 G.2 one state emission probabilities Emission probability

HMM is a Markov process that at each time step generates a symbol from some alphabet, , according to emission probability that depends on state. M = (Q, , , a,e) Q – finite set of states, say n states ={q 0,q 1,…} a – n x n transition probability matrix: a(i,j) = Pr[q t+1 =j|g t =i]  – n-vector, start probability vector:  (i) = Pr[q 0 =i]  = {  1, …,  k }-alphabet e(i,j) = P[o t =  j |q t = i]; o t –t th element of generated sequences = probability of generating  j in state q i (S=o 0,…o T the output sequence) HMM: formal definition

 Given HMM M and a sequence S compute Pr(S|M) – probability of generating S by M.  Given HMM M and a sequence S compute most likely sequence of states generating S.  What is the most likely state at position i.  Given a representative sample (training set) of a sequence property construct HMM that best models this property. Algorithmic problems related to HMM

 GenMark- [Borodovsky, McInnich – 1993, Comp. Chem., 17, ] 5 th order HMM  GLIMMER[Salzberg, Delcher, Kasif,White 1998, Nucleic Acids Research, Vol 26, ] – Interpolated Markov Model (IMM) HMM based gene predictors for microbial genomes

Additional information for gene prediction  Upstream regions of genes often contain motifs that can be used for gene prediction -10 STOP ATG TATACT Pribnow Box TTCCAAGGAGG Ribosomal binding site Transcription start site

Promoter structure in prokaryotes Transcription starts at offset 0. Pribnow Box (-10) Gilbert Box (-30) Ribosomal Binding Site (+10)

Ribosomal binding Site

 In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns)  And all the complicating factors as mentioned in the “What is a gene” paper  More non-coding regions than coding regions  This makes computational gene prediction in eukaryotes even more difficult Eukaryotic gene prediction is more difficult

Donor and acceptor sites: GT and AG dinucleotides  The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides  Detecting these sites is difficult, because GT and AC appear very often exon 1exon 2 GTAC Acceptor Site Donor Site

5’5’ 3’3’ Donor site Position % Splicing site signal

 Try to recognize location of splicing signals at exon-intron junctions –This has yielded a weakly conserved donor splice site and acceptor splice site  Profiles for sites are still weak, and lends the problem to the Hidden Markov Model (HMM) approaches, which capture the statistical dependencies between sites Splice site prediction

Similarity-based gene finding  Alignment of –Genomic sequence and (assembled) EST sequences –Genomic sequence and known (similar) protein sequences (spliced alignment) –Two or more similar genomic sequences

 Human EST (mRNA) sequence is aligned to different locations in the human genome  Find the “best” path to reveal the exon structure of human gene EST sequence Human Genome Using EST to find exon structure

Using similarities to find the exon structure  The known frog gene is aligned to different locations in the human genome  Find the “best” path to reveal the exon structure of human gene Frog Gene (known) Human Genome

Finding local alignments Use local alignments to find all islands of similarity Human Genome Frog Genes (known)

Exon chaining problem  Locate the beginning and end of each interval (2n points)  Find the “best” path: Given a set of putative exons, find a maximum set of non-overlapping putative exons

 Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence  Input: Genomic sequences G, target sequence T, and a set of candidate exons B.  Output: A chain of exons such that the global alignment score between the chain of exon and T is maximum among all chains of blocks from B. Spliced alignment for gene prediction

Spliced alignment can be solved by DP algorithm

 GLIMMER: uses interpolated Markov model (for prokaryotic gene prediction)  GeneMark: uses Markov model; geneMark-P & geneMark-E  Eukaryotic genes predictors –GENSCAN: uses Hidden Markov Models (HMMs) –Grail: uses pattern recognition and EST (neural network) Mount p380 –TwinScan (GenomeScan) Uses both HMM and similarity (e.g., between human and mouse genomes) Gene predictors

Glimmer  Made of 2 programs –BuildIMM Takes sequences as input and outputs the Interpolated Markov Models (IMMs) –Glimmer Takes IMMs and outputs all candidate genes Automatically resolves overlapping genes by choosing one, hence limited Marks “suspected to truly overlap” genes for closer inspection by user

GenMark  Based on non-stationary Markov chain models  Results displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence

GENSCAN States -- functional units “Phase” three frames. Two symmetric sub modules for forward and backward strands Performance: 80% exon detecting (but if a gene has more than one exon probability of detection decrease rapidly).

TwinScan  Aligns two sequences and marks each base as gap ( - ), mismatch (:), match (|), resulting in a new alphabet of 12 letters: Σ {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}.  Run Viterbi algorithm using emissions ek(b) where b ∊ {A-, A:, A|, …, T|}.  The emission probabilities are estimated from from human/mouse gene pairs. –Ex. eI(x|) eE(x-) since gaps (as well as mismatches) are favored in introns. –Compensates for dominant occurrence of poly-A region in introns

What’s new  What is a gene? (Gerstein et al. Genome Res : )  Prediction of gene fragments in short reads (metagenomic sequences)  ncRNA gene prediction (we will cover it briefly)  …