Lecture 12 Splicing and gene prediction in eukaryotes

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Heuristic alignment algorithms and cost matrices
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Eukaryotic Gene Finding
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Chapter 3 The Interrupted Gene.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used for.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
10. Decision Trees and Markov Chains for Gene Finding.
bacteria and eukaryotes
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used.
The Transcriptional Landscape of the Mammalian Genome
What is a Hidden Markov Model?
Interpolated Markov Models for Gene Finding
Visualization of genomic data
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
What are the Patterns Of Nucleotide Substitution Within Coding and
Introduction to Bioinformatics II
The Toy Exon Finder.
Presentation transcript:

Lecture 12 Splicing and gene prediction in eukaryotes Bioinformatics Lecture 12 Splicing and gene prediction in eukaryotes Critical splice signals Coding statistics: DNA differences between exons and introns Discriminant function and combined approach

Splicing and gene prediction in eukaryotes Any type of gene prediction and particularly ab initio is tremendously complicated in eukaryotes by the splicing phenomenon. The task is difficult, to predict positions of exon-intron boundaries for those eukaryotic genes, which have multiple introns, and to predict absence of introns for intronless genes. Eukaryotic genomes differ significantly in a number of ways, which requires species specific prediction programs. The major differences include: a) variation in GC-content (e.g. mammalian genomes have large variation in GC-content, referred as isochors), b) variation in codon usage frequencies. All these factors, if not taken into consideration, diminish quality of prediction.

AT/GC ratios in coding regions in some eukaryotes

The number of correct and incorrect (number in parentheses) of whole gene model predictions shared among the 3 programs from a test set of 1783 genes GenMark.hmm(GM) Genscan+(GS) Incorrect gene refers to cases in which all coding exons in the gene are in perfect agreement among the gene finders but not with the true gene GlimmerM(GA)

mRNA splicing

Critical splice signals EXON  1                          INTRON EXON 2 G U A/G A G U U U A/G A U/C U/C A G (100%) ( 62 –68 %) (100%) A G Donor site 5’ splice junction Acceptor site 3’ splice junction Branch site G/A

Frequencies of nucleotides at the ends of exons The first 10 nucleotides of exons, 5’ end The last 10 nucleotides of exons, 3’ end C. elegans D. melanogaster H. sapiens

Recognition of variable splice sites and gene prediction At least 3 critical signals/motifs (donor, acceptor and branch sites) should be recognised in order to predict position of an intron and both splice junctions. Significant sequence variation in these sites between species and different genes negatively affects quality of predictions. The best average of error (false-positive + false-negative) rate for either donor or acceptor site prediction is about 5%. This may be acceptable if the search is restricted by a short region. However search of a large region leads to unacceptable rate of the false-positive because for every true site there are hundreds of pseudo-sites. For example, if a large region has 40 true sites and 4000 pseudo-sites, one true site would be missed (2.5% false-negatives) and 100 pseudo-sites would be predicted as true sites (2.5% false-positives)!

Recognition of variable splice sites and gene prediction Since adjacent donor site and acceptor site are not independent, this correlation can be explored for further eliminating false-positives. For short introns, occurring mostly in lower eukaryotes, an intron is recognized by the interaction of splicing factors binding across the intron-ends (hence 5’ss – 3’ss correlation). In vertebrates, exons are much shorter, recognition of exons by the interaction of splicing factors binding across the exon-ends (hence 3’ss – 5’ss correlation) is the key. Therefore mammalian functional splice sites can only be effectively identified simultaneously through exon recognition. Also there are several additional signals/motifs essential for the correct splicing, which are responsible for recognition of certain proteins involved in splicing. Identification of such sites and their use in prediction programs should increase quality of eukaryotic gene predictions.

Coding statistics: DNA differences between exons and introns Except splicing signals and ORF there are several additional characteristics, which may help to discriminate between exons and introns including These features include DNA periodicity in exons, codon preferences, hexamer usage, codon prototype, compositional bias between codon positions

DNA periodicity in exons

DNA periodicity in exons,   3

Periodic structure in DNA sequences. The absolute frequency of the A A pair with ( 0 to 5) nucleotides between the two A's in the 200 first base pairs of the sequences in the set of 1761 human exons and 1753 human introns. A clear period-3 pattern appears in coding regions, which is absent in non-coding regions. A similar periodic pattern appears in coding regions for the other fifteen possible pairs of nucleotides.

Codon Preference A coding statistic was introduced to measure uneven usage of synonymous codons solely. Indeed, from a codon usage table, we can compute the relative probability of each synonymous codon to code for a given amino acid. For instance, GAG and GAA the two codons coding for Glutamic Acid are used in coding regions with probabilities 0.03882 and 0.02751, which results in a relative probability of 0.59 and 0.41, respectively.

Hexamer usage correlation Bias in the distribution of oligonucleotides longer than codons can also be used to discriminate between coding and non-coding regions. Bias in the usage of hexamers may be the most discriminant one (probably because of dependence between adjacent amino acids in the proteins). Bias in hexamer usage can be computed exactly as bias in codon usage as the background information for codon frequencies is known and frequencies of each of 642 = 4096 hexamers can be found. There are several ways to construct frame specific hexamer score, both log-odd LE(w,i) = log [fE(w,i)/fI(w)] and preference score PE(w,i) = fE(w,i) / [fE(w,i) + fI(w)], where fE(w,i) is frequency of hexamer w in frame i, calculated from known exon training data and fI(w) is the frequency of w from known introns. Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position. Estimated from a set of human exon and intron sequences. Codon position 1    A C G T  A .36 .27 .35 .18  C .21 .23 .24 .27  G .19 .14 .23 .23  T .24 .35 .19 .31 Codon position 2    A C G T A .16 .19 .15 .07 C .28 .44 .41 .33 G .40 .12 .27 .45 T .16 .25 .17 .16 Codon position 3  A .22 .33 .24 .13 C .21 .29 .27 .21 G .44 .15 .37 .53 T .13 .22 .12 .13

Codon Prototype, Markov model measure and Average Mutual Information A measure can be introduced which show how similar to the prototypical distribution (see the table) is the observed distribution of base frequencies at the three codon positions in a sequence (exon or intron). Dependencies between nucleotide positions in coding regions can be explicitly described by means of Markov Models. Average Mutual Information can measure the probability in the sequence of the pair of nucleotides i and j and at a distance of k nucleotides. Nucleotide Codon position   1 2 3 A 0.27 0.31 0.18 C 0.24 G 0.32 0.20 0.29 T 0.17 0.26 0.22

Values of different coding statistics in the 223 bp long 2nd coding exon of the human -globin gene, and in a 223 bp long seq. from the middle of the 2nd intron of the same gene   Exon sequence Intron sequence Coding frame Non-coding frames Frame 1 Frame 2 Frame 3 Codon Usage 24.06 -16.13 -3.16 -14.36 -23.74 -19.67 Hexamer Usage 27.62 -11.64 -6.51 -20.90 -27.56 -22.07 39.98 -14.58 -8.46 -26.73 -27.81 -25.87 Codon Preference 15.97 -1.32 7.24 -7.96 -12.70 -14.93 Amino Acid Usage 8.17 -14.87 -10.17 -6.15 -10.69 -4.57 Codon Prototype 9.87 -11.23 -10.30 -11.45 -17.44 -14.49 Markov Model order 1 29.92 -2.69 -3.31 -35.44 -42.40 -41.73 order 2 34.73 -18.26 -7.77 -29.61 -41.76 -40.05 order 5 72.69 -21.38 13.56 -37.63 -30.99 -36.40 Position Asymmetry 0.0957 0.0211 Periodic Asymmetry Index 1.159 1.009 Average Mutual Information 0.00681 0.000344 Fourier Spectrum 2.278 0.892

Pattern discriminant analysis A number of different pattern features of sequences are used to discriminate coding (ex) and non coding seq. A linear and quadratic analysis are shown with the later being more efficient. EPS is the 6-mer exon preference score and 3’SS (3’splicing site) is an example EPS

COMBINER computational gene prediction using multiple sources of evidence The next generation of computational method able to construct gene models is currently developed, which takes as input (combines) a genomic sequence and the locations of gene predictions from ab initio gene finders, protein sequence alignments, expressed sequence tag (EST) and cDNA alignments, splice site predictions, and other evidence An example of such program is COMBINER, which uses rigorous statistical assessments, evaluate candidate gene models and estimate probabilities using so-called decision trees.