Gene Finding Charles Yan.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Expressing Genetic Information- a.k.a. Protein Synthesis
Ab initio gene prediction Genome 559, Winter 2011.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Molecular Genetics of Gene Expression
Gene Finding (DNA signals) Genome Sequencing and assembly
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Sequence comparison: Local alignment
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Sequencing a genome and Basic Sequence Alignment
Regulation of eukaryotic gene sequence expression
Gene Structure and Identification
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
BLAST What it does and what it means Steven Slater Adapted from pt.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Sequencing a genome and Basic Sequence Alignment
Chapter 21 Eukaryotic Genome Sequences
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
LECTURE CONNECTIONS 14 | RNA Molecules and RNA Processing © 2009 W. H. Freeman and Company.
Chapter 17 From Gene to Protein. Gene Expression DNA leads to specific traits by synthesizing proteins Gene expression – the process by which DNA directs.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Chapter 3 The Interrupted Gene.
Motif Search and RNA Structure Prediction Lesson 9.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Introduction to molecular biology Data Mining Techniques.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Structure
Sequence comparison: Local alignment
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
4. HMMs for gene finding HMM Ability to model grammar
The Toy Exon Finder.
Presentation transcript:

Gene Finding Charles Yan

Gene Finding Content sensors Signal sensors Extrinsic content sensors Intrinsic content sensors Signal sensors Splice site prediction Promoter prediction Poly(A) sites prediction Translation initiation codon prediction Combining the evidence to predict gene structures

Combining Evidence Since 1990 programs are no longer limited to searching for independent exons, but try instead to identify the whole complex structure of a gene. Given a sequence and using signal sensors, one can accumulate evidence on the occurrence of signals: translation starts and stops and splice sites are the most important ones since they define the boundaries of coding regions

Combining Evidence In theory, each consistent pair of detected signals defines a potential gene region (intron, exon or coding part of an exon). If one considers that all these potential gene regions can be used to build a gene model, the number of potential gene models grows exponentially with the number of predicted exons.

Combining Evidence In practice, this is slightly reduced by the fact that `correct' gene structures must satisfy a set of properties: There are no overlapping exons Coding exons must be frame compatible Merging two successive coding exons will not generate an in-frame stop at the junction

Combining Evidence The number of candidates remains, however, exponential. In almost all existing approaches, such an exponential number is coped with in reasonable time by using dynamic programming techniques.

Combining Evidence Extrinsic Approaches Intrinsic approaches The content of exon/intron regions was assessed using extrinsic sensors Intrinsic approaches Integrated approaches Combine evidence coming from both intrinsic and extrinsic content sensors

Extrinsic Approaches The principle of most of these programs is to combine similarity information with signal information obtained by signal sensors.

Extrinsic Approaches Very briefly, all the programs in this class may be seen as sophistications of the traditional Smith-Waterman local alignment algorithm where the existence of a signal allows for the opening (donor) or closure (acceptor) of a gap with an essentially free extension cost. They are often referred to as `spliced alignment' programs.

Extrinsic Approaches Existing software may be further divided according to the type of similarity exploited: genomic DNA/protein, genomic DNA/cDNA or genomic DNA/genomic DNA. Some of these methods are able to deal with more than one type and to take into account possible frameshifts in the genomic DNA or cDNA sequences.

Extrinsic Approaches Procrustes To align a genomic sequence with a protein. Considers all potential exons from the query DNA sequence, initially with the only constraint that they must be bordered by donor and acceptor sites. All possible exon assemblies are explored by translating the exons and aligning them with the target protein. Other programs performing the same task are GeneWise, PredictGenes, ORFgene and ALN.

Extrinsic Approaches Some programs, like INFO and ICE, use a dictionary-based approach: they first create dictionaries of k long segments from a protein or an EST database and then, using a look-up procedure, find all segments in the query DNA sequence having a match in the dictionary.

Combining Evidence Extrinsic Approaches Intrinsic approaches The content of exon/intron regions was assessed using extrinsic sensors Intrinsic approaches Combine evidence coming from both intrinsic and extrinsic content sensors

Intrinsic approaches In the exon-based category, the gene assembly is separated from the coding segments prediction step. The goal is to find the highest scoring genes, the gene score being a simple function (usually the sum) of the scores of the assembled segments. In theory at The segment assembly process can be defined as the search for an optimal path in a directed acyclic graph where vertices represent exons and edges represent compatibility between exons. This is the approach adopted by the GeneId, GenView2, GAP3, FGENE and DAGGER programs

Intrinsic approaches In the signal-based methods, the gene assembly is produced directly from the set of detected signals.

Intrinsic approaches To effciently deal with the exponential number of possible gene structures defined by potential signals, almost all intrinsic gene finders use dynamic programming (DP) to identify the most likely gene structures according to the evidence defined by both content and signal sensors.

Integrated Approaches Combining both intrinsic and extrinsic. Combine the predictions of several programs in order to obtain a sort of consensus.

Gene Finding Content sensors Signal sensors Extrinsic content sensors Intrinsic content sensors Signal sensors Splice site prediction Promoter prediction Poly(A) sites prediction Translation initiation codon prediction Combining the evidence to predict gene structures

Pitfalls and Issues Several issues make the problem of eukaryotic gene finding extremely difficult. Very long genes: for example, the largest human gene, the dystrophin gene, is composed of 79 exons spanning nearly 2.3 Mb. Very long introns: again, in the human dystrophin gene, some introns are >100 kb long and >99% of the gene is composed of introns.

Pitfalls and Issues Very conserved introns. this is particularly a problem when gene prediction is addressed through similarity searches.

Pitfalls and Issues Very short exons: some exons are only 3 bp long in Arabidopsis genes and probably even 1 bp for the coding part of exons at either end of the coding sequence, meaning that start or stop codons can be interrupted by an intron. Such small exons are easily missed by all content sensors, especially if bordered bylarge introns. The more difficult cases are those where the length of a coding exon is a multiple of three (typically 3, 6 or 9 bp long), because missing such exons will not cause a problem in the exon assembly as they do not introduce any change in the frame.

Pitfalls and Issues Overlapping genes: though very rare in eukaryotic genomes, there are some documented cases in animals as well as in plants Polycistronic gene arrangement: one gene, and one mRNA, but two or more proteins.

Pitfalls and Issues Frameshifts: some sequences stored in databases may contain errors (either sequencing errors or simply errors made when editing the sequence) resulting in the introduction of artificial frameshifts (deletion or insertion of one base). Such frameshifts greatly increase the difficulty of the computational gene finding problem by producing erroneous statistics and masking true solutions.

Pitfalls and Issues Introns in non-coding regions: there are genes for which the genomic region corresponding to the 5`- and/or 3`-UTR in the mature mRNA is interrupted by one or more intron(s). Alternative transcription start: e.g. three alternative promoters regulate the transcription of the 14 kb full-length dystrophin mRNAs and four `intragenic' promoters control that of smaller isoforms.

Pitfalls and Issues Alternative splicing. Alternative polyadenylation: 20% of human transcripts showing evidence of alternative polyadenylation.

Pitfalls and Issues Alternative initiation of translation: finding the right AUG initiator is still a major concern for gene prediction methods. the rule stating that the firrst AUG in the mRNA is the initiator codon can be escaped through three mechanisms: context-dependent leaky scanning, re-initiation and direct internal initiation. Non-AUG triplet can sometimes act as the functional codon for translation initiation, as ACG in Arabidopsis or CUG in human sequences