Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Bioinformatics as Hard Disk Investigation Assuming you can read all the bits on a 1000 year old hard drive Can you figure out what does what? - Distinguish.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Gene Prediction Preliminary Results Computational Genomics February 20, 2012.
Ab initio gene prediction Genome 559, Winter 2011.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Gene Identification Lab
Gene Finding Charles Yan.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
BLAST What it does and what it means Steven Slater Adapted from pt.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
9. Lecture WS 2003/04Bioinformatics III1 Gene finding Material of this lecture taken from - chapter 8, DW Mount „Bioinformatics“ - C. Mathé et al. Nucleic.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.
From Genomes to Genes Rui Alves.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Motif Search and RNA Structure Prediction Lesson 9.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Bacterial infection by lytic virus
What is a Hidden Markov Model?
Interpolated Markov Models for Gene Finding
Ab initio gene prediction
Introduction to Bioinformatics II
Protein Synthesis Step 2: Translation
What do you with a whole genome sequence?
Microbial gene identification using interpolated Markov models
Presentation transcript:

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema

Gene Prediction IntroductionIntroduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema

Why gene prediction? experimental way?

Why gene prediction? Exponential growth of sequences Metagenomics: ~1% grow in lab New sequencing technology

How to do it?

It is a complicated task, let’s break it into parts

How to do it? It is a complicated task, let’s break it into parts Genome

How to do it? It is a complicated task, let’s break it into parts Genome

How to do it? Protein-coding gene prediction Phillip Lee & Divya Anjan Kumar Homology Search ab initio approach Nadeem Bulsara & Neha Gupta

How to do it? RNA gene prediction Amanda McCook & Chengwei Luo tRNA rRNA sRNA

Gene Prediction Introduction Protein-coding gene predictionProtein-coding gene prediction RNA gene prediction Modification and finishing Project schema

Homology Search

Strategy

open reading frame(ORF)

How/Why find ORF?

Protein Database Searches

Domain searches

Limits of Extrinsic Prediction

ab initio Prediction

Homology Search is not Enough! Biased and incomplete Database Sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either. Number of sequenced genomes clustered here

ab initio Gene Prediction

Features

ORFs (6 frames)

Codon Statistics

Features (Contd.)

Probabilistic View

Supervised Techniques

Unsupervised Techniques

Usually Used Tools GeneMark GLIMMER EasyGene PRODIGAL

GeneMark Developed in 1993 at Georgia Institute of Technology as the first gene finding tool. Used markov chain to represent the statistics of coding and noncoding reading frames using dicodon statistics. Shortcomings Inability to find exact gene boundaries

GeneMark.hmm

Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x 1,x 2,…………,x L | b 1,b 2,…………,b L ) Viterbi algorithm then calculates the functional sequence X * such that P(X * |S) is the largest among all possible values of X. Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites.

GeneMark RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated. Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM. GENEMARKS Considered the best gene prediction tool. Based on unsupervised learning. Even in prokaryotic genomes gene overlaps are quite common GeneMarkS

GLIMMER Used IMM (Interpolated Markov Models) for the first time. Predictions based on variable context (oligomers of variable lengths). More flexible than the fixed order Markov models. Principle IMM combines probability based on 0,1……..k previous bases, in this case k=8 is used. But this is for oligomers that occur frequently. However, for rarely occurring oligomers, 5th order or lower may also be used. Maintained by Steven Salzberg, Art Delcher at the University of Maryland, College Park

Glimmer development Glimmer 2 (1999) Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model) Glimmer 3 (2007) Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination. Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon. Score being the sum of log likelihood of the bases contained in the ORF.

Glimmer3.02

PRODIGAL Prokaryotic Dynamic Programming Gene Finding Algorithm Developed at Oak Ridge National Laboratory and the University of Tennessee

PRODIGAL-Features

EasyGene Developed at University of Copenhagen Statistical significance is the measure for gene prediction.

Comparison of Different Tools

Gene Prediction Introduction Protein-coding gene prediction RNA gene predictionRNA gene prediction Modification and finishing Project schema

RNA Gene Prediction

Why Predict RNA?

Regulatory sRNA

sRNA Challenges

Fundamental Methodology

RFAM

What Is Covariance? Fig: Christian Weile et al. BMC Genomics (2007) 8:244

Noncomparative Prediction Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612

Noncomparative Prediction *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1

Comparative+Noncomparative Effective sRNA prediction in V. cholerae Non-enterobacteria sRNAPredict2 32 novel sRNAs predicted 9 tested 6 confirmed Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096

Software *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1 Eva K. Freyhult et al. Genome Res. (2007) 17:117

Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification and finishingModification and finishing Project schema

Modification & Finishing Consensus strategy to integrate ab initio results Broken gene recruiting TIS correcting IS calling operon annotating Gene presence/absence analysis

Modification & Finishing Consensus strategy pass fail Broken gene recruiting ab initio results homology search candidate fragments

Modification & Finishing TIS correcting Start codon redundancy:ATG, GTG, TTG, CTG Markov iteration, experimental verified data Leaderless genes

Modification & Finishing IS callingOperon annotating IS Finder DB

Modification & Finishing Gene Presence/absence analysis

Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schemaProject schema

Schema (proposed)

assembly group

Schema (proposed) assembly group