Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Application to find Eukaryotic Open reading frames. Lab.
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Edward Marcotte/Univ. of Texas/BIO337/Spring 2014
Profiles for Sequences
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
McPromoter – an ancient tool to predict transcription start sites
Finding Eukaryotic Open reading frames.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
© 2006 W.W. Norton & Company, Inc. DISCOVER BIOLOGY 3/e
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Gene Structure and Function
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
© 2012 Pearson Education, Inc. Lecture by Edward J. Zalisko PowerPoint Lectures for Campbell Biology: Concepts & Connections, Seventh Edition Reece, Taylor,
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
= stochastic, generative models
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
PROTEIN SYNTHESIS HOW GENES ARE EXPRESSED. BEADLE AND TATUM-1930’S One Gene-One Enzyme Hypothesis.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
DNA, chromosomes, genes What is a gene? Triplet code? Compare prokaryotic and eukaryotic DNA.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Unit-II Synthetic Biology: Protein Synthesis Synthetic Biology is - A) the design and construction of new biological parts, devices, and systems, and B)
Gene Expression : Transcription and Translation 3.4 & 7.3.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
”Gene Finding in Eukaryotic Genomes”
What is a Hidden Markov Model?
Protein synthesis DNA is the genetic code for all life. DNA literally holds the instructions that make all life possible. Even so, DNA does not directly.
Interpolated Markov Models for Gene Finding
Gene Expression : Transcription and Translation
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Generalizations of Markov model to characterize biological sequences
Reading Frames and ORF’s
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Presentation transcript:

Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

Lots of genes in every genome Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Nature Reviews Genetics 13: (2012) Do humans really have the biggest genomes?

Lots of genes in every genome Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 animal 6.5 Gb (2X human) 17K genes gigabases (not gigabytes)

GATCACTTGATAAATGGGCTGAAGTAACTCGCCCAGATGAGGAGTGTGCTGCCTCCAGAAT CCAAACAGGCCCACTAGGCCCGAGACACCTTGTCTCAGATGAAACTTTGGACTCGGAATT TTGAGTTAATGCCGGAATGAGTTCAGACTTTGGGGGACTGTTGGGAAGGCATGATTGGTT TCAAAATGTGAGAAGGACATGAGATTTGGGAGGGGCTGGGGGCAGAATGATATAGTTTG GCTCTGCGTCCCCACCCAATCTCATGTCAAATTGTAATCCTCATGTGTCAGGGGAGAGGCCT GGTGGGATGTGATTGGATCATGGGAGTGGATTTCCCTCTTGCAGTTCTCGTGATAGTGAGT GAGTTCTCACGAGATCTGGTTGTTTGAAAGTGTGCAGCTCCTCCCCCTTCGCGCTCTCTCTC TCCCCTGCTCCACCATGGTGAGACGTGCTTGCGTCCCCTTTGCCTTCTGCCATGATTGTAAG CTTCCTCAGGCGTCCTAGCCACGCTTCCTGTACAGCCTGAGGAACTGGGAGTCAATGAAA CCTCTTCTCTTCATAAATTACCCAGTTTCAGGTAGTTCTTTCTAGCAGTGTGATAATGGACGA TACAAGTAGAGACTGAGATCAATAGCATTTGCACTGGGCCTGGAACACACTGTTAAGAAC GTAAGAGCTATTGCTGTCATTAGTAATATTCTGTATTATTGGCAACATCATCACAATACACTGC TGTGGGAGGGTCTGAGATACTTCTTTGCAGACTCCAATATTTGTCAAAACATAAAATCAGG AGCCTCATGAATAGTGTTTAAATTTTTACATAATAATACATTGCACCATTTGGTATATGAGTCT TTTTGAAATGGTATATGCAGGACGGTTTCCTAATATACAGAATCAGGTACACCTCCTCTTCCA TCAGTGCGTGAGTGTGAGGGATTGAATTCCTCTGGTTAGGAGTTAGCTGGCTGGGGGTTC TACTGCTGTTGTTACCCACAGTGCACCTCAGACTCACGTTTCTCCAGCAATGAGCTCCTGTT CCCTGCACTTAGAGAAGTCAGCCCGGGGACCAGACGGTTCTCTCCTCTTGCCTGCTCCAG CCTTGGCCTTCAGCAGTCTGGATGCCTATGACACAGAGGGCATCCTCCCCAAGCCCTGGTC CTTCTGTGAGTGGTGAGTTGCTGTTAATCCAAAAGGACAGGTGAAAACATGAAAGCC… Where are the genes? How can we find them? Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014cc A toy HMM for 5′ splice site recognition (from Sean Eddy’s NBT primer linked on the course web page) Remember this?

What elements should we build into an HMM to find bacterial genes? Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Let’s start with prokaryotic genes

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Can be polycistronic:

A CpG island model might look like: p(C  G) is higher A C T G A C T G p(C  G) is lower P( X | CpG island) P( X | not CpG island) CpG island model Not CpG island model Could calculate (or log ratio) along a sliding window, just like the fair/biased coin test ( of course, need the parameters, but maybe these are the most important….) Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Remember this?

One way to build a minimal gene finding Markov model Transition probabilities reflect codons A C T G A C T G Transition probabilities reflect intergenic DNA P( X | coding) P( X | not coding) Coding DNA model Intergenic DNA model Could calculate (or log ratio) along a sliding window, just like the fair/biased coin test Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

Really, we’ll want to detect codons. The usual trick is to use a higher-order Markov process. A standard Markov process only considers the current position in calculating transition probabilities. An n th -order Markov process takes into account the past n nucleotides, e.g. as for a 5 th order: Image from Curr Op Struct Biol 8: (1998) Codon 1Codon 2 Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

5 th order Markov chain, using models of coding vs. non-coding using the classic algorithm GenMark 1 st reading frame 2 nd reading frame 3 rd reading frame 1 st reading frame 2 nd reading frame 3 rd reading frame Direct strand Complementary (reverse) strand Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

An HMM version of GenMark

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 For example, accounting for variation in start codons…

Length distributions (in # of nucleotides) Coding (ORFs)Non-coding (intergenic) Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 … and variation in gene lengths

Coding (ORFs) Non-coding (intergenic) Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 (Placing these curves on top of each other) Long ORFS tend to be real protein coding genes Short ORFS occur often by chance Protein-coding genes <100 aa’s are hard to find

Model for a ribosome binding site (based on ~300 known RBS’s) Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

How well does it do on well-characterized genomes? Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 But this was a long time ago!

What elements should we build into an HMM to find eukaryotic genes? Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Eukaryotic genes

Edward Marcotte/Univ. of Texas/BIO337/Spring

We’ll look at the GenScan eukaryotic gene annotation model: Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

We’ll look at the GenScan eukaryotic gene annotation model: Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Zoomed in on the forward strand model…

Introns Initial exons Internal exons Terminal exons Introns and different flavors of exons all have different typical lengths Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

Taking into account donor splice sites Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

An example of an annotated gene… Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

Nature Reviews Genetics 13: (2012) How well do these programs work? We can measure how well an algorithm works using these: Algorithm predicts: True answer: PositiveNegative Positive Negative True positive False positive False negative True negative Specificity = TP / (TP + FP) Sensitivity = TP / (TP + FN)

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Nature Reviews Genetics 13: (2012) How well do these programs work? How good are our current gene models?

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 GENSCAN, when it was first developed…. Accuracy per base Accuracy per exon

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 Nature Reviews Genetics 13: (2012) In general, we can do better with more data, such as mRNA and conservation

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? In the year 2000 = scientists from around the world held a contest (“GASP”) to predict genes in part of the fly genome, then compare them to experimentally determined “truth” Genome Research 10:483–501 (2000)

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? In the year 2000 Genome Research 10:483–501 (2000) “Over 95% of the coding nucleotides … were correctly identified by the majority of the gene finders.” “…the correct intron/exon structures were predicted for >40% of the genes.” Most promoters were missed; many were wrong. “Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region.”

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? In the year 2006 = scientists from around the world held a contest (“EGASP”) to predict genes in part of the human genome, then compare them to experimentally determined “truth” 18 groups 36 programs We discussed these earlier

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014

Transcripts vs. genes

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 So how did they do? In the year 2006 “The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes.” “…taking into account alternative splicing, … only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity.”

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 At the gene level, most genes have errors In the year 2006

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? In the year 2008 = scientists from around the world held a contest (“NGASP”) to predict genes in part of the worm genome, then compare them to experimentally determined “truth” 17 groups from around the world competed “Median gene level sensitivity … was 78%” “their specificity was 42%”, comparable to human

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 For example: In the year 2008 Confirmed ????

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? 2012 In the year 2012 = a large consortium of scientists trying to annotate the human genome using a combination of experiment and prediction. Best estimate of the current state of human genes.

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? 2012 In the year 2012 Quality of evidence used to support automatic, manually, and merged annotated transcripts (probably reflective of transcript quality) 23,855 transcripts 89,669 transcripts 22,535 transcripts

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 How well do we know the genes now? 2014 In the year 2014 The bottom line: Gene prediction and annotation are hard Annotations for all organisms are still buggy Few genes are 100% correct; expect multiple errors per gene Most organisms’ gene annotations are probably much worse than for humans

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 The Univ of California Santa Cruz genome browser

Edward Marcotte/Univ. of Texas/BIO337/Spring 2014 The Univ of California Santa Cruz genome browser