Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group
GENSIPS10/7/ Genes are read out via mRNA & processing
GENSIPS10/7/ RNA Processing
GENSIPS10/7/ A typical human gene structure
GENSIPS10/7/ In a mammalian genome Finding all the genes is hard Mammalian genomes are large –5,051 miles of 10pt type –Raleigh to Tripoli, Libya Only about 1.5% protein coding –Raleigh to Winston-Salem
GENSIPS10/7/ Genes are fairly unconstrained Intron length is highly variable ~5% are nt long ~3% are longer than 30,000 nt Distance between genes is highly variable From 10 3 to 10 6 nt or more (probably)
GENSIPS10/7/ Exons per gene (RefSeq)
GENSIPS10/7/ Background is not random Segmental duplications Entire regions duplicate, then diverge slowly Processed pseudogenes Spliced transcripts integrate back into the genome –Sequence is similar to source genes –Generally not functional
GENSIPS10/7/ Gene prediction: two approaches 1. Transcript-based (E.g., GeneWise) A.Map experimentally determined sequences of spliced transcripts to their genomic source B.Map transcript sequences to genomic regions that could produce similar transcripts 2. De novo (genome only) Model DNA patterns characteristic of gene components –Splice donor and accepter –Protein coding sequence –Translation start and stop
GENSIPS10/7/ Advantages and disadvantages Transcript-based Advantage: conservative –Evidence of transcription for every exon Disadvantage: conservative –Can’t find “truly novel” genes Still subject to error
GENSIPS10/7/ Advantages and disadvantages De novo Advantage 1: Less biased toward –Known transcripts –Transcripts that can be sequenced easily Advantage 2: Genome sequencing is easy Disadvantages –No direct evidence of transcription –Presumably, more false positives
GENSIPS10/7/ Single-genome de novo: Genscan Strengths For mammalian sequence, one of the best single-genome, de novo gene predictors Widely used to great practical advantage De facto standard for mammalian sequence Limitations Predicts >45K genes (best est.: 25-30K) Predicts >315K exons (best est. 200K-250K) Gets only 9% of known genes exactly right*
GENSIPS10/7/ Dual genome de novo We developed algorithms that use two genomes to Reduce the number of false positives Refined the details of the structures
GENSIPS10/7/ Probability model Assigns probability to annotated DNA sequences: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ Optimization algorithm Given a DNA sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Single-genome de novo method
GENSIPS10/7/ CCATGGCGTCTTCAGGCAGTGACTC Genscan’s generative model Intron Exon Intron
GENSIPS10/7/ Generalized HMM States correspond to gene features Model generates DNA sequence by passing through states The probability of annotated DNA sequence is the probability of –generating the DNA sequence –by passing through states corre- sponding to the annotation. Genscan’s generative model
GENSIPS10/7/ Dual genome prediction Input Target and informant genomes Idea Patterns of evolution since the last common ancestor may reveal gene structure
GENSIPS10/7/ Two conservation signals 1. Local alignment signal Selective pressures differ by feature This leaves a characteristic signature 2. Structural signal Locations of introns tend to be conserved
GENSIPS10/7/ Characteristic local alignments TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC |||||||||||||||||||| || ||||| || || ||| TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC Coding exon CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC |||||| || | ||||||||| || || || CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT Intron (non-coding) human mouse
GENSIPS10/7/ Conservation of intron location
GENSIPS10/7/ Align→predict→filter→test WU-BLAST Aligned Intron Filter Validation (RT-PCR) TTATCCACCAGAC CAGATAGATACTT GTCTGCCACCCTC TTATCCACCAGAC CAGATAGGTATTT GTCAGCTACTCTC TCTGCCACC || || || TCAGCTACT TWINSCAN
GENSIPS10/7/ gHMM decoding Representation change TCTGCCACC ||:||:|| TCTGCCACC || || || TCAGCTACT Conservation sequence TWINSCAN
GENSIPS10/7/ BLAST Alignments Target Informant
GENSIPS10/7/ Projecting BLAST Alignments Target Informant
GENSIPS10/7/ Projecting BLAST Alignments Target Informant
GENSIPS10/7/ Projecting BLAST Alignments Target Informant
GENSIPS10/7/ Projecting BLAST Alignments Target Informant
GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| | ||||||||| || || || CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical
GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap
GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Synthetic (projected) local alignment human mouse |||||| |:|||||||||::||:|| ||: CTAGAG AGACAGGTACCATAGGGCTCTCCT Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned
GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC Conservation sequence human |||||| |:|||||||||::||:|| ||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned
GENSIPS10/7/ Conservation sequence CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC Conservation sequence human |||||| |:|||||||||::||:||||: Pair each nucleotide of the target with “|” if it is aligned and identical “:” if it is aligned to mismatch or gap “.” if it is unaligned
GENSIPS10/7/ Probability model Assigns probability to annotated DNA: 5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ ||| |:||||:|||||||||:||::|| Optimization Given DNA and conservation sequence, find the most probable annotation, according to the model Exon5’ UTR Intron Twinscan: Extending the model
GENSIPS10/7/ Each state “generates” DNA and conservation sequence independently Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states Twinscan
GENSIPS10/7/ Performance Evaluation RefSeq A set ~13,000 “Known” mRNAs Represents ~40-50% of human genes –Usually, only one of several splices Mapping to genome is imperfect Best available gold standard
GENSIPS10/7/
GENSIPS10/7/
GENSIPS10/7/
GENSIPS10/7/
GENSIPS10/7/ Short term goal All multi-exon human genes Predict accurately –Integrate information from more genomes Verify at least one intron experimentally Follow up with full-length verification
GENSIPS10/7/ Acknowledgments Funding agencies National Institutes of Health (NHGRI) National Science Foundation (DBI) Sequencing centers Sanger, Whitehead, Wash. U. My group Ian Korf, Paul Flicek, Evan Keibler, Ping Hu Collaborators Roderic Guigo, Josep Abril, Genis Parra –Pankaj Agarwal Stylianos Antonarakis, Alexandre Reymond, Manolis Dermitzakis
GENSIPS10/7/ Other clades Plants Arabidopsis thaliana, cabbage, rice Nematodes C. elegans, C. briggsae Fungi Cryptococcus neoformans (JEC21, H99)
GENSIPS10/7/ Pair HMM algorithms (SLAM,…) Input is orthologous sequences. Aligns and predicts simultaneously, using a joint probability model Predicts orthologous genes in 2 sequences All predicted CDS is aligned Some aligned regions are not predicted CDS –Labeled conserved non-coding sequence
GENSIPS10/7/ The algorithms (SLAM,…) sgp2 Alignment before prediction (tblastx) Predicts genes in target sequence only Don’t need orthologous input sequences –Paralogs & low-coverage shotgun can help Modifies scores of all potential exons, by –At each base, add tblastx score of best overlapping local alignment (roughly) –To gene-id scores of that potential exon
GENSIPS10/7/ The algorithms TWINSCAN Alignment before prediction (blastn) Predicts in target sequence only Modifies scores of all potential exons, UTRs, splice sites, start and stop models, by –At each base, apply a feature-specific scoring model (estimated for this purpose) –to the best overlapping local alignment, and adding the result –To Genscan scores for that feature
GENSIPS10/7/ % Aligned, CDS vs. other
GENSIPS10/7/ Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons Syntenic Gene Prediction (sgp2)
GENSIPS10/7/ Why work on gene finding? Genes are Components responsible for biological function Variations cause human disease / susceptibility Controls for modifying biological function –Human gene therapy –Agriculture –Nanotechnology, etc.