Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley
The Gene Finding Problem 5’3’ DNA Exon 1 Exon 2Exon 3Exon 4 Intron 1Intron 2Intron 3 polyA signalPyrimidine tract Branchpoint CTG A C Splice site CAG Splice site GGTGAG Translation Initiation ATG Stop codon TAG/TGA/TAA Promoter TATA
Approaches to Gene Recognition Naïve (mid80s - mid90s) ORFfinder, BLAST.. Statistical de novo Genie (96),Genscan (97), FGENESH.. Systems Ensembl.. “Ask not what mathematics can do for biology, ask what biology can do for mathematics” - Stanislaw Ulam
Difficulty of naïve approaches n = number of acceptor splice sites m = number of donor splice sites n+m+1 (Fibonacci #) Number of gene structures = F n+m+1 (Fibonacci #) 1,1,2,3,5,8,13,21,34… 1,1,2,3,5,8,13,21,34…
statistical gene finding TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA
TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA
Using GHMMs for ab-initio gene finding In practice, have observed sequence Predict genes by estimating hidden state sequence Usual solution: single most likely sequence of hidden states (Viterbi). TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA
Results High sensitivity / low specificity Exon / Intron length distributions Identification of GC isochore - gene richness dep. Splice site models
Comparative Gene Finding
Comparison of 1196 orthologous mRNAs (Makalowski et al., 1996) Sequence identity: –exons: 84.6% –protein: 85.4% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.
Comparison of 117 complete genes Batzoglou/Pachter et al % of genes equal number of coding exons Exceptions: Spermidine Synthase Lymphotoxin Beta 73% of coding exons have equal length 95% of coding exons have length equal mod 3 Intron conservation 35% Intron length ratio longer/shorter: 1.5
SLAM- alignment & gene finding Input: –Pair of syntenic sequences (FASTA). Output: –CDS and CNS predictions in both sequences. –Protein predictions. –Protein and CNS alignment.
SLAM components Splice site detector –VLMM Intron and intergenic regions –2nd order Markov chain –independent geometric lengths Coding sequence –PHMM on protein level –generalized length distribution Conserved non-coding sequence –PHMM on DNA level
Input:
Output:
What have we learned from comparative gene finding? conservation is a stronger splice site indicator than consensus intron lengths have diverged gene structure conservation is more powerful than sequence conservation for prediction consensus for GC splice sites
SLAM whole genome run Align the genomes Construct a synteny map Chop up into SLAMable pieces Run SLAM Collate results
Alignment project:
Linux cluster with GHz PC, 750Mb of RAM Three days to align the entire mouse genome against the human genome
Finding regulatory regionsGodzilla Gene name Enolase -Experimentally defined enhancer (beta- enolase)
Experimental gene verification with RT-PCR predicted intron primer Intron > 1000bp Aligning human/mouse Exons > 60bp
SLAM CNS data
Single exon data
Acknowledgments Marina Alexandersson – Gothenburg, Sweden (SLAM) Nick Bray – LBNL/UCB math (Avid alignment program) Simon Cawley - Affymetrix (SLAM) Olivier Couronne – LBNL (Godzilla) Colin Dewey - Berkerley (SLAM) Alex Poliakov - LBNL (Godzilla, VISTA) Chuck Sugnet - UCSC (SLAM) Inna Dubchak - LBNL Eddy Rubin - LBNL