Multiple Species Gene Finding Sourav Chatterji

Multiple Species Gene Finding Sourav Chatterji souravc@eecs.berkeley.edu

Predicting Replication Origins in Yeast. Breier AM, Chatterji S, Cozzarelli NR. Genome Biol. 2004;5(4):R22. Comparative GeneFinding using SLAM. Paired Splice Site Detection in SLAM. Zhao X, Huang H, Speed TP. Proceedings of RECOMB 2004; 68-75. Rat Genome Sequencing Consortium. Nature. 2004 Apr 1;428(6982):493-521. Multiple Species GeneFinding. Chatterji S, Pachter L. Proceedings of RECOMB 2004; 187-193. Evidence Based GeneFinding - Work in Progress.

State of the Genomes 5 Roundworm Genomes C.elegans and C.briggsae completed. 3 other worm genomes in progress. 11 Fruitfly Genomes D. Melanogaster - completed Seuqencing of 7 genomes in progress 3 more genomes in pipleline 18 Mammalian Genomes Human, Mouse, and Rat Genomes Published Sequencing of 6 genomes in progress. 9 other genomes in pipeline 4 primate genomes : Orangutan, Macaque, Chimpanzee and Human.

Outline GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Gibbs Sampling in HMMs Gene finding by Gibbs sampling Results Evidence Based Multiple Species GeneFinding Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work

Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation). High Sensitivity/Low Specificity

Gene Finding in Vertebrates Single organism gene finding: GENSCAN/GENIE/SNAP…… Based on generalized HMMs Viterbi Sequence (Gene Annotation). High Sensitivity/Low Specificity Conserved regions among related species more likely to be functional than divergent regions. IDEA: Comparative-based gene finding

Comparative (Pairwise) Gene Finding  ROSETTA : Global alignment followed by gene finding [Batzoglou and Pachter et al., 1999]  SLAM : Simultaneous Global alignment and gene finding [Alexandersson et al. 2001]  TWINSCAN : Blast alignment followed by gene finding [Korf et al. 2001]  AGENDA : Global alignment followed by gene finding [Rinner and Morgenstern, 2002]  DOUBLESCAN : Simultaneous alignment and gene finding [Meyer and Durbin, 2002]  SGP2 : Blast alignment followed by gene finding [Parra et al. 2003]

The Good News Gene structure Number of exons conserved (86% human/mouse) Exons have similar lengths (91% identical, remainder almost all differ by a multiple of 3) Intron lengths are divergent (~1% identical length) Sequence similarity Exons highly conserved (both amino acids & DNA) Intron sequences dissimilar Waterston et al., 2003

The Bad News Difficult to generalize many pairwise methods to multiple sequence methods Alignment Exons may be misaligned (much shorter than introns) Multiple sequence alignment is much harder than pairwise sequence alignment Long Conserved Non Coding Sequences Confuse methods that rely on conservation in a naive way Missing Sequence

Multiple Species Comparative Gene Finding (with Alignment)  McAuliffe et al. (2004), Siepel et al. (2004)

Multiple Species Comparative Gene Finding (without Alignment)

Gibbs Sampling for Biological Sequence Analysis Introduced by Lawrence et al. 1993 Motif Detection Extensions Multiple Motifs in a Sequence Multiple Types of Motifs Phylogenetic Relationships between Sequences Applications Alignment Linkage Analysis

Gibbs Sampling Aim : To sample from the joint distribution p(x 1,x 2,…,x n ) when it is easy to sample from the conditional distributions p(x i | x 1,…x i-1,x i+1,…,x n ) but not from the joint distribution. Method: Iteratively sample x i t from the conditional distribution p(x i | x 1 t,…x i-1 t,x i+1 t-1,…,x n t-1 ) Theorem : For discrete distributions, the distribution of (x 1 t,x 2 t …,x n t ) converges to p(x 1,x 2,…,x n )

tt ss Connection to HMMs Z1Z1 Y1Y1 Z2Z2 YmYm ZmZm Y2Y2 ss ss tt tt   t = output probabilities   s = transition probabilities  Difficult to sample from P(  Z | Y)  Easy to sample  from P(  | Z,Y)  Easy to sample Z from P(Z | ,Y)

The Motif Finding Problem  Fixed width unknown motif.  1 motif per sequence, unknown location. P1P1 P2P2 P3P3 P4P4 P5P5 A????? C????? G????? T????? ? ? ? ? PSSM

The Motif Finding HMM   : PSSM parameters  Y : Observed sequences  Z : Alignment P1P1 P2P2 P3P3 P4P4 P5P5 A????? C????? G????? T????? ? ? ? ? PSSM  Z Y BGP1P1 P5P5 Z2Z2 Y2Y2 Z m-1 Y m-1 Z1Z1 Y1Y1 ZmZm YmYm …

Gibbs Sampling for Motif Detection  Sample  from P(  | Z,Y) [sample PSSM from alignment]  Sample Z from P(Z | ,Y) [find positions from PSSM]  Samples from P(Z,  |Y) P1P1 P2P2 P3P3 P4P4 P5P5 A????? C????? G????? T????? ? ? ? ? PSSM  Z Y BGP1P1 P5P5 Z2Z2 Y2Y2 Z m-1 Y m-1 Z1Z1 Y1Y1 ZmZm YmYm …

Gibbs Sampling for HMMs N sequences independently generated by an HMM. Three types of random variables  : Parameters Z = Z 1,Z 2,…,Z i …,Z N : hidden variables Y = Y 1,Y 2,…,Y i …,Y N : observed variables

Gibbs Sampling for HMMs N sequences independently generated by an HMM. Three types of random variables  : Parameters Z = Z 1,Z 2,…,Z i …,Z N : hidden variables Y = Y 1,Y 2,…,Y i …,Y N : observed variables Z i 1,Z i 2,…,Z i m Y i 1,Y i 2,…,Y i m

Gibbs Sampling for HMMs N sequences independently generated by an HMM. Three types of random variables  : Parameters Z = Z 1,Z 2,…,Z i …,Z N : hidden variables Y = Y 1,Y 2,…,Y i …,Y N : observed variables Aim: To Sample from the distribution P(Z,  |Y) Iterations of a Gibbs Sampler Sample Z i from p(Z i | Y,  ), Sample  from p(  | Y,Z)

E00E00 E01E01 E02E02 Intron 0 E10E10 E11E11 E12E12 Intron 1 E20E20 E21E21 E22E22 Intron 2 EI0EI0 EI1EI1 EI2EI2 ET0ET0 ET1ET1 ET2ET2 Single Exon IG E100E100 E200E200 Ek00Ek00

Gibbs Sampling for Gene Finding

Initial Predictions

Gibbs Sampling for Gene Finding Sample Z 1 from P(Z 1 | Z [-1], Y)

Gibbs Sampling for Gene Finding Sample Z 2 from P(Z 2 | Z [-2], Y)

Learning the Number of Exon Classes

 Find Significant Hits Among Peptides

Learning the Number of Exon Classes  Each Connected Component forms a Class of Genes

Testing 1.6 Mb Data from the NISC Comparative Sequencing Project Divided into large genomic regions (100-200 kB) some of which contained multiple genes Selection Criteria 4 mammals roughly equidistant from each other Human, Mouse/Rat, Dog/Cat, Pig/Cow. Available RefSeq annotations with no obvious alternative splicing

Results Nucl. SnNucl. SpExon SnExon Sp Gibbs0.8970.8860.7140.628 Genscan0.9110.5480.7770.518 Twinscan0.6920.8560.4400.513 SLAM0.7910.8810.6320.527

Robustness Results Nucl. Sn Nucl. Sp Exon Sn Exon Sp Gibbs(before) 0.9390.9500.7630.735 Gibbs(after) 0.8850.9100.7400.703 Genscan(before) 0.9110.6800.7710.612 Genscan(after) 0.8660.6520.7480.594 Twinscan(before) 0.6940.8950.4650.604 Twinscan(after) 0.6650.8530.4650.598 SLAM(before) 0.9270.9110.7180.566 SLAM(after) 0.4380.9360.2500.646

Conclusions Efficient Running time O(kNL) Memory requirements O(L) k=#iterations,N=#sequences, L=max. length Converges rapidly. No Alignment Required !! Symmetric Prediction for All Species Application : rapid comparative based annotation of newly sequenced genomes Robust Rearrangements Draft Quality Sequence

Outline GeneFinding by Gibbs Sampling Ab-Initio GeneFinding in Vertebrates Overview of Gibbs Sampling Gene finding by Gibbs sampling Results Evidence Based Multiple Species GeneFinding Evidence based GeneFinding ExonAligner : An Exon Alignment Program Initial Results Proposals for Future Work

Evidence Based GeneFinding Procrustes : cDNA Evidence, DP based Spliced Alignment [Gelfland et al. 1996] Genewise : Protein evidence, combines genefinding HMM with protein profile HMM, part of the ENSEMBL pipeline. [Birney at al. 1996] Projector : Evidence from orthologous genes in related species, uses pair HMM based model. [Meyer and Durbin 2004] Evidence Annotation

Evidence Based GeneFinding Large scale sequencing efforts. 8 Drosophila genomes very soon D. Melanogaster well annotated. 9 mammalian genomes by early 2005 Human genome well annotated. Aim : Rapid annotation of newly sequenced genomes. Use well annotated genomes as evidence. Draft Quality Genomes Robustness for Sequencing Errors Using sequences from multiple species will result in more accurate annotations. Will also give us high quality multiple alignments. Data to study the evolution of genomes.

Evidence Based Multiple Species GeneFinding reference Basic Idea : Use annotations from a reference genome (e.g. D.melanogaster or H. sapiens) as evidence to annotate newly sequenced genomes. Use Whole Genome Homology Maps (courtesy Colin Dewey) Project exons from reference genome into every other genome. Join projections to get multiple alignments. Use orthologous sequences from multiple species to get more accurate annotations. Produce annotations with all supporting evidence. Exploit phylogenetic relationships among the species.

Projecting Annotated Exons Annotation Homology Map ExonAligner

ExonAligner : An Exon Alignment Program Mixture of global and local alignment. Penalize overhanging ends in Evidence. Overhanging ends in Target is OK. Exploit the property that they code for homologous proteins. Special Dynamic Programming Matrix Robust. Sequencing Errors. Phase Shifts. Chaining Algorithm for large sequences. Evidence Target

The Dynamic Programming Matrix The figure only shows edges into the black node. The red edges represent non-codon gaps, i.e. gaps caused by phase shifts/sequencing errors and are of length which is not a multiple of 3. They are heavily penalized.

Chaining Algorithms Widely used in large scale alignment algorithms. MUMer [Salzberg et al. 1999] AVID [Bray et al. 2002] LAGAN [Brudno et al. 2003] Step 1: Find good local alignments or fragments. Step 2: Select a consistent subset of fragments for chaining and call these fragments anchors. Step 3: Join the anchors to get an alignment.

The ExonAligner Chaining Algorithm Construction of Fragments. Translate target sequence in the 3 frames. Find significant hits with (translated) evidence and use them as fragments. Selection of Anchors. Construct weighted DAG from fragments. Weigh edges by using dynamic programming. Use nodes in the shortest path in the DAG as anchors. Use dynamic programming to join anchors together.

Exploiting Phylogeny

Recoverable Exons/Genes ?

Project Using Exon Aligner

Preliminary Results Created a Homology map of Human, Chimp, Rat, Mouse and Chicken genomes 266836 exon cliques. 45543 non-convex exon cliques. 27502 of these recoverable. Used ExonAligner to map 3300 human Refseq genes into the chimp genome. Robustness of Algorithm critical. 500 of the 42662 exons had non-codon gaps. These alignments will in turn be used to learn parameters for ExonAligner. Extrapolate parameters for other species.

An Illustrative Example RefSeq Gene NM_030575 Single exon gene, 221 a.a. protein. No orthologous gene found by Genewise. Potential orthologous gene in chimp 2 non-codon gaps in alignment (1 insertion and 1 deletion separated by 60 nt). 212 out of 221 amino acids are matches. Is this a real ortholog? Phase Shift/Sequencing Error? Find orthologous genes in other species and use multiple alignment.

Future Work Extend ExonAligner for Multiple Species Robust realignment Take into account codon structure Robust for phase shifts/sequencing errors Annotation with supporting evidence. Basic Evidence e.g. RefSeq Gene Annotation Multiple Alignment with Orthologous Features Score : Statistical Significance of the Feature

Future Work Comprehensive Annotation Program Put Evidence Based and Ab initio methods together Try to use alignment/homology in Gibbs Sampler Rapid annotation of Drosophila and Mammalian genomes Berkeley AAA group for Drosophila genomes. Study the evolution of genes Find human specific genes

Multiple Species Gene Finding Sourav Chatterji

Similar presentations

Presentation on theme: "Multiple Species Gene Finding Sourav Chatterji"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Species Gene Finding Sourav Chatterji

Similar presentations

Presentation on theme: "Multiple Species Gene Finding Sourav Chatterji"— Presentation transcript:

Similar presentations

About project

Feedback