Annotation and Alignment of the Drosophila Genomes
Genes or Regulation ? “10,516 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence” “ Cis -regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis -regulatory elements is flexible” Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
Genes or Regulatory Elements ? “10,516 10,867 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence” “ Cis -regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis -regulatory elements is flexible” Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem 1990.
S. Chatterji and L. Pachter, GeneMapper: Reference based annotation with GeneMapper,2005.
Genes or Regulatory Elements ? “10,516 10,867 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence” “ Cis -regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis -regulatory elements is flexible” Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
DroAna_ _ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_ _ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_ _ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_ _ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of coding sequence DroAna_ _ CTGAAGGAAT TCTATATT AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGA GTTT DroMoj_ _ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_ _ CTGCGGGATTAGGAGTCATTAGAGT GCGGAAAAGCGG GTT- DroVir_ _ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT GAAGAATAGATC CTTT *** * * * DroAna_ _ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_ _ ----TATTTACTCAC DroPse_1_ TGTACTTAC DroSim_ _ ATTCTATGGACTCAC DroVir_ _ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** Alignment of non-coding sequence
DroAna_ _ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_ _ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_ _ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_ _ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of coding sequence Alignment of non-coding sequence droAna CTGAAGGAATTCTA--TATTAAAG dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA CACATAAA--CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG TGCGGAAAAGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA TAAACAA----TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG TGAAGAATAGATCCT-TTATTT *** * * * * droAna AAGATTTCTCATCATTGGTTGAATC ACTTAC dm2.chr2L TATGGACTCAC droMoj1.contig_ AAATATTT TATTGACTCAC dp3.chr4_group TGT--ACTTAC droSim1.chr2L TATGGACTCAC droVir1.scaffold_ AAATATTTGGTCCACTCAC droYak1.chr2L CATAAACTCAC *** **
Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value E-5 Distribution comparison KS p-value E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC How is an alignment made from two sequences? >dm2.chr2L CTGCGGGATTAGGGGTCATTAGAGTGCCGAAAAGCGAGTTTATTCTATGGACTCAC >dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTACTTAC ? Given two sequences of lengths n,m : n=56 m=64
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGA GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ TGTACTTAC Each alignment can be summarized by counting the number of matches ( #M ), mismatches ( #X ), gaps ( #G ), and spaces ( #S ).
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGA GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ TGTACTTAC Each alignment can be summarized by counting the number of matches ( #M ), mismatches ( #X ), gaps ( #G ), and spaces ( #S ). #M=31, #X=22, #G=3, #S=12 #M=27, #X=18, #G=3, #S=28 2(#M+#X)+#S=112 so #X,#G and #S suffice to specify a summary.
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28)
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are different alignments.
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are different alignments, but only different summaries. So we don’t need to plot that many points.
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are different alignments, but only different summaries. So we don’t need to plot that many points. But is still quite a large number. Fortunately, there are only 69 vertices on the convex hull of the points. These are the interesting ones, and we can even draw them…
>mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC For the sequences: 49 #x=24, #S=10, #G=2 There are eight alignments that have this summary. the alignment polytope is:
mel CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGA GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGA GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAG AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAG AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC
mel CTGCGGGATTAGGGGTCATTAGAGT===------===GCCGAAAAGCGAGTTTATTCTA=TGGAC pse CTGGAAGAGTTTTGATTAGTAG===GGGATCCATGGGGGCGAGGAGAGGCCATCATC==GTGTAC Consensus at a vertex
The vertices of the polytope have special significance. Given parameters for a model, e.g. the default parameters for MULTIZ: M = 100, X = -100, S = -30, G = -400 the summary is the result of maximizing the linear form -200*(#X)-400*(#G)-80*(#S) over the polytope. Thus, the vertices of the polytope correspond to optimal alignments. 49 #x=24, #S=10, #G=2
What is usually done, is that a single set of parameters is specified ( M = 100, X = -100, S = -30, G = -400 is a standard default) and then the optimal vertex is identified using dynamic programming. An alignment optimal for the vertex is then selected. The running time of the algorithm is O(nm) [Needleman-Wunsch, 1970, Smith-Waterman, 1981] and it requires O(n+m) space [Hirschberg 1975]. Standard scoring schemes are: Parameters Model M,X,S Jukes-Cantor with linear gap penalty M,X,S,G Jukes-Cantor with affine gap penalty M,X TS,X TV,S,G Kimura-2 parameter with affine gap penalty Needleman-Wunsch Alignment
W i,j = S*W i-1,j +S*W i,j-1 +(X or M)*W i-1,j-1 A A C A T T A G A AGATTACCACA Score of best alignment of positions [1,i] and [1,j] in each sequence Needleman-Wunsch algorithm max plus
Building Drosophila whole genome multiple alignments MAVID MULTIZ (currently no D. erecta )
DroAna_ _ CTGAAGGAAT TCTATATT AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGA GTTT DroMoj_ _ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_ _ CTGCGGGATTAGGAGTCATTAGAGT GCGGAAAAGCGG GTT- DroVir_ _ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT GAAGAATAGATC CTTT *** * * * DroAna_ _ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_ _ ----TATTTACTCAC DroPse_1_ TGTACTTAC DroSim_ _ ATTCTATGGACTCAC DroVir_ _ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004) p MAVID
Needleman-Wunsch
droAna CTGAAGGAATTCTA--TATTAAAG dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG TGCCGAAAAGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA CACATAAA--CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG TGCGGAAAAGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA TAAACAA----TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG TGAAGAATAGATCCT-TTATTT *** * * * * droAna AAGATTTCTCATCATTGGTTGAATC ACTTAC dm2.chr2L TATGGACTCAC droMoj1.contig_ AAATATTT TATTGACTCAC dp3.chr4_group TGT--ACTTAC droSim1.chr2L TATGGACTCAC droVir1.scaffold_ AAATATTTGGTCCACTCAC droYak1.chr2L CATAAACTCAC *** ** Blanchette et al., Aligning multiple sequences with the threaded blockset aligner, Genome Research 14 (2004) p MULTIZ
Needleman-Wunsch
One (possibly wrong) alignment is not enough: the history of parametric inference 1992: Waterman, M., Eggert, M. & Lander, E. Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, : Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment, Algorithmica 12, : Wang, L., Zhao, J. Parametric alignment of ordered trees, Bioinformatics, : Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, XPARAL by Kristian Stevens and Dan Gusfield
Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. Alignment with biology rather than for biology.