Download presentation
Presentation is loading. Please wait.
1
Annotation and Alignment of the Drosophila Genomes
4
Genes or Regulation ? “10,516 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence” “ Cis -regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis -regulatory elements is flexible” Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
5
Genes or Regulatory Elements ? “10,516 10,867 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence” “ Cis -regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis -regulatory elements is flexible” Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
6
BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem 1990.
7
S. Chatterji and L. Pachter, GeneMapper: Reference based annotation with GeneMapper,2005.
8
Genes or Regulatory Elements ? “10,516 10,867 putative orthologs have been identified as a core gene set conserved over 25–55 million years (Myr) since the pseudoobscura / melanogaster divergence” “ Cis -regulatory sequences are more conserved than random and nearby sequences between the species—but the difference is slight, suggesting that the evolution of cis -regulatory elements is flexible” Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
9
http://rana.lbl.gov/drosophila/
10
DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of coding sequence DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** Alignment of non-coding sequence
11
DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of coding sequence Alignment of non-coding sequence droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA--CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA----TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** **
12
Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value0.050.0031E-5 Distribution comparison KS p-value0.0260.00162E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% Richards et al., Comparative genome sequencing of Drosophila pseudoobscura : Chromosomal, gene, and cis -element evolution, Genome Res., Jan 2005.
13
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC How is an alignment made from two sequences? >dm2.chr2L CTGCGGGATTAGGGGTCATTAGAGTGCCGAAAAGCGAGTTTATTCTATGGACTCAC >dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTACTTAC ? Given two sequences of lengths n,m : n=56 m=64
14
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ ------TGTACTTAC Each alignment can be summarized by counting the number of matches ( #M ), mismatches ( #X ), gaps ( #G ), and spaces ( #S ).
15
dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ ------TGTACTTAC Each alignment can be summarized by counting the number of matches ( #M ), mismatches ( #X ), gaps ( #G ), and spaces ( #S ). #M=31, #X=22, #G=3, #S=12 #M=27, #X=18, #G=3, #S=28 2(#M+#X)+#S=112 so #X,#G and #S suffice to specify a summary.
16
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28)
17
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are 434615666279134990029695804618937526970374145 different alignments.
18
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are 379522884096444556699773447791552717765633 different alignments, but only 53890 different summaries. So we don’t need to plot that many points.
19
The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are 379522884096444556699773447791552717765633 different alignments, but only 53890 different summaries. So we don’t need to plot that many points. But 53890 is still quite a large number. Fortunately, there are only 69 vertices on the convex hull of the 53890 points. These are the interesting ones, and we can even draw them…
20
>mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC For the sequences: 49 #x=24, #S=10, #G=2 There are eight alignments that have this summary. the alignment polytope is:
21
mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC
22
mel CTGCGGGATTAGGGGTCATTAGAGT===------===GCCGAAAAGCGAGTTTATTCTA=TGGAC pse CTGGAAGAGTTTTGATTAGTAG===GGGATCCATGGGGGCGAGGAGAGGCCATCATC==GTGTAC Consensus at a vertex
23
The vertices of the polytope have special significance. Given parameters for a model, e.g. the default parameters for MULTIZ: M = 100, X = -100, S = -30, G = -400 the summary is the result of maximizing the linear form -200*(#X)-400*(#G)-80*(#S) over the polytope. Thus, the vertices of the polytope correspond to optimal alignments. 49 #x=24, #S=10, #G=2
24
What is usually done, is that a single set of parameters is specified ( M = 100, X = -100, S = -30, G = -400 is a standard default) and then the optimal vertex is identified using dynamic programming. An alignment optimal for the vertex is then selected. The running time of the algorithm is O(nm) [Needleman-Wunsch, 1970, Smith-Waterman, 1981] and it requires O(n+m) space [Hirschberg 1975]. Standard scoring schemes are: Parameters Model M,X,S Jukes-Cantor with linear gap penalty M,X,S,G Jukes-Cantor with affine gap penalty M,X TS,X TV,S,G Kimura-2 parameter with affine gap penalty Needleman-Wunsch Alignment
25
W i,j = S*W i-1,j +S*W i,j-1 +(X or M)*W i-1,j-1 A A C A T T A G A AGATTACCACA Score of best alignment of positions [1,i] and [1,j] in each sequence Needleman-Wunsch algorithm max plus
26
Building Drosophila whole genome multiple alignments MAVID http://hanuman.math.berkeley.edu/kbrowser MULTIZ http://genome.ucsc.edu/ (currently no D. erecta )
27
DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004) p 693--699 MAVID
28
Needleman-Wunsch
29
droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA--CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAAAGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA----TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAATAGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** ** Blanchette et al., Aligning multiple sequences with the threaded blockset aligner, Genome Research 14 (2004) p 708--715 MULTIZ
30
Needleman-Wunsch
31
One (possibly wrong) alignment is not enough: the history of parametric inference 1992: Waterman, M., Eggert, M. & Lander, E. Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, 6090-6093 1994: Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment, Algorithmica 12, 312-326. 2003: Wang, L., Zhao, J. Parametric alignment of ordered trees, Bioinformatics, 19 2237-2245. 2004: Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, 2 271-287. XPARAL by Kristian Stevens and Dan Gusfield
32
Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. Alignment with biology rather than for biology.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.