Download presentation
Presentation is loading. Please wait.
Published byColleen Bruce Modified over 9 years ago
1
Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.
2
2
3
3 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment
4
4 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment
5
5 How Many Genomes Are Available? 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes) 1766 bacterial genomes sequenced (as of 2/12/2012) Insects, fungi, worms, plants, … Many more will be finished very soon Fertile ground for comparative genomics
6
6 1982-2003: number of nucleotides in GenBank doubled every 18 months Since 2003: doubled every 3 years
7
7 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment
8
8 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones. 1.Consider a set of corresponding DNA sequences from related species. 2.Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”
9
9
10
10 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment
11
11 How to Find Conserved Motifs ACTAACCGGGAGATTTCAGA human AAGTTCCGGGAGATTTCCA chimp TAGTTATCCGGGAGATTAGA mouse AAAACCGGTAGATTTCAGG rat
12
12 Multiple Sequence Alignment AC--TAACCGGGAGATTTCAGA human AAGTT--CCGGGAGATTTCC-A chimp TAGTTATCCGGGAGATT--AGA mouse AA---AACCGGTAGATTTCAGG rat (Finding the optimal alignment is NP-complete.)
13
13 Phylogenetic Footprinting 1.Use whole-genome multiple alignment such as provided by UCSC Genome Browser.UCSC Genome Browser 2.Search for regions of well conserved alignment. –Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe; Xie (2)] –RNA elements [Pedersen; Washietl] –General conservation & constraint [Bejerano; Boffelli; Cooper; Margulies (4); Pollard; Prabhakar; Siepel]
14
14 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment
15
15 Why Doubt Alignments? Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete) Aligning whole genomes adds the complications of huge sequences and genomic rearrangements Vertebrate alignment has 3.8 billion columns Automatically generated
16
16 Assessing 4 Genome-Size Alignments (with Xiaoyu Chen) Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003], TBA [Blanchette 2003], Pecan [Paten 2008] Target ENCODE regions: 30 Mbp covering 1% of the human genome (ENCODE targets) Total input: 554 Mbp over 28 vertebrates Rich resource for comparing and assessing genome-size alignments Margulies et al. 2007, Genome Research
17
17 Coverage of each alignment Alignment coverage: number of human bases aligned to a given species
18
18 Coverage of each alignment In noncoding regions, as species distance from human↑, coverage↓
19
19 Coverage of each alignment MAVID has lowest coverage
20
20 Coverage of each alignment Other 3 have comparable coverage in placental mammals
21
21 Coverage of each alignment MLAGAN has highest coverage in distant species, intronic and intergenic
22
22 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases
23
23 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Agree%: Coding > UTR > Int.
24
24 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Unique%: Coding < UTR < Int.
25
25 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases As species distance from human↑, Agree%↓Unique%↑
26
26 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Primates: high Agree%
27
27 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Placental nonprimates: Agree% > 0.5
28
28 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Distant species, Int: low Agree%, high Unique%
29
29 Alignment agreement for mouse Agree% Disagree% Unique% Intronic basesIntergenic bases Intronic & intergenic account for 95% of mouse bases aligned to human Agree% in those categories: 44% to 62% Much worse for more distant species Building reliable MSA remains challenging
30
30 Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990) Goal: label each alignment column with confidence measure of alignment correctness –Identify sequences that do not belong Users forewarned about regions of interest Genome browser designers consider realigning Alignment tool designers get feedback for possible improvements
31
31 Sample Suspicious Alignment Human -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC
32
32 Scoring Function Pairwise: score( 1, 2 ) = log ( ) Multiple: Human Chimp Mouse Rat Chicken 1234512345 Pr( 1, 2 ) Pr( 1 ) Pr( 2 ) sc ( 1 2 3 4 5 | ) = log ( ) Pr( 1 2 3 4 5 | ) Pr( 1 2 5 | ) Pr( 3 4 | )
33
33 Outline of Computation InputMultiple sequence alignment A Output Discordance : max k p k For each branch k of the tree { Compute scoring function sc k (Felsenstein) Find all maximally scoring segments of A using sc k (Ruzzo & Tompa) Compute K, using sc k (Karlin & Altschul) Compute p-value p k of each segment score using K, (Karlin & Altschul) }
34
34 Suspicious Alignment Regions Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen) Identify suspicious alignment regions: –Length 50 bp –Discordance 0.1 at each position, all with respect to the same worst species –Fewer than 50% gapped sites Suspicious% –Percentage of aligned bases in suspicious regions
35
35 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases
36
36 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases
37
37 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases
38
38 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases
39
39 Can suspicious alignments be improved? Baboon and MLAGAN (for example): all points (x,y), where x = human-baboon alignment score of MLAGAN region suspicious for baboon y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon y = x y - x = μ, where μ = average y-x over all points y - x = μ ± σ, where σ = standard deviation of y-x over all points
40
40 Can suspicious alignments be improved?
41
41 Summary of comparisons (all categories) primatesother placental mammalsdistant speciesTBAMAVIDMLAGANPecan High is better Low is better
42
42 Conclusions 1.Disturbing lack of agreement among alignments: alignment still a hard problem 2.Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions
43
43 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment
44
44 DNA, Genes, and Proteins DNA: program for cell processes Proteins: execute cell processes T C C AA C GG T G C T G A G G T G C AC Gene Protein DNA
45
45 Regulation of Genes What turns genes on and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced?
46
46 Regulation of Genes Gene Regulatory Element RNA polymerase Transcription Factor DNA
47
47 RNA polymerase Transcription Factor DNA Regulatory Element Gene Regulation of Genes
48
48 Goal Identify regulatory elements in DNA sequences. These are: Binding sites for proteins Short subsequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (“motifs”)
49
49 CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
50
50 Finding Short Motifs AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4
51
51 Most Parsimonious Solution “Parsimony score”: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT
52
52 Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-complete.
53
53 FootPrinter’s Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: + ACGT: 0... … ACGG: 1 ACGT: 0... 4 k entries … ACGG: 0 ACGT: + ... … ACGG: ACGT :0...
54
54 W u [s] = min ( W v [t] + d(s, t) ) v : child t of u Running Time Number of species Average sequence length Motif length Total time O(n k (4 2k + l ))
55
55 Improvements Better algorithm reduces time from O(n k (4 2k + l )) to O(n k (4 k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)
56
56 Application to -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)
57
57 Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAAC A TTGGCATGGCTT TTGTTATTTTTGGCGC TTGACTCAGG AT C T AAAAACTGGAAC G GCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTT AGTCATTCCAAAT GTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGT AACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A CCTGTACACTGAC T AATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAG A TTGGCATGGCTT TATTTGTTTTTTCTTTTGGC GC TTGACTCAGGAT T A AAAAACTGGAAT G GTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAAT ATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTC TGTAAATTATGT AACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG GGAGGGAGGGGCT A CCTGTACACTGAC T TAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAG A TTGGCATGGCTT TATTTGTTTTTTTTGTTTTGTT TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGAT T T AAAAACTGGAAC G GTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAAT AGTCATTCCAAAT ATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCG TGTAAATTATGT AATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG AGGCAGCCAGGGCTT A CCTGTACACTGAC T TGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2
58
58 Motifs Absent from Some Species Find motifs –with small parsimony score –that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with –score 0 spanning at least 250 Myrs –score 1 spanning at least 350 Myrs –score 2 spanning at least 450 Myrs –score 3 spanning at least 550 Myrs
59
59 Application to c-fos Gene Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Puffer fish Chicken Pig Mouse Hamster Human 10 2 7 2 2 2 1 0 1 1 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28
60
60 Application to c-fos Gene MotifScoreConserved inKnown? CAGGTGCGAATGTTC04 mammals TTCCCGCCTCCCCTCCCC04 mammalsyes GAGTTGGCTGcagcc3puffer + 4 mammals GTTCCCGTCAATCcct1chicken + 4 mammals yes CACAGGATGTcc4all 6 yes AGGACATCTG1chicken + 4 mammals yes GTCAGCAGGTTTCCACG04 mammals yes TACTCCAACCGC04 mammals metK in B. subtilis
61
61 Microbial Footprinting 1889 prokaryotes with genomes completely sequenced ( as of 2/12/2012 ) –For any prokaryotic gene of interest, plenty of close genes in other species available –Relatively simple genomes MicroFootPrinter (with Shane Neph) –Designed specifically for phylogenetic footprinting in microbial genomes –undergraduate Computational Biology Capstone project –User specifies species and gene of interest –Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters
62
62 Demo MicroFootPrinter home Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) –chvI (two component response regulator)chvI –ropB (outer membrane protein )ropB
63
63 Sample chvI motif Parsimony score: 2 Span: 41.10 Significance score: 4.22 B. henselae - 151 GCTACAATTT R. etli -90 GCCACAATTT R. leguminosarum -106 GCCACAATTT S. meliloti -119 GCCACAATTT S. medicae -118 GCCACAATTT A. tumefaciens -105 GCCACAATTT M. loti -80 GCCACATTTT M. sp. -87 GCCACATTTT O. anthropi -158 GCCACATTTT B. suis -38 GCCACATTTT B. melitensis -156 GCCACATTTT B. abortus -156 GCCACATTTT B. ovis -156 GCCACATTTT B. canis -38 GCCACATTTT
64
64 Sample ropB motif Parsimony score:1 Span:20.70 Significance score:1.34 Jannaschia sp.-151 CACATTTTGG R. etli-134 CACAATTTGG R. leguminosarum-135 CACAATTTGG A. tumefaciens-131 CACATTTTGG S. meliloti-128 CACATTTTGG S. medicae-128 CACATTTTGG
65
65 Combined ChvI Motif ropB: CACATTTTGG chvI: GCCACAATTT Atu1221: TTGTCACAAT ultimate: GYCACAWTTTGG Y ={C,T} W ={A,T}
66
66 References and Acknowledgments Amol Prakash & Martin Tompa, Measuring the Accuracy of Genome-Size Multiple Alignments. Genome Biology, June 2007, R124. Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, 567-572. Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748. Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368. All software available at bio.cs.washington.edu/software bio.cs.washington.edu/software
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.