Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,

Similar presentations


Presentation on theme: "Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,"— Presentation transcript:

1 Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.

2 2

3 3 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

4 4 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

5 5 How Many Genomes Are Available? 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes) 1766 bacterial genomes sequenced (as of 2/12/2012) Insects, fungi, worms, plants, … Many more will be finished very soon Fertile ground for comparative genomics

6 6 1982-2003: number of nucleotides in GenBank doubled every 18 months Since 2003: doubled every 3 years

7 7 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

8 8 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones. 1.Consider a set of corresponding DNA sequences from related species. 2.Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”

9 9

10 10 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

11 11 How to Find Conserved Motifs ACTAACCGGGAGATTTCAGA human AAGTTCCGGGAGATTTCCA chimp TAGTTATCCGGGAGATTAGA mouse AAAACCGGTAGATTTCAGG rat

12 12 Multiple Sequence Alignment AC--TAACCGGGAGATTTCAGA human AAGTT--CCGGGAGATTTCC-A chimp TAGTTATCCGGGAGATT--AGA mouse AA---AACCGGTAGATTTCAGG rat (Finding the optimal alignment is NP-complete.)

13 13 Phylogenetic Footprinting 1.Use whole-genome multiple alignment such as provided by UCSC Genome Browser.UCSC Genome Browser 2.Search for regions of well conserved alignment. –Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe; Xie (2)] –RNA elements [Pedersen; Washietl] –General conservation & constraint [Bejerano; Boffelli; Cooper; Margulies (4); Pollard; Prabhakar; Siepel]

14 14 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

15 15 Why Doubt Alignments? Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete) Aligning whole genomes adds the complications of huge sequences and genomic rearrangements Vertebrate alignment has 3.8 billion columns Automatically generated

16 16 Assessing 4 Genome-Size Alignments (with Xiaoyu Chen) Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003], TBA [Blanchette 2003], Pecan [Paten 2008] Target ENCODE regions: 30 Mbp covering 1% of the human genome (ENCODE targets) Total input: 554 Mbp over 28 vertebrates Rich resource for comparing and assessing genome-size alignments Margulies et al. 2007, Genome Research

17 17 Coverage of each alignment Alignment coverage: number of human bases aligned to a given species

18 18 Coverage of each alignment In noncoding regions, as species distance from human↑, coverage↓

19 19 Coverage of each alignment MAVID has lowest coverage

20 20 Coverage of each alignment Other 3 have comparable coverage in placental mammals

21 21 Coverage of each alignment MLAGAN has highest coverage in distant species, intronic and intergenic

22 22 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases

23 23 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Agree%: Coding > UTR > Int.

24 24 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Unique%: Coding < UTR < Int.

25 25 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases As species distance from human↑, Agree%↓Unique%↑

26 26 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Primates: high Agree%

27 27 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Placental nonprimates: Agree% > 0.5

28 28 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Distant species, Int: low Agree%, high Unique%

29 29 Alignment agreement for mouse Agree% Disagree% Unique% Intronic basesIntergenic bases Intronic & intergenic account for 95% of mouse bases aligned to human Agree% in those categories: 44% to 62% Much worse for more distant species Building reliable MSA remains challenging

30 30 Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990) Goal: label each alignment column with confidence measure of alignment correctness –Identify sequences that do not belong Users forewarned about regions of interest Genome browser designers consider realigning Alignment tool designers get feedback for possible improvements

31 31 Sample Suspicious Alignment Human -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC

32 32 Scoring Function Pairwise: score(  1,  2 ) = log ( ) Multiple: Human Chimp Mouse Rat Chicken 1234512345 Pr(  1,  2 ) Pr(  1 ) Pr(  2 ) sc (  1  2  3  4  5 | ) = log ( ) Pr(  1  2  3  4  5 | ) Pr(  1  2  5 | ) Pr(  3  4 | )

33 33 Outline of Computation InputMultiple sequence alignment A Output Discordance : max k p k For each branch k of the tree { Compute scoring function sc k (Felsenstein) Find all maximally scoring segments of A using sc k (Ruzzo & Tompa) Compute K, using sc k (Karlin & Altschul) Compute p-value p k of each segment score using K, (Karlin & Altschul) }

34 34 Suspicious Alignment Regions Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen) Identify suspicious alignment regions: –Length  50 bp –Discordance  0.1 at each position, all with respect to the same worst species –Fewer than 50% gapped sites Suspicious% –Percentage of aligned bases in suspicious regions

35 35 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

36 36 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

37 37 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

38 38 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

39 39 Can suspicious alignments be improved? Baboon and MLAGAN (for example): all points (x,y), where x = human-baboon alignment score of MLAGAN region suspicious for baboon y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon y = x y - x = μ, where μ = average y-x over all points y - x = μ ± σ, where σ = standard deviation of y-x over all points

40 40 Can suspicious alignments be improved?

41 41 Summary of comparisons (all categories) primatesother placental mammalsdistant speciesTBAMAVIDMLAGANPecan High is better Low is better

42 42 Conclusions 1.Disturbing lack of agreement among alignments: alignment still a hard problem 2.Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions

43 43 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

44 44 DNA, Genes, and Proteins DNA: program for cell processes Proteins: execute cell processes T C C AA C GG T G C T G A G G T G C AC Gene Protein DNA

45 45 Regulation of Genes What turns genes on and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced?

46 46 Regulation of Genes Gene Regulatory Element RNA polymerase Transcription Factor DNA

47 47 RNA polymerase Transcription Factor DNA Regulatory Element Gene Regulation of Genes

48 48 Goal Identify regulatory elements in DNA sequences. These are: Binding sites for proteins Short subsequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (“motifs”)

49 49 CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

50 50 Finding Short Motifs AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

51 51 Most Parsimonious Solution “Parsimony score”: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

52 52 Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-complete.

53 53 FootPrinter’s Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: 0... 4 k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

54 54 W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time Number of species Average sequence length Motif length Total time O(n k (4 2k + l ))

55 55 Improvements Better algorithm reduces time from O(n k (4 2k + l )) to O(n k (4 k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)

56 56 Application to  -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)

57 57 Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAAC A TTGGCATGGCTT TTGTTATTTTTGGCGC TTGACTCAGG AT C T AAAAACTGGAAC G GCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTT AGTCATTCCAAAT GTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGT AACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A CCTGTACACTGAC T AATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAG A TTGGCATGGCTT TATTTGTTTTTTCTTTTGGC GC TTGACTCAGGAT T A AAAAACTGGAAT G GTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAAT ATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTC TGTAAATTATGT AACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG GGAGGGAGGGGCT A CCTGTACACTGAC T TAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAG A TTGGCATGGCTT TATTTGTTTTTTTTGTTTTGTT TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGAT T T AAAAACTGGAAC G GTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAAT AGTCATTCCAAAT ATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCG TGTAAATTATGT AATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG AGGCAGCCAGGGCTT A CCTGTACACTGAC T TGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2

58 58 Motifs Absent from Some Species Find motifs –with small parsimony score –that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with –score 0 spanning at least 250 Myrs –score 1 spanning at least 350 Myrs –score 2 spanning at least 450 Myrs –score 3 spanning at least 550 Myrs

59 59 Application to c-fos Gene Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Puffer fish Chicken Pig Mouse Hamster Human 10 2 7 2 2 2 1 0 1 1 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28

60 60 Application to c-fos Gene MotifScoreConserved inKnown? CAGGTGCGAATGTTC04 mammals TTCCCGCCTCCCCTCCCC04 mammalsyes GAGTTGGCTGcagcc3puffer + 4 mammals GTTCCCGTCAATCcct1chicken + 4 mammals yes CACAGGATGTcc4all 6 yes AGGACATCTG1chicken + 4 mammals yes GTCAGCAGGTTTCCACG04 mammals yes TACTCCAACCGC04 mammals metK in B. subtilis

61 61 Microbial Footprinting 1889 prokaryotes with genomes completely sequenced ( as of 2/12/2012 ) –For any prokaryotic gene of interest, plenty of close genes in other species available –Relatively simple genomes MicroFootPrinter (with Shane Neph) –Designed specifically for phylogenetic footprinting in microbial genomes –undergraduate Computational Biology Capstone project –User specifies species and gene of interest –Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters

62 62 Demo MicroFootPrinter home Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) –chvI (two component response regulator)chvI –ropB (outer membrane protein )ropB

63 63 Sample chvI motif Parsimony score: 2 Span: 41.10 Significance score: 4.22 B. henselae - 151 GCTACAATTT R. etli -90 GCCACAATTT R. leguminosarum -106 GCCACAATTT S. meliloti -119 GCCACAATTT S. medicae -118 GCCACAATTT A. tumefaciens -105 GCCACAATTT M. loti -80 GCCACATTTT M. sp. -87 GCCACATTTT O. anthropi -158 GCCACATTTT B. suis -38 GCCACATTTT B. melitensis -156 GCCACATTTT B. abortus -156 GCCACATTTT B. ovis -156 GCCACATTTT B. canis -38 GCCACATTTT

64 64 Sample ropB motif Parsimony score:1 Span:20.70 Significance score:1.34 Jannaschia sp.-151 CACATTTTGG R. etli-134 CACAATTTGG R. leguminosarum-135 CACAATTTGG A. tumefaciens-131 CACATTTTGG S. meliloti-128 CACATTTTGG S. medicae-128 CACATTTTGG

65 65 Combined ChvI Motif ropB: CACATTTTGG chvI: GCCACAATTT Atu1221: TTGTCACAAT ultimate: GYCACAWTTTGG Y ={C,T} W ={A,T}

66 66 References and Acknowledgments Amol Prakash & Martin Tompa, Measuring the Accuracy of Genome-Size Multiple Alignments. Genome Biology, June 2007, R124. Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, 567-572. Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748. Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368. All software available at bio.cs.washington.edu/software bio.cs.washington.edu/software


Download ppt "Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,"

Similar presentations


Ads by Google