Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative Sequence Analysis in Molecular Biology

Similar presentations


Presentation on theme: "Comparative Sequence Analysis in Molecular Biology"— Presentation transcript:

1 Comparative Sequence Analysis in Molecular Biology
Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. Title: Comparative Sequence Analysis in Molecular Biology Lecturer: Martin Tompa, Professor of Computer Science & Engineering and Adjunct Professor of Genome Sciences, University of Washington, Seattle, WA, U.S.A. Outline: What is phylogenetic footprinting? How are genes regulated? Algorithms for phylogenetic footprinting Whole-genome multiple alignments How reliable are whole-genome multiple alignments? References: Blanchette M, Tompa M. Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, vol. 12, no. 5, May 2002, Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences USA, vol. 87, 1990, Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson H, Bejerano G, Barber GP, Baertsch R, Haussler D, Kent, WJ. The UCSC Genome Browser database: update Nucleic Acids Research, vol. 35, January 2007, D Kumar S, Filipski A. Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Research, vol. 17, February 2007, Neph S, Tompa M. MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, vol. 34, July 2006, W366-W368. Prakash A, Tompa M. Measuring the Accuracy of Genome-Size Multiple Alignments. Genome Biology, vol. 8, June 2007, R124. Abstract: In computational molecular biology, "phylogenetic footprinting" is a standard idea that is used to predict functional regions within a biological sequence (DNA, RNA, or protein). The procedure is to find corresponding sequences from several related species, and within these to identify those regions that have mutated less than expected over the course of evolution, suggesting that these regions are under selective pressure due to biological functionality. We will discuss various algorithms for and applications of phylogenetic footprinting and demonstrate some of these using software available on the web. We will then turn our attention to the larger problem of doing phylogenetic footprinting on a whole-genome scale, demonstrating the use of a genome browser available on the web and discussing the issue of assessing its reliability. These lectures will be self-contained. No prior knowledge of molecular biology is necessary.

2

3 Outline What genome data is available?
What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

4 Outline What genome data is available?
What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

5 DNA: the cell’s program
Nucleotide (A, C, G, or T)

6 TCCAACGGTGCTGAGGTGCAC
DNA, Genes, and Proteins TCCAACGGTGCTGAGGTGCAC Gene Protein DNA DNA: program for cell processes Proteins (and RNA): execute cell processes

7 How Much DNA in a Cell? An organism’s genome is the total DNA in one of its cells. How many nucleotides in a genome? M. tuberculosis bacterium 4,000,000 D. melanogaster fruit fly 200,000,000 H. sapiens human 3,000,000,000 P. nudum whisk fern 250,000,000,000 How can we understand the genome’s program? Lab benchwork is costly and time-consuming. We will return to this question.

8 How Many Genomes Are Available?
46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes) 1025 bacterial genomes sequenced (as of 4/6/2010) Insects, fungi, worms, plants, … Many more will be finished very soon Fertile ground for comparative genomics

9 1982-2003: number of nucleotides in GenBank doubled every 18 months
: number of nucleotides in GenBank doubled every 18 months Since 2003: doubled every 3 years

10 Outline What genome data is available?
What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

11 Phylogenetic Footprinting (Tagle et al. 1988)
Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones. Consider a set of corresponding DNA sequences from related species. Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”

12

13 Outline What genome data is available?
What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

14 How to Find Conserved Motifs
ACTAACCGGGAGATTTCAGA human AAGTTCCGGGAGATTTCCA chimp TAGTTATCCGGGAGATTAGA mouse AAAACCGGTAGATTTCAGG rat

15 Multiple Sequence Alignment
AC--TAACCGGGAGATTTCAGA human AAGTT--CCGGGAGATTTCC-A chimp TAGTTATCCGGGAGATT--AGA mouse AA---AACCGGTAGATTTCAGG rat (Finding the optimal alignment is NP-complete.)

16 Phylogenetic Footprinting
Use whole-genome multiple alignment such as provided by UCSC Genome Browser. Search for regions of well conserved alignment. Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe; Xie (2)] RNA elements [Pedersen; Washietl] General conservation & constraint [Bejerano; Boffelli; Cooper; Margulies (4); Pollard; Prabhakar; Siepel]

17 Outline What genome data is available?
What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

18 Which Alignment Columns to Trust?
Vertebrate alignment has 3.8 billion columns Automatically generated Recent comparison (Margulies et al., 2007) of 4 whole-mammal alignment methods revealed widespread disagreement

19 Which Alignment Columns to Trust
Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990) Goal: label each alignment column with confidence measure of alignment correctness Identify sequences that do not belong Users forewarned about regions of interest Genome browser designers consider realigning Alignment tool designers get feedback for possible improvements

20 Sample Suspicious Alignment
Human GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Chimp GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Rhesus GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Mouse GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA TTAACAC Rat GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA TTAACAC Dog GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Cow GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Elephant GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Tenrec GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Opossum GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA TTAACAC Chicken GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA TTAACAC Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC

21 Scoring Function Pairwise: score(1, 2) = log ( ) Multiple:
Pr(1, 2) Pr(1) Pr(2) sc(12345 | ) = log( ) Pr(12345 | ) Pr(125 | ) Pr(34 | ) Human Chimp Mouse Rat Chicken 1 2 3 4 5

22 Outline of Computation
Input Multiple sequence alignment A For each branch k of the tree { Compute scoring function sck (Felsenstein) Find all maximally scoring segments of A using sck (Ruzzo & Tompa) Compute K,  using sck (Karlin & Altschul) Compute p-value pk of each segment score using K,  (Karlin & Altschul) } Taking a look at the big picture. The input to our system is the multiple sequence alignment and the phylogeny relating them. So as I presented earlier, we have a null hypothesis case for each branch of the tree. So for a particular branch k We first build the scoring matrix, let’s say we have sequences from Human/Chimp/Mouse/Rat/Chicken… we build scores for … Next we simulate lot of datasets, align then and using their score distribution, estimate K and lambda. Now that we have these parameters and the scoring matrix, we can compute the score of the given local alignment by adding scores for each of the individual columns. Then using the BLAST theory, we can convert this score into a p-value. Finally we report the p-value of the weakest branch as the output. This is a slightly different version from the paper, and we think that this is a better measure of significance Output Discordance : maxk pk

23 Suspicious Alignment Regions
Case study: human chromosome 1 alignment to 16 other vertebrates in UCSC Genome Browser Identify suspicious alignment regions: Length  50 bp p-value  0.1 at each position, all with respect to the same branch k At most 50% gapped columns

24 Proposed Track on the UCSC Browser

25 247,000,000 9.7% 15% 3.3% 2.3% 26% 1.3% 24% 29% .004%

26 Genomic Locations of Suspicious Regions
6% of chromosome 1 alignments containing mouse are exonic 35% of chromosome 1 alignments containing zebrafish are exonic

27 Outline What genome data is available?
What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

28 TCCAACGGTGCTGAGGTGCAC
DNA, Genes, and Proteins TCCAACGGTGCTGAGGTGCAC Gene Protein DNA DNA: program for cell processes Proteins: execute cell processes

29 Regulation of Genes What turns genes on and off?
When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced?

30 Regulation of Genes Transcription Factor RNA polymerase DNA Gene
Regulatory Element

31 Regulation of Genes Transcription Factor RNA polymerase DNA Gene
Regulatory Element

32 Goal Identify regulatory elements in DNA sequences. These are:
Binding sites for proteins Short subsequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (“motifs”)

33 CLUSTALW multiple sequence alignment (rbcS gene)
Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA A GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCAATCT TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT A Tobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGGCCACCA TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pea GGCAGTGGCC---AACTAC CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC Larch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Turnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

34 Finding Short Motifs Size of motif sought: k = 4
AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

35 Most Parsimonious Solution
ACGT AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGG “Parsimony score”: 1 mutation (Finding the most parsimonious motif is NP-complete.)

36 Substring Parsimony Problem
Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-hard.

37 FootPrinter’s Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975)
Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1 ... … ACGG: 0 ACGT: 2 ... … ACGG: 1 ACGT: 1 ... … ACGG: + ACGT: 0 … ACGG: 1 ACGT: 0 ... 4k entries … ACGG: 0 ACGT: + ... … ACGG: ACGT :

38 Running Time Wu [s] =  min ( Wv [t] + d(s, t) )
v: child t of u Number of species Average sequence length Motif length Total time O(n k (4k + l ))

39 Improvements Better algorithm reduces time from O(n k (42k + l )) to O(n k (4k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)

40 Application to -actin Gene
Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)

41 Parsimony score over 10 vertebrates: 0 1 2
Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTGTTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates:

42 Motifs Absent from Some Species
Find motifs with small parsimony score that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with score 0 spanning at least 250 Myrs score 1 spanning at least 350 Myrs score 2 spanning at least 450 Myrs score 3 spanning at least 550 Myrs

43 Application to c-fos Gene
10 Puffer fish Chicken Pig Mouse Hamster Human 7 2 2 1 2 2 1 1 Asked for motifs of length 10, with mutations over tree of size mutation over tree of size mutations over tree of size mutations over tree of size mutations over tree of size 26 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28

44 Application to c-fos Gene
Motif Score Conserved in Known? CAGGTGCGAATGTTC 0 4 mammals TTCCCGCCTCCCCTCCCC 0 4 mammals yes GAGTTGGCTGcagcc 3 puffer + 4 mammals GTTCCCGTCAATCcct 1 chicken + 4 mammals yes CACAGGATGTcc 4 all 6 yes AGGACATCTG 1 chicken + 4 mammals yes GTCAGCAGGTTTCCACG 0 4 mammals yes TACTCCAACCGC 0 4 mammals metK in B. subtilis

45 Microbial Footprinting
1105 prokaryotes with genomes completely sequenced (as of 4/6/2010) For any prokaryotic gene of interest, plenty of close genes in other species available Relatively simple genomes MicroFootPrinter (with Shane Neph) Designed specifically for phylogenetic footprinting in microbial genomes undergraduate Computational Biology Capstone project User specifies species and gene of interest Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters

46 Demo MicroFootPrinter home
Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) chvI (two component response regulator) ropB (outer membrane protein ) Bacillus cereus: food poisoning Listeria monocytogenes: listeriosis, a food-borne illness Staphylococcus aureus: toxic-shock syndrome Staphylococcus saprophyticus: urinary tract infections

47 Sample chvI motif Parsimony score: 2 Span: Significance score: 4.22 B. henselae GCTACAATTT R. etli -90 GCCACAATTT R. leguminosarum GCCACAATTT S. meliloti GCCACAATTT S. medicae GCCACAATTT A. tumefaciens GCCACAATTT M. loti -80 GCCACATTTT M. sp GCCACATTTT O. anthropi GCCACATTTT B. suis -38 GCCACATTTT B. melitensis GCCACATTTT B. abortus GCCACATTTT B. ovis GCCACATTTT B. canis -38 GCCACATTTT

48 Sample ropB motif Parsimony score: 1 Span: Significance score: 1.34 Jannaschia sp CACATTTTGG R. etli -134 CACAATTTGG R. leguminosarum -135 CACAATTTGG A. tumefaciens -131 CACATTTTGG S. meliloti -128 CACATTTTGG S. medicae -128 CACATTTTGG

49 Combined ChvI Motif ropB: CACATTTTGG chvI: GCCACAATTT
Atu1221: TTGTCACAAT ultimate: GYCACAWTTTGG Y={C,T} W={A,T}

50 References and Acknowledgments
Amol Prakash & Martin Tompa, Measuring the Accuracy of Genome-Size Multiple Alignments. Genome Biology, June 2007, R124. Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368. All software available at bio.cs.washington.edu/software.html

51 Extra Material

52 BLASTX e-values for mouse alignments to human coding regions

53 Synthetic experiments
Same species Any species Specificity Sensitivity


Download ppt "Comparative Sequence Analysis in Molecular Biology"

Similar presentations


Ads by Google