Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington,

Slides:



Advertisements
Similar presentations
1 Aligning Multiple Genome Sequences With the Threaded Blockset Aligner Blanchette, W., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., Roskin, K.M.,
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Structural bioinformatics
Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA.
Gibbs Sampling in Motif Finding. Gibbs Sampling Given:  x 1, …, x N,  motif length K,  background B, Find:  Model M  Locations a 1,…, a N in x 1,
1 Regulatory Motif Finding Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting, Blanchette & Tompa (2002) Statistical.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Comparative Motif Finding
(Regulatory-) Motif Finding
Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Finding Regulatory Motifs in DNA Sequences
Gene Regulation and Microarrays …after which we come back to multiple alignments for finding regulatory motifs.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Journal club 06/27/08. Phylogenetic footprinting A technique used to identify TFBS within a non- coding region of DNA of interest by comparing it to the.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Sequencing a genome and Basic Sequence Alignment
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Outline More exhaustive search algorithms Today: Motif finding
Introduction to Phylogenetics
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
1 Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Comparative Genomics I: Tools for comparative genomics
Accessing and visualizing genomics data
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Genetics and Evolutionary Biology
LMO
Comparative Genomics.
University of Pittsburgh
B3- Olympic High School Bioinformatics
Comparative Sequence Analysis in Molecular Biology
Investigations of HIV-1 Env Evolution
by , Christine G. Elsik, Ross L. Tellam, and Kim C. Worley
Presentation transcript:

Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.

2

3 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

4 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

5 How Many Genomes Are Available? 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes) 1766 bacterial genomes sequenced (as of 2/12/2012) Insects, fungi, worms, plants, … Many more will be finished very soon Fertile ground for comparative genomics

: number of nucleotides in GenBank doubled every 18 months Since 2003: doubled every 3 years

7 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

8 Phylogenetic Footprinting (Tagle et al. 1988) Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones. 1.Consider a set of corresponding DNA sequences from related species. 2.Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”

9

10 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

11 How to Find Conserved Motifs ACTAACCGGGAGATTTCAGA human AAGTTCCGGGAGATTTCCA chimp TAGTTATCCGGGAGATTAGA mouse AAAACCGGTAGATTTCAGG rat

12 Multiple Sequence Alignment AC--TAACCGGGAGATTTCAGA human AAGTT--CCGGGAGATTTCC-A chimp TAGTTATCCGGGAGATT--AGA mouse AA---AACCGGTAGATTTCAGG rat (Finding the optimal alignment is NP-complete.)

13 Phylogenetic Footprinting 1.Use whole-genome multiple alignment such as provided by UCSC Genome Browser.UCSC Genome Browser 2.Search for regions of well conserved alignment. –Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe; Xie (2)] –RNA elements [Pedersen; Washietl] –General conservation & constraint [Bejerano; Boffelli; Cooper; Margulies (4); Pollard; Prabhakar; Siepel]

14 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

15 Why Doubt Alignments? Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete) Aligning whole genomes adds the complications of huge sequences and genomic rearrangements Vertebrate alignment has 3.8 billion columns Automatically generated

16 Assessing 4 Genome-Size Alignments (with Xiaoyu Chen) Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003], TBA [Blanchette 2003], Pecan [Paten 2008] Target ENCODE regions: 30 Mbp covering 1% of the human genome (ENCODE targets) Total input: 554 Mbp over 28 vertebrates Rich resource for comparing and assessing genome-size alignments Margulies et al. 2007, Genome Research

17 Coverage of each alignment Alignment coverage: number of human bases aligned to a given species

18 Coverage of each alignment In noncoding regions, as species distance from human↑, coverage↓

19 Coverage of each alignment MAVID has lowest coverage

20 Coverage of each alignment Other 3 have comparable coverage in placental mammals

21 Coverage of each alignment MLAGAN has highest coverage in distant species, intronic and intergenic

22 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases

23 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Agree%: Coding > UTR > Int.

24 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Unique%: Coding < UTR < Int.

25 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases As species distance from human↑, Agree%↓Unique%↑

26 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Primates: high Agree%

27 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Placental nonprimates: Agree% > 0.5

28 Level of agreement among alignments Agree% Disagree% Unique% TBA (T) MLAGAN (L) MAVID (V) TVL TVL TVL TVL TVL TVL TVL TVL TVL Coding basesUTR bases Intronic basesIntergenic bases Distant species, Int: low Agree%, high Unique%

29 Alignment agreement for mouse Agree% Disagree% Unique% Intronic basesIntergenic bases Intronic & intergenic account for 95% of mouse bases aligned to human Agree% in those categories: 44% to 62% Much worse for more distant species Building reliable MSA remains challenging

30 Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990) Goal: label each alignment column with confidence measure of alignment correctness –Identify sequences that do not belong Users forewarned about regions of interest Genome browser designers consider realigning Alignment tool designers get feedback for possible improvements

31 Sample Suspicious Alignment Human GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Chimp GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Rhesus GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Mouse GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA TTAACAC Rat GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA TTAACAC Dog GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Cow GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Elephant GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Tenrec GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA TTAACAC Opossum GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA TTAACAC Chicken GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA TTAACAC Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC

32 Scoring Function Pairwise: score(  1,  2 ) = log ( ) Multiple: Human Chimp Mouse Rat Chicken 1234512345 Pr(  1,  2 ) Pr(  1 ) Pr(  2 ) sc (  1  2  3  4  5 | ) = log ( ) Pr(  1  2  3  4  5 | ) Pr(  1  2  5 | ) Pr(  3  4 | )

33 Outline of Computation InputMultiple sequence alignment A Output Discordance : max k p k For each branch k of the tree { Compute scoring function sc k (Felsenstein) Find all maximally scoring segments of A using sc k (Ruzzo & Tompa) Compute K, using sc k (Karlin & Altschul) Compute p-value p k of each segment score using K, (Karlin & Altschul) }

34 Suspicious Alignment Regions Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen) Identify suspicious alignment regions: –Length  50 bp –Discordance  0.1 at each position, all with respect to the same worst species –Fewer than 50% gapped sites Suspicious% –Percentage of aligned bases in suspicious regions

35 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

36 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

37 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

38 Alignment accuracy Coding bases UTR bases Intronic bases Intergenic bases

39 Can suspicious alignments be improved? Baboon and MLAGAN (for example): all points (x,y), where x = human-baboon alignment score of MLAGAN region suspicious for baboon y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon y = x y - x = μ, where μ = average y-x over all points y - x = μ ± σ, where σ = standard deviation of y-x over all points

40 Can suspicious alignments be improved?

41 Summary of comparisons (all categories) primatesother placental mammalsdistant speciesTBAMAVIDMLAGANPecan High is better Low is better

42 Conclusions 1.Disturbing lack of agreement among alignments: alignment still a hard problem 2.Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions

43 Outline What genome data is available? What is phylogenetic footprinting? Phylogenetic footprinting by multiple sequence alignment Which parts of multiple sequence alignments are trustworthy? FootPrinter: phylogenetic footprinting without alignment

44 DNA, Genes, and Proteins DNA: program for cell processes Proteins: execute cell processes T C C AA C GG T G C T G A G G T G C AC Gene Protein DNA

45 Regulation of Genes What turns genes on and off? When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are produced?

46 Regulation of Genes Gene Regulatory Element RNA polymerase Transcription Factor DNA

47 RNA polymerase Transcription Factor DNA Regulatory Element Gene Regulation of Genes

48 Goal Identify regulatory elements in DNA sequences. These are: Binding sites for proteins Short subsequences (5-25 nucleotides) Up to 1000 nucleotides (or farther) from gene Inexactly repeating patterns (“motifs”)

49 CLUSTALW multiple sequence alignment (rbcS gene) CottonACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA AGGCTTTACCATT PeaGTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA AGG--TTAGCACA TobaccoTAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA ATGGCTTAGCACC Ice-plantTCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC TurnipATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA A GGAGC WheatTATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA DuckweedTCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA LarchTAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC CottonCAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A PeaC---AAAACTTTTCAATCT TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT A TobaccoAAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plantATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA TurnipCAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT A WheatGCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC DuckweedATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT LarchTTCTCGTATAAGGCCACCA TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA CottonACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA PeaGGCAGTGGCC---AACTAC CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA TobaccoGGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plantGGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG TurnipCACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA WheatCACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG DuckweedTTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC LarchCGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA CottonT-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC PeaTATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC TobaccoCATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plantTCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC LarchTCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TurnipTATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG WheatGTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC DuckweedCATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

50 Finding Short Motifs AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4

51 Most Parsimonious Solution “Parsimony score”: 1 mutation AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGG ACGT

52 Substring Parsimony Problem Given: phylogenetic tree T, set of orthologous sequences at leaves of T, length k of motif threshold d Problem: Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. This problem is NP-complete.

53 FootPrinter’s Exact Algorithm (with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975) W u [s] =best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2 ACGT: 1... … ACGG : 0 ACGT : 2... … ACGG : 1 ACGT : 1... … ACGG: +  ACGT: 0... … ACGG: 1 ACGT: k entries … ACGG: 0 ACGT: + ... … ACGG:  ACGT :0...

54 W u [s] =  min ( W v [t] + d(s, t) ) v : child t of u Running Time Number of species Average sequence length Motif length Total time O(n k (4 2k + l ))

55 Improvements Better algorithm reduces time from O(n k (4 2k + l )) to O(n k (4 k + l )) By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) Amenable to many useful extensions (e.g., allow insertions and deletions)

56 Application to  -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp)

57 Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAAC A TTGGCATGGCTT TTGTTATTTTTGGCGC TTGACTCAGG AT C T AAAAACTGGAAC G GCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTT AGTCATTCCAAAT GTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGT AACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A CCTGTACACTGAC T AATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAG A TTGGCATGGCTT TATTTGTTTTTTCTTTTGGC GC TTGACTCAGGAT T A AAAAACTGGAAT G GTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAAT ATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTC TGTAAATTATGT AACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG GGAGGGAGGGGCT A CCTGTACACTGAC T TAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAG A TTGGCATGGCTT TATTTGTTTTTTTTGTTTTGTT TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGAT T T AAAAACTGGAAC G GTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAAT AGTCATTCCAAAT ATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCG TGTAAATTATGT AATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG AGGCAGCCAGGGCTT A CCTGTACACTGAC T TGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2

58 Motifs Absent from Some Species Find motifs –with small parsimony score –that span a large part of the tree Example: in tree of 10 species spanning 760 Myrs, find all motifs with –score 0 spanning at least 250 Myrs –score 1 spanning at least 350 Myrs –score 2 spanning at least 450 Myrs –score 3 spanning at least 550 Myrs

59 Application to c-fos Gene Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Puffer fish Chicken Pig Mouse Hamster Human Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28

60 Application to c-fos Gene MotifScoreConserved inKnown? CAGGTGCGAATGTTC04 mammals TTCCCGCCTCCCCTCCCC04 mammalsyes GAGTTGGCTGcagcc3puffer + 4 mammals GTTCCCGTCAATCcct1chicken + 4 mammals yes CACAGGATGTcc4all 6 yes AGGACATCTG1chicken + 4 mammals yes GTCAGCAGGTTTCCACG04 mammals yes TACTCCAACCGC04 mammals metK in B. subtilis

61 Microbial Footprinting 1889 prokaryotes with genomes completely sequenced ( as of 2/12/2012 ) –For any prokaryotic gene of interest, plenty of close genes in other species available –Relatively simple genomes MicroFootPrinter (with Shane Neph) –Designed specifically for phylogenetic footprinting in microbial genomes –undergraduate Computational Biology Capstone project –User specifies species and gene of interest –Automates collection of orthologous genes, cis-regulatory sequences, gene tree, parameters

62 Demo MicroFootPrinter home Examples: Agrobacterium tumefaciens genes regulated by ChvI (with Eugene Nester) –chvI (two component response regulator)chvI –ropB (outer membrane protein )ropB

63 Sample chvI motif Parsimony score: 2 Span: Significance score: 4.22 B. henselae GCTACAATTT R. etli -90 GCCACAATTT R. leguminosarum -106 GCCACAATTT S. meliloti -119 GCCACAATTT S. medicae -118 GCCACAATTT A. tumefaciens -105 GCCACAATTT M. loti -80 GCCACATTTT M. sp. -87 GCCACATTTT O. anthropi -158 GCCACATTTT B. suis -38 GCCACATTTT B. melitensis -156 GCCACATTTT B. abortus -156 GCCACATTTT B. ovis -156 GCCACATTTT B. canis -38 GCCACATTTT

64 Sample ropB motif Parsimony score:1 Span:20.70 Significance score:1.34 Jannaschia sp.-151 CACATTTTGG R. etli-134 CACAATTTGG R. leguminosarum-135 CACAATTTGG A. tumefaciens-131 CACATTTTGG S. meliloti-128 CACATTTTGG S. medicae-128 CACATTTTGG

65 Combined ChvI Motif ropB: CACATTTTGG chvI: GCCACAATTT Atu1221: TTGTCACAAT ultimate: GYCACAWTTTGG Y ={C,T} W ={A,T}

66 References and Acknowledgments Amol Prakash & Martin Tompa, Measuring the Accuracy of Genome-Size Multiple Alignments. Genome Biology, June 2007, R124. Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368. All software available at bio.cs.washington.edu/software bio.cs.washington.edu/software