Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building synteny maps Recommended local aligners BLASTZ  Most accurate, especially for genes  Chains local alignments WU-BLAST  Good tradeoff of efficiency/sensitivity.

Similar presentations


Presentation on theme: "Building synteny maps Recommended local aligners BLASTZ  Most accurate, especially for genes  Chains local alignments WU-BLAST  Good tradeoff of efficiency/sensitivity."— Presentation transcript:

1 Building synteny maps Recommended local aligners BLASTZ  Most accurate, especially for genes  Chains local alignments WU-BLAST  Good tradeoff of efficiency/sensitivity  Best command-line options BLAT  Fast, less sensitive  Good for comparing very similar sequences finding rough homology map

2 Index-based local alignment Dictionary: All words of length k (~10) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan Question: Using an idea from overlap detection, better way to find all local alignments between two genomes?

3 Local Alignments

4 After chaining

5 Chaining local alignments 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

6 Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

7 Threaded Blockset Aligner Human–Cow HMR – CD Restricted Area Profile Alignment

8 Reconstructing the Ancestral Mammalian Genome Human: C Baboon: C Cat: C Dog: G C C or G G

9 Neutral Substitution Rates Dataset 3: 4-D sites

10 Finding Conserved Elements (1) Binomial method  25-bp window in the human genome  Binomial distribution of k matches in N bases given the neutral probability of substitution

11 Finding Conserved Elements (2) Parsimony Method  Count minimum # of mutations explaining each column  Assign a probability to this parsimony score given neutral model  Multiply probabilities across 25-bp window of human genome A C A A G

12 Finding Conserved Elements

13 Finding Conserved Elements (3) GERP

14 Phylo HMMs HMM Phylogenetic Tree Model Phylo HMM

15 Finding Conserved Elements (3)

16 How do the methods agree/disagree?

17 Statistical Power to Detect Constraint L N C: cutoff # mutations D: neutral mutation rate  : constraint mutation rate relative to neutral

18 Statistical Power to Detect Constraint L N C: cutoff # mutations D: neutral mutation rate  : constraint mutation rate relative to neutral

19 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

20 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Promoter motifs 3’ UTR motifsExons Introns

21 Comparing genomes reveals functional elements Ultra-conserved elements Protein-coding genes Short regulatory motifs

22 ATGACTAAATCTCATTCAGAAGAAGTGA Regulatory Motif Discovery GAL1 CCCCWCGGCCG Gal4 Mig1 CGGCCG Gal4 Gene regulation  Genes are turned on / off in response to changing environments  Gene regulatory logic is controlled by sequence motifs  Specialized proteins (transcription factors) recognize motifs What makes motif discovery hard?  Motifs are short (6-8 bp) and usually degenerate  Act at variable distances upstream (or downstream) of target gene

23 Overview of Motif Discovery Algorithms

24 Motif Representation GTATAA CTATAA GTCTTA ATATAC GTAATA TTGTAC GTATTA GTATTC ATCTAA GTATAA CTATAA GTCTTA ATATAC GTAATA TTGTAC GTATTA GTATTC ATCTAA PSSM GTATAA Consensus GTATAMGTATAM GTATAMGTATAM IUPAC Complex Dependency Graphical Models Complex Dependency Graphical Models GTATAA CTATAA TTGTAC GTCTTA GTAATA ATATACATATAC ATATACATATAC GTATTA GTATTC ATCTAA Nonparametric – Graph or Bag of Words Nonparametric – Graph or Bag of Words

25 Motif Representation – Pairwise Dependencies Complex Dependency Graphical Models Complex Dependency Graphical Models

26 Motif Representation – MotifScan GTATAA CTATAA TTGTAC GTCTTA GTAATA ATATACATATAC ATATACATATAC GTATTA GTATTC ATCTAA

27 Motif Finding Given a set of promoter sequences  For example, common expression pattern of the respective genes in microarrays ACCGAGAGTATAAGCTTACGTGACTTGCATGATCTTGCGATGTGTGTTCAGCT ATCGTACGTTGAGGAGAGGCGGTAATAGAAGTACGTCGATGTCGTCGTACAT TTCCTATAAGATCGACTGTAGGGAGAGTCTCTGAGAGTATTGCTGGCATGTG ACTTCGAGGAGAGATTCTCTAGATCTATGCTGTGGTATTAAGAGATCTCTAG ATCGATGCGCTGATCGCTATAATATATCGGCGGTATCTGGTTGATCTGGTGT GACTGATGTATCGTATCTGATCTGTCGGTATAATATAGCTGTCTGATTAGTTG TCTCTAGATGCTGTGCTGATGGTCTTATCGATGTGCGACGGTAATAGTATCCT Find a common motif that they share GTATAA GTAATA CTATAA GTATTA CTATAA GTATAA GTAATA

28 Most Popular Approaches Expectation Maximization – MEME  Sequences are mixtures of Motif model M, e.g., a motif PSSM Background model B, e.g., 3 rd order model of promoters  Learn model by Starting from random M, learned B from promoters Assign each position in input to M or B, accordingly Re-estimate M and B based on current assignments Gibbs Sampling – AlignACE, BioProspector  Update 1-seq x at a time Remove from M Pick a new location in x based on M M x

29 Whole-genome motif discovery

30 Study known motifs Derive conservation rules Discover novel motifs Regulatory Motif Discovery

31 Known motifs are preferentially conserved Is this enough to discover motifs? No.

32 human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Known motifs are preferentially conserved human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Gabpa Err  human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Is this enough to discover motifs? No

33 Known motifs are frequently conserved Across the human promoter regions, the Err  motif:  appears 434 times  is conserved 162 times Human Dog Mouse Rat Err  Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err  enrichment: 5.4-fold –Err  p-value < 10 -50 (25 standard deviations under binomial) Motif Conservation Score (MCS)

34 MCS distribution of all 6-mers shows excess conservation  High scoring patterns include known motifs  Excess specific to promoters and 3’-UTRs (not introns)  For MCS > 6, estimate 97% specificity Motif density Motif Conservation Score (MCS) Select motifs with MCS > 6.0, cluster

35 Hill-climbing in sequence space Seed selection  Three mini-motif conservation criteria (CC1, CC2, CC3) Motif extension  Non-random conservation of neighbors Motif collapsing  Merge neighbors using hierarchical clustering, avg-max-linkage Re-scoring complex motifs  Motif conservation score for full motifs (MCS)

36 Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG

37 Test 1: Selecting mini-motifs Estimate basal rate of conservation  Expected conservation rate at the evolutionary distances observed  Average conservation rate of non-outlier mini-motifs Score conservation of mini-motif  k: conserved motif occurrences  n: total motif occurrences  r: basal conservation rate  Evaluate binomial probability of observing k successes out of n trials Assign z-score to each mini-motif  Bulk of distribution is symmetric  Estimate specificity as (R-L)/R  Select cutoff: 5.0 sigma  1190 mini-motifs, 97.5% non-random Conservation rate r N Binomial score Right tail Left tail Specificity Cutoff

38 Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes

39 Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation

40 Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3

41 Extending mini-motifs Separate conserved and non-conserved instances CTACGA 6 CTxxGA 6 Causal set Random set CTACGARGW CTxxGAYHS Find maximally discriminating neighborhood N1 N2 M1 M2 Evaluate non-randomness of neighborhood –chi-square contingency test on [N1,M1], [N2,M2]

42 Systematically test candidate patterns All potential motifs Evaluate MCS Cluster similar motifs GTC AGT R R Y gap S W 174 motifs in promoters 106 motifs in 3’ UTRs Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0  specificity 97% Cluster –Sequence similarity –Overlapping occurrences Are these real ?

43 Functions of discovered motifs

44 Evidence of motif function Promoter motifs: (1)Comparison to known motifs (2)Distance from TSS (3)Expression enrichment Promoter3’-UTR ATG Stop 174 motifs106 motifs

45 (1)Promoter motifs match known TF binding sites Compare discovered motifs to TRANSFAC database of 125 known motifs 55% of TRANSFAC motifs match discovered motifs 45% of discovered motifs match TRANSFAC motifs (only 2% of control sequences match TRANSFAC motifs)

46 (2) Promoter motifs show preferred distance to TSS 32% of discovered motifs show strong positional bias Conserved motif sites in all four species Motif instances in human Each of 174 discovered motifs Motif 8 Motif 4 -81 -63 Distance from TSS Discovered motifs occur preferentially Within 200 bp of Transcription Start Site Individual motifs show strong peaks Regardless of conservation

47 (3) Promoter motifs enriched in specific tissues 70% of motifs show significant enrichment in at least one tissue New motifsKnown TFs

48 Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias  < 7% false positives 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias  75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias  < 7% false positives Most discovered motifs are likely to be functional New

49 Summary of Promoter Motifs

50 Similar analysis in 5% most conserved regions in human 12-22 bp long motifs 12-22 bp long motifs

51 Similar analysis in 5% most conserved regions in human


Download ppt "Building synteny maps Recommended local aligners BLASTZ  Most accurate, especially for genes  Chains local alignments WU-BLAST  Good tradeoff of efficiency/sensitivity."

Similar presentations


Ads by Google