Download presentation
Presentation is loading. Please wait.
Published byRosamund Watts Modified over 9 years ago
1
CS273a, Spring 2007, Lecture 11 Whole-genome motif discovery
2
CS273a, Spring 2007, Lecture 11 Challenges in Computational Biology DNA 4 Genome Assembly Gene Finding Regulatory motif discovery Database lookup Gene expression analysis9 RNA transcript Sequence alignment Evolutionary Theory7 TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Cluster discovery10Gibbs sampling Protein network analysis12 Emerging network properties14 13 Regulatory network inference Comparative Genomics RNA folding
3
CS273a, Spring 2007, Lecture 11 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
4
CS273a, Spring 2007, Lecture 11 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Promoter motifs 3’ UTR motifsExons Introns
5
CS273a, Spring 2007, Lecture 11 Comparing genomes reveals functional elements Ultra-conserved elements Protein-coding genes Short regulatory motifs
6
CS273a, Spring 2007, Lecture 11 ATGACTAAATCTCATTCAGAAGAAGTGA Regulatory Motif Discovery GAL1 CCCCWCGGCCG Gal4 Mig1 CGGCCG Gal4 Gene regulation –Genes are turned on / off in response to changing environments –Gene regulatory logic is controlled by sequence motifs –Specialized proteins (transcription factors) recognize motifs What makes motif discovery hard? –Motifs are short (6-8 bp) and usually degenerate –Act at variable distances upstream (or downstream) of target gene
7
CS273a, Spring 2007, Lecture 11 Regulatory Motif Discovery Study known motifs Derive conservation rules Discover novel motifs
8
CS273a, Spring 2007, Lecture 11 Known motifs are preferentially conserved Is this enough to discover motifs? No.
9
CS273a, Spring 2007, Lecture 11 human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Known motifs are preferentially conserved human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Gabpa Err human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT----- rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * * human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG rat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * * Is this enough to discover motifs? No
10
CS273a, Spring 2007, Lecture 11 Known motifs are frequently conserved Across the human promoter regions, the Err motif: –appears 434 times –is conserved 162 times Human Dog Mouse Rat Err Conservation rate: 37% Compare to random control motifs –Conservation rate of control motifs: 6.8% –Err enrichment: 5.4-fold –Err p-value < 10 -50 (25 standard deviations under binomial) Motif Conservation Score (MCS)
11
CS273a, Spring 2007, Lecture 11 MCS distribution of all 6-mers shows excess conservation –High scoring patterns include known motifs –Excess specific to promoters and 3’-UTRs (not introns) –For MCS > 6, estimate 97% specificity Motif density Motif Conservation Score (MCS) Select motifs with MCS > 6.0, cluster
12
CS273a, Spring 2007, Lecture 11 Hill-climbing in sequence space Seed selection –Three mini-motif conservation criteria (CC1, CC2, CC3) Motif extension –Non-random conservation of neighbors Motif collapsing –Merge neighbors using hierarchical clustering, avg-max-linkage Re-scoring complex motifs –Motif conservation score for full motifs (MCS)
13
CS273a, Spring 2007, Lecture 11 Test 1: Intergenic conservation Total count Conserved count CGG-11-CCG
14
CS273a, Spring 2007, Lecture 11 Test 1: Selecting mini-motifs Estimate basal rate of conservation –Expected conservation rate at the evolutionary distances observed –Average conservation rate of non- outlier mini-motifs Score conservation of mini-motif –k: conserved motif occurrences –n: total motif occurrences –r: basal conservation rate –Evaluate binomial probability of observing k successes out of n trials Assign z-score to each mini-motif –Bulk of distribution is symmetric –Estimate specificity as (R-L)/R –Select cutoff: 5.0 sigma –1190 mini-motifs, 97.5% non-random Conservation rate r N Binomial score Right tail Left tail Specificity Cutoff
15
CS273a, Spring 2007, Lecture 11 Test 2: Intergenic vs. Coding Coding Conservation Intergenic Conservation CGG-11-CCG Higher Conservation in Genes
16
CS273a, Spring 2007, Lecture 11 Test 3: Upstream vs. Downstream CGG-11-CCG Downstream motifs? Most Patterns Downstream Conservation Upstream Conservation
17
CS273a, Spring 2007, Lecture 11 Extend Collapse Full Motifs Constructing full motifs 2,000 Mini-motifs 72 Full motifs 6 CTA CGA R R CTGRC CGAA ACCTGCGAACTGRCCGAACTRAY CGAA Y 5 Extend Collapse Merge Test 1Test 2Test 3
18
CS273a, Spring 2007, Lecture 11 Extending mini-motifs Separate conserved and non-conserved instances CTACGA 6 CTxxGA 6 Causal set Random set CTACGARGW CTxxGAYHS Find maximally discriminating neighborhood N1 N2 M1 M2 Evaluate non-randomness of neighborhood –chi-square contingency test on [N1,M1], [N2,M2]
19
CS273a, Spring 2007, Lecture 11 Systematically test candidate patterns All potential motifs Evaluate MCS Cluster similar motifs GTC AGT R R Y gap S W 174 motifs in promoters 106 motifs in 3’ UTRs Enumerate –Length between 6 and 15 nt, allow central gap –11 letter alphabet (A C G T, 2-fold codes, N) Score –Compute binomial score (conserved vs. total) –Select MCS > 6.0 specificity 97% Cluster –Sequence similarity –Overlapping occurrences Are these real ?
20
CS273a, Spring 2007, Lecture 11 Functions of discovered motifs
21
CS273a, Spring 2007, Lecture 11 Evidence of motif function Promoter motifs: (1)Comparison to known motifs (2)Distance from TSS (3)Expression enrichment Promoter3’-UTR ATG Stop 174 motifs106 motifs
22
CS273a, Spring 2007, Lecture 11 (1)Promoter motifs match known TF binding sites Compare discovered motifs to TRANSFAC database of 125 known motifs 55% of TRANSFAC motifs match discovered motifs 45% of discovered motifs match TRANSFAC motifs (only 2% of control sequences match TRANSFAC motifs)
23
CS273a, Spring 2007, Lecture 11 (2) Promoter motifs show preferred distance to TSS 32% of discovered motifs show strong positional bias Conserved motif sites in all four species Motif instances in human Each of 174 discovered motifs Motif 8 Motif 4 -81 -63 Distance from TSS Discovered motifs occur preferentially Within 200 bp of Transcription Start Site Individual motifs show strong peaks Regardless of conservation
24
CS273a, Spring 2007, Lecture 11 (3) Promoter motifs enriched in specific tissues 70% of motifs show significant enrichment in at least one tissue New motifsKnown TFs
25
CS273a, Spring 2007, Lecture 11 Summary for promoter motifs RankDiscovered Motif Known TF motif Tissue Enrichment Distance bias 1RCGCAnGCGYNRF-1Yes 2CACGTGMYCYes 3SCGGAAGYELK-1Yes 4ACTAYRnnnCCCRYes 5GATTGGYNF-YYes 6GGGCGGRSP1Yes 7TGAnTCAAP-1Yes 8TMTCGCGAnRYes 9TGAYRTCAATF3Yes 10GCCATnTTGYY1Yes 11MGGAAGTGGABPYes 12CAGGTGE12Yes 13CTTTGTLEF1Yes 14TGACGTCAATF3Yes 15CAGCTGAP-4Yes 16RYTTCCTGC-ETS-2Yes 17AACTTTIRF1(*)Yes 18TCAnnTGAYSREBP-1Yes 19GKCGCn(7)TGAYGYes 20GTGACGYE4F1Yes 21GGAAnCGGAAnYYes 22TGCGCAnKYes 23TAATTACHX10Yes 24GGGAGGRRMAZYes 25TGACCTYERRAYes 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias < 7% false positives 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias 75% have evidence Control sequences < 2% match known TF motifs < 5% expression enrichment < 3% show positional bias < 7% false positives Most discovered motifs are likely to be functional New
26
CS273a, Spring 2007, Lecture 11 Summary of Promoter Motifs
27
CS273a, Spring 2007, Lecture 11 Similar analysis in 5% most conserved regions in human 12-22 bp long motifs 12-22 bp long motifs
28
CS273a, Spring 2007, Lecture 11 Similar analysis in 5% most conserved regions in human
29
CS273a, Spring 2007, Lecture 11 Overview of Motif Discovery Algorithms
30
CS273a, Spring 2007, Lecture 11 Motif Representation GTATAA CTATAA GTCTTA ATATAC GTAATA TTGTAC GTATTA GTATTC ATCTAA GTATAA CTATAA GTCTTA ATATAC GTAATA TTGTAC GTATTA GTATTC ATCTAA PSSM GTATAA Consensus GTATAMGTATAM GTATAMGTATAM IUPAC Complex Dependency Graphical Models Complex Dependency Graphical Models GTATAA CTATAA TTGTAC GTCTTA GTAATA ATATACATATAC ATATACATATAC GTATTA GTATTC ATCTAA Nonparametric – Graph or Bag of Words Nonparametric – Graph or Bag of Words
31
CS273a, Spring 2007, Lecture 11 Motif Representation – Pairwise Dependencies Complex Dependency Graphical Models Complex Dependency Graphical Models
32
CS273a, Spring 2007, Lecture 11 Motif Representation – MotifScan GTATAA CTATAA TTGTAC GTCTTA GTAATA ATATACATATAC ATATACATATAC GTATTA GTATTC ATCTAA
33
CS273a, Spring 2007, Lecture 11 Motif Finding Given a set of promoter sequences –For example, common expression pattern of the respective genes in microarrays ACCGAGAGTATAAGCTTACGTGACTTGCATGATCTTGCGATGTGTGTTCAGCT ATCGTACGTTGAGGAGAGGCGGTAATAGAAGTACGTCGATGTCGTCGTACAT TTCCTATAAGATCGACTGTAGGGAGAGTCTCTGAGAGTATTGCTGGCATGTG ACTTCGAGGAGAGATTCTCTAGATCTATGCTGTGGTATTAAGAGATCTCTAG ATCGATGCGCTGATCGCTATAATATATCGGCGGTATCTGGTTGATCTGGTGT GACTGATGTATCGTATCTGATCTGTCGGTATAATATAGCTGTCTGATTAGTTG TCTCTAGATGCTGTGCTGATGGTCTTATCGATGTGCGACGGTAATAGTATCCT Find a common motif that they share GTATAA GTAATA CTATAA GTATTA CTATAA GTATAA GTAATA
34
CS273a, Spring 2007, Lecture 11 Most Popular Approaches Expectation Maximization – MEME –Sequences are mixtures of Motif model M, e.g., a motif PSSM Background model B, e.g., 3 rd order model of promoters –Learn model by Starting from random M, learned B from promoters Assign each position in input to M or B, accordingly Re-estimate M and B based on current assignments Gibbs Sampling – AlignACE, BioProspector –Update 1-seq x at a time Remove from M Pick a new location in x based on M M x
35
CS273a, Spring 2007, Lecture 11 MotifCut Construct a graph of all promoters –Each k-mer in each promoter is a node –Nodes are connected with edges of weight proportional to sequence similarity Find maximum density subgraph ACAGGATCACTGATGCAGCATGCATGCATCG CTAGTCGTAGTCTCGATCTAGCTGTGTGTC CATGATGCGCGATCTTGCTGTGGTCATTAGC ATCGAGGCGAGAGAGATCTCTCTAGTGTACT ACAGGAT CAGGATC AGGATCA GGATCAC …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.