Download presentation
Presentation is loading. Please wait.
Published byLoreen Simpson Modified over 9 years ago
1
Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel
2
Hierarchical clustering
3
Promoter Motifs and expression profiles CGGCCCCGCGGA CTCCTCCCCCCCTTCTGGCCAATCA ATGTACGGGTG 3
4
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AlignACE Example …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf
5
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** A cluster of gene may contain a common motif in their promoter …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Find a needle in a haystack
6
Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces cerevisiae J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church Journal of Molecular Biology (2000)
7
Example http://www.cifn.unam.mx/Computational_Genomics/old_research/BIOL2.html GAL4 is one of the yeast genes required for growth on galactose. 123456 A0.80.410.601 C000000 G0.20.60010 T0000.400 Motif Representation G1 A G A A G A G2 A A A T G A G3 G A A T G A G4 A G A A G A G5 A G A A G A
8
Finding New Motif By lab work By comparison to known motifs in other species By searching upstream regions of a set of potentially co-regulated genes
9
The genes bound by the TF Abf1 can be clustered into several groups, some contain a motif NCGTNNNNARTGAT CGATGAGMTK NCGTNNNNARTGAT & CGATGAGMTK (sporulation experiment)
10
Search Space Size of search space: L=600, W = 15, N = 10 : Exact search methods are not feasible
11
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Input Data Set
12
K-means Start with random positions of centroids. Assign data points to centroids. Move centroids to center of assigned points. Iterate till minimal cost. Iteration = 3
14
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC MAP score = -10.0 …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example Based on slides from G. Church Computational Biology course at Harvard Initial Seeding
15
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Add? TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC TCTCTCTCCA How much better is the alignment with this site as opposed to without? …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Sampling
16
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Add? TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC How much better is the alignment with this site as opposed to without? Remove. TGAAAAAATG …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Sampling
17
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC GACATCGAAAC GCACTTCGGCG GAGTCATTACA GTAAATTGTCA CCACAGTCCGC TGTGAAGCACA How much better is the alignment with this new column structure? …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Column Sampling
18
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA MAP score = 20.37 …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Based on slides from G. Church Computational Biology course at Harvard AlignACE Example The Best Motif
19
MAP – Maximal a priori log likelihood score This is what the algorithm tries to optimize. Measures the degree of over representation of the motif in the input sequence relative to expectation in a random sequence. The MAP Score MAP
20
= standard Beta & Gamma functions N = number of aligned sites; T = number of total possible sites F jb = number of occurrences of base b at position j (F sum) G b = background genomic frequency for base b b = n x G b for n pseudocounts ( sum) W = width of motif; C = number of columns in motif (W>=C) Based on slides from G. Church Computational Biology course at Harvard The MAP Score
21
N = number of aligned sites exp = expected number of sites in the input sequence, comparing to a random model The MAP Score P = 1 site every 16,000 bases For 64,000 bases sequence - exp = 4
22
MotifNumber of genes (each 1,000 BPs long promoter) Number of times found Expected number of times MAP score AGGGTAA (7)1610~1 10 GTAGATG (7)162~1 0.60206 CCGTGAG (7)16010~10 0 GATGTA (6)162~4 -0.60206 AGGGTA (6)16104 4.089354 A (1)162504~2500 1.73 AAAAAAA (7)165~1.5 2.614394 GGGGGGG (7)165~0.5 5 Some examples Very intuitive: any things that’s long, that occurs many times and that is different from background will score highly
23
The MAP Score Properties a) Motif should be “strong” b) Input sequence can’t be too long P = 1 site every 16,000 bases Genome length ~12Mb : Motif needs more than 1500 sites to get a positive MAP score: Problem: most transcription factor binding sites will only occur in dozens to hundreds of genes
24
Solution: Cluster genes before searching for motifs Time-point 1 Time-point 3 Time-point 2
25
Group Specificity Score: How well a motif targets the genes used to find it comparing to all genome ? What is the probability to have such large intersection? All Genome (N) Motif ORFs Group (S 1 ) ORFs with best sites (S 2 ) X N = Total # of ORFs in the genome (6226) S 1 = # ORFs used to align the motif S 2 = # targets in the genome (~ 100 ORFs with best ScanACE scores) X = # size of intersection of S 1 and S 2 Based on slides from G. Church Computational Biology course at Harvard
26
Group Specificity Score: How well a motif targets the genes used to find it comparing to all genome ? What is the probability to have such large intersection? All Genome (N) Motif ORFs Group (S 1 ) ORFs with best sites (S 2 ) X N = Total # of ORFs in the genome (6226) S 1 = # ORFs used to align the motif S 2 = # targets in the genome (~ 100 ORFs with best ScanACE scores) X = # size of intersection of S 1 and S 2 Based on slides from G. Church Computational Biology course at Harvard
27
Positional Bias Score: Measures the degree of preference of positioning in a particular range upstream to translational start. Based on slides from G. Church Computational Biology course at Harvard #ORFS 10 6 1 Start -600 bp 50 bp
28
Find best 200 sites in the genome Restrict sites to segment of length [s = 600 bp] from translation start t = # sites in the segment Choose window size [w = 50 bp] m = # sites in the most enriched window Positional Bias Score: What is the probability to have m or more sites in a window of size w? Based on slides from G. Church Computational Biology course at Harvard #ORFS 10 1 Start -600 bp 50 bp
29
Find best 200 sites in the genome Restrict sites to segment of length [s = 600 bp] from translation start t = # sites in the segment Choose window size [w = 50 bp] m = # sites in the most enriched window Positional Bias Score: What is the probability to have m or more sites in a window of size w? Based on slides from G. Church Computational Biology course at Harvard #ORFS 10 1 Start -600 bp 50 bp
30
Lecture Topics Introduction to DNA regulatory motifs AlignACE - A motif finding algorithm Assessment of motifs AlignACE results on yeast genome Summary & Conclusions
31
Comparisons of motifs The CompareACE program finds best alignment between two motifs and calculates the correlation between the two position-specific scoring matrices Similar motifs: CompareACE score > 0.7 Based on slides from G. Church Computational Biology course at Harvard
32
Clustering motifs by similarity motif A motif B motif C motif D A B C D A 1.0 0.9 0.1 0.0 B 1.0 0.2 0.1 C 1.0 0.8 D 1.0 Pairwise CompareACE scores CompareACE cluster 1 : A, B cluster 2: C, D Hierarchical Clustering 123456 A0.80.410.601 C000000 G0.20.60010 T0000.400 123456 A 10.600 C000001 G 0010 T0000.400
33
Most Group Specific Motifs
34
Most Positional Biased Motifs
35
250 AlignACE runs on randomly created groups of ORFs, of size 20, 40, 60, 80,and 100 ORFs. Negative Controls Based on slides from G. Church Computational Biology course at Harvard MAP randomreal
36
Negative Controls MAP cut off of 10, Group Specificity cutoff of : False Positives = 10-20%
37
Positive Controls 29 listed TFs with five or more known binding sites were chosen. AlignACE was run on the upstream regions of the corresponding regulated genes. An appropriate motif was found in 21/29 cases. False negative rate = ~ 10-30 % Based on slides from G. Church Computational Biology course at Harvard
40
The data Organism: Saccharomyces cerevisiae Microarray experiment : Affymetrix microarrays of 6,220 mRNA Data: gathered by Cho et al. 15 time points, spanned about 4 hours across two cell cycles Genome sequence
41
Typical clusters of genes in the data
42
Variance normalization and clustering of expression time series 3,000 most variable ORFs were chosen (based on the normalized dispersion in expression level of each gene across the time points (s.d./mean). The 15 time points were used to construct a 3,000 by 15 data matrix. The variance of each gene was normalized across the 15 conditions: Subtracting the mean across the time points from the expression level of each gene and dividing by the standard deviation across the time point.
43
Before and after mean - variance normalization Before normalization After normalization
44
Time-point 1 Gene 1 Gene 2 Normalized Expression Data from microarrays Representation of expression data Euclidean distance
45
K-means = position of data point X i Start with random positions of centroids. = position of data centroid C Iteration = 0
46
Choosing K Since we don’t know the number of clusters in advance we need a way to estimate it. In order to choose the number of clusters K, the Sum of Squares of Errors is calculated for different K values. A clear break point indicates the “natural” number of clusters in the data. K Sum Squared errors
47
Significantly enrichment of functional category within clusters Each gene was mapped into one of 199 functional categories ( according to MIPS database ). For each cluster, P-values was calculated for observing the frequencies of genes from particular functional categories. There was significant grouping of genes within the same cluster.
50
The hyper-geometric score P values were calculated for finding at least (k) ORFs from a particular functional category within a cluster of size (n). where (f) is the total number of genes within a functional category and (g) is the total number of genes within the genome (6,220). P- values greater than 3×10 - 4 are not reported, as their total expectation within the cluster would be higher than 0.05 As we tested 199 MIPS (ref.15).
51
Challenge: generalize hyper- geometric for more than two sets Chr V Expression cluster Functional group
53
Sequence- MCB element Consensuses This motif was later mapped to the literature and confirmed to be the very well known MCB element which is known to control the periodicity of the genes which peak at G1-S. nucleotides
54
MCB element clusters The existence of motif in all ORF’s of each clusters
55
Location of the motif - MCB element Distance from ATG (b.p)
56
SCB element This motif (later found to be the SCB element) was the second scoring motif within this cluster. The SCB element is also a very well-known cis-regulatory element which contributes to the periodicity of the genes within the G1-S regulon.
57
ribonucleotide reductase
58
Determining the cell-cycle periodicity of clusters Show Fourier Analysis allow to rank the genes according to the periodicity of cell cycle.
59
Explain FFT… (including ORs variability)
60
Periodic clusters
61
Non periodic clusters
63
And this was just the beginning…
64
Collaboration ? Co-occurrence (AND) Redundancy (OR) In case of two motifs derived from a cluster http://longitude.weizmann.ac.il/publications/PilpelNatGent01.pdf
65
Logic of interaction of motifs Expression level Only M1Only M2 Expression level M1 AND M2 G2 M1 M2
66
Synergistic motifs A combination of two motifs is called ‘synergistic’ if the expression coherence score of the genes that have the two motifs is significantly higher than the scores of the genes that have either of the motifs SFFMcm1
67
A global map of combinatorial expression control *High connectivity *Hubs *Alternative partners in various conditions Pilpel et al. Nature Genetics 2001
68
The human cell cycle G1-Phase S-Phase G2-Phase M-Phase
69
The proliferation cluster genes are cell cycle periodic 5 10 15 20 25 30 35 40 45 4 3 2 1 0 -2 -3 -4 G2/M G1/S CHR Samples Gene Expression Proportion All genes Proliferation genes
70
200 150 100 50 TSS NFY E2F ELK1 CDE CHR The cell cycle motifs are enriched among the proliferation cluster genes Not in the cluster, mutated in cancer
71
Regulation of the proliferation cluster: significant motifs Sequence logo 1.42*10 -05 CHR 3.10*10 -06 ELK1 2.37*10 -09 E2F 5.31*10 -10 CDE 3.74*10 -11 NFY P-valueMotif 1000bp up stream 326 MathInspector motifs
72
Potential regulatory motifs in 3’ UTRs Finding 3’ UTRs elements associated with high/low transcript stability (in yeast) AAGCTTCCCCTACAAC Entire genome
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.