Download presentation
Presentation is loading. Please wait.
Published byAmberly Atkins Modified over 9 years ago
1
Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
2
Outline Biology of transcription regulation and challenges of computational motif findingBiology of transcription regulation and challenges of computational motif finding Scan for known TF motif sites –TRASFAC and JASPAR, Sequence Logo De novo method –Regular expression enumeration: w-mer enumerate –Position weight matrix update: EM and Gibbs Motif finding in different organisms –Motif clusters and conservation 2
3
Imagine a Chef Restaurant DinnerHome Lunch Certain recipes used to make certain dishes 3
4
Each Cell Is Like a Chef 4
5
Infant Skin Adult Liver Glucose, Oxygen, Amino Acid Fat, Alcohol Nicotine Healthy Skin Cell State Disease Liver Cell State Certain genes expressed to make certain proteins 5
6
Understanding a Genome Get the complete sequence (encoded cook book) Observe gene expressions at different cell states (meals prepared at different situations) Decode gene regulation (decode the book, understand the rules) 6
7
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT Information in DNA Milk->Yogurt Beef->Burger Egg->Omelet Fish->Sushi Flour->Cake Coding region 2% What is to be made 7
8
Information in DNA Non-coding region 98% Regulation: When, Where, Amount, Other Conditions, etc ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT Milk->Yogurt Beef->Burger Egg->Omelet Fish->Sushi Flour->Cake Morning Japanese Restaurant 5 Oz 9 Oz Butter Coding region 2% 8
9
Measure Gene Expression Microarray or SAGE detects the expression of every gene at a certain cell state Clustering find genes that are co-expressed (potentially share regulation) 9
10
STAT115, 04/01/2008 Decode Gene Regulation GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice Look at genes always expressed together: Upstream Regions Co-expressed Genes
11
STAT115, 04/01/2008 Decode Gene Regulation GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice Look at genes always expressed together: Upstream Regions Co-expressed Genes
12
STAT115, 04/01/2008 Decode Gene Regulation GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice Look at genes always expressed together: Upstream Regions Co-expressed Genes Morning
13
Biology of Transcription Regulation...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT......agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC......cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA......gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA... atttgctt ttcact gcaacct aactccagt actca gcaacct ccagcgccg gcaacct Transcription Factor (TF) TF Binding Motif Hemoglobin Beta Hemoglobin Zeta Hemoglobin Alpha Hemoglobin Gamma Motif can only be computational discovered when there are enough cases for machine learning 13
14
Computational Motif Finding Input data: –Upstream sequences of gene expression profile cluster –20-800 sequences, each 300-5000 bps long Output: enriched sequence patterns (motifs) Ultimate goals: –Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)? –Which genes are regulated by this TF, why is there disease when a TF goes wrong? –Are there binding partner / competitor for a TF? 14
15
Challenges: Where/what the signal The motif should be abundant GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAAATAAGACACGGACGGC GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT Water 15
16
The motif should be abundant And Abundant with significance GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAAATAAGACACGGACGGC GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT Coconut Challenges: Where/what the signal 16
17
Challenges: Double stranded DNA Motif appears in both strands GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG ||||||||||||||||||||||||||||| GTGTAGCGTACCATTTATGGTCAAGTCTG ||||||||||||||||||||||||||||| AGAGTCCATTTAGTCAGTATGATGGGTGT 17
18
Challenges: Base substitutions Sequences do not have to match the motif perfectly, base substitutions are allowed GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT 18
19
Challenges: Variable motif copies Some sequences do not have the motif Some have multiple copies of the motif GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT 19
20
Challenges: Variable motif copies Some sequences do not have the motif Some have multiple copies of the motif GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT Sushi Hand Roll Sashimi Tempura Sake Fish 20
21
Challenges: Two-block motifs Some motifs have two parts GACACATTTACCTATGC TGGCCCTACGACCTCTCGC CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT AATGCG GCGTAA or palindromic patterns CoconutMilk 21
22
Scan for Known TF Motif Sites Experimental TF sites: TRANSFAC, JASPAR TRANSFACJASPAR Motif representation: –Regular expression: Consensus CACAAAA binary decision Degenerate CRCAAAW IUPAC A/TA/G 22
23
Scan for Known TF Motif Sites Experimental TF sites: TRANSFAC, JASPAR Motif representation: –Regular expression: Consensus CACAAAA binary decision Degenerate CRCAAAW –Position weight matrix (PWM): need score cutoff Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p 0 A p 0 T p 0 G p 0 C p 0 A p 0 G p 0 C p 0 T 23
24
IUPAC for DNA A adenosine Ccytidine Gguanine T thymidine U uridine R G A (purine) Y T C (pyrimidine) K G T (keto) M A C (amino) S G C (strong) W A T (weak) B C G T (not A) D A G T (not C) H A C T (not G) VA C G (not T) NA C G T (any) 24
25
Protein Binding Microarrays In vitro protein- DNA interactions Better capture motifs 25
26
JASPAR User defined cutoff to scan for a particular motif 26
27
A Word on Sequence Logo SeqLogo consists of stacks of symbols, one stack for each position in the sequence The overall height of the stack indicates the sequence conservation at that position The height of symbols within the stack indicates the relative frequency of nucleic acid at that position ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG 27
28
Scan Known TF Motifs Drawbacks: –Limited number of motifs –Limited number of sites to represent each motif Low sensitivity and specificity –Poor description of motif Binding site borders not clear Binding site many mismatches –Many motifs look very similar E.g. GC-rich motif, E-box (CACGTG) 28
29
De novo Sequence Motif Finding Goal: look for common sequence patterns enriched in the input data (compared to the genome background) Regular expression enumeration –Pattern driven approach –Enumerate patterns, check significance in dataset –Oligonucleotide analysis, MobyDick Position weight matrix update –Data driven approach, use data to refine motifs –Consensus, EM & Gibbs samplingConsensusEMGibbs sampling –Motif score and Markov backgroundMotif score Markov background 29
30
Regular Expression Enumeration Oligonucleotide Analysis: check over- representation for every w-mer: –Expected w occurrence in data Consider genome sequence + current data size –Observed w occurrence in data –Over-represented w is potential TF binding motif Observed occurrence of w in the data p w from genome background size of sequence data Expected occurrence of w in the data 30
31
MobyDick A sequence data and a dictionary of motif words ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT D = {A, C, G, T} P w = {0.22, 0.28, 0.28, 0.22} 31
32
MobyDick A sequence data and a dictionary of motif words Check over-representation of every word-pair ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT D = {A, C, G, T} P w = {0.28, 0.22, 0.22, 0.28} ACGT AAAACAGAT CCACCCGCT GGAGCGGGT TTATCTGTT 32
33
MobyDick A sequence data and a dictionary of motif words Check over-representation of every word-pair ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT D = {A, C, G, T} P w = {0.28, 0.28, 0.22, 0.22} ACGT AAAACAGAT CCACCCGCT GGAGCGGGT TTATCTGTT D = {A,C,G,T,AA,GA,TA,GG} P w = {?} 33
34
MobyDick D = {A,C,G,T,AA,GA,TA,GG} Seq: AAGATAA Possible partitions: A A G A T A Ap A p A p G p A p T p A p A AA G A T A Ap AA p G p A p T p A p A AA GA T A Ap AA p GA p T p A p A AA GA TA Ap AA p GA p TA p A A A GA T AAp AA p GA p T p AA … Assign probabilities as to maximize total probability of generating the sequence 34
35
MobyDick A sequence data and a dictionary of motif words Check over-representation of every word-pair Reassign word probability and consider every new word-pair to build even longer words ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATGCTTCACATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT D = {A, C, G, T} P w = {0.28, 0.28, 0.22, 0.22} ACGT AAAACAGAT CCACCCGCT GGAGCGGGT TTATCTGTT D = {A,C,G,T,AA,GA,TA,GG} P w = {?} 35
36
Regular Expression Enumeration RE Enumeration Derivatives: –oligo-analysis, spaced dyads w 1.n s.w 2 –IUPAC alphabetIUPAC alphabet –Markov background (later)Markov background –2-bit encoding, fast index access –Enumerate limited RE patterns known for a TF protein structure or interaction theme Exhaustive, guaranteed to find global optimum, and can find multiple motifs Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width 36
37
Consensus Starting from the 1 st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence Seq1 Seq2 … … Good Motifs CACGTGC GTCAGTC CACGTTC GTCAGTC Bad Motifs CACGTGC GTCAGTC GTGACAT TGGAAAT 37
38
Consensus Starting from the 1 st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence Seq3 … … … Good Motifs CACGTGC GTCAGTC CACGTTC GTCAGTC CTCGTGC GACAGTC Bad Motifs CACGTGC GTCAGTC CACGTTC GTCAGTC TTCAAAG AGACTCA Remaining good motifs 38
39
Consensus Starting from the 1 st sequence, add one sequence at a time to look for the best motifs obtained with the additional sequence G Stormo, algorithm runs very fast Sequence order plays a big role in performance –First two sequences better contain the motif –Sites stop accumulating at the first bad sequence –Newer version allowing [0-n] is much slower 39
40
Expectation Maximization and Gibbs Sampling Model Objects: –Seq: sequence data to search for motif – 0 : non-motif (genome background) probability – : motif probability matrix parameter – : motif site locations Problem: P( , | seq, 0 ) Approach: alternately estimate – by P( | , seq, 0 ) – by P( | , seq, 0 ) –EM and Gibbs differ in the estimation methods 40
41
Expectation Maximization E step: | , seq, 0 TTGACGACTGCACGT TTGACp 1 TGACGp 2 GACGAp 3 ACGACp 4 CGACTp 5 GACTGp 6 ACTGCp 7 CTGCAp 8... P 1 = likelihood ratio = P(TTGAC| ) P(TTGAC| 0 ) p 0 T p 0 T p 0 G p 0 A p 0 C = 0.3 0.3 0.2 0.3 0.2 41
42
Expectation Maximization E step: | , seq, 0 TTGACGACTGCACGT TTGACp 1 TGACGp 2 GACGAp 3 ACGACp 4 CGACTp 5 GACTGp 6 ACTGCp 7 CTGCAp 8... M step: | , seq, 0 p 1 TTGAC p 2 TGACG p 3 GACGA p 4 ACGAC... Scale ACGT at each position, reflects weighted average of 42
43
EM Derivatives First EM motif finder (C Lawrence) –Deterministic algorithm, guarantee local optimum MEME (TL Bailey) –Prior probability allows 0-n site / sequence –Parallel running multiple EM with different seed –User friendly results 43
44
Gibbs Sampling Stochastic process, although still may need multiple initializations –Sample from P( | , seq, 0 ) –Sample from P( | , seq, 0 ) Collapsed form: – estimated with counts, not sampling from Dirichlet –Sample site from one seq based on sites from other seqs Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain 44
45
11 22 33 44 55 Gibbs Sampler Initial 1 3131 4141 5151 2121 11 Randomly initialize a probability matrixRandomly initialize a probability matrix n A1 + s A n A1 + s A + n C1 + s C + n G1 + s G + n T1 + s T estimated with counts p A1 = 45
46
Gibbs Sampler 1 Without 11 Segment Take out one sequence with its sites from current motifTake out one sequence with its sites from current motif 3131 4141 5151 2121 11 46
47
Segment (1-8) Sequence 1 Gibbs Sampler Score each possible segment of this sequenceScore each possible segment of this sequence 1 Without 11 Segment 3131 4141 5151 2121 47
48
Segment (2-9) Sequence 1 Gibbs Sampler Score each possible segment of this sequenceScore each possible segment of this sequence 3131 4141 5151 2121 1 Without 11 Segment 48
49
Segment Score Use current motif matrix to score a segment Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p 0 A p 0 T p 0 G p 0 C p 0 A p 0 G p 0 C p 0 T 49
50
Scoring Segments Motif12345bg A0.40.10.30.40.20.3 T0.20.50.10.20.20.3 G0.20.20.20.30.40.2 C0.20.20.40.10.20.2 Ignore pseudo counts for now… Sequence: TTCCATATTAATCAGATTCCG… score TAATC… AATCA0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383 ATCAG0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185 TCAGA0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444 CAGAT… 50
51
12 Gibbs Sampler Sample site from one seq based on sites from other seqs 3131 4141 5151 2121 Modified 1 estimated with counts 51
52
How to Sample? 52 Pos123456789 Score3112589126 SubT3416212938394147 Rand(subtotal) = X Find the first position with subtotal larger than X Pos123456789 Score311258950026 SubT3416212938538540546
53
Gibbs Sampler Repeat the process until motif convergesRepeat the process until motif converges 1 Without 21 Segment 3131 4141 5151 12 2121 53
54
Gibbs Sampler Intuition Beginning: –Randomly initialized motif –No preference towards any segment 54
55
Gibbs Sampler Intuition Motif appears: –Motif should have enriched signal (more sites) –By chance some correct sites come to alignment –Sites bias motif to attract other similar sites 55
56
Gibbs Sampler Intuition Motif converges: –All sites come to alignment –Motif totally biased to sample sites every time 56
57
11 22 33 44 55 Gibbs Sampler 3i 4i4i 5i 2i 1i Column shift Metropolis algorithm: –Propose * as shifted 1 column to left or right –Calculate motif score u( ) and u( *) –Accept * with prob = min(1, u( *) / u( )) 57
58
Gibbs Sampling Derivatives Gibbs Motif Sampler (JS Liu) –Add prior probability to allow 0-n site / seq –Sample motif positions to consider AlignACE (F Roth) –Look for motifs from both strands –Mask out one motif to find more different motifs BioProspector (XS Liu) –Use background model with Markov dependencies –Sampling with threshold (0-n sites / seq), new scoring function –Can find two-block motifs with variable gap 58
59
Scoring Motifs Information Content (also known as relative entropy) –Suppose you have x aligned segments for the motif –pb(s 1 from mtf) / pb(s 1 from bg) * pb(s 2 from mtf) / pb(s 2 from bg) *… pb(s x from mtf) / pb(s x from bg) Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p 0 A p 0 T p 0 G p 0 C p 0 A p 0 G p 0 C p 0 T 59
60
Scoring Motifs Information Content (also known as relative entropy) –Suppose you have x aligned segments for the motif –pb(s 1 from mtf) / pb(s 1 from bg) * pb(s 2 from mtf) / pb(s 2 from bg) *… pb(s x from mtf) / pb(s x from bg) Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p 0 A p 0 T p 0 G p 0 C p 0 A p 0 G p 0 C p 0 T 60
61
Scoring Motifs pb(s 1 from mtf) / pb(s 1 from bg) * pb(s 2 from mtf) / pb(s 2 from bg) *… pb(s x from mtf) / pb(s x from bg) = (p A1 /p A0 ) A1 (p T1 /p T0 ) T1 (p T2 /p T0 ) T2 (p G2 /p G0 ) G2 (p C2 /p C0 ) C2 … Take log of this: = A1 log (p A1 /p A0 ) + T1 log (p T1 /p T0 ) + T2 log (p T2 /p T0 ) + G2 log (p G2 /p G0 ) + … Divide by the number of segments (if all the motifs have same number of segments) = p A1 log (p A1 /p A0 ) + p T1 log (p T1 /p T0 ) + p T2 log (p T2 /p T0 )… Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG 61
62
Scoring Motifs Original function: Information Content = Motif Conservedness: How likely to see the current aligned segments from this motif model Good ATGCA ATGCC ATGCA TTGCA ATGGA ATGCA Bad AGGCA ATCCC GCGCA CGGTA TGCCA ATGGT TTGAA 62
63
Scoring Motifs Original function: Information Content Motif Specificity: How likely to see the current aligned segments from background = Good AGTCC Bad ATAAA 63
64
Scoring Motifs Original function: Information Content Which is better? (data = 8 seqs) = Motif 1 AGGCTAAC Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCAAC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC 64
65
Scoring Motifs Motif scoring function: Prefer: conserved motifs with many sites, but are not often seen in the genome background Motif Signal Abundant Positions Conserved Specific (unlikely in genome background) 65
66
Markov Background Increases Motif Specificity Prefers motif segments enriched only in data, but not so likely to occur in the background Segment ATGTA score = p(generate ATGTA from ) p(generate ATGTA from 0 ) 3rd order Markov dependency p( ) TCAGC =.25 .25 .25 .25 .25.3 .18 .16 .22 .24 ATATA =.25 .25 .25 .25 .25.3 .41 .38 .42 .30 66
67
Position Weight Matrix Update Advantage –Can look for motifs of any widths –Flexible with base substitutions Disadvantage: –EM and Gibbs sampling: no guaranteed convergence time –No guaranteed global optimum 67
68
Motif Finding in Bacteria Promoter sequences are short (200-300 bp) Motif are usually long (10-20 bases) –Some have two blocks with a gap, some are palindromes –Long motifs are usually very degenerate Single microarray experiment sometimes already provides enough information to search for TF motifs 68
69
Motif Finding in Lower Eukaryotes Upstream sequences longer (500-1000 bp), with some simple repeats Motif width varies (5 – 17 bases) Expression clusters provide decent input sequences quality for TF motif finding Motif combination and redundancy appears, although single motifs are usually significant enough for identification 69
70
Yeast Promoter Architecture Co-occurring regulators suggest physical interaction between the regulators 70
71
Motif Finding in Higher Eukaryotes Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream Motifs can be short or long (6-20 bases), and appear in combination and clusters Gene expression cluster not good enough input Need: –Comparative Genomics: phastcons score –Motif modules: motif clusters –ChIP-chip/seq 71
72
72 Yeast Regulatory Sequence Conservation
73
73 UCSC PhastCons Conservation Functional regulatory sequences are under stronger evolutionary constraint Align orthologous sequences together PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC
74
74 Conserved Motif Clusters First find conserved regions in the genome Then look for repeated transcription factors (TF) binding sites They form transcription factor modules
75
Summary Biology and challenge of transcription regulation Scan for known TF motif sites: TRANSFAC & JASPAR De novo method –Regular expression enumeration Oligonucleotide analysis MobyDick: build long motifs from short ones –Position weight matrix update CONSENSUS (sequence order) EM (iterate , ; ~ weighted average) Gibbs Sampler (sample , ; Markov chain convergence) Motif score and Markov background Motif cluster and motif conservation 75
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.