Download presentation
Presentation is loading. Please wait.
Published byMelvin Robinson Modified over 9 years ago
1
Motif Finding PSSMs Expectation Maximization Gibbs Sampling
2
Complexity of Transcription
3
Representing Binding Sites for a TF A set of sites represented as a consensus VDRTWRWWSHD (IUPAC degenerate DNA) A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 A matrix describing a a set of sites A single site AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA
4
Nucleic acid codes codedescription AAdenine CCytosine GGuanine TThymine UUracil RPurine (A or G) YPyrimidine (C, T, or U) MC or A KT, U, or G WT, U, or A SC or G BC, T, U, or G (not A) DA, T, U, or G (not C) HA, T, U, or C (not G) VA, C, or G (not T, not U) NAny base (A, C, G, T, or U)
5
From frequencies to log scores TGCTG = 0.9 A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log ( ) f(b,i) + s(N) p(b)
6
TFs do not act alone http://www.bioinformatics.ca/
7
PSSMs for Liver TFs… HNF1 C/EBP HNF3 HNF4
8
PSSMs for Helix-Turn-Helix Motif
9
Promoter…
10
Promoter Weight Matrices (PWM)
11
E.Coli PWMs
12
Motif Logo Motifs can mutate on less important bases. The five motifs at top right have mutations in position 3 and 5. Representations called motif logos illustrate the conserved regions of a motif. http://weblogo.berkeley.edu http://fold.stanford.edu/eblocks/acsearch.html 1234567 TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA Position:
13
Example: Calmodulin-Binding Motif (calcium-binding proteins)
14
Sequence Motifs http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html
15
Regulatory Motifs Transcription Factors bind to regulatory motifs Motifs are 6 – 20 nucleotides long Activators and repressors Usually located near target gene, mostly upstream
16
Challenges How to recognize a regulatory motif? Can we identify new occurrences of known motifs in genome sequences? Can we discover new motifs within upstream sequences of genes?
17
Motif Representation Exact motif: CGGATATA Consensus: represent only deterministic nucleotides. Example: HAP1 binding sites in 5 sequences. consensus motif: CGGNNNTANCGG N stands for any nucleotide. Representing only consensus loses information. How can this be avoided? CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACGG CGGCCCTAACGG ------------ CGGNNNTANCGG
18
12345 A102557060 C3025801015 T50255105 G 2510 20 PSPM – Position Specific Probability Matrix Represents a motif of length k (5) Count the number of occurrence of each nucleotide in each position
19
12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 PSPM – Position Specific Probability Matrix Defines Pi{A,C,G,T} for i={1,..,k}. Pi (A) – frequency of nucleotide A in position i.
20
Identification of Known Motifs within Genomic Sequences Motivation: identification of new genes controlled by the same TF. Infer the function of these genes. enable better understanding of the regulation mechanism.
21
12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 PSPM – Position Specific Probability Matrix Each k-mer is assigned a probability. Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2
22
12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 Detecting a Known Motif within a Sequence using PSPM The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT…
23
The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10-4 12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 Detecting a Known Motif within a Sequence using PSPM
24
The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10-4 Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6=0.042 12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 Detecting a Known Motif within a Sequence using PSPM
25
Detecting a Known Motif within a Sequence using PSSM Is it a random match, or is it indeed an occurrence of the motif? PSPM -> PSSM (Probability Specific Scoring Matrix) odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} defined as Pi(n)/P(n), where P(n) is background frequency. Oi(n) increases => higher odds that n at position i is part of a real motif.
26
12345 A0.10.250.050.70.6 12345 A0.410.22.82.4 12345 A-1.3220-2.3221.4851.263 PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is 0.25. Original PSPM (Pi): Odds Matrix (Oi): Going to log scale we get an additive score, Log odds Matrix (log2Oi):
27
12345 A-1.320-2.321.481.26 C0.2601.68-1.32-0.74 T10-2.32-1.32-2.32 G-1.320 -0.32 Calculating using Log Odds Matrix Odds 0 implies random match; Odds > 0 implies real match (?). Example: sequence = ATGCAAGTCT… Position 1: ATGCA -1.32+0-1.32-1.32+1.26=-2.7 odds= 2-2.7=0.15 Position 2: TGCAA 1+0+1.68+1.48+1.26 =5.42 odds=25.42=42.8
28
Calculating the probability of a match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18 P (i) = S / (∑ S) Example 0.15 /(.15+42.8+.18)=0.003 P (1)= 0.003 P (2)= 0.993 P (3) =0.004
29
Building a PSSM Collect all known sequences that bind a certain TF. Align all sequences (using multiple sequence alignment). Compute the frequency of each nucleotide in each position (PSPM). Incorporate background frequency for each nucleotide (PSSM).
30
Finding new Motifs We are given a group of genes, which presumably contain a common regulatory motif. We know nothing of the TF that binds to the putative motif. The problem: discover the motif.
31
Example Predicting the cAMP Receptor Protein (CRP) binding site motif
32
GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGTTAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGAACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACGCG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGATTTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCGACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCACATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TTTATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCTCCCACTCGCA TGTGATTCGATTCACA Extract experimentally defined CRP Binding Sites
33
GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCACT TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCACC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATCACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGTCATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCATCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACACATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA Create a Multiple Sequence Alignment
34
A C G T 1-0.430.1-0.460.55 21.370.12-1.59-11.2 31.69-1.28-11.2-1.43 4-1.280.12-11.21.32 50.91-11.2-0.460.47 61.53-1.38-1.48-1.43 70.9-0.48-11.20.12 8-1.37-1.28-11.21.68 9-11.2 1.73-0.56 10-11.2-0.51-11.21.72 11-0.48-11.21.72-11.2 121.56-1.59-11.2-0.46 13-0.51-0.38-0.550.88 14-11.20.50.570.13 150.17-0.510.12 160.9-11.20.5-0.48 170.170.160.06-0.48 18-0.4-0.380.82-0.48 19-1.38-1.28-11.21.68 20-1.481.7-11.2-1.38 211.5-1.38-1.43-1.28 Generate a PSSM
35
Shannon Entropy Expected variation per column can be calculated Low entropy means higher conservation
36
Entropy The entropy (H) for a column is: a: is a residue, f a : frequency of residue a in a column, p a : probability of residue a in that column
37
Entropy entropy measures can determine which evolutionary distance (PAM250, BLOSUM80, etc) should be used Entropy yields amount of information per column (discussed with sequence logos in a bit)
38
Log-odds score Profiles can also indicate log-odds score: Log 2 (observed:expected) Result is a bit score
39
Matlab Multalign 1 Enter an array of sequences. seqs = {'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTA ACATCTCGC'}; 2 Promote terminations with gaps in the alignment. multialign(seqs,'terminalGapAdjust',true) ans = --CACGTAACATCTC-- ACGACGTAACATCTTCT -AAACGTAACATCTCGC
40
Matlab 3 Compare alignment without termination gap adjustment. multialign(seqs) ans = CA--CGTAACATCT--C ACGACGTAACATCTTCT AA-ACGTAACATCTCGC
41
Matlab >> a={'ATATAGGAG','AATTATAGA','TTA GAGAAA'} >> a = 'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'
42
Char function >> cseq=char(a) cseq = ATATAGGAG AATTATAGA TTAGAGAAA
43
Double function >> intseq=double(cseq) intseq = 65 84 65 84 65 71 71 65 71 65 65 84 84 65 84 65 71 65 84 84 65 71 65 71 65 65 65
44
double >> double('A') ans = 65 >> double('C') ans = 67 >> double('G') ans = 71 >> double('T') ans = 84
45
Initiate PSPM matrix >> Pspm=zeros(4,length(intseq)) Pspm = 0 0 0 0 0 0 0 0 0
46
Use a for loop to count each nucleotide at each position >> for i = 1:length(intseq) Pspm(1,i)=length(find(intseq(:,i)==65)); Pspm(2,i)=length(find(intseq(:,i)==67)); Pspm(3,i)=length(find(intseq(:,i)==71)); Pspm(4,i)=length(find(intseq(:,i)==84)); end >> Pspm Pspm = 2 1 2 0 3 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 1 1 1 2 1 2 0 1 0 0 0
47
Add pseudocounts >> Pspmp=Pspm+1 Pspmp = 3 2 3 1 4 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 3 2 2 2 2 3 2 3 1 2 1 1 1
48
Normalize to get frequencies >> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1) Pspmnorm = Columns 1 through 7 0.4286 0.2857 0.4286 0.1429 0.5714 0.1429 0.4286 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.2857 0.1429 0.4286 0.2857 0.2857 0.4286 0.2857 0.4286 0.1429 0.2857 0.1429 Columns 8 through 9 0.4286 0.4286 0.1429 0.1429 0.2857 0.2857 0.1429 0.1429
49
Calculate odds score >> Pswm=Pspmnorm/0.25 Pswm = Columns 1 through 7 1.7143 1.1429 1.7143 0.5714 2.2857 0.5714 1.7143 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 1.1429 0.5714 1.7143 1.1429 1.1429 1.7143 1.1429 1.7143 0.5714 1.1429 0.5714 Columns 8 through 9 1.7143 1.7143 0.5714 0.5714 1.1429 1.1429 0.5714 0.5714
50
Log odds ratio >> logPswm=log2(Pswm) logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074
51
Estimate the probability of the given sequence to belong to the defined PSWM >> Unknown='TTAAGAAGG' Unknown = TTAAGAAGG >> intunknown=double(Unknown) intunknown = 84 84 65 65 71 65 65 71 71
52
Get the index of the PSWM for the unknown sequence >> for i=1:length(intunknown) A=find(intunknown==65) intunknown(A)=1; C=find(intunknown==67) intunknown(C)=2; G=find(intunknown==71) intunknown(G)=3; T=find(intunknown==84) intunknown(T)=4; end >> intunknown intunknown = 4 4 1 1 3 1 1 3 3
53
Calculate the log odds-ratio of the Unknown 'TTAAGAAGG' >> logunknown=logPswm(intunknown) logunknown = Columns 1 through 7 0.1926 0.1926 0.7776 0.7776 -0.8074 0.7776 0.7776 Columns 8 through 9 -0.8074 -0.8074 >> Punknown=sum(logunknown) Punknown = 1.0737
54
Is this significant score or just random similarity? >> cseq cseq = ATATAGGAG AATTATAGA TTAGAGAAA >> Unknown Unknown = TTAAGAAGG
55
What would be the maximum score? >> logPswm logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074 >> maxscore=max(logPswm) maxscore = Columns 1 through 7 0.7776 0.7776 0.7776 0.7776 1.1926 0.7776 0.7776 Columns 8 through 9 0.7776 0.7776 >> totalmaxscore=sum(maxscore) totalmaxscore= 7.4135
56
Write a function using the above statements to scan a sequence Write a function named ‘logodds’ that calculates the logs-odd ratio of a given alignment. Write a function named ‘scanmotif’ that calls the ‘logodds’ to search through a sequence using a sliding window to calculate the logodds of a subsequence and store these scores. The function should allow for selection of a maximum number of locations that are likely to contain the motif based on the scores obtained.
57
Position Specific Scoring Matrix (PSSM) incorporate information theory to indicate information contained within each column of a multiple alignment. information is a logarithmic transformation of the frequency of each residue in the motif
58
PSSMs and Pseudocounts Problem: PSSMs are only as good as the initial msa Some residues may be underrepresented Other columns may be too conserved Solution: Introduce Pseudocounts to get a better indication
59
Pseudocounts New estimated probability: Pca: Probability of residue a in column c nca: count of a’s in column c bca: pseudocount of a’s in column c Nc: total count in column c Bc: total pseudocount in column c
60
PSSMs and pseudocounts probabilities converted into a log-odds form (usually log 2 so the information can be reported in bits) and placed in the PSSM.
61
Searching PSSMs value for the first residue in the sequence occurring in the first column is calculated by searching the PSSM the value for the residue occurring in each column is calculated
62
Searching PSSMs values are added (since they are logarithms) to produce a summed log odds score, S S can be converted to an odds score using the formula 2 S odds scores for each position can be summed together and normalized to produce a probability of the motif occurring at each location.
63
Information in PSSMs Information theory: amount of information contained within each sequence. No information: amount of uncertainty can be measured as log 2 20 = 4.32 for amino acids, since there are 20 amino acids. For nucleic acid sequences, the amount of uncertainty can be measured as log 2 4 = 2.
64
Information in PSSMs If a column is completely conserved then the uncertainty is 0 – there is only one choice. two residues occurring with equal probability -- uncertainty to deciding which residue it is.
65
Measure of Uncertainty Measured as the entropy
66
Relative Entropy. Relative entropy takes into account overall composition of the organism being studied B a is background frequency of residue a in the organism
67
PSSM Uncertainty Uncertainty for whole model is summed over all columns:
68
Sequence Logos Information in PSSMs can be viewed visually Sequence logos illustrate information in each column of a motif height of logo is calculated as the amount by which uncertainty has been decreased
69
Sequence Logos
70
Statistical Methods Commonly used methods for locating motifs: Expectation-Maximization (EM) Gibbs Sampling
71
Expectation-Maximization Begin with set of sequences with an unknown signal in common Signal may be subtle Approximate length of signal must be given Randomly assign locations of this motif in each sequence
72
Expectation-Maximization Two steps: Expectation Step Maximization Step
73
Expectation-Maximization Expectation step Residue Frequencies for each position calculated Residues not in a motif are background Frequencies used to determine probability of finding site at any position in a sequence to fit motif model
74
Maximization Step Determine location for each sequence that maximally aligns to the motif pattern Once new motif location found for each sequence, motif pattern is revised in the expectation E-M continues until solution converges
75
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC AGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGA GCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCA CATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCT TCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGC GCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCC CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG GATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAG TCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGA CCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGC ATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGT AGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTC CCAGCACACACACTTATCCAGTGGTAAATACACATCAT TCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGAT ACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGA TGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAG CAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAA CTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAA GAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCT TGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACT GGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGT CAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTG CCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCA GGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTG CTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC
76
Residue Counts Given motif alignment, count for each location is calculated:
77
Residue Frequencies The counts are then converted to frequencies:
78
Example Maximization Step Consider the first sequence: TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT There are 41 residues; 41-6+1 = 36 sites to consider
79
MEME Software One of three motif models: OOPS: One expected occurrence per sequence ZOOPS: Zero or one expected occurrence per sequence TCM: Any number of occurrences of the motif
80
Gibbs Sampling Similar to E-M algorithm Combines E-M and simulated annealing Goal: Find most probable pattern by sampling from motif probabilities to maximize ratio of model:background probabilities
81
Predictive Update Step random motif start position chosen for all sequences except one Initial alignment used to calculate residue frequencies for motif and background similar to the Expectation Step of EM
82
Sampling Step ratio of model:background probabilities normalized and weighted motif start position chosen based on a random sampling with the given weights Different than E-M algorithm
83
Gibbs Sampling process repeated until residue frequencies in each column do not change The sampling step is then repeated for a different initial random alignment Sampling allows escape from local maxima
84
Gibbs Sampling Dirichlet priors (pseudocounts) are added into the nucleotide counts to improve performance shifting routine shifts motif a few bases to the left or the right A range of motif sizes is checked
85
Gibbs Sampler Web Interface http://bayesweb.wadsworth.org/gibbs/gi bbs.html http://bayesweb.wadsworth.org/gibbs/gi bbs.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.