Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke
Definitions Sequence, string – ordered arrangement of letters {'A', 'C', 'G', 'T'} Pattern – simplified regular expression, alphabet {'A', 'C', 'G', 'T', '.'}, where '.' - wild-card of length 1 ('A', 'C', 'G' or 'T') Triinu Tasa, Koke
What is a weight matrix? GATGAG GATGAT TGATAT GATGAT or [GT][AG][TA][GT]A[GT] What is a weight matrix? Triinu Tasa, Koke
Alignment matrix C: A C G T Frequency matrix F: A C G T Better: GATGAG GATGAT TGATAT Triinu Tasa, Koke What is a weight matrix?
Or weight matrix W: where N – number of sequences used - a priori probability of letter i What is a weight matrix? Triinu Tasa, Koke
Importance matrix I: I(i, j) = * A C G T What is a weight matrix? Triinu Tasa, Koke
Applications Pattern clustering 1. G.GATGAG.T 62/75 1:39/49 2:23/26 R: BP: e G.GATGAG 89/110 1:45/60 2:44/50 R: BP: e GATGAG.T 124/148 1:52/70 2:72/78 R: BP: e TG.AAA.TTT 132/145 1:53/61 2:79/84 R: BP: e AAAATTTT 200/231 1:63/77 2:137/154 R: BP: e TGAAAA.TTT 104/114 1:45/53 2:59/61 R: BP: e AAA.TTTT 343/537 1:79/145 2:264/392 R: BP: e G.AAA.TTTT 135/156 1:51/62 2:84/94 R: BP: e TG.GATGAG 49/57 1:30/35 2:19/22 R: BP: e TG.AAA.TTTT 86/91 1:40/43 2:46/48 R: BP:1.1124e Triinu Tasa, Koke Applications - Clustering
G.GATGAG.T: GAGATGAGAT GTGATGAGAT GAGATGAGGT... A C G T Triinu Tasa, Koke Applications - Clustering
Compare matrices with each other using the dynamic programming approach : where A, B – matrices i, j - columns If D(m,n) > threshold => matrices are different Triinu Tasa, Koke Applications - Clustering
G.GATGAG.TTG.AAA.TTTAAAATTTT G.GATGAGTGAAAA.TTTAAA.TTTT GATGAG.TTG.AAA.TTTT We want to represent the clusters by logos: We need to align the patterns first – position the similar parts of the patterns above each other: G.GATGAG.T G.GATGAG-- --GATGAG.T or the logo will look like this: Triinu Tasa, Koke Applications - Clustering
Multiple Alignment Importance matrix I – represents the aligned patterns. Example: G.GATGAG.T GATGAG.T G.GATGAG 1. Insert the first pattern into I: ('.' gives 0.25 to each) A C G T Align the second pattern with I using a dynamic programming approach: Triinu Tasa, Koke Applications – Multiple alignment
Dynamic programming matrix: G. G A T G A G. T G A T G A G T G.GATGAG.T --GATGAG.T Triinu Tasa, Koke Applications – Multiple alignment
3. Add the pattern '--GATGAG.T' to I, if necessary add columns to the matrix. 4. Repeat the procedure for every pattern. Output: G.GATGAG.T G.GATGAG-- --GATGAG.T Why importance matrix? Triinu Tasa, Koke Applications – Multiple alignment
Example: Pattern: GATG So far aligned: GATGATGTA GATGTGG We want: w(G, 4) > w(G, 1) > w(G, 9) Solution – importance matrix Triinu Tasa, Koke Applications – Multiple alignment
● Weight Matrix Matching Purpose: find the sequences that the weight matrix describes best in a given text file...CATAGGAAATTCCACCTCTTTGGCTTTGCCCAGTCTTCCCTTGAGGATGCCTACGTTC Calculate the score for each position 2. if score > threshold => signal Problem: finding a good threshold ● Threshold – 99.5% quantile Triinu Tasa, Koke Applications – Weight matrix matching
Questions? Triinu Tasa, Koke