Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia.

Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca
Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia

Motif characterization
Input: site-specific sequences Approaches: Consensus sequence (the chance of having NNN… increases with increasing number of sequences) Frequency Profile (the problem of mutation bias) PWM, Sequence logo, Perceptron and Gibbs sampler (cannot detect column association) Multiple correspondence analysis A4GALT ATACCATGTCCAA ACO2 ACAAAATGGCGCC ACR GGAGTATGGTTGA ADM2 CCGCCATGGCCCG

Consensus sequence Sequences flanking the initiation codon of 508 CDSs: A4GALT AUACCAUGUCCAA ACO ACAAAAUGGCGCC ACR GGAGUAUGGUUGA ADM CCGCCAUGGCCCG Site A C G U Cons 1 75 173 171 89 N 2 105 216 144 43 3 199 70 212 27 4 124 236 83 65 5 60 276 143 29 6 502 7 508 8 9 98 71 274 10 141 227 51 11 49 154 221 84 12 196 79 13 121 151 134 102 Sum 1563 1724 2175 1142 Two hypotheses: H0: site-specific pattern absent in flanking sequences HA: site-specific pattern present in flanking sequences Our objective is to find if sites flanking AUG contribute to the start codon recognition. The consensus sequence does not give us the answer

Frequencies and the background F.
Sequences flanking the initiation codon of 508 CDSs: A4GALT AUACCAUGUCCAA ACO ACAAAAUGGCGCC ACR GGAGUAUGGUUGA ADM CCGCCAUGGCCCG Site A C G U 1 75 173 171 89 0.1476 0.3406 0.3366 0.1752 2 105 216 144 43 0.2067 0.4252 0.2835 0.0846 3 199 70 212 27 0.3917 0.1378 0.4173 0.0531 4 124 236 83 65 0.2441 0.4646 0.1634 0.1280 5 60 276 143 29 0.1181 0.5433 0.2815 0.0571 6 502 0.9882 0.0118 0.0000 7 508 1.0000 8 9 98 71 274 0.1929 0.1398 0.5394 10 141 227 51 0.2776 0.4469 0.1004 11 49 154 221 84 0.0965 0.3031 0.4350 0.1654 12 196 79 0.3858 0.1555 13 121 151 134 102 0.2382 0.2972 0.2638 0.2008 Sum 1563 1724 2175 1142 0.2367 0.2611 0.3293 0.1729 In the table, the red numbers could be just red herrings. G is the most frequent (0.3293) and we expect the G column to have more red, U is the least frequent, and we expect the U column to have to least red. This is exactly the case. RCCaugGCGG -3R G? What background frequencies to use as control?

Two hypotheses in numbers
Site A C G U 1 75 173 171 89 0.1476 0.3406 0.3366 0.1752 2 105 216 144 43 0.2067 0.4252 0.2835 0.0846 3 199 70 212 27 0.3917 0.1378 0.4173 0.0531 4 124 236 83 65 0.2441 0.4646 0.1634 0.1280 5 60 276 143 29 0.1181 0.5433 0.2815 0.0571 6 502 0.9882 0.0118 0.0000 7 508 1.0000 8 9 98 71 274 0.1929 0.1398 0.5394 10 141 227 51 0.2776 0.4469 0.1004 11 49 154 221 84 0.0965 0.3031 0.4350 0.1654 12 196 79 0.3858 0.1555 13 121 151 134 102 0.2382 0.2972 0.2638 0.2008 Sum 1563 1724 2175 1142 0.2367 0.2611 0.3293 0.1729 Sequences flanking the initiation codon of 508 CDSs: A4GALT AUACCAUGUCCAA ACO ACAAAAUGGCGCC ACR GGAGUAUGGUUGA ADM CCGCCAUGGCCCG S = ACGGTACCACGTT Likelihood, odds ratio, log-odds, PWMS

Position weight matrix (PWM)
Two major purposes of PWM To characterize the sequence pattern (the motif) to facilitate the computation of log-odds (or PWM score), e.g., computing the PWMS for S1=ATACCATGTCCAA Two benefits: Multiplication to addition Reduction of rounding errors Site A C G U Std 1 0.3844 0.0329 0.0198 0.4453 2 0.7044 0.7076 3 0.7275 0.3426 1.1214 4 0.0457 0.8320 0.7786 5 1.0578 1.1452 6 2.0617 5.5288 7 2.5313 6.0595 8 1.6025 5.5952 9 0.7124 0.6782 10 0.2309 0.7760 0.8112 11 0.2167 0.4025 0.7626 12 0.1200 0.2295 0.2961 13 0.0105 0.1884 0.2163 0.2459 RCCAUGG PWMSS1 = … Xuhua Xia

PWMS over sites GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = , corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with  = 0.01.

PWM: position weight matrix
Also called position-specific scoring matrix (PSSM) Used in Characterizing sequence motifs Eukaryotic translation initiation consensus Splicing sites Branchpoint sites Shine-Dalgarno sequences Database searches PHI-BLAST PSI-BLAST RPS-BLAST) What you can do in research

Yeast 5’ ss PWM Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics . The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded. Site A C G U χ 2 p −5 94 32 57 92 11.798 0.0641 −0.7117 0.0245 0.2792 −4 119 47 48 61 14.117 0.4032 −0.1599 −0.2225 −0.3115 −3 139 38 43 55 39.672 0.6268 −0.4651 −0.3805 −0.4601 −2 138 40 36 38.899 0.6164 −0.3915 −0.6355 −1 91 45 88 51 27.270 0.0174 −0.2223 0.6492 −0.5685 1 274 −8.1042 −5.4675 2.2855 −8.1044 2 9 266 −2.5200 −8.1048 1.8081 3 268 4 1.5723 −4.6732 −4.1523 17 29 228 −2.3805 −0.8528 −5.5454 1.5859 5 272 −5.2765 −8.1049 2.2750 −5.8967 6 10 8 255 −3.1271 −2.6862 1.7472 7 97 18 39 121 55.570 0.1092 −1.5351 −0.5206 0.6734 95 54 35 11.363 0.0793 0.0397 −0.6759 0.2635 123 34 73 22.172 0.4508 −0.7175 −0.0534 118 41 78 17.334 0.3911 −0.3560 −0.5579 0.0418 11 105 33 17.367 0.2232 −0.6676 0.3101 12 90 44 42 99 12.109 0.0015 −0.2546 −0.4142 0.3847 Ma and Xia 2011

Yeast 3’ss PWM Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The 2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1. Site A C G U 2 p -12 61 53 34 130 56.0 0.0033 0.7648 -11 70 47 20 141 83.2 0.8813 -10 79 42 12 145 99.9 0.9214 -9 38 30 23 187 219.4 1.2867 -8 51 27 158 121.6 1.0447 -7 91 33 28 126 53.8 0.0093 0.7200 -6 95 35 106 22.0 0.0001 0.0707 0.4722 -5 93 129 63.3 0.0403 0.7537 -4 136 25 43.3 0.5842 0.0517 -3 121 272.3 1.1862 -2 277 1 563.7 1.6056 -1 278 1082.7 2.2900 37 73 75 9.7 0.0217 0.3691 2 72 64 54 88 8.0 0.0466 0.2729 0.2059 3 90 48 86 2.5 0.4771 0.0300 0.1730 4 83 43 98 8.7 0.0337 0.3599 5 65 10.6 0.0140 0.2951 Ma and Xia 2011

PWMS as a proxy of splicing strength
Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion. 5' ss 3' ss NRG RG PWMS Mean 8.8138 5.3129 7.1762 PWMS Var. 4.8646 8.2077 N 44 202 49 229 t p 0.0000 0.0001 Ma and Xia 2011

Highly expressed genes should have high splicing efficiency.
Predictions: (1) Highly transcribed genes should, on average, have introns with greater splicing efficiency (2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes. Lowly expressed genes could have their splicing sites drifting to low efficiency

PWMS and Gene Expression

PWMS and Splicing Mechanisms
Expected PWMS is 0 when there is no site-specific difference in nucleotide frequency distribution What does a strongly negative PWMS mean? 5’ ss: HAC1: HFM1: HOP2: 3’ ss: HAC1: REC102:

5'SS6 (a) I5 I6 I7 I8 E1 E5 E6 E7 E8 E9 E22 5'SS6 3'SS6 5'SS7 3'SS7
BPS6 BPS7 5'SS6 3'SS6 5'SS7 3'SS7 Long flanking introns could lead to E7 exclusion E5 E6 E7 E8 E9 (b) I5 I6 I7 I8 Fig. 1. Exon-intron structure of USP4 (a), with exons represented as Ei and introns as Ii, together with branchpoint sites (BPSi) and 5’ and 3’ splice sites (5'SSi and 3'SSi). E7 could be excluded if it is flanked by two long introns (b) or if 5'SS6 is stronger than 5'SS7 and/or BPS7 is stronger than BPS6 (c). The font size of splicing sites and branchpoint sites indicates the strength of splicing signals. Not drawn to scale. E5 E6 E7 E8 E9 (c) BPS7 BPS6 5'SS6 3'SS6 5'SS7 3'SS7 Strong 5'SS6 and BPS7 coupled with weak BPS6 and 5'SS7 could lead to E7 exclusion Caitlyn Vlasschaert et al., Scientific Reports (in press)

Perceptron The perceptron is one of the simplest artificial neural networks invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt (Rosenblatt, 1958). Perceptron has been used in bioinformatics research since 1980s: The identification of translational initiation sites in E. coli (Stormo et al., 1982a). Characterizing the ATP/GTP-binding motif (Hirst and Sternberg, 1991). More recent publications use multi-layer perceptrons which is more complicated than what we cover here.

What perceptron does Positive sequences POS1 ACGT POS2 GCGC
Negative sequences NEG1 AGCT NEG2 GGCC Objective: Find a scoring matrix that can distinguish between the two groups (positive and negative) of sequences

Definitions POS1 ACGT POS2 GCGC NEG1 AGCT NEG2 GGCC
The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 0. The first row designates sites 1-4. Base 1 2 3 4 A C G T For amino acid sequences, the matrix would be 20 by 4.

Iterations and convergence
First round of the training process in the perceptron algorithm. NEG1: AGCT, PS = 0, update Base 1 2 3 4 A -1 (a) C G T NEG2: GGCC, PS = -2, no update (b) POS1: ACGT, PS = -2, update (c) POS2: GCGC, PS = 2, no update (d) Let us start with the NEG sequences in the fictitious example. Note that you would waste computational time by starting with sequences in the POS group because the resulting PS will all be 0 and consequently, according to the rules for updating W in Eq. , no updating is made with PS = 0. PS for NEG1, i.e., PS for S = AGCT, is 0. According to the rules for updating W in Eq., we should update W by reducing the relevant Wij values by 1. The updated W is shown in Table 3-5a, with WA,1, WG,2, WC,3 and WT,4 in the original W (Table 3-4) reduced by 1, with updated values highlighted in bold. The next input sequence is NEG2 (=GGCC) which has PS = -2 based on the updated W in Table 3-5a. According to the rules of updating W in Eq., no update is made, so the W matrix in Table 3-5b is the same as that in Table 3-5a. We can proceed with POS1 and POS2 sequences. POS1 (=ACGT) has PS = -2. According to the rules of updating W in Eq. , we should add 1 to the cells corresponding to ACGT. The updated matrix W is shown in Table 3-5c. However, POS2 (= GCGC) has PS = 2. According to the rules in Eq. , there should be no updating, so W in Table 3-5d is the same as that in Table 3-5c. This ends the first round of iteration. Now we restart with NEG1, NEG2, POS1, and POS2. We found that NEG1 and NEG2 both have negative PS and POS1 and POS2 both have positive PS. So we declare convergence, with the two groups clearly classified correctly.

Doublet perceptron 1 2 3 4 5 6 7 8 9 AC CG GU UA AU UC CU UG UU AA CA
GA AG CC GC GG P1 ACGUAUACGU P2 ACGUCUACGU P3 ACGUGUACGU P4 ACGUUAACGU P5 ACGUUCACGU P6 ACGUUGACGU N1 ACGUAAACGU N1 ACGUACACGU N1 ACGUAGACGU N1 ACGUCAACGU N1 ACGUCCACGU N1 ACGUCGACGU N1 ACGUGAACGU N1 ACGUGCACGU N1 ACGUGGACGU N1 ACGUUUACGU

Doublet Perceptron Doublet 1 2 3 4 5 6 7 8 9 AA -6 -4.3 AC -4 AG -2 AU 8.33 CA -1 CC CG CU GA -0.7 GC GG GU 3.33 UA -3.7 6.67 5.67 UC UG 0.33 UU -11 Large amount of data are needed to avoid the problem of overfitting

Gene/Motif Prediction
Objective: given molecular sequence, find its biological function (preferably in terms of gene ontology). Cellular localization Biological processes the gene (its product) participates in The biological reaction Related terms: Motif: e.g., RccAUGG Fingerprint: a set of aligned sequences from which a position weight matrix or the like can be constructed to predict the motif effectively Gene/Motif prediction methods Position weight matrix Perceptrons Supervised learning Hidden Markov Models (HMMs) Neural networks (e.g., self-organizing map or SOM)

Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia.

Similar presentations

Presentation on theme: "Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia.

Similar presentations

Presentation on theme: "Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia."— Presentation transcript:

Similar presentations

About project

Feedback