MGMR progress report, 24/08/11
Likelihood Model We wish to estimate the relative expression values p_1,…p_N of M genes in N mRNA samples. Each sample is composed of Ri reads (iϵ[1,N]) Each read is aligned to a subset of the reads We assume Pr(read| gene) is given and that Pr(G) ~ Dirichlet(α_1,…, α_M) Notation: iϵ[1,N] indexes genes jϵ[1,R_i] indexes reads kϵ[1,N] indexes samples
Testing Strategy 1000 Mention we also did MGMR init
Testing strategy Estimate with SEQEM, MGMR real samples Simulate reads based on 2 sets of estimates Re-estimate on each initialization Measure error compared with gold standard
Error measures used Mention alternative to err_rate with median
Problems Initial results were bad Improvement with iterations, but worse than SEQEM estimate SEQEM noise level too high “table for illustration”; “we’ll talk about what given alpha refers to later…”
questions Why is SEQEM error so high (compared with paper) Effect of dataset? Bug? Highlight actual numbers for tag 32 (EM and weighted), coverage level varied
Given alpha, given priors testing strategy Back-track, check if there’s a bug or if model is flawed Instead of real data, generate data (P vectors) with known characteristics and test performance of SEQEM, MGMR, and MGMR given alpha vector (i.e., known alpha with no updates)
Control for everything Repeat SEQEM steps first Use HomoloGene genes Limit to chromosome 1 (has most genes) Filter overlapping genes Keep only homology groups appearing multiple times in human paralogs Sample from/map to exon sequences, sum over genes Net effect: lowered error but was still too high – for SEQEM, E~=0.60, C~=0.16
Given priors results On original set of 51; averaged at 10:10:50, 100,
Real samples results Colors are re-estimates; 1:1:10, 20:10:50, 100; SEQEM (red) is 100 iteration average (not changing with iterations)
Left edge High relative error average for low Chi-diff average values Shows Relative error very prone to outliers, Chi- Diff less so Example: True = .0001; Est = .0020 Relative Error ~= 20 Chi-squared difference ~=.04 Possible alternative: use median of error rates
51 Works...now what? Repeat with bigger set of genes – all HomoloGene paralogs (that survive filtering) Has proven difficult – high SEQEM error again in spite of filtering (E: 2.4, C:1.1, KL: 0.5) Questions again – is average error a good measure? Does effective length normalization help accuracy? Test a priori for mappability (with k-mer counts) Vary population homogeneity About 300 paralogs survive
RSEM figure MPE, 1 TPM or 1 NPM % sig difference Testing on all genes instead of only paralogs, but also dealing with isoforms