Download presentation
Presentation is loading. Please wait.
1
MGMR progress report, 24/08/11
2
Likelihood Model We wish to estimate the relative expression values p_1,…p_N of M genes in N mRNA samples. Each sample is composed of Ri reads (iϵ[1,N]) Each read is aligned to a subset of the reads We assume Pr(read| gene) is given and that Pr(G) ~ Dirichlet(α_1,…, α_M) Notation: iϵ[1,N] indexes genes jϵ[1,R_i] indexes reads kϵ[1,N] indexes samples
3
Testing Strategy 1000 Mention we also did MGMR init
4
Testing strategy Estimate with SEQEM, MGMR real samples
Simulate reads based on 2 sets of estimates Re-estimate on each initialization Measure error compared with gold standard
5
Error measures used Mention alternative to err_rate with median
6
Problems Initial results were bad
Improvement with iterations, but worse than SEQEM estimate SEQEM noise level too high “table for illustration”; “we’ll talk about what given alpha refers to later…”
7
questions Why is SEQEM error so high (compared with paper)
Effect of dataset? Bug? Highlight actual numbers for tag 32 (EM and weighted), coverage level varied
8
Given alpha, given priors testing strategy
Back-track, check if there’s a bug or if model is flawed Instead of real data, generate data (P vectors) with known characteristics and test performance of SEQEM, MGMR, and MGMR given alpha vector (i.e., known alpha with no updates)
9
Control for everything
Repeat SEQEM steps first Use HomoloGene genes Limit to chromosome 1 (has most genes) Filter overlapping genes Keep only homology groups appearing multiple times in human paralogs Sample from/map to exon sequences, sum over genes Net effect: lowered error but was still too high – for SEQEM, E~=0.60, C~=0.16
10
Given priors results On original set of 51; averaged at 10:10:50, 100,
11
Real samples results Colors are re-estimates; 1:1:10, 20:10:50, 100; SEQEM (red) is 100 iteration average (not changing with iterations)
12
Left edge High relative error average for low Chi-diff average values
Shows Relative error very prone to outliers, Chi- Diff less so Example: True = .0001; Est = .0020 Relative Error ~= 20 Chi-squared difference ~=.04 Possible alternative: use median of error rates
13
51 Works...now what? Repeat with bigger set of genes – all HomoloGene paralogs (that survive filtering) Has proven difficult – high SEQEM error again in spite of filtering (E: 2.4, C:1.1, KL: 0.5) Questions again – is average error a good measure? Does effective length normalization help accuracy? Test a priori for mappability (with k-mer counts) Vary population homogeneity About 300 paralogs survive
15
RSEM figure MPE, 1 TPM or 1 NPM % sig difference
Testing on all genes instead of only paralogs, but also dealing with isoforms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.