MGMR progress report, 24/08/11

MGMR progress report, 24/08/11

Likelihood Model We wish to estimate the relative expression values p_1,…p_N of M genes in N mRNA samples. Each sample is composed of Ri reads (iϵ[1,N]) Each read is aligned to a subset of the reads We assume Pr(read| gene) is given and that Pr(G) ~ Dirichlet(α_1,…, α_M) Notation: iϵ[1,N] indexes genes jϵ[1,R_i] indexes reads kϵ[1,N] indexes samples

Testing Strategy 1000 Mention we also did MGMR init

Testing strategy Estimate with SEQEM, MGMR real samples
Simulate reads based on 2 sets of estimates Re-estimate on each initialization Measure error compared with gold standard

Error measures used Mention alternative to err_rate with median

Problems Initial results were bad
Improvement with iterations, but worse than SEQEM estimate SEQEM noise level too high “table for illustration”; “we’ll talk about what given alpha refers to later…”

questions Why is SEQEM error so high (compared with paper)‏
Effect of dataset?‏ Bug?‏ Highlight actual numbers for tag 32 (EM and weighted), coverage level varied

Given alpha, given priors testing strategy
Back-track, check if there’s a bug or if model is flawed Instead of real data, generate data (P vectors) with known characteristics and test performance of SEQEM, MGMR, and MGMR given alpha vector (i.e., known alpha with no updates)

Control for everything
Repeat SEQEM steps first Use HomoloGene genes Limit to chromosome 1 (has most genes)‏ Filter overlapping genes Keep only homology groups appearing multiple times in human  paralogs Sample from/map to exon sequences, sum over genes Net effect: lowered error but was still too high – for SEQEM, E~=0.60, C~=0.16

Given priors results On original set of 51; averaged at 10:10:50, 100,

Real samples results Colors are re-estimates; 1:1:10, 20:10:50, 100; SEQEM (red) is 100 iteration average (not changing with iterations)

Left edge High relative error average for low Chi-diff average values
Shows Relative error very prone to outliers, Chi- Diff less so Example: True = .0001; Est = .0020 Relative Error ~= 20 Chi-squared difference ~=.04 Possible alternative: use median of error rates

51 Works...now what?‏ Repeat with bigger set of genes – all HomoloGene paralogs (that survive filtering) Has proven difficult – high SEQEM error again in spite of filtering (E: 2.4, C:1.1, KL: 0.5) Questions again – is average error a good measure? Does effective length normalization help accuracy? Test a priori for mappability (with k-mer counts) Vary population homogeneity About 300 paralogs survive

RSEM figure MPE, 1 TPM or 1 NPM % sig difference
Testing on all genes instead of only paralogs, but also dealing with isoforms

MGMR progress report, 24/08/11

Similar presentations

Presentation on theme: "MGMR progress report, 24/08/11"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MGMR progress report, 24/08/11

Similar presentations

Presentation on theme: "MGMR progress report, 24/08/11"— Presentation transcript:

Similar presentations

About project

Feedback