Iterative resolution of multi-reads in multiple genomes Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression Levels of Homologous Genes In RNA-Seq Experiments
2 SeqEm – Eran Halperin 1/12/11
Microarrays: Known Issues 3 Background hybridization Genes with low expression levels Different hybridization properties Relative expression levels Limited set of probes SeqEm – Eran Halperin 1/12/11
RNA-Seq Procedure 4 Isolate Total RNA (e.g. by poly(A) binding), Sequence short reads (25-40bp) Map to reference genome (Eland, MAQ, BWA, Bowtie, etc.) QC, Splice Variants, etc. Estimate concentration of mRNA in sample Statistics/Analysis SeqEm – Eran Halperin 1/12/11
5 SeqEm – Eran Halperin 1/12/11
Homologous Genes 6 SeqEm – Eran Halperin 1/12/11
7 SeqEm – Eran Halperin 1/12/11 http://jura.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=10&f=genepairs 7 SeqEm – Eran Halperin 1/12/11
MULTIREADS - Current Standard 8 Discard Uniformly distribute Map according to unique read distribution (Erange) SeqEm – Eran Halperin 1/12/11
Generative Model + Algorithm 9 Notation: G = (G1, G2, . . . , Gn) P = (P1, P2, . . . , Pn); ΣPi = 1 R = (r1, r2, . . . , rm) Model for RNA-Seq: Choose Gi from distribution P Generate short reads: copy (with errors) a random substring of G SeqEm – Eran Halperin 1/12/11
SeqEm 10 R G P1 P2 P3 SeqEm – Eran Halperin 1/12/11
SeqEm: Problem 1 11 SeqEm – Eran Halperin 1/12/11
SeqEm: Likelihood 12 SeqEm – Eran Halperin 1/12/11
13 SeqEm – Eran Halperin 1/12/11 Problem shown to be concave – EM converges to global maximum 13 SeqEm – Eran Halperin 1/12/11
1/12/11 14 SeqEm – Eran Halperin
MGMR motivation Cartoon: http://bulbapedia.bulbagarden.net/wiki/File:126Magmar.png
MGMR intution -Assume same gene structures -Most expression levels expected to be similar... http://bulbapedia.bulbagarden.net/wiki/File:126Magmar.png
New Generative Model Notation: Model for RNA-Seq: G = (G_1, G_2, . . . , G_M) genes S = (S_1, S_2, . . . , S_N) samples (i.e., genomes) P = (P_11, P_12, . . . , P_MN); for each sample, Σ(genes)P = 1 For i-th sample, R = (r_1, r_2, . . . , r_Ri) Model for RNA-Seq: Sample vector of Ps from Dirichlet distribution Ps define probability of sampling each gene Generate short reads: copy (with errors) a random substring of G
Why Dirichlet? Distribution's parameters (alphas) define distributions of multinomials (e.g., P_iks you draw) Conjugate prior of multinomial distribution – i.e., Mult(x|Θ)Dir(Θ|α)~Dir(x+α)
Dirichlet distribution Spend time here because gives intuition and next comes math -point out that each point is prob mass function – sums to one and all pos -colors rep values of pdf -explain gamma is factorial of n-1, explain where it should be high/low in each case and why - right side is points drawn from this distribution in each case
Estimating alpha given P
Project status Current status: math done (I hope!), coding... Plans: Simulation - small in silico genomes having known percent of homologs, differential expression Compare results of method to discarding reads, uniform assignment, weighted assignment Test on real data Sanity check: multiple lanes of same subject Population studies – e.g. 1000 genomes project Issue: do more mixed pools lead to less accuracy? Deal with SNPs, transcripts instead of genes Your suggestions...