Download presentation
Presentation is loading. Please wait.
Published byDomenic Hardy Modified over 8 years ago
1
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression Levels of Homologous Genes In RNA-Seq Experiments
2
RNA-Seq Procedure 1/12/11 SeqEm – Eran Halperin 2 Isolate Total RNA (e.g. by poly(A) binding), Sequence short reads (25-40bp) Map to reference genome (Eland, MAQ, BWA, Bowtie, etc.) Splicing, SNPs, etc. Estimate concentration of mRNA in sample Statistics/Analysis
3
Homologous Genes 1/12/11 SeqEm – Eran Halperin 3
4
1/12/11 SeqEm – Eran Halperin 4 http://jura.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=10&f=genepairs
5
MULTIREADS - Current Standard 1/12/11 SeqEm – Eran Halperin 5 Discard – count only unique regions Uniformly distribute Map according to unique read distribution
6
Generative Model + Algorithm 1/12/11 SeqEm – Eran Halperin 6 Notation: G = (G1, G2,..., Gn) P = (P1, P2,..., Pn); Σ Pi = 1 R = (r1, r2,..., rm) Model for RNA-Seq: Choose Gi from distribution P Generate short reads: copy (with errors) a random substring of G
7
SeqEm 1/12/11 SeqEm – Eran Halperin 7 R G P1 P2 P3
8
SeqEm 1/12/11 SeqEm – Eran Halperin 8
9
SeqEm: Likelihood 1/12/11 SeqEm – Eran Halperin 9
10
1/12/11 SeqEm – Eran Halperin 10
11
1/12/11 SeqEm – Eran Halperin 11
12
MGMR motivation Cartoon: http://bulbapedia.bulbagarden.net/wiki/File:126Magmar.png -Assume same gene structures -Most expression levels expected to be similar...
13
New Generative Model Notation: G = (G_1, G_2,..., G_M) genes S = (S_1, S_2,..., S_N) samples P = (P_11, P_12,..., P_MN); for each sample, Σ (genes)P = 1 For i-th sample, R = (r_1, r_2,..., r_Ri) Model for RNA-Seq: Sample vector of Ps from Dirichlet distribution P defines probability of sampling each gene Generate short reads: copy (with errors) a random substring of G
14
Dirichlet distribution https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010- 0006.pdf
16
Alternating iterations Initialize with few EM iterations, alpha a vector of 1s Holding P constant, update alpha vector (1 iteration, otherwise explodes) Holding alpha vector constant, update Ps Halt based on convergence or number of iterations
17
Testing Strategy ● Want to see that method improves upon EM alone (treating lanes independently) ● Estimate expression for each lane with EM alone for 100 iterations – our “gold standard" ● Use expression estimates as lane read distributions, simulate/sample fixed number of reads from each lane ● Align reads, re-estimate expression with EM, MGMR compare correlations to initial estimate
18
Testing Strategy 60 samples from Yoruba population, 100 Genomes Project RNA-Seq data - Pickerell et al, Nature 2010 Bowtie alignment, SEQEM estimation => "gold standard" R G Sample reads based on distributions Bowtie Alignment, Re-estimate expression Compare correlation to gold standard- MGMR vs. EM Gloves from http://www.artclips.com
19
Read Simulation ● Inputs: transcript sequences, expression estimate, error rate, # of reads wanted, read length ● For each gene, expected number of reads is # of reads * EM estimate ● Start positions, orientation sampled uniformly ● Error positions sampled uniformly without replacement from set of all read positions ● Errors are predefined substitutions
20
Test 1 ● Tested size limitations ● Initially tried large input – 1M simulated reads from 5 samples – Problem: large memory required in order to store q values for all alignments – Tried to address by decomposing problem into connected components, found in each case exists one large component with ~50% of nodes (reads or edges) – problem not easy to decompose without impacting estimate
21
Test 2 ● Designed to test effectiveness of model before addressing memory issues ● Advantage from alpha estimate is expected to be greater with more samples ● Sought large sample, small genome input – restricted to chromosome 1 transcripts, up to 20 samples
22
Next steps ● Comparisons: EM1 vs EM2-5; EM1 vs. EM2- 100; EM1 vs. MGMR-5,10,15,20; effect of different initial conditions ● Improve implementation speed – currently 5 samples, 10 iterations takes ~ 12 hrs ● Overcome memory hurdle to handle many samples ● Can merge transcripts to genes ● Break up problem arbitrarily (i.e., to chromosomes if necessary)
25
1/12/11 SeqEm – Eran Halperin 25
28
Estimating alpha given P
29
Microarrays: Known Issues 1/12/11 SeqEm – Eran Halperin 29 Background hybridization Genes with low expression levels Different hybridization properties Relative expression levels Limited set of probes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.