Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.

Similar presentations


Presentation on theme: "Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression."— Presentation transcript:

1 Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression Levels of Homologous Genes In RNA-Seq Experiments

2 RNA-Seq Procedure 1/12/11 SeqEm – Eran Halperin 2  Isolate Total RNA (e.g. by poly(A) binding), Sequence short reads (25-40bp)  Map to reference genome (Eland, MAQ, BWA, Bowtie, etc.)  Splicing, SNPs, etc.  Estimate concentration of mRNA in sample  Statistics/Analysis

3 Homologous Genes 1/12/11 SeqEm – Eran Halperin 3

4 1/12/11 SeqEm – Eran Halperin 4 http://jura.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=10&f=genepairs

5 MULTIREADS - Current Standard 1/12/11 SeqEm – Eran Halperin 5  Discard – count only unique regions  Uniformly distribute  Map according to unique read distribution

6 Generative Model + Algorithm 1/12/11 SeqEm – Eran Halperin 6  Notation:  G = (G1, G2,..., Gn)  P = (P1, P2,..., Pn); Σ Pi = 1  R = (r1, r2,..., rm)  Model for RNA-Seq:  Choose Gi from distribution P  Generate short reads: copy (with errors) a random substring of G

7 SeqEm 1/12/11 SeqEm – Eran Halperin 7 R G P1 P2 P3

8 SeqEm 1/12/11 SeqEm – Eran Halperin 8

9 SeqEm: Likelihood 1/12/11 SeqEm – Eran Halperin 9

10 1/12/11 SeqEm – Eran Halperin 10

11 1/12/11 SeqEm – Eran Halperin 11

12 MGMR motivation Cartoon: http://bulbapedia.bulbagarden.net/wiki/File:126Magmar.png -Assume same gene structures -Most expression levels expected to be similar...

13 New Generative Model Notation:  G = (G_1, G_2,..., G_M) genes  S = (S_1, S_2,..., S_N) samples  P = (P_11, P_12,..., P_MN); for each sample, Σ (genes)P = 1  For i-th sample, R = (r_1, r_2,..., r_Ri)  Model for RNA-Seq:  Sample vector of Ps from Dirichlet distribution  P defines probability of sampling each gene  Generate short reads: copy (with errors) a random substring of G

14 Dirichlet distribution https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010- 0006.pdf

15

16 Alternating iterations Initialize with few EM iterations, alpha a vector of 1s Holding P constant, update alpha vector (1 iteration, otherwise explodes) Holding alpha vector constant, update Ps Halt based on convergence or number of iterations

17 Testing Strategy ● Want to see that method improves upon EM alone (treating lanes independently) ● Estimate expression for each lane with EM alone for 100 iterations – our “gold standard" ● Use expression estimates as lane read distributions, simulate/sample fixed number of reads from each lane ● Align reads, re-estimate expression with EM, MGMR compare correlations to initial estimate

18 Testing Strategy 60 samples from Yoruba population, 100 Genomes Project RNA-Seq data - Pickerell et al, Nature 2010 Bowtie alignment, SEQEM estimation => "gold standard" R G Sample reads based on distributions Bowtie Alignment, Re-estimate expression Compare correlation to gold standard- MGMR vs. EM Gloves from http://www.artclips.com

19 Read Simulation ● Inputs: transcript sequences, expression estimate, error rate, # of reads wanted, read length ● For each gene, expected number of reads is # of reads * EM estimate ● Start positions, orientation sampled uniformly ● Error positions sampled uniformly without replacement from set of all read positions ● Errors are predefined substitutions

20 Test 1 ● Tested size limitations ● Initially tried large input – 1M simulated reads from 5 samples – Problem: large memory required in order to store q values for all alignments – Tried to address by decomposing problem into connected components, found in each case exists one large component with ~50% of nodes (reads or edges) – problem not easy to decompose without impacting estimate

21 Test 2 ● Designed to test effectiveness of model before addressing memory issues ● Advantage from alpha estimate is expected to be greater with more samples ● Sought large sample, small genome input – restricted to chromosome 1 transcripts, up to 20 samples

22 Next steps ● Comparisons: EM1 vs EM2-5; EM1 vs. EM2- 100; EM1 vs. MGMR-5,10,15,20; effect of different initial conditions ● Improve implementation speed – currently 5 samples, 10 iterations takes ~ 12 hrs ● Overcome memory hurdle to handle many samples ● Can merge transcripts to genes ● Break up problem arbitrarily (i.e., to chromosomes if necessary)

23

24

25 1/12/11 SeqEm – Eran Halperin 25

26

27

28 Estimating alpha given P

29 Microarrays: Known Issues 1/12/11 SeqEm – Eran Halperin 29  Background hybridization  Genes with low expression levels  Different hybridization properties  Relative expression levels  Limited set of probes


Download ppt "Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression."

Similar presentations


Ads by Google