Iterative resolution of multi-reads in multiple genomes

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

Mixture Models and the EM Algorithm
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
RNAseq.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Lecture 5: Learning models using EM
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Maximum likelihood estimation of relative transcript abundances Advanced bioinformatics 2012.
Todd J. Treangen, Steven L. Salzberg
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Recitation on EM slides taken from:
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
First topic: clustering and pattern recognition Marc Sobel.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Generalized Linear Models (GLMs) and Their Applications.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Flat clustering approaches
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
Bayesian Inference: Multiple Parameters
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Hidden Markov Models BMI/CS 576
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Outline of the chromatin immunoprecipitation (ChIP) technique
Oliver Schulte Machine Learning 726
MCMC Output & Metropolis-Hastings Algorithm Part I
Biases and their Effect on Biological Interpretation
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
Gibbs sampling.
Classification of unlabeled data:
Bayes Net Learning: Bayesian Approaches
Learning Sequence Motif Models Using Expectation Maximization (EM)
Department of Computer Science
Dirichlet process tutorial
Gene expression estimation from RNA-Seq data
Measuring transcriptomes with RNA-Seq
'Linear Hierarchical Models'
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Finding regulatory modules
Ensemble learning.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Topic Models in Text Processing
MGMR progress report, 24/08/11
Learning From Observed Data
CS590I: Information Retrieval
EM Algorithm and its Applications
Sequence Analysis - RNA-Seq 2
HKN ECE 313 Exam 2 Review Session
Classical regression review
Presentation transcript:

Iterative resolution of multi-reads in multiple genomes Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression Levels of Homologous Genes In RNA-Seq Experiments

2 SeqEm – Eran Halperin 1/12/11

Microarrays: Known Issues 3 Background hybridization Genes with low expression levels Different hybridization properties Relative expression levels Limited set of probes SeqEm – Eran Halperin 1/12/11

RNA-Seq Procedure 4 Isolate Total RNA (e.g. by poly(A) binding), Sequence short reads (25-40bp) Map to reference genome (Eland, MAQ, BWA, Bowtie, etc.) QC, Splice Variants, etc. Estimate concentration of mRNA in sample Statistics/Analysis SeqEm – Eran Halperin 1/12/11

5 SeqEm – Eran Halperin 1/12/11

Homologous Genes 6 SeqEm – Eran Halperin 1/12/11

7 SeqEm – Eran Halperin 1/12/11 http://jura.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=10&f=genepairs 7 SeqEm – Eran Halperin 1/12/11

MULTIREADS - Current Standard 8 Discard Uniformly distribute Map according to unique read distribution (Erange) SeqEm – Eran Halperin 1/12/11

Generative Model + Algorithm 9 Notation: G = (G1, G2, . . . , Gn) P = (P1, P2, . . . , Pn); ΣPi = 1 R = (r1, r2, . . . , rm) Model for RNA-Seq: Choose Gi from distribution P Generate short reads: copy (with errors) a random substring of G SeqEm – Eran Halperin 1/12/11

SeqEm 10 R G P1 P2 P3 SeqEm – Eran Halperin 1/12/11

SeqEm: Problem 1 11 SeqEm – Eran Halperin 1/12/11

SeqEm: Likelihood 12 SeqEm – Eran Halperin 1/12/11

13 SeqEm – Eran Halperin 1/12/11 Problem shown to be concave – EM converges to global maximum 13 SeqEm – Eran Halperin 1/12/11

1/12/11 14 SeqEm – Eran Halperin

MGMR motivation Cartoon: http://bulbapedia.bulbagarden.net/wiki/File:126Magmar.png

MGMR intution -Assume same gene structures -Most expression levels expected to be similar... http://bulbapedia.bulbagarden.net/wiki/File:126Magmar.png

New Generative Model Notation: Model for RNA-Seq: G = (G_1, G_2, . . . , G_M) genes S = (S_1, S_2, . . . , S_N) samples (i.e., genomes) P = (P_11, P_12, . . . , P_MN); for each sample, Σ(genes)P = 1 For i-th sample, R = (r_1, r_2, . . . , r_Ri) Model for RNA-Seq: Sample vector of Ps from Dirichlet distribution Ps define probability of sampling each gene Generate short reads: copy (with errors) a random substring of G

Why Dirichlet? Distribution's parameters (alphas) define distributions of multinomials (e.g., P_iks you draw) Conjugate prior of multinomial distribution – i.e., Mult(x|Θ)Dir(Θ|α)~Dir(x+α)

Dirichlet distribution Spend time here because gives intuition and next comes math -point out that each point is prob mass function – sums to one and all pos -colors rep values of pdf -explain gamma is factorial of n-1, explain where it should be high/low in each case and why - right side is points drawn from this distribution in each case

Estimating alpha given P

Project status Current status: math done (I hope!), coding... Plans: Simulation - small in silico genomes having known percent of homologs, differential expression Compare results of method to discarding reads, uniform assignment, weighted assignment Test on real data Sanity check: multiple lanes of same subject Population studies – e.g. 1000 genomes project Issue: do more mixed pools lead to less accuracy? Deal with SNPs, transcripts instead of genes Your suggestions...