MGMR progress report, 24/08/11

Slides:



Advertisements
Similar presentations
Association Tests for Rare Variants Using Sequence Data
Advertisements

RIP – T RANSCRIPT E XPRESSION L EVELS. O UTLINE RNA Immuno-Precipitation (RIP) NGS on RIP & its alternatives Alternate splicing Transcription as a graph.
RNAseq.
Physics 114: Lecture 7 Uncertainties in Measurement Dale E. Gary NJIT Physics Department.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Fitting. Choose a parametric object/some objects to represent a set of tokens Most interesting case is when criterion is not local –can’t tell whether.
Part II – TIME SERIES ANALYSIS C2 Simple Time Series Methods & Moving Averages © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Ch. 3.1 – Measurements and Their Uncertainty
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Introduction to resampling in MATLAB Feb Austin Hilliard.
Statistical Techniques I EXST7005 Review. Objectives n Develop an understanding and appreciation of Statistical Inference - particularly Hypothesis testing.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Using Scientific Measurements. Uncertainty in Measurements All measurements have uncertainty. 1.Measurements involve estimation by the person making the.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Between-Groups ANOVA Chapter 12. >When to use an F distribution Working with more than two samples >ANOVA Used with two or more nominal independent variables.
I Introductory Material A. Mathematical Concepts Scientific Notation and Significant Figures.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
General Linear Model 2 Intro to ANOVA.
LEAST MEAN-SQUARE (LMS) ADAPTIVE FILTERING. Steepest Descent The update rule for SD is where or SD is a deterministic algorithm, in the sense that p and.
Classification Ensemble Methods 1
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
INFERENTIAL STATISTICS DOING STATS WITH CONFIDENCE.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
Descriptive Statistics Used in Biology. It is rarely practical for scientists to measure every event or individual in a population. Instead, they typically.
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
Bayesian Inference: Multiple Parameters
PSY 626: Bayesian Statistics for Psychological Science
Advanced Quantitative Techniques
Step 1: Specify a null hypothesis
RNA Quantitation from RNAseq Data
Measurement, Quantification and Analysis
What is a Hidden Markov Model?
Chapter 3: Maximum-Likelihood Parameter Estimation
Data Mining K-means Algorithm
A Gentle Introduction to Bilateral Filtering and its Applications
AP Biology Intro to Statistics
Kallisto: near-optimal RNA seq quantification tool
Hidden Markov Models - Training
CPSC 531: System Modeling and Simulation
Elementary Statistics
Hypothesis Tests for a Population Mean,
Arrays, For loop While loop Do while loop
PSY 626: Bayesian Statistics for Psychological Science
Statistical Methods Carey Williamson Department of Computer Science
Alternative Computational Analysis Shows No Evidence for Nucleosome Enrichment at Repetitive Sequences in Mammalian Spermatozoa  Hélène Royo, Michael Beda.
Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing  Graham Heimberg, Rajat.
Significant Figures The significant figures of a (measured or calculated) quantity are the meaningful digits in it. There are conventions which you should.
Using Informative Priors to Enhance Wisdom in Small Crowds
Elementary Statistics
Discrete Event Simulation - 4
Introduction to ANOVA.
Chapter 3 Scientific Measurement 3.1 Using and Expressing Measurements
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
METHOD OF STEEPEST DESCENT
Gaussian Mixture Models And their training with the EM algorithm
Subscript and Summation Notation
Example: Sample exam scores, n = 20 (“sample size”) {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90} Because there are.
Hypothesis Tests for a Standard Deviation
Alternative Splicing QTLs in European and African Populations
Iterative resolution of multi-reads in multiple genomes
Carey Williamson Department of Computer Science University of Calgary
Finding Periodic Discrete Events in Noisy Streams
Propagation of Error Berlin Chen
Presentation transcript:

MGMR progress report, 24/08/11

Likelihood Model We wish to estimate the relative expression values p_1,…p_N of M genes in N mRNA samples. Each sample is composed of Ri reads (iϵ[1,N]) Each read is aligned to a subset of the reads We assume Pr(read| gene) is given and that Pr(G) ~ Dirichlet(α_1,…, α_M) Notation: iϵ[1,N] indexes genes jϵ[1,R_i] indexes reads kϵ[1,N] indexes samples

Testing Strategy 1000 Mention we also did MGMR init

Testing strategy Estimate with SEQEM, MGMR real samples Simulate reads based on 2 sets of estimates Re-estimate on each initialization Measure error compared with gold standard

Error measures used Mention alternative to err_rate with median

Problems Initial results were bad Improvement with iterations, but worse than SEQEM estimate SEQEM noise level too high “table for illustration”; “we’ll talk about what given alpha refers to later…”

questions Why is SEQEM error so high (compared with paper)‏ Effect of dataset?‏ Bug?‏ Highlight actual numbers for tag 32 (EM and weighted), coverage level varied

Given alpha, given priors testing strategy Back-track, check if there’s a bug or if model is flawed Instead of real data, generate data (P vectors) with known characteristics and test performance of SEQEM, MGMR, and MGMR given alpha vector (i.e., known alpha with no updates)

Control for everything Repeat SEQEM steps first Use HomoloGene genes Limit to chromosome 1 (has most genes)‏ Filter overlapping genes Keep only homology groups appearing multiple times in human  paralogs Sample from/map to exon sequences, sum over genes Net effect: lowered error but was still too high – for SEQEM, E~=0.60, C~=0.16

Given priors results On original set of 51; averaged at 10:10:50, 100,

Real samples results Colors are re-estimates; 1:1:10, 20:10:50, 100; SEQEM (red) is 100 iteration average (not changing with iterations)

Left edge High relative error average for low Chi-diff average values Shows Relative error very prone to outliers, Chi- Diff less so Example: True = .0001; Est = .0020 Relative Error ~= 20 Chi-squared difference ~=.04 Possible alternative: use median of error rates

51 Works...now what?‏ Repeat with bigger set of genes – all HomoloGene paralogs (that survive filtering) Has proven difficult – high SEQEM error again in spite of filtering (E: 2.4, C:1.1, KL: 0.5) Questions again – is average error a good measure? Does effective length normalization help accuracy? Test a priori for mappability (with k-mer counts) Vary population homogeneity About 300 paralogs survive

RSEM figure MPE, 1 TPM or 1 NPM % sig difference Testing on all genes instead of only paralogs, but also dealing with isoforms