Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University.

Slides:



Advertisements
Similar presentations
Chapter 7 Hypothesis Testing
Advertisements

Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Brief introduction on Logistic Regression
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Hypothesis Testing Steps in Hypothesis Testing:
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Topic 6: Introduction to Hypothesis Testing
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 9-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Log-linear and logistic models
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Chapter 11 Multiple Regression.
Topic 3: Regression.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
BCOR 1020 Business Statistics
Today Concepts underlying inferential statistics
Statistics for Managers Using Microsoft® Excel 5th Edition
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Inferential Statistics
One Sample  M ean μ, Variance σ 2, Proportion π Two Samples  M eans, Variances, Proportions μ1 vs. μ2 σ12 vs. σ22 π1 vs. π Multiple.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Hypothesis Testing:.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Overview Definition Hypothesis
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
STAT 5372: Experimental Statistics Wayne Woodward Office: Office: 143 Heroy Phone: Phone: (214) URL: URL: faculty.smu.edu/waynew.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Random Sampling, Point Estimation and Maximum Likelihood.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
Inferential Statistics 2 Maarten Buis January 11, 2006.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Forecasting Choices. Types of Variable Variable Quantitative Qualitative Continuous Discrete (counting) Ordinal Nominal.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Issues concerning the interpretation of statistical significance tests.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
© Copyright McGraw-Hill 2004
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Hypothesis Testing and Statistical Significance
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Methods of Presenting and Interpreting Information Class 9.
Chapter Nine Hypothesis Testing.
Chapter 4. Inference about Process Quality
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Chapter 9 Hypothesis Testing.
Reasoning in Psychology Using Statistics
CHAPTER 6 Statistical Inference & Hypothesis Testing
Chapter 7: The Normality Assumption and Inference with OLS
Reasoning in Psychology Using Statistics
Introductory Statistics
Presentation transcript:

Statistical Concepts and Methodologies for Data Analyses Benilton Carvalho Computational Biology and Statistics Group Department of Oncology University of Cambridge

FROM RANDOM VARIABLES TO HYPOTHESIS TESTING

Random Variables Function that associates probability to: – Countable items (discrete random variable); Tumor vs. Normal; Yes vs. No; Head vs. Tail; – Uncountable items (continuous random variable): Log-expression; weight; height; Characterized by a distribution function: – Bernoulli; Binomial; Geometric; Negative- Binomial; Poisson; – Normal; Student’s t; Gamma;

Examples – Discrete Distributions

Examples – Continuous Distributions

Common Uses of Different Distributions Bernoulli: probability of 1 success; Binomial: probability of K successes; Geometric: probability of K failures before 1 st success; Negative-Binomial: probability of K failures before R successes; Poisson: probability of K rare events;

The Questions Investigation of populations or groups within a population leads to questions: – How does BRCAI behave across groups? – Can genotype predict drug response? – Does transcript abundance change as a function of time?

The Experiment A procedure used to answer the questions; Comprised of multiple items: – Population; – Sample; – Hypotheses; – Test statistic; – Rejection criteria;

Population Superset of subjects of interest; Ideally, every subject in the population is surveyed; Issues with the “census approach”;

Sample Select some subjects from the population; We refer to this subset as sample; Subject in a sample can be called replicate; Replicate: technical vs. biological;

Hypotheses Sets that define the “underlying truth”; Null Hypothesis (H0): default situation. – Cannot be proven; – Reject (in favor of H1) vs. fail to reject; Alternative Hypothesis (H1): alternative (duh!) – Complements H0 on the parametric space; – Assists on the definition of the rejection criteria.

Examples of Hypotheses – P1 Comparing expression: Tumor vs. Normal: – Expression on tumor is at most as high as on normal; – Expression on tumor is higher than on normal;

Examples of Hypotheses – P2 Comparing expression: Tumor vs. Normal: – Expression on tumor is at least as low as on normal; – Expression on tumor is lower than on normal;

Examples of Hypotheses – P3 Comparing expression: Tumor vs. Normal: – Expressions on tumor and normal are the same; – Expressions on tumor and normal are different;

Test Statistic Summary of the data; Built “under H0”; Independent of unknown parameters; Known distributions; Compatibility between data and H0;

Test Statistic What the statistician see…

Rejection Criteria Function of three factors: – Test statistic; – Hypotheses; – Type I Error (False Positive), α; Determines thresholds used to reject H0: – One threshold: one-sided tests; – Two thresholds: two-sided tests; Defines what is “extreme” for the experiment;

Rejection Criteria

From Rejection Criteria to P-value! p-value

Rejection Criteria

From Rejection Criteria to P-value! p-value

Rejection Criteria

From Rejection Criteria to P-value! p-value

Sampling and testing 10% red balls and 90% blue balls Random sample of 10 balls from the box Discrete observations When do I think that I am not sampling from this box anymore? How many reds could I expect to get just by chance alone! #red = 3 24

10% red balls and 90% blue balls Random sample of 10 balls from the box Discrete observations Sample Null hypothesis (about the population that is being sampled) Rejection criteria (based on your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population) #red = 3 Test statistic 25

Continuous observations Sample Null hypothesis (about the population that is being sampled) Rejection criteria (based on your observed sample, do you have evidence to reject the hypothesis that you sampled from the null population) mean = 3, sd = 0.6 Test statistic 4, 2.3, 5.2, 4.7, 2.1, 3.5, …….. 26

Summary of the Experiment 4) decision 1) hypotheses 2) sample 3) test statistic

Useful Facts The Law of the Large Numbers guarantees that the larger the sample size is, the closer the sample average is to the actual mean; Normality assumption isn’t that important with large sample size; The Central Limit Theorem states that the average is asymptotically normal;

Useful Facts The Z-score depends on the precise knowledge of the variance term: Estimating the variance changes the distribution of the test statistic:

Useful Facts The Student’s t distribution is similar to the Normal distribution, but has heavier tails; Larger sample size, more d.f.; More d.f., closer to Normal;

Multiple Testing We are doing high-throughput experiments; Comparing thousands of units simultaneously; At this scale, we can observe several instances of rare events just by chance: – Event A: 1 in 1000 chance of happening; – Event B: 999 in 1000 chance of happening; – And the experiment is tried 20,000 times; – We expect 20 occurrences of Event A to be observed, although Event B is much more likely;

Multiple Testing Similar scenario, for example, with DE; Most genes are not differentially expressed; High-throughput experiments; Differential expression is tested for 20K genes; Need to protect against false positives; Suggestion: use non-specific filtering;

DATA MODELING

What is a model?

Statistical Models There is no “correct model”; Models are approximations of the truth; There is a “useful model”; Understand the mechanisms of the system for better choices of model alternatives;

Revisiting Microarrays Scanned images; Fluorescence intensities; Proportional to target abundances; Restricted dynamic range; Asymmetrical distribution; Log-Intensities behave better;

Revisiting Microarrays

Intensities

Log-Intensities

Back to Data Modeling Linear Regression / ANOVA Nature of the data: continuous; Linear regression often used; For subject i, known factors/covariates are candidates to predict log-intensities of a gene: Residuals expected to be Normal;

Interpreting Coefficients Statisticians indicate that a parameter is estimated by using a “hat” on top of it: Assuming that X = 0 for normal tissue: Assuming that X = 1 for tumor tissue:

Interpreting Coefficients Average log-intensity for normal tissue Change in average log-intensity associated to the tumor tissue Average log-intensity for tumor tissue

GLM Generalized Linear Models; Generic framework; Accommodates different types of data; Special cases: Linear regressions and ANOVAs;

Example – GLM Binomial Family Responses: yes/no; dead/alive; sick/healthy; Predictors: Gene expression / genotype / age; Example: – Response: Cytogenetic abnormalities (Yes/No); – Predictors: Log-expression of probeset 1059_at;

Log-Expression vs. Abnormalities

Modeling a Binary Response Response in the previous example: – Observed cytogenetic abnormalities; – Did not observe cytogenetic abnormalities; Linear regression does not work:

Modeling a Binary Response Instead of modeling the actual response, we model the probability of that response; Linear regression still fails; Valid Results

Logistic Regression - Rationale Probability is restricted to the [0, 1] interval; Linear regression isn’t; Need to transform probability;

Logistic Regression - Rationale Instead of probability, model the odds: Odds range from 0 to Infinity; A linear regression approach would still fail;

Logistic Regression - Rationale Instead of odds, model the log-odds: Log-odds range from -Infinity to Infinity; An approach like linear regression, using the log-odds scale, would work fine;

Back to GLM In the previous example: Link function: logitLinear Predictor

Interpreting Coefficients on a Logistic Model b0: average log-odds for normal tissue; b1: average change in log-odds on tumor; Suppose b0 = and b1 = -3.46: – How do we interpret?

Model Selection Likelihood measures the probability of observing the data under a certain model; Given two models, M1 and M2 (M2 ⊃ M1): – Get L1: likelihood of the data under M1; – Get L2: likelihood of the data under M2; LRT = -2 log(L1/L2) is known; – Small LRT: choose M1; – Large LRT: choose M2;

MODELING STRATEGIES FOR SEQUENCING DATA

Sequencing – Rationale Technical Replicate Sample j, transcript i is generated at rate λ ij ; A fragment attaches to the flow cell with a (low) probability p ij ; Number of observed tags, y ij, is Poisson distributed with rate proportional to λ ij p ij ; Adapted from notes by Tom Hardcastle

Poisson Probability function:

Analysis method: GLM Expected count of region i in sample j Design matrix Library size effect (Differential) effect for region i Noise Part Deterministic Part

technical rep – consistent with Poison biol. rep – not consistent with Poison Based on the data of Nagalakshmi et al. Science 2008; slide adapted from Huber; Need to account for extra variability

Sequencing – Rationale Biological Replicates For subject j, on transcript i: Different subjects have different rates, which we can model through: This hierarchy changes the distribution of Y:

Negative Binomial Probability function:

Adding an additional source of variation smooth dispersion-mean relation α

CONSIDERATIONS ON EXPERIMENT DESIGN

Consideration Sample size is crucial. The larger, the better; With differential expression, one can observe this more easily; Is RNA-Seq really worth it when we consider: – Cost, – Strategies for analysis, and – Technical requirements?

Differential Expression Across Groups Flow Cell Confounded With Group Group AGroup BGroup CGroup D Flow Cell 1Flow Cell 2Flow Cell 3Flow Cell 4

Differential Expression Across Groups Randomize Samples wrt Flow Cell Flow Cell 1Flow Cell 2Flow Cell 3Flow Cell 4

Differential Expression Across Groups Barcoding vs. Lane Effect Flow Cell 1Flow Cell 2Flow Cell 3Flow Cell

CONSIDERATIONS ON DATA PROCESSING

Normalization Samples are sequenced in different depths: Genes with higher expression on Sample 2; Adjusting by total reads can be misleading; GeneSample 1Sample 2 Gene 1500,000 ……… Gene N0500,000 Total Reads15,000, ,000,00 0

Normalization Length can affect relative inference of expression across genes; Gene A K-times longer than B is expected to have K-times more reads than B: Gene A Gene B