Statistics and Art: Sampling, Response Error, Mixed Models, Missing Data, and Inference Ed Stanek And others: Recai Yucel, Julio Singer, and others on the Cluster Team 11/11/2018
Anne Stanek Viviana Lencina Alice Singer Silvia San Martino Wenjun Li Luz Mery Gonzalas Julio Singer Ed Stanek Maria Lucia Singer 11/11/2018
What is truth?: Predict what? Subsets- sampling Prediction Outline Example: Dose-response Models in Toxicology- Threshold vs Hormetic Models What is truth?: Predict what? Subsets- sampling Prediction Results on Predictor of Realized Subject True Value Illustration and Dilemma Extension to two-stage problems Missing data framework Conclusions And others: Recai Yucel, Bo Xu, Ruitao Zhang , and others on the Cluster Team 11/11/2018
1. Example: Dose-response Models - Threshold vs Hormetic Models Yeast data- 2189 chemicals, 13 yeast strains, 5 doses x 2 replications- Focus on doses below BMD These plots are of hypothetical ‘true’ responses. Response is represented as Percent of Control 100% is the response when the dose=0. Question: Is there evidence of hormesis? The point where the true response drops below 100% is the zero effect point. In practice, a ‘bench mark dose’ is estimated as a dose where the observed response drops below 95%. 11/11/2018
i = chemical J = dose k = replication 11/11/2018 A mixed model is fit to response for doses in the hormetic range. Only 5 doses; Identify BMD(5), (meaning benchmark dose 5%, value where response above is less than (100-5)%=95% , and doses below BMD; When 3 doses below BMD, Predict average response for below BMD range. Results- order predicted response for realized chemicals from low to high Equal resp error, unequal resp error i = chemical J = dose k = replication 11/11/2018
Plot of predicted response for the strain ‘wild type yeast’ for 253 chemicals with 3 doses below a benchmark dose of 95%, using a pooled (equal) response errors based on a mixed model. Black line is expected distribution of mean response if Threshold model held 11/11/2018
11/11/2018 Similar plot of with un-equal respone error. This was constructed by fitting mixed models to each chemical, and estimating response variance. Which results should be used? Does it depends on whether model has heterogeneous response error? No- theoretically, a derivation with heterogeneous response error pools response error variances. However, in simple example, we can show that better results occur if response error is separated. The theory doesn’t match- we don’t understand the theory for the ‘better results’. Next Steps: Review what we do understand. Keep the context simple. 11/11/2018
2. What is truth? Predict what? Population, subjects, true response Subject Labels: True Response: Population Parameters Mean: Variance: Subject Deviation: Subjects == chemicals True Response== Average response in hormetic range Need to Define Parameters to represent the problem
Non-Stochastic Model:. Index for response:. Response error: Non-Stochastic Model: Index for response: Response error: Assume: Response Error Model: For each subject: Response Process: In hormetic range, pick a dose at random Measure response Assumptions (unbiased response error, heteroskedastic) Response Error Model is a stochastic Model Response Error is a random effect Sum of subject effects is zero (over population). Information: (subject label, response) Subsequently, take r=1 (one measure per subject) 11/11/2018
3. Subsets, Sampling Select n of N subjects (a subset, “sample”) Let all subsets be equally likely: Sample Mean: Note difference with: Select n of N subjects (a subset) Sample is a set (un-ordered) of different subjects. Usually representCommon 11/11/2018
Sample as a Sequence (part of Permutation) Represent Positions in a Permutation: Assume all Permutations Equally Likely: Define: Sample= positions Sample Mean: The random variable Y(ik) is not clearly defined. Sample is now a sequence (order matters)! 11/11/2018
Population s=2 s=3 Ed s=1 Wenjun Julio 11/11/2018 Population of N=3 subjects. The sample is the first two subjects on the left. s=2 Ed s=3 Wenjun s=1 Julio 11/11/2018
i=1 i=2 i=3 s=2 s=3 s=1 Position in Permutation 11/11/2018 Population of N=3 subjects. Note labels and positions. s=2 s=3 s=1 11/11/2018
i=1 i=2 i=1 i=2 i=3 i=3 s=2 s=1 s=3 s=1 s=3 s=2 Position in Permutation i=1 i=2 i=3 i=3 Different permutation: Ed, Julio, and Wenjun s=2 s=1 s=3 s=1 s=3 s=2 11/11/2018
i=1 i=2 i=3 s=3 s=1 s=2 Position in Permutation 11/11/2018 Different Permutation: Ed, Wenjun, Julio s=3 s=1 s=2 11/11/2018
i=1 i=2 i=3 s=3 s=2 s=1 Position in Permutation 11/11/2018 Different permutation: Julio, Wenjun, Ed s=3 s=2 s=1 11/11/2018
i=1 i=2 i=3 Sample Remainder s=1 s=2 s=3 Position in Permutation | Different Permutation with Sample and Remainder: Wenjun, Julio, and Ed s=1 s=2 s=3 11/11/2018
i=1 i=2 i=3 Sample Remainder s=2 s=1 s=3 Position in Permutation | Wenjun, Ed, and Julio (using sample and remainder s=2 s=1 s=3 11/11/2018
Population size (N) is most likely > 3 We only see “n” subjects in the sample For example: Suppose n=3, and N=7 We may see … 11/11/2018
i=1 i=2 i=3 Sample Remainder i=4 i=… s=3 s=4 s=5 | Position in Permutation i=1 i=2 i=3 Sample Remainder Luzmery, Wenjun, and Viviana in sample i=4 i=… s=3 s=4 s=5 11/11/2018
i=1 i=2 i=3 Sample Remainder s=2 s=4 s=7 i=… Position in Permutation | Viviana, Ed, Silvina, in a sample s=2 s=4 s=7 i=… 11/11/2018
Traditional Sampling Approach 1 2 … N Horvitz-Thompson Estimator: First order inclusion Probabilites= Prob( subject included in a sample) Bold y is a vector of population values. Missing Data Missing Data 11/11/2018
With Response Error Model Sample Mean Sample is a set Sample is a Sequence U(is) is an indicator variables that has a value of 1 if subject s is in position i To represent positions: 11/11/2018
| Position in Permutation i=1 i=2 i=3 Sample s=1 s=2 s=3 11/11/2018
First Position in Permutation: Suppose s=1,…,3=N First Position in Permutation: Then: Formal expression of response for Position i=1 in a permutation 11/11/2018
Positions in Sample Sequences Sample and Remainder representation Remainder 11/11/2018
Basic Random Variables Sample Remainder Population 11/11/2018
Finite Population Mixed Model Response Error Model Response Error Model Finite Population Mixed Model Combine response error model with permutation, get mixed model 11/11/2018
Mixed Model Mixed Model 11/11/2018 Alpha = fixed effects B = Random Effects W* = Response error Note that subscript is POSITION, not SUBJECT 11/11/2018
Properties of Basic Random Variables (N=3) Sum Expected Value Sum Average Expected Value Average 11/11/2018
Sample Random Variables (n=2) Sum Expected Value Sum Sum over Rows, get usual random variable, with expected value mu Sum over columns: get random variable with different expected values Expected Value 11/11/2018
Prediction of Mean in a Simple Case: No Response Error (N=3, n=2) Sample Remainder Note: Criteria: Linear Function of sample Unbiased Smallest Mean Squared Error Need to predict a function of the remainder Called Best Linear Unbiased Predictor (not that we use the term “Predictor” here for a parameter, not a random variable) 11/11/2018
Prediction of Mean No Response Error (N=3, n=2) Target Sample Data Realized We predict the un-observed values in the population. Best Linear Unbiased Predictor: 11/11/2018
Prediction of a Subject’s Mean in Position i with No Resp Prediction of a Subject’s Mean in Position i with No Resp. Error (N=3, n=2) Target Sample Data Realized We predict the un-observed values in the population. Best Linear Unbiased Predictor: 11/11/2018
Prediction of a Subject’s Mean in Position i with Response Error Target Sample Data Realized We predict the un-observed values in the population. Best Linear Unbiased Predictor: 11/11/2018
Prediction of Realized Random Effect – Other Examples SRS+ Subject Resp. Error SRS+ Position Resp. Error Cluster Sampling: Balanced Return to Basic Question- Which predictor should be use- Common Response Error- Optimal via the theory Allowing K to depend on realized subject- Had smaller MSE D Cluster Sampling: Un-Balanced Similar form, more complicated 11/11/2018
Plot of predicted response for the strain ‘wild type yeast’ for 253 chemicals with 3 doses below a benchmark dose of 95%, using a pooled (equal) response errors based on a mixed model. 11/11/2018
11/11/2018 Plot of with un-equal resp error Which results should be used? Does it depends on whether model has heterogeneous response error? No- theoretically, a derivation with heterogeneous response error pools response error variances. However, in simple example, we can show that better results occur if response error is separated. The theory doesn’t match- we don’t understand the theory for the ‘better results’. Review what we do understand. Keep the context simple. 11/11/2018
Delimma Pooled Response Error Variance should be used for K (Using theoretical Results) Empirical example illustrates smaller MSE results with K depending on realized Subject -- but no theory! What should we do?.... Is there a ‘gap’ in the framework? 11/11/2018
Basic Sample Random Variables Sum Usual Modelling Approach (work with right column) Properties of these random variables- exchangeable- Natural lead in to Bayesian Inference Traditional Sampling (and missing data) approach (work with bottom row): Don’t use explicit notation for sample, use inclusion probabilities, Some are missing. Super-population models: Use bottom row, but re-arrange elements so that those in the sample are first. Assume the random variables are exchangeable (like for the right column). Really doesn’t make sense. Sum 11/11/2018
Basic Random Variables Sample and Remainder What is potentially observable? What is observed? 11/11/2018
Thanks More Work is needed! 11/11/2018 Anne Stanek Viviana Lencina Alice Singer Silvia San Martino Wenjun Li Luz Mery Gonzalas Julio Singer Ed Stanek Maria Lucia Singer 11/11/2018 Thanks