Sampling Research Questions

Slides:



Advertisements
Similar presentations
Sampling: Theory and Methods
Advertisements

Chapter 5 One- and Two-Sample Estimation Problems.
Multistage Sampling.
Generating Random Spanning Trees Sourav Chatterji Sumit Gulwani EECS Department University of California, Berkeley.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
STATISTICS Sampling and Sampling Distributions
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada
FDA/Industry Workshop September, 19, 2003 Johnson & Johnson Pharmaceutical Research and Development L.L.C. 1 Uses and Abuses of (Adaptive) Randomization:
1 ESTIMATION IN THE PRESENCE OF TAX DATA IN BUSINESS SURVEYS David Haziza, Gordon Kuromi and Joana Bérubé Université de Montréal & Statistics Canada ICESIII.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
An analysis of Social Network-based Sybil defenses Bimal Viswanath § Ansley Post § Krishna Gummadi § Alan Mislove ¶ § MPI-SWS ¶ Northeastern University.
Seven New Management and Planning Tools.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
C82MST Statistical Methods 2 - Lecture 2 1 Overview of Lecture Variability and Averages The Normal Distribution Comparing Population Variances Experimental.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
SADC Course in Statistics Estimating population characteristics with simple random sampling (Session 06)
Assumptions underlying regression analysis
ZMQS ZMQS
STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Chapter 7 Sampling and Sampling Distributions
Bayesian network for gene regulatory network construction
1 STA 536 – Experiments with a Single Factor Regression and ANOVA.
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
Fact-finding Techniques Transparencies
Evaluating Limits Analytically
Department of Engineering Management, Information and Systems
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Randomized Algorithms Randomized Algorithms CS648 1.
Introduction Queuing is the study of waiting lines, or queues.
Chapter 10: The t Test For Two Independent Samples
(This presentation may be used for instructional purposes)
ABC Technology Project
Methods on Measuring Prices Links in the Fish Supply Chain Daniel V. Gordon Department of Economics University of Calgary FAO Workshop Value Chain Tokyo,
Hash Tables.
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
1 Panel Data Analysis – Advantages and Challenges Cheng Hsiao.
Phase II/III Design: Case Study
The effect of differential item functioning in anchor items on population invariance of equating Anne Corinne Huggins University of Florida.
Scale Free Networks.
Squares and Square Root WALK. Solve each problem REVIEW:
Graphs, representation, isomorphism, connectivity
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
AADAPT Workshop South Asia Goa, December 17-21, 2009 Kristen Himelein 1.
Addition 1’s to 20.
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Chapter Thirteen The One-Way Analysis of Variance.
Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009.
Simple Linear Regression Analysis
How Cells Obtain Energy from Food
Multiple Regression and Model Building
The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.
The STARTS Model David A. Kenny December 15, 2013.
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
1 Review Lecture: Guide to the SSSII Assignment Gwilym Pryce 5 th March 2006.
NLSCY – Non-response. Non-response There are various reasons why there is non-response to a survey  Some related to the survey process Timing Poor frame.
Sunbelt 2009statnet Development Team ERGM introduction 1 Exponential Random Graph Models Statnet Development Team Mark Handcock (UW) Martina.
Presentation transcript:

Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

Introduction At the end of the opening workshop the group in Sampling, Modeling, and Inference raised a number of open questions related to sampling. Today I will discuss those questions, most of which are still unsolved.

Goal of Sample-Based Inference What is the target of the inference? a stochastic model that generated a network or set of networks population of networks, e.g., dynamic networks multiple networks on a single population of edges single network

Various Network Sampling Designs Conventional sample design to learn about the network probabilities do not depend on observed data E.g., Current Population Survey Adaptive sample design using the network probabilities may depend on observed data E.g. RDS; ego-centric samples; link-tracing designs Two-phase sampling to target further investigation of missing data or measurement error Subsampling (?) to reduce computational burden at possible loss of efficiency

Conventional Sampling Design to Learn about the Network(s) Samples of nodes or of edges - used for description of network(s) prediction of future state of network prediction of links/gaps/nodes fitting a model to the graph

Limitations from Sampling Sampling introduces random error into the estimates (and possibly bias, since E f(X) ≠ f (EX) for nonlinear f ) Sampling variance needs to be estimated, maybe bias does too; may be problematic for small samples Some population characteristics may not be “estimable” from a sample E.g., maximum path length between any two nodes? Number of components in a general graph? What does “estimable” mean?

Limitations from Sampling If elements of interest (edges/non-edges, stars, motifs, etc.) have unequal probabilities of being observed, then need to know the probabilities and adjust for them or, need to have a model that explains the population or, sometimes, both.

E.g.: Induced Graph Sampling Undirected parent graph (V, G) Sample nodes S V Observe G(S) G – observe edge/non-edge between u, v iff u,v S Conventional sampling with possibly unequal probabilities (including multiple- frame stratified multi-stage): probability of including u1,u2 ,...,uj and excluding u1,u2 ,...,vk knowable for any j, k Denote inclusion probabilities by

Horvitz-Thompson Estimators of Totals

H-T Estimators of Triad Distribution Define Tk,u,v,w = 1 if u,v,w are distinct vertices sharing k edges and = 0 otherwise Tk number of triads in E with 0 < k < 3 edges Other totals estimated similarly, e.g., number of stars or other motifs.

Degree Distribution du degree of node u (its number of edges) M maximum degree in (E, G) Nr number of nodes of degree 0 < r < M (F0,F1,…,FM) is degree distribution, with Fr =Nr /N Degree distribution of the sample can differ from degree distribution of the population. “Subnets of Scale-Free Networks are Not Scale-Free: Sampling Properties of Networks” Stumpf, Wiuf, May (PNAS, 2005)

Estimation of Degree Distribution Induced subgraph from SRS of size n from (E,G) Nr number of nodes of degree r in parent graph Nr(S) number of nodes of degree r in subgraph

Estimation of Degree Distribution

Estimation of Mean and Variance of Degree Distribution

Partial Recap Using induced graph subsamples from conventional samples where joint inclusion probabilities are known, we can estimate population values of descriptive statistics based on totals degree distribution. (Only undirected graphs at one point in time discussed.) What about other descriptive statistics model fitting large variances when sample size small adaptive samples?

Approaches to Model Fitting You trust* your model. Under certain conditions** on the sample design and the model, you can ignore the way the sample was selected and treat the sample as having been generated from the model. The sampling mechanism needs to be carefully examined to make sure it meets the requirements, which depend on the model being used. * Reagan and others, “trust but verify” ** Handcock and Gile (2010 AoAS) call the condition “amenability” and relate it to “ignorability” (Rubin 1976).

Approaches to Model Fitting “Model as descriptive statistic”. You do not necessarily believe the model, but you want to fit the model the way you would if you completely observed the population. Anathema to many social scientists. . . E.g., in ERGMs, model fitting for population depends on sufficient statistics that are population totals. One can estimate them with H-T estimates (or alternatives) and then fit model. (Pavel Krivitsky poster) I have not investigated how to implement for other models. If both approaches are tried, “large” differences in fits can indicate model misspecification.

Adaptive Sampling Probabilities of observations depend on data from sampled units. Provides more information about network than conventional samples (Frank). Note: variances may be too large when sample is conventional but sparse. Probabilities of observing triads and larger typically unavailable, and even probabilities for dyads known for ego-centric designs but not link-tracing designs. (H-G 2010) In order to use full data, either need to estimate unknown probabilities (hard!!) or rely on model if amenability condition can be verified and model validated. E.g., when using conventional unequal probability samples to estimate a population total, the amenability condition typically does not hold.

Model Validation Model validation is important, but challenging when sampling probabilities are unknown. At the heart of every adaptive sample is a conventional sample. Use conventional sample to fit model as descriptive statistic. Compare result to model fitted under assumption of ignorability/amenability for (i) conventional sample and (ii) larger and more informative adaptive sample.

Recap What is the population (network, or set of networks) from which sample is selected? Sample design (and inference) to learn about the network Static Over time Description of network Prediction of future state of network and prediction of links/gaps/nodes

Recap Sample design (and inference) using the network to learn about a population Respondent Driven Sampling Adaptive Sampling Others Static and over time

Recap Subsampling design (and inference) to Ease computational burden Target further investigation to learn about measurement error When can inferences be made based on sample design information to provide approx. unbiasedness whether or not model is valid?

Recap How can model inferences be made? What models? Exponential random graph models Mixed membership stochastic block models Latent space models Agent based models What network characteristics (what summary statistics)

Recap What is effect of measurement error (and missing data, non-response) on inferences about network? RDS samples Others How to design and analyze randomized experiments when subjects are part of a static network? Dynamic? Google experiments Experiments on adolescents in schools (e.g., drug counseling, obesity “treatment”) – effects on peers