Presentation is loading. Please wait.

Presentation is loading. Please wait.

Log-linear modeling and missing data A short course Frans Willekens Boulder, July-August 1999.

Similar presentations


Presentation on theme: "Log-linear modeling and missing data A short course Frans Willekens Boulder, July-August 1999."— Presentation transcript:

1 Log-linear modeling and missing data A short course Frans Willekens Boulder, July-August 1999

2 Content The approach adopted in the course –a probabilistic perspective –a process perspective Data types and observations From observations to variables: the role of uncertainty Uncertainty and risk: risk set and exposure

3 Introduction to probability theory and statistical inference –Observations and random experiments –Random variables and probability distributions Continuous random variables Discrete random variables Plausible observations and plausible models: the maximum likelihood method

4 Analysis of count data: introduction to log- linear models –The Poisson probability model –The log-linear model The log-rate model: statistical analysis of occurrence-exposure rates

5 Logit model, logistic regression, and log- linear model: a comparison –Models of counts: log-linear model –Logit model and logistic regression Data on political attitudes (Payne) Data on leaving home –Construct your own logistic regression model Incomplete data: indirect estimation of migration flows. Summary References: books and web sites

6 The approach Focus on data analysis description (counts, proportions, rates, odds, relative risks) modeling (descriptive statistics frequently turn into dependent variables): models are probability models

7 Focus on underlying process –Data are representations or manifestations of underlying processes: substantive mechanism  causal model / regression model (systematic component) random mechanism  probability model (random component) –Process generates an event or an event sequence (event history; life history). measurement issues: observability, measurement issues, etc. The approach

8 Focus on model specification or selection of model rather than on parameter estimation (1)  Model specification is determined by research question and type of measurement  Creativity (thinking) at least as important as technique The approach

9 Focus on model specification (2)  Distinguish probability model and regression model [Clayton, D. and M. Hills (1993) Statistical models in epidemiology. Oxford University Press, Oxford] Probability theory (+ stochastic processes) Statistical inference  1. Probability distribution of observations >< probability distribution of error terms  2. Parameters of probability model are the dependent variables of regression model(s) The approach

10 Parameter estimation and model checking (fit): likelihood method – Alternatives: Least square method Entropy method The approach A model is plausible when the probability that it predicts a given observation(s) is sufficiently high.

11 The likelihood method – Probability function (density, mass): probability that a model with a given parameter(s) predicts a given outcome (parameter fixed; outcome varies) – Likelihood function: probability that a model that predicts a given outcome, has a given parameter(s) value (outcome fixed; parameter varies) The approach J.S. Long, 1997, p. 26

12 Focus on prediction –Prediction vs explanation –Key question: How well does the model predict outcomes (observations)?: predictive performance (how well does the model fit each case: compare predicted values with observed values; analysis of residuals) Measurement of predictive performance: likelihood Comparison of predictive performance of different models: likelihood ratio –Apply model to predict outcomes The approach

13 The likelihood method in case of incomplete data –Basic principle: Predict the missing data, construct the ‘complete-data likelihood function’, and get the ‘best’ parameter values: – Prediction (Expectation) – Estimation (Maximization) The approach EM Algorithm

14 Focus on interaction (association) –Origin-destination interaction –Origin-age interaction; destination-age interaction –Other variables Levels of interaction/association: odds ratio, model parameters The approach

15 Unified perspective on seemingly different models: generalized linear models (GLMs) The idea: A proper transformation of the dependent variable yields a regression model that (i) is linear in the parameters, and (ii) belongs to a family of related models. The approach

16 Link with risk analysis –Being at risk (exposure): duration at risk –Risk indicators: probability, rate, relative risk, odds –Risk levels –Risk factors  Risk analysis  Exposure analysis The approach

17 Focus on data analysis Focus on underlying process Focus on model specification Focus on likelihood approach Focus on ‘complete-data likelihood’ Focus on prediction Focus on interaction/association Link with risk analysis Unified perspective on different models: GLM The approach

18 Data types and observations

19 Observations - data types Level of measurement: – Individual actor (subject): micro-data – Group of actors: grouped data (aggregate data, tabulated data, contingency tables) Status or event: – Attribute or status: attribute data (status data) – Event: event = change in status. Event data

20 Observations - data types Micro-data: data on individual respondents – Attribute or status: attribute data (status data) Cross- section – Occurrence of non-occurrence of event during period (Y/N) OR number of occurrences: event data – Sequence of states or sequence of (repeatable) events: longitudinal data / life history data  Individual records Two types of data files:  Person-file  Episode file

21 Observations - data types Grouped data / aggregate data: data on groups of respondents – Number of persons with given attribute or having experienced given event: count data – Number of events during period: count data  Tabulations / cross-classifications (by covariate class) Micro-data and tabulated data may give same results (parameters of probability models and regression models). Key: weights.

22 Observations over time on a number of individuals (careers of life trajectories) –subjects are observed backward or forward in time Retrospectively: survivors recollect the past Prospectively –subjects are observed continuously or at discrete intervals continuously for a given period: continuous observation at several points in time during a given period: discrete observation (discrete intervals) Multiple observations on the same subject Longitudinal data

23 Continuous: attributes and changes in attributes (events) e.g. places of residence and migrations [movement approach] Discrete: attributes at several points in time (discrete intervals) e.g. places of residence at two, three, or more points in time [transition approach]

24 Observation window Only part of life trajectory is observed Incomplete data: Censoring Right censoring : observation (study) is terminated before all respondents experience the event Censoring due to termination of study: –Type I censoring: study is terminated after a fixed time period – Type II censoring: study is terminated after a given number of occurrences Censoring due to occurrence of competing event leading to attrition (e.g. death, withdrawal) Left censoring : time of entry into risk set is either before observation started or during observation

25 Data types applied to migration Micro-data: data on individuals or households – Status data: Current status: – migrant status (e.g. ever migrated / never migrated in given period) –current place (region) of residence Place of residence at two points in time: transition data (migrant data) – Time interval of fixed length: e.g. census and 5 years prior  “Where did you live 5 years ago?” – Time interval variable: e.g. census and place of birth  “Place of birth” Place of residence at 3 or more points in time

26 Data types applied to migration Micro-data: – Event data : migration data (movement data) Migration during given period (yes/no): migrant status Ever migrated? Number of migrations (quantum) Timing of migration (tempo) – Time scale: calendar time, age, process time (time since event- origin) – Measurement of time: exact time, time interval (discrete time, e.g. month, year) –Timing of all migrations vs timing of last migration

27 Data types applied to migration Grouped data: data on groups of individuals or households (actors) – Status data: Current status: number of actors (subjects) in given status Number of actors by place of residence at two points in time: transition data (migrant data) CENSUS Number of actors by place of residence at 3 or more points in time –Event data: Number of events during given period POP. REGISTER

28 Incomplete data: migration Migration data –Complete data: complete migration history –Incomplete data: Number of migrations (occurrences) only Last migration only (e.g. previous place of residence) Migrations vs migrants – Indirect measurement of migration by comparison of places of residence at two consecutive points time Place of residence of previous point in time

29 Incomplete data: migration Timing of migration incompletely recorded – Censoring (right, left) – Discrete time Direction of migration incomplete – Origin missing – Net migration only Attributes of migrants (persons) or migrations (events ) incomplete – Covariates partly missing

30 Incomplete data: migration Non-response in case of survey Areal units for which data are available are not the units for which data are available (areal interpolation and extrapolation)

31 From observations to variables How to capture uncertainty and chance? How to distinguish systematic effects from random effects?

32 Variables Continuous (metric): infinite number of values Discrete (categorical, qualitative): finite number of values

33 Continuous variables Range is not restricted Range is restricted: limited-dependent variables Truncated: values outside acceptable range are impossible or disregarded/omitted; independent variables remain unknown. Censored: values outside acceptable range are considered (not the specific value is considered, but only the fact that the value is outside the range); independent variables are known. Measurement equation : y = y * if y * >  y = 0 if y *   ory =  if y *   NOTE: a variable is truncated (censored) vs an observation is truncated (censored)

34 Discrete variables  The measurement scale consists of a set of categories –A. Type of measurement scale Ordinal: categorical variable having ordered categories – Attribute or opinion: e.g. poor.. excellent, low … high, approve … disapprove, military rank, level of education – Frequency of occurrence : never … always Nominal: categorical variable not having ordered categories. The values cannot be ordered (e.g. region of residence, marital status, religion, mode of transportation, political party, brand, profession) Count: frequency of occurrence of attribute or event (integer-valued data)

35 Discrete variables  The measurement scale consists of a set of categories – B. Number of values Binary, dichotomous: two categories only – Presence/absence of attribute – Occurrence/non-occurrence of event Polytomous: multiple categories

36 Level of measurement of a variable depends on the research question Education: Binary (yes/no): any education? Ordinal: primary, junior high, high school, college Count: number of years of schooling Migration: Binary: migrant status (ever migrated/never migrated) Ordinal: stayer, mover, frequent mover Count: number of migrations during given period

37 Uncertainty and risk

38 Risk analysis Events cannot be predicted with certainty Potential variation in outcome = RISK Williams et al., 1995, Risk management and insurance, McGraw-Hill, p. 5

39 Risk measures Count: Number of events during given period ( observation window ) Probability: probability of an outcome: proportion of risk set experiencing a given outcome (event) at least once Risk set = all persons at risk at given point in time. Rate: number of events per time unit of exposure (per unit of any measure of size, e.g. time, space, miles travelled)

40 Risk measures Difference of probabilities: p 1 - p 2 Relative risk: ratio of probabilities (focus: risk factor) prob. of event in presence of risk factor/ prob. of event in absence of risk factor (control group; reference category): p 1 / p 2 Odds: odds on an outcome: ratio of favourable outcomes to unfavourable outcomes. Chance of one outcome rather than another: p 1 / (1-p 1 ) The odds are what matter when placing a bet on a given outcome, i.e. when something is at stake. Odds reflect the degree of belief in a given outcome. Relation odds and relative risk: Agresti, 1996, p. 25

41 Risk measures Odds Odds ratio : ratio of odds (focus: risk indicator, covariate) odds in target group / odds in control group [reference category]: ratio of favourable outcomes in target group over ratio in control group. The odds ratio measures the ‘belief’ in a given outcome in two different populations or under two different conditions. If the odds ratio is one, the two populations or conditions are similar.

42 Reference categories:  20, Males Odds on leaving home early (rather than late) - Males: 74/178 = 0.42 - Females: 135/143 = 0.94 Odds ratio (  ): 0.944/0.416 = 2.27 (if we bet that a person leaves home early, we should bet on females; they are the ‘winners’ - leave home early) Var(  ) =  2 [1/135+1/143+1/74+1/178] = 0.1725 ln  = 0.819 Var(ln  ) = 1/135+1/143+1/74+1/178 = 0.0335 Selvin, 1991, p. 345

43 Risk analysis: probability models –Counts  Poisson r.v.  Poisson distribution  Poisson regression / log-linear model – Probabilities  binomial and multinomial r.v.  binomial and multinomial distribution  logistic regression / logit model (parameter p, probability of occurrence, is also called risk; e.g. Clayton and Hills, 1993, p. 7 ) – Rates  Occurrences/exposure  Poisson r.v.  log-rate model


Download ppt "Log-linear modeling and missing data A short course Frans Willekens Boulder, July-August 1999."

Similar presentations


Ads by Google