Model-based lifestyle behaviour estimates Dr Jennifer Mindell Clinical senior lecturer, UCL Contributors: Shelley Bradley
Why are these needed? Demand for detailed information at a range of smaller geographical levels –eg MSOAs, LAs, PCOs National surveys designed to provide reliable estimates: –national level –sometimes regional levels Sample size usually too small for direct estimates with adequate precision for smaller geographical areas
Why are these needed? Prevalence estimates of health behaviours based on survey data can only be computed for those areas covered by the sample For small areas covered by the survey: –sample size usually small –estimates have low precision – i.e. very wide CIs for the survey estimates E.g. for a percentage of 25% –sample size of 15: 95% CI of around 4%-46% –sample size of 50: 95% CI of 13%-37% Most MSOAs have no sample respondents
Basic idea behind the model-based method Find a relationship between: –estimate, as measured by the national survey (e.g smoking in HSE) and –other information in the sampled MSOAs (e.g Census and administrative data). Can use this relationship to generalise and produce reliable estimates for all MSOAs
Steps in deriving model based estimates 1. Investigate and choose data sources to be used. (Two sets of information). 2. Build statistical model relating the survey variable to the covariate information for MSOAs (or LAs or PCOs) covered by the survey. –E.g examine whether the tendency for a person to be a current smoker varies significantly between regions or between LAs with varying proportions of residents aged 16+ who were living as a couple, claiming Job Seekers Allowance etc.
Steps in deriving model based estimates 3. Use the model and covariate data (available for all MSOAs) to create ‘expected’ prevalence estimates given the characteristics of the area for all MSOAs 4. If required, ensure the model-based estimates constrained to higher level geographies
Healthy lifestyle behaviours The Information Centre commissioned NatCen to produce model-based estimates for the prevalence of healthy lifestyle behaviours using HSE data The estimates cover the time period and are for 6,781 MSOAs, 352 LAs, and 152 PCOs in England.
Examples Model-based estimates and 95% CIs produced using data from the HSfE covering the prevalence of lifestyle indicators among adults 16+: –smoking –binge drinking –obesity –consumption of 5+ portions per day of fruit and vegetables
Examples Model-based estimates with 95% CIs been produced for MSOAs in England and Wales for: –total household weekly income, –net household weekly income, –net household weekly income before housing costs, –net household weekly income after housing costs
The survey dataset for model- based lifestyles estimates Core interview questions/measurements included each year 3 years of HSfE data (2003, 2004, 2005) combined to maximise sample size Only the general population samples in each year used
Current cigarette smoking Adult respondents (aged 16 +) to the HSfE: Defined to be current smokers if they reported that they were a “current cigarette smoker” Defined as not a current smoker if they reported that they: –had “never smoked cigarettes at all”, –“used to smoke cigarettes occasionally”, or –“used to smoke cigarettes regularly”.
Fruit and vegetable consumption (adults aged 16+) Generated from data collected in the HSfE about the quantities of different types of fruit and vegetables consumed on the previous day: –Includes fresh / frozen / tinned –Vegetables / salads / pulses –Fruit / juices –Some elements capped max 1/d - guidelines. Measures summed to give total number of portions of fruit and vegetables consumed.
Binge drinking Generated from data collected about the quantities of all the different types of alcoholic drinks (beer, wine, spirits, sherry and alcopops) consumed on a respondent’s heaviest drinking day in the previous week; measures summed to give the number of units of alcohol consumed on the heaviest drinking day. Binge drinking defined separately for men and women: –Men: ≥ 8 units of alcohol on the heaviest drinking day in the previous seven days; –Women: ≥ 6 units of alcohol
Obesity Obesity generated from the height and weight of respondents, as measured by the HSfE interviewers. BMI is the weight in kilograms divided by the square of the height in metres. Defined as obese if BMI ≥ 30kg/m 2
The covariate dataset The term ‘covariate’ describes area-level characteristics potentially related to the four healthy lifestyle indicators, eg –% of residents aged 16 years + residing as a couple –life expectancy –SHA –urban/rural indicator These covariates generally average values or proportions relating to all individuals or households in the area Census provided the main source for demographic and social covariate data –because of its total geographical and population coverage
CASE STUDY: CURRENT SMOKING The process of creating the model-based estimates of healthy lifestyle behaviours for 352 LAs in England involved three main stages Stage 1 Fitting the relationship between current smoking and area-level characteristics Stage 2 Producing an initial estimate of expected prevalence Stage 3 Adjusting the LA estimate to the direct HSfE estimate for that SHA
CASE STUDY: CURRENT SMOKING Stage 1 Fitting the relationship between current smoking and area-level characteristics Using the combined HSfE data, those area- level characteristics most strongly related to whether an individual was a current smoker are identified. This was done using a technique called ‘logistic regression’.
Odds Ratio Odds Ratio = (a/c)/ (b/d) Odds ratio < 1 : Lower odds in exposed group Odds ratio = 1 : Same odds Odds ratio > 1 : Higher odds in exposed group Odds of having disease given exposure = a/c Odds of not having disease given exposure = b/d
CASE STUDY: CURRENT SMOKING Logistic regression estimates are displayed on the odds scale and sometimes displayed on the log-odds scale. Using the log-odds scale: 1.A log-odds estimate of 0 means that the covariate has no effect on current smoking, after adjusting for the other variables in the model. 2.A log-odds estimate < 0 indicates a decrease (that is, an increase in the covariate is associated with a decrease in the odds of being a current smoker). 3.A log-odds estimate > 0 indicates an increase (that is, an increase in the covariate is associated with an increase in the odds of being a current smoker).
Does the model makes sense? A number of diagnostic checks used: –to assess the appropriateness of the models developed –to show that the models are well specified and the assumptions sound These processes ensure that: –the methodology and its application are valid, –the models developed are the best possible for the data available, and –the model-based estimates are credible.
Validating the model Provides confidence in the accuracy of the estimates and the associated CIs Need to validate: –the process of making the estimates –the estimates themselves Comparison of the model–based estimates with other sources to establish the credibility of the model-based estimates
Confidence intervals Confidence intervals produced to make the margin of error around the estimates clear. The interval reflects the range within which the true value is likely to lie. The CIs represent the uncertainty in the modelling process. At the 95% confidence level, assuming that the model is a good representation of reality, each CI would be expected to contain the true value 95 times out of 100.
The survey context Two key issues to consider when using the estimates are sampling error and non- sampling error. Sampling error arises as a result of drawing a sample rather than conducting a complete census.
Non-sampling errors Defined as errors arising during the course of survey activities Unlike sampling errors, there is no simple and direct method of estimating the size of non- sampling errors. Despite our best efforts to avoid them, non- sampling errors are inevitable particularly in largescale data collections.
Non-sampling errors Sources of non-sampling error include: 1.The respondent –may not want to reveal their true amount of alcohol consumption or unintentionally provide incorrect information (measurement error) 2.The interviewer –may make mistakes when measuring the height and weight of respondents 3.Refusals to participate –Adults contacted from the survey may refuse to have their height and weight measured or refuse to participate in the survey.
Limitations of the estimates A standard direct estimate for a particular area based solely on sample respondents located within the area represents an estimate of the actual prevalence of health behaviours such as current smoking, obesity for the area in question. A model-based estimate for a particular area is the expected prevalence for that area based on its population characteristics (as measured by the census/administrative data) and does not represent an estimate of the actual prevalence.
Limitations of the estimates To interpret the estimates you should use statements such as: “Given the characteristics of the local population, we would expect approximately x% of adults within LA/PCO Y to smoke/be obese.” Model-based estimates cannot take account of any additional local factors that may impact on the true prevalence rate –e.g. local interventions –subtle differences in population demographics The estimates cannot be used to monitor performance
Limitations of the estimates Cannot usually compare between two sets of model- based estimates in two different time periods Users warned not to interpret the difference between the point estimates as a measure of change. Typically: –The models have been fitted separately –Built on a different set of geographies –The covariates are not the same –Each estimate is given with a 95% confidence interval The prevalence for an area should be viewed in light of its CI, not just the point estimate. –To disregard the CIs ignores the uncertainty that surrounds estimates derived from survey data
BBC news online October 2006
Limitations of the estimates As with any ranking based on estimates, care must be taken in interpreting the ranking of the model based estimates. –The estimates are expected prevalences not measured actual prevalence –Assigning the areas to bands would still require the uncertainty in the ranking/banding to be represented
Examples of data use
Model-based estimates of smoking prevalence, (%) AreaLowerUpperEstimate LB Harrow LB Redbridge LB Brent RBKC LB Tower Hamlets LB Lambeth LB Barking & Dagenham London England
DON’T TURN THE PAGE! Exercise A Which LAs have higher and which have lower smoking prevalence than London?
An area can be described as statistically significantly different from the regional or national average if the CIs for those estimates do not overlap. Barking & Dagenham PCT has a significantly higher (model-based) current smoking rate than England (and than London) Redbridge has a significantly lower (model-based) current smoking rate than England and London Tower Hamlet PCT cannot be said to have a significantly higher estimate than England as a whole since the CIs overlap –NB Sex and ethnicity! CIs depend on no. of areas with at least one data point
LALowerUpperEstimate Harrow Redbridge Brent K & C Tower Hamlets Lambeth Barking & Dagenham London England
Examples of data use Supporting indicators Model-based estimates of healthy lifestyle behaviours can be used in conjunction with other data sources to build up an area profile. E.g. the 2007 IMD, Council Tax data, Urban/Rural classifications, HES, ONS area classifications, and commercial geodemographic classifications such as ACORN. Both LA and PCO estimates should also be viewed in relation to the direct estimates derived from the HSfE data over the same time period. Eg. Health profiles, NWPHO alcohol profiles
Limitations of the estimates Care must be taken in interpreting the ranking of the model based estimates. E.g. the confidence interval around the highest ranked MSOA suggests that the estimate lies among the group of MSOAs with the highest income levels rather than being the MSOA with the highest average income.
Source: Neighbourhood Statistics Using Modelled based estimates Maps The model-based MSOA- level estimates of average weekly household income can be displayed on maps to show broad trends.
Exercise B Would using model-based estimates be appropriate in the following situations: 1. In setting a baseline for a target for the Local area agreement? 2. Predicting need and planning for service provision? 3. Monitoring change over time? 4. Creating a profile of an area?
Synthetic estimates Use a statistical model to express the relationship between individual healthy lifestyle behaviour and area-level information Outputs from that model used to generate a model-based estimate for all areas Estimates represent the expected prevalence for an area based on its population characteristics So cannot be used to monitor local interventions
Fuller technical description of the methodology See the project reports on the NHS Information Centre website: collections/population-and- geography/neighbourhoodstatistics/neighbo urhood-statistics:-healthy-lifestyle- behaviours:-model-based-estimates