Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester
Structure of talk Why does data quality/usability matter? UK Biobank as an illustration Statistical power of nested case-control studies Expected event rates in UK Biobank Biobank harmonisation Conclusions
Why does data quality/usability matter?
Epidemiological analysis at its simplest Odds ratio (OR) = (120*240)/(200*100) = 1.44 [1.04 – 2.0] May also adjust for a confounder e.g. high saturated fat intake [y/n] What is the impact of error in an outcome or an explanatory variable or in a confounder?
Systematic error Some disease free smokers deny smoking Odds ratio (OR) = (120*250)/(190*100) = 1.58
Random error At random, 10% of subjects state their exposure incorrectly Odds ratio (OR) = (118*236)/(204*102) = 1.34
The impact of errors Systematic errors in outcome or explanatory variables systematic bias in either direction True OR = 2 estimated OR = e.g. 1.5 or 2.7 Random errors in binary outcomes or any explanatory variables shrinkage bias True OR = 2 estimated OR = e.g. 1.5 Random errors in confounding variables systematic bias in either direction True OR = 2 estimated OR = e.g. 1.5 or 2.7
Errors in biobanks Random errors Loss of power is primary problem Biobank sample sizes very large, so why is there a problem?
Errors in biobanks Random errors But: why are biobank sample sizes so large? NB Biobanks very large not nested case-control studies Need to detect small relative risks ( e.g. OR=1.3) Power generally limited (see later) Small error effects catastrophic Apparent causal effects easily created or destroyed
Errors in biobanks Systematic errors Small real effects a major issue again Must understand data collection protocols, and must attempt to optimise those protocols UK Biobank P3G Observatory
What is UK Biobank?
A prospective cohort study 500,000 adults across UK Middle aged (40-69 years) A population-based biobank Not disease or exposure based Recruitment via electronic GP lists “Broad spectrum” not “fully representative” Individuals not families MRC, Wellcome Trust, DH, Scottish Executive £61M Basic design features
Longitudinal health tracking Nested case-control studies Long time-horizon Owned by the Nation Central Administration – Manchester PI: Prof Rory Collins - Oxford 6 collaborating groups (RCCs) of university scientists Basic design features
Statistical power and sample size
Focus on power of nested case-control analyses Likely to be very common analyses Power limiting
Issues that are often ignored in standard power calculations Multiple testing/low prior probability of association* Interactions* Unobserved frailty Misclassification* Genotype Environmental determinant Case-control status Subgroup analyses* Population substructure
Power calculations Work with least powerful setting Binary disease, binary genotype, binary environmental exposure Logistic regression analysis; interactions = departure from a multiplicative model Complexity (arbitrary but reasonable)
Summarise power using “Minimum Detectable Odds Ratios” (MDORs) calculated by ‘iterative simulation’ Estimate minimum ORs detectable with 80% power at stated level of statistical significance under specified scenario
Genetic main effects
Whole genome scan Genetic main effect, p<10 -7
Gene:environment interaction 20,000 cases
Summary – rule of thumb 80% power for genotype frequency = 0.1, (allele frequency 0.05 under dominant model) Genetic main effect 1.5, p=10 -4 5,000 cases Genetic main effect 1.3, p=10 -4 10,000 cases Genetic main effect 1.2, p=10 -4 20,000 cases Genetic main effect 1.4, p=10 -7 10,000 cases Genetic main effect 1.3, p=10 -7 20,000 cases G:E interaction with environmental exposure prevalance = 0.2 2.0, p=10 -4 20,000 cases
Effect of realistic data errors
Expected event rates in UK Biobank
Taking account of Age range at recruitment years Recruitment over 5 years All cause mortality Disease incidence (“healthy cohort effect”) Migration overseas Comprehensive withdrawal (max 1/500 p.a.)
No need to contact subjects
Smaller sample sizes
Interim conclusions Having taken account of realistic bioclinical complexity, UK Biobank is just large enough to be of great value as a stand-alone research infrastructure Data quality, in particular errors in outcome or explanatory variables, or in confounders is crucial Its value will be greatly augmented if it proves possible to set up a coherent and scientifically harmonized international network of Biobanks and large cohort studies
Harmonising biobanks internationally
Why harmonise? Basic aim is to enable and promote data pooling, in a manner that recognises and takes appropriate account of systematic differences between studies.
Why harmonise? Investigate less common (but not rare ) conditions UKBB: Ca stomach 2,500 cases in 29 years 6 UKBB equivalents: 10,000 cases in 20 years Investigate smaller ORs GME 1.5 1.2 requires 5,000 20,000 4 UKBB equivalents Analysis based on subsets – homogeneous classes of phenotype, or e.g. by sex
Why harmonise? Earlier analyses UKBB: Alzheimers disease, 10,000 cases in 18 yrs 5 UKBB equivalents 9 years Events at younger ages Broad range of environmental exposures Aim for 4-6 UKBB equivalents 2M – 3M recruits
Harmonisation initiatives Public Population Program in Genomics (P 3 G) Canada + Europe Tom Hudson, Bartha Knoppers, Leena Peltonen, Isabel Fortier ….. Population Biobanks FP6 Co-ordination Action (PHOEBE – Promoting Harmonisation Of Epidemiological Biobanks in Europe) Camilla Stoltenberg, Paul Burton, Leena Peltonen, George Davey Smith …..
Harmonisation in the P3G Observatory (from Isabel Fortier) Description Comparison Harmonisation Data quality crucial at every stage
Final conclusions Power of individual biobanks is limited Minimisation of measurement error is crucial Harmonisation is crucial if we are to optimise the value of biobanks internationally Harmonisation depends on a full understanding of all aspects of data quality
Extra slides
Rarer genotypes Genetic main effects
Gene:environment interaction 10,000 cases
Hattersley AT, McCarthy MI. A question of standards: what makes a good genetic association study? Lancet 2005; in press.
Summarise power using MDORs calculated by ‘iterative simulation’ Want minimum ORs detectable with 80% power at stated level of statistical significance 1. Guess starting values for ORs 2. Simulate population under specified scenario 3. Sample required number of cases and controls 4. Analyse resultant case-control study in standard way 5. Repeat 2,3,4 1,000 times 6. Use empirical statistical power results from the 1,000 analyses to update ORs to new values expected to generate a power of 80% Repeat 2-6 till all ORs have 80% power
Taking account of Age range at recruitment years Recruitment over 5 years All cause mortality Disease incidence (“healthy cohort effect”) Migration overseas Comprehensive withdrawal (max 1/500 p.a.) Partial withdrawal ( c.f Birth Cohort)
Necessary to contact subjects