Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester.

Similar presentations


Presentation on theme: "Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester."— Presentation transcript:

1 Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester

2 Structure of talk  Why does data quality/usability matter?  UK Biobank as an illustration  Statistical power of nested case-control studies  Expected event rates in UK Biobank  Biobank harmonisation  Conclusions

3 Why does data quality/usability matter?

4 Epidemiological analysis at its simplest  Odds ratio (OR) = (120*240)/(200*100) = 1.44 [1.04 – 2.0]  May also adjust for a confounder e.g. high saturated fat intake [y/n]  What is the impact of error in an outcome or an explanatory variable or in a confounder?

5 Systematic error  Some disease free smokers deny smoking  Odds ratio (OR) = (120*250)/(190*100) = 1.58

6 Random error  At random, 10% of subjects state their exposure incorrectly  Odds ratio (OR) = (118*236)/(204*102) = 1.34

7 The impact of errors  Systematic errors in outcome or explanatory variables  systematic bias in either direction True OR = 2  estimated OR = e.g. 1.5 or 2.7  Random errors in binary outcomes or any explanatory variables  shrinkage bias True OR = 2  estimated OR = e.g. 1.5  Random errors in confounding variables  systematic bias in either direction True OR = 2  estimated OR = e.g. 1.5 or 2.7

8 Errors in biobanks  Random errors Loss of power is primary problem Biobank sample sizes very large, so why is there a problem?

9 Errors in biobanks  Random errors But: why are biobank sample sizes so large?  NB Biobanks very large not nested case-control studies Need to detect small relative risks ( e.g. OR=1.3) Power generally limited (see later) Small error effects catastrophic  Apparent causal effects easily created or destroyed

10 Errors in biobanks  Systematic errors Small real effects a major issue again Must understand data collection protocols, and must attempt to optimise those protocols UK Biobank P3G Observatory

11 What is UK Biobank?

12  A prospective cohort study  500,000 adults across UK  Middle aged (40-69 years)  A population-based biobank Not disease or exposure based Recruitment via electronic GP lists  “Broad spectrum” not “fully representative”  Individuals not families  MRC, Wellcome Trust, DH, Scottish Executive £61M Basic design features

13  Longitudinal health tracking  Nested case-control studies  Long time-horizon  Owned by the Nation  Central Administration – Manchester PI: Prof Rory Collins - Oxford  6 collaborating groups (RCCs) of university scientists Basic design features

14 Statistical power and sample size

15 Focus on power of nested case-control analyses  Likely to be very common analyses  Power limiting

16 Issues that are often ignored in standard power calculations  Multiple testing/low prior probability of association*  Interactions*  Unobserved frailty  Misclassification* Genotype Environmental determinant Case-control status  Subgroup analyses*  Population substructure

17 Power calculations  Work with least powerful setting Binary disease, binary genotype, binary environmental exposure  Logistic regression analysis; interactions = departure from a multiplicative model  Complexity (arbitrary but reasonable)

18 Summarise power using “Minimum Detectable Odds Ratios” (MDORs) calculated by ‘iterative simulation’  Estimate minimum ORs detectable with 80% power at stated level of statistical significance under specified scenario

19 Genetic main effects

20 Whole genome scan  Genetic main effect, p<10 -7

21 Gene:environment interaction  20,000 cases

22 Summary – rule of thumb  80% power for genotype frequency = 0.1, (allele frequency  0.05 under dominant model) Genetic main effect  1.5, p=10 -4  5,000 cases Genetic main effect  1.3, p=10 -4  10,000 cases Genetic main effect  1.2, p=10 -4  20,000 cases Genetic main effect  1.4, p=10 -7  10,000 cases Genetic main effect  1.3, p=10 -7  20,000 cases G:E interaction with environmental exposure prevalance = 0.2  2.0, p=10 -4  20,000 cases

23 Effect of realistic data errors

24 Expected event rates in UK Biobank

25 Taking account of  Age range at recruitment 40-69 years  Recruitment over 5 years  All cause mortality  Disease incidence (“healthy cohort effect”)  Migration overseas  Comprehensive withdrawal (max 1/500 p.a.)

26 No need to contact subjects

27 Smaller sample sizes

28 Interim conclusions  Having taken account of realistic bioclinical complexity, UK Biobank is just large enough to be of great value as a stand-alone research infrastructure  Data quality, in particular errors in outcome or explanatory variables, or in confounders is crucial  Its value will be greatly augmented if it proves possible to set up a coherent and scientifically harmonized international network of Biobanks and large cohort studies

29 Harmonising biobanks internationally

30 Why harmonise?  Basic aim is to enable and promote data pooling, in a manner that recognises and takes appropriate account of systematic differences between studies.

31 Why harmonise?  Investigate less common (but not rare ) conditions UKBB: Ca stomach 2,500 cases in 29 years  6 UKBB equivalents:  10,000 cases in 20 years  Investigate smaller ORs GME 1.5  1.2 requires 5,000  20,000  4 UKBB equivalents  Analysis based on subsets – homogeneous classes of phenotype, or e.g. by sex

32 Why harmonise?  Earlier analyses UKBB: Alzheimers disease, 10,000 cases in 18 yrs  5 UKBB equivalents  9 years  Events at younger ages  Broad range of environmental exposures  Aim for 4-6 UKBB equivalents 2M – 3M recruits

33 Harmonisation initiatives  Public Population Program in Genomics (P 3 G) Canada + Europe  Tom Hudson, Bartha Knoppers, Leena Peltonen, Isabel Fortier …..  Population Biobanks FP6 Co-ordination Action (PHOEBE – Promoting Harmonisation Of Epidemiological Biobanks in Europe)  Camilla Stoltenberg, Paul Burton, Leena Peltonen, George Davey Smith …..

34 Harmonisation in the P3G Observatory (from Isabel Fortier)  Description  Comparison  Harmonisation  Data quality crucial at every stage

35 Final conclusions  Power of individual biobanks is limited  Minimisation of measurement error is crucial  Harmonisation is crucial if we are to optimise the value of biobanks internationally  Harmonisation depends on a full understanding of all aspects of data quality

36 Extra slides

37 Rarer genotypes  Genetic main effects

38 Gene:environment interaction  10,000 cases

39 Hattersley AT, McCarthy MI. A question of standards: what makes a good genetic association study? Lancet 2005; in press.

40 Summarise power using MDORs calculated by ‘iterative simulation’  Want minimum ORs detectable with 80% power at stated level of statistical significance 1. Guess starting values for ORs 2. Simulate population under specified scenario 3. Sample required number of cases and controls 4. Analyse resultant case-control study in standard way 5. Repeat 2,3,4 1,000 times 6. Use empirical statistical power results from the 1,000 analyses to update ORs to new values expected to generate a power of 80% Repeat 2-6 till all ORs have 80% power

41 Taking account of  Age range at recruitment 40-69 years  Recruitment over 5 years  All cause mortality  Disease incidence (“healthy cohort effect”)  Migration overseas  Comprehensive withdrawal (max 1/500 p.a.)  Partial withdrawal ( c.f. 1958 Birth Cohort)

42

43 Necessary to contact subjects

44


Download ppt "Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester."

Similar presentations


Ads by Google