Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester.

Slides:



Advertisements
Similar presentations
Bias Lecture notes Sam Bracebridge.
Advertisements

Agency for Healthcare Research and Quality (AHRQ)
How would you explain the smoking paradox. Smokers fair better after an infarction in hospital than non-smokers. This apparently disagrees with the view.
Designing Clinical Research Studies An overview S.F. O’Brien.
Observational Studies and RCT Libby Brewin. What are the 3 types of observational studies? Cross-sectional studies Case-control Cohort.
Study Designs in Epidemiologic
Epidemiologic study designs
Case-Control Studies (Retrospective Studies). What is a cohort?
1 Case-Control Study Design Two groups are selected, one of people with the disease (cases), and the other of people with the same general characteristics.
Chance, bias and confounding
Estimation and Reporting of Heterogeneity of Treatment Effects in Observational Comparative Effectiveness Research Prepared for: Agency for Healthcare.
Elements of a clinical trial research protocol
What is a sample? Epidemiology matters: a new introduction to methodological foundations Chapter 4.
Is low-dose Aspirin use associated with a reduced risk of colorectal cancer ? a QResearch primary care database analysis Prof Richard Logan, Dr Yana Vinogradova,
Comunicación y Gerencia 1Case control studies15/12/2010.
COHORT AND CASE-CONTROL DESIGNS Dr. N. Birkett, Department of Epidemiology & Community Medicine, University of Ottawa SUMMER COURSE: INTRODUCTION TO EPIDEMIOLOGY.
Study Design and Analysis in Epidemiology: Where does modeling fit? Meaningful Modeling of Epidemiologic Data, 2010 AIMS, Muizenberg, South Africa Steve.
Cohort Study.
Dr. Abdulaziz BinSaeed & Dr. Hayfaa A. Wahabi Department of Family & Community medicine  Case-Control Studies.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 7: Gathering Evidence for Practice.
Epidemiologic Study Designs Nancy D. Barker, MS. Epidemiologic Study Design The plan of an empirical investigation to assess an E – D relationship. Exposure.
Evidence-Based Medicine 4 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Research Study Design and Analysis for Cardiologists Nathan D. Wong, PhD, FACC.
Epidemiology The Basics Only… Adapted with permission from a class presentation developed by Dr. Charles Lynch – University of Iowa, Iowa City.
P3G: an international consortium in Human Genome Epidemiology
Study Designs Afshin Ostovar Bushehr University of Medical Sciences Bushehr, /4/20151.
Study design P.Olliaro Nov04. Study designs: observational vs. experimental studies What happened?  Case-control study What’s happening?  Cross-sectional.
 Is there a comparison? ◦ Are the groups really comparable?  Are the differences being reported real? ◦ Are they worth reporting? ◦ How much confidence.
Prof. of Clinical Chemistry, Mansoura University.
Exposure to cyclo-oxygenase-2 inhibitors and risk of cancer: nested case-control studies IAE world Congress Epidemiology 2011 Edinburgh Yana Vinogradova,
Genetic Databases International Collaboration and Secondary Uses Pr Bartha Maria Knoppers Canada Research Chair in Law and Medicine Genetics and Society.
Mother and Child Health: Research Methods G.J.Ebrahim Editor Journal of Tropical Pediatrics, Oxford University Press.
Literature searching & critical appraisal Chihaya Koriyama August 15, 2011 (Lecture 2)
A short introduction to epidemiology Chapter 2b: Conducting a case- control study Neil Pearce Centre for Public Health Research Massey University Wellington,
Cohort design in Epidemiological studies Prof. Ashry Gad Mohamed MBCh B, MPH, DrPH Prof. of Epidemiology Dr Amna R Siddiqui MBBS, MSPH, FCPS, PhD Associate.
Discussion for a statement for biobank and cohort studies in human genome epidemiology John P.A. Ioannidis, MD International Biobank and Cohort Studies.
Case-control study Chihaya Koriyama August 17 (Lecture 1)
Barriers and Tools to the Present and Future of Population Genetics Pr Bartha Maria Knoppers Canada Research Chair in Law and Medicine HGM 2006.
Leicester Warwick Medical School Health and Disease in Populations Case-Control Studies Paul Burton.
Causal relationships, bias, and research designs Professor Anthony DiGirolamo.
Basic concept of clinical study
Issues concerning the interpretation of statistical significance tests.
Instructor Resource Chapter 14 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
Describing the risk of an event and identifying risk factors Caroline Sabin Professor of Medical Statistics and Epidemiology, Research Department of Infection.
Public Population Projects in Genomics International Working Groups Working Meeting September th, 2005, Hinxton, UK.
Overview of Study Designs. Study Designs Experimental Randomized Controlled Trial Group Randomized Trial Observational Descriptive Analytical Cross-sectional.
11/20091 EPI 5240: Introduction to Epidemiology Confounding: concepts and general approaches November 9, 2009 Dr. N. Birkett, Department of Epidemiology.
Study designs. Kate O’Donnell General Practice & Primary Care.
Professor Bill Ollier Combining the strengths of UMIST and The Victoria University of Manchester.
BC Jung A Brief Introduction to Epidemiology - XIII (Critiquing the Research: Statistical Considerations) Betty C. Jung, RN, MPH, CHES.
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
Insert name of presentation on Master Slide A Secondary Analysis of the Cross-Sectional Data Available in the ‘Welsh Health Survey for Children’ to Identify.
Leicester Warwick Medical School Health and Disease in Populations Cohort Studies Paul Burton.
Instructor Resource Chapter 15 Copyright © Scott B. Patten, Permission granted for classroom use with Epidemiology for Canadian Students: Principles,
Descriptive study design
Matching. Objectives Discuss methods of matching Discuss advantages and disadvantages of matching Discuss applications of matching Confounding residual.
Case Control Studies Dr Amna Rehana Siddiqui Department of Family and Community Medicine October 17, 2010.
Design of Clinical Research Studies ASAP Session by: Robert McCarter, ScD Dir. Biostatistics and Informatics, CNMC
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
A short introduction to epidemiology Chapter 2: Incidence studies Neil Pearce Centre for Public Health Research Massey University Wellington, New Zealand.
Leicester Warwick Medical School Health and Disease in Populations Revision Paul Burton.
1 Study Design Imre Janszky Faculty of Medicine, ISM NTNU.
Case control & cohort studies
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Measures of disease frequency Simon Thornley. Measures of Effect and Disease Frequency Aims – To define and describe the uses of common epidemiological.
PHOEBE UK Biobank: how big is “big”? Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester.
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
Lecture 1: Fundamentals of epidemiologic study design and analysis
Interpreting Epidemiologic Results.
Dr Luis E Cuevas – LSTM Julia Critchley
Presentation transcript:

Data quality/usability and population -based biobanks Paul Burton Dept of Health Sciences Dept of Genetics University of Leicester

Structure of talk  Why does data quality/usability matter?  UK Biobank as an illustration  Statistical power of nested case-control studies  Expected event rates in UK Biobank  Biobank harmonisation  Conclusions

Why does data quality/usability matter?

Epidemiological analysis at its simplest  Odds ratio (OR) = (120*240)/(200*100) = 1.44 [1.04 – 2.0]  May also adjust for a confounder e.g. high saturated fat intake [y/n]  What is the impact of error in an outcome or an explanatory variable or in a confounder?

Systematic error  Some disease free smokers deny smoking  Odds ratio (OR) = (120*250)/(190*100) = 1.58

Random error  At random, 10% of subjects state their exposure incorrectly  Odds ratio (OR) = (118*236)/(204*102) = 1.34

The impact of errors  Systematic errors in outcome or explanatory variables  systematic bias in either direction True OR = 2  estimated OR = e.g. 1.5 or 2.7  Random errors in binary outcomes or any explanatory variables  shrinkage bias True OR = 2  estimated OR = e.g. 1.5  Random errors in confounding variables  systematic bias in either direction True OR = 2  estimated OR = e.g. 1.5 or 2.7

Errors in biobanks  Random errors Loss of power is primary problem Biobank sample sizes very large, so why is there a problem?

Errors in biobanks  Random errors But: why are biobank sample sizes so large?  NB Biobanks very large not nested case-control studies Need to detect small relative risks ( e.g. OR=1.3) Power generally limited (see later) Small error effects catastrophic  Apparent causal effects easily created or destroyed

Errors in biobanks  Systematic errors Small real effects a major issue again Must understand data collection protocols, and must attempt to optimise those protocols UK Biobank P3G Observatory

What is UK Biobank?

 A prospective cohort study  500,000 adults across UK  Middle aged (40-69 years)  A population-based biobank Not disease or exposure based Recruitment via electronic GP lists  “Broad spectrum” not “fully representative”  Individuals not families  MRC, Wellcome Trust, DH, Scottish Executive £61M Basic design features

 Longitudinal health tracking  Nested case-control studies  Long time-horizon  Owned by the Nation  Central Administration – Manchester PI: Prof Rory Collins - Oxford  6 collaborating groups (RCCs) of university scientists Basic design features

Statistical power and sample size

Focus on power of nested case-control analyses  Likely to be very common analyses  Power limiting

Issues that are often ignored in standard power calculations  Multiple testing/low prior probability of association*  Interactions*  Unobserved frailty  Misclassification* Genotype Environmental determinant Case-control status  Subgroup analyses*  Population substructure

Power calculations  Work with least powerful setting Binary disease, binary genotype, binary environmental exposure  Logistic regression analysis; interactions = departure from a multiplicative model  Complexity (arbitrary but reasonable)

Summarise power using “Minimum Detectable Odds Ratios” (MDORs) calculated by ‘iterative simulation’  Estimate minimum ORs detectable with 80% power at stated level of statistical significance under specified scenario

Genetic main effects

Whole genome scan  Genetic main effect, p<10 -7

Gene:environment interaction  20,000 cases

Summary – rule of thumb  80% power for genotype frequency = 0.1, (allele frequency  0.05 under dominant model) Genetic main effect  1.5, p=10 -4  5,000 cases Genetic main effect  1.3, p=10 -4  10,000 cases Genetic main effect  1.2, p=10 -4  20,000 cases Genetic main effect  1.4, p=10 -7  10,000 cases Genetic main effect  1.3, p=10 -7  20,000 cases G:E interaction with environmental exposure prevalance = 0.2  2.0, p=10 -4  20,000 cases

Effect of realistic data errors

Expected event rates in UK Biobank

Taking account of  Age range at recruitment years  Recruitment over 5 years  All cause mortality  Disease incidence (“healthy cohort effect”)  Migration overseas  Comprehensive withdrawal (max 1/500 p.a.)

No need to contact subjects

Smaller sample sizes

Interim conclusions  Having taken account of realistic bioclinical complexity, UK Biobank is just large enough to be of great value as a stand-alone research infrastructure  Data quality, in particular errors in outcome or explanatory variables, or in confounders is crucial  Its value will be greatly augmented if it proves possible to set up a coherent and scientifically harmonized international network of Biobanks and large cohort studies

Harmonising biobanks internationally

Why harmonise?  Basic aim is to enable and promote data pooling, in a manner that recognises and takes appropriate account of systematic differences between studies.

Why harmonise?  Investigate less common (but not rare ) conditions UKBB: Ca stomach 2,500 cases in 29 years  6 UKBB equivalents:  10,000 cases in 20 years  Investigate smaller ORs GME 1.5  1.2 requires 5,000  20,000  4 UKBB equivalents  Analysis based on subsets – homogeneous classes of phenotype, or e.g. by sex

Why harmonise?  Earlier analyses UKBB: Alzheimers disease, 10,000 cases in 18 yrs  5 UKBB equivalents  9 years  Events at younger ages  Broad range of environmental exposures  Aim for 4-6 UKBB equivalents 2M – 3M recruits

Harmonisation initiatives  Public Population Program in Genomics (P 3 G) Canada + Europe  Tom Hudson, Bartha Knoppers, Leena Peltonen, Isabel Fortier …..  Population Biobanks FP6 Co-ordination Action (PHOEBE – Promoting Harmonisation Of Epidemiological Biobanks in Europe)  Camilla Stoltenberg, Paul Burton, Leena Peltonen, George Davey Smith …..

Harmonisation in the P3G Observatory (from Isabel Fortier)  Description  Comparison  Harmonisation  Data quality crucial at every stage

Final conclusions  Power of individual biobanks is limited  Minimisation of measurement error is crucial  Harmonisation is crucial if we are to optimise the value of biobanks internationally  Harmonisation depends on a full understanding of all aspects of data quality

Extra slides

Rarer genotypes  Genetic main effects

Gene:environment interaction  10,000 cases

Hattersley AT, McCarthy MI. A question of standards: what makes a good genetic association study? Lancet 2005; in press.

Summarise power using MDORs calculated by ‘iterative simulation’  Want minimum ORs detectable with 80% power at stated level of statistical significance 1. Guess starting values for ORs 2. Simulate population under specified scenario 3. Sample required number of cases and controls 4. Analyse resultant case-control study in standard way 5. Repeat 2,3,4 1,000 times 6. Use empirical statistical power results from the 1,000 analyses to update ORs to new values expected to generate a power of 80% Repeat 2-6 till all ORs have 80% power

Taking account of  Age range at recruitment years  Recruitment over 5 years  All cause mortality  Disease incidence (“healthy cohort effect”)  Migration overseas  Comprehensive withdrawal (max 1/500 p.a.)  Partial withdrawal ( c.f Birth Cohort)

Necessary to contact subjects