Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh.

Similar presentations


Presentation on theme: "Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh."— Presentation transcript:

1 Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh Governments, British Heart Foundation and Diabetes UK Rory Collins UK Biobank Principal Investigator BHF Professor of Medicine & Epidemiology Nuffield Department of Population Health University of Oxford, UK

2 UK Biobank Prospective Cohort
500,000 UK men and women aged years when recruited and assessed during Extensive baseline questions and measurements, with stored biological samples (and opportunities to add enhanced assessments in large subsets) Repeat assessments over time in subsets of the participants to allow for sources of variation General consent for follow-up through all health records and for all types of health research Sufficiently large numbers of people developing different conditions to assess causes reliably

3 Need for prospective studies to be LARGE: CHD versus SBP for 5K vs 50K vs 500K people in the Prospective Studies Collaboration (PSC) 5000 people 50,000 people 500,000 people Age at risk: 256 256 256 Age at risk: 80-89 80-89 128 128 128 70-79 70-79 64 Age at risk: 64 64 80-89 60-69 60-69 32 32 32 50-59 70-79 60-69 50-59 16 16 16 40-49 8 8 8 50-59 40-49 Data taken from the Prospective Studies Collaborative meta-analysis co-ordinated by CTSU, Oxford and illustrates graphically the need for very large sample sizes to reliably study interactions (in this example the interaction between BP and age). The first graph (All) shows the relationship between systolic BP and age on Ischemic Heart Disease (IHD – floating absolute risk) in 1-million people published in the Lancet in 2002 and the other two graphs show the same relationship in randomly selected samples of 500,000 and 50,000. The tight 95% Confidence Intervals illustrate the reliability of the large sample size in the first two graphs, but the sample in only 50,000 (which is still orders of magnitude larger than many studies) has very wide confidence intervals, particularly in younger people. Key Message: studies at least as large as UK Biobank are required to study gene-environment interactions and for many interactions reliable evidence will only be generated by combining data from 4-5 biobanks worldwide – this emphasises the need for harmonisation of biobanks at an international level. 4 40-49 4 4 2 2 2 1 1 1 120 140 160 180 120 140 160 180 120 140 160 180 Usual SBP (mmHg) Usual SBP (mmHg) Usual SBP (mmHg)

4 Locations of UK Biobank assessment centres around the UK (with people recruited from urban and rural areas)

5 UK Biobank: 500,000 participants aged 40-69 recruited in 2007-10
40-49 119,000 50-59 168,000 60-69 213,000 Gender Male 228,000 Female 270,000 Deprivation More 92,000 Average 166,000 Less 241,000 Generalisability (not representativeness): Heterogeneity of study population allows associations with disease to be studied reliably

6 Production line baseline assessment visit (improved throughput; efficient staffing)

7 Baseline assessment: Questionnaire content
Self-completion: topics Median time (minutes) Socio-demographics Ethnicity Work-employment Physical activity Smoking (non-smokers) (past/current smokers) Diet (food frequency)* Alcohol Sleep Sun exposure Environmental exposures Early life factors Family history of common diseases 1.6 Reproductive history & screening (women) 2.4 (men) 0.8 Sexual history General health Past medical history & medications 1.6 Noise exposure Psychological status Cognitive function tests 10.0 Hearing speech-in-noise test Total time Interview: topics Median time (minutes) Medical history/medication 3.1 Occupation Other Total time *Subset of 200,000 participants: repeated daily diet diaries conducted via the internet Touchscreen and interview questions (plus extra enhancement questions) available at

8 Baseline assessment: Physical measurements (with enhanced measures in large subsets)
All 500,000 participants Blood pressure & heart rate Height (standing/seated) Waist/hip circumference Weight/impedance Spirometry Heel ultrasound Subset: 175,000 participants Hearing test Vascular reactivity Subset: 120,000 participants Visual acuity, refractive index & intraocular pressure Subset: 85,000 participants Retinal images & optical coherence tomograms Fitness test & ECG limb leads

9 UK Biobank different types of biological sample: allowing a wide range of different assays
Sample collection tube Fractions collected Potential assays Na+ EDTA Plasma Buffy coat Red cells Plasma proteome and metabonome Assays of genomic DNA Membrane lipids and heavy metals Lithium Heparin (PST) Plasma proteome and metabonome (without haemolysis) Silica clot accelerator (SST) Serum Serum proteome and metabonome (without haemolysis) Acid citrate dextrose Whole blood Assays of DNA extracted from EBV immortalised cell lines (B-cell transcriptome) EDTA Standard haematological parameters Tempus RNA stabilisation Whole blood with lysis reagent Blood transcriptome Representative transcriptomes of other tissues Urine Urine proteome and metabonome Gut microbiome Saliva Mixed saliva sample Salivary proteome and metabonome Salivary microbiome (Mucosal proteome and metabonome)

10 Further enhancements of the phenotyping of UK Biobank participants currently being conducted
Web-based assessments of diet completed

11 Web-based dietary assessment: 24-hr recall
Design considerations: Easy and quick: takes only minutes Automated data collection and coding Repeatable (capturing seasonal variation) Detailed enough to estimate nutrient intake Over 200,000 participants completed the questionnaire at least once, and about 90,000 did so more than once

12 Future web-based assessments for exposures
Cognitive function Repeat assessment of baseline measures Broaden cognitive phenotyping with new measures Complements enhanced cognitive function assessment that is planned for the imaging assessment visit Occupational history Information about all previous occupations (not just latest) Greater detail on type of work and duration Physical activity questionnaire (RPAQ) Complement data from activity monitor

13 Further enhancements of the phenotyping of UK Biobank participants currently being conducted
Web-based assessments of diet completed; and next to be cognition/mental health (2014) Wrist-worn accelerometers to be mailed to all participants who agree to wear one ( )

14 UK Biobank wrist-worn accelerometer
~45% of participants agree to wear one Willing participants sent device by mail It is to be worn continuously for 7 days Returned by mail and data downloaded Device cleaned and sent to next participant 100K participants from mid-2013 to mid-2015 (50,000 complete data-sets already obtained)

15 Further enhancements of the phenotyping of UK Biobank participants currently being conducted
Web-based assessments of diet completed; and next to be cognition/mental health (2014) Wrist-worn accelerometers to be mailed to all participants who agree to wear one ( ) Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants ( )

16 Genotyping of all UK Biobank participants
820K bespoke UK Biobank Affymetrix genotyping chip: 250,000 SNPs in a whole-genome array 200,000 markers for known risk factor or disease associations, copy number variation, loss of function, and insertions/deletions 150,000 exome markers for high proportion of non-synonymous coding variants with allele frequency over 0.02% Estimate (“impute”) additional genotypes by combining measured genotypes with reference sequence data Researchers can study associations of genotype data with biochemical risk factors and detailed phenotyping from baseline assessment, along with health outcomes

17 Further enhancements of the phenotyping of UK Biobank participants currently being conducted
Web-based assessments of diet completed; and next to be cognition/mental health (2014) Wrist-worn accelerometers to be mailed to all participants who agree to wear one ( ) Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants ( ) Standard panel of assays (e.g. lipids; clotting) on samples from all participants ( )

18 Rationale for assaying many standard markers in baseline samples from all 500,000 participants  
Cost-effective way of increasing the usability of the resource for researchers, by providing data for: Cross-sectional analyses with prevalent disease Identification of subsets based on assay values Conducting these assays in all of the participants at the same time should facilitate good quality control Lower cost for conducting all of these assays at one time rather than in multiple retrievals and assays Facilitates management of depletable samples

19 Consideration of a proposal to conduct assays of biomarkers of infectious disease in all participants Request from the international research community to facilitate studies of the associations of infectious agents with disease (in particular, different types of cancer) Plan would be to assay a panel of infectious agents (e.g. HPV, Hepatitis B & C, HBV, EBV, H. pylori) in the baseline sample collected from all 500,000 participants As with the biochemical and genetic assays that are being conducted, assays of a wide range of infectious agents would increase the efficient use of the resource Detailed proposal for funding is now being developed

20 Further enhancements of the phenotyping of UK Biobank participants currently being conducted
Web-based assessments of diet completed; and next to be cognition/mental health (2014) Wrist-worn accelerometers to be mailed to all participants who agree to wear one ( ) Biobank chip to genotype (GWAS; candidate SNPs; exome) all participants ( ) Standard panel of assays (e.g. lipids; clotting) on samples from all participants ( ) Information from multiple imaging modalities (e.g. brain/heart/body MRI; bone/joint DEXA)

21 Imaging of 100,000 UK Biobank participants
MRI of brain, heart and abdomen DEXA of bones, joints and body Ultrasound of carotid arteries Shortened baseline assessment plus more detailed cognitive function tests and ECG to detect rhythm disturbances Pilot phase: 4-6,000 people in 1 centre ( ) Main phase: 95,000 people in 3 centres ( ) Opportunities for repeat imaging in sub-sets (e.g. as part of MRC’s focus on dementia)

22 (floated so mean = PSC rates at age 65-69)
Body Mass Index (BMI) vs Heart Disease and Stroke (PSC:1M people followed for 12 years; Lancet 2009) 160 Heart disease ( deaths) At BMI >25: 5 units higher BMI associated with ~40% higher IHD & stroke mortality Annual deaths per 1000 (floated so mean = PSC rates at age 65-69) 80 At BMI <25: positive association continues for IHD, but not for stroke 40 Stroke (6122 deaths) 20 10 15 20 25 30 35 40 50 Baseline BMI (kg/m2) Adjusted for age, sex, smoking & study; first 5 years of follow-up excluded 22

23 Similar age, gender, BMI & % body fat,
but different amounts of INTERNAL FAT 5.86 litres of internal Fat 1.65 litres of internal fat 23

24 Mortality: little change
Atrial fibrillation (AF): prevalence and mortality during the period between 1993 and 2007 Prevalence: increasing Mortality: little change Piccini et al. Circulation: Cardiovascular Quality and Outcomes. 2012

25 Consideration of prolonged cardiac monitoring
Cardiac arrhythmias (especially AF) can indicate significant underlying cardiac disease can directly cause significant morbidity and mortality important risk factors for cardio-embolic events (esp. stroke) Detection requires prolonged monitoring many are intermittent (e.g. paroxysmal AF) substantial under-detection with standard 12 lead ECG AF increases with age (<50 years: <1%; >80 years: 10%+) No large-scale population-based prospective studies with prolonged monitoring, so the full extent/impact of AF on health outcomes is likely to have been underestimated

26 Example of device for prolonged arrhythmia detection
iRhythmZio Patch Has been used in 18,000 people Non-invasive stick-on patch Comfortable (median wear 12 days) Can be applied in clinic or at home Beat-to-beat ECG recording Validated against reference Holter Potentially recyclable device chip which stores data for downloading Planning to pilot feasibility and acceptability during imaging pilot

27 UK Biobank: Centralised follow-up of health
Death and cancer registries In-patient and out-patient hospital episodes (including psychiatric) and related procedure registries Primary care records of health conditions, prescriptions, diagnostic tests and other investigations Other health-related: disease registries; dispensing records; imaging; screening; dental records Direct to participants: self-reported medical conditions; treatments actually being taken; degree of functional impairment; cognitive and psychological scores

28 Health outcome data-linkage challenges
Regulation, bureaucracy, and permissions (despite explicit consent from participants) Data transfer, matching and coding queries Understanding different data structures Mapping between coding systems Mapping between different countries Presenting outcome data to researchers Original outcome codes Post-adjudication outcomes

29 Progress with UK-wide linkage to outcome data (both before and after baseline assessment)
Key messages: The slide shows the data types, countries and data providers from the perspective of a UK cohort study with participants recruited in England (89%), Scotland (7%) and Wales (4%). It demonstrates just some of the complexity of the processes required to link to, incorporate and make available for researchers what might seem superficially to be very straightforward data from these major sources Deaths and cancers: It has been possible for many years to obtain routine coded data on deaths by cause and cancer registrations using similar systems across England, Scotland and Wales Currently flagging of a cohort is done by the Medical Research information Service at the NHS Information Centre for England and Wales, and the NHS Central Register in Scotland. Data formats are different for Scotland but the type of information is essentially the same. Hospital discharge data: It is also possible to obtain nationwide hospital inpatient and outpatient coded data from Scotland, Wales and England from separate sources for each country as shown. Data formats vary by country but the type of information is similar. Primary care data: Scotland and Wales are now able to link to coded primary care data for around half of their populations. England is following with the development of the GP extraction system. In all three countries, national linkage ‘one stop shops’ are being developed to pull all of these data (and a range of other country specific datasets) together for easier access for research purposes. The most comprehensive and accessible is currently in Wales – SAIL (Secure Anonymised Information Linkage system, developed collaboratively between Health Info Research Unit, Swansea University and NHS Wales. The Scottish system (Scottish Health Informatics Programme) is similar although not so readily accessible. A new English system (CPRD) aims to do something similar to SAIL and SHIP – coverage currently is limited especially for primary care (currently 10-20%) but ambitious plans for wider population coverage. The whole field is made over-complicated by : - Frequent developments of new initiatives and systems, not all of which survive - Frequent relabelling of existing systems, Different regulatory mechanisms for accessing data for each data provider Differences between countries in the structure of the NHS and legislation/regulation The other datasets that initiatives such as SAIL, SHIP and CPRD either have currently on a country-wide basis or are moving towards include: laboratory reports, imaging reports, and disease registry and audit systems. At present, these tend to be patchy in coverage. In general Wales and Scotland have more country wide datasets available than England (although patches of England are good for various different types of data), and accessibility of data from a one stop shop is easiest for Wales.

30 Meaning of coded data from health records
What do the coded data actually tell us? Characteristics of coded data How accurate? How detailed? How complete? Do we need to go beyond the coded data?

31 UK Biobank: Expected numbers of participants developing diseases during long-term follow-up
Condition 2012 2017 2022 Diabetes 10,000 25,000 40,000 MI/CHD death 7,000 17,000 28,000 Stroke 2,000 5,000 9,000 COPD 3,000 8,000 14,000 Breast cancer 2,500 6,000 Colorectal cancer 1,500 3,500 Prostate cancer Lung cancer 800 4,000 Hip fracture Rh. arthritis Alzheimer’s

32 General strategy for outcome adjudication
Avoid false positive cases (but tolerate some false negatives) Geographical generalisability Cost-effectiveness Future-proofed Scalability Staged approach: Ascertain Confirm Classify

33 Staged approach to outcome adjudication
CHARACTERISTICS POSSIBLE DATA SOURCES ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires

34 Staged approach to outcome adjudication
CHARACTERISTICS POSSIBLE DATA SOURCES ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires CONFIRMATION of “case-ness” As above, but greater cost/lower feasibility Cross-referencing e-records Disease registers

35 Staged approach to outcome adjudication
CHARACTERISTICS POSSIBLE DATA SOURCES ASCERTAINMENT of suspected cases Cost-effective Feasible Scalable Death registers Cancer registers Hospital episodes Primary care records Web-based questionnaires CONFIRMATION of “case-ness” As above, but greater cost/lower feasibility Cross-referencing e-records Disease registers CLASSIFICATION of disease cases More involved and costly per case Review of clinical records Tumour collections/assays Specialised databases (e.g. imaging)

36 Expert Working Groups developing protocols for ascertainment, confirmation and classification
Cancer Diabetes Cardiac outcomes Stroke Mental health outcomes Ocular outcomes Neurodegenerative outcomes Respiratory outcomes Musculoskeletal outcomes Pilots progressing well; preparing for scaling up of algorithms and then for web adjudication Pilots commencing Pilots being developed

37 UK Biobank: Principles of Access
UK Biobank is available to all bona fide researchers for all types of health-related research that is in public interest No preferential or exclusive access (and, in particular, access does not involve “collaboration” with UK Biobank) Researchers have to pay for access to the Resource for their proposed research on a cost-recovery basis only Access to the biological samples that are limited and depletable will be carefully controlled and coordinated Researchers are required to publish their findings and return the data so that other researchers can use them

38 “Showcase”: e-catalogue of data items currently in the UK Biobank Resource (www.ukbiobank.ac.uk)

39 Showcase supports search strategies for data items in the UK Biobank Resource

40 Body Composition: % Body Fat

41 Preliminary applications subdivided by type of researcher, location and type of research

42 What makes UK Biobank special?
PROSPECTIVE: It can assess the full effects of a particular exposure (such as smoking) on all types of health outcome (such as cancer, vascular disease, lung disease, dementia) DETAILED: The wide range of questions, measures and samples at baseline allows good assessment of exposures, and outcome adjudication allows good disease classification BIG: Inclusion of large number of participants allows reliable assessment of the causes of a wide range of diseases, and of the combined impact of many different exposures Unique combination of BREADTH and DEPTH


Download ppt "Big Data in UK Biobank: Opportunities and Challenges Funders: Wellcome Trust and Medical Research Council, with Department of Health, Scottish & Welsh."

Similar presentations


Ads by Google