Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine.

Slides:



Advertisements
Similar presentations
High Resolution studies
Advertisements

Next-Generation Sequencing: Methodology and Application
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
The Human Genome Project
Lecture 45 Prof Duncan Shaw. Applications - finding genes Currently much interest in medical research, in finding the genes causing disease Sometimes.
Lecture 2 Strachan and Read Chapter 13
BIOL EVOLUTION AT MORE THAN ONE GENE SO FAR Evolution at a single locus No interactions between genes One gene - one trait REAL evolution: 10,000.
Copyright © 2008, SAS Institute Inc. All rights reserved. Discovering Meaningful Patterns in Genomics Data with JMP Genomics Jordan Hiller JMP Genomics.
The genetic dissection of complex traits
Genetic contributions to complex traits in a post genomewide era Nic Timpson ALSPAC – The first 21 years conference 2012.
Analysis of imputed rare variants
Schulich School of Medicine & Dentistry The University of Western Ontario London Regional Genomics Centre Next Generation Sequencing Meeting April 1, 2010.
What is an association study? Define linkage disequilibrium
The Past, Present, and Future of DNA Sequencing
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Rainer Lehtonen PhD, Genomics and genetics project leader Metapopulation Research Group Department of Biological and Environmental Sciences, University.
INTRODUCTION Genome-wide association studies are now feasible. Measuring allele frequencies of pools of cases and controls, instead of between individuals,
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Next-generation sequencing
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Fokkerij in genomics tijdperk Johan van Arendonk Animal Breeding and Genomics Centre Wageningen University.
Candidate Gene Resource Steering Committee Meeting July 25, 2006.
Ion Mandoiu Computer Science and Engineering Department
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
1 Next Generation Sequencing Itai Sharon November 11th, 2009 Introduction to Bioinformatics.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Andrew Singleton Molecular Genetics Section Laboratory of Neurogenetics National Institute on Aging Andrew Singleton, Chief of the.
Molecular Biology Dr. Chaim Wachtel April 4, 2013.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Georgia Wiesner, MD CREC June 20, GATACAATGCATCATATG TATCAGATGCAATATATC ATTGTATCATGTATCATG TATCATGTATCATGTATC ATGTATCATGTCTCCAGA TGCTATGGATCTTATGTA.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
High Throughput Sequencing Methods and Concepts
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Visualising NGS data in GBrowse 2 August 2009 GMOD Meeting 6-7 August 2009 Dave Clements GMOD Help Desk National Evolutionary Synthesis Center (NESCent)
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
“Recent next generation sequencing results” MACHADO LAB.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Lecture 23: Causes and Consequences of Linkage Disequilibrium November 16, 2012.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
By Chris Paine Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and.
Motivations to study human genetic variation
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Accessing and visualizing genomics data
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Genomics and the Growing World Steve Rounsley Dow Agrosciences.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 6: Genotype.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
SeqMonk tools for methylation analysis
Lecture 6: Genotype by sequencing
Cancer Genomics Core Lab
Statistical Applications in Biology and Genetics
Discovery of Multiple Differentially Methylated Regions
Lecture 6: Genotype by sequencing
Introduction to Data Formats and tools
Genomes and Their Evolution
3.1 Genes Genes and hence genetic information is inherited from parents, but the combination of genes inherited from parents by each offspring will be.
In these studies, expression levels are viewed as quantitative traits, and gene expression phenotypes are mapped to particular genomic loci by combining.
3.1 Genes Genes and hence genetic information is inherited from parents, but the combination of genes inherited from parents by each offspring will be.
Perspectives from Human Studies and Low Density Chip
Evaluation of power for linkage disequilibrium mapping
Presentation transcript:

Future directions in genetic epidemiology, impact on IT and Data requirements Nic Timpson MRC CAiTE Centre Department of Social Medicine

Step change Larsen

Why? -Technology -Paradigm shift -Genomic properties EUCCONET Data Management Workshop

Raw data Clinical meaning ???????

EUCCONET Data Management Workshop Two of the driving technologies: Chip based genotyping Next Generation Sequencing (NGS)

EUCCONET Data Management Workshop Basic flat Illumina output…

EUCCONET Data Management Workshop Derivation of flat file data from image based intensity reads:

CHR16_HAPMAP.recode.map red_test_run_assoc.txtgenetic_map_chr16.txt HAPMAP - Illumina - Affymetrix CHR16_HAPMAP.recode.ped EUCCONET Data Management Workshop

Position (Mb) EUCCONET Data Management Workshop NOD2 Crohns association

IndividualPlatform Read Length Base coverage Genomic coverage Cost ($US) J. Craig Venter Automated Sanger N/A70,000,000 James D. WatsonRoche/ ,000,000 Yoruban male Illumina/ Solexa ,000 Yoruban male Life/APG ,000 EUCCONET Data Management Workshop

Data (bytes) ~20Tb Based on n~5000 ~$5 + billion ~$70 million ~$1 million ~$ Per genome HGP Venter & Watson NGS 1- Candidate 2- CHIP (designer) 3- Affy Intensity data 5- NGS data (*LC) ~10Gb ~2Mb Consequent shifting budgets… EUCCONET Data Management Workshop

Based on the storage of re-sequence data, one can consider storage requirements for a next generation sequencing effort: Assuming a storage cost of about 1.5byte per bp of sequence reads for a low coverage ~2000 samples (as per UK10K for example) x 3 billion bp x 1.5 = 10 terabytes. That doesn't include any subsequent parsed data Double this just to have the data in all formats one might be able to use meaningfully. Yields ~20Tb 20 Tb is pretty small these days if buying new storage capacity just to do this alone one may therefore be better accounting for up to Tb if buying bespoke. Cost – service costs can be as high as £1500 per Tb NGS project on some 2000 individuals can be as much as 40-50k on computing alone. EUCCONET Data Management Workshop

Also receiving data on: Copy number variation across the genome Expression data (e.g. records of messenger RNA to track gene activity) Methylome (markers of the epigenome) Not to mention phenotype data (a retrospective effort and an ever increasing pool) Raises the issue of linkage and data USE… EUCCONET Data Management Workshop

Not just storage… EUCCONET Data Management Workshop

Varying matrix properties and overlaid ribbon plots: (here MAF) Male vs Female D vs r^2 EUCCONET Data Management Workshop

CDKAL Combinations of data processing/visualisation methods: e.g. follow-up of the dissection of the TCF2 locus and the counter results for T2D and prostate cancer - other T2D loci? See: Amundadottir et al Nature Genetics 2007 EUCCONET Data Management Workshop

Not to mention iterative approaches! Generation of empirical distributions for the purpose of comparison, e.g. expression data Gene X Gene (and possibly environment) interation analysis which may span the genome

Overall EUCCONET Data Management Workshop As would expect, data requirements are increasing Genetic epidemiology has been boosted into a realm of real findings and Exciting capability by the existence of new technology Increases may (or may not) be more rapid than once thought Storage and manipulation of large data sets present new challenges A new breed of analysts is emerging The computer scientist with a passion for biology Perhaps windows is dead…