Download presentation
Presentation is loading. Please wait.
Published byFelicity Rodgers Modified over 8 years ago
1
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics arc@soton.ac.uk
2
2 The human genome 22 chromosomes + X and Y Sequence of 3,200 million base pairs (of A,T,G,C) Codes for ~30,000 genes 1000s? of genes contain mutations contributing to disease ‘phenotypes’
3
3 Single nucleotide polymorphisms (SNPs or ‘snips’) DNA sequence variation in one nucleotide: A, T, C, or G ~15 million+ SNPs – 90% of genetic variation Two forms (alleles)- a C/T SNP has ‘genotypes’ C,C or C,T or T,T How to link genotype(s) with disease phenotype(s)? Look for shared SNP mutations in families, cases vs controls
4
4 Disease gene mapping - timeline 1990-1998: ‘linkage mapping’-rare genes causing severe disease in families – Cystic Fibrosis, Huntington’s disease 1998-2010: ‘association mapping’ common genes involved in common disease (asthma, heart disease diabetes) - case control studies (~1 million SNPs) 2010-onwards: ‘next generation sequencing’ – test all 15 million+ SNPs. Low frequency variants with intermediate effect on common disease
5
5 Human genome project - timeline 1990: Start of ‘Human Genome Project’ (to generate one genome sequence) 2003: One sequence completed: cost $300 million 2010: 3,000 sequences now completed 2011: 30,000 sequences expected: cost ~$5000 each
6
6
7
7 Breast cancer genetics Rare genes (clear inheritance in families): <25% of inherited risk Common genes (low-risk, association mapping) <5% of inherited risk ~70% of risk not explained by all breast cancer genes found so far – so, many genes are ‘missing’…
8
8 Susceptibility genes in breast cancer: more is less? “The large number of anticipated susceptibility factors, their low predictive value and the high frequency of these variants…..make these findings of limited use in clinical practice” Ref: Willems PJ (2007) Clin Genet 72:493-496
9
9 183,000 samples: found 180 ‘height’ genes –enriched for genes in shared biological pathways –and genes involved in skeletal growth defects Genes found only explain 10% of variation in height Many genes missing…..
10
10 Linking genes with disease
11
11 “1000 Genomes” - a deep catalog of human variation July 2010 –Sequenced 6 people (two families - parents and a daughter) –sequenced genomes of 179 people –sequencing exons 700 people (‘exomes’ -protein-coding genome) Ongoing –2,500 DNA samples from 27 populations around the world Next generation sequencing
12
12 Copyright ©2009 American Association for Clinical Chemistry Next generation sequence data analysis
13
13 Exome data Sequence of protein-coding exons-one ‘exome’ contains coding regions of all ~30,000 genes Exome contains 30 megabases DNA (whole genome has 3200 megabases) Detect all SNP variation in a person. Align ‘short reads’ (millions of sequences of ~100 bases against the reference genome) Requires 40X ‘depth’ to reliably identify all DNA variation
14
14 Sequence data – the ‘filtering’ problem Each person has 250-300 mutations that could affect protein function and 50-100 mutations implicated in inherited disorders. Most variants have no effect on health To find disease gene(s) filter out ‘normal’ variation (reference data:1000 genomes, web databases) Common disease may involve complex interactions between networks of 100’s of genes Machine learning and other mathematical tools required to interpret complex phenotype/sequence data
15
15
16
16
17
17 “The production of billions of NGS reads has also challenged the infrastructure of existing information technology systems in terms of data transfer, storage and quality control, computational analysis to align or assemble read data….” “Advances in bioinformatics are ongoing, and improvements are needed if these systems are to keep pace with the continuing developments in NGS technologies. It is possible that the costs associated with downstream data handling and analysis could match or surpass the data- production costs…” (Metzker, Nat Rev Genet 2010, 11, 31- 46.)
18
18 Some applications of DNA sequence data Disease gene mapping Disease diagnosis/disease sub-types Differences between populations, migration patterns Biotechnology (bacterial genomes, genetic engineering) Infectious disease control Evolution/Taxonomy/classification Archaeology Forensic science
19
19 Machine learning to identify genetic factors in breast cancer 3000 cases with early-onset breast cancer (Southampton data), genotyped with 1000s of SNPs Identify new breast cancer genes – integrate phenotypic data (tumour sub-types, survival, response to treatment) with genotypes/sequence and gene functional information (web databases) Machine learning models: test gene : gene and gene : phenotype interactions. New genes? Groups of genes distinguishing sub-types of disease?
20
20 Web-based tools to improve diagnosis of ‘dosage’ diseases ‘Dosage’ – number of copies of a gene (more or less than 2 due to duplication or deletion) Gene(s) in duplicated/deleted might cause disease if abnormal ‘dose’. Which gene(s)? Identification influences patient treatments. Data-mine for known gene function in literature/databases. Prediction of disease causing genes - machine learning models (integrate gene function, expression, known ‘dosage genes’) Web-based tools for mining/querying/presentation of data for clinicians to improve diagnosis.
21
21 Conclusions Majority of genetic variation underlying human disease is unknown Next-generation sequencing will, in time, reveal all of these genes But…finding missing disease genes in DNA sequence presents huge challenges for medicine, mathematics and computer/web science Exome and whole genome sequence analysis will transform all ‘bioscience’ research fields NGS is now generating vast data sets - novel and multidisciplinary approaches to management, visualisation, analysis and interpretation are urgently needed
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.