Download presentation
Presentation is loading. Please wait.
1
Disease risk prediction
Usman Roshan
2
Disease risk prediction
Prediction of disease risk with genome wide association studies has yielded low accuracy for most diseases. Family history competitive in most cases except for cancer (Do et. al., PLoS Genetics, 2012)
3
Disease risk prediction
Our own studies have shown limited accuracy with various machine learning methods Univariate and multivariate feature selection Multiple kernel learning What accuracy can we achieve with machine learning methods applied to variants detected from whole exome data?
4
Chronic lymphocytic leukemia prediction with exome sequences and machine learning
We selected exome sequences of chronic lymphocytic leukemia from dbGaP. Largest at the time of download in August cases and 169 controls Case and control prediction accuracy with genetic variants unknown Same dataset previously studied in Wang et. al., NEJM, 2011 where new associated genes are reported but no risk prediction
5
What is whole exome data?
Human genome sequence Introns Coding regions Exons Illumina 76bp short reads (exome data). In practice flanking regions are also sequenced and so some intronic regions are included.
6
Obtain structural variants (1)
Human genome reference sequence Data of size 3.2 Terrabytes and 140X coverage Mapped to human genome reference with BWA MEM (popular short read mapper) Short reads are aligned to human genome
7
Human genome reference Short reads from a single individual ATTAA ACCAG ACCAG ATTGA ATTGA ACCAG ACCAG ATT--A ATTGA ACCAG ACCAG ATT--A ATTGA ACCCG ACCAG ATT--A ATTGA ACCCG ACCAG ATTGA ATTGA ACCAG ATTGA ATTGA ATTGA Here no variant is reported but we detected it in a different individual. Thus we assign it a value of 0 for this individual. Heterozygous SNP encoded as 1 Homozygous SNP encoded as 2 (0 if same as reference) Heterozygous indel encoded as 1 Encoded into a feature vector of four dimensions (2, 1, 0, 1)
8
Obtain structural variants (2)
ATTGA ACCAG ATTGA Obtained SNPs and indels from the alignments for each individual ATTGA ACCAG ATT--A ATTGA ACCAG ATT--A ATTGA ACCAG Short reads from a single individual ATT--A Human genome reference ATTGA ACCCG ATTGA ATTGA ACCCG ATTGA ATTGA ACCCG ATTGA Homozygous SNP encoded as 1 (0 if same as reference) Heterozygous SNP encoded as 1 Heterozygous indel encoded as 1
9
Obtain structural variants (3)
A/C C/G A/C C/G C AA CC C C AC CG C C AA GG C Co1 AC CG Co Co2 CC CG Co Combine variants from different individuals to form a data matrix Each row is a case or control and each column is a variant 153 cases and 144 controls after excluding very large files and problematic datasets SNPs and 2200 indels Numerically encoded
10
Perform cross-validation study
Split rows randomly into training validation sets (90:10 ratio). Rank all variants on training Learn support vector machine classifer on training data with top k ranked variants Predict case and control on validation data. Compute error and repeat 100 times . . . Training data Validation data Full dataset: each row is a case or control individual and each column is a variant (SNP or indel)
11
Variant ranking F0 F1 F2 F1 F2 F0 C C C C C C Co Co Co Co Rank features
12
Risk prediction with Pearson ranked SNPs
Similar curves with Pearson Small k better than large k (k = number of variants) SNPs better than indels Top 60 SNPs mostly in chromosome 14
13
Prediction with GWAS
14
Cross-study validation
15
Prediction on external samples
16
Prediction on external samples
17
Pearson ranking of genes associated with CLL
18
Analysis of top ranked Pearson genes
19
Conclusion Encouraging results with exome data
No known risk prediction study with exome data Limitations: Small sample size Ancestry of some data unknown
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.