Disease risk prediction

Disease risk prediction
Usman Roshan

Prediction of disease risk with genome wide association studies has yielded low accuracy for most diseases. Family history competitive in most cases except for cancer (Do et. al., PLoS Genetics, 2012)

Our own studies have shown limited accuracy with various machine learning methods Univariate and multivariate feature selection Multiple kernel learning What accuracy can we achieve with machine learning methods applied to variants detected from whole exome data?

Chronic lymphocytic leukemia prediction with exome sequences and machine learning
We selected exome sequences of chronic lymphocytic leukemia from dbGaP. Largest at the time of download in August cases and 169 controls Case and control prediction accuracy with genetic variants unknown Same dataset previously studied in Wang et. al., NEJM, 2011 where new associated genes are reported but no risk prediction

What is whole exome data?
Human genome sequence Introns Coding regions Exons Illumina 76bp short reads (exome data). In practice flanking regions are also sequenced and so some intronic regions are included.

Obtain structural variants (1)
Human genome reference sequence Data of size 3.2 Terrabytes and 140X coverage Mapped to human genome reference with BWA MEM (popular short read mapper) Short reads are aligned to human genome

Human genome reference Short reads from a single individual ATTAA ACCAG ACCAG ATTGA ATTGA ACCAG ACCAG ATT--A ATTGA ACCAG ACCAG ATT--A ATTGA ACCCG ACCAG ATT--A ATTGA ACCCG ACCAG ATTGA ATTGA ACCAG ATTGA ATTGA ATTGA Here no variant is reported but we detected it in a different individual. Thus we assign it a value of 0 for this individual. Heterozygous SNP encoded as 1 Homozygous SNP encoded as 2 (0 if same as reference) Heterozygous indel encoded as 1 Encoded into a feature vector of four dimensions (2, 1, 0, 1)

ATTGA ACCAG ATTGA Obtained SNPs and indels from the alignments for each individual ATTGA ACCAG ATT--A ATTGA ACCAG ATT--A ATTGA ACCAG Short reads from a single individual ATT--A Human genome reference ATTGA ACCCG ATTGA ATTGA ACCCG ATTGA ATTGA ACCCG ATTGA Homozygous SNP encoded as 1 (0 if same as reference) Heterozygous SNP encoded as 1 Heterozygous indel encoded as 1

A/C C/G A/C C/G C AA CC C C AC CG C C AA GG C Co1 AC CG Co Co2 CC CG Co Combine variants from different individuals to form a data matrix Each row is a case or control and each column is a variant 153 cases and 144 controls after excluding very large files and problematic datasets SNPs and 2200 indels Numerically encoded

Perform cross-validation study
Split rows randomly into training validation sets (90:10 ratio). Rank all variants on training Learn support vector machine classifer on training data with top k ranked variants Predict case and control on validation data. Compute error and repeat 100 times . . . Training data Validation data Full dataset: each row is a case or control individual and each column is a variant (SNP or indel)

Variant ranking F0 F1 F2 F1 F2 F0 C C C C C C Co Co Co Co Rank features

Risk prediction with Pearson ranked SNPs
Similar curves with Pearson Small k better than large k (k = number of variants) SNPs better than indels Top 60 SNPs mostly in chromosome 14

Prediction with GWAS

Cross-study validation

Prediction on external samples

Pearson ranking of genes associated with CLL

Analysis of top ranked Pearson genes

Conclusion Encouraging results with exome data
No known risk prediction study with exome data Limitations: Small sample size Ancestry of some data unknown

Disease risk prediction

Similar presentations

Presentation on theme: "Disease risk prediction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Disease risk prediction

Similar presentations

Presentation on theme: "Disease risk prediction"— Presentation transcript:

Similar presentations

About project

Feedback