Disease risk prediction

Slides:



Advertisements
Similar presentations
Considerations for Analyzing Targeted NGS Data HLA
Advertisements

GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
RNAseq.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Minimum Redundancy and Maximum Relevance Feature Selection
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
(1) Risk prediction by kernels and (2) Ranking SNPs Usman Roshan.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Supplementary slides. Mock-ups Exome overview Genomic coverage: lower quartile 1, median 23, upper quartile 35 Protocols: Aligner used: BWA v2.3 Reference.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
Usman Roshan Machine Learning, CS 698
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Deletions Project Tom Carpel CS CM124 6/11/2008.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
BNFO 615 Usman Roshan. Projects and papers An opportunity to do hands on work Proposal presentations due by end of September Papers: present at least.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
From Reads to Results Exome-seq analysis at CCBR
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Seefried, F., von Rohr P., Drögemüller C.
Genetics Journal Club Sumeet A. Khetarpal 10 December 2015.
Interpreting exomes and genomes: a beginner’s guide
Regression Usman Roshan.
Genomic Analysis: GWAS
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Nucleotide variation in the human genome
GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer Morteza H.Chalabi, Fabio Vandin Hello.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Predicting E. Coli Promoters Using SVM
Genome alignment Usman Roshan.
Trees, bagging, boosting, and stacking
Usman Roshan Machine Learning
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Dimensionality reduction
Basic machine learning background with Python scikit-learn
Gene Hunting: Design and statistics
Feature selection Usman Roshan.
Performance of Common Analysis Methods for Detecting Low-Frequency Single Nucleotide Variants in Targeted Next-Generation Sequence Data  David H. Spencer,
Linking Genetic Variation to Important Phenotypes
Bertram et al. (2005) , NEJM, 352: Bertram et al. (2005) , NEJM, 352:
Genomic alterations in breast cancer cell line MDA-MB-231.
Usman Roshan Machine Learning
Regression Usman Roshan.
Pierre Nahon, Jessica Zucman-Rossi  Journal of Hepatology 
BF528 - Genomic Variation and SNP Analysis
Challenge 4 – Moving beyond coding regions
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Kernel Methods for large-scale Genomics Data Analysis
Mapping of srt1 by BSA-seq.
Results from a GWAS of prostate cancer in the KP population (8,399 cases and 38,745 controls), highlighting key chromosomal regions. Results from a GWAS.
Presentation transcript:

Disease risk prediction Usman Roshan

Disease risk prediction Prediction of disease risk with genome wide association studies has yielded low accuracy for most diseases. Family history competitive in most cases except for cancer (Do et. al., PLoS Genetics, 2012)

Disease risk prediction Our own studies have shown limited accuracy with various machine learning methods Univariate and multivariate feature selection Multiple kernel learning What accuracy can we achieve with machine learning methods applied to variants detected from whole exome data?

Chronic lymphocytic leukemia prediction with exome sequences and machine learning We selected exome sequences of chronic lymphocytic leukemia from dbGaP. Largest at the time of download in August 2013. 186 cases and 169 controls Case and control prediction accuracy with genetic variants unknown Same dataset previously studied in Wang et. al., NEJM, 2011 where new associated genes are reported but no risk prediction

What is whole exome data? Human genome sequence Introns Coding regions Exons Illumina 76bp short reads (exome data). In practice flanking regions are also sequenced and so some intronic regions are included.

Obtain structural variants (1) Human genome reference sequence Data of size 3.2 Terrabytes and 140X coverage Mapped to human genome reference with BWA MEM (popular short read mapper) Short reads are aligned to human genome

Human genome reference Short reads from a single individual ATTAA ACCAG ACCAG ATTGA ATTGA ACCAG ACCAG ATT--A ATTGA ACCAG ACCAG ATT--A ATTGA ACCCG ACCAG ATT--A ATTGA ACCCG ACCAG ATTGA ATTGA ACCAG ATTGA ATTGA ATTGA Here no variant is reported but we detected it in a different individual. Thus we assign it a value of 0 for this individual. Heterozygous SNP encoded as 1 Homozygous SNP encoded as 2 (0 if same as reference) Heterozygous indel encoded as 1 Encoded into a feature vector of four dimensions (2, 1, 0, 1)

Obtain structural variants (2) ATTGA ACCAG ATTGA Obtained SNPs and indels from the alignments for each individual ATTGA ACCAG ATT--A ATTGA ACCAG ATT--A ATTGA ACCAG Short reads from a single individual ATT--A Human genome reference ATTGA ACCCG ATTGA ATTGA ACCCG ATTGA ATTGA ACCCG ATTGA Homozygous SNP encoded as 1 (0 if same as reference) Heterozygous SNP encoded as 1 Heterozygous indel encoded as 1

Obtain structural variants (3) A/C C/G A/C C/G C0 AA CC C0 0 0 C1 AC CG C1 1 1 C2 AA GG C2 0 2 Co1 AC CG Co1 1 1 Co2 CC CG Co2 2 1 Combine variants from different individuals to form a data matrix Each row is a case or control and each column is a variant 153 cases and 144 controls after excluding very large files and problematic datasets 122392 SNPs and 2200 indels Numerically encoded

Perform cross-validation study 0 0 1 2 0 . . . 0 2 2 2 1 . . . Split rows randomly into training validation sets (90:10 ratio). Rank all variants on training Learn support vector machine classifer on training data with top k ranked variants Predict case and control on validation data. Compute error and repeat 100 times . . . Training data Validation data Full dataset: each row is a case or control individual and each column is a variant (SNP or indel)

Variant ranking F0 F1 F2 F1 F2 F0 C0 1 2 0 C0 2 0 1 C1 1 2 1 C1 2 1 1 C2 1 2 2 C2 2 2 1 Co1 1 0 1 Co1 0 1 1 Co2 2 0 0 Co2 0 0 2 Rank features

Risk prediction with Pearson ranked SNPs Similar curves with Pearson Small k better than large k (k = number of variants) SNPs better than indels Top 60 SNPs mostly in chromosome 14

Prediction with GWAS

Cross-study validation

Prediction on external samples

Prediction on external samples

Pearson ranking of genes associated with CLL

Analysis of top ranked Pearson genes

Conclusion Encouraging results with exome data No known risk prediction study with exome data Limitations: Small sample size Ancestry of some data unknown