Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell.

Slides:



Advertisements
Similar presentations
Introductory Genetics
Advertisements

What is an association study? Define linkage disequilibrium
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Ivan P. Gorlov, Olga Y. Gorlova & Christopher I. Amos.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
1 Review Define the terms genes pool and relative frequency Predict Suppose a dominant allele causes a plant disease that usually kills the plant before.
Population Genetics (Ch. 16)
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Population Genetics Unit 4 AP Biology.
Genome Variations & GWAS
Understanding Genetics of Schizophrenia
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Evolution Chapters Evolution is both Factual and the basis of broader theory What does this mean? What are some factual examples of evolution?
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
©Edited by Mingrui Zhang, CS Department, Winona State University, 2008 Identifying Lung Cancer Risks.
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
Gene Hunting: Linkage and Association
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
CATALYST Recall and Review: – What are chromosomes? – What are genes? – What are alleles? How do these terms relate to DNA? How do these terms relate to.
Methods in genome wide association studies. Norú Moreno
GENETICS REVIEW. A physical trait that shows as a result of an organism’s particular genotype. PHENOTYPE.
Biology 3201 Chapters The Essentials. Micro vs. Macro Evolution Micro Evolution Evolution on a smaller scale. This is evolution within a particular.
1. Define the following terms:  Genetic drift: random change in a gene frequency that is caused by a series of chance occurrences that cause an allele.
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Schematic of the single variant polymorphism (SNP) genotyping assay.
Unit 1 – Living Cells.  The study of the human genome  - involves sequencing DNA nucleotides  - and relating this to gene functions  In 2003, the.
It’s All In Your Genes Introduction to Genetics and Punnett Squares Developed By: Stephanie Shirley Senior Graduate Student MD Anderson Cancer Center Science.
Genome-Wides Association Studies (GWAS) Veryan Codd.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
May 4, What is an allele?. Genotype: genetics of trait (what alleles?) Homozygous: two copies of the same allele –Homozygous dominant (BB) –Homozygous.
13/11/
Measuring Evolutionary Change Over Time
Evolution and Populations –Essential Questions p
Hardy-Weinberg Theorem
Population and Community Dynamics
Measuring Evolution of Populations
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
1. Population Genetics Population is the group of the same species in a habitat Branch of evolutionary biology that studies how populations become unique.
Bellwork: What indicates that a population is evolving
Phenotype the set of observable characteristics of an individual resulting from their DNA information.
AQA GCSE INHERITANCE, VARIATION AND EVOLUTION PART 2
Diversity of Individuals and Evolution of Populations
Evolution of Populations: Part I
By Michael Fraczek and Caden Boyer
Genes and Variation EQ: How is the gene pool affected by selection pressure? Read the lesson title aloud to students.
Mechanisms of Evolution
Chapter 7 Multifactorial Traits
EDEXCEL GCSE BIOLOGY GENETICS Part 2
Lesson Overview 17.1 Genes and Variation.
CATALYST Recall and Review: How do these terms relate to DNA?
Hardy-Weinberg Equilibrium Model
More Genetics Bio 12.
Genetics: Genes, Alleles and Traits
Mechanisms of Evolution
Hardy - Weinberg Questions.
Lesson Overview 17.1 Genes and Variation.
A. How Common is Genetic Variation? B. Variation and Gene Pools
The Evolution of Populations
B6 Genetics- Paper2 Revision
B6 Genetics- Paper2 Revision
Hardy-Weinberg Lab Data
Presentation transcript:

Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell Greiner, Sambasivarao Damaraju 10-Fold Cross Validation Accuracy of Various Combinations of Statistical Feature Selection and Learning Methods Breast Cancer Prediction Please direct all correspondence to the first author at:  The DNA in germ cells (egg and sperm cells) which is the source of DNA for all other cells in the body is called germline genome. Single Nucleotide Polymorphisms (SNPs) are the substitutions of single nucleotides at a specific position on the genome, observed at frequencies above 1% in a particular human population.  The allele with the higher frequency of occurrence within a population is called the major allele (“A”), while those occurring less frequently are called the minor allele (“B”). For each SNP, the two allelic variations (“A” and “B”) can give rise to three possible genotypes: 1.When both parents contribute an “A” allele (same major allele), the genotype is referred to as wild type homozygous (AA); 2.when both parents contribute a “B” allele (same minor allele) the genotype is referred to as variant (mutant) homozygous (BB). 3.when the two alleles are different, the genotype is referred to as heterozygous (AB). Genome Wide Associative Study Schema  Biological Perspective: There are many factors that contribute to developing breast cancer and germline SNPs are only one of them.  Other Heritable/Genetic DNA changes Copy Number Variations (CNVs) are not included in this study Structural DNA changes are not included in this study etc  Environmental and lifestyle factors are not reflected in germline DNA  Somatic DNA changes Point mutations are not included in this study etc Acknowledgments Germline SNPs Genome Wide Predictive Study Schema  This research was made possible by Faculty of Medicine and Dentistry and Faculty of Science of University of Alberta interdisciplinary graduate studentship award.  Datasets used in this project are provided by Damaraju’s Lab at Cross Cancer Institute, dbGaP’s Breast Cancer CGEMS project, and HapMap Project.  We would like to acknowledge members of Alberta Innovates Center for Machine Learning (AICML) and members of Damaraju Lab at Cross Cancer Institute of Alberta Health Services (AHS) for their constructive feedbacks. Ethnicity Prediction Reasons of Poor Breast Cancer Prediction  Training dataset  623 Caucasian samples (302 breast cancer cases and 321 controls)  906,600 SNP probes on Affy 6.0 SNP array  Data preprocessing filtered the following SNPs and left 506,836 SNPs.  SNPs that had any missing calls  SNPs whose genotype frequency deviated from Hardy-Weinberg equilibrium (nominal p-value <0.001 in controls)  SNPs whose minor allele frequency were less than 5%  Biological feature selection methods did not produced a classification model with an error rate better than baseline accuracy  28 SNPs reported in the literature to be highly associated with breast cancer  12,858 SNPs associated with genes of KEGG’s cancer pathways  1,661 SNPs associated with breast cancer in the F-SNP database  Statistical feature selection methods combined with learning methods Decision TreeKNNSVMZero R Information Gain50.88%56.17%55.37%51.52% MeanDiff52.06%58.71%57.30%51.52% mRMR51.20%57.78%56.18%51.52% PCA51.69%51.36%51.84%51.52%  External validation using CGEMS breast cancer dataset  2287 Caucasian samples (1145 breast cancer cases and 1142 controls)  555,352 SNP probes on Illumina HumanHap 550 (I5) array  On the model trained using MeanDiff+KNN algorithm, we observed 60.25% accuracy using this test dataset.  Permutation test showed that 59.55% LOOCV accuracy achieved using MeanDiff+KNN algorithm is better than baseline accuracy achieved by Zero R.  Training dataset: International HapMap project phase 2 genotyping data  270 samples (90 Caucasian, 90 African, and 90 Chinese/Japanese)  906,600 SNP probes on Affy 6.0 SNP array  Data preprocessing filtered the following SNPs and left 150,000 SNPs.  SNPs that had any missing calls  SNPs whose minor allele frequency were less than 35%  SNPs located on X, Y, MT, or unspecified chromosome  Learning Algorithms  CART Decision Tree which uses 3 SNPs: 10-Fold CV of 97.41%  Ensemble of Disjoint CART Trees which uses 149 SNPs: 10-Fold CV of 100%  External validation using 321 Caucasian subjects  One the model trained using Ensemble of Disjoint CART Decision Trees algorithm, we observed 96.8% accuracy using this test dataset.  Machine Learning Perspective:  Breast cancer is a heterogeneous disease and subtypes are not reflected in labels Disease = F1(x1,x2,x3) v F2(x1,x4,x5) v F3(x2,x5,x6,x7,x8) v F4(x1,x10) ….  Curse of dimensionality Large number of SNPs vs. small number of samples made learning impossible Future Directions of Research  Designing a predictive model for breast cancer using a combination of different types of germline and somatic DNA changes and environmental and life style factors  Assessment of predictability of 7 phenotypes of Welcome Trust Case Control Consortium (WTCCC) using germline SNPs  Type 1 Diabetes  Type 2 Diabetes  Rheumatoid Arthritis  Inflammatory Bowl Disease  Bipolar Disorder  Hypertension  Coronary Artery Disease  Extension of ethnicity prediction project using international HapMap project phase 3 genotyping data  Computational modeling of effect of different factors contributing to the challenge of learning complex disease related predictive models