Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.

Slides:



Advertisements
Similar presentations
LS-SNP: Large-scale annotation of coding non- synonymous SNPs based on multiple information sources -Bioinformatics April 2005.
Advertisements

Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Genetic Screening for Alzheimer’s Disease Thorstensen: Genetic Screening for Alzheimer's Disease 1.
Using SuSPect to Predict the Phenotypic Effects of Missense Variants Chris Yates UCL Cancer Institute
Outline to SNP bioinformatics lecture
JYC: CSM17 BioinformaticsCSM17 Week 10: Summary, Conclusions, The Future.....? Bioinformatics is –the study of living systems –with respect to representation,
JYC: CSM17 BioinformaticsCSM17 Week 10: Summary, Conclusions, The Future.....? Bioinformatics is –the study of living systems –with respect to representation,
The Protein Data Bank (PDB)
Sequence similarity.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
A Statistical Geometry Approach to the Study of Protein Structure Majid Masso Bioinformatics and Computational Biology George Mason University.
Identifying deleterious Single Nucleotide Polymorphisms using multiple sequence alignments CMSC858P Project by Maya Zuhl.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Abstract Visualization of protein structural data is an important aspect of protein research. Incorporation of genomic annotations into a protein structural.
Is phosphorylation site disruption associated with cancer? Maricel G. Kann (University of Maryland, Baltimore County) Matthew E. Mort (Indiana University.
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Identification and evaluation of causative genetic variants corresponding to a certain phenotype Xidan Li.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Jessica Dantzer Mooney Lab Center for Computational Biology and Bioinformatics Indiana University School of Medicine
CS177 Lecture 10 SNPs and Human Genetic Variation
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
Construction of Substitution Matrices
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Protein Secondary Structure Prediction G P S Raghava.
Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
POLYMORPHISM AND VARIANT ANALYSIS Saurabh Sinha, University of Illinois.
November 18, 2000ICTCM 2000 Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee,
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Bioinformatics and Computational Biology
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Research proposal 2009 信息技术会议 Bioinformatics Analysis & Identification of non-Synonymous SNPs in Candidate Genes for Ascites College of Animal Husbandry.
Genetics 3.1 Genes. Essential Idea: Every living organism inherits a blueprint for life from its parents.
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
Bioinformatics Overview
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
An Artificial Intelligence Approach to Precision Oncology
Madhavi Ganapathiraju Graduate student Carnegie Mellon University
Feature Extraction Introduction Features Algorithms Methods
POLYMORPHISMS & ASSOCIATION TESTS
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Genomes and Their Evolution
Predicting Student Performance: An Application of Data Mining Methods with an Educational Web-based System FIE 2003, Boulder, Nov 2003 Behrouz Minaei-Bidgoli,
Behrouz Minaei, William Punch
By Stitziel, Tseng, Pervouchine, Goddeau, Kasif, Liang
Alignment IV BLOSUM Matrices
Deleterious- and Disease-Allele Prevalence in Healthy Individuals: Insights from Current Predictions, Mutation Databases, and Population-Scale Resequencing 
Evaluating Classifiers for Disease Gene Discovery
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Deleterious “Hurtful or injurious to life or health; noxious” (Oxford English Dictionary) “Tis pity wine should be so deleterious, For tea and coffee leave us much more serious.” (BYRON Juan IV, 1821) BYRON

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations SNPs What is an SNP (single nucleotide polymorphism)? What is an SNP (single nucleotide polymorphism)? Why are SNPs important? Why are SNPs important? Some SNPs are nonsynonymous Some SNPs are nonsynonymous The molecular effects of SNPs vary widely The molecular effects of SNPs vary widely

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations MOTIVATION Improve on the existing deleterious prediction methods Improve on the existing deleterious prediction methods Use protein sequence, evolution and structure data combined with machine learning to identify potentially disease- causing SNPs Use protein sequence, evolution and structure data combined with machine learning to identify potentially disease- causing SNPs

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations SNP data is increasingly available Over 40 major online databases Over 40 major online databases dbSNP is the primary SNP database (contains 5,000,000+ validated human SNPs) dbSNP is the primary SNP database (contains 5,000,000+ validated human SNPs) Many databases contain potentially disease- causing SNPs related to a particular disease Many databases contain potentially disease- causing SNPs related to a particular disease

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Deleterious effects of mutations on proteins Function Function Stability Stability Expression Expression Protein-Protein Interactions Protein-Protein Interactions

Friday 17 rd December 2004Stuart Young Current Classification Tools Sequence Approaches BLOSUM62 An amino acid substitution score matrix SIFT Collects sequence homologues in multiple alignments and identifies non-conservative changes in amino acids Ng P & Henikoff S, 'Predicting Deleterious Amino Acid Substitutions‘. Genome Research, 2001, 11:

Friday 17 rd December 2004Stuart Young Current Classification Tools Structural Approaches Expert rules Uses evolutionary and structural data Sunyaev et al, 'Prediction of deleterious human alleles‘. Human Molecular Genetics, 2001, Vol. 10, No. 6, 593. Decision Trees Improved performance based on sequence and structural data Produces intuitive rules

Friday 17 rd December 2004Stuart Young Our foundation for the project Saunders CT & Baker D ‘ ‘Evaluation of Structural and Evolutionary Contributions to Deleterious Mutation Prediction’ J. Mol. Biol. (2002) 322, 891–901 Structural and evolutionary features Structural and evolutionary features Trained classifiers based on two data sets - experimental mutations and human alleles Trained classifiers based on two data sets - experimental mutations and human alleles

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations S & B - Training Sets Experimental mutations (~5,000) HIV-1 protease E. Coli Lac repressor T4 Lysozyme Human alleles (~350 mutations) 103 ‘hot’ human genes

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Why two training sets? Unbiased human data is hard to get: Many disease-associated mutations are discovered through genetics association studies and may not be causative (i.e., only linked with the causative allele) Effect of mutations is hard to measure Experimental ‘whole gene mutagenesis’ data is used considered ‘unbiased’

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Features used in S&B Study SIFT SIFT + Solvent Accessibility(SA) SIFT + normalized B-factor SIFT + Sunyaev expert rules SIFT + SA + B-factor

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Hypothesis Can we improve on the results of Saunders and Baker by using more structural and sequence properties?

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Experimental Design Classification algorithm Decision Trees Support Vector Neural Nets Additional Features Amino acid relative frequencies Additional structural properties

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Structural Property Values Russ Altman (Stanford) developed a vector representation of protein structural sites Spheres (1.875 Å → 7.5 Å ) centered on C- alpha atom of the mutation position 66 features Atom/residue counts within sphere and other features, e.g.: Solubility Solvent accessibility

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Amino Acid Windows AA frequencies within a window on either side of the mutation position 20 AAs = 20 features LEFT and RIGHT → 40 features

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Amino Acid Windows

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Tools Databases PDB - Protein structure data S-BLEST - Structural features Software Perl Matlab (NN, PRTools(DT), SVC)

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations List of Features Used BLOSUM62, disorder, secondary structure, molecular weight Grouped amino acid frequency windows of varying widths SIFT S-BLEST (vector contains four sub-shells spreading outward from site) Solvent accessibility (C-beta density, i.e., the number of C-beta atoms around the site)

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Comparison with S&B Results

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations 1. Human Data Set Human allele dataset as train and test set Ensembles of decision trees for classification 20-fold cross validation Progressively added features to see their affect on performance Because structural data was not available for all mutation sites, we used a subset of the original Saunders and Baker training set

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Best Features

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations 1. Experimental Data Set Same as human data set but using experimental mutations for training and testing

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Evaluation of S-BLEST Using a Random Subset of the Experimental Training Set

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations 3. Cross-classification Used the same features described above Trained on one dataset and tested on the other: Human to experimental Experimental to human Experimental gene to exp. gene

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Summary of Results Human data set 80% accuracy (up from 70%) Experimental data set 87% accuracy (up from 79.5%)

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Conclusion Prediction tools CAN identify deleterious mutations We believe that further study is warranted to identify over-fitted classifiers to further improve classification accuracy on real world data

Friday 17 rd December 2004Stuart Young Acknowledgements People Andrew Campen (CCBB IT, IUPUI) Brandon Peters (CCBB, IUPUI) Haixu Tang (Capstone Coordinator, IUB) Funding This work was funded by a grant from the Showalter Trust (Sean Mooney, PI), INGEN, and a IUPUI McNair Scholarship. The Indiana Genomics Initiative (INGEN) Indiana University is supported in part by Lilly Endowment Inc.

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations Thank You

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations

Friday 17 rd December 2004Stuart Young Predicting Deleterious Mutations