Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Trees with Numeric Tests
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Classification Algorithms
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Structural bioinformatics
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Lecture 5 (Classification with Decision Trees)
Decision Trees Chapter 18 From Data to Knowledge.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Ensemble Learning (2), Tree and Forest
Fall 2004 TDIDT Learning CS478 - Machine Learning.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
CS690L Data Mining: Classification
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Classification and Regression Trees
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees an introduction.
Lecture Notes for Chapter 4 Introduction to Data Mining
LECTURE 20: DECISION TREES
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
ECE 471/571 – Lecture 12 Decision Tree.
Lecture Notes for Chapter 4 Introduction to Data Mining
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning: Lecture 3
Machine Learning in Practice Lecture 23
Machine Learning in Practice Lecture 17
LECTURE 18: DECISION TREES
COSC 4368 Intro Supervised Learning Organization
Presentation transcript:

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU

2 Classical methods to predict a structure of a new protein:  Sequence comparison to the known proteins in search of similarities Sequences often diverge and become unrecognizable  Structure comparison to the known structures in the PDB database Structural data is sparse and not available for newly sequenced genes

3 What other features can be used to improve prediction?  Domain content  Subcellular location  Tissue specificity  Species type  Pairwise interaction  Enzyme cofactors  Catalytic activity  Expression profiles, etc.

4 Having so many features it is important:  To extract relevant information Directly from the sequence Predicted secondary structure Features extracted from database  To combine data in a feasible model Mixture model of Probabilistic Decision Trees (PDT) was used

5 Features extracted directly from the sequence (percentage) :  20 individual amino acids  16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc)  20 most informative dipeptides

6 Features predicted from the sequence:  Secondary structure predicted by the PSIPRED: Coil Helix Strand

7 Features extracted from SWISSPROT database:  Binary features (presence/absence) Alternative products Enzyme cofactors Catalytic activity  Nominal features Tissue Specificity (2 different definitions) Subcellular location Organism and Species classification  Continuous Number of patterns exhibited by each protein (“complexity” of a protein)

8 Mixture model of PDT (Probabilistic Decision Trees)  Can handle nominal data  Robust to the errors  Missing data is allowed

9 How to select an attribute for a decision node?  Use entropy to measure the impurity  Impurity must reduce after the split  Alternative measure – Mantras distance metric (has lower bias towards low split info).

10 Enhancements of the algorithm:  Dynamic attribute filtering  Discretizing numerical features  Multiple values for attributes  Missing attributes  Binary splitting  Leaf weighting  Post-prunning  10-fold cross-validation

11 The probabilistic fremaework  Attribute is selected with probability that depends on its information gain  Weight the trees by the performance

12 Evaluation of decision trees  Accuracy = (tp + tn)/total  Sensitivity = tp/(tp + fn)  Selectivity = tp/(tp + fp)  Jensen-Shannon divergence score

13 Handling skewed distributions (unequal class sizes)  Re-weight cases by 1/(# of counts ) Increases the impurity # of false positives  Mixed entropy Uses average of weighted & unweighted information gain to split and prune trees  Interlaced entropy Start with weighted samples and later use the unweighted entropy

14 Model selection (simplification)  Occam’s razor: out of 2 models with the same result choose more simple  Bayesian approach : the most probable model has max.posterior probability

15 Learning strategy optimization Configurationsensitivity (selectivity)accepted/rejected Basic (C4.5)0.35initial Mantaras metric0.36accepted Binary branching0.45accepted Weighted entropy0.56accepted 10 fold cross-validation0.65accepted 20 fold cross-validation0.64rejected JS-based post-prunning.07accepted sen/sel post-prunning0.68rejected Weighted leafs0.68rejected Mixed entropy0.63rejected Dipeptide information0.73accepted Propabilistic trees0.81accepted

16 Pfam classification test (comparison to BLAST)  PDT performance – 81%  BLAST performance – 86%  Main reasons: Nodes become impure because weighted entropy stops learning too early Important branches were eliminated by post-pruning when validation set is small

17 EC classification test (comparison to BLAST)  PDT performance on average – 71%  BLAST performance was often smaller

18 Conclusions  Many protein families cannot be defined by sequence similarities only  New method makes use of other features (structure, dipeptides, etc.)  Besides classification, PDT allow feature selection for further use  Results comparable to BLAST

19 Modifications and Improvements  Use global optimization for pruning  Use probabilities for attribute values  Use boosting techniques (combine weighted trees)  Use Gini-index to measure node-impurity