Biological Data Mining (Predicting Post-synaptic Activity in Proteins)

Slides:



Advertisements
Similar presentations
Lecture 3: CBR Case-Base Indexing
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
Data Mining Lecture 9.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Decision Tree.
Decision Trees.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification: Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Algorithms for Classification: Notes by Gregory Piatetsky.
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
Inferring rudimentary rules
Classification.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Chapter 7 Decision Tree.
Basic Data Mining Techniques
Data Mining: Classification
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Decision-Tree Induction & Decision-Rule Induction
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
CS690L Data Mining: Classification
Algorithms: Decision Trees 2 Outline  Introduction: Data Mining and Classification  Classification  Decision trees  Splitting attribute  Information.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Decision Trees by Muhammad Owais Zahid
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Artificial Intelligence
Chapter 6 Classification and Prediction
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
DATAWAREHOUSING AND DATAMINING
Data Mining(中国人民大学) Yang qiang(香港科技大学) Han jia wei(UIUC)
Junheng, Shengming, Yunsheng 10/19/2018
A task of induction to find patterns
Data Mining CSCI 307, Spring 2019 Lecture 15
A task of induction to find patterns
Presentation transcript:

Biological Data Mining (Predicting Post-synaptic Activity in Proteins) Rashmi Raj (rashmi@cs.stanford.edu)

Protein Enzymatic Proteins Transport Proteins Regulatory Proteins # Proteins play crucial roles in almost every biological process. # A transport protein is a protein involved in the movement of a chemical across a biological membrane. Storage Proteins Hormonal Proteins Receptor Proteins

Synaptic Activity

Pre-Synaptic And Post-Synaptic Activity The cells are held together by cell adhesion Neurotransmitters are stored in bags called synaptic vesicles Synaptic vesicles fuse with pre-synaptic membrane and release their content into synaptic cleft Post-synaptic receptors recognize them as a signal and get activated which then transmit the signal on to other signaling components # Multiple types of proteins are expected to be found at these sites for reception and propagation of signals. # Within the post-synaptic cell, the signaling apparatus is organized by various scaffolding proteins. Source - http://www.mun.ca/biology/desmid/brian/BIOL2060/BIOL2060-13/1319.jpg

Problem Predicting post-synaptic activity in proteins

Why Post-Synaptic Protein? Neurological Disorder Alzheimer Disease Wilson Disease Etc.

Solution

What is Data Mining? Data mining is the process of searching large volumes of data for patterns, correlations and trends Database Data Mining Patterns or Knowledge Decision Support Science Business Web Government etc.

Market Basket Analysis An example of market basket transactions.           TID -- Items {Bread, Milk} {Bread, Diapers, Beer, Eggs} {Milk, Diapers, Beer, Cola} {Bread, Milk, Diapers, Beer} {Bread, Milk, Diapers, Cola}

Why is Data Mining Popular? Data Flood Bank, telecom, other business transactions ... Scientific Data: astronomy, biology, etc Web, text, and e-commerce

Why Data Mining Popular? Limitations of Human Analysis Inadequacy of the human brain when searching for complex multifactor dependencies in data

Tasks Solved by Data Mining Learning association rules Learning sequential patterns Classification Numeric prediction Clustering etc.

Decision Trees A decision tree is a predictive model It takes as input an object or situation described by a set of properties (predictive properties), and outputs a yes/no decision (class)

Decision Tree Example Outlook Temperature Humidity Windy Play? sunny hot high false No true overcast Yes rain mild cool normal * Manager of a golf club, wants to understand the reason people decide to play golf. He assumes that weather must be an important underlying factor. During two weeks recorded the weather conditions and people played golf or not. Now he wants to predict whether people will play golf tomorrow or not based on the weather forecast. He might use this data for optimizing staff availability.

Decision Tree Example Outlook sunny rain overcast Humidity Yes Windy high normal true false No Yes No Yes

Choosing the Splitting Attribute At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index Sunny Overcast Raining

Which Attributes to Select?

A Criterion for Attribute Selection Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Outlook overcast Yes

Information Gain Popular impurity criterion: information gain Strategy: choose attribute that results in greatest information gain Information Gain a (average purity of the subsets that an attribute produces)

Computing Information Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Formula for computing the entropy:

Computing Information Gain (information before split) – (information after split) Information gain for attributes from weather data:

Continuing to Split

The Final Decision Tree Splitting stops when data can’t be split any further

Uniprot [the Universal Protein Resource] Central repository of protein sequence and function created by joining information contained in Swiss-Prot, TrEMBL and PIR

Prosite Prosite consists of: biologically significant sites, patterns PROSITE Database of protein families and domains Prosite consists of: biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs

Predicting post-synaptic activity in proteins Classes Positive Example = {proteins with post-synaptic activity} Negative Example = {proteins without post-synaptic activity} Predictor attribute Prosite patterns

First Phase select positive and negative Examples: carefully select relevant proteins from the UniProt database Positive Examples Negative Examples

Positive Example Query-1: Post-synaptic AND !toxin All Species – To maximize the number of examples in the data to be mined !toxin – Several entities in UniProt/SwissProt refer to the toxin alpha-latrotoxin

Negative Examples Query-2: Heart !(result of query 1) Query-3: Cardiac !(result of queries1,2) Query-4: Liver !(result of queries 1,2,3) Query-5: Hepatic !(result of queries 1,2,3,4) Query-6: Kidney !(result of queries 1,2,3,4,5)

Second Phase generating the predictor attribute: use the link from UniProt to Prosite database to create the predictor attributes

Predictor Attributes Facilitate easy interpretation by biologists Must have a good predictive power

Generating The Predictor Attribute Occurred at least in two proteins Pattern (NOT Profile) Remove data if they do not have a Prosite entry NO YES Prosite Entry? Remove from the Dataset

Data Mining Algorithm C4.5 rules induction algorithm (Quinlan, 1993) Generates easily interpretable classification rules of the form: This kind of rule has the intuitive meaning IF THEN ELSE IF (condition) THEN (class)

Classification Rules

Predictive Accuracy Well-known 10-fold cross-validation procedures (Witten and Frank 2000) Divide the dataset into 10 partitions with approximately same number of examples (proteins) In i-th iteration, 1=1,2,…,10, the i-th partition as test set and the other 9 partitions as training set 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Results Sensitivity Specificity TPR = TP/(TP+FN) Average values (over the 10 iterations of cross-validation procedure) = 0.85 Specificity TNR = TN/(TN+FP) Average values (over the 10 iterations of cross-validation procedure) =0.98

Questions?