Download presentation
1
Biological Data Mining (Predicting Post-synaptic Activity in Proteins)
Rashmi Raj
2
Protein Enzymatic Proteins Transport Proteins Regulatory Proteins
# Proteins play crucial roles in almost every biological process. # A transport protein is a protein involved in the movement of a chemical across a biological membrane. Storage Proteins Hormonal Proteins Receptor Proteins
3
Synaptic Activity
4
Pre-Synaptic And Post-Synaptic Activity
The cells are held together by cell adhesion Neurotransmitters are stored in bags called synaptic vesicles Synaptic vesicles fuse with pre-synaptic membrane and release their content into synaptic cleft Post-synaptic receptors recognize them as a signal and get activated which then transmit the signal on to other signaling components # Multiple types of proteins are expected to be found at these sites for reception and propagation of signals. # Within the post-synaptic cell, the signaling apparatus is organized by various scaffolding proteins. Source -
5
Problem Predicting post-synaptic activity in proteins
6
Why Post-Synaptic Protein?
Neurological Disorder Alzheimer Disease Wilson Disease Etc.
7
Solution
8
What is Data Mining? Data mining is the process of searching large volumes of data for patterns, correlations and trends Database Data Mining Patterns or Knowledge Decision Support Science Business Web Government etc.
9
Market Basket Analysis
An example of market basket transactions. TID -- Items {Bread, Milk} {Bread, Diapers, Beer, Eggs} {Milk, Diapers, Beer, Cola} {Bread, Milk, Diapers, Beer} {Bread, Milk, Diapers, Cola}
10
Why is Data Mining Popular?
Data Flood Bank, telecom, other business transactions ... Scientific Data: astronomy, biology, etc Web, text, and e-commerce
11
Why Data Mining Popular?
Limitations of Human Analysis Inadequacy of the human brain when searching for complex multifactor dependencies in data
12
Tasks Solved by Data Mining
Learning association rules Learning sequential patterns Classification Numeric prediction Clustering etc.
13
Decision Trees A decision tree is a predictive model
It takes as input an object or situation described by a set of properties (predictive properties), and outputs a yes/no decision (class)
14
Decision Tree Example Outlook Temperature Humidity Windy Play? sunny
hot high false No true overcast Yes rain mild cool normal * Manager of a golf club, wants to understand the reason people decide to play golf. He assumes that weather must be an important underlying factor. During two weeks recorded the weather conditions and people played golf or not. Now he wants to predict whether people will play golf tomorrow or not based on the weather forecast. He might use this data for optimizing staff availability.
15
Decision Tree Example Outlook sunny rain overcast Humidity Yes Windy
high normal true false No Yes No Yes
16
Choosing the Splitting Attribute
At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index Sunny Overcast Raining
17
Which Attributes to Select?
18
A Criterion for Attribute Selection
Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Outlook overcast Yes
19
Information Gain Popular impurity criterion: information gain
Strategy: choose attribute that results in greatest information gain Information Gain a (average purity of the subsets that an attribute produces)
20
Computing Information
Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Formula for computing the entropy:
21
Computing Information Gain
(information before split) – (information after split) Information gain for attributes from weather data:
22
Continuing to Split
23
The Final Decision Tree
Splitting stops when data can’t be split any further
24
Uniprot [the Universal Protein Resource]
Central repository of protein sequence and function created by joining information contained in Swiss-Prot, TrEMBL and PIR
25
Prosite Prosite consists of: biologically significant sites, patterns
PROSITE Database of protein families and domains Prosite consists of: biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs
26
Predicting post-synaptic activity in proteins
Classes Positive Example = {proteins with post-synaptic activity} Negative Example = {proteins without post-synaptic activity} Predictor attribute Prosite patterns
27
First Phase select positive and negative Examples:
carefully select relevant proteins from the UniProt database Positive Examples Negative Examples
28
Positive Example Query-1: Post-synaptic AND !toxin
All Species – To maximize the number of examples in the data to be mined !toxin – Several entities in UniProt/SwissProt refer to the toxin alpha-latrotoxin
29
Negative Examples Query-2: Heart !(result of query 1)
Query-3: Cardiac !(result of queries1,2) Query-4: Liver !(result of queries 1,2,3) Query-5: Hepatic !(result of queries 1,2,3,4) Query-6: Kidney !(result of queries 1,2,3,4,5)
30
Second Phase generating the predictor attribute:
use the link from UniProt to Prosite database to create the predictor attributes
31
Predictor Attributes Facilitate easy interpretation by biologists
Must have a good predictive power
32
Generating The Predictor Attribute
Occurred at least in two proteins Pattern (NOT Profile) Remove data if they do not have a Prosite entry NO YES Prosite Entry? Remove from the Dataset
33
Data Mining Algorithm C4.5 rules induction algorithm (Quinlan, 1993)
Generates easily interpretable classification rules of the form: This kind of rule has the intuitive meaning IF THEN ELSE IF (condition) THEN (class)
34
Classification Rules
35
Predictive Accuracy Well-known 10-fold cross-validation procedures (Witten and Frank 2000) Divide the dataset into 10 partitions with approximately same number of examples (proteins) In i-th iteration, 1=1,2,…,10, the i-th partition as test set and the other 9 partitions as training set 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
36
Results Sensitivity Specificity TPR = TP/(TP+FN)
Average values (over the 10 iterations of cross-validation procedure) = 0.85 Specificity TNR = TN/(TN+FP) Average values (over the 10 iterations of cross-validation procedure) =0.98
37
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.