Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biological Data Mining (Predicting Post-synaptic Activity in Proteins)

Similar presentations


Presentation on theme: "Biological Data Mining (Predicting Post-synaptic Activity in Proteins)"— Presentation transcript:

1 Biological Data Mining (Predicting Post-synaptic Activity in Proteins)
Rashmi Raj

2 Protein Enzymatic Proteins Transport Proteins Regulatory Proteins
# Proteins play crucial roles in almost every biological process. # A transport protein is a protein involved in the movement of a chemical across a biological membrane. Storage Proteins Hormonal Proteins Receptor Proteins

3 Synaptic Activity

4 Pre-Synaptic And Post-Synaptic Activity
The cells are held together by cell adhesion Neurotransmitters are stored in bags called synaptic vesicles Synaptic vesicles fuse with pre-synaptic membrane and release their content into synaptic cleft Post-synaptic receptors recognize them as a signal and get activated which then transmit the signal on to other signaling components # Multiple types of proteins are expected to be found at these sites for reception and propagation of signals. # Within the post-synaptic cell, the signaling apparatus is organized by various scaffolding proteins. Source -

5 Problem Predicting post-synaptic activity in proteins

6 Why Post-Synaptic Protein?
Neurological Disorder Alzheimer Disease Wilson Disease Etc.

7 Solution

8 What is Data Mining? Data mining is the process of searching large volumes of data for patterns, correlations and trends Database Data Mining Patterns or Knowledge Decision Support Science Business Web Government etc.

9 Market Basket Analysis
An example of market basket transactions.           TID -- Items {Bread, Milk} {Bread, Diapers, Beer, Eggs} {Milk, Diapers, Beer, Cola} {Bread, Milk, Diapers, Beer} {Bread, Milk, Diapers, Cola}

10 Why is Data Mining Popular?
Data Flood Bank, telecom, other business transactions ... Scientific Data: astronomy, biology, etc Web, text, and e-commerce

11 Why Data Mining Popular?
Limitations of Human Analysis Inadequacy of the human brain when searching for complex multifactor dependencies in data

12 Tasks Solved by Data Mining
Learning association rules Learning sequential patterns Classification Numeric prediction Clustering etc.

13 Decision Trees A decision tree is a predictive model
It takes as input an object or situation described by a set of properties (predictive properties), and outputs a yes/no decision (class)

14 Decision Tree Example Outlook Temperature Humidity Windy Play? sunny
hot high false No true overcast Yes rain mild cool normal * Manager of a golf club, wants to understand the reason people decide to play golf. He assumes that weather must be an important underlying factor. During two weeks recorded the weather conditions and people played golf or not. Now he wants to predict whether people will play golf tomorrow or not based on the weather forecast. He might use this data for optimizing staff availability.

15 Decision Tree Example Outlook sunny rain overcast Humidity Yes Windy
high normal true false No Yes No Yes

16 Choosing the Splitting Attribute
At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A goodness function is used for this purpose. Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index Sunny Overcast Raining

17 Which Attributes to Select?

18 A Criterion for Attribute Selection
Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes Outlook overcast Yes

19 Information Gain Popular impurity criterion: information gain
Strategy: choose attribute that results in greatest information gain Information Gain a (average purity of the subsets that an attribute produces)

20 Computing Information
Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Formula for computing the entropy:

21 Computing Information Gain
(information before split) – (information after split) Information gain for attributes from weather data:

22 Continuing to Split

23 The Final Decision Tree
Splitting stops when data can’t be split any further

24 Uniprot [the Universal Protein Resource]
Central repository of protein sequence and function created by joining information contained in Swiss-Prot, TrEMBL and PIR

25 Prosite Prosite consists of: biologically significant sites, patterns
PROSITE Database of protein families and domains Prosite consists of: biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs

26 Predicting post-synaptic activity in proteins
Classes Positive Example = {proteins with post-synaptic activity} Negative Example = {proteins without post-synaptic activity} Predictor attribute Prosite patterns

27 First Phase select positive and negative Examples:
carefully select relevant proteins from the UniProt database Positive Examples Negative Examples

28 Positive Example Query-1: Post-synaptic AND !toxin
All Species – To maximize the number of examples in the data to be mined !toxin – Several entities in UniProt/SwissProt refer to the toxin alpha-latrotoxin

29 Negative Examples Query-2: Heart !(result of query 1)
Query-3: Cardiac !(result of queries1,2) Query-4: Liver !(result of queries 1,2,3) Query-5: Hepatic !(result of queries 1,2,3,4) Query-6: Kidney !(result of queries 1,2,3,4,5)

30 Second Phase generating the predictor attribute:
use the link from UniProt to Prosite database to create the predictor attributes

31 Predictor Attributes Facilitate easy interpretation by biologists
Must have a good predictive power

32 Generating The Predictor Attribute
Occurred at least in two proteins Pattern (NOT Profile) Remove data if they do not have a Prosite entry NO YES Prosite Entry? Remove from the Dataset

33 Data Mining Algorithm C4.5 rules induction algorithm (Quinlan, 1993)
Generates easily interpretable classification rules of the form: This kind of rule has the intuitive meaning IF THEN ELSE IF (condition) THEN (class)

34 Classification Rules

35 Predictive Accuracy Well-known 10-fold cross-validation procedures (Witten and Frank 2000) Divide the dataset into 10 partitions with approximately same number of examples (proteins) In i-th iteration, 1=1,2,…,10, the i-th partition as test set and the other 9 partitions as training set 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

36 Results Sensitivity Specificity TPR = TP/(TP+FN)
Average values (over the 10 iterations of cross-validation procedure) = 0.85 Specificity TNR = TN/(TN+FP) Average values (over the 10 iterations of cross-validation procedure) =0.98

37 Questions?


Download ppt "Biological Data Mining (Predicting Post-synaptic Activity in Proteins)"

Similar presentations


Ads by Google