William Norris Professor and Head, Department of Computer Science

William Norris Professor and Head, Department of Computer Science
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer Science

What is Data Mining? Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Why Data Mining? Commercial Viewpoint
Lots of data is being collected and warehoused Web data Yahoo! collects 10GB/hour purchases at department/ grocery stores Walmart records  20 million transactions per day Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Data Mining? Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite NASA EOSDIS archives over 1-petabytes of earth science data / year telescopes scanning the skies Sky survey data gene expression data scientific simulations terabytes of data generated in a few hours Data mining may help scientists in automated analysis of massive data sets in hypothesis formation

Data Mining for Biomedical Informatics
Recent technological advances are helping to generate large amounts of both medical and genomic data High-throughput experiments/techniques Gene and protein sequences Gene-expression data Biological networks and phylogenetic profiles Electronic Medical Records IBM-Mayo clinic partnership has created a DB of 5 million patients Single Nucleotides Polymorphisms (SNPs) Data mining offers potential solution for analysis of large-scale data Automated analysis of patients history for customized treatment Prediction of the functions of anonymous genes Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network

Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems

Data Mining Tasks... Data Milk Clustering Predictive Modeling
Anomaly Detection Association Rules Milk

How can data mining help biologists?
Data mining particularly effective if the pattern/format of the final knowledge is presumed; common in biology: Protein complexes (clustering and association patterns) Gene regulatory networks (predictive models) Protein structure/function (predictive models) Motifs (association patterns) We will look at two examples: Clustering of ESTs and microarray data Identifying protein functional modules from protein complexes

Model for predicting credit worthiness
Predictive Modeling: Classification Find a model for class attribute as a function of the values of other attributes Model for predicting credit worthiness Employed No Education Number of years Yes Graduate { High school, Undergrad } > 7 yrs < 7 yrs Class

General Approach for Building Classification Model
categorical categorical quantitative class Test Set Learn Classifier Model Training Set

Classification Techniques
Base Classifiers Decision Tree based Methods Rule-based Methods Nearest-neighbor Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Ensemble Classifiers Boosting, Bagging, Random Forests

Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Classifying credit card transactions as legitimate or fraudulent Categorizing news stories as finance, weather, entertainment, sports, etc Identifying intruders in the cyberspace

Classification for Protein Function Prediction
Feature 1 Feature 2 Feature 3 …….. ……. Annotation Protein 1 1 ‘Annotated’ Protein 2 ……… .. Protein n Not annotated Features could be several motifs, structure, domains, etc. Important features (motifs, conserved domains etc) can be extracted using automated feature extraction techniques and using domain knowledge Classification model can then be built on the extracted features and trained using several annotated proteins Two proteins sharing significantly common features have a greater probability of having the same function Several supervised and unsupervised schemes can be applied for this purpose

Constructing a Decision Tree
Key Computation Worthy Not Worthy 4 3 Employed = Yes Employed = No Worthy: 4 Not Worthy: 3 Yes No Worthy: 0 Employed Employed Worthy: 4 Not Worthy: 3 Yes No Worthy: 0 Not Worthy: 3 Graduate High School/ Undergrad Worthy: 2 Not Worthy: 2 Education Not Worthy: 4

Constructing a Decision Tree
Employed = Yes Employed = No

Design Issues of Decision Tree Induction
How should training records be split? Method for specifying test condition depending on attribute types Measure for evaluating the goodness of a test condition How should the splitting procedure stop? Stop splitting if all the records belong to the same class or have identical attribute values Early termination

Rule-based Classifier (Example)

Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule The rule r1 covers a hawk => Bird The rule r3 covers the grizzly bear => Mammal

Ordered Rule Set Rules are rank ordered according to their priority
An ordered rule set is known as a decision list When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered If none of the rules fired, it is assigned to the default class

Nearest Neighbor Classifiers
Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records

Nearest-Neighbor Classifiers
Requires three things The set of stored records Distance metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Nearest Neighbor Classification…
Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes

Bayes Classifier Bayes theorem:
A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:

Using Bayes Theorem for Classification
Approach: compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes theorem Maximum a-posteriori: Choose Y that maximizes P(Y | X1, X2, …, Xd) Equivalent to choosing value of Y that maximizes P(X1, X2, …, Xd|Y) P(Y) How to estimate P(X1, X2, …, Xd | Y )?

Artificial Neural Networks (ANN)
Training ANN means learning the weights of the neurons Activation Functions

Perceptron Single layer network Activation function: g = sign(wx)
Contains only input and output nodes Activation function: g = sign(wx) Learning: Initialize the weights (w0, w1, …, wd) Repeat For each training example (xi, yi) Compute f(w, xi) Update the weights: Until stopping condition is met

Example of Perceptron Learning

Perceptron Learning Rule
Since f(w,x) is a linear combination of input variables, decision boundary is linear For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly, and hence more powerful learning techniques are needed

Characteristics of ANN
Multilayer ANN are universal approximators Can handle redundant attributes because weights are automatically learnt Gradient descent may converge to local minimum Model building can be very time consuming, but testing can be very fast

Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data

Which one is better? B1 or B2? How do you define better?

Find hyperplane that maximizes the margin => B1 is better than B2

Evaluating Classifiers
Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Accuracy Most widely-used metric: PREDICTED CLASS ACTUAL CLASS a (TP)
Class=Yes Class=No a (TP) b (FN) c (FP) d (TN)

Problem with Accuracy Consider a 2-class problem
Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % This is misleading because the model does not detect any class 1 example Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

Alternative Measures PREDICTED CLASS ACTUAL CLASS a b c d Class=Yes
Class=No a b c d

ROC (Receiver Operating Characteristic)
A graphical approach for displaying trade-off between detection rate and false alarm rate Developed in 1950s for signal detection theory to analyze noisy signals ROC curve plots TPR against FPR Performance of a model represented as a point in an ROC curve Changing the threshold parameter of classifier changes the location of the point

ROC Curve (TPR,FPR): (0,0): declare everything to be negative class
(1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

ROC (Receiver Operating Characteristic)
To draw ROC curve, classifier must produce continuous-valued output Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record Many classifiers produce only discrete outputs (i.e., predicted class) How to get continuous-valued outputs? Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM

Methods for Classifier Evaluation
Holdout Reserve k% for training and (100-k)% for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Bootstrap Sampling with replacement .632 bootstrap:

Methods for Comparing Classifiers
Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set?

Data Mining Book For further details and sample chapters see

William Norris Professor and Head, Department of Computer Science

Similar presentations

Presentation on theme: "William Norris Professor and Head, Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

William Norris Professor and Head, Department of Computer Science

Similar presentations

Presentation on theme: "William Norris Professor and Head, Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback