1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg.

1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg

Motivation of data mining  Discover automatically useful information in large data repository.  Extract patterns from experience.  Predict outcome of future observations.  Learning: Set of Task If Experience increase,performance measure on the set of tasks increases ExperiencePreformance Measure

Organisation of data  Datasets are organized as instances/attributes Instances Attributes Synonyms Data points Entries Sample … Synonyms Factors Variables Measures...

Nature of data  Attributes can be:  Numeric  Nominal  Continuous  Categorical  Ordered  Ranges  Hierarchical Atom counts:O=1Cl=4N=6S=3 Molecule name: (1-methyl)(1,1,1-tributyl)azanium, tetrahexylammonium Molecular surface: Phase state: solid, amorphous, liquid, gas, ionized Intestinal absorption: not absorbed, mildly absorbed, completely absorbed Spectral domains: EC numbers: visibleUVIR EC 1. Oxidoreductases EC 2. Transferases EC 3. Hydrolases EC 4. Lyases EC 5. Isomerases EC 6. Ligases EC 6.1 Forming Carbon-Oxygen Bonds EC 6.2 Forming Carbon-Sulfur Bonds EC 6.3 Forming Carbon-Nitrogen Bonds EC 6.4 Forming Carbon-Carbon Bonds EC 6.5 Forming Phosphoric Ester Bonds EC 6.6 Forming Nitrogen—Metal Bonds

Nature of learning  Unsupervised learning  Clustering  Rules  Supervised learning  Classification  Regression  Other  Reinforcement  First order logic +

 A Concept is the target function to be learned  Concept is learned from  attributes-values  Relations  Sequences  Spatial Concept in data mining  Instance 1  Instance 2  Instance 3 …  Instance 1  Instance 2  Instance 3 … DB1 DB2

Machine Learning and Statistics  Statistician point of view  Datasets are the expression of underlying probability distributions  Datasets validate or invalidate prior hypothesis  Data miner point of view  Any hypothesis compatible with the dataset is useful  Search for all hypothesis compatible with the dataset Induction Deduction

Validation in Data Mining  Validation means that a model is build on a training set of data then applied on a test set of data.  Success and failure on the test set must be estimated.  The estimate is supposed to be representative of any new situation.  Every model must be validated.

Training/Test  Split the dataset in two parts:  One part is the training set  The other is the test set

Bootstrapping  Draw N instances with replacement from the dataset  Create a training set with these instances  Use the dataset as the test set

Cross-Validation  Split the dataset in N subsets  Use each subset as a test set while all others form a training set

Scrambling  Reassign at random the classes to the instances.  Success and failure are estimated on the scrambled data.  The goal is to estimate good success measurement by pure chance.

Clustering  Search for an internal organization of the data  Optimizes relations between instances relative to an objective function  Typical objective functions: SeparationCoherence Density Contiguity Concept

Cluster Evaluation  Essential because any dataset can be clustered by not any cluster is meaningful.  Evaluation can  Unsupervised  Supervised  Relative

Unsupervised Cluster evaluation CohesionSeparation Silhouette Proximity matrix CoPhenetic Correlation For p Nearest Neighbor Distances (NND) between instances (ω i ) and NND between rand points (u i ) Clustering Tendency

Supervised cluster evaluation Recall(3,1) Class1 N i, number of members of cluster i Precision(3,1) Cluster3 p ij Precision(i,j) Recall(i,j)

Relative analysis  Compare two clustering.  Supervised cluster analysis is a special case of relative analysis  The reference clustering is the set of classes Rand statisticsJaquard statistics N 00 : number of instances couple in different clusters for both clustering N 11 : number of instances couple in same clusters for both clusters N 01 : number of instances couple in different clusters for the first clustering and in the same clusters for the second N 10 : number of instances couple in the same clusters for the first clustering and in different one for the second.

A simple clustering algorithm: k-mean 1.Select k points as centroids 2.Form k clusters: each point is assigned to its closest centroid 3.Reset each centroid to the (geometric) center of its cluster 4.Repeat from point 2 until no change is observed 5.Repeat from point 1 until stable average clusters are obtained. XXXXXX

Classification  Definition  Assign one or several objects to predefined categories  The target function maps a set of attributes x to a set of classes y.  Learning scheme  Supervised learning  Attribute-value  Goal  Predict the outcome of future observations

Probabilities basics  Conditional probabilities  Independence of random events:  Probability of realization of event A knowing that B has occurred  The Bayes equation for independent events x i

Statistical approach to classification  Estimate the probability of an instance {x 1,x 2 } being of Class1 or Class2. Class 1 Class2

 The probability that an instance {x 1,x 2,…} belongs to class A is difficult to estimate.  Poor statistics  Consider the Bayes Equation:  With the naive assumption that {x 1,x 2,…} are independent  The prior probability, the evidence and the likelihood have better estimates  Good statistics The Naive Bayes assumption Posterior Probability Prior ProbabilityLikelihood Evidence

The Naive Bayes Classifier 1.Estimate the prior probability, P(A), for each class. 2.Estimate the likelihood, P(x|A), of each attribute for each class. 3.For a new instance, estimate the Bayes Score for each class: 4.Assign the instance to the class which possesses the highest score The value of C can be optimized

Success and failure  For N instance and a give classifier, for each class I  N TP (i):  True Positives Number of instances of class i correctly classified.  N FP (i):  False Positives Number of instances incorrectly assigned to class i.  N TN (i):  True Negatives Number of instances of other classes correctly classified.  N FN (i):  False Negatives Number of instances of class i incorrectly assigned to other classes.

Confusion Matrix  For N instances, K classes and a classifier  N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK

Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class

Ranking success evaluation  Receiver Operator Curve (ROC)  Receiver Operator Curve Area Under the Curve (ROC AUC) Recall 1-Specificity

Losses and Risks  Errors on a different class prediction has different costs  What does it cost to mistakenly assign an instance of one class to another?  Normalized Expected Cost  Probability Cost Function Class1Class2…ClassK Class10C 12 …C 1K Class2C 21 0…C 2K …………… ClassKC K1 C K2 …0 Cost matrix Cij Asymmetric matrix

Cost Curve Worse classifier Ideal Classifier N FP N TP Probability Cost function Accept All Classifier Reject All Classifier Actual Classifier Normalized expected cost

Conclusion  Data mining extracts useful information from datasets  Clustering:  Unsupervised  Information about the data  Classification:  Supervised  Build models in order to predict outcome of future observations

Multi-Linear Regression y=ax+b a b Sum of Squared Errors (SSE)

1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg.

Similar presentations

Presentation on theme: "1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg.

Similar presentations

Presentation on theme: "1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg."— Presentation transcript:

Similar presentations

About project

Feedback