Download presentation
Presentation is loading. Please wait.
Published bySilas Wade Modified over 9 years ago
1
1 Introduction to data mining G. Marcou + + Laboratoire d’infochimie, Université de Strasbourg, 4, rue Blaise Pascal, 67000 Strasbourg
2
Motivation of data mining Discover automatically useful information in large data repository. Extract patterns from experience. Predict outcome of future observations. Learning: Set of Task If Experience increase,performance measure on the set of tasks increases ExperiencePreformance Measure
3
Organisation of data Datasets are organized as instances/attributes Instances Attributes Synonyms Data points Entries Sample … Synonyms Factors Variables Measures...
4
Nature of data Attributes can be: Numeric Nominal Continuous Categorical Ordered Ranges Hierarchical Atom counts:O=1Cl=4N=6S=3 Molecule name: (1-methyl)(1,1,1-tributyl)azanium, tetrahexylammonium Molecular surface: Phase state: solid, amorphous, liquid, gas, ionized Intestinal absorption: not absorbed, mildly absorbed, completely absorbed Spectral domains: EC numbers: visibleUVIR EC 1. Oxidoreductases EC 2. Transferases EC 3. Hydrolases EC 4. Lyases EC 5. Isomerases EC 6. Ligases EC 6.1 Forming Carbon-Oxygen Bonds EC 6.2 Forming Carbon-Sulfur Bonds EC 6.3 Forming Carbon-Nitrogen Bonds EC 6.4 Forming Carbon-Carbon Bonds EC 6.5 Forming Phosphoric Ester Bonds EC 6.6 Forming Nitrogen—Metal Bonds
5
Nature of learning Unsupervised learning Clustering Rules Supervised learning Classification Regression Other Reinforcement First order logic +
6
A Concept is the target function to be learned Concept is learned from attributes-values Relations Sequences Spatial Concept in data mining Instance 1 Instance 2 Instance 3 … Instance 1 Instance 2 Instance 3 … DB1 DB2
7
Machine Learning and Statistics Statistician point of view Datasets are the expression of underlying probability distributions Datasets validate or invalidate prior hypothesis Data miner point of view Any hypothesis compatible with the dataset is useful Search for all hypothesis compatible with the dataset Induction Deduction
8
Validation in Data Mining Validation means that a model is build on a training set of data then applied on a test set of data. Success and failure on the test set must be estimated. The estimate is supposed to be representative of any new situation. Every model must be validated.
9
Training/Test Split the dataset in two parts: One part is the training set The other is the test set
10
Bootstrapping Draw N instances with replacement from the dataset Create a training set with these instances Use the dataset as the test set
11
Cross-Validation Split the dataset in N subsets Use each subset as a test set while all others form a training set
12
Scrambling Reassign at random the classes to the instances. Success and failure are estimated on the scrambled data. The goal is to estimate good success measurement by pure chance.
13
Clustering Search for an internal organization of the data Optimizes relations between instances relative to an objective function Typical objective functions: SeparationCoherence Density Contiguity Concept
14
Cluster Evaluation Essential because any dataset can be clustered by not any cluster is meaningful. Evaluation can Unsupervised Supervised Relative
15
Unsupervised Cluster evaluation CohesionSeparation Silhouette Proximity matrix CoPhenetic Correlation For p Nearest Neighbor Distances (NND) between instances (ω i ) and NND between rand points (u i ) Clustering Tendency
16
Supervised cluster evaluation Recall(3,1) Class1 N i, number of members of cluster i Precision(3,1) Cluster3 p ij Precision(i,j) Recall(i,j)
17
Relative analysis Compare two clustering. Supervised cluster analysis is a special case of relative analysis The reference clustering is the set of classes Rand statisticsJaquard statistics N 00 : number of instances couple in different clusters for both clustering N 11 : number of instances couple in same clusters for both clusters N 01 : number of instances couple in different clusters for the first clustering and in the same clusters for the second N 10 : number of instances couple in the same clusters for the first clustering and in different one for the second.
18
A simple clustering algorithm: k-mean 1.Select k points as centroids 2.Form k clusters: each point is assigned to its closest centroid 3.Reset each centroid to the (geometric) center of its cluster 4.Repeat from point 2 until no change is observed 5.Repeat from point 1 until stable average clusters are obtained. XXXXXX
19
Classification Definition Assign one or several objects to predefined categories The target function maps a set of attributes x to a set of classes y. Learning scheme Supervised learning Attribute-value Goal Predict the outcome of future observations
20
Probabilities basics Conditional probabilities Independence of random events: Probability of realization of event A knowing that B has occurred The Bayes equation for independent events x i
21
Statistical approach to classification Estimate the probability of an instance {x 1,x 2 } being of Class1 or Class2. Class 1 Class2
22
The probability that an instance {x 1,x 2,…} belongs to class A is difficult to estimate. Poor statistics Consider the Bayes Equation: With the naive assumption that {x 1,x 2,…} are independent The prior probability, the evidence and the likelihood have better estimates Good statistics The Naive Bayes assumption Posterior Probability Prior ProbabilityLikelihood Evidence
23
The Naive Bayes Classifier 1.Estimate the prior probability, P(A), for each class. 2.Estimate the likelihood, P(x|A), of each attribute for each class. 3.For a new instance, estimate the Bayes Score for each class: 4.Assign the instance to the class which possesses the highest score The value of C can be optimized
24
Success and failure For N instance and a give classifier, for each class I N TP (i): True Positives Number of instances of class i correctly classified. N FP (i): False Positives Number of instances incorrectly assigned to class i. N TN (i): True Negatives Number of instances of other classes correctly classified. N FN (i): False Negatives Number of instances of class i incorrectly assigned to other classes.
25
Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK
26
Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class
27
Ranking success evaluation Receiver Operator Curve (ROC) Receiver Operator Curve Area Under the Curve (ROC AUC) Recall 1-Specificity
28
Losses and Risks Errors on a different class prediction has different costs What does it cost to mistakenly assign an instance of one class to another? Normalized Expected Cost Probability Cost Function Class1Class2…ClassK Class10C 12 …C 1K Class2C 21 0…C 2K …………… ClassKC K1 C K2 …0 Cost matrix Cij Asymmetric matrix
29
Cost Curve Worse classifier Ideal Classifier N FP N TP Probability Cost function Accept All Classifier Reject All Classifier Actual Classifier Normalized expected cost
30
Conclusion Data mining extracts useful information from datasets Clustering: Unsupervised Information about the data Classification: Supervised Build models in order to predict outcome of future observations
31
Multi-Linear Regression y=ax+b a b Sum of Squared Errors (SSE)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.