Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, Roma, Italy, Workshop ENEA: I Sistemi di Supporto alle Decisioni Centro Ricerche ENEA Casaccia, Roma, October 28, 2003
2 Data Mining Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. A data miner is a computer program that sifts through data seeking regularities or patterns. Obstructions: noise and computational complexity.
3 Some Applications Decisions involving judgments, e.g. loans. Screening images. Example: detection of oil slicks from satellite images, warning of ecological disasters, illegal dumping. Load forecasting in the electricity supply industry. Diagnosis, e.g. for preventive maintenance of electromechanical devices. Marketing and Sales. … On Thursday customers often purchase beer and diapers together … Stock Market Analysis. Anomaly Detection.
4 Data AgeSpectacle Prescription AstigmatismTear production rate Recommende d lens youngmyopenoreducednone youngmyopenonormalsoft youngmyopeyesreducednone youngmyopeyesnormalhard younghypermetropenoreducednone younghypermetropenonormalsoft younghypermetropeyesreducednone younghypermetropeyesnormalhard Pre- presbyopic myopenoreducednone Pre-presbmyopenonormalsoft Attributes Goal Instance
5 Classification Assume instances have n attributes A 1, … A n-1, A n. Let attribute A n our goal. A classifier is a function f from (A 1 x …x A n-1 ) to A n. That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the goal. In other words f classifies each instance w.r.t. the goal attribute. The problem of computing a classifier from a set of instances is called the classification problem. Note that in a classification problem the set of classes (i.e. the possible goal value) is known in advance. Note that a classifier works on any possible instance. That is also on instances that were not present in our data set. This is way classification is a form of machine learning.
6 Clustering Assume instances have n attributes A 1, … A n. A clustering function is a function f from the set (A 1 x …x A n ) to some small subset of the natural numbers. That is f splits the set of instances into a small number of classes. The problem of computing a clustering function from our data set is called the clustering problem. Note that, unlink in a classification problem, in a clustering problem the set of classes is not known in advance. Note that a clustering function works on any possible instance. That is also on instances that were not present in our data set. This is way clustering is a form of machine learning. In the following we will focus on classification.
7 Rules for Contact Lens Data (An example of calssification) if ( = ) then = ; if ( = and = and = ) then = if ( = and = and = ) then =.. Attribute recommendation is the attribute we would like to predict. Such attribute is usually called Goal and is typically written on the last column. A possible way of defining a classifier is by using a set of rules as above.
8 Labor Negotiations Data AttributeType Durationyears Wage increase first year percentage2%4%4.3%...4.5% Wage increase second year percentage???...? Working hours per week Number of hours pension{none, r, c}none??...? Education allowance{yes, no}yes??...? Statutory holidaysNun of days vacationBelow-avg, avg, gen avggen...avg... Acceptability of contract {good, bad}badgood...good
9 Classification using Decision Trees (The Labor Negotiations Data Example (1)) Wage increase first year Statutory holidays Wage increase first year > 2.5 <= 10 badgoodbadgood <= 2.5 > 10 <= 4 > 4
10 Classification using Decision Trees (The Labor Negotiations Data Example (2)) Wage increase first year working hours per weekStatutory holidays Health plan contribution bad <= 36 > 36 Wage increase first yeargood > 10<= 10 badgoodbad good none half full <= 4 > 4 <= 2.5 > 2.5
11 Which Classifiers is good for me ? From the same data set we may get many classifiers with different properties. Here are some of the properties usually considered for a classifiers. Note that depending on the problem under consideration, some property may or may not not be relevant. Success rate. That is the percentage of instances classified correctly. Easy of computation. Readability. There are cases in which the definition of the classifier must be read by a human being. In such cases the readability of the classifier definition is an important parameter to judge the goodness of a classifier. Finally we should note that starting from the same data set different classification algorithms may return different classifiers. Usually deciding which one to use requires running some testing experiments.
12 A Classification Algorithm Decision Trees Decision trees are among the most used and more effective classifiers. We will show the decision tree classification algorithm with an example: the weather data.
13 Weather Data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno
14 Constructing a decision tree for the weather data (1) OutlookTemperatureHumidityWindy yynnnyynnn yyyyyyyy yynnnyynnn yynnyynn yyyynnyyyynn yyynyyyn sunny overcast rainy H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = bits; H([4, 0]) = 0 bits; H([3, 2]) = bits H([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = bits Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = Gain: Gain: 0.029Gain: 0.152Gain: hot mild cool yyynnnnyyynnnn yyyyyynyyyyyyn high normal yyyyyynnyyyyyynn yyynnnyyynnn false true H(p 1, … p n ) = -p 1 logp 1 - … -p n logp n H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r))
15 Constructing a decision tree for the weather data (2) Outlook Temperature Humidity Windy nnnn ynyn nnnnnn y yyyy yynnyynn ynyn sunny hot mild cool high normal true false
16 Constructing a decision tree for the weather data (3) Outlook Humidity Windy yes no sunny overcast rainy highnormal false true Computational cost of decision tree construction for a data set with m attributes and n instances: O(mn(log n)) + O(n(log n) 2 )
17 Naive Bayes OutlookTemperatureHumidityWindyPlay yesno sunn y 23 over cast 40 rain y 32 yesno sunn y 2/93/5 over cast 4/90/5 rainy3/92/5 yesno hot22 mild42 cool31 yesno hot2/92/5 mild4/92/5 cool3/91/5 yesno high34 nor mal 61 yesno high3/94/5 nor mal 6/91/5 yesno fals e 62 true33 yesno fals e 6/92/5 true3/93/5 yesno 95 yesno 9/145/14
18 Naive Bayes (2) A new day: OutlooktemperatureHumidityWindyPlay sunnycoolhightrue? E = (sunny and cool and high and true) Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E). Assuming attributes statistically independent: P(yes | E) = (P(sunny | yes) * P(cool| yes) * P(high | yes) * P(true | yes) * P(yes)) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E) = / P(E). P(no | E) = / P(E). Since P(yes | E) + P(no | E) = 1 we have that P(E) = = Thus: P(yes | E) = P(no | E) = 0.795; Thus we answer: NO Obstruction: usually attributes are not statistically independent. However naive Bayes works quite well in practice.
19 Performance Evaluation Split data set into two parts: training set and test set. Use training set to compute classifier. Use test set to evaluate classifier. Note: test set data have no been used in the training process. This allows us to compute the following quantites (on the test set). For sake of simplicity we refer to a two-class prediction. yesno yesTP (true positive)FN (false negative) noFP (false positive)TN (true negative) Predicted class Actual class
20 Lift Chart Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN) Number of true positives = TP 100% 1000 Lift charts are typically used in Marketing Applications
21 Receiver Operating Characteristic (ROC) Curve FP rate = FP/(FP + TN) Tp rate = TP/(TP + FN) 100% ROC curves are typically used in Communication Applications
22 A glimpse of the data mining in Safeguard We outline our use of data mining techniques in the safeguard project.
23 On line schema Format Filter Port 2506 TCP Packets Preprocessed TCP payload Classifier 1 (Hash Table based) Classifier 2 (Hidden Markov Models) Cluster Analyzer tcpdump Supervisor Alarm level Format Filter Format Filter Format Filter Sequence of payload bytes Distribution of payload bytes Conditional probabilities of chars and words in payload Statistics info (avg, var, dev) on payload bytes
24 Training schema Format Filter Port 2506 TCP Packets Preprocessed TCP payload log WEKA (Datamining tool) tcpdump HT Classifier Synthesizer Classifier 1 (Hash Table based) Classifier 2 (Hidden Markov Models) HMM Synthesizer Cluster Analyzer Format Filter Format Filter Format Filter Sequence of payload bytes Distribution of payload bytes Conditional probabilities of chars and words in payload Statistics info (avg, var, dev) on payload bytes