Download presentation
Presentation is loading. Please wait.
1
Classifiers in Atlas CS240B Class Notes UCLA
2
Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html
3
The Classification Task zInput: a training set of tuples, each labelled with one class label zOutput: a model (classifier) which assigns a class label to each tuple based on the other attributes. zThe model can be used to predict the class of new tuples, for which the class label is missing or unknown zSome natural applications ycredit approval ymedical diagnosis ytreatment effectiveness analysis
4
Train & Test zThe tuples (observations, samples) are partitioned in training set + test set. zClassification is performed in two steps: 1.training - build the model from training set 2.Testing (for accuracy, etc.)
5
Classical example: play tennis? Training set from Quinlan’s Book Seq Could have Been used to generate the RID column
6
Bayesian classification zThe classification problem may be formalized using a-posteriori probabilities: z P(C|X) = prob. that the sample tuple X= is of class C. zE.g. P(class=N | outlook=sunny,windy=true,…) zIdea: assign to sample X the class label C such that P(C|X) is maximal
7
Estimating a-posteriori probabilities zBayes theorem: P(C|X) = P(X|C)·P(C) / P(X) zP(X) is constant for all classes zP(C) = relative freq of class C samples zC such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum
8
Naïve Bayesian Classification zNaïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C) zFor Categorical attributes: P(x i |C) is estimated as the relative freq of samples having value x i as i-th attribute in class C zComputationally this is a count with grouping
9
Play-tennis example: estimating P(x i |C) outlook P(sunny|p) = 2/9P(sunny|n) = 3/5 P(overcast|p) = 4/9P(overcast|n) = 0 P(rain|p) = 3/9P(rain|n) = 2/5 temperature P(hot|p) = 2/9P(hot|n) = 2/5 P(mild|p) = 4/9P(mild|n) = 2/5 P(cool|p) = 3/9P(cool|n) = 1/5 humidity P(high|p) = 3/9P(high|n) = 4/5 P(normal|p) = 6/9P(normal|n) = 2/5 windy P(true|p) = 3/9P(true|n) = 3/5 P(false|p) = 6/9P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14
10
Bayesian Classifiers zThe training can be done by SQL count and grouping sets (but that might require many passes through the data). If the results are stored in a table called SUMMARY, then:SUMMARY z The testing is a simple SQL query on SUMMARY zFirst operation is to verticalize the table
11
Decision tree obtained with ID3 (Quinlan 86) outlook overcast humiditywindy highnormal weak strong sunny rain NNPP P [0] [2] [1][3] [4] […]
12
Decision Tree Classifiers z Computed in a recursive fashion z Various ways to split and computing the splitting function zFirst operation is to verticalize the table
13
Classical example
14
Initial state: the node column Training set from Quinlan’s book
15
First Level (Outlook will then be deleted)
16
Gini index zE.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. zfp = p/(p+n)fn = n/(p+n) gini(S) = 1 – fp 2 - fn 2 zIf dataset S is split into S 1, S 2, S 3 then gini split (S 1, S 2, S 3 ) = gini( S 1 ) · (p 1 +n 1 )/(p+n) + gini( S 2 ) · (p 2 +n 2 )/(p+n) +gini( S 3 ) · (p 2 +n 2 )/(p+n) These computations can be easily expressed in ATLaS
17
Programming in ATLaS zTable-based programming is powerful and natural for data intensive z SQL can be ackward and many extensions are possible z But even SQL `as is’ is adequate
18
The ATLaS System zThe system compile ATLaS programs into C programs, which zExecutes on Berkeley DB record manager zThe 100 Apriori program compiles into 2,800 lines of C zOther data structures (R-trees, in-memory tables) have been added using the same API. zThe system is now 54,000 lines of C++ code.
19
ATLaS: Conclusions zA native extensibility mechanism for SQL—and a simple one. More efficient than Java or PL/SQL zEffective with Data Minining Applications z Also OLAP applications, and recursive queries, and temporal database applications zComplement current mechanisms based on UDFs and Data Blades z Supports and favors streaming aggregates (SQL implicit default is blocking) z Good basis for determining program properties: e.g. (non)monotonic and blocking behavior zThese are lessons that future QLs cannot easily ignore.
20
The Future z Continuous queries on Data Streams z Other extensions and improvements z Stay tuned: www.wis.ucla.eduwww.wis.ucla.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.