Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:

Classifiers in Atlas CS240B Class Notes UCLA

Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering: http://wis.cs.ucla.edu/atlas/examples.html

The Classification Task zInput: a training set of tuples, each labelled with one class label zOutput: a model (classifier) which assigns a class label to each tuple based on the other attributes. zThe model can be used to predict the class of new tuples, for which the class label is missing or unknown zSome natural applications ycredit approval ymedical diagnosis ytreatment effectiveness analysis

Train & Test zThe tuples (observations, samples) are partitioned in training set + test set. zClassification is performed in two steps: 1.training - build the model from training set 2.Testing (for accuracy, etc.)

Classical example: play tennis? Training set from Quinlan’s Book Seq Could have Been used to generate the RID column

Bayesian classification zThe classification problem may be formalized using a-posteriori probabilities: z P(C|X) = prob. that the sample tuple X= is of class C. zE.g. P(class=N | outlook=sunny,windy=true,…) zIdea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities zBayes theorem: P(C|X) = P(X|C)·P(C) / P(X) zP(X) is constant for all classes zP(C) = relative freq of class C samples zC such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum

Naïve Bayesian Classification zNaïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C) zFor Categorical attributes: P(x i |C) is estimated as the relative freq of samples having value x i as i-th attribute in class C zComputationally this is a count with grouping

Bayesian Classifiers zThe training can be done by SQL count and grouping sets (but that might require many passes through the data). If the results are stored in a table called SUMMARY, then:SUMMARY z The testing is a simple SQL query on SUMMARY zFirst operation is to verticalize the table

Decision tree obtained with ID3 (Quinlan 86) outlook overcast humiditywindy highnormal weak strong sunny rain NNPP P [0] [2] [1][3] [4] […]

Decision Tree Classifiers z Computed in a recursive fashion z Various ways to split and computing the splitting function zFirst operation is to verticalize the table

Classical example

Initial state: the node column Training set from Quinlan’s book

First Level (Outlook will then be deleted)

Gini index zE.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. zfp = p/(p+n)fn = n/(p+n) gini(S) = 1 – fp 2 - fn 2 zIf dataset S is split into S 1, S 2, S 3 then gini split (S 1, S 2, S 3 ) = gini( S 1 ) · (p 1 +n 1 )/(p+n) + gini( S 2 ) · (p 2 +n 2 )/(p+n) +gini( S 3 ) · (p 2 +n 2 )/(p+n) These computations can be easily expressed in ATLaS

Programming in ATLaS zTable-based programming is powerful and natural for data intensive z SQL can be ackward and many extensions are possible z But even SQL `as is’ is adequate

The ATLaS System zThe system compile ATLaS programs into C programs, which zExecutes on Berkeley DB record manager zThe 100 Apriori program compiles into 2,800 lines of C zOther data structures (R-trees, in-memory tables) have been added using the same API. zThe system is now 54,000 lines of C++ code.

ATLaS: Conclusions zA native extensibility mechanism for SQL—and a simple one. More efficient than Java or PL/SQL zEffective with Data Minining Applications z Also OLAP applications, and recursive queries, and temporal database applications zComplement current mechanisms based on UDFs and Data Blades z Supports and favors streaming aggregates (SQL implicit default is blocking) z Good basis for determining program properties: e.g. (non)monotonic and blocking behavior zThese are lessons that future QLs cannot easily ignore.

The Future z Continuous queries on Data Streams z Other extensions and improvements z Stay tuned: www.wis.ucla.eduwww.wis.ucla.edu

Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:

Similar presentations

Presentation on theme: "Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:

Similar presentations

Presentation on theme: "Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:"— Presentation transcript:

Similar presentations

About project

Feedback