Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens

Introduction LOCUS Parallel Execution Experimental Evaluation Conclusions & Future Work Motivation

Introduction ω 1 = ω 2 = Classification x = ω = f(x)

. “Lazy”“Eager” Introduction x 1 = x 2 = (+) Faster decisions ( - ) Large/complex datasets ( - ) Dynamic datasets ( - ) Dynamic models (Nearest Neighbors)(Decision Trees)

Large/complex datasets

Motivation

Large/complex datasets Dynamic datasets

Motivation

Large/complex datasets Dynamic datasets Dynamic models

Motivation

Large/complex datasets Dynamic datasets Dynamic models Lazy (model-free)

Motivation Large/complex datasets Dynamic datasets Dynamic models Lazy (model-free) Nearest Neighbors Disk-based

Motivation Nearest Neighbors Suffers from “curse of dimensionality” Not reliable [Beyer et al., ICDT 1999] Not indexable [Shaft et al., ICDT 2005] LOCUS (Lazy Optimal Classifier of Unlimited Scalability)

Motivation Category? LOCUS (Lazy Optimal Classifier of Unlimited Scalability)

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability)

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Scaling?

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Accuracy?

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier Other features?

Motivation Lazy LOCUS (Lazy Optimal Classifier of Unlimited Scalability) Based on simple SQL queries Converges to optimal Bayes Classifier Parallelizable

LOCUS x = ω 2 = ω 1 = (f 1  [0, 20], f 2  [0, 10]) f2f2 f1f1 Example

LOCUS f2f2 f1f1 Ideally: Dense space

LOCUS f2f2 f1f1 ω( ) = ? Ideally: Dense space

LOCUS f2f2 f1f1 ω( ) =

LOCUS f2f2 f1f1 Reality: Many features Large domains  Sparse space

Reality: Many features Large domains  Sparse space LOCUS f2f2 f1f1 ω( ) = ? ?

LOCUS f2f2 f1f1 ω( ) = ? ω 1 : 2 ω 2 : 1  3-NN

LOCUS f2f2 f1f1 ω( ) = ω 1 : 2 ω 2 : 1  3-NN

LOCUS f2f2 f1f1 ω( ) = ? LOCUS

f2f2 f1f1 ω( ) = ? ω 1 : 7 ω 2 : 3  LOCUS

f2f2 f1f1 ω( ) =  ω 1 : 7 ω 2 : 3 LOCUS

f2f2 f1f1 Disk-based implementation LOCUS

2δ12δ1 2δ22δ2 SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) ω 1 : 7 ω 2 : 3 ω( ) = 

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) What if R is large? Classical optimization techniques for a well-known type of aggregate queries Indexing Presorting Materialized views

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) Method reliability? LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 ≥x 2 -δ 2 AND f 2 ≤x 2 +δ 2 GROUP BY ω R(f 1, f 2, ω) What if a feature, say f 2, is categorical? (e.g. sex)

LOCUS SELECT ω, count(*) FROM R WHERE f 1 ≥x 1 -δ 1 AND f 1 ≤x 1 +δ 1 AND f 2 =x 2 GROUP BY ω R(f 1, f 2, ω) Not a problem, since generally in practice: Combinations of categorical and numeric features Categorical features have small domains Hence, they do not contribute to sparsity What if a feature, say f 2, is categorical? (e.g. sex)

SELECT Parallel Execution R1R1 R2R2 R3R3 R4R4 R = R 1  R 2  R 3  R 4

Parallel Execution ω 1 : 5 ω 2 : 2 ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 R1R1 R2R2 R3R3 R4R4 Count: distributive function ω 1 : 23 ω 2 : 4 5252 12 3 18 3 23 4

ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 ω 1 : 5 ω 2 : 2 Parallel Execution Small network traffic Load balancing Lightweight operations on the main server SELECT R1R1 R2R2 R3R3 R4R4 ω 1 : 7 ω 2 : 1 ω 1 : 5 ω 2 : 1 ω 1 : 6 ω 2 : 0 ω 1 : 5 ω 2 : 2 5252 12 3 18 3 23 4

Experimental Evaluation LOCUS vs DTs and NNs (weka) Synthetic datasets  Ten functions [Agrawal et al., IEEE TKDE 1993]  D = 9  N  [5  10 3, 5  10 6 ] Real-world datasets  UCI Repository

Experimental Evaluation Classification error rate (synthetic datasets, N = 5  10 4 )

Experimental Evaluation Effect of dataset size on classification error rate of LOCUS (synthetic datasets, N  [5  10 3, 5  10 6 ])

Experimental Evaluation Effect of dataset size on time scalability of LOCUS (synthetic datasets, N  [5  10 3, 5  10 6 ])

Experimental Evaluation Classification error rate (real-world datasets)

Experimental Evaluation Effect of dataset size on classification error rate (dataset CovType, N  [5  10 3, 5  10 5 ])

Conclusions & Future Work LOCUS  Lazy (complex/dynamic datasets and models)  Efficient (based on simple SQL queries)  Reliable (converging to optimal)  Parallelizable

Conclusions & Future Work Similar techniques for  feature selection  regression Implementation of a parallel version

Questions?

Thank you!

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

Similar presentations

Presentation on theme: "Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

Similar presentations

Presentation on theme: "Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens."— Presentation transcript:

Similar presentations

About project

Feedback