An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB Conference Vancouver, Canada, 1992. Presentation by: Vladan Radosavljevic

Outline Introduction Motivation Interval Classifier Example Results Conclusion

Introduction Given a small set of labeled examples find classifier which will efficiently classify large unlabeled population database Or – retrieve all examples from the database that belong to the desired class Assumption: labeled examples are representative of entire population, number of classes are known in advance (m)

Motivation Why an Interval Classifier? Neural Networks – not database oriented, tuples have to be retrieved one at a time into memory before classification Decision Trees (ID3, CART) – binary splits increase computation time, pruning the tree after building makes the tree generation more expensive

Interval Classifier (IC) Key features: Tree classifier Categorical attributes – branches for each value Numerical attributes – decomposing range into k intervals, k determined algorithmically for each node IC generates SQL queries as final classification functions!

Interval Classifier - Algorithm Algorithm: Partition the domain of numerical attributes into predefined number of intervals, and for each interval determine winning class (class that has the largest frequency in that interval) For each attribute compute the value of the goodness function - information gain ratio (or re-substitution error rate) and find the winner attribute A Then for each partition of attribute A set strength of the winning class based on the frequency and predefined threshold, strength - weak or strong RRRGGG W W S S S S

Interval Classifier - Algorithm … Merge adjacent intervals that have the same winners with the equal strengths Divide training set of examples using calculated intervals Strong intervals become leaves with assigned winning class Recursively proceed with weak intervals. Stop when all intervals are strong, or specified maximum tree depth are obtained W S S Leaves

Interval Classifier - Pruning Pruning Dynamic, while tree is generated Find accuracy for the node using training set Expand the node only if classification error is below threshold that depends on number of leaves and entire accuracy The aim is to check whether the expansion will bring error reduction or not To avoid pruning to aggressively – each node inherits from its parent a certain number of credits

Example Age: numerical, uniformly distributed 20-80 Zip-code: categorical, uniformly Level of Education, elevel: categorical, unif. Two classes: A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) B: otherwise

Example 1000 training tuples Calculate class histogram for numerical attribute age by choosing 100 equi-distant intervals and determine winning class for each partition Find the best attribute based on the resubstitution error rate: 1-sum(win_freq(part)/total_freq)

Example Choose age – the smallest error rate, partition the domain by merging adjacent intervals which have the same winning class with equal strengths B

Example Proceed with weak nodes and repeat the same procedure Finally: Classes defined in the beginning: A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) B: otherwise

Results Generate examples with smooth boundaries among the groups Training set 2500 tuples, test 10000 Fixed precision – threshold 0.9 Adaptive precision – adaptive threshold Error pruning – credits Function 5 – nonlinear

Results Comparing to the ID3:

Conclusion IC interface efficiently with the database systems Treatment of numerical attributes Dynamic pruning Too many user defined parameters? Scalability? In practice K-ary trees are less accurate than binary ones?

References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami: “An Interval Classifier for Database Mining Applications”, in Proceeding of the VLDB Conference, Vancouver, BC, Canada, 1992, pp.560-573

THANK YOU!

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

Similar presentations

Presentation on theme: "An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

Similar presentations

Presentation on theme: "An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB."— Presentation transcript:

Similar presentations

About project

Feedback