Efficient Rule-Based Attribute-Oriented Induction for Data Mining Authors: Cheung et al. Graduate: Yu-Wei Su Advisor: Dr. Hsu
Outline Motivation Objective Introduction Basic attribute-oriented induction Rule-based concept generalization Rule-based attribute-oriented induction Path relation An efficient rule-based AOI Performance study Conclusion Opinion
Motivation AOI induction capability is limited by the unconditional concept generalization Basic AOI has the problem of time complexity
Objective Extending the concept generalization to rule-based concept hierarchy Developing an efficient algorithm to facilitate induction
Introduction The growth in size of existing databases has far exceeded the human abilities to analyze AOI has been implemented in DBMiner The engine for concept generalization in basic AOI is the concept ascension To further enhance the capability of AOI, need to replace concept ascension
Introduction( cont.) Rule-based Concept Graph can be generalized to more than one higher level concept Induction anomaly would occur, when a concept tree is applied directly to a concept graph Instead multi-dimensional data cube to generalized relation can improve performance
Basic attribute-oriented induction The purpose of AOI is to discover rules from relations The primary is concept generalization
Basic attribute-oriented induction( cont.) By merging the records which have the same generalized values, characteristics of the data can be captured Primitives in AOI Task-relevant data Data that want to be analyzed, called initial relation Background knowledge To support concept hierarchies
Basic attribute-oriented induction( cont.) Representation of learning results Concept generalization Notation Desirable level: an attribute contain a small number of distinct values Desirable attribute threshold: to control the number of distinct values of an attribute Minimum desirable level: when generalized to a level lower than the current one, the distinct number would more than desirable attribute threshold
Basic attribute-oriented induction( cont.) Prime relation: every attribute is at minimum desirable level R ’ 1. Inducing the prime relation R ’ from initial relation 2. Progressive generalization
Basic attribute-oriented induction( cont.)
Rule-based concept generalization Rule-based concept generalization is a more general scheme and a concept can ascend via more than one path Generalization rule can be determine whether a concept can be generalized along one path
Rule-based concept generalization( cont.) Deductive rule generalization The rule associated with a generalization path Concept hierarchy associated with deduction generalization rules is called deduction-rule- based concept graph
Rule-based attribute-oriented induction General model definition (DB, CH, DS, KR, t a ) DB: underlying database CH: a set of rule-based concept hierarchies DS: deduction system supporting CH KR: representation scheme of learned result Ta: desirable attribute threshold The generalization and rule creation process is fundamentally the same as basic AOI
Rule-based attribute-oriented induction( cont.) Concept ascension require additional information and it may not available in the prime relation, called Induction anomaly An depending attribute has been removed An depending attribute can be generalized too high to match the condition An depending condition can only be evaluated against the initial relation
Rule-based attribute-oriented induction( cont.)
Path relation Further generalization in the rule-based case has to be started again from the initial relation and it is costly and wasteful Using a path relation to capture the generalization result from one of the rules on the initial relation A tuple in initial relation can only be generalized via a unique path to the root
Path relation( cont.) Before induction starts, a preprocessing is used to identify and label the generalization paths of all the attributes Every tuple can be transformed into a tuple of ids of the associated generalization paths Path relation has captured completely the generalization result of the initial relation
Path relation( cont.)
Another issue in rule-based generalization is the cyclic dependency Generalization is depended on another attribute, if dependency is cyclic, the deadlock will happened
Path relation( cont.)
Data structure for generalization
An efficient rule-based AOI
An efficient rule-based AOI( cont.)
Performance study The time complexity of relation path algorithm is O(n) and backtracking algorithm is O(n log n) The difference between two algorithms Backtracking uses generalized relation as data structure Further generalization after prime relation has been generated is based in the prime relation
Performance study( cont.) Experiment data is a synthesized student data Attribute name: name, status, sex, age, GPA Conditions Graduate students are at least 22 years Ph.D. student are at least 25 years Graduate student ’ s GPA are at least 3 Generalization order
Performance study( cont.)
Conclusion Concept hierarchies may contain unconditional and conditional rules, which enlarges the application domain Path relation algorithm has an improved complexity of O(n) Rule-based induction can extend the capability of DBMiner
Opinion The numerical attribute problem still exist Experiments are to few and can ’ t see the improved results