ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (1 st October, 2007)
1 st October 2007ADBIS 2007, Varna, Bulgaria Overview Introduction The Multi-relational Setting The Data Summarization Approach –Dynamics Aggregation of Relational Attributes Experimental Evaluations Experimental Results Conclusions
1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction Clustering is a process of grouping data that shares similar characteristics into groups Despite the increase in volume of datasets stored in relational databases, not many studies handle clustering across multiple relations In a dataset stored in a relational database with one-to- many associations between records, each table record (or object) can form numerous patterns of association with records from other tables. –instance(id1 = 1) = {(X,111), (X,112), (X,113), (Z,117)}.
1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction Clustering in a multi-relational environment has been studied in Relational Distance-Based Clustering –the similarity between two objects is defined on the basis of the tuples that can be joined to each of them relatively expensive not able to generate interpretable rules Our Approach: present a data summarization approach, borrowed from the information retrieval theory, to cluster such multi-instance data –Scalable –Able to generate interpretable rules
1 st October 2007ADBIS 2007, Varna, Bulgaria The Multi-Relational Setting Let DB be a database consisting of n objects. Let R := {R 1,…,R m } be the set of different representations in DB and each object may have zero or more representation of each R i, such that |R i | ≥ 0, where i = 1,…,m. Each object O k in DB, where k = 1,…,n can be described by maximally m different representations with each representation has its frequency, O i :={R 1 (O i ):|R 1 (O i )|:|Ob(R 1 )|,…,R m (|O i ):|R m (O i )|:|Ob(R m )|} with R k (O i ) represents the k-th representation in the i-th object and |R k (O i )| represents the frequency of the k-th representation in the i- th object, and finally |Ob(R k )| represents the frequency of object with k-th representation. If all different representations exist for O i, then the total different representations for O i is |O i | = m else |O i | < m.
1 st October 2007ADBIS 2007, Varna, Bulgaria Data Summarization Approach we apply the vector-space model to represent an object Employed the rf-iof term weighting model borrowed from Information Retrieval Theory (tf-idf), where in which each object O i, i = 1,…,n can be represented as (rf 1 ·log(n/of 1 ),rf 2 ·log(n/of 2 ),..., rf m ·log(n/of m )) where rf j is the frequency of the j-th representation in the object, of j is the number of objects that contain the j-th representation and n is the number of objects. In Dynamic Aggregation of Relational Attributes (DARA) algorithm, we convert the data representation from a relational model into a vector space model –Based on contents –Based on structured contents
1 st October 2007ADBIS 2007, Varna, Bulgaria Dynamic Aggregation of Relational Attributes Let F = (F 1, F 2, F 3,…, F k ) denotes k attributes Let dom(F i ) denotes the domain of the i-th attribute. An instance may have theses values (F 1,a, F 2,b, F 3,c, F 4,d,…, F k-1,b, F k,n ), where a ∈ dom(F 1 ),b ∈ dom(F 2 ),…,n ∈ dom(F k ). Contents Based Data Summarization –None of the attributes are concatenated to represent each object ( p = 1, where p is number of attributes concatenated) –If p = 1, we have 1:F 1,a,2:F 2,b,3:F 3,c,4:F 4,d,…,k-1:F k-1,b,k:F k,n Structured Contents Based Data Summarization –Attributes are concatenated based on the value of p, in which p > 1. –If p = 2, we have (provided even number of fields) 1:F 1,a F 2,b, 2:F 3,c F 4,d,…, (k/2):F k-1,b F k,n –if p = k, then we have 1:F 1,a F 2,b F 3,c F 4,d …F k-1,b F k,n as a single term produced.
1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations The DARA algorithm can also be seen as an aggregation function for multiple instances of an object, –coupled with the C4.5 classifier (J48 in WEKA) [20], as an induction algorithm that is run on the DARA’s transformed data representation. All experiments with DARA and C4.5 were performed using a leave-one-out cross validation estimation with different values of p, where p denotes the number of attributes being concatenated. We chose well-known dataset, Mutagenesis.
1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations three different sets of background knowledge (referred to as experiment B1, B2 and B3). – B1 : The atoms in the molecule are given, as well as the bonds between them, the type of each bond, the element and type of each atom. – B2 : Besides B1, the charge of atoms are added – B3 : Besides B2, the log of the compound octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital ( Є LUMO) are added Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3.
1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results B1 has the schema Molecule(ID, ATOM1, ATOM2, TYPE_ATOM1, TYPE_ATOM2, BOND_TYPE) Performed a leave-one-out X-validation est, p = 1, 2, 3, 4, 5 For B1, the predictive accuracy of the decision tree learned is the highest when p is 2 or 5. Found that the attributes, first element’s type and second element’s type, are highly correlated with the class membership, yet uncorrelated with each other (using the correlation-based feature selection - CFS in WEKA) This means that an attribute combining these two would be relevant to the learning task and split the instance space in a suitable manner. The data contains this composite attribute when p = 2, 4 and 5, but not for the cases of p = 1 and 3.
1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results In B2, two attributes are added into B1, which are the charges of both atoms. Performed a leave-one-out X-validation estimation using the C4.5 classifier for p {1,2,3,4,5,6,7}, Higher prediction accuracy obtained when p = 5, compared to learning from B1 when p = 5, When p = 5, we have two compound attributes, [ID,ATOM1,ATOM2,TYPE_ATOM1,TYPE_ATOM2,BOND_TYPE] and [ATOM1_CHARGE, ATOM2_CHARGE] ) There is a drop in performance when p = 1, 2 and 7 Testing using the correlation-based feature selection function provides a possible explanation of these results
1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results Leave-One-Out CV Estimation Accuracy on Mut (B1, B2, B3)
1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results Comparison of performance accuracy on Mutagenesis Dataset The results show that (1) there is no other algorithm that outperformed ours on all datasets, and (2) for each of the other algorithms listed in the table, there is a dataset on which our algorithm performed better.
1 st October 2007ADBIS 2007, Varna, Bulgaria Conclusions presents an algorithm transforming relational datasets into a vector space model that is suitable to clustering operations, as a means of summarizing multiple instances varying the number of concatenated attributes p for clustering has an influence on the predictive accuracy An increase in accuracy coincides with the cases of grouping together attributes that are highly correlated with the class membership the prediction accuracy is degraded when the number of attributes concatenated is increased further. data summarization performed by DARA, can be beneficial in summarizing datasets in a complex multi-relational environment, in which datasets are stored in a multi-level of one-to-many relationships
Thank You A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA