ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

The Robert Gordon University School of Engineering Dr. Mohamed Amish
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
C6 Databases.
Fast Algorithms For Hierarchical Range Histogram Constructions
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
CS292 Computational Vision and Language Pattern Recognition and Classification.
A Study on Feature Selection for Toxicity Prediction*
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Feature Selection for Regression Problems
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Data Mining By Archana Ketkar.
Recommender systems Ram Akella November 26 th 2008.
A semantic learning for content- based image retrieval using analytical hierarchy process Speaker : Kun Hsiang.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
THE MODEL OF ASIS FOR PROCESS CONTROL APPLICATIONS P.Andreeva, T.Atanasova, J.Zaprianov Institute of Control and System Researches Topic Area: 12. Intelligent.
Combining Content-based and Collaborative Filtering Department of Computer Science and Engineering, Slovak University of Technology
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Systems analysis and design, 6th edition Dennis, wixom, and roth
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Friday, February 4, 2000 Lijun.
Weka: a useful tool in data mining and machine learning Team 5 Noha Elsherbiny, Huijun Xiong, and Bhanu Peddi.
Ensemble Based Systems in Decision Making Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: IEEE CIRCUITS AND SYSTEMS MAGAZINE 2006, Q3 Robi.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Visual Information Systems Recognition and Classification.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
1 Redundant Feature Elimination for Multi-Class Problems Annalisa Appice, Michelangelo Ceci Dipartimento di Informatica, Università degli Studi di Bari,
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,
ECML 2001 A Framework for Learning Rules from Multi-Instance Data Yann Chevaleyre and Jean-Daniel Zucker University of Paris VI – LIP6 - CNRS.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Linguistic summaries on relational databases Miroslav Hudec University of Economics in Bratislava, Department of Applied Informatics FSTA, 2014.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
Chapter 7. Classification and Prediction
Chapter 6 Classification and Prediction
Supervised Time Series Pattern Discovery through Local Importance
Classification and Prediction
Introduction to Databases
Presentation transcript:

ADBIS 2007 A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (1 st October, 2007)

1 st October 2007ADBIS 2007, Varna, Bulgaria Overview Introduction The Multi-relational Setting The Data Summarization Approach –Dynamics Aggregation of Relational Attributes Experimental Evaluations Experimental Results Conclusions

1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction Clustering is a process of grouping data that shares similar characteristics into groups Despite the increase in volume of datasets stored in relational databases, not many studies handle clustering across multiple relations In a dataset stored in a relational database with one-to- many associations between records, each table record (or object) can form numerous patterns of association with records from other tables. –instance(id1 = 1) = {(X,111), (X,112), (X,113), (Z,117)}.

1 st October 2007ADBIS 2007, Varna, Bulgaria Introduction Clustering in a multi-relational environment has been studied in Relational Distance-Based Clustering –the similarity between two objects is defined on the basis of the tuples that can be joined to each of them relatively expensive not able to generate interpretable rules Our Approach: present a data summarization approach, borrowed from the information retrieval theory, to cluster such multi-instance data –Scalable –Able to generate interpretable rules

1 st October 2007ADBIS 2007, Varna, Bulgaria The Multi-Relational Setting Let DB be a database consisting of n objects. Let R := {R 1,…,R m } be the set of different representations in DB and each object may have zero or more representation of each R i, such that |R i | ≥ 0, where i = 1,…,m. Each object O k in DB, where k = 1,…,n can be described by maximally m different representations with each representation has its frequency, O i :={R 1 (O i ):|R 1 (O i )|:|Ob(R 1 )|,…,R m (|O i ):|R m (O i )|:|Ob(R m )|} with R k (O i ) represents the k-th representation in the i-th object and |R k (O i )| represents the frequency of the k-th representation in the i- th object, and finally |Ob(R k )| represents the frequency of object with k-th representation. If all different representations exist for O i, then the total different representations for O i is |O i | = m else |O i | < m.

1 st October 2007ADBIS 2007, Varna, Bulgaria Data Summarization Approach we apply the vector-space model to represent an object Employed the rf-iof term weighting model borrowed from Information Retrieval Theory (tf-idf), where in which each object O i, i = 1,…,n can be represented as (rf 1 ·log(n/of 1 ),rf 2 ·log(n/of 2 ),..., rf m ·log(n/of m )) where rf j is the frequency of the j-th representation in the object, of j is the number of objects that contain the j-th representation and n is the number of objects. In Dynamic Aggregation of Relational Attributes (DARA) algorithm, we convert the data representation from a relational model into a vector space model –Based on contents –Based on structured contents

1 st October 2007ADBIS 2007, Varna, Bulgaria Dynamic Aggregation of Relational Attributes Let F = (F 1, F 2, F 3,…, F k ) denotes k attributes Let dom(F i ) denotes the domain of the i-th attribute. An instance may have theses values (F 1,a, F 2,b, F 3,c, F 4,d,…, F k-1,b, F k,n ), where a ∈ dom(F 1 ),b ∈ dom(F 2 ),…,n ∈ dom(F k ). Contents Based Data Summarization –None of the attributes are concatenated to represent each object ( p = 1, where p is number of attributes concatenated) –If p = 1, we have 1:F 1,a,2:F 2,b,3:F 3,c,4:F 4,d,…,k-1:F k-1,b,k:F k,n Structured Contents Based Data Summarization –Attributes are concatenated based on the value of p, in which p > 1. –If p = 2, we have (provided even number of fields) 1:F 1,a F 2,b, 2:F 3,c F 4,d,…, (k/2):F k-1,b F k,n –if p = k, then we have 1:F 1,a F 2,b F 3,c F 4,d …F k-1,b F k,n as a single term produced.

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations The DARA algorithm can also be seen as an aggregation function for multiple instances of an object, –coupled with the C4.5 classifier (J48 in WEKA) [20], as an induction algorithm that is run on the DARA’s transformed data representation. All experiments with DARA and C4.5 were performed using a leave-one-out cross validation estimation with different values of p, where p denotes the number of attributes being concatenated. We chose well-known dataset, Mutagenesis.

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Evaluations three different sets of background knowledge (referred to as experiment B1, B2 and B3). – B1 : The atoms in the molecule are given, as well as the bonds between them, the type of each bond, the element and type of each atom. – B2 : Besides B1, the charge of atoms are added – B3 : Besides B2, the log of the compound octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital ( Є LUMO) are added Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3.

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results B1 has the schema Molecule(ID, ATOM1, ATOM2, TYPE_ATOM1, TYPE_ATOM2, BOND_TYPE) Performed a leave-one-out X-validation est, p = 1, 2, 3, 4, 5 For B1, the predictive accuracy of the decision tree learned is the highest when p is 2 or 5. Found that the attributes, first element’s type and second element’s type, are highly correlated with the class membership, yet uncorrelated with each other (using the correlation-based feature selection - CFS in WEKA) This means that an attribute combining these two would be relevant to the learning task and split the instance space in a suitable manner. The data contains this composite attribute when p = 2, 4 and 5, but not for the cases of p = 1 and 3.

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results In B2, two attributes are added into B1, which are the charges of both atoms. Performed a leave-one-out X-validation estimation using the C4.5 classifier for p  {1,2,3,4,5,6,7}, Higher prediction accuracy obtained when p = 5, compared to learning from B1 when p = 5, When p = 5, we have two compound attributes, [ID,ATOM1,ATOM2,TYPE_ATOM1,TYPE_ATOM2,BOND_TYPE] and [ATOM1_CHARGE, ATOM2_CHARGE] ) There is a drop in performance when p = 1, 2 and 7 Testing using the correlation-based feature selection function provides a possible explanation of these results

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results Leave-One-Out CV Estimation Accuracy on Mut (B1, B2, B3)

1 st October 2007ADBIS 2007, Varna, Bulgaria Experimental Results Comparison of performance accuracy on Mutagenesis Dataset The results show that (1) there is no other algorithm that outperformed ours on all datasets, and (2) for each of the other algorithms listed in the table, there is a dataset on which our algorithm performed better.

1 st October 2007ADBIS 2007, Varna, Bulgaria Conclusions presents an algorithm transforming relational datasets into a vector space model that is suitable to clustering operations, as a means of summarizing multiple instances varying the number of concatenated attributes p for clustering has an influence on the predictive accuracy An increase in accuracy coincides with the cases of grouping together attributes that are highly correlated with the class membership the prediction accuracy is degraded when the number of attributes concatenated is increased further. data summarization performed by DARA, can be beneficial in summarizing datasets in a complex multi-relational environment, in which datasets are stored in a multi-level of one-to-many relationships

Thank You A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA