Matic Perovšek, Anže Vavpeti č, Nada Lavra č Jožef Stefan Institute, Slovenia A Wordification Approach to Relational Data Mining: Early Results
Overview Introduction Methodology Experimental results Conclusion
Introduction Relational data mining algorithms aim to induce models and/or relational patterns from multiple tables Individual-centered relational databases can be transformed to a single-table form – propositionalization
Motivation Wordification inspired by text mining techniques Large number of simple, easy to understand features Greater scalability, handling large datasets Can be used as a preprocessing step to propositional learners, as well as to declarative modeling / constraint solving (De Raedt et al., today’s invited talk)
Methodology 1. Transformation from relational database to a textual corpus 2. TF-IDF weight calculation
Transformation from relational database to a textual corpus One individual of the initial relational database - > one text document Features -> the words of this document Words constructed as a combination:
Transformation from relational database to a textual corpus For each individual, the words generated for the main table are concatenated with words generated from the secondary (BK) tables
Example
TF-IDF weights No explicit use of existential variables in our features, TF-IDF instead The weight of a word gives a strong indication of how relevant is the feature for the given individual. The TF-IDF weights can then be used either for filtering words with low importance or using them directly by a propositional learner.
Experimental results Slovenian traffic accidents database IMDB database Top 250 and bottom 100 movies Movies, actors, movie genres, directors, director genres Applied the wordification methodology Performed association rule learning
Experimental results
Conclusion Novel propositionalization technique called Wordification Greater scalability Easy to understand features Further work: Test on larger databases Experimental comparison with other propositionalization techniques Combine with propositionalization–like approach to mining heterogeneous information networks (Gr č ar et al. 2012), applicable to CLP in data preprocessing Gr č ar, Trdin, Lavra č : A Methodology for Mining Document-Enriched Heterogeneous Information Networks, Computer Journal 2012