Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html.

Similar presentations


Presentation on theme: "Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html."— Presentation transcript:

1 Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html Speeding Up Multi-Relational Data Mining * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc.

2 Motivation Importance of relational learning: Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999) MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002) Goal Speed up MRDM framework and in particular MRDTL algorithm

3 Problem Formulation Example of multi-relational database Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table schema instances Department d1Math1000 d2Physics300 d3Computer Science400 Staff p1Daled1Professor70 - 80k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Graduate Student s1John2.04p1d3 s2Lisa3.510p4d3 s3Michel3.93p4d4 Department ID Specialization #Students Staff ID Name Department Position Salary Grad.Student ID Name GPA #Publications Advisor Department

4 MRDM overview. Selection graphs Staff Grad.Student GPA >3.9 Grad.Student Department Nodes correspond to the tables from the database Edges correspond to the associations between tables It corresponds to the subset of the instances from the target table having some property It is a way of specifying attributes in the relational setting Specialization =math IDNameDepartmentPositionSalary p1Daled1Professor70 - 80k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Staff IDNameDepartmentPositionSalary p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k

5 MRDM overview. Transforming selection graphs into SQL queries Staff Grad. Student Select T0.id Select distinct T0.id From T0, Graduate_Student T1 From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor Select T0.id Select distinct T0.id From T0 From Staff T0 Where T0.id not in ( Select T1. id ( Select T1. id From Graduate_Student T1) From Graduate_Student T1) GPA >3.9 Select distinct T0. id Graduate_Student T1 From Staff T0, Graduate_Student T1 T0.id=T1.Advisor Where T0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 From Graduate_Student T1 Where T1.GPA > 3.9) Where T1.GPA > 3.9) Generic query: select distinct T0.primary_key from table_list where join_list and condition_list

6 MRDM overview. Refinements of selection graphs Staff Grad.Student GPA >3.9 Grad.Student Department GPA >2.0 Staff Grad.Student GPA >3.9 Grad.Student Department Grad.Student GPA>2.0 Staff Grad.Student GPA >3.9 Grad.Student Department refinement complement refinement Specialization =math

7 The most time consuming operations of MRDTL Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math IDNameDepPosition p1Daled1Postdoc p2Martind1Postdoc p3Davidd4Postdoc p4Peterd3Postdoc p5Johnd2Professor p6Susand3Professor …… …… select distinct Staff.Salary, count(distinct Staff.ID) from Staff, Grad.Student, Department where join_list and condition_list group by Staff.Salary Query associated with the selection graph:

8 A way to speed up - eliminate redundant calculations Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Problem: For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation: Tables Staff and Grad.Student will be joined for all the children refinements A way to fix: make the join only once and save necessary information for all further calculations

9 Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Staff_IDGrad.Student_IDDep_ID p1s1d1 p2s1d1 p3s6d4 p4s3d3 p5s1d2 p6s9d3 …… ……

10 Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Query associated with the selection graph: select S.Salary, count(distinct S.Staff_ID) from S group by S.Salary Staff_IDGrad.Student_IDDep_ID p1s1d1 p2s1d1 p3s6d4 p4s3d3 p5s1d2 p6s9d3 … …

11 Experimental results AccuracyTime with speed upTime w/o speed upBest-known accuracy Mutagenesis87.5 %28.45 secs52.15 secs86 % KDD Cup 2001, localization 76.11%202.9 secs1256.38 secs72 % KDD Cup 2001, function 91.44%151.19 secs307.83 secs93.6 % PKDD 2001 Discovery Challenge 98.1%127.75 secs198.22 secs99.28 %

12 Summary A general approach for speeding up MRDM framework MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work techniques for handling missing values pruning techniques or complexity regularizations use of the aggregates for the attribute values more extensive evaluation of MRDTL on real-world data sets


Download ppt "Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html."

Similar presentations


Ads by Google