Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html.

Slides:

Advertisements

Similar presentations

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

Advertisements

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer.

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.

XML: Extensible Markup Language

XML DOCUMENTS AND DATABASES

Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

IS698: Database Management Min Song IS NJIT. The Relational Data Model.

BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

The Volcano/Cascades Query Optimization Framework

RSCTC 2008 Rough Sets in Data Warehousing Infobright Community Edition (ICE)

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.

Kansas State University Department of Computing and Information Sciences Laboratory for Knowledge Discovery in Databases (KDD) KDD Group Research Seminar.

Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.

Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.

Query Processing & Optimization

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

Automatic Data Ramon Lawrence University of Manitoba

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.

Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.

SharePoint 2010 Business Intelligence Module 6: Analysis Services.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Database Design Concepts

Information storage: Introduction of database 10/7/2004 Xiangming Mu.

Systems analysis and design, 6th edition Dennis, wixom, and roth

The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant.

Data Mining – A First View Roiger & Geatz. Definition Data mining is the process of employing one or more computer learning techniques to automatically.

Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.

DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.

Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,

Copyright © 2004 Pearson Education, Inc.. Chapter 15 Algorithms for Query Processing and Optimization.

Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Databases. Not All Tables Are Created Equal Spreadsheets use tables to store data and formulas associated with that data The “meaning” of data is implicit.

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.

Chapter 2: Intro to Relational Model. 2.2 Example of a Relation attributes (or columns) tuples (or rows)

Experiments with MRDTL – A Multi-relational Decision Tree Learning Algorithm Hector Leiva, Anna Atramentov and Vasant Honavar * Artificial Intelligence.

Foundations of Business Intelligence: Databases and Information Management.

Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

1 Krogel, Rawles, Železný, Flach, Lavrač, Wrobel: Comparative Evaluation of Approaches to Propositionalization Comparative Evaluation of Approaches to.

Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)

1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

CENG 770. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Relational Algebra COMP3211 Advanced Databases Nicholas Gibbins

XML: Extensible Markup Language

COMP3017 Advanced Databases

Relational Algebra Chapter 4, Part A

Architectural Support for Database Visualization

Ontology-Based Information Integration Using INDUS System

Data Mining Concept Description

On Efficient Graph Substructure Selection

KDD CUP 2001 Task 3 Localization

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Presentation transcript:

Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA Speeding Up Multi-Relational Data Mining * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc.

Motivation Importance of relational learning: Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999) MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002) Goal Speed up MRDM framework and in particular MRDTL algorithm

Problem Formulation Example of multi-relational database Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table schema instances Department d1Math1000 d2Physics300 d3Computer Science400 Staff p1Daled1Professor k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Graduate Student s1John2.04p1d3 s2Lisa3.510p4d3 s3Michel3.93p4d4 Department ID Specialization #Students Staff ID Name Department Position Salary Grad.Student ID Name GPA #Publications Advisor Department

MRDM overview. Selection graphs Staff Grad.Student GPA >3.9 Grad.Student Department Nodes correspond to the tables from the database Edges correspond to the associations between tables It corresponds to the subset of the instances from the target table having some property It is a way of specifying attributes in the relational setting Specialization =math IDNameDepartmentPositionSalary p1Daled1Professor k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Staff IDNameDepartmentPositionSalary p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k

MRDM overview. Transforming selection graphs into SQL queries Staff Grad. Student Select T0.id Select distinct T0.id From T0, Graduate_Student T1 From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor Select T0.id Select distinct T0.id From T0 From Staff T0 Where T0.id not in ( Select T1. id ( Select T1. id From Graduate_Student T1) From Graduate_Student T1) GPA >3.9 Select distinct T0. id Graduate_Student T1 From Staff T0, Graduate_Student T1 T0.id=T1.Advisor Where T0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 From Graduate_Student T1 Where T1.GPA > 3.9) Where T1.GPA > 3.9) Generic query: select distinct T0.primary_key from table_list where join_list and condition_list

MRDM overview. Refinements of selection graphs Staff Grad.Student GPA >3.9 Grad.Student Department GPA >2.0 Staff Grad.Student GPA >3.9 Grad.Student Department Grad.Student GPA>2.0 Staff Grad.Student GPA >3.9 Grad.Student Department refinement complement refinement Specialization =math

The most time consuming operations of MRDTL Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math IDNameDepPosition p1Daled1Postdoc p2Martind1Postdoc p3Davidd4Postdoc p4Peterd3Postdoc p5Johnd2Professor p6Susand3Professor …… …… select distinct Staff.Salary, count(distinct Staff.ID) from Staff, Grad.Student, Department where join_list and condition_list group by Staff.Salary Query associated with the selection graph:

A way to speed up - eliminate redundant calculations Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Problem: For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation: Tables Staff and Grad.Student will be joined for all the children refinements A way to fix: make the join only once and save necessary information for all further calculations

Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Staff_IDGrad.Student_IDDep_ID p1s1d1 p2s1d1 p3s6d4 p4s3d3 p5s1d2 p6s9d3 …… ……

Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Query associated with the selection graph: select S.Salary, count(distinct S.Staff_ID) from S group by S.Salary Staff_IDGrad.Student_IDDep_ID p1s1d1 p2s1d1 p3s6d4 p4s3d3 p5s1d2 p6s9d3 … …

Experimental results AccuracyTime with speed upTime w/o speed upBest-known accuracy Mutagenesis87.5 %28.45 secs52.15 secs86 % KDD Cup 2001, localization 76.11%202.9 secs secs72 % KDD Cup 2001, function 91.44% secs secs93.6 % PKDD 2001 Discovery Challenge 98.1% secs secs99.28 %

Summary A general approach for speeding up MRDM framework MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work techniques for handling missing values pruning techniques or complexity regularizations use of the aggregates for the attribute values more extensive evaluation of MRDTL on real-world data sets