Data Mining Query Languages Kristen LeFevre April 19, 2004 With Thanks to Zheng Huang and Lei Chen.

Slides:



Advertisements
Similar presentations
A Guide to SQL, Seventh Edition. Objectives Use joins to retrieve data from more than one table Use the IN and EXISTS operators to query multiple tables.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Managing Data Resources
Chapter Information Systems Database Management.
Chapter 8 Special-Purpose Languages. SQL SQL stands for "Structured Query Language". Allows the user to pose complex questions of a database. It also.
Introduction to Structured Query Language (SQL)
ASP.NET Database Connectivity I. 2 © UW Business School, University of Washington 2004 Outline Database Concepts SQL ASP.NET Database Connectivity.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Database Systems More SQL Database Design -- More SQL1.
Introduction to Structured Query Language (SQL)
Attribute databases. GIS Definition Diagram Output Query Results.
Concepts of Database Management Sixth Edition
Mgt 20600: IT Management & Applications Databases Tuesday April 4, 2006.
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
Midterm 1 Concepts Relational Algebra (DB4) SQL Querying and updating (DB5) Constraints and Triggers (DB11) Unified Modeling Language (DB9) Relational.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
DBMS support of the Data Mining Advisor : S.-Y. Hwang Ph.D D Tsung-Hsien Yang D Shi-Hwao Wang 1/22/2008.
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.
Data Mining Techniques
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Chapter 3 Single-Table Queries
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Database Technical Session By: Prof. Adarsh Patel.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
6 Chapter Databases and Information Management. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
A Guide to MySQL 5. 2 Objectives Use joins to retrieve data from more than one table Use the IN and EXISTS operators to query multiple tables Use a subquery.
Relational Databases Database Driven Applications Retrieving Data Changing Data Analysing Data What is a DBMS An application that holds the data manages.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Using Special Operators (LIKE and IN)
Concepts of Database Management Seventh Edition
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
3-Tier Client/Server Internet Example. TIER 1 - User interface and navigation Labeled Tier 1 in the following graphic, this layer comprises the entire.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Chapter 9 Database Systems Introduction to CS 1 st Semester, 2014 Sanghyun Park.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Concepts of Database Management Eighth Edition Chapter 3 The Relational Model 2: SQL.
IS 230Lecture 6Slide 1 Lecture 7 Advanced SQL Introduction to Database Systems IS 230 This is the instructor’s notes and student has to read the textbook.
Foundations of Business Intelligence: Databases and Information Management.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Concepts of Database Management Seventh Edition Chapter 3 The Relational Model 2: SQL.
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
A Guide to SQL, Eighth Edition Chapter Five Multiple-Table Queries.
UNIT-3 Data Mining Primitives, Languages, and System Architectures LectureTopic ********************************************** Lecture-18Data mining primitives:
Object storage and object interoperability
Query Processing – Implementing Set Operations and Joins Chap. 19.
Hierarchical Retrieval Fresher Learning Program December, 2011.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
1 Chapter 3 Single Table Queries. 2 Simple Queries Query - a question represented in a way that the DBMS can understand Basic format SELECT-FROM Optional.
LM 5 Introduction to SQL MISM 4135 Instructor: Dr. Lei Li.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
Data Resource Management Lecture 8. Traditional File Processing Data are organized, stored, and processed in independent files of data records In traditional.
Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.
Managing Data Resources File Organization and databases for business information systems.
Intro to MIS – MGS351 Databases and Data Warehouses
More SQL: Complex Queries,
MySQL Subquery Source: Dev.MySql.com
Chapter 9 Database Systems
SQL FUNDAMENTALS CDSE Days 2018.
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
MANAGING DATA RESOURCES
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
©Jiawei Han and Micheline Kamber Slides contributed by Jian Pei
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Database Systems: Design, Implementation, and Management Tenth Edition
Presentation transcript:

Data Mining Query Languages Kristen LeFevre April 19, 2004 With Thanks to Zheng Huang and Lei Chen

Outline Introduce the problem of querying data mining models Overview of three different solutions and their contributions Topic for Discussion: What would an ideal solution support?

Problem Description You guys are armed with two powerful tools Database management systems Efficient and effective data mining algorithms and frameworks Generally, this work asks: “How can we merge the two?” “How can we integrate data mining more closely with traditional database systems, particularly querying?”

Three Different Answers DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University) Integrating Data Mining with SQL Databases: OLE DB for Data Mining (Netz et al, Microsoft) MSQL: A Query Language for Database Mining (Imielinski & Virmani, Rutgers University)

Some Common Ground Create and manipulate data mining models through a SQL-based interface (“Command- driven” data mining) Abstract away the data mining particulars Data mining should be performed on data in the database (should not need to export to a special-purpose environment) Approaches differ on what kinds of models should be created, and what operations we should be able to perform

DMQL Commands specify the following: The set of data relevant to the data mining task (the training set) The kinds of knowledge to be discovered Generalized relation Characteristic rules Discriminant rules Classification rules Association rules

DMQL Commands Specify the following: Background knowledge Concept hierarchies based on attribute relationships, etc. Various thresholds Minimum support, confidence, etc.

DMQL Syntax use database {use hierarchy for } related to from [where ] [order by ] {with [ ] threshold = [for ]} Specify background knowledge Specify rules to be discovered Collect the set of relevant data to mine Specify threshold parameters Relevant attributes or aggregations

DMQL Syntax find classification rules [as ] [according to ] Find association rules [as ] generalize data [into ] others

DMQL use database Hospital find association rules as Heart_Health related to Salary, Age, Smoker, Heart_Disease from Patient_Financial f, Patient_Medical m where f.ID = m.ID and m.age >= 18 with support threshold =.05 with confidence threshold =.7

DMQL DMQL provides a display in command to view resulting rules, but no advanced way to query them Suggests that a GUI interface might aid in the presentation of these results in different forms (charts, graphs, etc.)

MSQL Focus on Association Rules Seeks to provide a language both to selectively generate rules, and separately to query the rule base Expressive rule generation language, and techniques for optimizing some commands

MSQL Get-Rules and Select-Rules Queries Get-Rules operator generates rules over elements of argument class C, which satisfy conditions described in the “where” clause [Project Body, Consequent, confidence, support] GetRules(C) [as R1] [into ] [where ] [sql-group-by clause] [using-clause]

MSQL may contain a number of conditions, including: restrictions on the attributes in the body or consequent “rule.body HAS {(Job = ‘Doctor’}” “rule1.consequent IN rule2.body” “rule.consequent IS {Age = *}” pruning conditions (restrict by support, confidence, or size) Stratified or correlated subqueries in, has, and is are rule subset, superset, and equality respectively

MSQL GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7 and not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body) Retrieve all rules with descriptors of the form “Age = x” in the body, except when there is a rule with equal or greater support and confidence with a rule containing a superset of the descriptors in the body

MSQL GetRules(C) R1 where and not exists ( GetRules(C) R2 where and R2.Body HAS R1.Body) correlated stratified GetRules(C) R1 where and consequent is {(X=*)} and consequent in (SelectRules(R2) where consequent is {(X=*)}

MSQL Nested Get-Rules Queries and their optimization Stratified (non-corrolated) queries are evaluated “bottom-up.” The subquery is evaluated first, and replaced with its results in the outer query. Correlated queries are evaluated either top- down or bottom-up (like “loop-unfolding”), and there are rules for choosing between the two options

MSQL GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7 and not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body)

MSQL GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7 Top-Down Evaluation For each rule produced by the outer, evaluate the inner not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body)

MSQL not exists ( GetRules(Patients) Support >.05 and Confidence >.7 and R2.Body HAS R1.Body) Bottom-Up Evaluation For each rule produced by the inner, evaluate the outer GetRules(Patients) where Body has {Age = *} and Support >.05 and Confidence >.7

MSQL Choosing between the two In general, evaluate the expression with more restrictive conditions first Heuristic rules Evaluate the query with higher support threshold first Next consider confidence threshold A (length = x) expression is in general more restrictive than (length > x), which is more restrictive than (length < x) “Body IS (constant expression)” is more restrictive than “Body HAS”, which is more restrictive than “Body IN” Next consider “Consequent IN” expressions Descriptors of for (A = a) are more restrictive than wildcards such as (A = *) Meant to prevent unconstrained queries from being evaluated first

OLE DB for DM An extension to the OLE DB interface for Microsoft SQL Server Seeks to support the following ideas: Define a model by specifying the set of attributes to be predicted, the attributes used for the prediction, and the algorithm Populate the model using the training data Predict attributes for new data using the populated model Browse the mining model (not fully addressed because it varies a lot by model type) None of the others seemed to support this

OLE DB for DM Defining a Mining Model Identify the set of data attributes to be predicted, the set of attributes to be used for prediction, and the algorithm to be used for building the model Populating the Model Pull the information into a single rowset using views, and train the model using the data and algorithm specified Supports complex objects, so rowset may be hierarchical (see paper for more complex examples)

OLE DB for DM Using the mining model to predict Defines a new operator prediction join. A model may be used to make predictions on datasets by taking the prediction join of the mining model and the data set.

OLE DB for DM CREATE MINING MODEL [Heart_Health Prediction] [ID] Int Key, [Age] Int, [Smoker] Int, [Salary] Double discretized, [HeartAttack] Int PREDICT, %Prediction column USING [Decision_Trees_101] Identifies the source columns for the training data, the column to be predicted, and the data mining algorithm.

OLE DB for DM INSERT INTO [Heart_Health Prediction] ([ID], [Age], [Smoker], [Salary]) SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID The INSERT represents using a tuple for training the model (not actually inserting it into the rowset).

OLE DB for DM SELECT t.[ID], [Heart_Health Prediction].[HeartAttack] FROM [Heart_Health Prediction] PREDICTION JOIN ( SELECT [ID], [Age], [Smoker], [Salary] FROM Patient_Medical M, Patient_Financial F WHERE M.ID = F.ID) as t ON [Heart_Health Prediction].Age = t.Age AND [Heath_Health Prediction].Smoker = t.Smoker AND [Heart_Health Prediction].Salary = t.Salary Prediction join connects the model and an actual data table to make predictions

Key Ideas Important to have an API for creating and manipulating data mining models The data is already in the DBMS, so it makes sense to do the data mining where the data is Applications already use SQL, so a SQL extension seems logical

Key Ideas Need a method for defining data mining models, including algorithm specification, specification of various parameters, and training set specification (DMQL, MSQL, ODBDM) Need a method of querying the models (MSQL) Need a way of using the data mining model to interact with other data in the database, for purposes such as prediction (ODBDM)

Discussion Topic: What Functionality would and Ideal Solution Support?