Integrating Query Processing and Data Mining in Relational DBMSs

Slides:



Advertisements
Similar presentations
        iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:
Advertisements

Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they.
Datamining_3 Clustering Methods Clustering a set is partitioning that set. Partitioning is subdividing into subsets which mutually exclusive (don't overlap)
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Query Processing Presented by Aung S. Win.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
P-Trees: Universal Data Structure for Query Optimization to Data Mining Most people have Data from which they want information. So, most people need DBMSs.
Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.
Query Optimization to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. The main.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
CPSC 231 D.H.1 Learning Objectives Understanding of disk versus RAM performance gap. Understanding definition, design goals and design problems of file.
CS4432: Database Systems II Query Processing- Part 2.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Query Processing SQL Queries in a high level language such as SQL are processed by Horizontal DBMSs in the following steps: 1. SCAN and PARSE (SCANNER-PARSER):
Query Optimization: Relational Queries to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know.
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Decomposition Storage Model (DSM) An alternative way to store records on disk.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
Physical Database Design
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Physical Database Design and Performance
Decision Tree Induction for High-Dimensional Data Using P-Trees
Decomposition Storage Model (DSM)
Methodology – Physical Database Design for Relational Databases
Physical Database Design for Relational Databases Step 3 – Step 8
Data Engineering Query Optimization (Cost-based optimization)
Database Performance Tuning and Query Optimization
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Dr. William Perrizo North Dakota State University
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
CODE GENERATION implementing the operator, PROJECTION
OrientX: an Integrated, Schema-Based Native XML Database System
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
Physical Database Design
Chapter 6: Physical Database Design and Performance
Vertical K Median Clustering
Clustering Methods Clustering a set is partitioning that set.
Outline Introduction Background Our Approach Experimental Results
North Dakota State University Fargo, ND USA
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Vertical K Median Clustering
Chapter 11 Database Performance Tuning and Query Optimization
North Dakota State University Fargo, ND USA
Evaluation of Relational Operations: Other Techniques
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Evaluation of Relational Operations: Other Techniques
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Presentation transcript:

Integrating Query Processing and Data Mining in Relational DBMSs Qiang Ding (North Dakota State University) William Perrizo (ditto) Victor Shi (ditto) Kirk Scott (University of Alaska)

Integrating Query Processing and Data Mining in Relational DBMSs Introduction Our Goal To optimize data mining and query processing together A unified approach To minimize I/O To reduce disk storage (compression) Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs Introduction (Cont.) Vertical Partitioning Decomposition Storage Model (DSM, Copeland et al) Attribute Transposed File (ATF) Band Sequential (BSQ) Bit Transposed File (BTF, Wang et al) bSQ & P-tree Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs P-trees Represent data bit-by-bit in a recursive quadrant-by-quadrant arrangement Lossless representations of the original data Facilitate compression and fast ANDing Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

bSQ, 2-D Peano order, and P-trees 1111110011111000111111001111111011110000111100001111000001110000 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs SPJ Queries Consider a SPJ query involving more than one join Constellation model Our strategy Selection masks Semi-joins Full elimination of all non-participants Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs SELECT DISTINCT C.c, R.capacity FROM S, C, E, O, R WHERE S.s=E.s AND C.c=O.c AND O.o=E.o AND O.r=R.r AND C.cred>1 AND (E.grade='B' OR E.grade='A') AND R.capacity>10 AND S.gen='F' ORDER BY C.c DESC; An Example S C s |n|gen 0 000|A|M 0 1 001|T|M 0 2 010|S|F 1 3 011|B|F 1 4 100|C|F 1 5 101|J|F 1 c |n|cred 0 00|B|1 01 1 01|D|3 11 2 10|M|3 11 3 11|S|2 10 E s |o |grade 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 O o |c | r 0 000|0 00|0 01 1 001|0 00|1 01 2 010|1 01|0 00 3 011|1 01|1 01 4 100|2 10|0 00 5 101|2 10|2 10 6 110|2 10|3 11 7 111|3 11|2 10 R r |capacity 0 00|30 11 1 01|20 10 2 10|30 11 3 11|10 01 Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Full Vertical Partitioning Ss1 Ss2 Ss3 Sgen Sn 0011 0000 0101 0011 ATSBCJ 00 11 01 11 Es1 Es2 Es3 Eo1 Eo2 Eo3 Egrade1 Egrade2 0000 0000 0011 0000 0010 1010 1101 0100 0000 1111 1100 0000 0111 1101 1011 1001 11 00 01 11 00 01   11 00 Cc1 Cc2 Ccred1 Ccred2 Cn 00 01 01 11 BDMS 11 01 11 10 Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 0011 0000 0101 0011 0000 0001 1100 0011 1111 0101 0011 1101 0011 0110  Rr1 Rr2 Rcap1 Rcap2 00 01 11 10 11 01 10 11 Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Applying Selection Masks mE =Egrade1 mR =Rcap1 mC =Ccred1 mS =Sgen 1101 11 01 0011 1011 10 11 11 11   results in, Es1 Es2 Es3 Eo1 Eo2 Eo3 Ss1 Ss2 Ss3 00∙0 00∙0 00∙1 00∙0 00∙0 10∙0 ∙∙11 ∙∙00 ∙∙01 0∙00 1∙11 1∙00 0∙00 0∙11 1∙01 00 11 01 11 00 01 11 00 01  Rr1 Rr2 Cc1 Cc2 00 01 ∙0 ∙1 1∙ 0∙ 11 01 Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Semijoining Toward Center SE(on s=2,3,4,5) EO(on o=0,1,2,3,4,5), RO(on r=0,1,2), CO(on c=1,2,3) Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 0011 0000 0101 0011 0000 0001 1100 0011 1111 0101 0011 1101 0011 0110 0011 0000 0101 ∙∙11 ∙∙00 0001 1100 00∙∙ 11∙∙ 01∙∙ 0011 1101 00∙1 01∙0  Thus, the participants are o=2,3,4,5. Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs Semijoining Back Semijoining back again produces: Cc1 Cc2 Rr1 Rr2 ∙0 ∙1 00 01 1∙ 0∙ 1∙ 0∙   Es1 Es2 Es3 Eo1 Eo2 Eo3 ∙∙∙∙ ∙∙∙∙ ∙∙∙∙ ∙∙∙∙ ∙∙∙∙ ∙∙∙∙ ∙∙00 ∙∙11 ∙∙00 ∙∙00 ∙∙11 ∙∙01 11 00 01 11 00 01 Thus the participants are c=1,2; r=0,1,2; s=2,4,5. Ss1 Ss2 Ss3 ∙∙11 ∙∙00 ∙∙01 0∙ 1∙ 0∙ Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs Generating Output C.c = 2 C.c = 1 Oc1 ^ Oc2’ Oc1’ ^ Oc2 ∙∙11 ∙∙11 = ∙∙11 ∙∙00 ∙∙00 = ∙∙00 00∙∙ 00∙∙ 00∙∙ 11∙∙ 11∙∙ 11∙∙ O.r = 0, 2 O.r = 0, 1 Semijoin to R: R.capacity R.capacity 30 30, 20 Final output: c capacity | 2 | 30 | | 1 | 30 | | 1 | 20 | Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Data Mining Operations P-tree-based mining algorithms Association, Classification, and Clustering Faster and/or more accurate P-trees: data-mining ready compressed data structures P-ARM, Closed P-KNN Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Data Mining Using P-trees –– P-ARM Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Data Mining Using P-trees –– P-KNN Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining Without necessitation the creation of a massive universal relation Full vertical partitioning Saving space Efficiently and directly (boolean operations) Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM

Integrating Query Processing and Data Mining in Relational DBMSs Conclusion SPJ strategies can be combined with proven data mining strategies in a unified way Achieved by using P-trees Complete vertical decomposition Only participating fields are retrieved Fast and accurate I/O minimized Indexes eliminated Integrating Query Processing and Data Mining in Relational DBMSs 5/22/2019 5:23 AM