Yue (Jenny) Cui and William Perrizo North Dakota State University

Slides:

Advertisements

Similar presentations

Garfield AP Computer Science

Advertisements

Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Concepts of Database Management Sixth Edition

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Concepts of Database Management, Fifth Edition

Xin  Syntax ◦ SELECT field1 AS title1, field2 AS title2,... ◦ FROM table1, table2 ◦ WHERE conditions  Make a query that returns all records.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.

Concepts of Database Management Seventh Edition

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

Concepts of Database Management Eighth Edition Chapter 3 The Relational Model 2: SQL.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

1 Searching and Sorting Searching algorithms with simple arrays Sorting algorithms with simple arrays –Selection Sort –Insertion Sort –Bubble Sort –Quick.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Copyright 2008 Koren ECE666/Koren Part.7b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

จัดทำโดย นายชนากานต์ สันติคุณาภรณ์ นายธฤษพงศ์ ศิริบูรณ์ นางสาวศุภาภรณ์ ถ่านคำ.

Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui Advisor: Dr. William Perrizo Master Thesis Oral Defense Department.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Searching and Sorting Searching algorithms with simple arrays

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Indexing Structures for Files and Physical Database Design

Database System Implementation CSE 507

Efficient Image Classification on Vertically Decomposed Data

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Efficient Ranking of Keyword Queries Using P-trees

Yue (Jenny) Cui and William Perrizo North Dakota State University

Proximal Support Vector Machine for Spatial Data Using P-trees1

North Dakota State University Fargo, ND USA

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

Efficient Image Classification on Vertically Decomposed Data

A Fast and Scalable Nearest Neighbor Based Classification

GROUP BY & Subset Data Analysis

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.

Vertical K Median Clustering

North Dakota State University Fargo, ND USA

SQL: Structured Query Language

Functional Analytic Unsupervised and Supervised data mining Technology

The Multi-hop closure theorem for the Rolodex Model using pTrees

Vertical K Median Clustering

Query Functions.

North Dakota State University Fargo, ND USA

Algorithm of Aggregate Function SUM

Algorithm for the Aggregate Function SUM

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

Fraction-Score: A New Support Measure for Co-location Pattern Mining

LINQ to SQL Part 3.

Introduction to SQL Server and the Structure Query Language

CS 405G: Introduction to Database Systems

Presentation transcript:

Yue (Jenny) Cui and William Perrizo North Dakota State University Aggregate Function Computation and Iceberg Querying in Vertical Databases Yue (Jenny) Cui and William Perrizo North Dakota State University

Outline Introduction Review of Aggregate Functions Review P-tree vertical, compressed datamining-ready data structures Algorithms of Aggregate Function Computation Using P-trees SUM, COUNT, and AVERAGE. MAX, MIN, MEDIAN, RANK, and TOP-K. Performance Analysis Conclusion

Introduction Commonly used aggregation functions include COUNT, SUM, AVERAGE, MIN, MAX, MEDIAN, RANK, and TOP-K. Iceberg queries perform aggregate functions across attributes and then eliminate aggregate values that are below some specified threshold. example iceberg query. SELECT Location, Product Type, Sum (# Product) FROM Relation Sales GROUPBY Location, Product Type HAVING Sum (# Product) >= T

But it is pure (pure0) so this branch ends A file, R(A1..An), contains horizontal structures (horizontal records) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Horizontal structures (records) Scanned vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 5. 2nd half of 2nd half is  1 0 0 0 1 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 But it is pure (pure0) so this branch ends Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level

Algorithms of Aggregate Function Computation Using P-trees The dataset we used in our example. We use the data in relation Sales to illustrate algorithms of aggregate function. Id Mon Loc Type On line # Product 1 Jan New York Notebook Y 10 2 Minneapolis Desktop N 5 3 Feb Printer 6 4 Mar 7 11 Chicago 9 Apr Fax Table 1. Relation Sales.

Algorithms of Aggregate Function Computation Using P-trees (Cont.) Table 2 shows the binary representation of data in relation Sales. Id Mon Loc Type On line # Product P0,3 P0,2 P0,1 P0,0 P1,4 P1,3 P1,2 P1,1 P1,0 P2,2 P2,1 P2,0 P3,0 P4,3 P4,2 P4,1 P4,0 1 0001 00001 001 1010 2 00101 010 0101 3 0010 100 0110 4 0011 0111 5 1011 6 00110 1001 7 0100 101 Table 2. Binary Form of Sales.

Algorithm of Aggregate Function COUNT COUNT function: It is not necessary to write special function for COUNT because P-tree RootCount function has already provided the mechanism to implement it. Given a P-tree Pi, RootCount(Pi) returns the number of 1s in Pi. Id Mon Loc Type On line # Product P0,3 P0,2 P0,1 P0,0 P1,4 P1,3 P1,2 P1,1 P1,0 P2,2 P2,1 P2,0 P3,0 P4,3 P4,2 P4,1 P4,0 1 0001 00001 001 1010 2 00101 010 0101 3 0010 100 0110 4 0011 0111 5 1011 6 00110 1001 7 0100 101 Table 1. Relation Sales.

Algorithm of Aggregate Function SUM SUM function: Sum function can total a field of numerical values. Algorithm 4.1 Evaluating sum () with P-tree. total = 0.00; For i = 0 to n { total = total + 2i * RootCount (Pi); } Return total Algorithm 4. 1. Sum Aggregate

Algorithm of Aggregate Function SUM P4,3 P4,2 P4,1 P4,0 10 5 6 7 11 9 3 1 1 1 1 For example, if we want to know the total number of products which were sold out in relation Sales, the procedure is showed on left {3} {3} {5} {5} 23 * + 22 * + 21 * + 20 * = 51

Algorithm of Aggregate Function AVERAGE Average function: Average function will show the average value in a field. It can be calculated from function COUNT and SUM. Average () = Sum ()/Count ().

Algorithm of Aggregate Function MAX Max function: Max function returns the largest value in a field. Algorithm 4.2 Evaluating max () with P-tree. max = 0.00; c = 0; Pc is set all 1s For i = n to 0 { c = RootCount (Pc AND Pi); If (c >= 1) Pc = Pc AND Pi; max = max + 2i; } Return max; Algorithm 4. 2. Max Aggregate.

Algorithm of Aggregate Function MAX Steps IF Pos Bits P4,3 P4,2 P4,1 P4,0 1. Pc = P4,3 RootCount (Pc) = 3 >= 1 10 5 6 7 11 9 3 1 1 1 1 {1} 2. RootCount (Pc AND P4,2) = 0 < 1 Pc = Pc AND P’4,2 {0} 3. RootCount (Pc AND P4,1 ) = 2 >= 1 Pc = Pc AND P4,1 {1} 4. RootCount (Pc AND P4,0 ) = 1 >= 1 {1} 23 * + 22 * + 21 * + 20 * = {1} {0} {1} {1} 11

Algorithm of Aggregate Function MIN Min function: Min function returns the smallest value in a field. Algorithm 4.3. Evaluating Min () with P-tree. min = 0.00; c = 0; Pc is set all 1s For i = n to 0 { c = RootCount (Pc AND NOT (Pi)); If (c >= 1) Pc = Pc AND NOT (Pi); Else min = min + 2i; } Return min; Algorithm 4. 2. Max Aggregate.

Algorithm of Aggregate Function MIN Steps IF Pos Bits P4,3 P4,2 P4,1 P4,0 1. Pc = P’4,3 RootCount (Pc) = 4 >= 1 10 5 6 7 11 9 3 1 1 1 1 {0} 2. RootCount (Pc AND P’4,2) = 1 >= 1 Pc = Pc AND P’4,2 {0} 3. RootCount (Pc AND P’4,1 ) = 0 < 1 Pc = Pc AND P4,1 {1} 4. RootCount (Pc AND P’4,0 ) = 0 < 1 {1} 23 * + 22 * + 21 * + 20 * = {0} {0} {1} {1} 3

Performance Analysis Figure 15. Iceberg Query with multi-attributes aggregation Performance Time Comparison

Performance Analysis Our experiments are implemented in the C++ language on a 1GHz Pentium PC machine with 1GB main memory running on Red Hat Linux. In figure 15, we compare the running time of P-tree method and bitmap method on calculating multi-attribute iceberg query. In this case P-trees are proved to be substantially faster.

Conclusion we believe our study confirms that the P-tree approach is superior to the bitmap approach for aggregation of all types and multi-attribute iceberg queries. It also proves that the advantages of basic P-tree representations of files are: First, there is no need for redundant, auxiliary structures. Second basic P-trees are good at calculating multi-attribute aggregations, numeric value, and fair to all attributes.