Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Hashing as a Dictionary Implementation
Searching on Multi-Dimensional Data
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Data Preprocessing.
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
Lab3 CPIT 440 Data Mining and Warehouse.
CS405G: Introduction to Database Systems Final Review.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Hashing.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Segmentation Course web page: vision.cis.udel.edu/~cv May 7, 2003  Lecture 31.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
In this chapter, you learn about the following: ❑ Anomalies ❑ Dependency and determinants ❑ Normalization ❑ A layman’s method of understanding normalization.
1 CS 350 Data Structures Chaminade University of Honolulu.
Chapter 4 Data and Databases. Learning Objectives Upon successful completion of this chapter, you will be able to: Describe the differences between data,
Lecture 3 The Digital Image – Part I - Single Channel Data 12 September
CS 405G: Introduction to Database Systems 21 Storage Chen Qian University of Kentucky.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Modeling Issues for Data Warehouses CMPT 455/826 - Week 7, Day 1 (based on Trujollo) Sept-Dec 2009 – w7d11.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
Copyright, Harris Corporation & Ophir Frieder, The Process of Normalization.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
Normalization. 2 u Main objective in developing a logical data model for relational database systems is to create an accurate representation of the data,
CSCI Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science Department AUC.
IST Database Normalization Todd Bacastow IST 210.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Data Transformation: Normalization
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Indexing Structures for Files and Physical Database Design
CHP - 9 File Structures.
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
COMP 430 Intro. to Database Systems
CS405G: Introduction to Database Systems
Hash Tables.
Jewels, Himalayas and Fireworks, Extending Methods for
Indexing and Hashing Basic Concepts Ordered Indices
Jewels, Himalayas and Fireworks, Extending Methods for
Data Transformations targeted at minimizing experimental variance
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Presentation transcript:

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo

Motivation Databases and data warehouses are currently separate systems Why? Standard answer: Details, details, details … Our answer: Fundamental issue of representation

Relations Revisited R(A 1, A 2, …, A N ) Set of tuples Any choices at a fundamental level? Yes! Duality between Element-based representation Space-based representation

Duality Element-based representation: Standard representation of tuples with all their attributes Space-based representation: The existence (count?) of a tuple is represented in its attribute space

Similar Dualities in Physics Particles can be represented by their position More fundamental level: Particle Particles can be 1 values in a grid of locations Field

Space-Based Representation Consider standard tuples as vectors in the space of attribute domains Represent all possible attribute combinations as one bit: 1 if data item is present 0 if it isn’t Allowing counts could be useful for projections (?)

Space-Based Representation as a Partition Partitions are mutually exclusive and collectively exhaustive sets of elements The Space-Based Representation partitions attribute space into two sets: Data item present in database (1) Data item not present (0)

Usefulness of Space-Based Representation No indexes needed: instant value-based access Index locking becomes dimensional locking Aggregation very easy due to value-based ordering Selections become “and”s What experience do we have with space-based representations?

Data Cube Representation One value (e.g., sales) given in the space of the key attributes Space-based with respect to key attributes Element-based with respect to non-key attributes

Properties of the Domain Space Ideally space should have distance, norm, etc. Especially important for data mining Does that make sense for all domains? Can any domain be mapped to integer?

Can all Domains be Mapped to Integer? Simplistic answer: yes! All information in a computer is saved as bits Any sequence of bits can be interpreted as an integer Problems Order may be irrelevant, e.g., hair-color Order may be wrong, e.g., sign bit for int Even if order is correct, spacing may vary, e.g., float (solution in paper: intervalization) Domains may be very large, e.g., movies

Categorical attributes (irrelevant order) We need more than one attribute for an appropriate representation Data mining solution: 1 attribute per domain value Our solution: 1 attribute per bit slice Values are corners of a Hypercube in log(Domain Size) dimensions Distances are given trough MAX metric

Fundamental Partition (Space-Based Representation) # of dimensions = Number of attributes # of represented points = product of all domain sizes  Exponential in number of dimensions!  We badly need compression!

How Do We Handle Size? Problem exponential in #of attributes  How can we reduce #of attributes? Review normalization: We can decompose a relation into a set of relations each of which contains the entire key and one other attribute This decomposition is loss less dependency preserving (BCNF relations only)

Compression for Non-Key Attributes Fundamental partition contains one non-zero data-point in any non-key dimension only Represent number by bit-slices Note: This works for numerical and categorical attributes Original values can be regained by anding Example 5 (binary 101) is bit 0 & bit 1’ & bit 2

Concept Hierarchies Bit sliced representation have significant benefits beyond compression: Bit slices can be combined into concept hierarchies: Highest level: bit 0 Next level: bit 0 & bit 1 Next level: bit 0 & bit 1 & bit 2

Compression for Key Attributes Database state-independent compression could lead to information loss (counts > 1) Database state-dependent compression: Tree structure that eliminates pure subtrees => P-trees

Other Ideas Compression is better if attribute values are dense within their domain We could use extent domain Compression good Problems with insertion Reorganization of storage Index locking has to be reintroduced …

How Good is Compression so far? If all domains are “dense”, i.e. all values occur Size can easily be smaller than original relation If non-key attributes are “sparse” Not usually a problem: good compression Problems only in extreme cases E.g., movies as attribute values! If key-attributes are “sparse” Larger potential for problems, but also large potential for benefit (see data cubes)

Are Key-Attributes Usually Sparse? Many key attributes are dense (“structure” attributes as keys) Automatically generated IDs are usually sequential x and y in spatial data mining Time in data streams Keys in tables that represent relationships tend to be sparse (feature attributes as keys) Student / course offering / grade Data cubes!

What Have We Gained? (Database Aspects) Data simultaneously acts as index No separate index locking (unless extent domain is used) All information saved as bit patterns Easy “select” Other database operations discussed in class

What Have We Gained? (Feature Attribute Keys) Direct mining possible on relations with feature attributes keys E.g., student / course offering / grade Rollup can be defined, etc. Clustering, classification, ARM can make use of proximity inherent in representation Bit-wise representation provides concept hierarchy for non-key attribute Tree structure provides concept hierarchy for key attributes

What Have We Gained? (Structure Attribute Keys) For relations with structure attribute keys mining requires “and”ing produces counts for feature attributes Bit-wise representation provides concept hierarchy for non-key attribute Duality: Concept hierarchies in this representation map exactly to tree structure when the attribute is a key

Mapping Concept Hierarchies Bit Slices Tree P-tree: Take key attributes, e.g. x and y, and bit interleave them: x = y =  Any two of these digits form a level in the P- tree – or a level in a concept hierarchy

How Could We Use That Duality? Join with other relations and project off key attributes (Meta P-trees) Can we do that? We lose uniqueness We can use 1 to represent 1 or more tuples (equivalent to relational algebra) Or we can introduce counts Can be useful for data mining Need for non-duplicate eliminating counts exists also in other applications

How Do Hierarchies Benefit us in Databases? Multi-granularity Locking Subtrees form suitable units for storage in a block Fast access! Proportional to # of levels in tree # of bits for bit slices

Summary Space-based representation has many benefits Value-based access and storage No separate index needed Rollups easy P-Trees Follow from systematic compression Benefits from concept hierarchies