High-dimensional Similarity Join

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Variations of the Turing Machine
Introduction to Algorithms
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
Select from the most commonly used minutes below.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Processes and Operating Systems
STATISTICS HYPOTHESES TEST (I)
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
David Burdett May 11, 2004 Package Binding for WS CDL.
Introduction to Algorithms 6.046J/18.401J
Create an Application Title 1Y - Youth Chapter 5.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Media-Monitoring Final Report April - May 2010 News.
Break Time Remaining 10:00.
EE, NCKU Tien-Hao Chang (Darby Chang)
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Database Performance Tuning and Query Optimization
PP Test Review Sections 6-1 to 6-6
Chapter 10: Applications of Arrays and the class vector
1 IMDS Tutorial Integrated Microarray Database System.
Data structure is concerned with the various ways that data files can be organized and assembled. The structures of data files will strongly influence.
1 Linked Lists A linked list is a sequence in which there is a defined order as with any sequence but unlike array and Vector there is no property of.
1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Association Rule Mining
Operating Systems Operating Systems - Winter 2012 Chapter 4 – Memory Management Vrije Universiteit Amsterdam.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
TESOL International Convention Presentation- ESL Instruction: Developing Your Skills to Become a Master Conductor by Beth Clifton Crumpler by.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Before Between After.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Types of selection structures
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.
Essential Cell Biology
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
CSE20 Lecture 15 Karnaugh Maps Professor CK Cheng CSE Dept. UC San Diego 1.
Clock will move after 1 minute
famous photographer Ara Guler famous photographer ARA GULER.
PSSA Preparation.
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.
Presentation transcript:

High-dimensional Similarity Join Presented by Yang Xia Wongsodihardjo, Hariyanto Wang Hao

Agenda -kdb tree join Introduction Motivation R*-tree based join Epsilon grid order join Summary

Introduction Extracting knowledge from large multi-dimensional databases. Many data mining algorithms require to process all pair of points which have a distance not exceeding a user-given parameter . The operation of generating all such pairs is in essence a similarity join. Data mining algorithms can be directly performed on top of a similarity join.

Motivation Conventional joining algorithms cannot be directly applied to high-D similarity join, such as nested-loop join, sort-merge join, and hash-based join. Make use of the index built on the high-D data.

Efficient Processing of Spatial Joins Using R-trees by T. Brinkhoff, H. P. Kriegel, and B. Seeger SIGMOD 1993 Presented by Hariyanto Wongsodihardjo 6 September 2001

Efficient Processing of Spatial Joins Using R-trees Presenting a study of spatial join processing using R-trees, particularly R*-trees, which is one of the most efficient members of the R-tree family Presenting several techniques for improving spatial join execution time with respect to CPU and I/O time

R-tree Basic Algorithms Let S be a query rectangle of a window query. The query is performed by starting in the root and computing all entries which rectangles intersects S For these entries, the corresponding child nodes are read into main memory and the query is performed like in the root node The efficiency of queries depends on the goodness how R-trees assign rectangles to nodes.

A First Approach of a Spatial Join for R-trees CPU Time Tuning The consumption of CPU time is proportional to the number of floating point comparisons required for computing the join condition (i.e. the test whether two rectangles intersect). Several constraints should be considered The storage utilization and the query performance of the original R*-tree should not be affected Expensive preprocessing steps for the nodes of the R*-tree should be avoided The algorithm should be robust and easy to implement

CPU-Time and I/O-Time Tuning CPU-Time Tuning Restricting the search space Spatial Sorting and plane sweep I/O-Time Tuning Local plane-sweep order with pinning Local z-order

Restricting the search space

Restricting the search space

Restricting the search space

Spatial sorting and plane sweep

Spatial sorting and plane sweep

Spatial sorting and plane sweep

Spatial sorting and plane sweep

Local plane-sweep order

Local plane-sweep order

Local plane-sweep order with pinning (SJ4) Sequence for local plane-sweep order on example 2 is II, I,IV, III and the read schedule is <r1, s2, s1, r2, s2, r4, r3> Pinning algorithm is based on the degree of the rectangles of both entries. The degree of an rectangle E is given by the number of intersections between rectangle E and the rectangles which belong to entries of the other tree not processed until now. Thus for ex. 2 the read schedule is <r1, s2, r4, r3, s1, r2>. The page whose rectangle has a max degree is pinned and the join is performed for the pinned page.

Local z-order (SJ5)

Local z-order (SJ5) Compute intersection between each rectangle of R with all rectangles of S Sort resulting rectangles on the spatial location of their centers Use z-ordering to sort resulting rectangles Then pin pages as before. The sequence for Figure 7 is I, II, III, V, IV and the read schedule is <s1, r2, r1, s2, r4, r3, s3>.

I/O Performance Comparison

I/O Performance Comparison

Conclusion R* tree join algorithm is straightforward R* tree join algorithm improves CPU-time by applying spatial sorting and restricting the search space R* tree join algorithm improves I/O-time by applying local sweep order with pinning or local z-order

High-dimensional similarity joins ( tree) Presented By Yang Xia References:K. Shim, R. Srikant, and R. Agrawarl, High-dimensional similarity joins, Proc. 13th IEEE Internat. Conf. on Data Engineering, 1997, pp. 301--311.

Introduction  tree is a main-memory data structure optimized for performing similarity joins. It uses the similarity distance limit  as a parameter in building the tree. Problem Definition -Self-join -Non-self-join -Distance metric:

Problems with Current Indices Number of Neighboring Leaf Nodes Storage Utilization Traversal Cost Build Time Skewed Data

 tree Definition The co-ordinates of the points in each dimension lie between 0 and +1. Start with a single leaf node. Whenever the number of points in a leaf node exceeds a threshold, the leaf node is split. If the leaf node was at level i, the i dimension is used for splitting. The node is split into parts.

Example of  tree

Similarity Join using the  tree

Memory Management Main-memory can hold all points within a 2  distance on the first dimension.

Memory Management Main-memory cannot hold all points within a 2  distance on the first dimension.

Design Rationale Biased Splitting: The dimension used in previous split is selected again for splitting as long as the length of the dimension in the bounding rectangle of each resulting leaf node is at least .  Sized Splitting: When we split a node, we split the node in  sized chunks.

Design Rationale Number of Neighboring Leaf Nodes. Space Requirements. Traversal Cost. Build time. Skewed data.

An example

Experiments Synthetic Data Parameters

Experiments(1)

Experiments(2)

Experiments(3)

Conclusions  tree reduces the number of neighbor leaf nodes that are considered for the join test.  tree reduces the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions.

Presented By Wang Hao 6 September 2001 Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data Christian Bhm, Bernhard Braunmller, Florian Krebs, and Hans-Peter Kriegel SIGMOD 2001 Presented By Wang Hao 6 September 2001

Motivation Indexing Based Join Join without Index R-tree family, MuX (Multipage Index) tree, etc.. Optimization conflict between CPU and IO [BK01]. Optimize CPU: fine-gained partitioning with page capacities of a few points. Optimized IO: large block size requires less IO. Join without Index Seeded tree, spatial hash join, -kdb tree, etc.. Not scalable to large data sets. -kdb tree: cache size can be from 36% to 60% of database size.

Design Objectives Join without Index. Optimize both CPU and IO. Scalable to large data set of size well beyond 1GB.

Basic Ideas Define a sort order of data: epsilon grid order. Laying an equi-distant grid cell with cell length , over the data space and comparing the cells lexicographically. Use external sort to sort the data. Schedule the IO carefully during join phase.

Epsilon Grid Order For two vectors p, q is true iff there exists a dimension di, such that Epsilon grid order is a strict order: irreflexive, asymmetric, and transitive.

Epsilon Grid Order (Cont.) A point with cannot be a join mate or p, of any point p’ which is not A point with cannot be a join mate or p, of any point p’ which is not

I/O Scheduling Using the  Grid Order Unbuffered IO operations. Example: IO Units in a 2-D data space

I/O Scheduling (Cont.) Illustration: Pairs of IO units that must be considered for join. In the picture, each entry in the matrix stands for one pair of IO Units. IO thrashing effects

Scheduling Mode

Scheduling Algorithm

Joining Two IO Units Active dimensions Minlen: minimum of length of sequences for join.

Optimization Potentials Use larger sequences to optimize IO. Optimize minlen for minimal CPU processing time. Comparing with -kdb tree and MuX tree, no directory is constructed. The only space overhead is the recursion stack: O(log n) Other possible optimizations Modification of sort order. Optimization in the recursion in join_sequence.

Experiments Settings: Buffer memory: 10% of database size. Use Euclidean distance. Distance parameter : determined using algorithm in [SEKX98] such that they are suitable for clustering. Compare with Nested-loop join, Z-ordering R-tree based join, and MuX tree based join.

Experiments on Uniformly Distributed 8-D Data.

Experiments on Real 16-D Data from CAD Database.

Conclusions and Future work Define a strict order: epsilon grid order. A sophisticated scheduling algorithm. Several optimization techniques. Experiments show it outperforms competitive algorithms for data sets with size up to 1.2 GB. Future work Parallel version of the join algorithm. Extend the cost model to query optimizer.

Overall Summary We have covered three joining algorithms: R* tree-based join, e-kdb tree join, and epsilon grid order join. Specific algorithms have been proposed to perform similarity join for each of the following cases: Both data set have index, Only one data set has index, None of them have index. High-D similarity joins can be applied in data mining algorithms such as clustering.

Resource Links Readings on High-dimensional Similarity Join http://www.comp.nus.edu.sg/~wanghao/cs6203/join.htm