Using Trees to Depict a Forest

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

AI Pathfinding Representing the Search Space

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Categorical Data The Case of Quran Verses

PARTITIONAL CLUSTERING

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Active Learning and Collaborative Filtering

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.

2-dimensional indexing structure

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.

Aki Hecht Seminar in Databases (236826) January 2009

Techniques and Data Structures for Efficient Multimedia Similarity Search.

What is Cluster Analysis?

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.

Birch: An efficient data clustering method for very large databases

CS4432: Database Systems II

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Randomized Algorithms - Treaps

CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.

USING TREES TO DEPICT A FOREST Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1 Reading Assignment Presentation Courtesy of 35 th International.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Yaomin Jin Design of Experiments Morris Method.

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Computer Sciences Department1. Sorting algorithm 3 Chapter 6 3Computer Sciences Department Sorting algorithm 1  insertion sort Sorting algorithm 2.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Dynamic Diversification of Continuous Data Marina Drosou and Evaggelia Pitoura Computer Science Department University of Ioannina, Greece

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.

Dense-Region Based Compact Data Cube

Spatial Data Management

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Data Mining K-means Algorithm

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

original list {67, 33,49, 21, 25, 94} pass { } {67 94}

K Nearest Neighbor Classification

Indexing and Hashing Basic Concepts Ordered Indices

Spatial Indexing I R-trees

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

Continuous Density Queries for Moving Objects

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Using Trees to Depict a Forest Proceedings of Very Large Data Base Endowment Inc. (VLDB Endowment) Volume 2, 2009 Bin Liu , H.V. Jagadish Department of EECS University of Michigan Ann Arbor, USA

Introduction Motivation Many-Answers Problem A common approach to displaying many results is to batch them into pages. It has been shown that over 85% of the users look at only the first page of results returned by a search engine Similar results are distributed in many pages. Often users start exploring the dataset and become increasingly clear of their needs. For example, a car costing 8500 with 49000 miles may be very similar to another costing 8200 with 55000 miles, but there could be many intervening cars in the sort order, say by price, Eg: Ann wants to buy a car. attributes ID, Model, Price, and Mileage Query :Select * from Cars where Model = `Civic' and Price < 15,000 and Mileage < 80,000

Contents Introduction Set of representatives Cover-Tree based clustering Query refinement Experimental details

Challenges What it means to represent a large dataset. Representation modeling: using a small set of data to represent a large dataset Representative Finding Find the representative data points that best define the model. waiting time perceived by the user should not be significant. Query-Refinement . Queries will frequently be modified and reissued, based on results seen so far. we should permit users to ask for “more like this” an operation we call zooming in

MusiqLens framework Zooming-in At users request display more representatives similar to particular tuple. Zooming-in A hyper-link is provided for the user to browse those items. Suppose now the user chooses to see more cars like the first one. Since they cannot fit in one screen, MusiqLens shows representatives from the subset of cars. We call this operation \zooming-in", in analogy to zooming into finer level of details when viewing an image.

What is a good set of representatives? Given a large data set, our problem is to find a small number of tuples that best represent the whole data set. Random selection Generate random numbers from o to dataset cardinality. This is a baseline against which to compare other techniques. Density based sampling probabilistically under sample dense regions and over-sample sparse regions. Select k-medoids A medoid of a cluster of data points is the one whose average or maximum dissimilarity is the smallest to other points. Sort by attribute sorting is one attribute at a time in a multi-attribute scenario Sort by typicality data as distributed samples of a continuous random variable, and they select data points where the probability density function has highest values.

User study The goal of choosing representative points is to give users a good sense of what else to expect in the data 10 subjects were presented with only the representative points .For each set of representatives elicit from the users what the rest of the data set may look like. Prediction error distance. Conclusion : The conclusion from the investigation described above is that k-medoid (average) cluster centers constitute the representative set of choice

Cover tree based clustering algorithm Finding the medoids is the challenge. Many clever clustering techniques exist. None of them address query refinement challenge Even support for incremental computation is limited. Cover tree helps reduces the problem of finding medoids from the original data set to finding medoids in the sample.

Cover tree Properties 1. Each node of the tree is associated with one of the data points sj . 2. If a node is associated with data point sj , then one of its children must also be associated with sj (nesting). 3. All nodes at level i are at separated by at least D(i) (separation). 4. Each node at level i is within distance D(i) to its children in level i + 1 (covering). D(i) = 1/2i i : level of node , Root is 0 The explicit cover-tree has a space cost to O(n),and it can be constructed in O(n2) time.

Naïve nodes Span of a node Every explicit node either has a parent other than itself or a child other than a self-child. We call the rest of the nodes naive nodes Span of a node A very important property of cover-tree is that the subtree under a node spans to a distance of at most 2*D(i), where I is the level at which the node appears.

Some statics on Cover trees 1)Density: This is the total number of data points in the sub tree rooted at node si. a larger density indicates that the region covered by the node is more densely populated with data points. 2)Centroid: This is the average value of all data points in the subtree. Assume that there are T points in total in the subtree. For node si, if we denote the N-dimensional points in the subtree as Xj where j =1..T As each point is inserted, we increase the density for all its ancestors. Assume the new data point inserted is X j , then for each node i along the insertion path of the new point, we update the density and centroid as follows to calculate the total distance from all data points under node s1, we compute the distance from the centroid of s1 to m1, and multiply it by its density. We do the same for all other nodes and sum up the total distance. This value is then averaged over the total number of points, and we have obtained an estimate of the average distance cost. Density and Centroid updating formulas.

Distance and cost estimation of candidate k-medoids Without reading whole dataset , we can obtain estimate of average distance cost for candidate k medoids. Use density and centroid information to estimate this. To calculate the total distance from all data points under node s1, we compute the distance from the centroid of s1 to m1, and multiply it by its density. Do the same for all other nodes and sum up the total distance. This value is then averaged over the total number of points, and obtain an estimate of the average distance cost.

Average Medoid computation Traverse the cover tree to reach a level which has more than k nodes . We can view each subtree as a small cluster with centroid and density maintained at root. K- medoid is NP-hard. Hence try to find local minimum in terms of distance cost from data points to medoids. Seeding method: Space filling curves Hilbert space-filling curve has been shown to preserve the locality of multidimensional objects when they are mapped to linear space ,exploited in R-tree based clustering technique. Using same idea in the cover tree, nodes in Cm could be sorted by Hilbert values, and k seeds chosen evenly in the sorted list of nodes. Cm is the working level

Choose the seeds in a better way than Hilbert space filling curves. Level m – 1 of the cover tree, which contains less than k nodes, provides hints for seeds because of the clustering property of the tree Intuitively, nodes in Cm that share a common parent in Cm-1 form a small cluster themselves.(Avoid these) As a heuristic ,more seeds are chosen from children of a node whose descendants span a larger area. Nodes with relatively small decedents should have lower priority in becoming seeds. The contribution of a subtree to the distance cost is proportional to the product of the density and span. This special value is the weight of a node. The contribution of a subtree to the distance cost is proportional to the product of the density and span. We denote this special value as the weight of a node.

Priority queue based on weight of node is constructed. The key of the priority queue is the weight of a node. Initially all non-naive nodes in Cm-1 are pushed to the queue. We pop the head node from the queue and fetch all its children. Make sure the queue has k nodes by adding children of the node with largest weight. Afterwards, if any child has a larger weight than the minimum weight of all nodes in the queue, push it to the queue. Repeat this process until no more children can be pushed into the queue. The first k nodes in the queue are our seeds. the rest of the nodes are assigned to their respective closest seed to form k initial clusters. Using each centroid as input, we can find the corresponding medoid with a nearest neighbor query (supported by cover tree) For each final medoid o, we call nodes in the working level that are closest to o as its CloseSet. we can try to improve the clusters before computing the final medoids. The centroid of each cluster is updated; nodes are re-assigned to the updated centroids. This process repeats until stable centroids are found. As another refinement step, we can use cover-tree directed swaps. Instead of swapping with a random node, we can swap with nodes that was assigned to the closest neighbor medoid. Both measures signicantly cut the computational cost

Query Refinement In practical dataset browsing and searching scenarios, users often find it necessary to add additional filtering conditions or remove some attributes, often based on what they see from the data Expensive re-computation that causes much delay (e.g., seconds) for the user severely damages the usability of the system In this system ,dynamically change the representatives according to the new conditions with minimal cost.

Zoom in on representatives This operation can be efficiently supported using the cover tree structure. every final medoid is associated with a CloseSet of nodes in the working level. Once a medoids is chosen by the user, we should generate more representatives around s. Fetch all nodes in the CloseSet of s, and descend the cover tree to fetch all their children and store them in a list L. Treat nodes in list L as the nodes in our new working level.

Selection When a user applies a selection condition, nodes in the working level are very likely to change 1) completely invalidated 2) partially invalidated 3) completely valid. partially invalidated When this happens estimate the valid percentage of the children For child node s, in 2D case, we calculate the area around S within distance s:span, and calculate the percentage that is still valid under the selection condition After applying the selection condition, if there are less than k valid or partially valid nodes in the working level, we descend the cover tree until a level that has more than k nodes

Projection user removes one attribute at a time Goal is to refresh the representative without incurring much additional waiting for the user Once an attribute is removed, the cover tree index is no longer a valid reflection of the actual distance among data points. Using the cover tree as direction is no longer viable: after removing a dimension, nodes that are previously far away can become very close. (Weight and span are less accurate) Hilbert sort re-order all nodes and find seeds as outlined earlier. Follow the average medoid computation.

Experiments MusiqLens System Architecture

Comparison with R-tree based methods Time to compute medoids and avg Euclidian distance from data point to respective medoid comparison Other parameters

Conclusion Goals: 1) To find the representatives efficiently, 2) Adapt efficiently when users refines the query Towards the first goal, cover tree based algorithms for efficiently computing the medoids. Towards the second goal, algorithms to efficiently re-generate the representatives when users add selection condition, remove attributes, or zoom-in on frequently occurring representatives.