Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1
Motivation In interactive database querying, we often get more results than we can comprehend immediately 2 When do you actually click over 2-3 pages of results? 85% of users never go to the second page! What to display on the first page?
Standard solutions Sorting by attributes Computationally expensive Similar results can be distributed many pages apart Ranking Hard to estimate of the user's preference. In database queries, all tuples are equally relevant! What to do when there are millions of results? 3
Make the First Page Count Human beings are very capable of learning from examples Show the most “representative” results Best help users learn what is in the result set User can decide further actions based on representatives 4
The Proposal: MusiqLens Experience 5 (Model-driven Usable Systems for Information Querying)
Suppose a user wants a 2005 Civic 6 but there are too many of them…
MusiqLens on the Car Data IdModelPriceYearMileageCondition 872Civic$12, ,000Good 122 more like this 901Civic$16, ,000Excellent 345 more like this 725Civic$18, ,000Excellent 86 more like this 423Civic$17, ,000Good 201 more like this 132Civic$9, ,000Fair 185 more like this 322Civic$14, ,000Good 55 more like this 7
MusiqLens on the Car Data IdModelPriceYearMileageCondition 872Civic$12, ,000Good 122 more like this 901Civic$16, ,000Excellent 345 more like this 725Civic$18, ,000Excellent 86 more like this 423Civic$17, ,000Good 201 more like this 132Civic$9, ,000Fair 185 more like this 322Civic$14, ,000Good 55 more like this 8
After Zooming in: 2005 Honda Civics ~ ID 132 IdModelPriceYearMileageCondition 342Civic$9, ,000Good 25 more like this 768Civic$10, ,000Good 10 more like this 132Civic$9, ,000Fair 63 more like this 122Civic$9, ,000Good 5 more like this 123Civic$9, ,000Fair 40 more like this 898Civic$9, ,000Fair 42 more like this 9
After Filtering by “Price < 9,500” IdModelPriceYearMileageCondition 123Civic$9, ,000Fair 40 more like this 898Civic$9, ,000Fair 42 more like this 133Civic$9, ,000Fair 33 more like this 126Civic$9, ,000Good 3 more like this 129Civic$8, ,000Fair 20 more like this 999Civic$9, ,000Fair 12 more like this 10
Challenges Representation Modeling: finding a suitable metric What is the best set of representatives? Representative finding How to find them efficiently? Query Refinement How to efficiently adapt to user’s query operations? 11
Finding a Suitable Metric Users should be the ultimate judge Which metric generates the representatives that I can learn the most from? User study to evaluate different representation modeling 12
Users should be the ultimate judge Which metric generates the representatives that I can learn the most from? Finding a Suitable Metric 13 User study Use a set of candidates Users observe the representatives Users estimate more data points in the data Representatives lead to best estimation wins
Metric Candidates Sort by attributes Uniform random sampling Small clusters are missed Density-biased sampling Sample more from sparse regions, less from dense regions Sort by typicality Based on probabilistic modeling K-medoids 14
Sort by Typicality 15 Proposed by Ming Hua, Jian Pei, et al [4] Figure source: slides from Ming Hua
Metric Candidates - K-medoids A medoid of a cluster is the object whose dissimilarity to others is smallest Average medoid and max medoid K-medoids are k objects, each from a different cluster where the object is the medoid Why not K-means? K-means cluster centers do not exist in database We must present real objects to users 16
Plotting the Candidates 17 Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1
Plotting the Candidates - Typicality 18
Plotting the Candidates – k-medoids 19
User Study Procedure Users are given: 7 sets of data, generated using the 7 candidate methods Each set consists of 8 representative points Users predict 4 more data points That are most likely in the data set Should not pick those already given Measure the predication error 20
Verdict K-meoids is the winner In this paper, authors choose average k- medoids Proposed algorithm can be extended to max- medoids with small changes 21
Challenges Representation Modeling: finding a suitable metric What is the best set of representatives? Representative finding How to find them efficiently? Query Refinement How to efficiently adapt to user’s query operations? 22
Cover Tree Based Algorithm Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 Briefly discuss Cover Tree properties See Cover Tree based algorithms for computing k-medoids 23
Cover Tree Properties (1) 24 Points in the Data (One Dimension)
Cover Tree Properties (2) 25
Cover Tree Properties (3) 26
Additional Stats for Cover Tree (2D Example) 27 Density (DS): number of points in the subtree DS = 10 DS = 3 Centroid (CT): geometric center of points in the subtree p
k-medoid Algorithm Outline 28
Cover Tree Based Seeding 29
A Simple Example: k = 4 30 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S 3 (5), S 8 (3), S 5 (2) S 8 (3/2), S 5 (1), S 3 (1), S 7 (1), S 2 (1/2) Final set of seeds
Update Process 31
Challenges Representation Modeling: finding a suitable metric What is the best set of representatives? Representative finding How to find them efficiently? Query Refinement How to efficiently adapt to user’s query operations? 32
Query Adaptation Handle user actions Zooming Selection (filtering) 33
Zooming Zooming Expand all nodes assigned to the medoid Run k-medoid algorithm on the new set of nodes 34
Selection Effect of selection on a node Completely invalid Fully valid Partially valid Estimate the validity percentage (VG) of each node Multiply the VG with weight of each node 35
System Architecture 36
Experiments – Initial Medoid Quality Compare with R-tree based method by M. Ester, H. Kriegel, and X. Xu Data sets Synthetic dataset: 2D points with zipf distribution Real dataset: LA data set from R-tree Portal, 130k points Measurement Time to compute the medoids Average distance from a data point to its medoid 37
Results on Synthetic Data 38 For various sizes of data, Cover-tree based method outperforms R-tree based method Time Distance
Results on Real Data 39 For various k values, Cover-tree based method outperforms R-tree based method on real data
Query Adaptation 40 Synthetic DataReal Data Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Time cost of re-building is orders-of-magnitude higher than incremental computation.
Conclusion Authors proposed MusiqLens framework for solving the many-answer problem Authors conducted user study to select a metric for choosing representatives Authors proposed efficient method for computing and maintaining the representatives under user actions Part of the database usability project at Univ. of Michigan Led by Prof. H.V. Jagadish 41