Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.

Slides:



Advertisements
Similar presentations
Using Trees to Depict a Forest
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Random Forest Predrag Radenković 3237/10
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Indexing and Range Queries in Spatio-Temporal Databases
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Diversification And Refinement In Collaborative Filtering Recommender Date: 2012/02/16 Source: Rubi Boim et. al (CIKM’11) Speaker: Chiang,guang-ting Advisor:
Reverse Furthest Neighbors in Spatial Databases Bin Yao, Feifei Li, Piyush Kumar Florida State University, USA.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.
Mutual Information Mathematical Biology Seminar
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
Clustering.
Unsupervised Learning and Data Mining
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
What is Cluster Analysis?
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
USING TREES TO DEPICT A FOREST Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1 Reading Assignment Presentation Courtesy of 35 th International.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Ch10 Machine Learning: Symbol-Based
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Designing Example Critiquing Interaction Boi Faltings Pearl Pu Marc Torrens Paolo Viappiani IUI 2004, Madeira, Portugal – Wed Jan 14, 2004 LIAHCI.
Learning Geographical Preferences for Point-of-Interest Recommendation Author(s): Bin Liu Yanjie Fu, Zijun Yao, Hui Xiong [KDD-2013]
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
VisDB: Database Exploration Using Multidimensional Visualization Maithili Narasimha 4/24/2001.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1.
Lecture 07: Dealing with Big Data
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
1 Efficient and Effective Clustering Methods for Spatial Data Mining Raymond T. Ng, Jiawei Han Pavan Podila COSC 6341, Fall ‘04.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Cohesive Subgraph Computation over Large Graphs
CACTUS-Clustering Categorical Data Using Summaries
Data Mining K-means Algorithm
Preference Query Evaluation Over Expensive Attributes
Probabilistic Latent Preference Analysis
Presentation transcript:

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1

Motivation  In interactive database querying, we often get more results than we can comprehend immediately 2  When do you actually click over 2-3 pages of results?  85% of users never go to the second page!  What to display on the first page?

Standard solutions  Sorting by attributes  Computationally expensive  Similar results can be distributed many pages apart  Ranking  Hard to estimate of the user's preference.  In database queries, all tuples are equally relevant!  What to do when there are millions of results? 3

Make the First Page Count  Human beings are very capable of learning from examples  Show the most “representative” results  Best help users learn what is in the result set  User can decide further actions based on representatives 4

The Proposal: MusiqLens Experience 5 (Model-driven Usable Systems for Information Querying)

Suppose a user wants a 2005 Civic 6 but there are too many of them…

MusiqLens on the Car Data IdModelPriceYearMileageCondition 872Civic$12, ,000Good 122 more like this 901Civic$16, ,000Excellent 345 more like this 725Civic$18, ,000Excellent 86 more like this 423Civic$17, ,000Good 201 more like this 132Civic$9, ,000Fair 185 more like this 322Civic$14, ,000Good 55 more like this 7

MusiqLens on the Car Data IdModelPriceYearMileageCondition 872Civic$12, ,000Good 122 more like this 901Civic$16, ,000Excellent 345 more like this 725Civic$18, ,000Excellent 86 more like this 423Civic$17, ,000Good 201 more like this 132Civic$9, ,000Fair 185 more like this 322Civic$14, ,000Good 55 more like this 8

After Zooming in: 2005 Honda Civics ~ ID 132 IdModelPriceYearMileageCondition 342Civic$9, ,000Good 25 more like this 768Civic$10, ,000Good 10 more like this 132Civic$9, ,000Fair 63 more like this 122Civic$9, ,000Good 5 more like this 123Civic$9, ,000Fair 40 more like this 898Civic$9, ,000Fair 42 more like this 9

After Filtering by “Price < 9,500” IdModelPriceYearMileageCondition 123Civic$9, ,000Fair 40 more like this 898Civic$9, ,000Fair 42 more like this 133Civic$9, ,000Fair 33 more like this 126Civic$9, ,000Good 3 more like this 129Civic$8, ,000Fair 20 more like this 999Civic$9, ,000Fair 12 more like this 10

Challenges  Representation Modeling: finding a suitable metric  What is the best set of representatives?  Representative finding  How to find them efficiently?  Query Refinement  How to efficiently adapt to user’s query operations? 11

Finding a Suitable Metric  Users should be the ultimate judge  Which metric generates the representatives that I can learn the most from?  User study to evaluate different representation modeling 12

 Users should be the ultimate judge  Which metric generates the representatives that I can learn the most from? Finding a Suitable Metric 13  User study  Use a set of candidates  Users observe the representatives  Users estimate more data points in the data  Representatives lead to best estimation wins

Metric Candidates  Sort by attributes  Uniform random sampling  Small clusters are missed  Density-biased sampling  Sample more from sparse regions, less from dense regions  Sort by typicality  Based on probabilistic modeling  K-medoids 14

Sort by Typicality 15 Proposed by Ming Hua, Jian Pei, et al [4] Figure source: slides from Ming Hua

Metric Candidates - K-medoids  A medoid of a cluster is the object whose dissimilarity to others is smallest  Average medoid and max medoid  K-medoids are k objects, each from a different cluster where the object is the medoid  Why not K-means?  K-means cluster centers do not exist in database  We must present real objects to users 16

Plotting the Candidates 17 Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1

Plotting the Candidates - Typicality 18

Plotting the Candidates – k-medoids 19

User Study Procedure  Users are given:  7 sets of data, generated using the 7 candidate methods  Each set consists of 8 representative points  Users predict 4 more data points  That are most likely in the data set  Should not pick those already given  Measure the predication error 20

Verdict  K-meoids is the winner  In this paper, authors choose average k- medoids  Proposed algorithm can be extended to max- medoids with small changes 21

Challenges  Representation Modeling: finding a suitable metric  What is the best set of representatives?  Representative finding  How to find them efficiently?  Query Refinement  How to efficiently adapt to user’s query operations? 22

Cover Tree Based Algorithm  Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006  Briefly discuss Cover Tree properties  See Cover Tree based algorithms for computing k-medoids 23

Cover Tree Properties (1) 24 Points in the Data (One Dimension)

Cover Tree Properties (2) 25

Cover Tree Properties (3) 26

Additional Stats for Cover Tree (2D Example) 27 Density (DS): number of points in the subtree DS = 10 DS = 3 Centroid (CT): geometric center of points in the subtree p

k-medoid Algorithm Outline 28

Cover Tree Based Seeding 29

A Simple Example: k = 4 30 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S 3 (5), S 8 (3), S 5 (2) S 8 (3/2), S 5 (1), S 3 (1), S 7 (1), S 2 (1/2) Final set of seeds

Update Process 31

Challenges  Representation Modeling: finding a suitable metric  What is the best set of representatives?  Representative finding  How to find them efficiently?  Query Refinement  How to efficiently adapt to user’s query operations? 32

Query Adaptation  Handle user actions  Zooming  Selection (filtering) 33

Zooming  Zooming  Expand all nodes assigned to the medoid  Run k-medoid algorithm on the new set of nodes 34

Selection  Effect of selection on a node  Completely invalid  Fully valid  Partially valid  Estimate the validity percentage (VG) of each node  Multiply the VG with weight of each node 35

System Architecture 36

Experiments – Initial Medoid Quality  Compare with R-tree based method by M. Ester, H. Kriegel, and X. Xu  Data sets  Synthetic dataset: 2D points with zipf distribution  Real dataset: LA data set from R-tree Portal, 130k points  Measurement  Time to compute the medoids  Average distance from a data point to its medoid 37

Results on Synthetic Data 38 For various sizes of data, Cover-tree based method outperforms R-tree based method Time Distance

Results on Real Data 39 For various k values, Cover-tree based method outperforms R-tree based method on real data

Query Adaptation 40 Synthetic DataReal Data Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Time cost of re-building is orders-of-magnitude higher than incremental computation.

Conclusion  Authors proposed MusiqLens framework for solving the many-answer problem  Authors conducted user study to select a metric for choosing representatives  Authors proposed efficient method for computing and maintaining the representatives under user actions  Part of the database usability project at Univ. of Michigan  Led by Prof. H.V. Jagadish  41