Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun.

Slides:



Advertisements
Similar presentations
Chapter 11. Hash Tables.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Indexing DNA Sequences Using q-Grams
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Presented By: - Chandrika B N
1 Efficient packet classification using TCAMs Authors: Derek Pao, Yiu Keung Li and Peng Zhou Publisher: Computer Networks 2006 Present: Chen-Yu Lin Date:
Multiple testing correction
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
A DDING S TRUCTURE TO T OP -K: F ORM I TEMS TO E XPANSIONS Date : Source : CIKM’ 11 Speaker : I-Chih Chiu Advisor : Dr. Jia-Ling Koh 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Indexing & querying text
Implementation Issues & IR Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Spatial Online Sampling and Aggregation
Indexing and Hashing Basic Concepts Ordered Indices
One-Pass Algorithms for Database Operations (15.2)
Magnet & /facet Zheng Liang
2018, Spring Pusan National University Ki-Joune Li
Presentation transcript:

Dynamic Faceted Search for Discovery- driven Analysis Debabrata Sash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, Guy Lohman CIKM’08 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling Date: 2008/12/18 1

Outline Introduction Terminology and Problem Statement Measure of “Interestingness” Implementing Dynamic Faceted Search Evaluation Conclusion and Future work 2

Introduction Today’s faceted search systems are designed for browsing catalog data and are not directly suitable for discovery-driven exploration To preserve browsing consistency, facets selected for navigation tend to be “static” When browsing online catalogs, the navigational facets are single-dimensional only 3

Introduction Propose a dynamic faceted search system for the kind of discovery-driven analysis that is often performed in On-Line Analytical Processing (OLAP) systems From a potentially large search result, this paper wants to automatically and dynamically discover a small set of facets and values that are deemed most “interesting” to a user 4

Terminology and Problem Statement Defn 1. A repository D is a collection of documents Each of which is composed of some free text and one or more pairs Given a value f in facet F, we call an instance of F All unique values associated with a facet F form the domain of F 5

Terminology and Problem Statement Defn 2. Organize the domain of these facets into a facet hierarchy Each node in the hierarchy stores a pair A node is the parent of another node if for each document, F 2 = f 2 implies F 1 = f 1 6

Terminology and Problem Statement Defn 3. Assume a query q on the repository has the form “keywords && F 1 = f 1 && F 2 = f 2 …” The result of q is denoted by D q Includes the set of documents having the specified keywords Satisfying all constraints on selected facets 7

Terminology and Problem Statement Defn 4. Given a query q, define a facet summary for a facet set F 1, …, F m as a list of tuples over D q f i is an instance of facet F i A(f 1, …, f m ) is an aggregate of documents in D q that contain all these facet instances 8

Terminology and Problem Statement Problem Definition: Given a repository of documents with n facets, a query q, 2 integers K 1 & K 2  select K 1 facet sets and a facet summary for each with up to K 2 tuples that are the most “interesting” to a user 9

Measure of “Interestingness” Interestingness: How surprising an actual aggregated value is, given a certain expectation 10

Measure of “Interestingness” *Setting the Expectation For a given set of facet values f 1, …, f m from F 1, …, F m : C D (f 1, …, f m ): the count of the number of documents with all those facet values in D C q (f 1, …, f m ): the count of the number of documents with all those facet values in D q E[C q (f 1, …, f m )]: an “expected” value for C q (f 1, …, f m ) Natural 、 navigational 、 ad hoc 11

Measure of “Interestingness” *Setting the Expectation Natural: For an individual facet instance : (uniformity assumption) For an instance f 1, …, f m of a facet set: (independence assumption) 12

Measure of “Interestingness” *Setting the Expectation Navigational: Ad hoc: User can tell the system to set expectation based on an arbitrary query q of the user’s choice Set the count for each facet value proportionally based on the distribution of the result of q 13

Measure of “Interestingness” *Measuring Degree of Interestingness Single facet instance: By evaluating it with respect to a scenario in which its associated count is generated by random sampling The smaller the probability of observing the count under random sampling, the more interesting the facet instance 14

Measure of “Interestingness” *Measuring Degree of Interestingness p-value: Suppose that a certain facet value occurs in r out of R documents in the repository and in q out of Q documents in the output of a certain query Also suppose The interestingness of that facet value vis-à-vis the query: the probability that in a random sample of size Q there will be at least q documents with that facet value hypergeometric distribution  normal distribution or Poisson distribution 15

Measure of “Interestingness” *Measuring Degree of Interestingness The whole facet: For each facet F, we consider the p-values of only the k most interesting values in F, replace  The final measure: MaxWeight: assign 1 to w 1 and 0 to the rest AvgWeight: assign each w i an equal weight HybridWeight: average the interesingness computed by MaxWeight and AvgWeight 16

Implementing Dynamic Faceted Search Solr: indexes facets without storing them Enumerates every facet instance from the index and intersects its posting list with D q From the intersected set, it derives the count on facet value f Caches each posting list to a bitset If the bitset is dense: bitmap Otherwise: a hash map of document IDs 17

Implementing Dynamic Faceted Search Improving Solr: Solr limitation 1: has to choose a threshold that decides the representation of the bitset  represent a bitset as a compressed bitmap using Word-Aligned Hybrid (WAH) code 18

Implementing Dynamic Faceted Search WAH There are 2 types of words: Literal words: a verbatim representation of 31 bits Fill words: encodes the length of a list of all 0’s and 1’s in 30 bits A bitmap is broken into groups of 31 bits first and then converted into a sequence of literal and fill words Operations on bitmaps such as intersection can be performed on WAH code directly without decoding 19

Implementing Dynamic Faceted Search Improving Solr: Solr limitation 2: it has to intersect the matching document set D q with the bitset of every facet instance  reduce the number of intersections by building a directory structure called bitset tree on top of the bitsets of a facet 20

Implementing Dynamic Faceted Search Building and Using a Bitset Tree Starting with the leaf nodes, for each bitset b corresponding to facet instance, we create an entry Then divide all entries into groups of size s For each group, we generate a leaf node holding all entries in that group 21

Evaluation *Setup DBLP Contains about 13,000 papers published in 26 venues (e.g., SIGMOD, VLDB, TODS, etc) in the past 30 years It has 14 facets organized in 6 hierarchies, including author, venue, time (e.g., decade, year), location (e.g., country, city), number of authors per paper, number of citations per paper Use the title of each paper as text for keywords searches Conduct the user survey 22

Evaluation *Setup Patent Has about 1.8 million U.S. patents from the past 30 years 16 facets organized into 10 hierarchies Use for performance evaluation 23

Evaluation *Result from a User Survey Performed tests on 3 keyword queries 2 are provided by author: “distributed”, “mining” Users pick the 3 keyword 1 base on natural 2 base on navigational 1 used complete repository 1 used previous query 24

Evaluation *Result from a User Survey 25

Evaluation *Result from a User Survey Our dynamic approach also received some negative feedback Overall, the feedback for the natural expectation is neutral Different ways of aggregating the degree of interestingness HybridWeight(7) > MaxWeight(6) > AvgHeight(2) 26

Evaluation *Performance Results Environment: Implemented in Java 3GHz P4 desktop machine with 1GB memory A single disk drive, running Linux Version: 1. simple: inverted index 2. Solr 3. compressed: improves Solr by WAH code 4. tree: improves Solr by bitset trees 5. compressed-tree: both WAH and bitset tree on Solr 27

Evaluation *Performance Results Scaling with Data Size Run a query that matches 25,000 docs using tree Break the total time into search time & summary computation time 28

Evaluation *Performance Results 29

Evaluation *Performance Results 30

Conclusion and Future Work Develop a novel dynamic faceted search system support OLAP-style discovery-driven analysis on a large set of structured and unstructured data Propose an intuitive and effective way of measuring “interestingness” Propose a novel navigational,method of setting a user’s expectation 31

Conclusion and Future Work Incorporate user feedback in facet selection How to extend the aggregates to functions other than count Sum, average on some numerical measures How to support dynamic faceted search in a distributed environment 32