ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

CSE 330: Numerical Methods

Fast Algorithms For Hierarchical Range Histogram Constructions

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

■ Google’s Ad Distribution Network ■ Primary Benefits of AdWords ■ Online Advertising Stats and Trends ■ Appendix: Basic AdWords Features ■ Introduction.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

1 s-t Graph Cuts for Binary Energy Minimization  Now that we have an energy function, the big question is how do we minimize it? n Exhaustive search is.

Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Distributed Algorithms for Secure Multipath Routing

Kuang-Hao Liu et al Presented by Xin Che 11/18/09.

CSE332: Data Abstractions Lecture 27: A Few Words on NP Dan Grossman Spring 2010.

Requirements Specification

Recent Development on Elimination Ordering Group 1.

DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.

Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Kyle Heath, Natasha Gelfand, Maks Ovsjanikov, Mridul Aanjaneya, Leo Guibas Image Webs Computing and Exploiting Connectivity in Image Collections.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.

Chapter 11: Limitations of Algorithmic Power

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Data Mining – Intro.

10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.

Advanced Topics NP-complete reports. Continue on NP, parallelism.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

WAES 3308 Numerical Methods for AI

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Analysis of Algorithms

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.

Generating RCPSP instances with Known Optimal Solutions José Coelho Generator and generated instances in:

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 Short Term Scheduling. 2  Planning horizon is short  Multiple unique jobs (tasks) with varying processing times and due dates  Multiple unique jobs.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

CS270 Project Overview Maximum Planar Subgraph Danyel Fisher Jason Hong Greg Lawrence Jimmy Lin.

Multi-Aspect Query Summarization by Composite Query Date: 2013/03/11 Author: Wei Song, Qing Yu, Zhiheng Xu, Ting Liu, Sheng Li, Ji-Rong Wen Source: SIGIR.

Extracting Query Facets From Search Results Date : 2013/08/20 Source : SIGIR’13 Authors : Weize Kong and James Allan Advisor : Dr.Jia-ling, Koh Speaker.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Honors Track: Competitive Programming & Problem Solving Seminar Topics Kevin Verbeek.

Journal of Computational and Applied Mathematics Volume 253, 1 December 2013, Pages 14–25 Reporter : Zong-Dian Lee A hybrid quantum inspired harmony search.

Chapter 14 Genetic Algorithms.

Your Name Digital Multimedia

Analysis and design of algorithm

Objective of This Course

Discovering Functional Communities in Social Media

Conflict-Aware Event-Participant Arrangement

Efficient Aggregation over Objects with Extent

Presentation transcript:

ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION

K EYWORD S EARCH ON S TRUCTURED D ATA  Effective techniques have been developed to help users find relevant results?  Ranking: sort the results in the order of estimated relevance  Snippet: provide a summary of each result to help users judge relevance  50% of keyword searches are information exploration queries, which inherently have multiple relevant results  Users intend to investigate and compare multiple relevant results.  How to help user compare relevant results? Keywords Search Engine Results: Relevant Data Fragments Structured Data Web Search 50% Navigation 50% Information Exploration Broder, SIGIR 02 2

R ESULTS AND S NIPPETS store city Phoenix name BHPhoto merchandises category DSLR camera brand Canon megapixel 12 category DSLR camera brand Sony megapixel 12 …… store city Phoenix name Adorama merchandises category Compact camera brand HP megapixel 14 category Compact camera brand Canon megapixel 12 …… “Phoenix, camera, store” store city name BHPhoto merchandises brand Canon camera megapixel 12 brand Canon camera Phoenix Snippet store city Phoenix name Adorama merchandises category Compact camera brand Canon megapixel 12 Snippet Snippets are unhelpful in differentiating query results. 3 (Huang et al. SIGMOD 09)

D IFFERENTIATION F EATURE S ETS (DFS) store city Phoenix name BHPhoto merchandises category DSLR camera brand Canon megapixel 12 category DSLR camera brand Sony megapixel 12 …… store city Phoenix name Adorama merchandises category Compact camera brand HP megapixel 14 category Compact camera brand Canon megapixel 12 …… DFS Feature Typevalue store: nameBHPhoto camera: brandCanon Sony camera: categoryDSLR Feature Typevalue store: nameAdorama camera: brandCanon HP camera: categoryCompact Feature: (entity, attribute, value) 4 Bank websites usually allow users to compare selected credit cards, however, only with a pre-defined feature set.

C HALLENGES OF R ESULT D IFFERENTIATION  How to automatically generate DFS that highlight the differences among results?  How to measure the quality of a set of DFSs?  DFSs should obviously maximize the difference among results. How to quantify it?  What are other desirable properties?  Can DFSs be efficiently generated from results? 5

C ONTRIBUTIONS  1 st work on automatically differentiating structured search results  Application domains: online shopping, employee hiring, job/institution hunting, etc.  Identifying 3 desiderata for good DFSs  Quantifying the differentiation power of a set of DFSs  Proving the NP-hardness of DFS generation  Tackling the problem using two local optimality criteria  Single-swap / Multi-swap optimality  Implemented XRed: X ML Re sult D ifferentiation  Empirically verified the effectiveness & efficiency of XRed 6

R OADMAP  Desiderata for good DFSs  Problem definition  Local optimality and algorithms  Experiments 7

D ESIDERATUM 1 B EING S MALL  A Small DFS is easy for user to go through and compare with other DFSs.  The size of each DFS, |D|, cannot exceed a user- specified upper bound L |D| ≤ L 8

D ESIDERATUM 2 S UMMARIZING Q UERY R ESULTS DFSs that do not summarize the results show useless & misleading differences. store city Phoenix name BHPhoto merchandises category DSLR camera brand Canon megapixel 12 category DSLR camera brand Sony megapixel 12 …… store city Phoenix name Adorama merchandises category Compact camera brand HP megapixel 14 category Compact camera brand Canon megapixel 12 …… DFS Feature TypeDFS camera:brandHP Feature TypeDFS camera:brandCanon This store sells only a few HP cameras. 9

D ESIDERATUM 2 S UMMARIZING Q UERY R ESULTS DFSs that do not summarize the results show useless & misleading differences. store city Phoenix name BHPhoto merchandises category DSLR camera brand Canon megapixel 12 category DSLR camera brand Sony megapixel 12 …… store city Phoenix name Adorama merchandises category Compact camera brand HP megapixel 14 category Compact camera brand Canon megapixel 12 …… DFS Feature TypeDFS camera:brandCanon camera:brandHP Feature TypeDFS camera:brandCanon camera:brandHP This store sells only a few HP cameras. 10

 A DFS is valid only if it summarizes the corresponding result.  Features of the same type should be included in order of occurrences.  Ratios of two features in the DFS should be roughly the same as in the result. D ESIDERATUM 2 S UMMARIZING Q UERY R ESULTS Dominance Ordered Distribution Preserved 11

D ESIDERATUM 3 D IFFERENTIATING Q UERY R ESULTS  Differentiation unit: feature type.  A feature type t in two DFSs D 1 and D 2 is differentiable if  The order of the features of type t is different.  The ratio of two features of type t is different. D 1. Camera: brand: Canon D 2. Camera: brand: HP D 1. Camera: brand: Canon D 2. Camera: brand: Canon Camera: brand: HP D 1. Camera: brand: Canon Camera: brand: HP D 2. Camera: brand: Canon Camera: brand: Canon Camera: brand: HP 12

Degree of Differentiation (DoD) of two DFSs = Number of differentiable feature types. D ESIDERATUM 3 D IFFERENTIATING Q UERY R ESULTS Feature TypeDFS store:nameBHPhoto camera:brandCanon Sony camera:categoryDSLR Feature TypeDFS store:nameAdorama camera:brandCanon HP camera:categoryCompact DoD = 3 DoD of multiple DFSs = the sum of DoD of every pair of DFS. 13

R OADMAP  Desiderata for good DFSs  Problem definition  Local optimality and algorithms  Experiments 14

DFS G ENERATION P ROBLEM  Given a set of results and a size limit L, generate a DFS for each result such that  Their DoD is maximized.  Every DFS is valid (good summary)  Every DFS’s size does not exceed L.  We proved the NP-hardness of this problem by reduction from X3C. 15

R OADMAP  Desiderata for good DFSs  Problem definition  Local optimality and algorithms  Experiments 16

L OCAL O PTIMALITY  To tackle this hard problem, instead of achieving global optimality, we propose two local optimality criteria :  Single-swap Optimality  Multi-swap Optimality 17

S INGLE S WAP  A set of DFSs is Single-Swap Optimal, if adding / changing a single feature in a single DFS (subject to validity and size limit) cannot increase the DoD. Feature TypeValue store: nameBHPhoto store: cityPhoenix camera: megapixel12 camera: categoryDSLR Feature TypeValue store: nameAdorama camera: brandCanon HP camera: megapixel12 DoD = 1 Feature TypeValue store: nameAdorama camera: brandCanon HP camera: categoryCompact DoD increases to 2 # of cameras: 200 Category: DSLR: 188 Others: 12 Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22 Megapixel: 12: : 15 14: 20 STORE 1 # of cameras: 150 Category: Compact: 140 Others: 10 Brand: Canon: 80 HP: 70 Megapixel: 12: : 5 14: 19 STORE 2 Achieved Single-Swap Optimal 18

A LGORITHM FOR S INGLE -S WAP O PTIMALITY  Start from a randomly generated DFS for each result.  Repeatedly add a feature / change a feature in a DFS.  Stop until the DoD no longer increases. Does this algorithm terminate in polynomial time? Yes: The maximum possible DoD for a set of DFSs is POLYNOMIAL. Each iteration increases the DoD at least by 1. Each iteration takes polynomial time. 19

M ULTI -S WAP O PTIMALITY  A set of DFSs is Multi-Swap Optimal, if adding / changing any number of features in a single DFS (subject to validity and size limit) cannot increase the DoD. Feature TypeValue store:nameBHPhoto store:cityPhoenix camera: megapixel12 camera:categoryDSLR Feature TypeValue store:nameAdorama camera: brandCanon HP camera:categoryCompact DoD = 2 Feature TypeValue store:nameBHPhoto camera:brandCanon Sony camera:categoryDSLR DoD increases to 3 20 # of cameras: 200 Category: DSLR: 188 Others: 12 Brand: Canon: 103 Sony: 50 Nikon: 25 HP: 22 Megapixel: 12: : 15 14: 20 STORE 1 # of cameras: 150 Category: Compact: 140 Others: 10 Brand: Canon: 80 HP: 70 Megapixel: 12: : 5 14: 19 STORE 2

A LGORITHM FOR M ULTI -S WAP O PTIMALITY  Start from a randomly generated DFS for each result.  Repeatedly add / change multiple features in a DFS.  Stop until the DoD no longer increases. We designed a novel dynamic programming algorithm, which takes pseudo-polynomial time 21 This algorithm has exponential time complexity!

E VALUATION  We have implemented Xred ( X ML Re sult D ifferentiation) and evaluated it empirically.  Data sets  Film (  Camera Retailer (synthetic)  Result generation: XSeek (  DFS size limit: 10% of # of feature types  Metrics:  Quality (DoD)  Efficiency  Comparison system: exponential algorithm that generates optimal solution. 22

DFS Q UALITY 23 Film Camera Retailer

E FFICIENCY 24 Result Size 1KB ~ 9KB # of Results 2 ~ 52 Film Camera Retailer

S CALABILITY 25

C ONCLUSIONS  We initiate the problem of automatically differentiating structured query results, which is useful for information exploration queries.  We define Differentiation Feature Set (DFS) for each result, and identify three desiderata for DFS.  We formalize the DFS generation problem, and prove its NP- hardness.  We propose two local optimality criteria: single-swap and multi-swap, and design algorithms to efficiently achieve them.  We implemented the XRed system, and verified its effectiveness and efficiency through experiments. 26

F UTURE W ORK Result differentiation is a new area and opens opportunities for new research topics.  Is there a better way of selecting feature types, e.g., by considering users’ interests?  Is there a better way of measuring the quality of DFSs besides DoD?  Are there approximation / randomized algorithms for DFS generation problem that achieve better quality / efficiency? 27

28