Dr. Subbarao Kambhampati

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (PhD Dissertation Defense) Committee: Subbarao.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (PhD Proposal Defense) Committee: Subbarao Kambhampati.

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections MS Thesis Defense Bhaumik Chokshi Committee Members:

Evaluating Search Engine

Topic-Sensitive SourceRank: Agreement Based Source Selection for the Multi-Topic Deep Web Integration Manishkumar Jha Raju Balakrishnan Subbarao Kambhampati.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Presented by Zeehasham Rasheed

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.

Scalable Text Mining with Sparse Generative Models

PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement Raju Balakrishnan, Subbarao Kambhampati Arizona State University.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Truth Discovery with Multiple Conflicting Information Providers on the Web KDD 07.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Algorithmic Detection of Semantic Similarity WWW 2005.

Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Post-Ranking query suggestion by diversifying search Chao Wang.

More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Using ODP Metadata to Personalize Search University of Seoul Computer Science Database Lab. Min Mi-young.

Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Millions of Databases: Which are Trustworthy and Relevant?

iSRD Spam Review Detection with Imbalanced Data Distributions

Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.

Presentation transcript:

Dr. Subbarao Kambhampati Topic-Sensitive SourceRank: Extending SourceRank for Performing Context-Sensitive Search over Deep-Web MS Thesis Defense Manishkumar Jha Committee Members Dr. Subbarao Kambhampati Dr. Huan Liu Dr. Hasan Davulcu

Deep Web Integration Scenario Millions of sources containing structured tuples Autonomous Uncontrolled collection Contains information spanning multiple topics Mediator Access is limited to query-forms ←query ←answer tuples answer tuples→ answer tuples→ query→ ←query answer tuples→ ←answer tuples query→ ←query Web DB Web DB Web DB Web DB Web DB Web DB Web DB Deep Web

Source quality and SourceRank Deep-Web is adversarial Source quality is a major issue over deep-web SourceRank[1] provides a measure for assessing source quality based on source trustworthiness and result importance [1] SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement, WWW, 2011

… But Source quality is topic-sensitive Sources might have data corresponding to multiple topics. Importance may vary across topics Example: Barnes & Noble might be quite good as a book source but not be as good a movie source SourceRank will fail to capture this fact Issues were noted for surface-web. But are much more critical for deep-web as sources are even more likely to cross topics book movie

Deep Web Integration Scenario Mediator ←query ←answer tuples ` answer tuples→ answer tuples→ query→ ←query answer tuples→ ←answer tuples Movie query→ ←query Web DB Web DB Web DB Music Web DB Web DB Web DB Web DB Deep Web Camera Books

Problem Definition Problem Definition: Performing effective multi-topic source selection sensitive to trustworthiness for deep-web

Our solution – Topic sensitive-SourceRank Compute multiple topic-sensitive SourceRanks At query-time, using query-topic combine these rankings into composite importance ranking Challenges Computing topic-sensitive SourceRanks Identifying query-topic Combining topic-sensitive SourceRanks

Agenda SourceRank Topic-sensitive SourceRank Experiments and Results Conclusion

Agenda SourceRank Topic-sensitive SourceRank Experiments and Results Conclusion

SourceRank Computation Assesses source quality based on trustworthiness and result importance Introduces a domain-agnostic agreement based technique for implicitly creating an endorsement structure between deep-web sources Agreement of answer sets returned in response to same queries manifests as a form of implicit endorsement

SourceRank Computation contd. Endorsement is modeled as directed weighted agreement graph Nodes represent sources Edge weights represent agreement between the sources SourceRank of a source is computed as the stationary visit probability of a Markov random walk performed on this agreement graph

Agenda SourceRank Topic-sensitive SourceRank Experiments and Results Conclusion

Trust-based measure for multi-topic deep-web Issues with SourceRank for multi-topic deep-web Single importance ranking Is query-agnostic We propose Topic-sensitive SourceRank, TSR for effectively performing multi-topic selection sensitive to trustworthiness TSR overcomes the drawbacks of SourceRank

Topic-sensitive SourceRank Overview Instead of creating a single importance ranking, multiple importance rankings are created Each importance ranking is biased towards a particular topic At query-time, using query information and individual topic-specific importance rankings, compute a composite importance ranking biased towards the query

Challenges for TSR Computing topic-specific importance rankings is not trivial Inferring query information Identifying query-topic Computing composite importance ranking

Computing topic-specific SourceRank For a deep-web source, its SourceRank score for a topic, will depend on the answers to queries of same topic Using topic-specific sampling queries for a topic, will result in an endorsement structure, biased towards the same topic Example: If movie-related sampling queries are used, then movie sources are more likely to agree on the answer sets than other topic sources. This will result in the endorsement structure biased towards the movie-topic

Computing topic-specific SourceRank contd. SourceRank computed on biased agreement graph for a topic will capture topic-specific source importance ranking for the same topic

Topic-specific sampling queries Publicly available online directories such as ODP, Yahoo Directory provide hand-constructed topic hierarchies These directories along with the links posted under each topic are a good source for obtaining topic-specific sampling queries

Computing Topic-specific SourceRanks Partial topic-specific sampling queries are used for obtaining source crawls Topic-specific source crawls are used for computing biased agreement graphs Topic-specific SourceRanks, TSR’s are obtained by performing a weighted random walk on the biased agreement graphs

Query Processing Query processing involves Computing query-topic Computing query-topic sensitive importance scores Source selection

Computing query-topic Likelihood of the query belonging to topics Soft classification problem: For user query q and a set of topics ci  C, goal is to find fractional topic membership of q with each topic ci Camera Book Movie Music 0.3 0.6 0.1 topic query-topic For Query=“godfather”

Computing query-topic – Training Data Description of topics Use complete topic-specific sampling queries to obtain topic-specific source crawls Topic descriptions are treated as bag of words

Computing query-topic – Classifier Naïve Bayes Classifier (NBC) is used with parameters set to maximum likelihood estimates For a user query q, NBC uses topic-description to estimate topic probability conditioned on q i.e. for topic ci, NBC uses topic-description for ci to estimate P(ci|q)

Computing query-topic – Classifier contd. Computing P(ci|q) where qj is the jth term of query q P(ci) can be set based on domain knowledge, but for our computations, we use uniform probabilities for topics

Computing query-topic sensitive importance scores Topic-specific SourceRanks are linearly combined using query-topic as weights Query-topic sensitive or composite SourceRank score for source sk is computed as where TSRki is the topic-specific SourceRank score of source sk for topic ci

Source selection Linearly combines relevance-scores with importance scores Overall score of a source sk is computed as where Rk: relevancy score of sk CSRk: query-topic sensitive score of sk

Agenda SourceRank Topic-sensitive SourceRank Experiments and Results Conclusion

Experimental setup Experiments were conducted on a multi-topic deep-web environment consisting of four-representative topics – camera, book, movie and music Source DataSet Sources were collected via Google Base Google Base was probed with 40 queries containing a mix of camera names, book, movie and music album titles Total of 1440 sources were collected: 276 camera, 556 book, 572 movie and 281 music sources

Sampling queries Generated using publicly available online listings Used 200 titles or names in each topic Randomly selected cameras from pbase.com, book from New York Times best sellers, movies from ODP and music albums from Wikipedia’s top-100, 1986-2010

Test queries Contained a mix of queries from all four topics Do not overlap with the sampling queries Generated by randomly removing words from camera names, book, movie and music album titles with 0.5 probability Number of test queries varied for different topics to obtain the required (0.95) statistical significance

Query similarity based measure- CORI Source statistics were collected using highest document frequency terms Sources were selected using the same parameters as found optimal in CORI paper

Query similarity based measure- Google Base Two-versions of Google Base were used Gbase on dataset: Google Base search is restricted to our crawled sources Gbase: Google Base search with no restrictions i.e. considers all sources in Google Base

Agreement based measures - USR Undifferentiated SourceRank, USR SourceRank extended to multi-topic deep-web Single agreement graph is computed using entire set of sampling queries USR of sources is computed based on a random walk on this graph

Agreement based measures - DSR Oracular source selection, DSR Assumes a perfect classification of sources and user queries are available i.e. each source and test query is manually labeled with its domain association Creates agreement graphs and SourceRanks for a domain including only sources in that domain For each test query, sources ranking high in the domain corresponding to the test query are used

Result merging, ranking and relevance evaluation Top-k sources are selected Google Base is made to query only on these to top-k sources Experimented with different values of k and found k=10 to be optimal Google Base’s tuple ranking was used for ranking resulting tuples and return top-5 results in response to test queries

Result merging, ranking and relevance evaluation contd. Top-5 results returned were manually classified as relevant or irrelevant Result classification was rule based Example- if the test query is “pirates caribbean chest” and original movie name is “Pirates of Caribbean and Dead Man’s Chest”, then if the result entity refers to the same movie (dvd, blue-ray etc.) then the result is classified as relevant and otherwise irrelevant To avoid author bias, results from different source selection methods were merged in a single file so that the evaluator does not know which method each result came from while he does the classification

Results TSR was compared with the baseline source selection methods Agreement based measures (TSR, USR and DSR) were combined with query-similarity based CORI measure. The combination is represented by agreement based measure name and the weight assigned to agreement based measure, 1- Example: TSR(0.1) represents 0.9xCORI + 0.1xTSR We experimented with different values of  and found that =0.9 gives best precision for TSR-based source selection i.e. TSR(0.1) Higher weightage of CORI compared to TSR is to compensate the fact that TSR scores have higher dispersion compared to CORI scores

TSR precision exceeds that of similarity-based measures by 85% Comparison of top-5 precision of TSR(0.1) and query similarity based methods: CORI and Google Base TSR precision exceeds that of similarity-based measures by 85%

Comparison of topic-wise top-5 precision of TSR(0 Comparison of topic-wise top-5 precision of TSR(0.1) and query similarity based methods: CORI and Google Base TSR significantly out-performs all query-similarity based measures for all topics

TSR precision exceeds USR(0.1) by 18% and USR(1.0) by 40% Comparison of top-5 precision of TSR(0.1) and agreement based methods: USR(0.1) and USR(1.0) TSR precision exceeds USR(0.1) by 18% and USR(1.0) by 40%

Comparison of topic-wise top-5 precision of TSR(0 Comparison of topic-wise top-5 precision of TSR(0.1) and agreement based methods: USR(0.1) and USR(1.0) For three out of the four topics, TSR(0.1) out-performs USR(0.1) and USR(1.0) with confidence levels 0.95 or more

TSR(0.1) is able to match DSR(0.1)’s performance Comparison of top-5 precision of TSR(0.1) and oracular DSR(0.1) TSR(0.1) is able to match DSR(0.1)’s performance

Comparison of topic-wise top-5 precision of TSR(0 Comparison of topic-wise top-5 precision of TSR(0.1) and oracular DSR(0.1) TSR(0.1) matches DSR(0.1) performance across all topics indicating its effectiveness in identifying important sources across all topics

Agenda SourceRank Topic-sensitive SourceRank Experiments and Results Conclusion

Conclusion We attempted multi-topic source selection sensitive to trustworthiness and importance for the deep-web We introduced topic-sensitive SourceRank (TSR) Our experiments on more than a thousand deep-web sources show that a TSR-based approach is highly effective in extending SourceRank to multi-topic deep-web

Conclusion contd. TSR out-performs query-similarity based measures by around 85% in precision TSR results in statistically significant precision improvements over other baseline agreement-based methods Comparison with oracular DSR approach reveals effectiveness of TSR for topic-specific query and source classification and subsequent source selection

Paper submitted to Comad’11

Questions?