Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Ranking Tweets Considering Trust and Relevance Srijith Ravikumar,Raju Balakrishnan, and Subbarao Kambhampati Arizona State University 1.
Search Advertising These slides are modified from those by Anand Rajaram & Jeff UllmanAnand Rajaram & Jeff Ullman.
Dr. Subbarao Kambhampati
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Optimal Ad Ranking for Profit Maximization Raju Balakrishnan (Arizona State University) Subbarao Kambhampati (Arizona State University) TexPoint fonts.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (PhD Dissertation Defense) Committee: Subbarao.
Active Learning and Collaborative Filtering
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (PhD Proposal Defense) Committee: Subbarao Kambhampati.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Topic-Sensitive SourceRank: Agreement Based Source Selection for the Multi-Topic Deep Web Integration Manishkumar Jha Raju Balakrishnan Subbarao Kambhampati.
Search Engines and Information Retrieval
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
INFO 624 Week 3 Retrieval System Evaluation
Chapter 19: Information Retrieval
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Information Retrieval
1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement Raju Balakrishnan, Subbarao Kambhampati Arizona State University.
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
1 Online algorithms Typically when we solve problems and design algorithms we assume that we know all the data a priori. However in many practical situations.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Millions of Databases: Which are Trustworthy and Relevant?
Information Retrieval
Lecture 18: Uniformity Testing Monotonicity Testing
Data Integration with Dependent Sources
Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Panagiotis G. Ipeirotis Luis Gravano
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Discussion Class 9 Google.
Presentation transcript:

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Agenda  Problem 1: Ranking the Deep Web – Need for New Ranking. – SourceRank: Agreement Analysis. – Computing Agreement and Collusion. – Results & System Implementation.  Problem 2: Ad-Ranking sensitive to Mutual Influences. – Browsing model & Nature of Influence. – Ranking Function and Generalizations. – Results. 2

Deep Web Integration Scenario Web DB Mediator ← query Web DB Millions of sources containing structured tuples Uncontrolled collection of redundant information answer tuples → ← answer tuples ← query query → Deep Web Search engines have nominal access. We don’t Google for a “Honda Civic 2008 Tampa” 3

Why Another Ranking? Example Query: “Godfather Trilogy” on Google Base Importance: Searching for titles matching with the query. None of the results are the classic Godfather Rankings are oblivious to result Importance & Trustworthiness Trustworthiness (bait and switch)  The titles and cover image match exactly.  Prices are low. Amazing deal!  But when you proceed towards check out you realize that the product is a different one! (or when you open the mail package, if you are really unlucky) 4

Problem: Given a user query, select a subset of sources to provide important and trustworthy answers. Surface web search combines link analysis with Query-Relevance to consider trustworthiness and relevance of the results. Unfortunately, deep web records do not have hyper-links. Source Selection in the Deep Web 5

Observations  Many sources return answers to the same query.  Comparison of semantics of the answers is facilitated by structure of the tuples. Idea: Compute importance and trustworthiness of sources based on the agreement of answers returned by different sources. Source Agreement 6

Agreement Implies Trust & Importance.  Important results are likely to be returned by a large number of sources.  e.g. For the query “Godfather” hundreds of sources return the classic “The Godfather” while a few sources return the little known movie “Little Godfather”.  Two independent sources are not likely to agree upon corrupt/untrustworthy answers.  e.g. The wrong author of the book (e.g. Godfather author as “Nino Rota”) would not be agreed by other sources. As we know, truth is one (or a few), but lies are many. 7

Which tire? Agreement is not just for the search 8

Agreement Implies Trust & Relevance Probability of agreement of two independently selected irrelevant/false tuples is Probability of agreement or two independently picked relevant and true tuples is 9

Method: Sampling based Agreement Link semantics from S i to S j with weight w: S i acknowledges w fraction of tuples in S j. Since weight is the fraction, links are unsymmetrical. where induces the smoothing links to account for the unseen samples. R 1, R 2 are the result sets of S 1, S 2.  Agreement is computed using key word queries.  Partial titles of movies/books are used as queries.  Mean agreement over all the queries are used as the final agreement. 10

Method: Calculating SourceRank How can I use the agreement graph for improved search? Source graph is viewed as a markov chain, with edges as the transition probabilities between the sources. The prestige of sources considering transitive nature of the agreement may be computed based on a markov random walk. SourceRank is equal to this stationary visit probability of the random walk on the database vertex. This static SourceRank may be combined with a query- specific source-relevance measure for the final ranking. 11

Computing Agreement is Hard Computing semantic agreement between two records is the record linkage problem, and is known to be hard. Semantically same entities may be represented syntactically differently by two databases (non-common domains). Godfather, The: The Coppola Restoration James Caan / Marlon Brando more $9.99 Marlon Brando, Al Pacino USD The Godfather - The Coppola Restoration Giftset [Blu-ray] Example “Godfather” tuples from two web sources. Note that titles and castings are denoted differently. 12

Method: Computing Agreement Agreement Computation has Three levels. 1.Comparing Attribute-Value  Soft-TFIDF with Jaro-Winkler as the similarity measure is used. 2.Comparing Records.  We do not assume predefined schema matching.  Instance of a bipartite matching problem. Optimal matching is.  Greedy matching is used. Values are greedily matched against most similar value in the other record.  The attribute importance are weighted by IDF. (e.g. same titles (Godfather) is more important than same format (paperback)) 3.Comparing result sets.  Using the record similarity computed above, result set similarities are computed using the same greedy approach. 13

Detecting Source Collusion Observation 1: Even non-colluding sources in the same domain may contain same data. e.g. Movie databases may contain all Hollywood movies. Observation 2: Top-k answers of even non-colluding sources may be similar. e.g. Answers to query “Godfather” may contain all the three movies in the Godfather trilogy. The sources may copy data from each other, or make mirrors, boosting SourceRank of the group. 14

Source Collusion--Continued  Basic Method: If two sources return same top-k answers to the queries with large number of answers (e.g. queries like “the” or “DVD”) they are likely to be colluding.  We compute the degree of collusion of sources as the agreement on large answer queries.  Words with highest DF in the crawl is used as the queries.  The agreement between two databases are adjusted for collusion by multiplying by (1-collusion). 15

Factal: Search based on SourceRank ”I personally ran a handful of test queries this way and got much better results [than Google Products] results using Factal” --- Anonymous WWW’11 Reviewer. 16

Evaluation Precision and DCG are compared with the following baseline methods 1)CORI: Adapted from text database selection. Union of sample documents from sources are indexed and sources with highest number term hits are selected [Callan et al. 1995]. 2)Coverage: Adapted from relational databases. Mean relevance of the top-5 results to the sampling queries [Nie et al. 2004]. 3)Google Products: Products Search that is used over Google Base All experiments distinguish the SourceRank from baseline methods with 0.95 confidence levels. 17

Online Top-4 Sources-Movies 29% Though combinations are not our competitors, note that they are not better: 1.SourceRank implicitly considers query relevance, as selected sources fetch answers by query similarity. Combining again with query similarity may be an “overweighting”. 2. Search is Vertical 18

Online Top-4 Sources-Books 48% 19

Google Base Top-5 Precision-Books 24%  675 Google Base sources responding to a set of book queries are used as the book domain sources.  GBase-Domain is the Google Base searching only on these 675 domain sources.  Source Selection by SourceRank (coverage) followed by ranking by Google Base. 675 Sources 20

Google Base Top-5 Precision-Movies 25% 21

Trustworthiness of Source Selection Google Base Movies 1. Corrupted the results in sample crawl by replacing attribute vales not specified in the queries with random strings (since partial titles are the queries, we corrupted attributes except titles). 2.If the source selection is sensitive to corruption, the ranks should decrease with the corruption levels. Every relevance measure based on query-similarity are oblivious to the corruption of attributes unspecified in queries. 22

Trustworthiness- Google Base Books 23

Collusion—Ablation Study  Two database with the same one million tuples from IMDB are created.  Correlation between the ranking functions reduced increasingly. Natural agreement will be preserved while catching near-mirrors. Observations: 1. At high correlation the adjusted agreement is very low. 2.Adjusted agreement is almost the same as the pure agreement at low correlations. 24

Computation Time  Random walk is known to be feasible in large scale.  Time to compute the agreements is evaluated against number of sources.  Note that the computation is offline.  Easy to parallelize. 25

Publications and Recognition SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement. Raju Balakrishnan, Subbarao Kambhampati. WWW 2011 (Full Paper). Factal: Integrating Deep Web Based on Trust and Relevance. Raju Balakrishnan, Subbarao Kabmbhampati. WWW 2011 (Demonstration). SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement (Best Poster Award, WWW 2010). Raju Balakrishnan, Subbarao Kambhampati. WWW 2010 Pages 1055~1056. Best Poster Award, WWW

Contributions 1.Agreement based trust assessment for the deep web 2.Agreement based relevance assessment for the deep web 3.Collusion detection between the web sources 4.Evaluations in Google Base sources and online web databases 27

Agenda  Ranking the Deep Web – Need for new ranking. – SourceRank: Agreement Analysis. – Computing Agreement and Collusion. – Results & System Implementation.  Proposed Work: Ranking the Deep Web Results.  Problem 2: Ad-Ranking sensitive to Mutual Influences. – Motivation and Problem Definition. – Browsing model & Nature of Influence – Ranking Function & Generalization – Results.  Proposed Work: Mechanism Design & Evaluations. Search engines generate their multi- billion dollar revenue by textual ads. Related problem of ranking of ads is as important as the ranking of results. A different aspect of ranking 28

Ad Ranking: State of the Art Sort by Bid Amount x Relevance We consider ads as a set, and ranking is based on user’s browsing model Sort by Bid Amount Ads are Considered in Isolation, as both ignore Mutual influences. 29

User’s Browsing Model User browses down staring at the first ad  Abandon browsing with probability  Goes down to the next ad with probability At every ad he May Process repeats for the ads below with a reduced probability  Click the ad with relevance probability If is similar to residual relevance of decreases and abandonment probability increases. 30

Mutual Influences Three Manifestations of Mutual Influences on an ad are: 1.Similar ads placed above  Reduces user’s residual relevance of 2.Relevance of other ads placed above  User may click on above ads may not view 3.Abandonment probability of other ads placed above  User may abandon search and may not view 31

Expected Profit Considering Ad Similarities Considering bids ( ), residual Relevance ( ), abandonment probability ( ), and similarities the expected profit from a set of n results is, THEOREM: Ranking maximizing expected profit considering similarities between the results is NP-Hard Proof is a reduction of independent set problem to choosing top-k ads considering similarities. Expected Profit = Even worse, constant ratio approximation algorithms are hard (unless NP = ZPP) for diversity ranking problem 32

Dropping similarity, hence replacing Residual Relevance ( ) by Absolute Relevance ( ), Ranking to maximize this expected utility is a sorting problem Expected Profit Considering other two Mutual Influences (2 and 3) Expected Profit = 33

Optimal Ranking  The physical meaning RF is the profit generated for unit consumed view probability of ads  Higher ads have more view probability. Placing ads producing more profit for unit consumed view probability higher up is intuitive. Rank ads in the descending order of: 34

Comparison to current Ad Rankings  Assume abandonment probability is zero Assume where is a constant for all ads Assumes that the user has infinite patience to go down the results until he finds the ad he wants. Assumes that abandonment probability is negatively proportional to relevance. Bid Amount x Relevance Sort by Bid Amount

Generality of the Proposed Ranking The generalized ranking based on utilities. For ads utility=bid amount For documents utility=relevance Popular relevance ranking 36

Quantifying Expected Profit Proposed strategy gives maximum profit for the entire range 45.7% 35.9% Number of Clicks Zipf random with exponent 1.5 Abandonment probability Uniform Random as Relevance Uniform random as Bid Amounts Uniform random Difference in profit between RF and competing strategy is significant Bid amount only strategy becomes optimal at 37

Optimal Ad-Ranking for Profit Maximization. Raju Balakrishnan, Subbarao Kabmbhampati. WebDB 2008 Yahoo! Research Key scientific Challenge award for Computation advertising, Publication and Recognition 38

Overall Contributions 1.SourceRank based source selection sensitive to 1.Trustworthiness 2.Importance of the deep web sources. 2. A method to assess the collusion of the deep web sources. 3. An optimal generalized ranking for ads and search results. 4. A ranking framework optimal with respect to the perceived relevance of search snippets, and abandonment probability. 39