CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Slides:

Advertisements

Similar presentations

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Advertisements

Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

Best-First Search: Agendas

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Precision and Recall.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Evaluating Search Engine

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Link Analysis, PageRank and Search Engines on the Web

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

Link Structure and Web Mining Shuying Wang

(hyperlink-induced topic search)

Information Retrieval

HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Using Hyperlink structure information for web search.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

Lesley Charles November 23, 2009.

Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Presenter: Shanshan Lu 03/04/2010

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

1 CS 430: Information Discovery Lecture 5 Ranking.

Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

HITS Hypertext-Induced Topic Selection

Methods and Apparatus for Ranking Web Page Search Results

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

Presentation transcript:

CS 533 Information Retrieval Systems

 Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis  Combining Connectivity and Content Analysis  Computing Relevance Weights for Nodes  Pruning Nodes from the Neighborhood Graph  Regulating the Influence of a Node  Evaluation  Partial Content Analysis  Degree Based Pruning  Iterative Pruning  Conclusion

 This paper addresses the problem of topic distillation on the World Wide Web.  Given a typical user query, it is the process of finding quality documents related to the query topic.  Connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents.  The essence of the proposed approach is to augment a previous connectivity analysis based algorithm with content analysis.

 The situation on the World Wide Web is different from the setting of conventional information retrieval systems for several reasons.  Users tend to use very short queries (1 to 3 words per query) and are very reluctant to give feedback.  The collection changes continuously.  The quality and usefulness of documents varies widely. Some documents are very focused; others involve a patchwork of subjects. Many are not intended to be sources of information.  Preprocessing all the documents in the corpus requires a massive effort and is usually not feasible.  Determining relevance accurately under these circumstances is hard.  Most search services are content to return exact query matches which may or may not satisfy the user's actual information need.

 In this paper, a system that takes a different approach in the same context is described. Given typical user queries on the Web, the system attempts to find quality documents related to the topic of the query.  This is more general than finding a precise query match.  Not as ambitious as trying to exactly satisfy the user's information need.  A simple approach to finding quality documents is to assume that if document A has a hyperlink to document B, then the author of document A thinks that document B contains valuable information.  Transitivily, if A is seen to point to a lot of good documents, then A's opinion becomes more valuable, and the fact that A points to B would suggest that B is a good document as well.

 Given an initial set of results from a search service, connectivity analysis algorithm extracts a subgraph from the Web containing the result set and its neighboring documents.  This is used as a basis for an iterative computation that estimates the value of each document as a source of relevant links and as a source of useful content.  The goal of connectivity analysis is to exploit linkage information between documents  Assumption 1 : A link between two documents implies that the documents contain related content.  Assumption2 : If the documents were authored by different people then the first author found the second document valuable.

 Compute two scores for each document: a hub score and an authority score.  Documents that have high authority scores are expected to have relevant content.  Documents with high hub scores are expected to contain links to relevant content.  A document which points to many others is a good hub, and a document that many documents point to is a good authority.  Transitively, a document that points to many good authorities is an even better hub, and similarly a document pointed to by many good hubs is an even better authority.  In the evaluation of different algorithms, it is used as baseline

 A start set of documents matching the query is fetched from a search engine (say the top 200 matches).  This set is augmented by its neighborhood, which is the set of documents that either point to or are pointed to by documents in the start set.  The documents in the start set and its neighborhood together form the nodes of the neighborhood graph.  Nodes are documents  Hyperlinks between documents not on the same host are directed edges  Iteratively computes the hub and authority scores for the nodes  (1) Let N be the set of nodes in the neighborhood graph  (2) For every node n in N, let H[n] be its hub score and A[n] its authority score  (3) Initialize H[n] and A[n] to 1 for all n in N.  (4) While the vectors H and A have not converged:  (5) For all n in N, A[n] := ∑ (n’, n) Є N H[n’]  (6) For all n in N, H[n] := ∑ (n’, n) Є N A[n’]  (7)Normalize the H and A vectors.  Proven to converge (In practice, in about 10 iterations)

 If there are very few edges in the neighborhood graph not much can be inferred from the connectivity.  Mutually Reinforcing Relationships Between Hosts : Sometimes a set of documents on one host point to a single document on a second host. This drives up the hub scores of the documents on the first host and the authority score of the document on the second host. The reverse case, where there is one document on a first host pointing to multiple documents on a second host, creates the same problem.  Automatically Generated Links : Web documents generated by tools often have links that were inserted by the tool.  Non-relevant Nodes : The neighborhood graph often contains documents not relevant to the query topic. If these nodes are well connected, the topic drift problem arises: the most highly ranked authorities and hubs tend not to be about the original topic.

 Mutually reinforcing relationships between hosts give undue weight to the opinion of a single person.  It is desirable for all the documents on a single host to have the same influence on the document they are connected to as a single document would.  To solve the problem, give fractional weights to edges in such cases  If there are k edges from documents on a first host to a single document on a second host, give each edge an authority weight of 1/k.  If there are l edges from a single document on a first host to a set of documents on a second host, give each edge a hub weight of 1/l.  Modified algorithm:  (4) While the vectors H and A have not converged:  (5) For all n in N, A[n] := ∑ (n’, n) Є N H[n’] x auth_wt(n’, n)  (6) For all n in N, H[n] := ∑ (n’, n) Є N A[n’] x hub_wt(n’, n)  (7)Normalize the H and A vectors.

 Two basic approaches:  Eliminating non-relevant nodes from the graph  Regulating the influence of a node based on its relevance

 The relevance weight of a node equals the similarity of its document to the query topic.  The query topic is broader than the query itself.  Thus matching the query against the document is usually not sufficient.  Use the documents in the start set to define a broader query and match every document in the graph against this query.  Consider the concatenation of the first 1000 words from each document to be the query, Q and compute similarity (Q, D).

 There are many approaches one can take to use the relevance weight of a node to decide if it should be eliminated from the graph:  Use thresholding (All nodes whose weights are below a threshold are pruned)  Thresholds are picked in 3 ways:  Median Weight: The threshold is the median of all the relevance weights  Start Set Median Weight: The threshold is the median of the relevance weights of the nodes in the start set  Fraction of Maximum Weight: The threshold is a fixed fraction of the maximum weight (max/10 is used)  Run the imp algorithm on the pruned graph. Corresponding algorithms are called: med, startmed, and maxby10.

 Modulate how much a node influences its neighbors based on its relevance weight (reduce the influence of less relevant nodes on the scores of their neighbors)  If W[n] is the relevance weight of a node n and A[n] the authority score of the node, use W[n] x A[n] instead of A[n] in computing the hub scores of nodes that point to it  If H[n] is its hub score, use W[n] x H[n] instead of H[n] in computing the authority score of nodes it points to  Combining the previous four approaches with the above strategy gives four more algorithms: impr, medr, startmedr, and maxby 10 r

 Authority Rankings: imp improves precision by at least 26% over base; regulation and pruning each improve precision further by about 10%, but combining them does not seem to give any additional improvement.  Hub Rankings: imp improves precision by at least 23% over base; med improves on imp by a further 10%. Regulation slightly improves imp and maxby10 but not the others.  Due to the distribution of the t a and t h, no algorithm can have a better relative than 0.65 for authorities and 0.6 for hubs. Base achieved a relative recall at 10 of 0.27 for authorities and 0.29 for hubs. Their best algorithm for authorities gave a relative recall of 0.41; similarly for hubs it was i.e., they achieved roughly half the potential improvement by this measure.

 Content analysis based algorithms improve precision at the expense of response time  Describe two algorithms that involve content pruning but only analyze a part of the graph (less than 10% of the nodes)  A factor of 10 faster than previous content analysis based algorithms  The new algorithms attempt to selectively analyze and prune if needed, the nodes that are most influential in the outcome. Use two heuristics to select the nodes to be analyzed:  Degree based pruning  Iterative pruning  Their performances are comparable to the best of previous algorithms

 In and out degrees of the nodes are used to select nodes that might be influential  use 4 x in_degree + out_degree as a measure of influence  The top 100 nodes by this measure are fetched, scored against Q and pruned if their score falls below the pruning threshold  Connectivity analysis as in imp is run for 10 iterations on the pruned graph  The ranking for hubs and authorities computed by imp is returned as the final ranking. This algorithm is called pca 0

 Use connectivity analysis itself (specifically the imp algorithm) to select nodes to prune  Pruning happens over a sequence of rounds. In each round imp is run for 10 iterations. This algorithm is called pca1

 Showed that Kleinberg's connectivity analysis has three problems. Various algorithms are presented to address them.  Simple modification suggested in algorithm imp achieved a considerable improvement in precision. Precision was further improved by adding content analysis.  medr, pca0 and pca1 are the most promising.  For authorities, pca1 seems to be the best algorithm overall.  For hubs, medr is the best general-purpose algorithm.  If term vectors are not available for the documents in the collection, imp is suggested.

 Krishna Bharat, Monika R. Henzinger, Improved algorithms for topic distillation in a hyperlinked environment, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p , August 24-28, 1998, Melbourne, Australia [doi> / ]