Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
LIS618 lecture 9 Web retrieval Thomas Krichel
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Link Structure and Web Mining Shuying Wang
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
(hyperlink-induced topic search)
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Link Analysis HITS Algorithm PageRank Algorithm.
Overview of Search Engines
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Chapter 5 Searching for Truth: Locating Information on the WWW.
Using Hyperlink structure information for web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Ranking Link-based Ranking (2° generation) Reading 21.
CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005 A Presentation on When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine Optimisation No Point having a lovely site and lovely content if no one can find it!
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Search Engine Optimization
Information Retrieval in Practice
WEB SPAM.
Web Crawling.
7CCSMWAL Algorithmic Issues in the WWW
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Searching EIT, Author Gay Robertson, 2017.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Junghoo “John” Cho UCLA
Discussion Class 9 Google.
Presentation transcript:

Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University

Recap (PageRank and HITS) PageRank and HITS: Both search the web based on a) Relevance (content, anchor text,...) b) Quality, importance, authority,... The latter one: based on link structure PageRank: Global, query-independent, recursive calculation over all pages HITS: Local subgraph containing relevant documents, distinguishes between hubs and authorities

Hilltop [1]: Basic Idea Observation : Many web user (authors) - Create web pages with link lists about topics they are very familiar with (experts) - Maintain these pages well / try to keep them up-to-date - Link to good, high quality pages Idea : Try to find such pages automatically and use their link structure for ranking Compare HITS: Similar to hubs but more explicit description of "expert" sources and global view (i.e. query independent)

Hilltop [1]: Basic Idea (cont.) An expert page is a page that is about a certain topic and has links to many non- affiliated pages on that topic - " non-affiliated " means authors from non- affiliated organizations (modeled, e.g. by URL processing) - " links to many... pages " can be modeled (e.g.) by a threshold A page is an authority on a query topic if, and only if, some of the best experts on that query topic point to it

Hilltop [1]: Basic Idea (cont.) General approach : 1. Identify experts (in advance, i.e. query independent) 2. Select experts for a particular topic (depending on a specific query) 3. Use these experts to find and rank authorities for this topic

Identifying good expert pages What makes a good expert and how can they be found? A good expert is objective, diverse, unbiased, and point to numerous non-affiliated pages. Two hosts can be defined as affiliated if they share the same first 3 octets of the IP address OR -... the rightmost non-generic token in the hostname is the same (token = substrings in a hostname delimited by ".")

Identifying good expert pages 1st: Devide all (indexed) web pages into groups of affiliated ones 2nd: Get experts (i.e. pages pointing to lots of non-affiliated pages) based on their number of links to different groups (e.g. using a threshold) Note: This is all topic-independent! Possible extensions: - Consider topic-related clusters (if available) - Consider special characteristics of a page (e.g. similar formatting, etc.)

Indexing experts Identification of experts: done in advance, i.e. topic / query independent Selection of experts for a particular topic: done during the search process, i.e. query dep. Therefore: create inverted file for all pages that have been identified as an expert Only index so called key phrases, i.e. - Take all words in the title, in headlines (,,... tags), in the anchor text of a URL - Associate these phrases with the respective URLs

Search: Get and rank authorities With this, we have: - Experts for different topics - All information we need to select all experts for a particular topic given the query terms q i Query processing is now done in two steps 1. Select & rate experts (based on query) 2. Select & rate authorities (based on experts)

1. Select & rate experts Select page as an expert (e.g.) if all query terms q i are associated with at least one URL Rate the selected experts by calculating an expert score for each expert p For this, we define - LevelScore(p) = Weighting of the type of key phrase (e.g. title: 16, heading: 6, anchor: 1) - FullnessFactor(p,q i ) = Measure for the no. of terms in p that contain query terms q IF m  2 THEN FullnessFactor(p,q) = 1 ELSE FullnessFactor(p,q) = 1-(m-2) / plen

1. Select & rate experts (cont.) Based on the LevelScore and FullnessScore, some measures S i are calculated as follows: S i =  LevelScore(p) X FullnessFactor(p,q) (with  being the sum over all key phrases p with k-i query terms) The expert score is finally calculated as Expert_score = 2 32 S S 1 + S 2

2. Select & rate authorities Select pages as targets if they are referenced by at least two of these experts Rate them by calculating a target score: 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T Edge_Score(E,T) = Expert_Score(E)*  query terms q occ(q,T) with occ(q,T) = no. of diff. key phrases for T containing q

2. Select & rate authorities (cont.) 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T 2. Check all experts pointing to the same target and for affiliated experts, remove all edges but the one with the highest Edge_Score 3. The Target_Score is now calculated as the sum of all remaining Edge_Scores Possible extension: Combine Target_Scores with a page-dependent Match_Score (depending on the appearance of search terms on the page)

Hilltop: Summary Preprocessing : - Divide the web into groups of affiliated pages (based on their authors / URLs) - Select experts (based on linkage and groups) Searching : Select and rate 1. Experts referencing to pages about a particular topic (represented by the query) 2. Authorities for this particular topic

Hilltop: Discussion Main properties (when compared to PageRank and HITS): - Topic/query-dependent (unlike PageRank) - Pre-selection of experts (unlike HITS), i.e. - all experts are considered (no subgraph) - efficient online calculation can be done - Page content and structure is considered Potential problems / criticism: - Uses lots of intuitive assumptions that are modeled by heuristics

References [1] BHARAT, MIHAILA: WHEN EXPERTS AGREE: USING NON-AFFILIATED EXPERTS TO RANK POPULAR TOPICS. ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 20, NO. 1, JAN. 2002