LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute.

Slides:



Advertisements
Similar presentations
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Google Pagerank: how Google orders your webpages Dan Teague NCSSM.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
A Fuzzy Web Surfer Model Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 The Four Dimensions of Search Engine Quality Jan Pedersen Chief Scientist, Yahoo! Search 19 September 2005.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
(c) Maria Indrawan Distributed Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Using Hyperlink structure information for web search.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
Models and Algorithms for Complex Networks Introduction and Background Lecture 1.
Google PageRank Algorithm
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Mathematics of the Web Prof. Sara Billey University of Washington.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The PageRank Citation Ranking: Bringing Order to the Web
Quality of a search engine
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HITS Hypertext Induced Topic Selection
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HITS Hypertext Induced Topic Selection
Junghoo “John” Cho UCLA
Presentation transcript:

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Incrementally Ranking Ephemeral Web Documents in Search Engines What’s ephemeral documents What’s the problem to be solved? Experiments with Google Generations of rankings Properties of ephemeral documents Solution to rank computation Future work in a big framework Road Map Jie Wu, , Fri., Toronto, Canada

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Ephemeral Documents What‘s Ephemeral Web Documents Definition: The (highly demanded) documents newly appear (and die) in the middle of 2 consecutive crawlings. Significance of the study: Addressing the aspects of freshness, similarity, accuracy, personalization, etc. (semantic issues) of search engines. Cause of the problem: Latency of crawling cycles. For example, ca. 1 month for Google, 2 weeks for MSN (1/3 to ½ size of Google), 3 weeks for Alltheweb. Examples: Everyday news pages (not really ephemeral), web sites for events (e.g. Olympics, projects like Alvis, shor-term programs, unexpected big events like a war, etc.), deep-web, etc. Question: How to make ephemeral documents available in a SE ASAP?

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Search for „sars“ on Google: Top 3 Google Example (Done at ca. 21:45, , Thu.)

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Search for „sars“ on Google: No. 4-6 Google Example cont.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Content of No. 2

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Content of No. 4

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Results from Google News (ca. 10:15, , Fri.)

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Results from MSN (ca. 23:05, , Thu.)

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Google vs MSN Result Analysis 1.Actually all top 15 results of MSN are about the disease SARS 2.MSN’s collection size if only a bit more than 1/3 of that of Google 3.MSN might adjust the weights of SARS-related documents 4.How to do that in a systematic and uniform way for SE with a huge collection of documents like Google? Google‘s Problems 1.Ephemeral documents not included in the collection. 2.Delayed reflection of public information needs. 3.Weights given to ephemeral documents not enough.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne My Notions 3 Generations of Rankings Generation 1: Factors: on-page ones, such as keywords/terms Algorithm: boolean model, vector space similarity, latent semantic indexing, fuzzy set model, probablistic models, etc. Generation 2: Factors: on-page ones + link structure Algorithm: G1 + link sturcture analysis, e.g. PageRank (importance of a page in general sense), HITS Generation 3: Factors: on-page ones + link structure + semantic factors Algorithm: G1 + G2 + Alvis

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Ranking Life Cycle of Normal Documents Normal vs. Ephemeral Web Documents I Viewpoints of PageRank and Human-Mind

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Ranking Life Cycle of Ehemeral Documents Normal vs. Ephemeral Web Documents II Viewpoints of PageRank and Human-Mind

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Nothing basically. Current Work on Ephemeral Web Documents 1.Google continues its trilogy of roughly monthly crawling of the whole web, PageRank computation, adding other factors in. 2.People may not consider it really important to solve this problem. The current centralized, colossal and complete strategy is good and enough. 3.Separate solutions and systems are provided to address the problem, for example, news.google.com.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Analysis by Matrix Computation P=cG+(1-c)E A=P T The principal eigenvector of A. G´=G+N+G2N+N2G P´=cG´+(1-c)E´ A´=(P´) T The principal eigenvector of A´. Continuously compute the new eigenvectors given the old ones and the minor change. Heavier weights have to be given to the links pointing to the new ephemeral documents.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne New Matrix After including ephemeral documents

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Computation Based on the New Matrix 1.Aperiodic: the matrix is induced by the web graph. 2.Irreducible: strongly connected. Ergodic Theorem applies: the Markov chain defined by Q has a unique stationary probability distribution. The Computation Converges. How to Compute 1.Adaptive methods for PageRank computation. 2.k = 400x(4,500 ∼ 35,000) = 1,800,000 ∼ 14,000,000 (0.06% ∼ 0.47%) of 3 billion. 3.Make use of the block structure.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne 1. Aggregation of Local External Ranking Decentralized Solution: Algebra+Distribution+Local 2. Aggregation of Local Internal Ranking 3. Obtaining the composite Ranking

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Big Picture of ALVIS

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Distributed IR in ALVIS

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Algorithms for distributed computation of SiteRank Algorithms & Protocols 1.Exchange of information for SiteRank: Naive broadcasting, randomized gossiping, etc. 2.Approximation of the Global SiteRank: when to stop Protocols for exchanging semantic information 1.Refer to Z39.50, Digital library initiatives, etc. 2.Based on mature open standards like XML 3.Embrace the new technologies such as SemanticWeb 4.Personalization

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne After including ranking of ephemeral documents Applications in Search Engines 1.Ranking of normal and ephemeral documents can be unified seamlessly. 2.Strong support of a decentralized architecture for Web and peer-to-peer search engines 3.No contradiction to using separate solutions. For example news.google.com can be easily built upon a unified ranking scheme.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Revisiting the Challenges by Dr. Andrei Broder 3 Challenges A web graph model that takes into account information content. A method to compare graph derived query independent factors. Mothods to create graphs where none exists.

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Real Computation on a Web-Scale Data Set Future Work Where is the data set? Taking Into Account More Semantic Information Semantic information of the documents and the content

LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Questions? ?