Download presentation
Presentation is loading. Please wait.
1
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Incrementally Ranking Ephemeral Web Documents in Search Engines What’s ephemeral documents What’s the problem to be solved? Experiments with Google Generations of rankings Properties of ephemeral documents Solution to rank computation Future work in a big framework Road Map Jie Wu, 1.8.2003, Fri., Toronto, Canada
2
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Ephemeral Documents What‘s Ephemeral Web Documents Definition: The (highly demanded) documents newly appear (and die) in the middle of 2 consecutive crawlings. Significance of the study: Addressing the aspects of freshness, similarity, accuracy, personalization, etc. (semantic issues) of search engines. Cause of the problem: Latency of crawling cycles. For example, ca. 1 month for Google, 2 weeks for MSN (1/3 to ½ size of Google), 3 weeks for Alltheweb. Examples: Everyday news pages (not really ephemeral), web sites for events (e.g. Olympics, projects like Alvis, shor-term programs, unexpected big events like a war, etc.), deep-web, etc. Question: How to make ephemeral documents available in a SE ASAP?
3
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Search for „sars“ on Google: Top 3 Google Example (Done at ca. 21:45, 1.5.2003, Thu.)
4
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Search for „sars“ on Google: No. 4-6 Google Example cont.
5
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Content of No. 2
6
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Content of No. 4
7
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Results from Google News (ca. 10:15, 2.5.2003, Fri.)
8
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Results from MSN (ca. 23:05, 1.5.2003, Thu.)
9
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Google vs MSN Result Analysis 1.Actually all top 15 results of MSN are about the disease SARS 2.MSN’s collection size if only a bit more than 1/3 of that of Google 3.MSN might adjust the weights of SARS-related documents 4.How to do that in a systematic and uniform way for SE with a huge collection of documents like Google? Google‘s Problems 1.Ephemeral documents not included in the collection. 2.Delayed reflection of public information needs. 3.Weights given to ephemeral documents not enough.
10
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne My Notions 3 Generations of Rankings Generation 1: Factors: on-page ones, such as keywords/terms Algorithm: boolean model, vector space similarity, latent semantic indexing, fuzzy set model, probablistic models, etc. Generation 2: Factors: on-page ones + link structure Algorithm: G1 + link sturcture analysis, e.g. PageRank (importance of a page in general sense), HITS Generation 3: Factors: on-page ones + link structure + semantic factors Algorithm: G1 + G2 + Alvis
11
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Ranking Life Cycle of Normal Documents Normal vs. Ephemeral Web Documents I Viewpoints of PageRank and Human-Mind
12
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Ranking Life Cycle of Ehemeral Documents Normal vs. Ephemeral Web Documents II Viewpoints of PageRank and Human-Mind
13
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Nothing basically. Current Work on Ephemeral Web Documents 1.Google continues its trilogy of roughly monthly crawling of the whole web, PageRank computation, adding other factors in. 2.People may not consider it really important to solve this problem. The current centralized, colossal and complete strategy is good and enough. 3.Separate solutions and systems are provided to address the problem, for example, news.google.com.
14
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Analysis by Matrix Computation P=cG+(1-c)E A=P T The principal eigenvector of A. G´=G+N+G2N+N2G P´=cG´+(1-c)E´ A´=(P´) T The principal eigenvector of A´. Continuously compute the new eigenvectors given the old ones and the minor change. Heavier weights have to be given to the links pointing to the new ephemeral documents.
15
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne New Matrix After including ephemeral documents
16
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Computation Based on the New Matrix 1.Aperiodic: the matrix is induced by the web graph. 2.Irreducible: strongly connected. Ergodic Theorem applies: the Markov chain defined by Q has a unique stationary probability distribution. The Computation Converges. How to Compute 1.Adaptive methods for PageRank computation. 2.k = 400x(4,500 ∼ 35,000) = 1,800,000 ∼ 14,000,000 (0.06% ∼ 0.47%) of 3 billion. 3.Make use of the block structure.
17
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne 1. Aggregation of Local External Ranking Decentralized Solution: Algebra+Distribution+Local 2. Aggregation of Local Internal Ranking 3. Obtaining the composite Ranking
18
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Big Picture of ALVIS
19
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Distributed IR in ALVIS
20
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Algorithms for distributed computation of SiteRank Algorithms & Protocols 1.Exchange of information for SiteRank: Naive broadcasting, randomized gossiping, etc. 2.Approximation of the Global SiteRank: when to stop Protocols for exchanging semantic information 1.Refer to Z39.50, Digital library initiatives, etc. 2.Based on mature open standards like XML 3.Embrace the new technologies such as SemanticWeb 4.Personalization
21
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne After including ranking of ephemeral documents Applications in Search Engines 1.Ranking of normal and ephemeral documents can be unified seamlessly. 2.Strong support of a decentralized architecture for Web and peer-to-peer search engines 3.No contradiction to using separate solutions. For example news.google.com can be easily built upon a unified ranking scheme.
22
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Revisiting the Challenges by Dr. Andrei Broder 3 Challenges A web graph model that takes into account information content. A method to compare graph derived query independent factors. Mothods to create graphs where none exists.
23
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Real Computation on a Web-Scale Data Set Future Work Where is the data set? Taking Into Account More Semantic Information Semantic information of the documents and the content
24
LSIR All rights reserved. © 2003, Jie Wu, EPFL-I&C-IIF-LSIR, http://lsirwww.epfl.ch/ Laboratoire de systèmes d'informations répartis, Swiss Federal Institute of Technology, Lausanne Questions? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.