Download presentation
Presentation is loading. Please wait.
Published byKerrie Allen Modified over 9 years ago
1
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering Yazd University Graduate Research Assistant @ Web Lab. & IPKD Lab., YU Senior Research Fellow @ Parsijoo External Research Member of MSC Lab., DUT
2
Slide 2 Information Retrieval Systems: Search Engines Graphs in Information Retrieval – Connection-based Ranking Spamming Spam Detection A Real world Case Outline Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case…
3
Slide 3 Enterprise Document Retrieval Web Information Retrieval Systems: Search Engines Web Retrieval vs. Document Retrieval – Structure of documents – Scale – Domain – Users – Query Specificity – Determination Introduction to IR Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web
4
Slide 4 Architecture of Search Engines Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Web
5
Slide 5 Web Structure – Meta Data – Linkage Applications of Web Structure – Crawling – Indexing – Ranking Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web www.sharif.edu math.sharif.edu Math Dept.
6
Slide 6 Cite / Link – Use / Quote / Express favoring – Trust / Applicability Assumption – A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Recursion: Quality of a page is related to – its in-degree, – the quality of pages linking to it Trust in Web Structure Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web AB
7
Slide 7 Page and Berin [1] introduce the random surfer model Definition – Random surfer starts from a random page – The surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) Random Surfer on the Web Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s Surfer
8
Slide 8 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming
9
Slide 9 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
10
Slide 10 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
11
Slide 11 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
12
Slide 12 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
13
Slide 13 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s
14
Slide 14 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s
15
Slide 15 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s
16
Slide 16 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming
17
Slide 17 Each page inherits its rank from its ancestors. Issues – Web graph is not strongly connected – Convergence of PageRank is not guaranteed – Effects of sink nodes – Pages without outputs – Trapping pages PageRank Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming
18
Slide 18 Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s Sink
19
Slide 19 Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes Prob. of selecting each node is (1/n) – In each node, surfer has the option of jumping Prob. of jumping is α (0 ≤ α ≤ 1) Damping factor (d=1- α ) PageRank with Teleport Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s
20
Slide 20 Spamming Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam – The manipulation of web page content for the purpose of appearing high up in search results. Spamming Techniques – Text content manipulation – (tags, comments, invisible text blocks) – Structural content manipulation (Mimicking important websites)
21
Slide 21 Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam Detection Methods – Text Spam Comparing word probability – Link-farm Spam Trust/Anti-trust Rank Community Detection
22
Slide 22 Link-farm Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Link-farm Spam – Trust Rank – Anti-trust
23
Slide 23 Parsijoo Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo
24
Slide 24 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Crawled Pages: (1x10 9 /month) rem. 500 x 10 6 Crawling rate: 2000 page/sec – Cached URLs: 10 x 10 9 80,000 URL /sec 10 X 10 6 Unique Host (each host needs one queue) – Unique URLS: 800 x 10 6 – Unique Words: 80 X 10 6 – Unique Requests: 200 x 10 3 /day
25
Slide 25 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Requests (per day) Web:100 K Image:35 K News: 10 K Music: 10 K Scholar: 1 K Video: 5 K SADANA and etc. 35K – Unique Requests: 200 x 10 3 /day
26
Slide 26
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.