Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering Yazd University Graduate Research Web Lab. & IPKD Lab., YU Senior Research Parsijoo External Research Member of MSC Lab., DUT
Slide 2 Information Retrieval Systems: Search Engines Graphs in Information Retrieval – Connection-based Ranking Spamming Spam Detection A Real world Case Outline Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case…
Slide 3 Enterprise Document Retrieval Web Information Retrieval Systems: Search Engines Web Retrieval vs. Document Retrieval – Structure of documents – Scale – Domain – Users – Query Specificity – Determination Introduction to IR Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web
Slide 4 Architecture of Search Engines Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Web
Slide 5 Web Structure – Meta Data – Linkage Applications of Web Structure – Crawling – Indexing – Ranking Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web math.sharif.edu Math Dept.
Slide 6 Cite / Link – Use / Quote / Express favoring – Trust / Applicability Assumption – A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Recursion: Quality of a page is related to – its in-degree, – the quality of pages linking to it Trust in Web Structure Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web AB
Slide 7 Page and Berin [1] introduce the random surfer model Definition – Random surfer starts from a random page – The surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) Random Surfer on the Web Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s Surfer
Slide 8 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming
Slide 9 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
Slide 10 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
Slide 11 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
Slide 12 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s
Slide 13 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s
Slide 14 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s
Slide 15 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s
Slide 16 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming
Slide 17 Each page inherits its rank from its ancestors. Issues – Web graph is not strongly connected – Convergence of PageRank is not guaranteed – Effects of sink nodes – Pages without outputs – Trapping pages PageRank Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming
Slide 18 Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s Sink
Slide 19 Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes Prob. of selecting each node is (1/n) – In each node, surfer has the option of jumping Prob. of jumping is α (0 ≤ α ≤ 1) Damping factor (d=1- α ) PageRank with Teleport Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s
Slide 20 Spamming Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam – The manipulation of web page content for the purpose of appearing high up in search results. Spamming Techniques – Text content manipulation – (tags, comments, invisible text blocks) – Structural content manipulation (Mimicking important websites)
Slide 21 Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam Detection Methods – Text Spam Comparing word probability – Link-farm Spam Trust/Anti-trust Rank Community Detection
Slide 22 Link-farm Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Link-farm Spam – Trust Rank – Anti-trust
Slide 23 Parsijoo Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo
Slide 24 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Crawled Pages: (1x10 9 /month) rem. 500 x 10 6 Crawling rate: 2000 page/sec – Cached URLs: 10 x ,000 URL /sec 10 X 10 6 Unique Host (each host needs one queue) – Unique URLS: 800 x 10 6 – Unique Words: 80 X 10 6 – Unique Requests: 200 x 10 3 /day
Slide 25 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Requests (per day) Web:100 K Image:35 K News: 10 K Music: 10 K Scholar: 1 K Video: 5 K SADANA and etc. 35K – Unique Requests: 200 x 10 3 /day
Slide 26