Presentation is loading. Please wait.

Presentation is loading. Please wait.

“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.

Similar presentations


Presentation on theme: "“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post."— Presentation transcript:

1 “In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post

2 Papers 'The PageRank Citation Ranking: Bringing Order to the Web', Brin, Page, Motwani and Winograd Technical report, Stanford University Database Group, 1998. http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999- 66&format=pdf&compressionhttp://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999- 66&format=pdf&compression= 'When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics', Bharat and Mihaila ACM Transactions on Information Systems (TOIS), 2002. http://delivery.acm.org/10.1145/510000/503107/p47- bharat.pdf?key1=503107&key2=1461826701&coll=portal&dl=ACM&CFID=1649 0664&CFTOKEN=21998154

3 Comparative Analysis Agenda PageRank Citation Ranking Algorithm Hilltop Ranking Algorithm Comparison Conclusion

4 Need for Improved Search For popular queries, there are potentially thousands of matching pages Users only bother with first 10-20 in the list * Problem: How can the most relevant pages be ranked at the top? *Answer: Utilize the linked structure of hypertext, links counting as votes for the page Relevance is subjective, but by using link analysis, objectivity can be brought to the ranking process

5 Before PageRank Algorithm In 1998, many search engines begin to use link analysis …. naively All “backlinks” created equal Important Page Rank = 1 Joker Page Rank = 2 Some yahoo Yahoo! Some other yahoo

6 PageRank Algorithm Rank of page is high if the sum of the backlink ranks is high. Simple PageRank: F u : Set of links from u B u : Set of links to u N u : |F u | c : constant R u : Rank of u Rank “Sink” Problem: B R=1,2,… C R=1,2,… A R = 1

7 PageRank Con’t Introduce the notion of the Random Surfer. When the surfer gets bored, he jumps to another page. New equation: Where E(u) is a vector of web pages that the Random Surfer jumps to if he’s in a loop. PageRank equation can be broken down to Eigenvalue problem, which can be solved efficiently

8 PageRank Con’t In 1998, running PageRank on 75M URLs in 5H Time is insignificant compared to building full text index So PageRank is great … but: “Manipulation by Commercial Interests” Brin et al. thought that it wouldn’t be a problem because of the cost to buy 1) a link from an important page or 2) a link from many non- important pages Brings us to Hilltop …

9 Hilltop Algorithm Original Problem: Ranking thousands of pages to find the most relevant New Problem: Ranking to eliminate “Manipulation by Commercial Interests” Can’t rank solely based on the content of the page (manipulation by authors) or backlinks (“link spamming”) Who can be trusted …?

10 Hilltop Algorithm Con’t … the Experts can be trusted! Algorithm: 1) User Query 2) Compute list of relevant experts 3) Find most relevant links from relevant experts 4) Merge links (must be linked by at least two experts) and rank 5) If no relevant experts, return nothing Designed for high precision and not recall To be used only for popular queries

11 Hilltop Con’t: Who are the Experts? Expert is a Page concerning a certain topic Links to many Non-Affiliated pages on the same topic Experts found as a preprocessing step In 1999, about 2.5M of 140M indexed pages were considered experts for the Hilltop experiments. Now, the fraction of expert pages is probably much smaller

12 Hilltop Con’t: Affiliation Defined Hosts are affiliated if: Same first three octets of IP Address E.g. “205.232.1.1” and “205.232.1.203” Rightmost non-generic token in hostname is the same E.g. “ibm.com” and “ibm.co.uk” Experts have k (k = some threshold) or more non-affiliated links

13 Hilltop Evaluation 8/1999 Recall locating Specific Popular Pages http://www.cs.toronto.edu/~georgem/hilltop/

14 Hilltop Evaluation Con’t Precision on Popular/Broad Topics http://www.cs.toronto.edu/~georgem/hilltop/

15 Hilltop versus : A Comparison Google uses citation analysis to compute a PageRank once a boolean query is fulfilled Great in general Not great when “link spammers” use it to draw hits to non-relevant sites or “Commercial Interests” Hilltop Not great for non-popular queries (no experts) Great for popular queries (many non-affiliated experts exist) Great at reducing the effect of spammers

16 Conclusion “Before Google -- a darkness was upon the land.” Google is great but has vulnerabilities to “link spammers” Hilltop provides a way to disregard “link spammers” for popular queries Wouldn’t it be nice to marry the Hilltop and PageRank algorithms?

17 Hilltop versus Revisited Google bought the patent to Hilltop in 2003 Hilltop was part of “Florida Update” deployed by Google on 11/16/2003 For popular queries in Google, Hilltop comprises 40% of the ranking (PR = 40% and “On Page” data is 20%) Bharat now works for Google Link spammers (“Search Engine Optimizers”) up in arms

18 Thank you Questions?


Download ppt "“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post."

Similar presentations


Ads by Google