Download presentation
Presentation is loading. Please wait.
Published byPeregrine Hoover Modified over 9 years ago
1
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1
2
Outline Introduction Web Spam: a debatable problem Characterizing Spam Pages DataSets Method Combined Classifier Conclusion 2
3
Introduction Characterize Web Spam pages[1][2] – Inclusion of many unrelated keywords and links. – Use of many keywords in the URL. – Redirection of the user to another page. – Creation of many copies with substantially duplicate content. – Insertion of hide text by writing in the same color as the background of the page. 3
4
4 [3]
5
Web Spam: a debatable problem Some Define – All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as Web spam or spamdexing. – An unjustifiably favorable relevance or importance score for some web page, considering the page’s true value.[4] – Any attempt to deceive a search engine’s relevancy algorithm. Search Engine Optimization (SEO) 5
6
Characterizing Spam Pages Content spam – Inserting a large number of keywords. – It is shown that 82-86% of spam pages of this type can be detected by an automatic classifier.[5] Link spam – A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link-based ranking algorithm. 6
7
Link Farm[6] 7 “manipulation of the link structure by a group of users with the intent of improving the rating of one or more users in the group”.
8
8
9
High and low-ranked pages are different 9
10
DataSet[7] WEBSPAM-UK2006 –.uk Domain 77.9 million pages, over 3 billion links, 11,400 hosts, May 2006. 10 http://barcelona.research.yahoo.net/webspam/
11
TrustRank[4] 11
12
Truncated PageRank(1/2)[2] 12
13
Truncated PageRank(2/2) 13
14
Estimation of Supporters[2] 14
15
Link and Content features 15
16
Topological dependencies : in-links[6] 16
17
Topological dependencies : out-links 17
18
Conclusion The current precision and recall of Web spam detection algorithms can be improved using a combination of factors already used by search engine. User interaction features (e.g. data collected via toolbar or by observing clicks in search engine results). 18
19
Reference [1]Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. Link-based characterization and detection of Web Spam. In Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Seattle, USA, August 2006.(cita 57) [2]Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press(cita 49) [3] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR), pages 423–430, Amsterdam, Netherlands, 2007. ACM Press(cita 90) [4]Gy¨ongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.(cita 455) [5] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83–92, Edinburgh, Scotland, May 2006.(cita 196) [6]Gibson, D., Kumar, R., and Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment(cita 96) [7] http://barcelona.research.yahoo.net/webspam/ 19
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.