Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

TrustRank Algorithm Srđan Luković 2010/3482
Designing for Search Engines MIS 424 MIS 424 Professor Sandvig Professor Sandvig.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Designing for Search Engines MIS 314 MIS 314 Professor Sandvig Professor Sandvig.
Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.
Adversarial Information Retrieval The Manipulation of Web Content.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
SURF:SURF: Detecting and Measuring Search Poisoning Long Lu, Roberto Perdisci, and Wenke Lee Georgia Tech and University of Georgia.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.
The Search Engine Landscape: 2010 How Users Interact with Engines & How the Search Engines Crawl, Index & Rank Pages Rand Fishkin CEO & Co-Founder: SEOmoz.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Heuristics for Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley Joint work with Fetterly, Manasse, Ntoulas.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.
Amy Dai Machine learning techniques for detecting topics in research papers.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Lecture 2 Jan 15, 2008 Social Search. What is Social Search? Social Information Access –a stream of research that explores methods for organizing users’
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
IR, IE and QA over Social Media Social media (blogs, community QA, news aggregators)  Complementary to “traditional” news sources (Rathergate)  Grow.
Finding high-Quality contents in Social media BY : APARNA TODWAL GUIDED BY : PROF. M. WANJARI.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Spamdexing
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
SEO - TECHNIQUES Types of SEO SEO techniques can be classified into two broad categories : 1.White Hat SEO 2.Black Hat SEO
The Future of SEO: 2015 Ranking Factors survey (source moz.com)
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
WEB SPAM.
Source: Procedia Computer Science(2015)70:
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
PJ SEO Specialists WordPress Web Development and SEO.
A Comparative Study of Link Analysis Algorithms
Applying Key Phrase Extraction to aid Invalidity Search
Detecting Spam Web Pages through Content Analysis
Data Mining Chapter 6 Search Engines
Web Spam
Presentation transcript:

Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1

Outline Introduction Web Spam: a debatable problem Characterizing Spam Pages DataSets Method Combined Classifier Conclusion 2

Introduction Characterize Web Spam pages[1][2] – Inclusion of many unrelated keywords and links. – Use of many keywords in the URL. – Redirection of the user to another page. – Creation of many copies with substantially duplicate content. – Insertion of hide text by writing in the same color as the background of the page. 3

4 [3]

Web Spam: a debatable problem Some Define – All deceptive actions which try to increase the ranking of a page in search engines are generally referred to as Web spam or spamdexing. – An unjustifiably favorable relevance or importance score for some web page, considering the page’s true value.[4] – Any attempt to deceive a search engine’s relevancy algorithm. Search Engine Optimization (SEO) 5

Characterizing Spam Pages Content spam – Inserting a large number of keywords. – It is shown that 82-86% of spam pages of this type can be detected by an automatic classifier.[5] Link spam – A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link-based ranking algorithm. 6

Link Farm[6] 7 “manipulation of the link structure by a group of users with the intent of improving the rating of one or more users in the group”.

8

High and low-ranked pages are different 9

DataSet[7] WEBSPAM-UK2006 –.uk Domain 77.9 million pages, over 3 billion links, 11,400 hosts, May

TrustRank[4] 11

Truncated PageRank(1/2)[2] 12

Truncated PageRank(2/2) 13

Estimation of Supporters[2] 14

Link and Content features 15

Topological dependencies : in-links[6] 16

Topological dependencies : out-links 17

Conclusion The current precision and recall of Web spam detection algorithms can be improved using a combination of factors already used by search engine. User interaction features (e.g. data collected via toolbar or by observing clicks in search engine results). 18

Reference [1]Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. Link-based characterization and detection of Web Spam. In Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Seattle, USA, August 2006.(cita 57) [2]Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press(cita 49) [3] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR), pages 423–430, Amsterdam, Netherlands, ACM Press(cita 90) [4]Gy¨ongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.(cita 455) [5] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference, pages 83–92, Edinburgh, Scotland, May 2006.(cita 196) [6]Gibson, D., Kumar, R., and Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment(cita 96) [7] 19