Download presentation
Presentation is loading. Please wait.
Published byShon Brown Modified over 9 years ago
1
Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference
2
link stuffing: for link-based ranking, black hat SEO techniques include the creation of extraneous pages which link to a target page keyword-stuffing:The content of other pages may be “engineered” so as to appear relevant to popular searches
3
Figure 1: An example spam page; although it contains popular keywords, the overall content is useless to a human user
4
Web spam The practices of crafting web pages for the sole purpose of increasing the ranking of these or some affiliated pages, without improving the utility to the viewer, are called “web spam”.
5
왜 web spamming 을 하는가 ? 첫째, Search engine 이 스팸사이트를 상위에 rank 하게 하여 웹검색자들을 스팸사이트로 끌여들여 경제적 이득을 취함 둘째로 search engine 이 스팸사이트를 노출시켜 사용자가 search engine 의 성능을 믿지 못하도록 함, 즉 search engine 에 대한 공격 마지막으로 a search engine 이 spam pages 들로 인하여 필요 없는 공간과 시간, 혹은 네트워크 resource 를 을 낭비하게 함. –1/7 of English-language pages
6
Importance of detecting web spam Creating an effective spam detection method is a challenging problem. –Given the size of the web, such a method has to be automated. –However, while detecting spam, we have to ensure that we identify spam pages alone, and that we do not mistakenly consider legitimate pages to be spam. –At the same time, it is most useful if we can detect that a page is spam as early as possible, and certainly prior to query processing. In this way, we can allocate our crawling, processing, and indexing efforts to non- spam pages, thus making more efficient use of our resources.
7
Web spamming techniques
8
Web Spam Taxonomy By Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Workshop on Adversarial Information Retrieval on the Web, May 2005Web Spam Taxonomy
9
Term Spamming p: page, q: query words TF(t)= 문서에 출현하는 term t 의 수 IDF(t)=term t 를 포함하는 문서의 수 Term spamming 은 TFIDF score 에 기반한 랭킹알고 리즘을 채택하고 있는 search engine 을 대상으로 공격
10
Term Spamming Body/title/meta tag/Anchor text <meta name=\keywords" content=\buy, cheap, cameras, lens, accessories, nikon, canon"> free, great deals, cheap, in- expensive, cheap, free URL spam buy-canon-rebel-20d-lens-case.camerasx.com, buy-nikon-d100-d70-lens-case.camerasx.com,
11
How to Term Spamming Repetition of one or a few specific terms Dumping of a large number of unrelated terms Weaving of spam terms into copied contents Phrase stitching is also used by spammers to create content quickly
12
Link Spamming PageRank 알고리즘의 특징을 파악하여 Outgoing links, Incoming links 를 조작하는 수법
13
Outgoing links A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score. At the same time, the most wide-spread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com)
14
Incoming links Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s). Post links on blogs, unmoderated message boards, guest books, or wikis. spammers may include URLs to their spam pages as part of the seemingly innocent comments/messages they post.
15
Hiding Techniques-Content Hiding
16
Hiding Techniques-Cloaking If spammers can clearly identify web crawler clients, they can adopt the following strategy, called cloak- ing: given a URL, spam web servers return one specic HTML document to a regular web browser, while they return a dierent document to a web crawler. This way, spammers can present the ultimately intended content to the web users (without traces of spam on the page), and, at the same time, send a spammed document to the search engine for indexing.
17
Hiding Techniques-Redirection
18
Spam occurrence per top-level domain 105, 484, 446 web pages, collected by the MSN Search crawler during August 2004.
19
Spam occurrence per language in our data set.
20
Prevalence of spam - number of words on page
21
Prevalence of spam - number of words in title
22
Prevalence of spam - average word-length of page
23
Prevalence of spam - visible content on page
24
Prevalence of spam - compressibility of page
25
Classification model to detect spam
26
given the training set DS we generate N training sets by sampling n random items with replacement For each of the N training sets, we now create a classifier, thus obtaining N classifiers. In order to classify a page, we have each of the N classifiers provide a class prediction, which is considered as a vote for that particular class. The eventual class of the page is the class with the majority of the votes
27
Bagging & Boosting spamNon- spam SpamAB Non- spam CD 예측 실제
28
Challenges in Web Information Retrieval Mehran Sahami Vibhu Mittal Shumeet Baluja Henry Rowley Google Inc.
29
Information Retrieval on the Web Goal: identify which pages are of high quality and relevance to a user’s query. –PageRank, HITS Two Challenges –Adversarial classification: detecting Web spamming –Evaluating Search results
30
PageRank Assume four web pages: A, B,C and D. The initial values of PageRank –PR(A)= PR(B)= PR(C)= PR(D)= 0.25. PageRank for any page u Bu ={v| v links to page u } Nv = the number of links from page v.
31
PR(A) = PR(C)/1 PR(B) = PR(A)/2 PR(C) = PR(A)/2 + PR(B)/1+PR(D)/1 PR(D) = 0
32
Determining the relatedness of fragments of text eg: –“Captain Kirk” & “Star Trek” is similar than –“Captain Kirk” & “Fried Chicken”. How to measure the closeness between two phases. K(x,y) =
34
Retrieval of UseNet Articles at least 800 million documents
35
Retrieval of Images and Sounds non-textual “documents” –from digital still and video cameras, camera phones, audio recording devices, and mp3 music.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.