Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”

Similar presentations


Presentation on theme: "What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”"— Presentation transcript:

1 What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”

2 What do Web Spammers do Web Spammers target the last step Inverted Index Search Engine Servers Document IDs Query THE WEB Index the documents Get indices for relevant documents Retrieve full text of relevant documents Display results on a web page Rank Result

3 Web spam (you know it when you see it)

4 Defining Web Spam Spam web page is… A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page) Ultimately a judgment call Some web pages are borderline useless Some pages look fine in isolation, but in context are clearly “spam”

5 Spamming Techniques Boosting Rank:  Term Spamming  Link Spamming Hiding Spam:  Content Hiding  Cloaking  Redirecting

6 Boosting Rank by Term Spamming Editing the textual content The Search engine looks for relevant terms in various fields Different fields are weighed different

7 Term Spam: Keyword stuffing Search engines return pages that contain query terms (Certain caveats and provisos apply …) One way to get more SE referrals: Create pages containing popular query terms (“keyword stuffing”) Three variants: Hand-crafted pages Completely synthetic pages Assembling pages from “repurposed” content

8 Synthetic content for keyword stuffing Monetization Random words Well-formed sentences stitched together Links to keep crawlers going

9 More examples of synthetic content Someone’s wedding site!

10 Really good synthetic content Links to keep crawlers going Grammatically well-formed but meaningless sentences “Nigritude Ultramarine”: An SEO competition

11 Spamming Techniques Boosting Rank:  Term Spamming  Link Spamming Hiding Spam:  Content Hiding  Cloaking  Redirecting

12 Boosting Rank by Link Spamming Link structure  importance Outgoing links Incoming links Use Directories Link exchange and spam farms

13 Link Spam Inflating the rank of a page by creating nepotistic links to it From own sites: Link farms From partner sites: Link exchanges From unaffiliated sites (e.g. blogs, guest books, web forums, etc.) The more links, the better Generate links automatically Use scripts to post to blogs Synthesize entire web sites Synthesize many web sites (DNS spam) The more important the linking page, the better Buy expired highly-ranked domains Post links to high-quality blogs

14 Inflate rank: Link farms, link exchanges

15 Inflate rank: Expired domains

16 Inflate rank: Web forum and blog spam

17 Spamming Techniques Boosting Rank:  Term Spamming  Link Spamming Hiding Spam:  Content Hiding  Cloaking  Redirecting

18 Hiding Spam Invisible content Cloaking: serve different page to a crawler than to a browser Techniques: Recognize page request is from search engine (based on “user-agent” info or on IP address) Make some text invisible (i.e. black on black) Use CSS to hide text Use JavaScript to rewrite page (dynamically created) Use “meta-refresh” to redirect user to other page

19 Why should we care about Web spam? We depend on search engines and trust them Web Spam undermines the reputation of a trusted information source


Download ppt "What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”"

Similar presentations


Ads by Google