Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Search Engines and Information Retrieval
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.
Search Engines: Technology, Society, and Business Prof. Marti Hearst Oct 22, 2007.
Search Engine Optimization Strategies The Basics of SEO Offered to you by Mary Anne Donovan Vice President, SEO Literacy Consultants.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
CIS 1310 – HTML & CSS 13 Web Promotion. CIS 1310 – HTML & CSS Learning Outcomes  Identify Commonly Used Search Engines  Describe Components of a Search.
Search Engine Optimization. What is SEO? Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
SEO PLAN Presented By Mangesh Dolse. Lead Management Tool( Sample)
Search Engine Optimization
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research.
Search Engines and Information Retrieval Chapter 1.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
© 2006 Stephan M Spencer Netconcepts Search Engine Marketing by Stephan Spencer President, Netconcepts.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Heuristics for Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley Joint work with Fetterly, Manasse, Ntoulas.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
The Road to Online Marketing. A Magic Voyage Begins!!!
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Amy Dai Machine learning techniques for detecting topics in research papers.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
What Is SEO? Search engine optimization (SEO) is the art and science of publishing and marketing information that ranks well for valuable keywords in.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
1 Proposal Presentation On Search Engine Optimization.
Search Engines By: Faruq Hasan.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
Search Engines, SEO and Web Search By Alessandro Ballarin.
The effects of Web Spam on The Evolution of Search Engines CS315-Web Search and Mining.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
John Paul Aguiar | BrainyMarketer.com WE HAVE THE MARKETING HELP YOU NEED!
A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages , Feb Apr
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
SEO Tactics Search Engines Optimization is the best process which helps to improve your business in search engine mediums and social mediums such as Facebook,
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Search Engine Optimization
Web Marketing Relationship Management – Existing Customers
PRESENTATION ON SEO.
CCT356: Online Advertising and Marketing
WEB SPAM.
PRESENTATION ON SEO.
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Detecting Spam Web Pages through Content Analysis
Web Spam
Blog SEO Tips: How to Write SEO Friendly Blog Posts
Presentation transcript:

Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley

About me : UIUC (home of NCSA Mosaic) : Digital Equipment/Compaq Started working on web search in 1997 Mercator web crawler (used by AltaVista) 2001-now: Microsoft Research Measuring web evolution Link-based ranking (algorithms and infrastructure) Web spam detection

About MSR Silicon Valley One of six MSR labs (founded in 2001) Located in Mountain View About 30 full-time researchers Areas Algorithms & Theory Distributed Systems Security & Privacy Software Tools Web Search & Data Mining We’re growing & hiring!

There’s gold in those hills E-Commerce is big business Total US e-Commerce sales in 2004: $69.2 billion (1.9% of total US sales) (US Census Bureau) Grow rate: 7.8% per year (well ahead of GDP growth) Forrester Research predicts that online US B2C sales (incl. auctions & travel) will grow to $329 by 2010 (13% of all US retail sales)

Search engines direct traffic Significant amount of traffic results from Search Engine (SE) referrals E.g. Jacob Nielsen’s site “HyperTextNow” receives one third of its traffic through SE referrals Only sites that are highly placed in SE results (for some queries) benefit from SE referrals

Ways to increase SE referrals Buy keyword-based advertisements Improve the ranking of your pages Provide genuinely better content, or “Game” the system “Search Engine Optimization” is a thriving business Some SEOs are ethical Some are not …

Web spam (you know it when you see it)

Defining web spam Working Definition Spam web page: A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page) Ultimately a judgment call Some web pages are borderline useless Some pages look fine in isolation, but in context are clearly “spam”

Why web spam is bad Bad for users Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Burns crawling bandwidth Pollutes corpus (infinite number of spam pages!) Distorts ranking of results

Detecting Web Spam Spam detection: A classification problem Given salient features, decide whether a web page (or web site) is spam Can use automatic classifiers Plethora of existing algorithms (Bayes, C4.5, SVM, …) Use data sets tagged by human judges to train and evaluate classifiers (this is expensive!) But what are the “salient features”? Need to understand spamming techniques to decide on features Finding the right features is “alchemy”, not science Spammers adapt – it’s an arms race!

Taxonomy of web spam techniques “Keyword stuffing” “Link spam” “Cloaking”

Keyword stuffing Search engines return pages that contain query terms (Certain caveats and provisos apply …) One way to get more SE referrals: Create pages containing popular query terms (“keyword stuffing”) Three variants: Hand-crafted pages (ignored in this talk) Completely synthetic pages Assembling pages from “repurposed” content

Examples of synthetic content Monetization Random words Well-formed sentences stitched together Links to keep crawlers going

Examples of synthetic content Someone’s wedding site!

Features identifying synthetic content Average word length The mean word length for English prose is about 5 characters Word frequency distribution Certain words (“the”, “a”, …) appear more often than others N-gram frequency distribution Some words are more likely to occur next to each other than others Grammatical well-formedness Alas, natural-language parsing is expensive

Really good synthetic content Links to keep crawlers going Grammatically well-formed but meaningless sentences “Nigritude Ultramarine”: An SEO competition

Content “repurposing” Content repurposing: The practice of incorporating all or portions of other (unaffiliated) web pages A “convenient” way to machine generate pages that contain human-authored content Not even necessarily illegal … Two flavors: Imporporate large portions of a single page Incoporate snippets of multiple pages

Example of page-level content “repurposing”

Example of phrase-level content “repurposing”

Techniques for detecting content repurposing Single-page flavor: Cluster pages into equivalence classes of very similar pages If most pages on a site are very similar to pages on other sites, raise a red flag (There are legitimate replicated sites; e.g. mirrors of Linux man pages) Many-snippets flavor: Test if page consists mostly of phrases that also occur somewhere else Computationally hard problem Have probabilistic technique that makes it tractable

Detour: Link-based ranking Most search engines use hyperlink information for ranking Basic idea: Peer endorsement Web page authors endorse their peers by linking to them Prototypical link-based ranking algorithm: PageRank Page is important if linked to (endorsed) by many other pages More so if other pages are themselves important

Link spam Link spam: Inflating the rank of a page by creating nepotistic links to it From own sites: Link farms From partner sites: Link exchanges From unaffiliated sites (e.g. blogs, guest books, web forums, etc.) The more links, the better Generate links automatically Use scripts to post to blogs Synthesize entire web sites Synthesize many web sites (DNS spam) The more important the linking page, the better Buy expired highly-ranked domains Post links to high-quality blogs

Link farms and link exchanges

The trade in expired domains

Web forum and blog spam

Features identifying link spam Large number of links from low-ranked pages Discrepancy between number of links (peer endorsement) and number of visitors (user endorsement) Links mostly from affiliated pages Same web site; same domain Same IP address Same owner (according to WHOIS record) Evidence that linking pages are machine-generated …

Cloaking Cloaking: The practice of sending different content to search engines than to users Techniques: Recognize page request is from search engine (based on “user-agent” info or IP address) Make some text invisible (i.e. black on black) Use CSS to hide text Use JavaScript to rewrite page Use “meta-refresh” to redirect user to other page Hard (but not impossible) for SE to detect

How well does web spam detection work? Experiment done at MSR-SVC: (joint work with Fetterly, Manasse, Ntoulas) using a number of the features described earlier fed into C4.5 decision-tree classifier corpus of about 100 million web pages judged set of pages (2364 spam, non-spam) 10-fold cross-validation Our results are not indicative of spam detection effectiveness of MSN Search!

How well does web spam detection work? Confusion matrix: Expressed as precision-recall matrix: classified as  spamnon-spam spam1, non-spam36714,439 classrecallprecision spam81.1%83.9% non-spam97.5%97.0%

Questions