Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Cloak and Dagger. In a nutshell… Cloaking Cloaking in search engines Search engines’ response to cloaking Lifetime of cloaked search results Cloaked pages.
Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.
Internet Resources Discovery (IRD) Search Engines Quality.
Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.
“IT Solutions for Tourism Industry” CAPS Workshop Yerevan April 14, 2009.
Search Engine Optimization (SEO)
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Detection of Internet Scam Using Logistic Regression
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
Browser Wars and the Politics of Search Engines
Search Engine Marketing Shelly Brown Director of Web Services Southwest Baptist University.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Search Engines & Search Engine Optimization (SEO).
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Master Thesis Defense Jan Fiedler 04/17/98
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Driving Traffic It is not enough to promote your site when it is first launched. You also need to actively promote your site on a long term basis.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Continuing Education UCC Fall 2010 Search Engine Optimization.
Basic Search Engine Optimization. What is SEO?  SEO is an abbreviation for search engine optimization.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines By: Faruq Hasan.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Week 1 Introduction to Search Engine Optimization.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Why You Should Optimize Your Website Content. Optimizing a website's content, in order to obtain a high search engine ranking is what Search Engine Optimization.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Why SEO is Important for Online Business & How to choose the right SEO Firm By, Init SEO
SEO Company or SEO Agency
SEO Company in Miami
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
SEARCH ENGINE OPTIMIZATION.
Search Engine Optimization(S.E.O)
Search Engine Optimization
Dr. Frank McCown Comp 250 – Web Development Harding University
Detection of Internet Scam Using Logistic Regression
WEB SPAM.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Best SEO Company in California Irving ScheibIrving Scheib.
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Search Engine Optimization (SEO)
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Presentation transcript:

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft

Cloaking - Example Browser View Want to lose weight?

Cloaking - Example Browser View Crawler View

Cloaking - Example Browser View Want to buy blinds for your windows?

Cloaking - Example Browser View Crawler View

Cloaking - Example Browser View

Cloaking - Example Browser View Crawler View

Cloaking - Example Browser View Crawler View

Cloaking A hiding technique –Browser: Serve true intended content –Crawler: Serve content that will rank the page high on search engine Web spam –Actions intended to mislead search engines to rank certain pages higher than they deserve Cloaking reduces information reliability, as a result search engines take strict measures against sites that cloak

How do servers cloak? Cloaking techniques –User-Agent string Crawlers –msnbot/1.0 (+ –Mozilla/5.0 (compatible; Googlebot/2.1; + Browsers –Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) –IP based Easily available lists of crawler IPs/ranges IP techniques are quite successful

Distribution of Cloaking We study the distribution of cloaking spam over two sets of queries –Popular Queries –Monetizable Queries Assumption: Spammers are economically motivated Hypothesis: the more monetizable a query is the more likely that it will be spammed

Motivation behind Web Spam Profitability of online businesses –Conversion ratios Impression-to-click Click-to-sale –Quality of website (features, usefulness, etc) Usually increasing raw traffic to site increases revenue and improves profitability –Search engine optimization (White hat & Black hat) –Web Spam Online advertising –Advertising keywords –Sponsored links are presented separately from organic results –Web spam results are inter-mixed with organic results Other (non-economic) motivations do exist –Google bombs (Negritude-Ultramarine) Economic motivations are well known for spam

Query classes Popular Queries –Search engine query logs –Frequency  Popularity Monetizable Queries –Search engine ad logs (sponsored links) –Frequency of clicks  Monetizability –Revenue generated  Monetizability Not disjoint sets!! Top-5000 queries for study

Popular Query Set Queries –List of top-5000 queries that generated the most traffic –MSN Search user query logs –Only query ranks are used, their frequencies were discarded Urls –Top-200 search results from MSN Search, Google, and Ask.com –5000 * 200 * 3 = 3 million Urls (not unique)

Monetizable Query Set (5000 Queries) Queries –List of top-5000 queries that generated the most revenue (PPC) from sponsored ads on a single day –MSN Search advertisement logs –Only query ranks are used, their raw monetization values were discarded Urls –Top-200 search results from MSN Search, Google, and Ask.com –5000 * 200 * 3 = 3 million Urls (not unique)

Data sets Queries –5000 popular, 5000 monetizable –Overlap between the two sets 826 queries (17%) Popular Urls –3 million produced 1.49 million unique urls Monetizable Urls –3 million produced 1.28 million unique urls Each Url was processed once for cloaking Assumption: Search engines apply anti-spam and Url editing techniques uniformly over the set of queries and urls

Cloaking Detection Extension of technique proposed by Wu and Davison (2005;2006) Download up to 4 copies of each Url –Browser IE user-agent string Up to 2 copies (B1, B2) –Crawler msnbot user-agent string Up to 2 copies (C1,C2) Urls crawled in random order Over 2 days

Cloaking Score Comparing a pair of documents Normalized term frequency difference T 1 and T 2 are sets of terms (T 1 \ T 2 ) = set of terms in 1 but not in 2 Sets can contain repeats Normalization by (T 1  T 2 ) reduces any bias that stems from the size of the web page

Cloaking Test Procedure

Cloaking Test Processing stages (popular,monetizable) –(C1,B1) Resolved as not cloaking (91.8%, 90.2%) 74.7%, 73.1% resolved (not cloaking) – same HTML 13.6%, 13.4% resolved (not cloaking) – same Txt 0.46%, 0.67% resolved (not cloaking) – same words (incl. freq) –8.2%, 9.8% remain for which (B2,C2) downloaded Normalized term frequency differences –Cloaking: D(C1,B1), D(C2,B2) –Dynamic: D(C1,C2), D(B1,B2) Simple measure of cloaking (threshold t )

Threshold ( t ) Dynamic urls –8.2% of popular urls = 122,180 urls –9.8% of monetizable urls = 125,440 urls 4000 URLs were randomly chosen –2000 from Popular set (8.2%) –2000 from Monetizable set (9.8%) Manually labeled for cloaking spam

Precision and Recall

98.5% 74.0%

Precision and Recall 98.5% 74.0% 9.7% 6.0% Overall Mean over 5000 Queries

Amount of cloaking F 1, F 0.5, and F 2 give best t = 0 (100% recall) Cloaking detection algorithm –98.5% precision (Monetizable) –74.0% precision (Popular) % Cloaked urls –9.7% (Monetizable) –6.0% (Popular) It is much easier to detect cloaking in monetizable query results Monetizable queries are 62% more likely to produce cloaking spam results

Distribution of Cloaked Urls

Independently Sorted Queries

Distribution of Cloaked Urls 2% Queries 98% Queries

Distribution of cloaking Top 100 (2%) most cloaked queries –have 10x as many cloaking URLs in comparison with bottom 4900 queries (98%) Very skewed distribution An effective way of monitoring and detecting cloaked URLs –Start with most cloaked queries (found in this study) and work towards the least cloaked queries –True for both Popular and Monetizable Queries

Summary Amount of cloaking in search results depends on query properties such as popularity and monetizability Improved cloaking detection algorithm –High precision for monetizable queries –Moderate precision for popular queries Focusing on most popular and monetizable queries can produce significant reduction in cloaking spam with minimal effort

Questions?