Cloak & Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, Geoffrey M. Voelker University of California, San Diego 1.

Slides:



Advertisements
Similar presentations
iRobot: An Intelligent Crawler for Web Forums
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
1 IDX. 2 What you will learn: What IDX is Why its important How to use it Tips and tricks Introduction Q & A.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Search Engine Optimization
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
1 Dynamics of Online Scam Hosting Infrastructure Maria Konte, Nick Feamster Georgia Tech Jaeyeon Jung Intel Research.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Advanced Tools for Account Searches and Portfolios Dawn Gamache Cindy Bylander.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Sport Court® Dealer Website Options All website options include Google & Bing Webmasters, Google Analytics setup and are coded W3C Compliant as well as.
Lost in Translation Measuring and Managing GOOD Web Intentions Marilyn Harmacek. 1.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
PP Test Review Sections 6-1 to 6-6
ABC Technology Project
EU market situation for eggs and poultry Management Committee 20 October 2011.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
2 |SharePoint Saturday New York City
VOORBLAD.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 EN0129 PC AND NETWORK TECHNOLOGY I IP ADDRESSING AND SUBNETS Derived From CCNA Network Fundamentals.
© 2012 National Heart Foundation of Australia. Slide 2.
Sets Sets © 2005 Richard A. Medeiros next Patterns.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Executional Architecture
Before Between After.
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
2004 EBSCO Publishing Presentation on EBSCOadmin.
Subtraction: Adding UP
® Microsoft Office 2010 Browser and Basics.
Januar MDMDFSSMDMDFSSS
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Juice: A Longitudinal Study of an SEO Campaign David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 1.
11 Simple Things You Can Do Next Week to Make More Money Selling SSL Bob Angus, VeriSign.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
WEB OF KNOWLEDGE 5.2
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
Windfall Web Throughout this slide show there will be hyperlinks (highlighted in blue). Follow the hyperlinks to navigate to the specified Topic or Figure.
Cloak and Dagger. In a nutshell… Cloaking Cloaking in search engines Search engines’ response to cloaking Lifetime of cloaked search results Cloaked pages.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Presentation transcript:

Cloak & Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, Geoffrey M. Voelker University of California, San Diego 1

What is Cloaking? 2

Bethenny Frankel? 3

How Does Cloaking Work? Googlebot visits twitter&page=2 4 GET … HTTP/1.1 … User-Agent: Googlebot/2.1 Hi Googlebot, I’ve got some content for you Hi Googlebot, I’ve got some content for you

Customized Content for Crawler Googlebot receives content related to “bethenny frankel twitter” 5

Google Indexes Content 6

Poisoned Search Results User clicks on the search result linking to twitter&page=2 7 GET … HTTP/1.1 … User-Agent: Firefox Referer: It’s traffic! … I mean a user… $$$ It’s traffic! … I mean a user… $$$

Scam Content for User 8

User gets 0wned 9

What is Cloaking? Blackhat search engine optimization (SEO) technique – Delivers different content to different types of users (search crawler, visitor, site owner) SEO-ed page  search crawler Scam page  visitor Benign page  site owner of compromised host Used to obtain search traffic illegitimately by gaming search results – Users click on search result, taken to scams – Clicks “monetized” by scams: fake A/V, pay-per-click, etc. 10

Why is this a problem? From users perspective – Bad experience – Yet another vector for scams – Compromised hosts From search engines perspective – Poisoned search results impact quality – Increase complexity to detect + defend against cloaking 11

Repeat Cloaking Scammer returns the scam first time, then benign content afterwards 12 first visit? yes no

User-Agent Cloaking Scammer examines the HTTP header for User- Agent [Gyöngyi05] 13 User-Agent is firefox? yes no GET … HTTP/1.1 … User-Agent: Firefox

Referer Cloaking Scammer examines the HTTP header for Referer [Wang06] 14 clicked thru google.com ? yes no GET … HTTP/1.1 … Referer:

IP Cloaking Scammer maps request IP address to known range [Gyöngyi05] 15 Google IP? no yes IP:

Goals Systematic measurement over time to capture dynamics and trends in cloaking as SEO – Contemporary picture of cloaking as seen from search engines (Google, Yahoo, Bing) – Characterize differences based on search term classes Trends: dynamic, broad categories Pharmacy: static, domain specific – Time dynamics: lifetime of cloaked pages and search engine response Difficult to observe using a snapshot 16

Approach We built Dagger, a customized crawler system – Collects search terms – Crawls pages from search results – Cloaking detection – Repeated measurement over time Ran for 5 months (March 1, 2011 – August 1, 2011) Study results from Google, Yahoo, Bing 17

What Search Terms to Study? Selected terms represent portion of search index Use terms cloakers target – Past work led us to Trends and Pharmacy – Differences allow us to understand utilization Trends (dynamic) – Large set of search terms that change constantly – Search terms come from various categories Pharmacy (static) – Limited set of terms – One category, pharmacy 18

Collecting Search Terms Maintain feeds for trends and pharmacy sources Google Suggest adds long tail search terms 19 Terms volcano viagra 50mg olympics dallas mavericks viagra 50mg viagra 50mg canada dallas mavericks roster

Crawling Search Results Submit search terms to search engines (Google, Yahoo, Bing) Collect the top 100 search results per search term Crawl each unique URL twice: – Browser (Microsoft Internet Explorer) – Crawler (Googlebot) URLs Web Pages 20 Terms volcano viagra 50mg olympics

Detecting Cloaked Pages Text Shingling – Remove near duplicate HTML Snippet analysis – Remove HTML (browser) matches snippet DOM analysis – Compare HTML structure of browser against crawler Text Shingling Snippet Analysis DOM Analysis 21 Web Pages 90% 56%

Data Set Ran for 5 months (March 1, 2011 – August 1, 2011) – Trends: 110 search terms collected every hour (dynamic) 14K unique URLs crawled every 4 hours per search engine – Pharmacy: 230 search terms in total (static) 16K unique URLs crawled every day per search engine In total, we crawled 43M search results – 200K cloaked search results for trends – 500K cloaked search results for pharmacy 22

How Much Cloaking? Google has the most cloaked search results – Economies of scale, Google has the larger market Trends vs Pharmacy – Pharmacy 10x volume, less volatility 23

Which Terms Poisoned? Google Suggest has 2.5+ times more cloaked pages High variance in % cloaked search results – Terms selected can introduce bias into results RankSearch Term% Cloaked 1viagra 50mg canada61.2 % 2viagra 25mg online48.5 % 3viagra 50mg online41.8 % 4cialis 100mg40.4 % 5generic cialis 100mg37.7 % …… 50%tramadol 50mg7.0% 24

Rate of Search Engines Response? Search results cleaned when cloaked search result no longer appears in the top 100 – 40% (trends), 20% (pharmacy) cleaned after 1 st day – Cloaked search results churn more rapidly than overall 25

How Long are Pages Cloaked? Over 80% of cloaked pages remain cloaked past seven days – Cloakers have little incentive to stop – Pages often not well maintained – Also pages are hidden from site owner 26

What is Cloaked? Focus on trends Cluster based on DOM structure of browser, then manually label – Top 62 / 7671 clusters, representing 61% of cloaked search results – March 1 – May 1 Traffic sales suggest specialization + sophistication Category% Cloaked Pages Traffic Sales81.5% Error7.3% Legitimate3.5% Software2.2% SEO-ed business2.0% PPC1.3% Fake-AV1.2% CPALead0.6% Insurance0.3% Link farm0.1% 27

What is Cloaked? Classify the HTML using file size + content as features Cloaked content is highly dynamic – Redirects surge – Errors rise Matches general timeframe of Fake-AV takedowns 28

Conclusion Cloaking remains an active vector for scams – Fake A/V, pay-per-click, malware Search engines respond, but not fast enough to prevent monetization – Majority of cloaked search results persist > 1 day Clear differences in how search terms can be poisoned – Trends: < 2% results poisoned, but spread broadly, undifferentiated traffic – Pharmacy: up to 60% results poisoned, highly focused Signs of increasing specialization + sophistication in blackhat SEO w/ traffic sales 29

Thank You! Questions? 30

IP Cloaking Return SEO-ed page only to search engine Dagger can still detect that cloaking occurs: – The user must receive the scam for monetization – If we are detected as a false googlebot, what do we receive? Surely not the page that the real googlebot receives If we receive the scam, then scammers vulnerable to security crawlers (blacklist) and the site owner (clean up) In practice we receive a benign page (index.html) – Anything other than scam will result in a delta, which we can use for comparison and detection 31