Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

Similar presentations


Presentation on theme: "Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar."— Presentation transcript:

1 Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar @ ADLab, NCU-CSIE 18 th ACM Conference on Computer and Communications Security (CCS 2011)

2 Outline Introduction Methodology Results Related Work Conclusion 2

3 Introduction Search Engine Optimization (SEO) “Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid ("organic" or "algorithmic") search results.” ­­­--- Wikipedia Wikipedia SEO could be used as benign techniques Cloaking Up to 1999 One of the notorious blackhat SEO skills Delivering different content to different user segments ie. Search engine crawlers and normal users 3

4 Introduction 4 Normal User Search Engine Crawler

5 Introduction Types of cloaking Repeat Cloaking Cookies or IP tracking User Agent Cloaking User-Agent field in the HTTP request header Referrer Cloaking Referer field in the HTTP header IP Cloaking 5

6 Introduction This paper… Designs a system, Dagger, to identify cloaking in near real-time Uses this system to Provide a picture of cloaking activity as seen through three search engines(Google, Bing and Yahoo) Characterize the differences in cloaking behavior between undifferentiated “trending” keywords and targeted keywords. Characterize the dynamic behavior of cloaking activity 6

7 Methodology Dagger consists of five functional components Collecting search terms Fetching search results from search engines Crawling the pages linked from the search results Analyzing the pages crawled Repeating measurements over time 7

8 Methodology Collecting Search Terms Collecting popular search terms from Google Hot Searches Alexa Twitter Constructing another source of search terms using keyword suggestions from “Google Suggest.” ex: User enter -> viagra 50mg Suggestion -> viagra 50mg cost viagra 50mg canada … 8

9 Methodology Querying Search Results Submitting the search terms to search engines(Google, Yahoo, and Bing) Google Hot Searches and Alexa each supply 80 terms per 4-hour Twitter supplies 40 Together with 240 additional suggestions based on Google Hot Searches (80 * 3)  Total 440 terms Extracting the top 100 search results for each search term(44,000) Removing whitelist URLs Grouping similar entries (same URL, source, and search term)  average roughly 15,000 unique URLs in each measurement period 9

10 Methodology Crawling Search Results Web crawler A Java web crawler using the HttpClient 3.x package from Apache Crawling 3 times for each URL Disguised as a normal user using Internet Explorer, clicking through the search result Disguised as the Googlebot Web crawler Disguised as a normal user again, NOT clicking through the search result Dealing with IP cloaking? Fourth crawling using Google Translate More than half of cloaked results do IP cloaking 10

11 Methodology Detecting Cloaking Removing HTTP error response (average 4% of URLs) Using Text Shingling to filter out nearly identical pages 90% of URLs are near duplicates ( “near duplicates” means 10% or less differences between 2 sets of signatures) Measuring the similarity between the snippet of the search result and the user view of the page Removing noise from both the snippet and the body of the user view Search substrings from the snippet Number of words from unmatched substrings divided by the total number of words from all substrings 1.0 means no match 0.0 means fully match Threshold: 0.33  filter out 56% of the remaining URLs 11

12 Methodology Detecting Cloaking(cont.) False positives may still exist Examining the DOMs as the final test Computing the sum of an overall comparison and a hierarchical comparison Overall comparison: unmatched tags from the entire page divided by the total number of tags Hierarchical comparison: the sum of the unmatched tags from each level of the DOM hierarchy divided by the total number of tags 2.0 means no match 0.0 means fully match Threshold: 0.66 12

13 Methodology Detecting Cloaking(cont.) Manual inspection False positive: 9.1% (29 of 317) in Google search 12% (9 of 75) in Yahoo (benign websites but delivering different content to search engines) Advanced browser detection Temporal Remeasurement Dagger remeasures every 4 hours for up to 7 days 13

14 Results Cloaking Over Time 14

15 Results 15

16 Results Sources of Search Terms 16

17 Results 17

18 Results 18

19 Results 19

20 Results Search Engine Response 20

21 Results 21

22 Results 22

23 Results 23

24 Results 24

25 Results Cloaking Duration 25

26 Results Cloaked Content 26

27 Results 27

28 Results Domain Infrastructure 28

29 Results SEO 29

30 Conclusion Cloaking is an standard skill of constructing scam pages. This paper examined the current state of search engine cloaking as used to support Web spam. New techniques for identifying cloaking(via the search engine snippets that identify keyword-related content found at the time of crawling) Exploring the dynamics of cloaked search results and sites over time. 30


Download ppt "Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar."

Similar presentations


Ads by Google