1 DSpin: Detecting Automatically Spun Content on the Web Speaker : Ting Luo 2014/05/26 Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California,

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Computer Forensic Analysis By Aaron Cheeseman Excerpt from Investigating Computer-Related Crime By Peter Stephenson (2000) CRC Press LLC - Computer Crimes.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Near-Duplicates Detection

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

Information Retrieval in Practice

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Overview of Search Engines

Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.

Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Tag-based Social Interest Discovery

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

14 Publishing a Web Site Section 14.1 Identify the technical needs of a Web server Evaluate Web hosts Compare and contrast internal and external Web hosting.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

KW Agent Website Training Getting Good with Google.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

A Graph-based Friend Recommendation System Using Genetic Algorithm

1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Event Data History David Adams BNL Atlas Software Week December 2001.

Addressing Image Compression Techniques on current Internet Technologies By: Eduardo J. Moreira & Onyeka Ezenwoye CIS-6931 Term Paper.

May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.

Web Search Algorithms By Matt Richard and Kyle Krueger.

N-Gram-based Dynamic Web Page Defacement Validation Woonyon Kim Aug. 23, 2004 NSRI, Korea.

For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.

Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Presented By Amarjit Datta

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Whole Page Performance Leeann Bent and Geoffrey M. Voelker University of California, San Diego.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.

Why You Should Optimize Your Website Content. Optimizing a website's content, in order to obtain a high search engine ranking is what Search Engine Optimization.

Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.

Search Engine Optimization

Information Retrieval in Practice

Search Engine Optimization

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS

SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.

Presentation transcript:

1 DSpin: Detecting Automatically Spun Content on the Web Speaker : Ting Luo 2014/05/26 Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego Network and Distributed System Security Symposium(NDSS 2014)

2 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

3 Introduction Search Engine Optimization (SEO) Black Hat SEO techniques that are used to get higher search rankings in an unethical manner Spinning To generating and posting Web spam What is Spinning ? replaces words restructures original content to create new versions with similar meaning but different appearance

4 Introduction Using Spinning in SEO to increase page ranks 1.create many different versions of a single seed article 2.post those versions on multiple Web sites with links pointing to a site being promoted Target Site A B C D Original

5 Introduction Goal detect automatically spun content on the Web Input a set of article pages crawled from various Web sites output a set of pages flagged as automatically spun content

6 Introduction Contributions 1.Spinning characterization The Best Spinner 2. Spun content detection detecting automatically spun content based upon immutables 3. Behavior of article spammers

7 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

8 Background And Previous Work A. Spinning Overview

9 Example Both links to adult webcam sites The spun content is in English, but has been posted to German and Japanese wikis You have actually seen the feared demon-eye impact that occurs when the camera flash bounces off the eye of a person or animal You’ve seen the dreaded demon-eye impact that happens when the camera flash bounces off the eye of an individual or animal Background And Previous Work A. Spinning Overview

10 (6) SPAM Content Background And Previous Work A. Spinning Overview

11 Background And Previous Work B. Article Spam Detection Web spam taxonomies –content spam Quilted pages Keyword stuffing –link spam Page hijacking Link farms

12 Background And Previous Work C. Near-duplicate Document Detection Near-duplicate Document –Two such documents differ from each other in a very small portion that displays advertisements Fingerprinting Algorithm –A procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string –reduce storage and computation costs

13 Background And Previous Work C. Near-duplicate Document Detection From :

14 Background And Previous Work C. Near-duplicate Document Detection The classic approach - Shingles [1] –The hash value of a k-gram which is a sub-sequence of k successive words –The sets of shingles constitutes the set of features of a document Enables a graph representation for similarity among pages pages as nodes edges between two pages that share shingles above a threshold [1] Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma, ‘Detecting Near-Duplicate for Web Crawling,’ 2007

15 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

16 The Best Spinner(TBS) A. TBS

17 The Best Spinner(TBS) A. TBS A popular spinning tool –$77 per year –requires registration with a username and password synonym dictionary –requires credentials at runtime to allow the tool to download an updated version Spintax –{Home|House|Residence|Household}

18 The Best Spinner(TBS) A. TBS Parameters –Frequency every word, or one in every second, third, or fourth word –Remove original removes the original word from the spintax alternatives {Home|House|Residence|Household}  {House|Residence|Household} –Auto-select inside spun text when selected, spins already spun text

19 The Best Spinner(TBS) A. TBS {You can| You are able to | It is possible to | You’ll be able to | You possibly can}

20 The Best Spinner(TBS) B. Reverse Engineering TBS During every startup –downloads the latest version of the synonym dictionary –Save as the file tbssf.dat in an encrypted format (base64 encoding) After Reversing Engineering TBS –use an authentication key to download the synonym dictionary Synonym dictionary –8.4 MB in size –has a total of 750,114 synonyms grouped into 92,386 lines

21 The Best Spinner(TBS) B. Reverse Engineering TBS Authentication key

22 The Best Spinner(TBS) C. Controlled Experiments 5-12% 6-14%

23 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

24 Similarity Similarity score classic Jaccard Coefficient –take all the words from the two documents, A and B –compute the set intersection over the set union across all the words

25 Similarity How to compute the intersection and size of two documents? Extention A. Methods Explored B. The Immutable Method C. Verification Process

26 Similarity A. Methods Explored (1)Shingling Computing shingles, or n-grams, over the entire text with a shingle size of four –a sentence “a b c d e f” is the set of three elements “a b c d”, “b c d e”, and “c d e f”. the intersection is the overlap of shingles between two documents

27 Similarity A. Methods Explored low similarity between 21.1–60.7% Although useful for document similarity, it is not useful for identifying spun content given the low similarity scores

28 Similarity A. Methods Explored (2) Parts-of-speech Standford NLP package –For each sentence, the NLP parser returns the original sentence with parts-of-speech tags for every word –use the parts-of-speech lists as the comparison unit

29 Similarity A. Methods Explored TBS can replace single words with phrases, and phrases comprised of multiple words can be spun into a single word

30 Similarity B. The Immutable Method Separate each article’s words into –mutables –Immutables focus entirely on the list of immutable words from two articles to determine if they are similar

31 Similarity A. Methods Explored Ratios are above 90% for most spun content provides a clear separation between spun and non-spun content

32 Similarity B. The Immutable Method Benefit –it also greatly decreases the number of bytes needed for comparison by reducing the representation of each article by an order of magnitude.

33 Similarity C. Verification Process mutable verifier Steps –it sums all the words that are common between the two pages, and adds it to the total overlap count pages –It computes the synonyms of the remaining words from one page and determines if they match the words of the other page –taking the synonyms of the synonyms of the remaining words and comparing them in a similar fashion to step two

34 Similarity A. Methods Explored Has a much higher overhead

35 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

36 Methodology A. Data Sets Wikis –purchase a Fiverr job offering to create 15,000 legitimate backlinks –Crawled the recent posts on each of the wikis 37M pages for December 2012 GoArticles –Allows users to build backlinks as “dofollow” that can affect search engine page rankings. –crawl over 1M articles posted between January 2012 to May 2013

37 Methodology B. Filters Visible text –remove all pages that do not contain any visible text on the page Content tag –Wiki : div labeled “bodyContent” –GoArticles : div with “class=article” –If it lacks of this tag, then remove it

38 Methodology B. Filters Word count –Discard small pages –Threshold of 50 words Link density –Discard pages with an unusually high link density Foreign text –Only evaluate the immutable method on pages with mostly English text

39 Methodology C. Inverted Indexing Definition – id : a unique index corresponding to an article immu is an immutable that occurs in id. – > Each group represents all document ids that contain the immutable – the total number of immutables that overlap between id i and id j

40 Methodology C. Inverted Indexing Calculate the similarity score between each two pages Set the threshold to be 75% > 2 articles

41 Methodology D. Clustering graph representation –each page(ids) is a node –each pair has an edge Each connected subgraph represents a cluster

42 Methodology E. Exact Duplicates and Near Duplicates Exact duplicates –Use a hash over each page (MD5 sum) –two articles are identical if their MD5 sums match Near Duplicates –Using mutable verifier –100% mutable match, but with mismatching MD5 sums

43 Methodology E. Exact Duplicates and Near Duplicates For example –The English professor Synonym dictionary –{The|…} {The English professor|…} Ideal – is a mutable phrase In fact – will be marked as mutable – will be marked as immutable

44 Methodology F. Hardware 24 physical nodes running Fedora Core 14 Each node has –a single Xeon X3470 Quad-Core 2.93GHz CPU and 24 GB of memory Runs on –Hadoop and Pig jobs

45 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

46 Spinner In The Wild A. Volume Wiki –68.0% as SEO spam –35.6% are spun content GoArticles has drastically less spun content (7.0%) than the wiki data set

47 Spinner In The Wild B. False Positives False positives –two articles that appear in the same cluster but are unrelated Randomly sampled 99 clusters, for each one chose 2 pages. –found no evidence of false positives

48 Spinner In The Wild C. Cluster Sizes Wiki data setGoArticles data set

49 Spinner In The Wild D. Content most of the popular words appear to relate to sales and services

50 Spinner In The Wild E. Domains 1. Spun content across domains –the average cluster spans across 12 ± 27 domains –spammers target multiple domains when posting spun content, instead of a single site

51 Spinner In The Wild E. Domains It indicates a strong, positive correlation between larger scale spinning campaigns and a larger number of targeted domains

52 Spinner In The Wild E. Domains 2. Spun content per domain –The bulk of the distribution are when domains have 15%–65% spun content

53 Spinner In The Wild F. Timing Wiki –75% of duration <=1 day –50% of duration <=3 days

54 Spinner In The Wild G. Backlinks Wiki –Links occur on 99.97%±1.41% of pages per cluster on average

55 Spinner In The Wild G. Backlinks GoArticles –larger spinning campaigns generally targeting a smaller set of unique backlinks and domains than the number of pages

56 Spinner In The Wild H. GoArticles as Seed Pages the majority of cross domain clusters contain many wiki pages (31.6 on average), compared with just 1.2 on average for GoArticles

57 Spinner In The Wild H. GoArticles as Seed Pages

58 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

59 Disucssion Response of spammers –Change the dictionary frequently –tools could compute spun content remotely Future work –Other spinning tools or human-generated spun content

60 Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

61 Conclusion Proposed a method for detecting automatically spun content on the Web Implement a tool – Dspin –operates on sets of crawled Web pages to identify spun content

62 Q & A