DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
Search Engines and Information Retrieval
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
INFO 624 Week 3 Retrieval System Evaluation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Overview of Search Engines
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
COMPUTER TERMS PART 1. COOKIE A cookie is a small amount of data generated by a website and saved by your web browser. Its purpose is to remember information.
Search Engine Optimization. What is SEO? Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search.
An innovative platform to allow translation and indexing of internet sites Localization World
SEO Webinar - With Neil Palmer of IM3.co.uk In Partnership with Huddlebuy How do I improve my website traffic with SEO? Covering: What is SEO? Why is SEO.
SEO PLAN Presented By Mangesh Dolse. Lead Management Tool( Sample)
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
1 DSpin: Detecting Automatically Spun Content on the Web Speaker : Ting Luo 2014/05/26 Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California,
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 A Static Analysis Approach for Automatically Generating Test Cases for Web Applications Presented by: Beverly Leung Fahim Rahman.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Our Process Project Summary Website Audit This option is an update and one time optimization of your existing website. We will modify and update your.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
DC 2004 Metadata Generation and Accessibility Auditing Liddy Nevile La Trobe University, Australia Mail
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
GROUP PresentsPresents. WEB CRAWLER A visualization of links in the World Wide Web Software Engineering C Semester Two Massey University - Palmerston.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Why You Should Optimize Your Website Content. Optimizing a website's content, in order to obtain a high search engine ranking is what Search Engine Optimization.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
Internet Marketing Company How To Find Diamond From Clutter Internet marketing is the usage of the World Wide Web to provide online shopping or advertisement.
Search Engine Optimization (SEO) Presentation By Celina Jonesi Small Business Seo – KG Tech.
Data mining in web applications
How To Market Disaster Restoration Services in The Internet Era
Information Retrieval in Practice
Search Engine Optimization(S.E.O)
Evaluation Anisio Lacerda.
Search Engine Architecture
Joseph JaJa, Mike Smorul, and Sangchul Song
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Best SEO Tips to Make Your Website Stand Out. SEARCH ENGINE OPTIMIZATION It is essential that you implement Search Engine Optimization strategies to make.
湖南大学-信息科学与工程学院-计算机与科学系
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Presentation transcript:

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1

What is Spinning? A Black Hat Search Engine Optimization (BHSEO) technique that rewords original content to avoid duplicate detection Typically an article (seed) is spun multiple times creating N versions of the article that will be posted on N different sites Artificially generate interest to increase search result rankings of targeted site 2

Spinning Example 3

Spinning Approaches Human Spinning Hire a real person from an online marketplace (i.e. Fiverr, Freelancer) to spin manually Pros: – Reasonable text readability Cons: – Expensive ($2-8 / hr) – Not scalable (humans) Automated Spinning Run software to spin automatically Pros: – Fast – Cheap ($5) – Scalable (500 articles / job) – Minimal human interaction Cons: – Can read awkwardly 4

Spinning in BHSEO 5 SEO Software Start with a seed article and SEO Software

Spinning in BHSEO 6 SEO Software SEO Software submits the article to spinner (TBS)

Spinning in BHSEO 7 SEO Software TBS spins the article and verifies plagiarism detection fails

Spinning in BHSEO 8 SEO Software SEO Software receives spun article

Spinning in BHSEO 9 SEO Software Proxies User Generated Content SEO Software posts articles on User Generated Content through proxies

Spinning in BHSEO 10 SEO Software Proxies User Generated Content Search Engine Search Engine consumes user generated content

Goals Understand the current state of automated spinning software using one of the most popular spinners (The Best Spinner) Develop techniques to detect spinning using immutables + mutables Examine spinning on the Web using Dspin, our system to identify automatically spun content 11

The Best Spinner (TBS) TBS consists of two parts – Program (binary): provides the user interface – Synonym dictionary: a homemade, curated list of synonyms that are updated weekly Replaces text with synonyms from dictionary We extract the synonym dictionary through reverse engineering the binary 12

TBS Example 13

Immutables + Mutables An article is composed of immutables (NOT IN dictionary) and mutables (IN dictionary) 14

Spinning Detection Algorithm Immutables detection computes the ratio of shared immutables between two pages Works well in practice except in corner case where there are few immutables to compare Mutables detection computes the ratio of all shared words after two levels of recursively expanding synonyms Also works well and handles corner case, but expensive 15

Other Approaches Duplicate content detection is a well known problem for Search Engines Explored other approaches: – Hashes of substrings [Shingling] – Parts of speech [Natural Language Processing] Spinning is designed to circumvent these approaches (i.e. replace every Nth word, synonym phrases) 16

Validation Setup controlled experiment using TBS 600 article test data set – Started with 30 seed articles 5 articles from 5 different article directories 5 articles randomly chosen from Google News – Each article spun 20 times w/ bulk spin option Immutables detects all spun content and matches with the source 17

DSpin Detection from Search Engine POV – Input: set of article pages crawled from the Web – Output: set of pages flagged as auto spun Build graph of clusters of “similar” pages using immutables + mutables approach – Each page represents a node – Create edges between pairs of nodes using immutables, verify edges using mutables – Each connected components is cluster 18

Results Ran DSpin on a real life data set – Set of 797 abused wikis – Crawl each wiki daily for newly posted articles – Collected 1.23M Articles from Dec 2012 Address the following questions: – Is spinning a problem in the wild? – Can we characterize spinning behavior? 19

Filtering 20 Filter out pages that are: non-English, exact duplicates, < 50 words, or primarily links 225K spun pages remaining. Spinning is for real.

Wiki Content 21 Spinning campaigns target business + marketing terms

Cluster Size 12.7K clusters from 225K spun pages 22 Moderate clusters of spun articles in abused wikis

Timing Duration 23 Duration reveals how long a campaign lasts Compute by extracting dates, max – min Most campaigns occur in bursts.

Conclusion Proposed + evaluated a spinning detection algorithm based on immutables + mutables that Search Engines can implement Demonstrated the algorithm's applicability on a real life data set (abused wikis) Characterized the behavior of at least one slice of the Web where spun articles thrive 24

Thank You! Q&A 25

TBS Coverage Only one synonym dictionary was used to implement DSpin, is this system still applicable widely (i.e. for other spinners)? – We had no prior knowledge about how articles from abused wikis were spun – Yet we still detected spun articles 26

Synonym Dictionary Churn How much does the synonym dictionary change over time? – We re-fetched synonym dictionary four months after the initial study and found that 94% of terms remain the same – Furthermore, DSpin detected spun articles posted months prior 27

Synonyms in the Cloud What if the spinner stores the synonym dictionary in the cloud? – There is an operational cost for the spinner (network bandwidth == $$$) – Can still reconstruct synonym dictionary through controlled experiments (i.e. submitting our own articles for spinning) 28

Scalability How can Search Engines implement the immutables algorithm? – Assume Search Engines already perform duplicate content detection – Can think of immutables approach as performing duplicate content detection on the immutables portion of the pages (a subset of what is already currently done) 29