The Splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng Presenter: Belle Tseng NEC Laboratories America, Cupertino, CA.
Problem statement Goal: combat spam in the blogosphere What are splogs? How to detect splogs? How to evaluate anti-splogs techniques? Approach: splog detection task & solution Identify unique characteristics of splogs Propose a time-sensitive online detection task that captures the unique characteristics Propose a splog detection technique based on temporal & link properties What are splogs? How to detect splogs? How to evaluate anti-splogs techniques? a comparative evaluation framework on TREC dataset, and also, captures the unique characteristics of splogs WWW 2006, May 26, 2018 2
Outline of the talk Introduction Splog detection task Our detection method Data pre-processing & annotation Experiment results Concluding remarks WWW 2006, May 26, 2018 3
Introduction Motivation Related work What are splogs? WWW 2006, May 26, 2018 4
Splog (spam+blog)—a new and serious problem in the blogosphere! Motivation Splogs are polluting the blogosphere… 10-20% of blogs are splogs [1] An average of 44 of the top 100 blogs search results in three popular blog search engines came from splogs [1] 75% of new pings came from splogs; more than 50% of claimed blogs pinging weblogs.com are splogs [2] Research issues What are splogs? How to detect splogs? How to evaluate anti-splogs techniques? no concrete definition! The statistics exhibit serious problems caused by splogs, including the degradation of information retrieval quality and the tremendous waste of network and storage resources. [1] Umbria (2006) SPAM in the blogosphere [2] P. Kolari (2005) Welcome to the Splogosphere splogs are different from web spams! a comparative evaluation framework on TREC dataset captures the unique characteristics of splogs Splog (spam+blog)—a new and serious problem in the blogosphere! WWW 2006, May 26, 2018 5
However, splogs are different… Related work Web spam detection Content analysis [Ntoulas06]: statistical properties in content Link analysis [Gyongyi05]: spam mass estimation Splog detection [Kolari06]: apply web spam detection & topic identification techniques in splog detection *I don’t want to mention [Kolari06] because I don’t think it significant, but the UMBC team will be there. Should we mention this work? [Ntoulas06] Detecting Spam Web Pages through Content Analysis Observation: Spams and non-spams have different statistical properties in content. Method: use statistical properties that differentiate spams and non-spams as features, such as number of words in the page/page title, amount of anchor text, fraction of visible content, etc. combine multiple features and use well-known classification technique (decision tree classifier) improve by bagging and boosting (to determine the better features) The intuition of [Kolari06] is similar but apply for blogs instead of websites. They first define several content features such as anchor text, select features by mutual information, and classify by SVM. [Gyongyi04] Combating Web Spam with TrustRank Intuition: Good pages seldom point to bad ones, thus get more scores from trusted sites through PageRank Method: (builds on PageRank) select a set of seeds (e.g. with high PageRank) manually label seeds as spam or non-spam, and give (uniform) trust scores propagate trust scores from seeds to all sites by PageRank algorithm TrustRank can be used to (1) filter out spams or (2) demote spams’ ranking. The experiment shows TrustRank (1) removes most of the spam from the top-scored sites, but (2) doesn’t remove spam from low-scored sites. However, splogs are different… WWW 2006, May 26, 2018 6
Example (1): keyword stuffing WWW 2006, May 26, 2018 7
Example (2): stolen content Traditional content analysis is not enough! WWW 2006, May 26, 2018 8
Example (3): link farm WWW 2006, May 26, 2018 9
Example (4): via trackback links Traditional link analysis is not enough! WWW 2006, May 26, 2018 10
What are splogs? Splog: a blog created by an author who has the intention of spamming NOTE: a blog having comment spam or trackback spam is not considered a splog S: splog W: affiliate website Ads/ppc: profitable mechanism Key points: (1) motive—profitable mechanism (2) schemems—increase visibility via search engines, boost 1, 2, 3 or attach regular blogs (3) unjustifiable Figure 1 illustrates the overall scheme taken by splog creators. Their motive is to drive visitors to affiliated sites (including the splog itself) that have some profitable mechanisms. By profitable mechanism, we refer to web-based business methods, such as search engine advertising programs (e.g. Google AdSense) or pay-per-click (ppc) affiliate programs. There are several schemes used by spammers to increase the visibility of splogs by getting indexed with high ranks on popular search engines. To deceive the search engine, the spammer may boost (1) relevancy (e.g. via keyword stuffing), (2) popularity (e.g. via link farm), or (3) recency (e.g. via frequent posts), based on some ranking criteria used by search engines. The increased visibility is unjustifiable since the content in splogs is often nonsense or stolen from other sites [1]. The spammer also attacks regular blogs through comments and trackbacks to boost the splog ranking. Here’s the working definition… it’s still vague, so let’s discuss the characteristics of blogs. WWW 2006, May 26, 2018 11
Characteristics of splogs Typical characteristics Machine-generated content No value-addition Hidden agenda, usually an economic goal Uniqueness of splogs Dynamic content Non-endorsement link Machine-generated content: splog entries are generated automatically, usually nonsense, gibberish, repetitive or copied from other blogs or websites. No value-addition: splogs provide useless or no unique information to their readers. There are blogs using automatic content aggregating techniques to provide useful service such as podcasting—these are legitimate blogs because of their value addition. Hidden agenda, usually an economic goal: splogs have commercial intention that can be revealed if we observe any affiliate ads or out-going links to affiliate sites. Dynamic content: blog readers are mostly interested in recent entries. Unlike web spams where the content is static, a splog continuously generates fresh content to drive traffic. Non-endorsement link: A hyperlink is often interpreted as an endorsement of other pages. It is less likely that a web spam gets endorsements from normal sites. However, since spammers can create hyperlinks (comment links or trackbacks) in normal blogs, links in blogs cannot be simply treated as endorsements. Splog detection—different from web spam detection! WWW 2006, May 26, 2018 12
Task Definition Framework Traditional IR-based evaluation Proposed online evaluation WWW 2006, May 26, 2018 13
Framework Splog detector for the blog search engines Different from the web search engine in the growing contents (feeds) So, time is crucial Entries become available gradually time dealy to gather enough evidence A splog persists in the index with growing content detect it as soon as possible How fast is the detector? Make a decision with less evidence b1, b2, b3…: downloaded blogs e1, e2, e3…: downloaded entries WWW 2006, May 26, 2018 14
Detection tasks Traditional IR-based evaluation with ground truth K-fold cross-validation Performance measures: precision/recall, AUC, ROC plot, etc. without ground truth Performance measure: average precision at top N of the ranked list based on pooling of multiple detection list Task Type Dataset Offline (Traditional) Online (Time-Sensitive) With Ground Truth TASK 1 TASK 3 Without Ground Truth TASK 2 TASK 4 WWW 2006, May 26, 2018 15
Online evaluation A framework to evaluate time-sensitive detection performance B(t1): a partition consisting of blogs discovered during ti-1 to ti pjk: detection performance at time tj on the partition at tk (B(tk)) Pi: average performance for each delay i=j-k WWW 2006, May 26, 2018 16
Detection Method Baseline features Temporal regularity Link regularity WWW 2006, May 26, 2018 17
Baseline features A subset of the content features presented in [Ntoulas06] In practice, Extract features from 5 parts of a blog tokenized URLs, blog and post titles, anchor text, blog homepage content, and post (entry) content Vectorize by word count, average word length, and a tf-idf vector Prune rarely-used words Feature selection using Fisher linear discriminant analysis (LDA)—to avoid over-fitting These features are widely used in content analysis. WWW 2006, May 26, 2018 18
New features Challenges Observation Content-based methods: suffer from more sophisticated content generation schemes Link-based methods: suffer from different semantics of links; link graph is more dynamic and incomplete Observation Content: machine-generated posts How to capture the characteristics in machine-generated content? Link: to drive traffic to a specific set of affiliate websites How to capture the characteristics in specific linking targets? Splogs’ motivation is different from normal, human-generated blogs! Temporal regularity estimation Link regularity estimation WWW 2006, May 26, 2018 19
Temporal regularity (TCR) Temporal content regularity (TCR) Captures the similarity between growing contents Estimated by autocorrelation of the content Similarity measure: histogram intersection distance distance between two posts (k posts in between) TCR: autocorrelation Amount of common contents of two posts WWW 2006, May 26, 2018 20
TCR examples WWW 2006, May 26, 2018 21
Temporal regularity (TSR) Temporal structural regularity (TSR) captures consistency in timing of content creation estimated by the entropy of the post-time difference distribution Use hierarchical clustering method blog entropy of post-time Normalized by the maximum observed blog entropy WWW 2006, May 26, 2018 22
a normal blog whose TSR=0.0615 TSR examples a splog whose TSR=1 a normal blog whose TSR=0.0615 title post time buy viagra cheap 9/27/2005 12:30 shoot 8/25/2005 18:34 viagra substitute 9/27/2005 12:50 more school 8/30/2005 15:38 buy viagra uk 9/27/2005 13:10 things that happened over the week 9/3/2005 6:08 viagra story 9/27/2005 13:30 parteeeeee!! 9/10/2005 7:14 viagra levitra 9/27/2005 13:50 haven't done this in a while... 9/15/2005 17:06 viagra shop 9/27/2005 14:10 I heart shoes. 9/17/2005 7:09 viagra online pharmacy 9/27/2005 14:30 things about me 9/19/2005 17:35 where to buy viagra 9/27/2005 14:50 mweep. 9/23/2005 15:57 viagra hgh 9/27/2005 15:10 notes from Sarah to me. 9/28/2005 18:39 viagra picture 9/27/2005 15:30 ummmm... 10/1/2005 18:00 12/15/2005 19:49 this is what I've wanted to hear all season... 10/7/2005 6:41 12/15/2005 20:09 we got 3rd, but we're still cool 10/11/2005 16:50 12/15/2005 20:29 fun stuff. as usual. 10/13/2005 19:25 …… WWW 2006, May 26, 2018 23
Link regularity (LR) captures consistency in blogs’ targeting websites Splog—more consistent behavior because its main intention is to drive traffic to affiliate websites Affiliate websites—not authoritative to normal bloggers Analyzing the linking behavior using HITS algorithm LR: compute hub scores with out-link normalization Splogs target focused set of websites, while normal blogs usually have more diverse targets WWW 2006, May 26, 2018 24
Classification Binary classification: splog or normal blogs Use SVMs classifier with a radial basis function kernel Combine baseline features with TCR, TSR, LR R (TCR, TSR, LR) SVMs Splog/non-splog base-n WWW 2006, May 26, 2018 25
Data-Preprocessing & Ground Truth Annotation tool Disagreement among annotators Ground truth WWW 2006, May 26, 2018 26
Data TREC dataset: 100,649 feeds Removing duplicate feeds and feeds without homepage or permalinks 43.6K unique blogs Most blogs are discovered in the first week used blogs discovered in the first week in online experiment WWW 2006, May 26, 2018 27
Annotation (1) An interface for annotators Five labels: (N) Normal (S) Splog (B) Borderline (U) Undecided (F) Foreign WWW 2006, May 26, 2018 28
Annotation (2) Disagreement among annotators Ground truth They agree more on normal blogs but less on near-splog blogs (S/B/U) Pooling? Splog recognition: conservative vs. aggressive Annotator N S B U F Total Mr. C1 45 3 4 1 7 60 Ms. S1 37 16 6 Ms. S2 36 10 Mr. S 47 8 Mr. C2 44 Ms. L 48 Ground truth Label 9240 blogs (random & stratified sampling) 7905 labeled as normal, 525 labeled as splogs Low splog percentage Some known splogs are pre-filtered Focus on the 43.6K subset of blogs having both homepage and entries WWW 2006, May 26, 2018 29
Experimental Results Offline detection Online detection WWW 2006, May 26, 2018 30
Offline evaluation base-n: n-dimensional baseline features AUC accuracy precision recall base-253 0.966 0.915 0.923 0.907 R+base-253 0.974 0.919 0.918 0.920 base-127 0.957 0.893 0.899 0.886 R+base-127 0.968 0.925 0.931 base-64 0.938 0.874 0.885 0.861 R+base-64 0.948 0.908 0.895 base-32 0.834 0.837 0.831 R+base-32 0.921 0.870 0.883 0.851 R 0.814 0.696 0.860 0.469 base-n: n-dimensional baseline features R+base-n: with temporal and link regularity features WWW 2006, May 26, 2018 31
Online experiment testing period linking graph Week 1 Week 2 Week 7 WWW 2006, May 26, 2018 32
Online evaluation Without sufficient content data, the regularity features provide a significant boost to the performance WWW 2006, May 26, 2018 33
Summary Splog—a new and serious problem in the blogosphere Detection of splogs is different from web spam detection Identifying new detection tasks Online evaluation measure how quickly a detector can identify splogs Introducing useful and unique features of blogs/splogs temporal and link regularity measures Annotation Guideline and tool help reduce annotation effort WWW 2006, May 26, 2018 34