Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song,

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TrustRank Algorithm Srđan Luković 2010/3482
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Search Engines and Information Retrieval
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Adversarial Information Retrieval The Manipulation of Web Content.
Search Engines and Information Retrieval Chapter 1.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Evaluation Anisio Lacerda.
WEB SPAM.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Evaluation of IR Systems
Introduction to Web Mining
Source: Procedia Computer Science(2015)70:
PageRank, Ads and Searching
Detecting Spam Web Pages through Content Analysis
iSRD Spam Review Detection with Imbalanced Data Distributions
Discovery of Blog Communities based on Mutual Awareness
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
CS 345A Data Mining Lecture 1
Information Retrieval and Web Design
Presentation transcript:

The Splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng Presenter: Belle Tseng NEC Laboratories America, Cupertino, CA.

Problem statement Goal: combat spam in the blogosphere What are splogs? How to detect splogs? How to evaluate anti-splogs techniques? Approach: splog detection task & solution Identify unique characteristics of splogs Propose a time-sensitive online detection task that captures the unique characteristics Propose a splog detection technique based on temporal & link properties What are splogs? How to detect splogs? How to evaluate anti-splogs techniques?  a comparative evaluation framework on TREC dataset, and also, captures the unique characteristics of splogs WWW 2006, May 26, 2018 2

Outline of the talk Introduction Splog detection task Our detection method Data pre-processing & annotation Experiment results Concluding remarks WWW 2006, May 26, 2018 3

Introduction Motivation Related work What are splogs? WWW 2006, May 26, 2018 4

Splog (spam+blog)—a new and serious problem in the blogosphere! Motivation Splogs are polluting the blogosphere… 10-20% of blogs are splogs [1] An average of 44 of the top 100 blogs search results in three popular blog search engines came from splogs [1] 75% of new pings came from splogs; more than 50% of claimed blogs pinging weblogs.com are splogs [2] Research issues What are splogs? How to detect splogs? How to evaluate anti-splogs techniques? no concrete definition! The statistics exhibit serious problems caused by splogs, including the degradation of information retrieval quality and the tremendous waste of network and storage resources. [1] Umbria (2006) SPAM in the blogosphere [2] P. Kolari (2005) Welcome to the Splogosphere splogs are different from web spams! a comparative evaluation framework on TREC dataset captures the unique characteristics of splogs Splog (spam+blog)—a new and serious problem in the blogosphere! WWW 2006, May 26, 2018 5

However, splogs are different… Related work Web spam detection Content analysis [Ntoulas06]: statistical properties in content Link analysis [Gyongyi05]: spam mass estimation Splog detection [Kolari06]: apply web spam detection & topic identification techniques in splog detection *I don’t want to mention [Kolari06] because I don’t think it significant, but the UMBC team will be there. Should we mention this work? [Ntoulas06] Detecting Spam Web Pages through Content Analysis Observation: Spams and non-spams have different statistical properties in content. Method: use statistical properties that differentiate spams and non-spams as features, such as number of words in the page/page title, amount of anchor text, fraction of visible content, etc. combine multiple features and use well-known classification technique (decision tree classifier) improve by bagging and boosting (to determine the better features) The intuition of [Kolari06] is similar but apply for blogs instead of websites. They first define several content features such as anchor text, select features by mutual information, and classify by SVM. [Gyongyi04] Combating Web Spam with TrustRank Intuition: Good pages seldom point to bad ones, thus get more scores from trusted sites through PageRank Method: (builds on PageRank) select a set of seeds (e.g. with high PageRank) manually label seeds as spam or non-spam, and give (uniform) trust scores propagate trust scores from seeds to all sites by PageRank algorithm TrustRank can be used to (1) filter out spams or (2) demote spams’ ranking. The experiment shows TrustRank (1) removes most of the spam from the top-scored sites, but (2) doesn’t remove spam from low-scored sites. However, splogs are different… WWW 2006, May 26, 2018 6

Example (1): keyword stuffing WWW 2006, May 26, 2018 7

Example (2): stolen content Traditional content analysis is not enough! WWW 2006, May 26, 2018 8

Example (3): link farm WWW 2006, May 26, 2018 9

Example (4): via trackback links Traditional link analysis is not enough! WWW 2006, May 26, 2018 10

What are splogs? Splog: a blog created by an author who has the intention of spamming NOTE: a blog having comment spam or trackback spam is not considered a splog S: splog W: affiliate website Ads/ppc: profitable mechanism Key points: (1) motive—profitable mechanism (2) schemems—increase visibility via search engines, boost 1, 2, 3 or attach regular blogs (3) unjustifiable Figure 1 illustrates the overall scheme taken by splog creators. Their motive is to drive visitors to affiliated sites (including the splog itself) that have some profitable mechanisms. By profitable mechanism, we refer to web-based business methods, such as search engine advertising programs (e.g. Google AdSense) or pay-per-click (ppc) affiliate programs. There are several schemes used by spammers to increase the visibility of splogs by getting indexed with high ranks on popular search engines. To deceive the search engine, the spammer may boost (1) relevancy (e.g. via keyword stuffing), (2) popularity (e.g. via link farm), or (3) recency (e.g. via frequent posts), based on some ranking criteria used by search engines. The increased visibility is unjustifiable since the content in splogs is often nonsense or stolen from other sites [1]. The spammer also attacks regular blogs through comments and trackbacks to boost the splog ranking. Here’s the working definition… it’s still vague, so let’s discuss the characteristics of blogs. WWW 2006, May 26, 2018 11

Characteristics of splogs Typical characteristics Machine-generated content No value-addition Hidden agenda, usually an economic goal Uniqueness of splogs Dynamic content Non-endorsement link Machine-generated content: splog entries are generated automatically, usually nonsense, gibberish, repetitive or copied from other blogs or websites. No value-addition: splogs provide useless or no unique information to their readers. There are blogs using automatic content aggregating techniques to provide useful service such as podcasting—these are legitimate blogs because of their value addition. Hidden agenda, usually an economic goal: splogs have commercial intention that can be revealed if we observe any affiliate ads or out-going links to affiliate sites. Dynamic content: blog readers are mostly interested in recent entries. Unlike web spams where the content is static, a splog continuously generates fresh content to drive traffic. Non-endorsement link: A hyperlink is often interpreted as an endorsement of other pages. It is less likely that a web spam gets endorsements from normal sites. However, since spammers can create hyperlinks (comment links or trackbacks) in normal blogs, links in blogs cannot be simply treated as endorsements. Splog detection—different from web spam detection! WWW 2006, May 26, 2018 12

Task Definition Framework Traditional IR-based evaluation Proposed online evaluation WWW 2006, May 26, 2018 13

Framework Splog detector for the blog search engines Different from the web search engine in the growing contents (feeds) So, time is crucial Entries become available gradually  time dealy to gather enough evidence A splog persists in the index with growing content  detect it as soon as possible How fast is the detector? Make a decision with less evidence b1, b2, b3…: downloaded blogs e1, e2, e3…: downloaded entries WWW 2006, May 26, 2018 14

Detection tasks Traditional IR-based evaluation with ground truth K-fold cross-validation Performance measures: precision/recall, AUC, ROC plot, etc. without ground truth Performance measure: average precision at top N of the ranked list based on pooling of multiple detection list Task Type Dataset Offline (Traditional) Online (Time-Sensitive) With Ground Truth TASK 1 TASK 3 Without Ground Truth TASK 2 TASK 4 WWW 2006, May 26, 2018 15

Online evaluation A framework to evaluate time-sensitive detection performance B(t1): a partition consisting of blogs discovered during ti-1 to ti pjk: detection performance at time tj on the partition at tk (B(tk)) Pi: average performance for each delay i=j-k WWW 2006, May 26, 2018 16

Detection Method Baseline features Temporal regularity Link regularity WWW 2006, May 26, 2018 17

Baseline features A subset of the content features presented in [Ntoulas06] In practice, Extract features from 5 parts of a blog tokenized URLs, blog and post titles, anchor text, blog homepage content, and post (entry) content Vectorize by word count, average word length, and a tf-idf vector Prune rarely-used words Feature selection using Fisher linear discriminant analysis (LDA)—to avoid over-fitting These features are widely used in content analysis. WWW 2006, May 26, 2018 18

New features Challenges Observation Content-based methods: suffer from more sophisticated content generation schemes Link-based methods: suffer from different semantics of links; link graph is more dynamic and incomplete Observation Content: machine-generated posts How to capture the characteristics in machine-generated content? Link: to drive traffic to a specific set of affiliate websites How to capture the characteristics in specific linking targets? Splogs’ motivation is different from normal, human-generated blogs! Temporal regularity estimation Link regularity estimation WWW 2006, May 26, 2018 19

Temporal regularity (TCR) Temporal content regularity (TCR) Captures the similarity between growing contents Estimated by autocorrelation of the content Similarity measure: histogram intersection distance distance between two posts (k posts in between) TCR: autocorrelation Amount of common contents of two posts WWW 2006, May 26, 2018 20

TCR examples WWW 2006, May 26, 2018 21

Temporal regularity (TSR) Temporal structural regularity (TSR) captures consistency in timing of content creation estimated by the entropy of the post-time difference distribution Use hierarchical clustering method blog entropy of post-time Normalized by the maximum observed blog entropy WWW 2006, May 26, 2018 22

a normal blog whose TSR=0.0615 TSR examples a splog whose TSR=1 a normal blog whose TSR=0.0615 title post time buy viagra cheap 9/27/2005 12:30 shoot 8/25/2005 18:34 viagra substitute 9/27/2005 12:50 more school 8/30/2005 15:38 buy viagra uk 9/27/2005 13:10 things that happened over the week 9/3/2005 6:08 viagra story 9/27/2005 13:30 parteeeeee!! 9/10/2005 7:14 viagra levitra 9/27/2005 13:50 haven't done this in a while... 9/15/2005 17:06 viagra shop 9/27/2005 14:10 I heart shoes. 9/17/2005 7:09 viagra online pharmacy 9/27/2005 14:30 things about me 9/19/2005 17:35 where to buy viagra 9/27/2005 14:50 mweep. 9/23/2005 15:57 viagra hgh 9/27/2005 15:10 notes from Sarah to me. 9/28/2005 18:39 viagra picture 9/27/2005 15:30 ummmm... 10/1/2005 18:00 12/15/2005 19:49 this is what I've wanted to hear all season... 10/7/2005 6:41 12/15/2005 20:09 we got 3rd, but we're still cool 10/11/2005 16:50 12/15/2005 20:29 fun stuff. as usual. 10/13/2005 19:25 …… WWW 2006, May 26, 2018 23

Link regularity (LR) captures consistency in blogs’ targeting websites Splog—more consistent behavior because its main intention is to drive traffic to affiliate websites Affiliate websites—not authoritative to normal bloggers Analyzing the linking behavior using HITS algorithm LR: compute hub scores with out-link normalization Splogs target focused set of websites, while normal blogs usually have more diverse targets WWW 2006, May 26, 2018 24

Classification Binary classification: splog or normal blogs Use SVMs classifier with a radial basis function kernel Combine baseline features with TCR, TSR, LR R (TCR, TSR, LR) SVMs Splog/non-splog base-n WWW 2006, May 26, 2018 25

Data-Preprocessing & Ground Truth Annotation tool Disagreement among annotators Ground truth WWW 2006, May 26, 2018 26

Data TREC dataset: 100,649 feeds Removing duplicate feeds and feeds without homepage or permalinks  43.6K unique blogs Most blogs are discovered in the first week used blogs discovered in the first week in online experiment WWW 2006, May 26, 2018 27

Annotation (1) An interface for annotators Five labels: (N) Normal (S) Splog (B) Borderline (U) Undecided (F) Foreign WWW 2006, May 26, 2018 28

Annotation (2) Disagreement among annotators Ground truth They agree more on normal blogs but less on near-splog blogs (S/B/U) Pooling? Splog recognition: conservative vs. aggressive Annotator N S B U F Total Mr. C1 45 3 4 1 7 60 Ms. S1 37 16 6 Ms. S2 36 10 Mr. S 47 8 Mr. C2 44 Ms. L 48 Ground truth Label 9240 blogs (random & stratified sampling) 7905 labeled as normal, 525 labeled as splogs Low splog percentage Some known splogs are pre-filtered Focus on the 43.6K subset of blogs having both homepage and entries WWW 2006, May 26, 2018 29

Experimental Results Offline detection Online detection WWW 2006, May 26, 2018 30

Offline evaluation base-n: n-dimensional baseline features AUC accuracy precision recall base-253 0.966 0.915 0.923 0.907 R+base-253 0.974 0.919 0.918 0.920 base-127 0.957 0.893 0.899 0.886 R+base-127 0.968 0.925 0.931 base-64 0.938 0.874 0.885 0.861 R+base-64 0.948 0.908 0.895 base-32 0.834 0.837 0.831 R+base-32 0.921 0.870 0.883 0.851 R 0.814 0.696 0.860 0.469 base-n: n-dimensional baseline features R+base-n: with temporal and link regularity features WWW 2006, May 26, 2018 31

Online experiment testing period linking graph Week 1 Week 2 Week 7 WWW 2006, May 26, 2018 32

Online evaluation Without sufficient content data, the regularity features provide a significant boost to the performance WWW 2006, May 26, 2018 33

Summary Splog—a new and serious problem in the blogosphere Detection of splogs is different from web spam detection Identifying new detection tasks Online evaluation measure how quickly a detector can identify splogs Introducing useful and unique features of blogs/splogs temporal and link regularity measures Annotation Guideline and tool help reduce annotation effort WWW 2006, May 26, 2018 34