Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Similar presentations

Presentation on theme: "Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,"— Presentation transcript:

1 Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri, Presented by Anton Rodriguez-Dmitriev

2 Personal Background Graduated from FSU Working on a MSECE Specializing in Controls CS minor Work part-time at STW Technic, LP

3 Web Spam Consequences Damages reputation of search engine Weakens the trust of the users Eiron et al. ranked 100 million pages using PageRank: 11 out of the top 20 were pornographic pages PageRank alone cannot filter spam Cost incurred in crawling, indexing and storing spam pages


5 Some popular spamming techniques Link Spam: create link structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm. Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries Cloaking: send different content to a search engine than to the regular visitor of a website

6 Topology of the Dataset Used WEBSPAM-UK2006 dataset: publically available spam collection Undirected graph Pruned to contain only hosts that share more than 100 links Black nodes are spam and white nodes are non-spam Most spammers in the larger connected component are clustered together Other connected components are single-class

7 Evaluation of the process Confusion Matrix: a represents the number of non-spam examples that were correctly classified b represents the number of examples of non-spam that were falsely classified as spam c represents the spam examples that were falsely classified as non-spam d represents the number of spam examples that were correctly classified

8 Success Measures True positive-rate (or Recall): False positive-rate : Precision: F-measure :

9 Link-based Features Degree-related measures: In-degree and out-degree of the hosts and neighbors Edge-reciprocity: the number of links that are reciprocal Assortativity: the ratio between the degree of a particular page and the average degree of its neighbors PageRank TrustRank: uses a subset of hand-picked trusted nodes and propagates their labels through the Web graph Truncated PageRank: a variant of PageRank that diminishes the influence of a page to the PageRank of its neighbors

10 Link-based Features Estimation of supporters: Given two nodes x and y, x is a d-supporter of y, if the shortest path from x to y has length d N d (x) is the set of d- supporters of page x Spam pages have a smaller bottleneck than non-spam Bottleneck number : Histogram of b 4 (x) for spam and non- spam

11 Content-based Features Most interesting features presented: Finding the k most frequent words in the dataset, excluding stopwords: Corpus precision: is the fraction of words in a page that appear in a set of popular terms Corpus recall: to be the fraction of popular terms that appear in the page Considering the set of q most popular terms in a query log: Query precision and query recall: are analogous to corpus precision and recall. Used k & q = 100, 200, 500 and 1000

12 Content-based Features The best features are the corpus precision and query precision All features where judged based only on histograms Histogram of the query precision in non-spam vs. spam pages for q = 500.

13 Classifiers Cost-sensitive decision tree Cost of zero for correctly classifying the instance Cost of misclassifying spam as normal is R times more costly as classifying a normal host as spam R can be used to tune the balance between the true- positive rate and the false- positive rate Used “bagging” to help reduce the false-positive rate

14 Conclusion Experimental evidence led to the hypotheses: Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes Spam nodes are mainly linked by spam nodes These tendencies can be exploited to yield better spam detection Using multiple features, link-based and content- based, provided better detection Error rate can be tuned by adjusting the cost matrix

15 Critique Article presented many features, both link-based and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing) Results obtained showed which features and optimizations were effective Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques There was no direct comparison between prior research results and the results obtained

Download ppt "Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,"

Similar presentations

Ads by Google