Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Carlos Castillo, chato@yahoo-inc.com Debora Donato, debora@yahoo-inc.com Aristides Gionis, gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com Fabrizio Silvestri, f.silvestri@isti.cnr.it Presented by Anton Rodriguez-Dmitriev

Personal Background Graduated from FSU Working on a MSECE Specializing in Controls CS minor Work part-time at STW Technic, LP

Web Spam Consequences Damages reputation of search engine Weakens the trust of the users Eiron et al. ranked 100 million pages using PageRank: 11 out of the top 20 were pornographic pages PageRank alone cannot filter spam Cost incurred in crawling, indexing and storing spam pages

Some popular spamming techniques Link Spam: create link structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm. Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries Cloaking: send different content to a search engine than to the regular visitor of a website

Topology of the Dataset Used WEBSPAM-UK2006 dataset: publically available spam collection Undirected graph Pruned to contain only hosts that share more than 100 links Black nodes are spam and white nodes are non-spam Most spammers in the larger connected component are clustered together Other connected components are single-class

Evaluation of the process Confusion Matrix: a represents the number of non-spam examples that were correctly classified b represents the number of examples of non-spam that were falsely classified as spam c represents the spam examples that were falsely classified as non-spam d represents the number of spam examples that were correctly classified

Success Measures True positive-rate (or Recall): False positive-rate : Precision: F-measure :

Link-based Features Degree-related measures: In-degree and out-degree of the hosts and neighbors Edge-reciprocity: the number of links that are reciprocal Assortativity: the ratio between the degree of a particular page and the average degree of its neighbors PageRank TrustRank: uses a subset of hand-picked trusted nodes and propagates their labels through the Web graph Truncated PageRank: a variant of PageRank that diminishes the influence of a page to the PageRank of its neighbors

Link-based Features Estimation of supporters: Given two nodes x and y, x is a d-supporter of y, if the shortest path from x to y has length d N d (x) is the set of d- supporters of page x Spam pages have a smaller bottleneck than non-spam Bottleneck number : Histogram of b 4 (x) for spam and non- spam

Content-based Features Most interesting features presented: Finding the k most frequent words in the dataset, excluding stopwords: Corpus precision: is the fraction of words in a page that appear in a set of popular terms Corpus recall: to be the fraction of popular terms that appear in the page Considering the set of q most popular terms in a query log: Query precision and query recall: are analogous to corpus precision and recall. Used k & q = 100, 200, 500 and 1000

Content-based Features The best features are the corpus precision and query precision All features where judged based only on histograms Histogram of the query precision in non-spam vs. spam pages for q = 500.

Classifiers Cost-sensitive decision tree Cost of zero for correctly classifying the instance Cost of misclassifying spam as normal is R times more costly as classifying a normal host as spam R can be used to tune the balance between the true- positive rate and the false- positive rate Used “bagging” to help reduce the false-positive rate

Conclusion Experimental evidence led to the hypotheses: Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes Spam nodes are mainly linked by spam nodes These tendencies can be exploited to yield better spam detection Using multiple features, link-based and content- based, provided better detection Error rate can be tuned by adjusting the cost matrix

Critique Article presented many features, both link-based and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing) Results obtained showed which features and optimizations were effective Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques There was no direct comparison between prior research results and the results obtained

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Similar presentations

Presentation on theme: "Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Similar presentations

Presentation on theme: "Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,"— Presentation transcript:

Similar presentations

About project

Feedback