Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University
The World Wide Web Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi- structured data. Content includes truth, lies, obsolete information, contradictions, …
PageRank Intuition: “a page is important if important pages link to it.” In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web. (A few fixups needed.)
PageRank Web graph encoded by matrix M –NXN matrix (N = number of web pages) –Mij = 1/|O(j)| iff there is a link from j to i –Mij = 0 otherwise –O(j) = set of pages node i links to Define matrix A as follows –Aij = βMij + (1-β)/N, where 0<β<1 –1-β is the “tax” discussed in prior lecture Page rank r is first eigenvector of A –Ar = r
Many Random Walkers Model Imagine a large number M of independent, identical random walkers (MÀN) At any point in time, let M(p) be the number of random walkers at page p The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.
Economic Considerations Search has become the default gateway to the web Very high premium to appear on the first page of search results –e.g., e-commerce sites –advertising-driven sites
What is Web Spam? Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value Spam = web pages that are the result of spamming This is a very broad defintion –SEO industry might disagree! –SEO = search engine optimization Approximately 10-15% of web pages are spam
Types of Spamming Techniques Term spamming –Manipulating the text of web pages in order to appear relevant to queries Link spamming –Creating link structures that boost page rank or hubs and authorities scores
Link Spam Three kinds of web pages from a spammer’s point of view –Inaccessible pages –Accessible pages e.g., web log comments pages spammer can post links to his pages –Own pages Completely controlled by spammer May span multiple domain names
Link Spam Detection Open research area One approach: TrustRank
Trust Rank Basic principle: approximate isolation –It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam
Anti-Trust Approach Broadly based on the same “approximate isolation principle” This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. A page can be classified as a spam page if it has Anti- Trust Rank value more than a chosen threshold value.
Seed Set selection Seed spam set chosen from pages with high page rank. Nearly 100% URLS containing certain terms like {viagra,gambling, hardporn} as substrings are spam. Use these for evaluation. Also some seed pages were chosen by an Oracle (Human Expert).
Results Overall Percentage of “spam” pages =0.28%. Average page rank of “spam”/Average Page Rank = 2.6. % of “spam” pages in: top 1000 Anti-Trust rank pages = 25.3% Bottom 1000 Trust rank pages = 0.68% Ratio of average page ranks of spam pages returned by ATR vs. TR is roughly 6.
Results
References The PageRank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB Topic-sensitive PageRank. Taher Haveliwala. In WWW The WebGraph dataset. Online at: