Download presentation
Presentation is loading. Please wait.
Published byDaniella Riley Modified over 9 years ago
1
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew University of Jerusalem
2
Contents What is web spamWhat is web spam Combating web spam – TrustRank Combating web spam – Mass Estimation Conclusion
3
Web Spam Actions intended to mislead search engines into ranking some pages higher than they deserve. Search engines are the entryways to the web Financial gains
7
Consequences Decreased search results quality “Kaiser pharmacy” returns techdictionary.com Increased cost of each processed query Search engine indexes are inflated with useless pages The first step in combating spam is understanding it
8
Search Engines High quality results, i.e. pages that are Relevant for a specify query Textual similarity Important Popularity Search engines combine relevance and importance, in order to compute Ranking
9
Definition revised any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value
10
Search Engine Optimizers Engage in spamming (according to our definition) Ethical methods Finding relevant directories to which a site can be submitted Using a reasonably sized description meta tag Using a short and relevant page title to name each page
11
Spamming Techniques Boosting techniques Achieving high relevance / importance Hiding techniques Hiding the boosting techniques We’ll cover them both
12
Techniques Boosting Techniques Term Spamming Link Spamming Hiding Techniques
13
TF TF ( term frequency( measure of the importance of the term (in a specific page) number of occurrences of the considered term number of occurrences of all terms IDF
14
IDF - (inverse document frequency) a measure of the general importance of the term in a collection of pages total number of documents in the corpus Total number of documents where t appears TFIDF
15
TF-IDF A high weight in tf-idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents. Spammers: Make a page relevant for a large number of queries Make a page very relevant for a specific query
16
Term Spamming Techniques Body Spam Simplest, oldest, most popular. Title Spam Higher weights. Meta tag spam Low priority
17
Term Spamming Techniques (cont’d) Anchor text spam free, great deals, cheap, cheap,free URL Buy-canon-rebel-20d-lens-case.camerasx.com
18
Grouping Term Spamming Techniques Repetition Increased relevance for a few specific queries Dumping of a large number of unrelated terms Effective against Rare, obscure terms queries Weaving of spam terms into copies contents Rare (original) topic Dilution – conceal some spam terms within the text Phrase stitching Create content quickly Remember not only airfare to say the right plane tickets thing in the right place, but far cheap travel more difficult still, to leave hotel rooms unsaid the wrong thing at vacation the tempting moment.
19
Techniques Boosting Techniques Term Spamming Link Spamming Hiding Techniques
20
Three Types Of Pages On The Web Inaccessible Spammers cannot modify Accessible Can be modified in a limited way Own pages We call a group of own pages a spam farm
21
First Algorithm - HITS Assigns global hub and authority scores to each page Circular definition: Important hub pages are those that point to many important authority pages Important authority pages are those pointed to by many hubs Hub scores can be easily spammed Adding outgoing links to a large number of well knows, reputable pages. Authority score is more complicated The more the better
22
Second Algorithm - Page Rank a family of algorithms for assigning numerical weightings to hyperlinked documents The PageRank value of a page reflects the frequency of hits on that page by a random surfer is the probability of being at that page after lots of clicks We continue at random from a sink page
23
Page rank All n own pages are part of the farm All m accessible pages point to the spam farm Links pointing outside the spam farm are supressed No vote gets lost (each page has an outgoing link) All accessible and own pages point to t All pages within the farm are reachable Inaccessible accessibleOwn t
24
Techniques – Outgoing links Manually adding outgoing link to well- knows hosts; increased hub score Directories sites dmoz.org Yahoo! Directory Creating massive outgoing link structure quickly
25
Techniques – Incoming Links Honey-pot – useful resource Infiltrate a web directory Links on blogs, guest books, wikis Google’s tag – discount Link exchange Buy expired domains Create own spam farm
26
Techniques Boosting Techniques Term Spamming Link Spamming Hiding Techniques
27
Content Hiding Color scheme font’s color same as background’s color Tiny anchor images links (1x1 pixel) Using scripts Setting the visible HTML style attribute to FALSE.
28
Cloaking Spam web servers can return a different document to a web crawler Identification of web crawlers: A list of IP addresses ‘user-agent’ field in the HTTP request Allow web masters block some contents Legitimate optimizations (remove ads) Delivering contents that search engine can’t read (such as flash)
29
Redirection Automatically redirecting the browser to another URL Refresh meta tag in the header of an HTML document Simple to identify Scripts location.replace(“target.html”)
30
How can we fight it? IDENTIFY instances of spam Stop crawling / indexing such pages PREVENT spamming Avoid cloaking – identifying as regular web browsers COUNTERBALANCE the effect of spamming Use variation of the ranking methods
31
Some Statistics The results of a single breadth first search at the Yahoo! Home page A complete set of pages crawled and indexed by AltaVista
32
Some More Statistics Sophisticated spammers Average spammers
33
Contents What is web spam Combating web spam – TrustRankCombating web spam – TrustRank Combating web spam – Mass Estimation Conclusion
34
Motivation The spam detection process is very expensive and slow, but is critical to the success of search engines We’d like to assist the human experts who detect web spam
35
Getting dirty G = (V,E) V = set of N pages (vertices) E = set of directed links (edges) that connect pages We collapse multiple hyperlinks into a single link We remove self hyperlinks i(p) – number of in-links to a page p w(p) – number of out-links from a page p
36
Our Example V = { 1, 2, 3, 4} E = { (1,2),(2,3),(3,2),(3,4)} N = 4 i(2) = 2; w(2) = 1 1432
37
A Transition Matrix
38
In our example 1432 The out edges of ‘3’ The in edges of ‘4’
39
An Inverse transition matrix
40
In Our Example 1432 The in edges of ‘2’ The out edges of ‘2’
41
Page Rank mutual reinforcement between pages the importance of a certain page influences and is being influenced by the importance of some other pages. In-links votes decay factor start-off atuthority
42
Equivalent Matrix Equation Scalar N vector Dynamic Static
43
A Biased PageRank A static score distribution (summing up to one) Only pages that are reachable from some d[i]>0 will have a positive page rank
44
Oracle Function A binary oracle function O over all pages p in V: 1 4 23 65 7 good bad O(3 ) = 1 O(6 ) = 0
45
Oracle Functions Oracle invocations are expensive and time consuming We CAN’T call the function for all pages Approximate isolation of the good set Good pages seldom point to bad ones As we’ve seen, good pages *can* point to bad ones bad pages often point to bad ones
46
Trust Function We need to evaluate pages without relying on O. We define, for any page p, a trust function Ideal Trust Property (for any page p) T(p) = Pr[ O(p) = 1 ] Very hard to come up with such function Useful in ordering search results
47
Ordered Trust Property
48
First Evaluation Metric - Pairwise Orderedness First Evaluation Metric - Pairwise Orderedness A violation of the ordered trust proerty Trust function T, oracle function O, pages p,q The fraction of the pairs for which T did not make a mistake
49
Threshold Trust Property Doesn’t necessarily provide an ordering of pages based on their likelihood of being good We’ll describe two evaluation metrics Precision Recall
50
Threshold Evaluation Metrics Total number of good pages in X Total number of ‘good’ estimations Total number of correct ‘good’ estimations
51
Computing Trust Limited budget L of O-invocations We select at random a seed set S of L pages and call the oracle on its elements Ignorant Trust Function: Not checked by human experts
52
For example L = 3; S={1,3,6} 1 4 23 65 7 Oracle Actual Values Ignorant function values We choose X = 7 Pairwise orderness = 34/42 For d = ½ Precision =1; Recall =0.5
53
Trust Propagation Remember approximate isolation ? We generalize the ignorant function M-Step Trust Function: The original set S, on which we called O There exists a path of a maximum length of M from page p to page q, that doesn’t include bad seed pages
54
Example 1 4 23 65 7
55
Example 1 4 23 65 7
56
Example 1 4 23 65 7
57
Example 1 4 23 65 7 A mistake
58
Results A drop in performance The further away we are from good seed pages, the less certain we are that a page is good!
59
Trust Attenuation Trust Dampening <1. We could assign maximum(b,b*b) or average(b,b*b)
60
Trust Attenuation Trust Splitting The care with which people add links to their pages is often inversely proportional to the number of links on the page
62
Trust Rank Algorithm 1.(Partially) Evaluate seed-desirability of pages 2.Invoke the oracle function on the L most desirable seed pages, normalize the result (a vector d) 3.Evaluate TrustRank scores using a biased PageRank computation with d replacing the unfiorm distribution
63
For Example Desirability vector [0.08,0.13,0.08,0.10,0.09,0.06,0.02] Order the vertices accordingly: [2, 4, 5, 1, 3, 6, 7] 1 4 23 65 7
64
For Example (cont’d) Compute good seeds vector (other seeds are considered bad) [0, 1, 0, 1, 0, 0, 0] Normalize the result d = [0, 1/2, 0, 1/2, 0, 0, 0] Will be used as the biased page rank vector 1 4 23 65 7
65
For Example (cont’d) Compute TrustRank Scores [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05] Highest score Higher than p4, due to p3 P1 is unreferenced 1 4 23 65 7 High due to a direct link from p4
66
Selecting Seeds We want to choose pages that are useful in identifying additional good pages We want to keep the seed set small Two strategies Inverse page rank High Page Rank
67
I. Inverse PageRank Preference to pages from which we can reach many other pages We can select seed pages based on the number of outlinks We’ll choose the pages that point to many pages that point to many pages that point to many pages that point to many pages that point to many pages that point to many pages … This is actually PageRank, where the importance of a page depends on its outlinks Perform PageRank on the graph G=(V,E’) Use inverse transition matrix U (instead of T)
68
II. High PageRank We’re interested in high PageRank pages Obtain accurate trust scores for high PageRank pages Preference to pages with high PageRank Likely to point to other high PageRank pags May identify the goodness of fewer pages, but they may be more important pages
69
Statistics |Seed set S| = 1250 (given by inverse PageRank) Only 178 sites were selected to be used as good seeds (due to extremely rigorous selection criteria)
70
Statistics (cont’d) Bad sites in PageRank and TrustRank buckets
71
Statistics (cont’d) Bucket level demotion in Trust Rank A site from a higher PageRank bucket appears in a lower TrustRank Bucket Spam sites in PageRank bucket 2 got demoted 7 buckets in average
72
Contents What is web spam Combating web spam – TrustRank Combating web spam – Mass EstimationCombating web spam – Mass Estimation Turn the spammers’ ingenuity against themselves Conclusion
73
Spam Mass – Naïve Approach Given a page x, we’d like to know if it got most of its PageRank from spam pages or from reputable pages Suppose that we have a partition of the web into 2 sets V(S) = Spam pages V(R) = Reputable pages
74
First Labeling Scheme Look at the number of direct inlinks If most of them comes from spam pages, then declare that x is a spam page G-0 G-1 S-k S-2 S-1 good bad S-0x
75
Second Labeling Scheme If the largest part of x’s PageRank comes from spam nodes, we label x as spam good bad G-0 G-2 S-3 S-2 S-1 S-0 S-3 x S-5 S-6 G-1 G-3 good bad
76
Improved Labeling Scheme good bad G-0 G-2 S-3 S-2 S-1 S-0 S-3 x S-5 S-6 G-1 G-3 good bad
77
Spam Mass Definition
78
Estimating We assumed that we have a priori knowledge of whether nodes are good or bad – not realistic! What we’ll have is a subset of the good nodes, the good core Not hard to construct Bad pages are ofren abandoned
79
Estimating (cont’d) For computes 2 sets sets of PageRank scores: p=PR(v) – based on the uniform random jump distribution v (v[i] = 1/n, for i = 1..n) p`=PR(v`) – based on the random jump distribution v`
80
Spam Mass Definition (cont’d)
81
good bad G-0 G-2 S-3 S-2 S-1 S-0 S-3 x S-5 S-6 G-1 G-3 good bad
82
Spam Detection Algorithm Compute PageRank score p Compute (biased) PageRank p` Compute the relative spam mass vector For each node (with PageRank high enough), if its relative spam mass is bigger than a (given) threshold, declare that x is spam
83
Statistics
84
Contents What is web spam Combating web spam – TrustRank Combating web spam – Mass Estimation ConclusionConclusion
85
Conclusion We introduced ‘web spam’ We presented two ways to combat spammers TrustRank (spam demotion) Spam mass estimation (spam detection)
86
questions? Thank you
87
Bibliography Web Spam TaxonomyWeb Spam Taxonomy (2004) - Gyongyi, Zoltan; Garcia- Molina, Hector, Stanford University Combating Web Spam with TrustRankCombating Web Spam with TrustRank (2005) - Gyongyi, Zoltan; Garcia-Molina, Hector; Pedersen, Jan Link Spam Detection Based on Mass EstimationLink Spam Detection Based on Mass Estimation (2005) - Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan http://www.firstmonday.org/issues/issue10_10/tatum/ http://en.wikipedia.org/wiki/TFIDF http://en.wikipedia.org/wiki/Pagerank
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.