Link Analysis and Anti-Spam Tie-Yan Liu Microsoft Research Asia.

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
Variations of the Turing Machine
Introduction to Algorithms
AP STUDY SESSION 2.
1
Select from the most commonly used minutes below.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
STATISTICS HYPOTHESES TEST (I)
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Slide 1 FastFacts Feature Presentation October 16 th, 2008 We are using audio during this session, so please dial in to our conference line… Phone number:
David Burdett May 11, 2004 Package Binding for WS CDL.
Introduction to Algorithms 6.046J/18.401J
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
CALENDAR.
Spectral Clustering Eyal David Image Processing seminar May 2008.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar.
Chapter 7 Sampling and Sampling Distributions
Office 2003 Introductory Concepts and Techniques M i c r o s o f t Windows XP Project An Introduction to Microsoft Windows XP and Office 2003.
1.
Break Time Remaining 10:00.
EE, NCKU Tien-Hao Chang (Darby Chang)
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Multicore Programming Skip list Tutorial 10 CS Spring 2010.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Office 2003 Introductory Concepts and Techniques M i c r o s o f t Office 2003 Integration Integrating Office 2003 Applications and the World Wide Web.
Computer vision: models, learning and inference
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
 Copyright I/O International, 2013 Visit us at: A Feature Within from Item Class User Friendly Maintenance  Copyright.
Lilian Blot PART III: ITERATIONS Core Elements Autumn 2012 TPOP 1.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Artificial Intelligence
2004 EBSCO Publishing Presentation on EBSCOadmin.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
A Data Warehouse Mining Tool Stephen Turner Chris Frala
Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine Link Analysis.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
The PageRank Citation Ranking: Bringing Order to the Web
WEB SPAM.
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Junghoo “John” Cho UCLA
Presentation transcript:

Link Analysis and Anti-Spam Tie-Yan Liu Microsoft Research Asia

"Web Search and Mining" USTC, Outline First Session ̵ Overview of Link Analysis Technologies ̵ PageRank and HITS Second Session ̵ More about Link Analysis Algorithms Third Session ̵ Spam and Anti-Spam Homework

First Session

"Web Search and Mining" USTC, Typical Search Engine Architecture

"Web Search and Mining" USTC, Ranking for the Search Results Todays search engines may return millions of pages for a certain query It is definitely not possible for the user to preview all these results An appropriate ranking will be very helpful. ̵ Ranking on relevance ̵ Ranking on importance

"Web Search and Mining" USTC, Traditional IR Ranking A ranking purely on relevance ̵ Term frequency (tf) ̵ Inverse Document Frequency (idf) ̵ Okapi … ̵ Many other aspects that Dr. Shuming Shi will mention in the next course.

"Web Search and Mining" USTC, Limitations of Traditional IR Text-based ranking function ̵ can hardly be recognized as one of the most authoritative pages for the query harvard, since many other web pages contain harvard more often. ̵ The number of pages with the same relevance is still too large for the users to preview. Pages are not sufficiently self-descriptive ̵ Usually the term search engine doesn't appear on the web pages of search engines.

"Web Search and Mining" USTC, Whats More for Web Search In order to solve these problems ̵ We must leverage other information on the Web ̵ We must distinguish those pages with the same amount of relevance Link Analysis ̵ The web is not just a collection of pure-text documents the hyperlinks are also very important! ̵ A link from page A to page B may indicate: A is related to B, or A is recommending, citing, voting for or endorsing B ̵ Links effect the ranking of web pages and thus have commercial value.

"Web Search and Mining" USTC, Famous Link Analysis Methods HITS PageRank

"Web Search and Mining" USTC, HITS - Kleinbergs Algorithm HITS – Hypertext Induced Topic Selection For each vertex v in a subgraph of interest: ̵ a(v) - the authority of v ̵ h(v) - the hubness of v A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

"Web Search and Mining" USTC, Authority and Hubness a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

"Web Search and Mining" USTC, Convergence of Authority and Hubness Recursive dependency : a(v) Σ h(w) h(v) Σ a(w) Using Linear Algebra, we can prove: w pa[v] w ch[v] a(v) and h(v) converge

"Web Search and Mining" USTC, HITS Example {1, 2, 3, 4} - nodes relevant to the topic Expand the root set R to include all the children and a fixed number of parents of nodes in R A new set S (base subgraph) Start with a root set R {1, 2, 3, 4} Find a base subgraph:

"Web Search and Mining" USTC, HITS Example HubsAuthorities(G) 1 1 [1,…,1] R 2 a h 1 3 t 1 4 repeat 5 for each v in V 6 do a (v) Σ h (w) 7 h (v) Σ a (w) 8 a a / || a || 9 h h / || h || 10 t t until || a – a || + || h – h || < ε 12 return (a, h ) Hubs and authorities: two n-dimensional a and h 00 t t t t t t t t t t tt t -1 w pa[v] |V|

"Web Search and Mining" USTC, HITS Example Results Authority Hubness Authority and hubness weights

"Web Search and Mining" USTC, Matrix Denotion of HITS It is clear that the authority and hubness values calculated by the aforementioned algorithm is the left and right singular vector of the adjacency matrix of the base sub graph.

"Web Search and Mining" USTC, PageRank Introduced by Page et al (1998) ̵ The page rank is proportional to its parents rank, but inversely proportional to its parents outdegree

"Web Search and Mining" USTC, Matrix Notation Adjacent Matrix A =

"Web Search and Mining" USTC, Matrix Notation r = B r Pagerank is embedded in the eigenvector of B associated with the eigen value 1. B =

"Web Search and Mining" USTC, Matrix Notation

"Web Search and Mining" USTC, Markov Chain Notation Random surfer model ̵ Description of a random walk through the Web graph ̵ Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page Does it converge to some sensible solution (as t ) regardless of the initial ranks ? r t = M r t-1 M: transition matrix for a first-order Markov chain (stochastic)

"Web Search and Mining" USTC, Problem Rank Sink Problem ̵ In general, many Web pages have no inlinks/outlinks ̵ It results in dangling edges in the graph E.g. no parent rank 0 M T converges to a matrix whose last column is all zero no children no solution M T converges to zero matrix

"Web Search and Mining" USTC, Modification Surfer will restart browsing by picking a new Web page at random M = ( B + E ) E : escape matrix M : stochastic matrix Still problem? ̵ It is not guaranteed that M is primitive ̵ If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M

"Web Search and Mining" USTC, Distribution of the Mixture Model The probability distribution that results from combining the Markovian random walk distribution & the static rank source distribution r = εe + (1- ε)x ε: probability of selecting non-linked page PageRank Now, transition matrix [εH + (1- ε)M] is primitive and stochastic r t converges to the dominant eigenvector

"Web Search and Mining" USTC, PageRank v.s. HITS - Algorithm

"Web Search and Mining" USTC, PageRank v.s. HITS - Stability Whether the link analysis algorithms based on eigenvectors are stable in the sense that results dont change significantly? General Strategy for evaluating stability: ̵ 1. Start with original adjacency matrix, A ̵ 2. Perturb the matrix to get A*, Select k nodes in graph to add or delete ̵ 3. Compute distance, d(r(A),r(A*)), for some distance measure d and objective function r that measures the quality of results of A somehow ̵ 4. Compute amount of perturbation p( Α, Α *) for some distance function p that measures the amount of perturbation ̵ 5. Evaluate the conditions, if any, where small values for p generate large values for d

"Web Search and Mining" USTC, Stability of HITS Ng 2001 ̵ A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights ̵ Observations ̵ Stability determined by eigengap ̵ Eigengap: difference between 1 st and 2 nd eigenvalues A T A for authorities, AA T for hubs ̵ If eigengap is big, HITS will be insensitive to small perturbations, vice versa if small δ : eigengap λ 1 – λ 2 d: maximum outdegree of G

"Web Search and Mining" USTC, Stability of PageRank Looser bound ̵ Ng et al (2001) ̵ Bianchini et al (2001) Observations ̵ The parameter ε of the mixture model has a stabilization role ̵ If original k pages to be modified do not have high overall PR scores then perturbed scores will not be far from the original

Second Session

"Web Search and Mining" USTC, Pre-PageRank PageRank achieves great success in the industry, many people regarded it as a break-through in the research field as well. Actually the basic idea of PageRank has already appeared in many previous works ̵ Mark 1988 ̵ Bray 1996 ̵ Marchiori 1997 ̵ ……

"Web Search and Mining" USTC, Mark 1988 To calculate the score S of a document at vertex v S(v) = s(v) + 1 | ch[v] | Σ S(w) w |ch(v)| v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v Limitation: - Require G to be a directed acyclic graph (DAG) - If v has a single link to w, S(v) > S(w) - If v has a long path to w and s(v) S (w) Mark, D. M., (1988), "Network models in geomorphology," Chapter 4 in Modeling in Geomorphologic Systems, Edited by M. G. Anderson, John Wiley., p

"Web Search and Mining" USTC, Bray 1996 The visibility of a site is measured by the number of other sites pointing to it ̵ Authority? The luminosity of a site is measured by the number of other sites to which it points ̵ Hub?

"Web Search and Mining" USTC, Marchiori (1997) S(v) = s(v) + h(v) - S(v): overall information - s(v): textual information - h(v): hyper information Hyper information should complement textual information to obtain the overall information h(v) = Σ F S(w) w |ch[v]| r(v, w) - F: a fading constant, F Є (0, 1) - r(v, w): the rank of w after sorting the children of v by S(w)

"Web Search and Mining" USTC, Post PageRank And following the success of PageRank, a lot of new algorithms were also proposed. ̵ Fast PageRank calculation ( Haveliwala) ̵ Topic-sensitive PageRank ̵ Personalized PageRank ̵ LinkFusion ̵ ……

"Web Search and Mining" USTC, Fast PageRank calculation [ Haveliwala – 1999] Partition the destination vector into d blocks that each fit into main memory, and to compute one block at a time. This algorithm is quite similar in structure to the Block Nested-Loop Join algorithm in database systems. which also performs very well for data sets of moderate size but eventually loses out to more scalable approaches.

"Web Search and Mining" USTC, Fast PageRank calculation [ Haveliwala – 2003] Basic observation: ̵ the convergence rates of the PageRank values of individual pages during application of the Power Method is nonuniform. That is, many pages converge quickly, with a few pages taking much longer to converge. Furthermore, the pages that converge slowly are generally those pages with high PageRank.

"Web Search and Mining" USTC, Topic-Specific PageRank [Haveliwala - WWW02] Topic-specific PageRanks ̵ For each page precomputed PageRank values of the most relevant topics used for each query. ̵ 16 topics

"Web Search and Mining" USTC, Link Fusion – [Zeng, WWW04] In a more generalized scenario, suppose there are N data types. The importance attribute of one type of object can be reinforced by both inter and intra-type links as: Suppose w is the attribute vector of all the objects in the URM. Link Fusion can be represented as: w new =L urm T w old Such iterative calculation can be continued: w n =(L urm T ) n w 0 The result w is the prime eigenvector of L urm, which can be explained as the value of data objects regarding a specific attribute.

"Web Search and Mining" USTC, Limits of Link Analysis Pay-for-place ̵ Search engine bias : organizations pay search engines and page rank ̵ Advertisements: organizations pay high ranking pages for advertising space With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page

"Web Search and Mining" USTC, Limits of Link Analysis Stability ̵ Adding even a small number of nodes/edges to the graph has a significant impact Topic drift ̵ A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page Content evolution ̵ Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks

Third Session

"Web Search and Mining" USTC, What is Link Spam Since link analysis has played an important role in search engines, it has large commercial values Improving ones PageRank, can directly increase ones clicks thus earn more money. Link Spam is something trying to unfairly gain a high ranking on a search engine for a web page without improving the user experience, by mean of tricky modification / manipulation of the link graph.

"Web Search and Mining" USTC, Link Spamming Technologies Adding outlinks ̵ Replicate hub pages Adding inlinks ̵ Create a honey pot ̵ Infiltrate a web directory ̵ Post links on blog, wiki, etc ̵ Participate in-link exchange ̵ Buy expired domains ̵ Create own spam farm.

"Web Search and Mining" USTC, Case Study: Spam HITS Hub score can be increased by adding outlinks to the target page Authority score can be increased by creating hyperlinks from high-hub-score pages to the target page.

"Web Search and Mining" USTC, Case Study: Spam PageRank Factors that influence PageRank ̵ PR(t)=PR static (t)+PR in (t)-PR out (t)-PR sink (t) Strategies ̵ Own pages are part of the spam farm, maximizing PR static ̵ Accessible pages point to the spam farm, maximizing PR in ̵ Links pointing outside the spam farm are supressed, minimizing PR out (t) ̵ All pages within the farm have some outlinks, minimizing PR sink (t)

"Web Search and Mining" USTC, Anti-Spam Early approaches ̵ BHITS, SALSA, DOM, revised HITS, BadRank … State-of-the-art ̵ TrustRank (2004) ̵ Revised PageRank (VLDB2004) ̵ BadRank + (WWW2005) ̵ SpamRank (WWW2005, workshop) ̵ ……

"Web Search and Mining" USTC, TrustRank Basic assumption ̵ Good pages seldom point to spam pages, but spam pages may very likely point to good pages. Use TrustRank to denote the goodness of a webpage, and use Trust Propagation to label all the web pages starting from a small human-labeled seed set.

"Web Search and Mining" USTC, TrustRank Step 1: Initialization ̵ How to select seeds Inverse PageRank (Hub pages, since they have more influence) High PageRank (Important pages are more important to search applications) Step 2: Propagation

"Web Search and Mining" USTC, TrustRank Step 3: ̵ Trust Dampening ̵ Trust Splitting

"Web Search and Mining" USTC, BadRank+ Motivation ̵ Pages in the spam farm are densely connected, and many common pages exist in both the inlinks and outlinks of these pages. Propagate the badness of pages in the seed set to detect other the spam pages in the Web.

"Web Search and Mining" USTC, BadRank+ Step 1: Initialization ̵ At least 3 common nodes (approximately the same, i.e. with the same domain name) in the inlink and outlink sets Step 2: Expansion ̵ ParentPenalty: if a page links to many bad pages (larger than a threshold), it will also be labeled as bad. ̵ Delete all the links between detected bad pages before PageRank calculation.

"Web Search and Mining" USTC, Revised PageRank Assumption ̵ The spam farm have high correlation with each other. Approach ̵ Increase the probability of jumping from nodes with large correlation coefficients.

"Web Search and Mining" USTC, Revised PageRank Step 1: Collusion detection ̵ Calculate PageRank values for different ε ̵ Calculate the correlation coefficient between the curve of node xs PageRank and 1/ ε, denoted by co- co(x). Step 2: ε Personalization ̵ Use F( ε default, co-co(x)) to personalize the original matrix U. ̵ Recalculate PageRank.

"Web Search and Mining" USTC, SpamRank Key assumption ̵ Supporters of an honest page should not overly dependent on one another, i.e. they should be spread across different quality. ̵ Due to the self-similarity, the honest supporter set should have a power-law distribution of PageRank. ̵ Spammers have a limited budget, so they do not replicate the unimportant structures.

"Web Search and Mining" USTC, Summary The current works on anti-spam are very limited. Promising research directions ̵ Use more statistics and the properties of the transition probability matrix to detect spam ̵ Design a new spam-free ranking function

Homework

"Web Search and Mining" USTC, Technical Report Writing 1. HITS and PageRank are both based on simple linear algebra, can you design some other link analysis algorithm based on advanced linear algebra or matrix factorization? 2. The performance / sensitivity of PageRank with respect to the smoothing factor ε. 3. How to speed up the calculation of PageRank using matrix factorization, or some specific characteristics of the Markov chain? 4. PageRank is the eigenvector of a 2-D matrix, then can LinkFusion be the eigenvector of a 3-D tensor? 5. Stability analysis for other link analysis algorithms. 6. A survey on the state-of-the-art spam technologies. 7. How to design a search engine that is robust to spam? 8. Other novel research topics related to link analysis.

"Web Search and Mining" USTC, Requirements Send the report to before Dec 4 (within 1 The length should not be less than 8 pages, with the template at There must be something new and intersting in your report, and yous better use some experiments to support your idea. Never try to copy or steal already-published ideas as your technical report. We are sure we have read much more than you can find.

"Web Search and Mining" USTC, Other Information Slides can be found at