Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department http://cs.uiowa.edu/~psriniva padmini-srinivasan@uiowa.edu

Web Search Hard problem – Hats off to ‘information retrieval’ – Complex information needs Keywords Synonyms, polysemy (multiple meanings) – True homonyms: row (oar) row (argue); delta (greek and of a river) – Polysemous homonyms: mouth (of a river), mouth (of an animal); right ‘hand’ person, ‘hand’ it to me – The age of intermediaries (BRS After Dark) – Diversity in writing + Diversity in queries + Diversity in Indexing + Diversity in motivations – Controlled vocabularies vs free-texts – Majority rule? ‘Cornell’

Web Search Peculiarities Compared to the good old days Needle in a haystack problem; many needles in many haystacks! Which ones to look for? – How distinct is this from the “traditional” methods for IR? Libraries etc. – Can we do without libraries? Quality – a serious question? – Does redundancy promote quality? – Does collaboration promote quality? Scale – Retrieve and FILTER/ORGANIZE – Satisfying versus satisficing

Link Analysis In-links and out-links; in-degree and out- degree – A matter of endorsement! (directional) – Akin to citations – What are differences? Must one out-link? – Power laws all the way through!

Some studies (Kumar et. al. 99): Alexa web crawl from 1997 over 40 million nodes. Trawling the Web for cyber communities, Proc. 8th WWW, Apr 1999 Probability page has in-degree k = 1/k 2 Probability page has at least in-degree k = 1/k Actual exponent slightly larger than 2. Barabasi and Albert 1999 – studied the U. Notre Dame web site with some extensions

Broder et al. Graph Structure of the Web Note that the exponent is different. Note also the deviation In the low end of the out-degree.

Fractals? Broder et al “almost fractal like quality for the power law in-degree and out-degree distributions, in that it appears both as a macroscopic phenomenon on the entire web, as a microscopic phenomenon at the level of a single university website, and at intermediate levels between these two.” Graph structure in the web

Similar Studies Donato et al. ACM TOIT, 2007. The Web as a Graph: How Far We Are – In-degree: power law; exponent 2.1 (Fig. 4) – Out-degree: not so good (Fig. 5) – Check out Fig. 8: SCC distribution (number of SCCs versus Size of SCC). Power law; exponent 2.09 Webbase, 200 Million Stanford crawl (2001) – 39% OUT; 11% IN; 13% Tendrils; 33% SCC (48 million) next SCC: 10 thousand!

Hubs & Authorities In-links: votes HITS algorithm: Hyperlink induced topic search. – A good hub is one that points to good authorities [lists; directories] – A good authority is one that is pointed to by good hubs – A good hub need not be an authority and vice versa. – Those who have knowledge; those who know well about those who have knowledge – Dynamic estimation; repeated application of update rules. Converges!

Algorithm First conduct retrieval. Compute Hubs and Authorities on relevant set – Rank the retrieved set by a list of hubs and a list of authorities Initialize hub and authority scores (say to all 1, or some other positive number) – Apply authority score update rule – Apply hub score update rule Example: fig 14.15 and 14.18 (problem 3)

Its all about convergence First show how the update rule works with matrices M and M T Then show the same using eigenvectors Then show that the initialization of hub scores really does not matter. As long as it is a positive vector, i.e., all hub scores are initialized to a positive number

PageRank Endorsements repeatedly move through out- links. A  B Principle of repeated improvement: – Weight of ‘current’ endorsement depends on ‘current’ estimate of A’s PageRank. – More important nodes convey higher endorsements. – Stabilize ~ till the network changes

Calculation Initialize: each node has a PageRank = 1/n where n is the number of nodes Basic PageRank Update Rule: – A node divides its PageRank equally over its out-links. If no out-links, it keeps its PageRank. – The PageRank of a node = sum of PageRanks it receives in that iteration. – Total PageRank stays constant, so no need for normalizing. Iterate till convergence OR a number of iterations.

Equilibrium No further changes in PageRanks Degenerate cases exist (Scaled PageRank Updates) Values need not be unique except where the network is strongly connected.

Slow leaks?

Scaled PageRank Update Rule Scaling factor: (between 0 and 1) generally (0.8 and 0.9) – Apply basic PageRank update rule. For each page: – Scale down all by some value s (say 0.9), so each gets 0.9 * PageRank.. – Total PageRank = s – Divide remaining PageRank (1-s) equitably over all nodes. Get a unique set of values for each setting of s. [shown later in proofs] Random walk model [Browsing not Searching]: probability of reaching a page is equal to prob(coming across an in-link) + prob(getting there at random)

Summary Link based analysis – Power laws: in-links, out-links etc. Hubs and Authorities – convergence PageRank – convergence

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Similar presentations

Presentation on theme: "Ch 14. Link Analysis Padmini Srinivasan Computer Science Department"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Similar presentations

Presentation on theme: "Ch 14. Link Analysis Padmini Srinivasan Computer Science Department"— Presentation transcript:

Similar presentations

About project

Feedback