Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides.

Similar presentations


Presentation on theme: "Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides."— Presentation transcript:

1 Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides by Marti Hearst, Ray Larson

2 Recall Where We Left Off  We were discussing information retrieval ranking models  The Boolean model captures some intuitions of what we want – AND, OR  But it’s too restrictive, and has no real ranking between returned answers 2

3 3 Sim(q,d j ) = cos(  ) = [vec(d j )  vec(q)] / |d j | * |q| = [  w ij * w iq ] / |d j | * |q|  Since w ij > 0 and w iq > 0, 0 ≤ sim(q,d j ) ≤ 1  A document is retrieved even if it matches the query terms only partially i j dj q  Vector Model

4 4 Weights in the Vector Model Sim(q,d j ) = [  w ij * w iq ] / |d j | * |q|  How do we compute the weights w ij and w iq ?  A good weight must take into account two effects:  quantification of intra-document contents (similarity)  tf factor, the term frequency within a document  quantification of inter-documents separation (dissimilarity)  idf factor, the inverse document frequency w ij = tf(i,j) * idf(i)

5 5 TF and IDF Factors  Let: N be the total number of docs in the collection n i be the number of docs which contain k i freq(i,j) raw frequency of k i within d j  A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(l,j)) where the maximum is computed over all terms which occur within the document d j  The idf factor is computed as idf(i) = log (N / n i ) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i

6 6 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example 1I

7 7 d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Vector Model Example III

8 8 Vector Model, Summarized  The best term-weighting schemes tf-idf weights: w ij = f(i,j) * log(N/n i )  For the query term weights, a suggestion is w iq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N / n i )  This model is very good in practice:  tf-idf works well with general collections  Simple and fast to compute  Vector model is usually as good as the known ranking alternatives

9 9 Advantages:  term-weighting improves quality of the answer set  partial matching allows retrieval of docs that approximate the query conditions  cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages:  assumes independence of index terms; not clear if this is a good or bad assumption Pros & Cons of Vector Model

10 10 Comparison of Classic Models  Boolean model does not provide for partial matches and is considered to be the weakest classic model  Experiments indicate that the vector model outperforms the third alternative, the probabilistic model, in general  Generally we use a variation of the vector model in most text search systems

11 11 Switching Our Sights to the Web  Information retrieval is more heterogeneous in nature:  No editor to control quality  Deliberately misleading information (“web spam”)  Great variety in types of information  Phone books, catalogs, technical reports, news, slide shows, …  Many languages; partial duplication; jargon  Diverse user goals  Very short queries  ~2.35 words on average (Aug 2000; Google results)  And much larger scale!

12 12 Handling Short Queries & Mixed-Quality Information  Human processing  Web directories: Yahoo, Open Directory, …  Human-created answers: about.com, Search Wikia  (Still not clear that automated question-answering works)  Capitalism: “paid placement”  Advertisers pay to be associated with certain keywords  Clicks / page popularity: pages visited most often  Link analysis: use link structure to determine credibility  … combination of all?

13 13 Link Analysis for Starting Points: HITS (Kleinberg), PageRank (Google)  Assumptions:  Credible sources will mostly point to credible sources  Names of hyperlinks suggest meaning  Ranking is a function of the query terms and of the hyperlink structure  An example of why this makes sense:  The official Olympics site will be linked to by most high- quality sites about sports, Olympics, etc.  A spammer who adds “Olympics” to his/her web site probably won’t have many links to it  Caveat: “Search engine optimization”

14 14 Google’s PageRank (Brin/Page 98)  Mine structure of web graph independently of the query! Each web page is a node, each hyperlink is a directed edge  Assumes a random walk (surf) through the web:  Start at a random page  At each step, the surfer proceeds  to a randomly chosen web page with probability d  to a randomly chosen successor of the current page with probability 1- d  The PageRank of a page p is the fraction of steps the surfer spends at p in the limit

15 15 Link Counts Aren’t Everything… “A-Team” page Hollywood “Series to Recycle” page Yahoo Directory Wikipedia Mr. T’s page Team Sports Cheesy TV Shows page

16 16 PageRank Rank of page j Rank of page i Every page j that links to i Number of links out from page j Importance of page i is governed by pages linking to it

17 17 Computing PageRank (Simple version) Initialize so total rank sums to 1.0 Iterate until convergence

18 18 Computing PageRank (Step 0) 0.33 Initialize so total rank sums to 1.0

19 19 Computing PageRank (Step 1) 0.17 0.33 0.17 Propagate weights across out-edges

20 20 Computing PageRank (Step 2) 0.17 0.50 0.33 Compute weights based on in-edges

21 21 Computing PageRank (Convergence) 0.2 0.40 0.4

22 22 Naïve PageRank Algorithm Restated Let  N(p) = number outgoing links from page p  B(p) = number of back-links to page p  Each page b distributes its importance to all of the pages it points to (so we scale by N(b))  Page p’s importance is increased by the importance of its back set

23 23 In Linear Algebra Terms Create an m x m matrix M to capture links:  M(i, j) = 1 / n j if page i is pointed to by page j and page j has n j outgoing links  Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:  (Computes principal eigenvector via power iteration)

24 24 A Brief Example Google AmazonYahoo 00.5 00 1 0 g' y’ a’ g y a = * Total rank sums to number of pages g y a 1 1 1 = 1 0.5 1.5, 1 0.75 1.25, 1 0.01 1.99, … Running for multiple iterations:

25 25 Oops #1 – PageRank Sinks: Dead Ends Google AmazonYahoo 000.5 0 00 g' y’ a’ g y a = * g y a 1 1 1 = 0.5 1, 0.25 0.5 0.25, 0 0 0, … Running for multiple iterations:

26 26 Oops #2 – Hogging all the PageRank Google AmazonYahoo 000.5 1 00 g' y’ a’ g y a = * g y a 1 1 1 = 0.5 2, 0.25 2.5 0.25, 0 3 0, … Running for multiple iterations:

27 27 Improved PageRank  Remove out-degree 0 nodes (or consider them to refer back to referrer)  Add decay factor to deal with sinks  PageRank(p) = d  b  B(p) (PageRank(b) / N(b)) + (1 – d)  Intuition in the idea of the “random surfer”:  Surfer occasionally stops following link sequence and jumps to new random page, with probability 1 - d

28 28 Stopping the Hog 000.5 1 00 g' y’ a’ g y a = 0.8 * g y a = 0.35 2.30 0.35, 0.2 + Running for multiple iterations: … though does this seem right? Google AmazonYahoo 0.6 1.8 0.6 0.44 2.12 0.44 0.38 2.25 0.38,,

29 29 Summary of Link Analysis  Use back-links as a means of adjusting the “worthiness” or “importance” of a page  Use iterative process over matrix/vector values to reach a convergence point  PageRank is query-independent and considered relatively stable  But vulnerable to SEO

30 Can We Go Beyond?  PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page  A more general notion: label propagation  Take a set of start nodes each with a different label  Estimate, for every node, the distribution of arrivals from each label  In essence, captures the relatedness or influence of nodes  Used in YouTube video matching, schema matching, … 30

31 31 Overall Ranking Strategies in Web Search Engines  Everybody has their own “secret sauce” that uses:  Vector model (TF/IDF)  Proximity of terms  Where terms appear (title vs. body vs. link)  Link analysis  Info from directories  Page popularity  gorank.com “search engine optimization site” compares these factors  Some alternative approaches:  Some new engines (Vivisimo, Teoma, Clusty) try to do clustering  A few engines (Dogpile, Mamma.com) try to do meta-search


Download ppt "Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides."

Similar presentations


Ads by Google