1 ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine.

2 ICS215Notes 012 Course Web Server URL: –All course info will be posted online Instructor: Chen Li –ICS 424B, Course general info:

3 ICS215Notes 013 Topic today: Web Search How did earlier search engines work? How does Google work? Readings: –Lawrence and Giles, Searching the World Wide Web, Science, 1998.Searching the World Wide Web –Brin and Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998.The Anatomy of a Large-Scale Hypertextual Web Search Engine

4 ICS215Notes 014 Earlier Search Engines Hotbot, Yahoo, Alta Vista, Northern Light, Excite, Infoseek, Lycos … Main technique: “inverted index” –Conceptually: use a matrix to represent how many times a term appears in one page –# of columns = # of pages (huge!) –# of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 … ‘car’ 1 0 1 0 ‘toyota’ 0 2 0 1  page 2 mentions ‘toyota’ twice ‘honda’ 2 1 0 0 …

5 ICS215Notes 015 Search by Keywords If the query has one keyword, just return all the pages that have the word –E.g., “ toyota ”  all pages containing “toyota”: page2, page4,… –There could be many many pages! –Solution: return those pages with most frequencies of the word first

6 ICS215Notes 016 Multi-keyword Search For each keyword W, find all the set of pages mentioning W Intersect all the sets of pages –Assuming an “AND” operation of those keywords Example: –A search “ toyota honda ” will return all the pages that mention both “ toyota ” and “ honda ”

7 ICS215Notes 017 Observations The “matrix” can be huge: –Now the Web has 4.2 billion pages! –There are many “terms” on the Web. Many of them are typos. –It’s not easy to do the computation efficiently:  Given a word, find all the pages…  Intersect many sets of pages… For these reasons, search engines never store this “matrix” so naively.

8 ICS215Notes 018 Problems Spamming: –People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times –Though these pages may be unimportant compared to, even if the latter only mentions “toyota” only once (or 0 time) Search engines can be easily “fooled”

9 ICS215Notes 019 Closer look at the problems Lacking the concept of “importance” of each page on each topic E.g.: Our ICS215 class page is not as “important” as Yahoo’s main page. A link from Yahoo is more important than a link from our class page But, how to capture the importance of a page? –A guess: # of hits?  where to get that info? –# of inlinks to a page  Google’s main idea.

10 ICS215Notes 0110 Google’s History Started at Stanford DB Group as a research project (Brin and Page) Used to be at: Very soon many people started liking it Incorporated in 1998: The “largest” search engine now Started other businesses: froogle, gmail, …

11 ICS215Notes 0111 PageRank Intuition: –The importance of each page should be decided by what other pages “say” about this page –One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) Problem: –We can easily fool this technique by generating many dummy pages that point to our class page

12 ICS215Notes 0112 Details of PageRank At the beginning, each page has weight 1 In each iteration, each page propagates its current weight W to all its N forward neighbors. Each of them gets weight: W/N Meanwhile, a page accumulates the weights from its backward neighbors Iterate until all weights converge. Usually 6-7 times are good enough. The final weight of each page is its importance. NOTICE: currently Google is using many other techniques/heuristics to do search. Here we just cover some of the initial ideas.

13 ICS215Notes 0113 Example: MiniWeb (Materials used by courtesy of Jeff Ullman) Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. Their weights are represented as a vector Ne Am MS For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.

14 ICS215Notes 0114 Iterative computation Ne Am MS Final result: Netscape and Amazon have the same importance, and twice the importance of Microsoft. Does it capture the intuition? Yes.

15 ICS215Notes 0115 Observations We cannot get absolute weights: –We can only know (and we are only interested in) those relative weights of the pages The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:

16 ICS215Notes 0116 Problem 1 of algorithm: dead ends Ne Am MS MS does not point to anybody Result: weights of the Web “leak out”

17 ICS215Notes 0117 Problem 2 of algorithm: spider traps Ne Am MS MS only points to itself Result: all weights go to MS!

18 ICS215Notes 0118 Google’s solution: “tax each page” Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. Example: assume 20% tax rate in the “spider trap” example.

19 ICS215Notes 0119 The War of Search Engines More companies are realizing the importance of search engines More competitors in the market: Microsoft, Yahoo!, etc.

20 ICS215Notes 0120 Next: HITS / Web communities Readings: –Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999.Authoritative Sources in a Hyperlinked Environment –Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999Trawling the Web for emerging cyber-communities

21 ICS215Notes 0121 Hubs and Authorities Motivation: find web pages to a topic –E.g.: “find all web sites about automobiles” “Authority”: a page that offers info about a topic –E.g.: DBLP is a page about papers –E.g.:,,, “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic –E.g.: our ICS215 page linking to pages about papers –E.g.: is a hub of search

22 ICS215Notes 0122 Two values of a page Each page has a hub value and an authority value. –In PageRank, each page has one value: “weight” Two vectors: –H: hub values –A: authority values

23 ICS215Notes 0123 HITS algorithm: find hubs and authorities First step: find pages related to the topic (e.g., “automobile”), and construct the corresponding “focused subgraph” –Find pages S containing the keyword (“automobile”) –Find all pages these S pages point to, i.e., their forward neighbors. –Find all pages that point to S pages, i.e., their backward neighbors –Compute the subgraph of these pages root Focused subgraph

24 ICS215Notes 0124 Step 2: computing H and A Initially: set hub and authority to 1 In each iteration, the hub score of a page is the total authority value of its forward neighbors (after normalization) The authority value of each page is the total hub value of its backward neighbors (after normalization) Iterate until converge hubs authorities

25 ICS215Notes 0125 Example: MiniWeb Ne Am MS Normalization! Therefore:

26 ICS215Notes 0126 Example: MiniWeb Ne Am MS

27 ICS215Notes 0127 Trawling: finding online communities Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”) Examples: –Web pages of NBA fans –Community of Turkish student organizations in the US –Fans of movie star Jack Lemmon Applications: –Provide valuable and timely info for interested people –Represent the sociology of the web –Target advertising

28 ICS215Notes 0128 How: analyzing web structure These pages often do not reference each other –Competitions –Different view points Main idea: “co-citations” –Often these pages share a large number of pages –Example: the following two web sites share many pages  

29 ICS215Notes 0129 Bipartite subgraphs Bipartite graphs: sets of nodes, F and C Dense bipartite graph: there are “enough” number of edges between F and C Complete bipartite graph: there is an edge between each node in F and each node in C (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C (i,j)-Core is a good signature for finding online communities Usually i and j are between 3 and 9 F “Fans” C “Centers”

30 ICS215Notes 0130 “Trawling”: finding cores Find all (i,j)-cores in the Web graph. –In particular: find “fans” (or “hubs”) in the graph –“centers” = “authorities” –Challenge: Web is huge. How to find cores efficiently?  Experiments: 200M pages, 1 TB data Main idea: pruning Step 1: using out-degrees –Rule: each fan must point to at least 6 different websites –Pruning results: 12% of all pages (= 24M pages) are potential fans –Retain only links, and ignore page contents

31 ICS215Notes 0131 Step 2: eliminate mirroring pages Many pages are mirrors (exactly the same page) They can produce many spurious fans Use a “shingling” method to identify and eliminate duplicates Results: –60% of 24M potential-fan pages are removed –# of potential centers is 30 times of # of potential fans

32 ICS215Notes 0132 Step 3: using in-degrees of pages Delete pages highly referenced, e.g., yahoo, altavista Reason: they are referenced for many reasons, not likely forming an emerging community Formally: remove all pages with more than k inlinks (k = 50, for instance) Results: –60M pages pointing to 20M pages –2M potential fans

33 ICS215Notes 0133 Step 4: iterative pruning To find (i,j)-cores –Remove all pages whose # of out-links is < i –Remove all pages whose # of in-links is < j –Do it iteratively

34 ICS215Notes 0134 Step 5: inclusion-exclusion pruning Idea: in each step, we –Either “include” a community –Or we “exclude” a page from further contention Check a page x with j out-degree. x is a fan of an (i,j)-core if: –There are i-1 fans point to all the forward neighbors of x –This step can be checked easily using the index on fans and centers Result: for (3,3)-cores, 5M pages remained Final step: –Since the graph is much smaller, we can afford to “enumerate” the remaining cores Result: –(3,3)-cores: about 75 KB –High-quality communities –Check a few in the paper by yourself

