Presentation is loading. Please wait.

Presentation is loading. Please wait.

Greg Nilsen University of Pittsburgh April 2003

Similar presentations


Presentation on theme: "Greg Nilsen University of Pittsburgh April 2003"— Presentation transcript:

1 Greg Nilsen University of Pittsburgh April 2003
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003

2 The Problem The web is a complex, unorganized structure.
Search engines can be fooled: Search Engine Designers v. Advertisers User feedback rarely used to quantify results.

3 Outline Background The Idea Implementation Results and Conclusions
History of Web Searches Kleinberg’s Algorithm Classification and Support Vector Machines The Idea Implementation Results and Conclusions References

4 Background - History Social Networks
1953 – Katz proposes a measure of standing for people based on the references to them by others 1965 – Hubbell proposes a similar measure through the study of the balance of weight-propagated schemes on the nodes of the network

5 Background - History Scientific Citations (Bibliometrics)
1972 – Garfield’s impact factor used for an assessment of journals in Journal Citation Reports of the Institute for Scientific Information 1976 – Pinski and Narin observe that not all citations are equal, and develop a citation-based measure of standing

6 Background - History WWW Search
Early 1990’s – Text-Based/Keyword Searches 1992 – Botafogo et al. introduce the notions of index nodes (high-out degree) and reference nodes (high in-degree) to web searches 1997 – Carriere and Kazman give a “directionless” ranking measure by summing in-degrees and out-degrees

7 Background - History WWW Search
1998 – Brin and Page develop the PageRank algorithm which was used as one part of the Google search engine (query independent ranking that helped determine the order of pages) 1999 – Kleinberg…

8 Background – Kleinberg’s Algorithm
Basic Idea: Create a Focused Subgraph of the Web Iteratively Compute Hub and Authority Scores Filter Out The Top Hubs and Authorities Extended Ideas: Similar Page Queries Non-Principal Eigenvectors

9 Background – Kleinberg’s Algorithm
Create a focused subgraph of the web (a base set of pages) Why? We need a set that is: Relatively Small Rich in Relevant Pages Contains Most of the Strongest Authorities

10 Background – Kleinberg’s Algorithm
Start with a root set: In our case we are using a data set that started with the first 200 results of a text-based search on AltaVista. Create our base set: Add in all pages that link to and from any page in the root set.

11 Background – Kleinberg’s Algorithm
Root

12 Background – Kleinberg’s Algorithm
Root

13 Background – Kleinberg’s Algorithm
Root

14 Background – Kleinberg’s Algorithm
Root

15 Background – Kleinberg’s Algorithm
Base Root

16 Background – Kleinberg’s Algorithm
Now that we have a focused subgraph, we need to compute hub and authority scores. Start by initializing all pages to have a hub and authority weights of 1. Compute new hub and authority scores: Hub Score = Σ (Authority Scores of All Pages The Hub Points At) Authority Score = Σ (Hub Scores of All Pages That Point to the Authority)

17 Background – Kleinberg’s Algorithm
Normalize the new weights (hubs and authorities separately) so that the sum of their squares is equal to one. Repeat the computing of weights and their normalization until the scores converge (usually 20 iterations). When we have completed computing the hub and authority scores, we then take the top authority scores as our top results.

18 Background – Kleinberg’s Algorithm
Similar page queries Once we produce results, a searcher may wish to find pages similar to a given result. In order to do this, we can use the algorithm that we have discussed above. This time, we build a root set of the pages that point to the given page. We then grow this into a base set and determine the hubs and authorities for the new set. This will result in pages similar to the initial page.

19 Background – Kleinberg’s Algorithm
Non-Principal Eigenvectors An eigenvector is a densely linked collection of hubs and authorities within the subgraph. In the Kleinberg algorithm, we produce the principal eigenvector by iteratively computing hub and authority scores until convergence. However, the principal eigenvector may not contain all of the information desired by the search.

20 Background – Kleinberg’s Algorithm
Example: A search for “jaguar” This search will produce 3 strong eigenvectors due to different meanings of the word: Jaguar – the car Jaguar – the cat The Jacksonville Jaguars NFL team Which one of these will be returned as the principal eigenvector depends heavily on the initial set of pages. We cannot determine which of the three meanings that the searcher meant.

21 Background – Kleinberg’s Algorithm
Therefore, we can produce results that come from “strong” eigenvectors. However, we can still miss relevant pages. For example, the search for “WWW conferences” produces the most pertinent results on the 11th non-principal eigenvector. How to determine relevant eigenvectors is a topic that is still currently under research.

22 Background - Classification
Classification is a type of problem in machine learning where we begin with a set of discrete and continuous values and produce a discrete value. For example, in an analysis of handwriting: There are many methods of performing classification, but we will focus on just one in this talk.

23 Background – Support Vector Machines
A Support Vector Machine (SVM) is a method of binary classification. With a SVM, we want to find a hyperplane that divides the two classes which we want to differentiate. We can then distinguish what class a new point belongs to by whether or not it lies above the hyperplane.

24 Background – Support Vector Machines

25 Background – Support Vector Machines
However, there can be many hyperplanes that divide the two sets. Because of this, we want to find the hyperplane that provides the maximum margin from each of the two classes. These examples on the margin are known as the support vectors.

26 Background – Support Vector Machines
We can then take examples from a training set of the data and learn the weights that determine where to place this hyperplane. wTx + w0 = 0 Once we have this hyperplane in place, we can test it on testing data to make sure that it provides a “good” division. After testing the result, we can then plug new data in to determine the best classification according to the SVM. A higher SVM score signals a point further from the margin, and therefore a stronger candidate for the class.

27 The Idea Kleinberg’s algorithm produces “good” results, but subject to a phenomena known as “topic drift”. The hub weights of some sites such as yahoo.com or eBay.com cause irrelevant clusters to be identified as major eigenvectors. So, while structural information provides us with much information about a query, additional information seems necessary.

28 The Idea Kleinberg’s algorithm also uses only the top authority scores, but there may be useful pages that rank strongly as hubs. Since web queries are an application driven towards maximizing user satisfaction, we can use user feedback to try and weight hub and authority scores so that we can classify “better” results using SVMs.

29 The Idea A plot of hub vs. authority scores. Hubs Authorities

30 The Idea Hubs the dividing hyperplane Authorities

31 The Idea We can then compile data from different types of searches, we may be able to generalize this hyperplane so that we pull more relevant results from the result of Kleinberg’s algorithm.

32 Implementation Start with data from the University of Toronto’s Link Analysis Ranking Algorithm repository. Getting results for a text-based search engine is very difficult any more now that search engines have gotten smarter. Contains data for 8 distinct types of searches.

33 Implementation Next, we implement Kleinberg’s algorithm in C++ that reads in the datasets and outputs a web page with the top 50 hubs and top 50 authorities on the page. Compile a survey in which participants are asked if a result is useful for a mixture of the top 25 hubs and top 25 authorities for the search on “abortion” (a search that tends to produce two distinct groups) and “genetic” (a search that is more generic in nature).

34 Implementation Using the results of the survey, determine a class label (1 or 0) for each result. With the resulting labels, perform learning via SVMs in Matlab using the hub and authority scores as input and the class label as output.

35 Implementation Using the weights resulting from the SVM learning and plug them into our initial program to compute SVMscores for all web pages. Sort the web pages based on their SVMscores and output the top 50 results to a web page.

36 Results While this is still a current research project, we have the current results. Kleinberg’s Algorithm Results Abortion Computational Geometry Net Censorship The User-Feedback Classification Improvement Abortion Computational Geometry Net Censorship

37 Conclusions While the current results provide significant improvement on some searches in our datasets, for some searches the results are not much of an improvement. This may be due to the fact that user feedback was limited. It may also be because of the scaling of the hubs and authorities. We may have to normalize each set against the largest value to provide a better scale for our hyperplane.

38 References J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM (JASM), 46, 1999. A. Borodin, G. Roberts, J. Rosenthal, P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. Proceedings of the 10th International World Wide Web Conference, 2001. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Proc. 7th WWW Conf., 1998. R.A. Botafogo, E. Rivlin, and B. Shneiderman: Structural Analysis of Hypertexts: Identifying Hierarchies and Useful Metrics. ACM Transactions on Information Systems, Vol. 10, No. 2. ACM, pp J. Carrire and R. Kazman. WebQuery: Searching and Visualizing the Web through Connectivity, in Proceedings of WWW6 (Santa Clara CA, April 1997). E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178: , 1972.

39 References L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18:39-43, 1953. G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory with application to literature of physics. Information Processing & Management, 12: , 1976. C.H. Hubbell. An input-output approach to clique identification. Sciometry 28, , 1965.


Download ppt "Greg Nilsen University of Pittsburgh April 2003"

Similar presentations


Ads by Google