Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial.

Similar presentations


Presentation on theme: "Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial."— Presentation transcript:

1 Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial Intelligence Lab. (URL: http://ailab.sogang.ac.kr) Dept. of Computer Science Sogang University Seoul, Korea Nick Craswell and David Hawking CSIRO Mathematical and Information Sciences, Canberra, Australia Stephen Robertson Microsoft Research, UK CIGIR’01, to appear

2 Sogang University A. I. Lab. Page 2 Introduction (1/2) Introduction (1/2) Link-based ranking is popular With search engines. Google, Fast With researchers. HITS, PageRank To find the main entry point of a specific Web site In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content. This paper : “named site finding” It opens a rich new area for effectiveness improvement, where traditional methods fail.

3 Sogang University A. I. Lab. Page 3 Introduction (2/2) Introduction (2/2) Link methods its incoming and outgoing links Content methods its text content Past TREC experiments have found that link information does not enhance retrieval effectiveness. In particular, TREC-8 Small and Large Web Tracks found link methods to be no better than non-link methods.

4 Sogang University A. I. Lab. Page 4 The site finding problem The site finding problem The “topic” of site might be quite broad. Yahoo! (http://www.yahoo.com/) covers a broad range of subject matter and provides a range of services. A site finding task is one where the user wants to find a particular site, and their query is an attempt to specify which site that is. A search system succeeds in the task if it returns the entry page of the required site : the “correct answer”.

5 Sogang University A. I. Lab. Page 5 Site finding examples Named site finding Where can I find Hotmail? Where is the official Michael Schumacher home page? Where can I find the web site for Toshiba? Where is the fun site dating patterns analyzer? Where is the official Star Wars site?  The user knows which site they want, but not its location (URL) Sometimes, the user types the name of a site in order to find its URL Not named site finding How does a modem work? What should I consider when purchasing a PC for under $2,000? Who was Cleopatra? What is mp3 and where can i learn more about it? Why do dogs have wet noses? Where is the Taj Mahal?

6 Sogang University A. I. Lab. Page 6 Link-based ranking methods (1/2) Link-based ranking methods (1/2) A hypertext is a relationship between two documents or two parts of the same documents. On the Web, the source document would contain text such as: ACM Site The target document : http://www.acm.org The link’s anchor text : ACM Site If the user selects the anchor, their browser will display the target document. A ranking method, given a query and a set of documents, generates a ranked list of documents. In the site-finding task, the entry page of the described site should appear as close as possible to the top of the list.

7 Sogang University A. I. Lab. Page 7 Link-based ranking methods (2/2) Link-based ranking methods (2/2) Link source Targets Possible anchors The ranked list (Link-based ranking methods) Link methods can be divided into three classes, depending on which of these alternate assumptions they rely : 1. recommendation 2. topic locality 3. anchor description ( this paper )

8 Sogang University A. I. Lab. Page 8 Three classes of link methods Three classes of link methods The recommendation assumption is that by linking to a target, a page author is recommending it. Accordingly, a page with high in-degree is highly recommended, and should be ranked more highly. The topic locality assumption is that pages connected by links are more likely to be about the same topic than those which are not. The anchor description assumption is that the anchor text of a link describes its target. Using the example link mentioned previously, the anchor text “ACM site” is describing http://www.acm.org/.

9 Sogang University A. I. Lab. Page 9 TREC experiments with links TREC experiments with links The Text Retrieval Conference (TREC) has primarily concentrated on subject searches, performed over news and government documents. It has recently expanded to new search tasks, such as question answering, and new document sets, including several Web collections. The Web collections are based on a 1997 Internet Archive (http: //www.archive.org) crawl of over 50 million pages. The 100 gigabyte VLC2 collection is an 18.5 million document subset. The WT2g and WT10g collections are 0.25 and 1.25 million page subsets respectively.

10 Sogang University A. I. Lab. Page 10 Link-based ranking

11 Sogang University A. I. Lab. Page 11 Outline Outline A. Problem: Named site finding B. Solution: Anchor text propagation [wwww, Google] C. Experiments: Link solution twice as effective

12 Sogang University A. I. Lab. Page 12 Site finding samples (this paper)

13 Sogang University A. I. Lab. Page 13 Different from TREC ad hoc TREC ad hoc: Topical queryRelevant documents Named site finding: Site name queryThe site’s URL Site finding useful for forgotten URLs (known item search, I’m feeling lucky) or visiting a new site (“suspected item search”)

14 Sogang University A. I. Lab. Page 14 Evaluation methodology 1. Choose corpus : Choose a fixed test corpus of hypertext documents. 2. Identify query pairs : Identify a set of pairs (for example ), numbering perhaps 100. Each represents a user typing a query, in order to find a particular site entry page. 3. Run methods : For each method being evaluated, run the queries over the corpus. 4. Examine results : For each query, examine pooled results to identify equivalent URLs (e.g. mirror sites) 5. Measure effectiveness : Apply some effectiveness measure. In case of multiple equivalent correct answers, measure according to the top ranked one.

15 Sogang University A. I. Lab. Page 15 Content method used here http://www.excite.com/ Excite Home World’s best news, FREE! Chocolates & Wine Cards & Music Flowers Gifts Excite Search Twice the power of the competition. Search the entire Web Search NewsTracker Search Excite Web Reviews Search Usenet newsgroups Search Tips Advanced Search Submit For info on destinations around the globe, visit City.net. Reference People Finder Yellow Pages Email Lookup Travel Search Shareware Resources Free Start Page Bookmark Excite Excite Direct New to the Net? Free Search Engine Information Help Feedback Advertising Add URL About Excite Jobs at Excite Excite Web Reviews Our insights into the...... [106 more words] Okapi BM25 applied to e.g. 18.5 million documents in VLC2. The anchor method does not use any of this text.

16 Sogang University A. I. Lab. Page 16 Link method used here Okapi BM25 applied to e.g. 44.1 million VLC2 anchor documents. www.excite.com anchor doc 7 332excite 910excite netsearch 294http://www.excite.com/ 227excite search 200excite! 192http://www.excite.com 168e xcite 154view 140excite home 86excite search engine 66excite search: 49exite... [440 more lines] Each anchor document contains all the anchor texts of a page’s incoming links. If 7332 pages link to http://www.excite.com/ with the anchor text “excite”, that word is added 7332 times to the anchor document.

17 Sogang University A. I. Lab. Page 17 VLC2: Random sites VLC2: Random sites VLC2 results for 100 site entry pages, chosen randomly through page selection and navigation. For 35 of the 100 queries, the anchor method returned the correct answer at rank one, compared 15 times for the content method. 35/100 vs 15/100

18 Sogang University A. I. Lab. Page 18 VLC2: Yahoo!-listed sites VLC2 results for 100 Yahoo!-listed sites. 62/100 vs 27/100

19 Sogang University A. I. Lab. Page 19 ANU: Directory-listed sites University results for 100 sites within the institution. 68/100 vs 21/100

20 Sogang University A. I. Lab. Page 20 DiscussionDiscussion Anchor information is more useful than content on this site finding task. The biggest difference between link and content methods is at rank one. In future experiments it will be interesting to test other link and content methods, and combinations of methods.

21 Sogang University A. I. Lab. Page 21 ConclusionConclusion Anchors: Good evidence for finding named sites The link anchor method was approximately twice as effective as the content method. Future work: Can we improve on this e.g. using anchor+content? Using the methodology of the present study as a basis there are a great many aspects of this important problem to be investigated in future work.


Download ppt "Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial."

Similar presentations


Ads by Google