Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial.

Slides:



Advertisements
Similar presentations
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Advertisements

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
The PageRank Citation Ranking “Bringing Order to the Web”
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
Google Tools and your Library - the Possibilities are Exponential Google CSE Google CSE Google Scholar Google Scholar Google My Library Google.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Information Retrieval
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
What Is A Web Page? An Introduction to the Internet.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Lecturer: Ghadah Aldehim
Google Xtras. Google Maps Google Latitude tests Site mapping What is it? A New Standard: Search Engine Giants Adopt the XML Protocol In 2005, the search.
1 ITGS - introduction A computer may have: a direct connection to a net (cable); or remote access (modem). Connect network to other network through: cables.
Search Engine Marketing Shelly Brown Director of Web Services Southwest Baptist University.
Slide No. 1 Searching the Web H Search engines and directories H Locating these resources H Using these resources H Interpreting results H Locating specific.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
NCBI/WHO PubMed/Hinari Course Introduction Session #1, Sept 13, 2005 Session #2, Sept 14, 2005 Internet Concepts and Scientific Literature Resources Ho.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
 Search Engine Search Engine  Steps to Search for webpages pertaining to a specific information Steps to Search for webpages pertaining to a specific.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
UNIT 14 1 Websites. Introduction 2 A website is a set of related webpages stored on a web server. Webmaster: is a person who sets up and maintains a.
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
LOGO Searching the Web CHAPTER 2 Eastern Mediterranean University School of Computing and Technology Department of Information Technology ITEC229 Client-Side.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.
Promotion of e-Commerce sites. A business which uses e- commerce to trade online must also advertise. Several traditional methods can be used, such as.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
2004/051 >> Supply Chain Solutions That Deliver Users.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
G042 - Lecture 09 Commencing Task A Mr C Johnston ICT Teacher
The Internet is a Big Collection of Computers and Cables. -"interconnection of computer networks". Millions of personal, business, and governmental.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
and Internet Explorer.  The transmission of messages and files via a computer network  Messages can consist of simple text or can contain attachments,
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine Optimization
DATA MINING Introductory and Advanced Topics Part III – Web Mining
WEB SPAM.
HITS Hypertext-Induced Topic Selection
Web Mining Ref:
Electronic Communication
Searching for Truth: Locating Information on the WWW
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Searching the Internet
Presentation transcript:

Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial Intelligence Lab. (URL: Dept. of Computer Science Sogang University Seoul, Korea Nick Craswell and David Hawking CSIRO Mathematical and Information Sciences, Canberra, Australia Stephen Robertson Microsoft Research, UK CIGIR’01, to appear

Sogang University A. I. Lab. Page 2 Introduction (1/2) Introduction (1/2) Link-based ranking is popular With search engines. Google, Fast With researchers. HITS, PageRank To find the main entry point of a specific Web site In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content. This paper : “named site finding” It opens a rich new area for effectiveness improvement, where traditional methods fail.

Sogang University A. I. Lab. Page 3 Introduction (2/2) Introduction (2/2) Link methods its incoming and outgoing links Content methods its text content Past TREC experiments have found that link information does not enhance retrieval effectiveness. In particular, TREC-8 Small and Large Web Tracks found link methods to be no better than non-link methods.

Sogang University A. I. Lab. Page 4 The site finding problem The site finding problem The “topic” of site might be quite broad. Yahoo! ( covers a broad range of subject matter and provides a range of services. A site finding task is one where the user wants to find a particular site, and their query is an attempt to specify which site that is. A search system succeeds in the task if it returns the entry page of the required site : the “correct answer”.

Sogang University A. I. Lab. Page 5 Site finding examples Named site finding Where can I find Hotmail? Where is the official Michael Schumacher home page? Where can I find the web site for Toshiba? Where is the fun site dating patterns analyzer? Where is the official Star Wars site?  The user knows which site they want, but not its location (URL) Sometimes, the user types the name of a site in order to find its URL Not named site finding How does a modem work? What should I consider when purchasing a PC for under $2,000? Who was Cleopatra? What is mp3 and where can i learn more about it? Why do dogs have wet noses? Where is the Taj Mahal?

Sogang University A. I. Lab. Page 6 Link-based ranking methods (1/2) Link-based ranking methods (1/2) A hypertext is a relationship between two documents or two parts of the same documents. On the Web, the source document would contain text such as: ACM Site The target document : The link’s anchor text : ACM Site If the user selects the anchor, their browser will display the target document. A ranking method, given a query and a set of documents, generates a ranked list of documents. In the site-finding task, the entry page of the described site should appear as close as possible to the top of the list.

Sogang University A. I. Lab. Page 7 Link-based ranking methods (2/2) Link-based ranking methods (2/2) Link source Targets Possible anchors The ranked list (Link-based ranking methods) Link methods can be divided into three classes, depending on which of these alternate assumptions they rely : 1. recommendation 2. topic locality 3. anchor description ( this paper )

Sogang University A. I. Lab. Page 8 Three classes of link methods Three classes of link methods The recommendation assumption is that by linking to a target, a page author is recommending it. Accordingly, a page with high in-degree is highly recommended, and should be ranked more highly. The topic locality assumption is that pages connected by links are more likely to be about the same topic than those which are not. The anchor description assumption is that the anchor text of a link describes its target. Using the example link mentioned previously, the anchor text “ACM site” is describing

Sogang University A. I. Lab. Page 9 TREC experiments with links TREC experiments with links The Text Retrieval Conference (TREC) has primarily concentrated on subject searches, performed over news and government documents. It has recently expanded to new search tasks, such as question answering, and new document sets, including several Web collections. The Web collections are based on a 1997 Internet Archive (http: // crawl of over 50 million pages. The 100 gigabyte VLC2 collection is an 18.5 million document subset. The WT2g and WT10g collections are 0.25 and 1.25 million page subsets respectively.

Sogang University A. I. Lab. Page 10 Link-based ranking

Sogang University A. I. Lab. Page 11 Outline Outline A. Problem: Named site finding B. Solution: Anchor text propagation [wwww, Google] C. Experiments: Link solution twice as effective

Sogang University A. I. Lab. Page 12 Site finding samples (this paper)

Sogang University A. I. Lab. Page 13 Different from TREC ad hoc TREC ad hoc: Topical queryRelevant documents Named site finding: Site name queryThe site’s URL Site finding useful for forgotten URLs (known item search, I’m feeling lucky) or visiting a new site (“suspected item search”)

Sogang University A. I. Lab. Page 14 Evaluation methodology 1. Choose corpus : Choose a fixed test corpus of hypertext documents. 2. Identify query pairs : Identify a set of pairs (for example ), numbering perhaps 100. Each represents a user typing a query, in order to find a particular site entry page. 3. Run methods : For each method being evaluated, run the queries over the corpus. 4. Examine results : For each query, examine pooled results to identify equivalent URLs (e.g. mirror sites) 5. Measure effectiveness : Apply some effectiveness measure. In case of multiple equivalent correct answers, measure according to the top ranked one.

Sogang University A. I. Lab. Page 15 Content method used here Excite Home World’s best news, FREE! Chocolates & Wine Cards & Music Flowers Gifts Excite Search Twice the power of the competition. Search the entire Web Search NewsTracker Search Excite Web Reviews Search Usenet newsgroups Search Tips Advanced Search Submit For info on destinations around the globe, visit City.net. Reference People Finder Yellow Pages Lookup Travel Search Shareware Resources Free Start Page Bookmark Excite Excite Direct New to the Net? Free Search Engine Information Help Feedback Advertising Add URL About Excite Jobs at Excite Excite Web Reviews Our insights into the [106 more words] Okapi BM25 applied to e.g million documents in VLC2. The anchor method does not use any of this text.

Sogang University A. I. Lab. Page 16 Link method used here Okapi BM25 applied to e.g million VLC2 anchor documents. anchor doc 7 332excite 910excite netsearch 294http:// 227excite search 200excite! 192http:// 168e xcite 154view 140excite home 86excite search engine 66excite search: 49exite... [440 more lines] Each anchor document contains all the anchor texts of a page’s incoming links. If 7332 pages link to with the anchor text “excite”, that word is added 7332 times to the anchor document.

Sogang University A. I. Lab. Page 17 VLC2: Random sites VLC2: Random sites VLC2 results for 100 site entry pages, chosen randomly through page selection and navigation. For 35 of the 100 queries, the anchor method returned the correct answer at rank one, compared 15 times for the content method. 35/100 vs 15/100

Sogang University A. I. Lab. Page 18 VLC2: Yahoo!-listed sites VLC2 results for 100 Yahoo!-listed sites. 62/100 vs 27/100

Sogang University A. I. Lab. Page 19 ANU: Directory-listed sites University results for 100 sites within the institution. 68/100 vs 21/100

Sogang University A. I. Lab. Page 20 DiscussionDiscussion Anchor information is more useful than content on this site finding task. The biggest difference between link and content methods is at rank one. In future experiments it will be interesting to test other link and content methods, and combinations of methods.

Sogang University A. I. Lab. Page 21 ConclusionConclusion Anchors: Good evidence for finding named sites The link anchor method was approximately twice as effective as the content method. Future work: Can we improve on this e.g. using anchor+content? Using the methodology of the present study as a basis there are a great many aspects of this important problem to be investigated in future work.