CSM06 Information Retrieval Lecture 5: Web IR part 2 Dr Andrew Salway

Slides:



Advertisements
Similar presentations
Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
LIS618 lecture 9 Thomas Krichel Structure Google “theory”, see essay by Brin and Page fullpapers/1921/com1921.htm.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Internet Resources Discovery (IRD) Search Engines Quality.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Information Retrieval
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Using Hyperlink structure information for web search.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
CSM06 Information Retrieval Lecture 6: Visualising the Results Set Dr Andrew Salway
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Ranking Link-based Ranking (2° generation) Reading 21.
Information Retrieval
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Post-Ranking query suggestion by diversifying search Chao Wang.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Evaluating Web Pages Techniques to apply and questions to ask.
1 CS 430: Information Discovery Lecture 5 Ranking.
CSM06: Information Retrieval Notes about writing coursework reports, revision and examination.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Text Based Information Retrieval
HITS Hypertext-Induced Topic Selection
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval
Disambiguation Algorithm for People Search on the Web
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Introduction to Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Discussion Class 9 Google.
Presentation transcript:

CSM06 Information Retrieval Lecture 5: Web IR part 2 Dr Andrew Salway

Recap of Lecture 4 Various techniques that search engines can use to index and rank web pages In particular, techniques that exploit hypertext structure: –Use of anchor text –Link analysis  PageRank Plus, techniques that analyse the words in webpages

Recap of Lecture 4 (*ADDED) The ‘Random Surfer’ explanation of PageRank: A web surfer follows links at random: at a page with no outlinks they ‘teleport’ at random to another page… “the PageRank value of a web page is the long run probability that the surfer will visit that page” (Levene 2005, page 95)

Recap of Lecture 4 (*ADDED) “Whether PageRank Leakage exists or not, is a question of semantics. The PageRank for a given page is solely determined by the inbound links. However, an outgoing link can drain the entire site for PageRank”. leak/Pagerank-leakage.htm

Past Exams Previous exams and solutions for CSM06 are available from: IMPORTANT 1) The content of the module is updated and revised each year so some of the past questions refer to topics that were not part of the module in ) There were some changes to the structure of the exam in 2004, e.g. each question is worth 50 marks. This will be the same in Also, in 2004 more emphasis was put on current research and development of information retrieval systems (cf. some of the research papers given as Set Reading). As in 2004 the 2005 exam will include questions that ask you to write about some specified research. You do NOT need to go beyond the lecture content and the Set Reading to answer these questions. 3) The solutions that are provided are written in a style to help with the marking of the exams – this does not necessarily reflect how you would be expected to write your answers, e.g. solutions are sometimes given in note form, whereas you would normally be expected to write full sentences.

Lecture 5: OVERVIEW Retrieving ‘similar pages’ based on link analysis: companion and cocitation algorithms (Dean and Henzinger 1999) Transforming questions into queries: TRITUS system (Agichtein, Lawrence and Gravano 2001) Evaluating web search engines

“Finding Related Pages in the World Wide Web” (Dean and Henzinger 1999) Use a webpage (URL) as a query – may be an easier way for a user to express their information need –The user is saying “I want more pages like this one” – maybe easier than thinking of good query words? –e.g. the URL (New York Times newspaper) returns URLs for other newspapers and news organisationswww.nytimes.com Aim is for high precision with fast execution using minimal information  Two algorithms to find pages related to the query page using only connectivity information, i.e. link analysis (nothing about webpage content or usage): –Companion Algorithm –Cocitation Algorithm

What does ‘related’ mean? “A related web page is one that addresses the same topic as the original page, but is not necessarily semantically identical”

Companion Algorithm Based on Kleinberg’s HITS algorithm – mutually reinforcing authorities and hubs 1. Build a vicintiy graph for u 2. Contract duplicates and near-duplicates 3. Compute edge weights (i.e. links) 4. Compute hub and authority scores for each node (URL) in the graph  return highest ranked authorities as results set

Companion Algorithm 1. Build a vicintiy graph for u The graph is made up of the following nodes and edges between them: u Up to B parents of u, and for each parent up to BF of its children – if u has > B parents then choose randomly; if a parent has > BF children, then choose children closest to u Up to F children of u, and for each child up to FB of its parents NB. Use of a ‘stop list’ of URLs with very high indegree

Companion Algorithm 2. Contract duplicates and near-duplicates: if two nodes each have > 10 links and > 95% are in common then make them into one node whose links are the union of the two 3. Compute edge weights (i.e. links) Edges between nodes on the same host are weighted 0 Scaling to reduce the influence from any single host: “If there are k edges from documents on a first host to a single document on a second host then each edge has authority weight 1/k” “If there are l edges from a single document on a first host to a set of documents on a second host, we give each edge a hub weight of 1/l”

Companion Algorithm 4. Compute hub and authority scores for each node (URL) in the graph  return highest ranked authorities as results set “a document that points to many others is a good hub, and a document that many documents point to is a good authority”

Companion Algorithm 4. continued… H = hub vector with one element for the Hub value of each node A = authority vector with one element for the Authority value of each node Initially all values set to 1

Companion Algorithm 4. continued… Until H and A converge: For all nodes n in the graph N A[n] = Σ H[n´]*authority_weight(n´,n) For all nodes n in the graph N H[n] = Σ A[n´]*hub_weight(n,n´)

Cocitation Algorithm Finds pages that are frequently cocited with the query web page u – “it finds other pages that are pointed to by many other pages that also point to u” Two nodes are co-cited if they have a common parent: the number of common parents is their degree of co-citation

Cocitation Algorithm 1.Select up to B parents of u 2.For each parent add up to BF of its children to the set of u’s siblings S 3.Return nodes in S with highest degrees of cocitation with u NB. If < 15 nodes in S that are cocited with u at least twice then restart using u’s URL with one path element removed, e.g. aaa.com/X/Y/Z  aaa.com/X/Y

Evaluation of companion and cocitation algorithms 59 input URLs chosen by 18 volunteers (mainly computing professionals) The volunteers were shown results for each URL they chose and have to judge it ‘1’ for valuable and ‘0’ for not valuable  Various calculations of precision, e.g. ‘precision at 10’ for the intersection group (those query URLs that all 3 algorithms returned results for)

Evaluation of companion and cocitation algorithms Authors suggest that their algorithms perform better than an algorithm (Netscape’s) that incorporates content and usage information, as well as connectivity information – “This is surprising” – IS IT?? Perhaps it is because they had more connectivity information??

Transforming Questions into Queries… Users of IR systems might prefer to express their information needs directly as questions, rather than as keywords, e.g. –“What is a hard disk?” – rather than the query “hard disk” –What the user wants is a specific answer to their question, rather than web-pages selling hard disks, or web-pages reviewing different kinds of hard disks But, web search engines may treat the query as a ‘bag of words’ and not recognise questions as such; documents are returned that are similar to the ‘bag of words’

Transforming Questions into Queries… The challenge then is to automatically transform the question into a suitable query for which search engines will return more pages that do answer the user’s question Here we consider the work of Agichtein, Lawrence and Gravano (2001) who developed the Tritus system to try and solve this problem… Cf. AskJeeves (

Tritus: premise A good answer to the question “What is a hard disk?” might be “magnetic rotating disk used to store data” So maybe the query “What is a hard disk?” should be transformed into the query – “hard disk” NEAR “used to”

Tritus: aim To automatically learn how to transform natural language questions into queries that contain terms and phrases which are expected to appear in documents containing answers to these questions.

Tritus: learning algorithm Step 1 Select question phrases from a set of questions by extracting frequent n-grams that don’t contain domain specific nouns, e.g. “who was”, “what is a”, “how do I”

Tritus: learning algorithm Step 2 For each question type, select candidate transformations from set of good answers for each question, e.g. “what is a”  {“is used to”, “is a”, “used”}

Tritus: learning algorithm Step 3 Weight and re-rank transformation using results from web search engines

Tritus: in use Trained to learn the best query transformations for specific web search engines, e.g. Google and AltaVista Evaluation conducted to compare the effect of query transforms, and to compare with AskJeeves

Evaluation of Web Search Engines Precision may be applicable to evaluate a web search engine, but it may be the precision in the first page of results that is most important Recall, as traditionally defined, may not be applicable because it is difficult or impossible to identify all the relevant web-pages for a given query

Four strategies for evaluation of web search engines (1)Use precision and recall in the traditional way for a very tightly defined topic: only applicable if all relevant web pages are known in advance (2)Use ‘relative recall’ – estimate total number of relevant documents by doing a number of searches and adding the total number of relevant documents returned (3)Statistically sample the web in order to estimate number of relevant pages (4)Avoid recall altogether SEE: Oppenheim, Morris and McKnight (2000), p. 194

Alternative Evaluation Criteria Number of web-pages covered, and coverage: Is more pages covered better? May be more important that certain domains are included in coverage? Freshness / broken links: Web- page content is frequently updated so index also needs to be updated; broken links frustrate users. Should be relatively straightfoward to quantify.

continued… Search Syntax: More experienced users may like the option of ‘advanced searches’, e.g. phrases, Boolean operators, and field searching. Human Factors and Interface Issues: Evaluation from a user’s perspective is a more subjective criterion, however it is an important one – it can be argued that an intuitive interface for formulating queries and interpreting results helps a user to get better results from the system.

continued… Quality of Abstracts: related to interface issues are the ‘abstracts’ of web-pages that a web search engine displays – if good then these help a user to quickly identify more promising pages

Set Reading for Lecture 5 Dean and Henzinger (1999), ‘ Finding Related Pages in the World Wide Web ’. Pages Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10 th International WWW Conference. **Section 1 and Section 3** Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2), pp Pages In Library Article Collection.

Exercise: Google’s ‘Similar Pages’ It is suggested that Google’s ‘Similar Pages’ feature is based in part on the work of Dean and Henzinger. By making a variety of queries to Google and choosing ‘Similar Pages’ see what you can find out about how this works.

Exercise: web search engine evaluation Compare three web-search engines by making the same queries to each. How do they compare in terms of: –Advanced query options? –Coverage? –Quality of highest ranked results? –Ease of querying and understanding results? –Ranking factors that they appear to be using?

Further Reading The other parts of the papers given for Set Reading

Lecture 5: LEARNING OUTCOMES For both (Dean and Henzinger 1999) and (Agichtein, Lawrence and Gravano 2001), you should be able to -Explain how they were trying to make web search better for users -Outline their proposed solution -Discuss their evaluation of their solution and make your own comments You should be able to explain and apply various techniques to compare and evaluate web search engines

Reading ahead for LECTURE 6 If you want to prepare for next week ’ s lecture then take a look at … The visual interface of the KartOO search engine: Use and read about the clustering of web pages done by Vivisimo: Recent developments in Google Labs, especially Google Sets and Google Suggest: