信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Xin4xi1 jian3suo3 yu3 sou1 suo3 yin3qing2 Spring 2018

Last week We have discussed: QQ Group: 623881278 Website: PPTs…
A complete search system QQ Group: Website: PPTs…

Course schedule (日程安排)
Lecture 1 Introduction Boolean retrieval (布尔检索模型) Lecture 2 Term vocabulary and posting lists Lecture 3 Dictionaries and tolerant retrieval Lecture 4 Index construction and compression Lecture 5 Scoring, weighting, and the vector space model Lecture 6 Computer scores, and a complete search system Lecture 7 Evaluation of information retrieval systems Web search engines 6th May : Final exam

Evaluation in an INFORMATION RETRIEVAL SYSTEM
Chapter 8, pdf p. 188

Introduction In previous chapters, we have discussed many techniques.
Which techniques should be used in an IR system? Should we use stop lists? Should we use stemming? Should we use TF-IDF? …

Different search engines will show different results
BAIDU BING How can we measure the effectiveness of an IR system?

User utility We discussed the concept of document relevance (文件关联) for a query. Relevance is not the most important measure. User utility: What makes the user happy? Speed of response, Size of the index, Relevance of the results, User interface design (用户界面设计 ): clarity (清晰), layout (布局), responsiveness (响应能力) of the user interface. he generation of high-quality snippets (片段)

User utility What makes the user happy? Speed of response,
Size of the index, Relevance of the results, User interface design (用户界面设计 ): clarity (清晰), layout (布局), responsiveness (响应能力) of the user interface. The generation of high-quality snippets (片段)

How to evaluate an IR system?
8.1 How to evaluate an IR system? To evaluate an IR system, we can use: a collection of documents a set of test queries a set of relevance judgments indicating which document is relevant for each query. Testing data (测试数据) Relevance judgments Queries Documents

1) Tuning an IR system 2) Testing the IR system Training data (训练数据)
Results are good? Adjusting the parameters 2) Testing the IR system Testing data (测试数据) Results are good?

Evaluation of unranked retrieval results
8.3 Evaluation of unranked retrieval results There exist many measures to evaluate whether the results of an IR system are good or not. Some popular measures: Precision (准确率) Recall (召回) Accuracy ..

Precision (准确率) Precision: What fraction of the returned results are relevant to the user query? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages Precision = 5 / 10 = (50 %) P(relevant | retrieved)

Recall (召回) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved Recall = 100 / 1000 = (10 %) P(retrieved | relevant)

Accuracy (精确) Accuracy: The number of documents correctly identified (as relevant or non relevant): Example: There are 1000 documents The IR system correctly identifies 300 documents (as relevant or irrelevant) The IR system incorrectly identify 400 documents (as relevant or irrelevant) Accuracy = 600 / 1000 = (60 %)

Limitations of the accuracy
Accuracy has some problem The distribution is skewed (偏态分布) Generally, over 99.9 % of documents are irrelevant. Thus, an IR system that would consider ALL documents as irrelevant has a high accuracy! But such system would not be good for the user!

A user can tolerate to see irrelevant documents in the results, as long as there are some relevant documents. For a Web surfer, precision is the most important, every results on the first page should be relevant (high precision) It is ok if some documents are missing (low recall)

For a professional searcher: precision can be low, but the recall should be high (all documents should be available) Precision and recall are generally inversely related. if precision increases, recall decreases if recall increases, precision decreases

8.5 Assessing relevance To assess the relevance of results, we need some testing data (documents, queries, relevance judgments). Appropriate queries for the test documents may be selected by some domain experts. Providing relevance judgments for all documents is time consuming. Solution: We can use a subset of all documents for evaluating each query Relevance judgments Queries Documents

Other problems Some documents may not be totally relevant or irrelevant (there can be an in-between) The same query may have many meanings. The relevance of a document may not be the same for different people. Even if the system works well for some queries, it may not work well for other queries.

8.6 User satisfaction If we want to measure user satisfaction, we need to do user studies (用户研究). We need to use real humans to evaluate the IR system. We may use objective measures (客观的措施) time to complete a task, the user look at how many pages of results. We may use subjective measures (主观的措施) score for user satisfaction user comments on the search interface. Both qualitative (定性的措施) and quantitative measures (定量的措施)

User satisfaction User studies are very useful (e.g. to evaluate the user interface) But user studies are expensive! User studies are also time-consuming. It is difficult to do good user studies Need to design the study well Need to interpret the results.

User utility For e-commerce: We may measure the time to purchase.
We may measure the fraction of searchers who buy something. e.g. 50 % of searchers bought some product. User happiness may not be the most important. The store owner happiness may be more important (how much money is made). VS

User utility For an enterprise, school, or government:
The most important metric is probably user productivity. How much time do users spend to find the information that they need? Information security

Improving an IR system Example
We want to improve the scoring function of the IR system. We can ask two groups of users to use different versions of the IR system (A/B testing). We can compare the number of clicks on the top search results for the two versions of the IR system. This can help us to choose the best scoring function.

Web Search engines

The Web 19.1 What is special about the Web?
The number of documents (very large) Lack of coordination in the creation of the documents, Diversity of background and motives of participants.

The Web The Web is a set of webpages (网页)
Webpages are created using a language called HTML Webpage HTML

The Web Webpages are stored on servers (服务器)
To access a webpage, one must use a software called a Web browser (浏览器) Browser Internet SERVER of HITSZ Home

The Web The idea of the Web: each webpage contain links to other webpages (hyperlinks - 超链接). Each webpage has an address (URL) e.g. Creating a webpage is not difficult. Webpages have become one of the best way to supply and consume information.

The Web Billions of webpages containing information.
But if we cannot search this information, it is useless. Historically, two ways of searching for information: Search engines (Baidu, Bing, etc.) Directories (Yahoo!, etc.)

Web directories (网络目录)
Web directory: a list of websites, separated by categories.

Web directories (网络目录)
A Web directory contains only the “best” webpages for each category. Problems: Web directories are created by humans. This takes a lot of time. It is not convenient for searching. A user need to know how to find information within the categories. There can be thousands of categories. Information in categories is often old For this reason, Web directories have mostly disapeared.

Web search engines Baidu, Bing, etc.
They adapt information retrieval techniques to search billions of documents. Adapted in terms of: Indexing, Processing queries, Ranking documents

Web search engines Why are they popular?
ability to quickly answer queries. ability to index millions of documents. almost always up-to-date. Fifteen years ago, results returned by Web search engines were not very good Novel ranking techniques (排序技术 ) and spam-fighting techniques (反垃圾邮件技术) have been proposed to obtain better results

Web characteristics 19.2 The Web is mainly decentralized (分散).
Many languages. Many different types of content. Some webpages contains only pictures and no text. The Web contains a lot of non reliable information. How can a search engine knows which websites can be trusted?

Size of the Web 1995: 30 million webpages indexed by AltaVista
2017: billion webpages Note: only static webpages are counted. Dynamic webpage: the content is generated in real- time for the user.

The Web graph The Web can be viewed as a graph (图)
Each webpage is a vertex (顶点) A link between two webpages is an edge (图的边) The Web is a directed graph (有向图) Webpages: A,B,C, …, F

The Web graph Two types of links: In-links: links that go to a page
Out-links: links that leave a page Node B has 3 in-links has 1 out-link

The Web graph Not all web pages are equally popular
Many web pages have few in-links Few web pages have many in-links The number of in-links per website follows a power law distribution (幂律分布) Number of webpages Number of in-links

Spam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate (房地产) Thus, many people modify their website to try to appear first in the search results. e.g. write multiple times Beijing real-estate in a webpage to increase the term frequency. e.g. write invisible text using the background color of the webpage (e.g. white)

Spam detection Nowadays, search engines use many sophisticated methods to detect spam (repeated keywords, etc.). Websites that are trying to cheat may be blocked from search engines. Thus, some people have developed new techniques to cheat search engines (欺骗搜索引擎) 

Cloaking (伪装) One such technique is cloaking.
Some websites try to cheat by showing different content to search engines and users. This is a problem that did not exist in traditional IR.

Paid inclusion Paid inclusion: a website can also pay a search engine to appear high in the results. Some search engines do not allow paid inclusion.

Doorway page Doorway page:
a page containing carefully chosen text to rank highly in search engines for some keywords. the page then links to another page containing commercial content. a website may have many doorway pages. Another webpage Doorway page Doorway page

Link analysis To reduce the problem of spam on the Web, many search engine perform link analysis. Basic idea: to rank a page higher or treat it as more reliable if it has many in-links e.g. PageRank algorithm But some people create fake links to increase the popularity of their webpages. There is thus a continuing battle between spammers and search engines.

Advertising (广告) 19.3 Two main advertisement models: 1) cost per view:
The goal is to show some content to the user (branding). An image is typically used. A company may pay to display the image 1000 times.

Advertising (广告) Two main advertisement models: 2) cost per click:
The goal is that some people click on an advertisement to visit the website of the advertiser (initiate a transaction). The website may ask the person to buy something. An image or text may be used with a link. A company may pay for 1000 clicks.

Advertising (广告) 19.3 Today, many search engines earn money from advertising. Some will display search results and advertisement separately. Search results Sponsored search results Some other search engines will combine search results and advertisement.

Search results Sponsored search results

Click-spam Click spam: a company clicks on the advertisement of its competitors to spend their money. This may be done using some automatic software. A search engine must use some techniques to block click spam.

Search user experience (用户体验)
19.4 It is also important to understand users of search engines. For traditional IR systems: Users often received a training about how to search and write queries. For Web search engines: Users may not know or care about how to write queries. Usually, people use 2 or 3 keywords in a query. Usually people do not use special operators (wildcard queries, Boolean operators…)

Three types of user queries
1) Informational queries: seek general information on a broad topic. e.g. information about playing piano There is not a single webpage that contains all the information that the user wants. The user generally want to combine information from several webpages.

2) Navigational queries: seek the website or home page of a given entity. e.g. find the webpage of Huawei(华为) The user expects that the first result is the webpage of the entity (e.g. Huawei) The user only needs one document. He wants a very high precision (1).

3) Transactional queries: the user wants to make a transaction. e.g. reserve a hotel room in Guangzhou, buy train tickets… The search engine should provides links to service providers.

For a given query, it can be difficult to identify the type of the query. Identifying the type of a query is useful: for selecting the most relevant results, for displaying relevant advertisements (e.g. advertisement about train tickets)

Components of a Web search engine
(网络爬虫)

Near-duplicates (近似重复)
19.6 Another issue: the Web may contain multiple copies of the same webpage. Up to 40% of the webpages are duplicates (重复 ) of other pages. Some of these of these copies are legitimate (合法的). Others are not. Search engines try to avoid indexing duplicates to reduce the size of their indexes.

Detecting duplicates How to detect duplicates?
We do not want to compare billions of webpages with each other. Simple approach: calculate a fingerprint (hash) for each webpage that is a number. If two pages have the same fingerprint, they may be duplicates, so we need to compare them. If they are duplicates, only one of them is indexed.

Web crawling (Web信息发现)
20 Web crawling: the process by which a search engine gather pages from the Web to index them. Goal: Collect information about webpages, Collect information about links between webpages, Do this quickly!

Web crawler (网络爬虫) A web crawler must have the following features (特征): 1) Robustness: Several websites try to cheat and may try to generate an infinite number of pages to mislead web crawlers. Web-crawlers must be able to avoid these « traps » (陷阱).

Web crawler (网络爬虫) 2) Politeness (礼貌 ): 3) Efficient
A Web crawler should be polite. It should not visit a website too often. Otherwise, the owner of the website may not be happy. 3) Efficient The Web crawler should be able to efficiently index a huge amount of webpages.

Web crawler (网络爬虫) 4) Quality
The Web crawler should try to index the high quality or most useful webpages first The Web crawler must be able to assign different priority levels to different webpages. 5) Extensible A Web crawler should work with different technologies, different languages, different data format, etc.

Crawling How a Web crawler indexes websites?
The crawler begins with one or more URL (web page addresses). The crawler visit one of these webpages. The crawler extracts the text and links. The text is indexed. The links are used to find more webpages. The crawler then continue visiting other webpages.

Crawling A Web crawler should not visit the same webpage twice.
How fast can it be to crawl the Web? 4 billion webpages 1 month = webpages / second! A Web Crawler may be designed to visit popular websites more often than less popular websites.

Crawling Generally, a search engine will have many computers working as Web crawlers. These Web crawlers could be located in different locations: China, Europe, America, etc. These Web crawlers must work together. They must split the work and avoid visiting the same websites multiple times. This can be challenging!

Link analysis Many search engines consider the links between websites as an important information to rank webpages. Link analysis: analyzing the links between websites to derive useful information. A link from a website A to another website B is considered as an endorsement (认可 ) of the website B by A. B A

Link analysis When analyzing links, we can also analyze the context of each link in a webpage (the text of the link). e.g. The real-estate market in Shenzhen (…) This is useful because the webpage B may not provide an accurate description of itself. B A

Link analysis In fact, there is often a gap between the terms in a webpage and how web users would describe a page. The text used in a link is useful. But some terms may not be useful. e.g. Click here for information about Shenzhen. We can use the TF-IDF measure to filter unimportant words. B A

Link analysis Thanks to the analysis of the text of links:
If we search « big blue », we may find the webpage of IBM. This is great. But there can be some side-effects. For example, if we search « miserable failure » we can find the page of George W. Bush. 

This is because many people have purposely linked to the page of George W Bush. with the text « miserable failure » to fool the search engines.

Link analysis Search engines try to use various techniques to avoid this problem. Some search engines will not only consider the text of links, but also the text before and after a link.

Conclusion Today, Wish you a good preparation for the final exam! 再见！
Evaluation of information retrieval systems Web Search engines This was the last lecture for this course. Wish you a good preparation for the final exam! 再见！

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar presentations

Presentation on theme: "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Similar presentations

Presentation on theme: "信息检索与搜索引擎 Introduction to Information Retrieval GESC1007"— Presentation transcript:

Similar presentations

About project

Feedback