信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv@hitsz.edu.cn Spring 2017
Last time We have discussed: QQ Group: 479395945 Evaluation of an information retrieval system QQ Group: 479395945
Course schedule (日程安排) Week 1 Introduction (Chapter 1) Boolean retrieval Week 2 Term vocabulary and posting lists (Chapter 2) Week 3 Dictionaries and tolerant retrieval (Chapter 3) Week 4 Index construction (Chapter 4) Week 5 Scoring, term weighting, the vector space model (Chapter 6) Week 6 Week 7 A complete search system (Chapter 7) Week 8 Evaluation in information retrieval Week 9 Web search engines, conclusion
Final exam Duration: 2 hours It is a closed-book exam Answers must be written in English. 10 questions. Some typical questions in my exams: What is the advantages/disadvantages of using X instead of Y ? When X should be used? How X works ? or why X is designed like that? Some questions that may be similar to the assignments that require to draw an index, answer a query, etc.
Final exam (continued) During the exam, if you are not sure about the meaning of a question in terms of English, you may ask for some clarifications. No electronic devices are allowed. A pen/pencil/eraser can be used during the exam. Bring your student ID card.
Web Search engines
Introduction In previous chapters, we have discussed mostly traditional information retrieval (searching for documents) Today, we will discuss web search engines (Chapter 19, 20 and 21)
The Web 19.1 What is special about the Web? The number of documents (very large) Lack of coordination in the creation of the documents, Diversity of background and motives of participants.
The Web The Web is a set of webpages (网页) Webpages are created using a language called HTML Webpage HTML http://www.wikihow.com/Create-a-Simple-Web-Page-with-HTML
The Web Webpages are stored on servers (服务器) To access a webpage, one must use a software called a Web browser (浏览器) Browser Internet SERVER of HITSZ Home
The Web Webpages are stored on servers (服务器) To access a webpage, one must use a software called a Web browser (浏览器) Browser Internet SERVER of HITSZ Webpages are sent over the internet using the HTTP protocol (HTTP协议) Home
The Web The idea of the Web: each webpage contain links to other webpages (hyperlinks - 超链接). Each webpage has an address (URL) e.g. http://www.hitsz.edu.cn Creating a webpage does not require advanced technical skills. Anyone can create a webpage. Even if a webpage contains errors, often it can still be used. Webpages have become one of the best way to supply and consume information.
The Web Billions of webpages containing information. But if we cannot search this information, it is useless. Historically, two ways of searching for information: Search engines (Baidu, Bing, etc.) Directories (Yahoo!, etc.)
Web directories (网络目录) Web directory: a list of websites, separated by categories.
Web directories (网络目录) A Web directory contains only the “best” webpages for each category. Problems: Web directories are created by humans. This takes a lot of time. It is not convenient for searching. A user need to know how to search within the categories. There can be thousands of categories. Information in categories is often outdated (过时) For this reason, Web directories are not very popular nowadays.
Web search engines Baidu, Bing, etc. They adapted information retrieval techniques so that they can work well with billions of documents. Adapted in terms of: Indexing, Processing queries, Ranking documents
Web search engines Why are they popular? ability to quickly answer queries (in a few milliseconds). ability to index millions of documents. almost always up-to-date. Fifteen years ago, results returned by Web search engines were not very good Novel ranking techniques (排序技术 ) and spam- fighting techniques (反垃圾邮件技术) have been proposed to obtain better results
Web characteristics 19.2 The Web is mainly decentralized (分散). Many languages (must be treated differently). Webpages contain many different types of content. Some webpages contains only pictures and no text. The Web contains a lot of non reliable information. How can a search engine knows which websites can be trusted?
Size of the Web 1995: 30 million webpages indexed by AltaVista 2017: 4.48 billion webpages http://www.worldwidewebsize.com/ Note: only static webpages are counted. Dynamic webpage: the content is generated in real- time for the user.
The Web graph The Web can be viewed as a graph (图) Each webpage is a vertex (顶点) A link between two webpages is an edge (图的边) The Web is a directed graph (有向图) Webpages: A,B,C, …, F The Web is not a strongly connected graph (强连通图) (there is no edge between some vertices such as A and C)
The Web graph Two types of links: In-links: links that go to a page Out-links: links that leave a page Node B has 3 in-links has 1 out-link
The Web graph Not all web pages are equally popular Many web pages have few in-links Few web pages have many in-links The number of in-links per website follows a power law distribution (幂律分布) Number of webpages Number of in-links
Spam For some queries, there is a huge competition to appear high in the results of search engines. e.g. Beijing real-estate (房地产) Thus, many people are modify their website to try to appear high in search results. e.g. write multiple times Beijing real-estate in a webpage to increase the term frequency. e.g. write some invisible text by using the same color as the background.
Spam detection Nowadays, search engines use many sophisticated methods to detect spam (repeated keywords, etc.). Websites that are trying to cheat may be blocked from search engines. Thus, some people have developed new techniques to deceive search engines (欺 骗搜索引擎)
Cloaking (伪装) One such technique is cloaking. Some websites try to cheat by showing different content to search engines and users. This is a problem that did not exist in traditional IR.
Paid inclusion Paid inclusion: a website can also pay a search engine to appear high in the results. Some search engines do not allow paid inclusion.
Doorway page Doorway page: a page containing carefully chosen text to rank highly in search engines for some keywords. the page then links to another page containing commercial content. a website may have many doorway pages. Another webpage Doorway page Doorway page
Link analysis To reduce the problem of spam on the Web, many search engine perform link analysis. Basic idea: to rank a page higher or treat it as more reliable if it has many in-links. e.g. PageRank algorithm But some people create fake links to increase the popularity of their webpages. There is thus a continuing battle between spammers and search engines.
Advertising (广告) 19.3 Two main advertisement models: 1) cost per view: The goal is to show some content to the user (branding). An image is typically used. A company may pay to display the image 1000 times.
Advertising (广告) Two main advertisement models: 2) cost per click: The goal is that some people click on an advertisement to visit the website of the advertiser (initiate a transaction). The website may ask the person to buy something. An image or text may be used with a link. A company may pay for 1000 clicks.
Advertising (广告) 19.3 Today, many search engines earn money from advertising. Some will display search results and advertisement separately. Search results Sponsored search results Some other search engines will combine search results and advertisement. A website may also add advertisements from other companies to earn money through advertisement.
Search results Sponsored search results
Click-spam Click spam: a company clicks on the advertisement of its competitors to spend their money. This may be done using some automatic software. A search engine must use some techniques to block click spam.
Search user experience (用户体验) 19.4 It is also important to understand users of search engines. For traditional IR systems: Users often received a training about how to search and write queries. For Web search engines: Users may not know or care about how to write queries. Usually, people use 2 or 3 keywords in a query. Usually people do not use special operators (wildcard queries, Boolean operators…)
Search user experience (用户体验) The more people use a search engine, the more money it can earn. How a search engine can get more users? By increasing the precision in the first few results, By offering a website that is simple and easy to use, and that is very fast. As user can thus quickly find what he is looking for.
Three types of user queries 1) Informational queries: seek general information on a broad topic. e.g. information about playing piano There is not a single webpage that contains all the information that the user wants. The user generally want to combine information from several webpages.
Three types of user queries 2) Navigational queries: seek the website or home page of a given entity. e.g. find the webpage of Huawei(华为) The user expects that the first result is the webpage of the entity (e.g. Huawei) The user only needs one document. He wants a very high precision (1).
Three types of user queries 3) Transactional queries: the user wants to make a transaction. e.g. reserve a hotel room in Guangzhou, buy train tickets… The search engine should provides links to service providers.
Three types of user queries For a given query, it can be difficult for a to identify the type of the query. Identifying the type of a query is useful: for selecting the most relevant results, for displaying relevant advertisements (e.g. advertisement about train tickets)
Components of a Web search engine (网络爬虫)
Index size How can we compare the sizes of the indexes of two search engines (e.g. Baidu vs Bing)? This may be difficult to evaluate A search engine may only index the first few thousands words in a page. A search engine may display a page in its results that is not in its index (because some other page in its index links to that page) Search engines may organize their indexes in tiers using tiered indexes. For general queries, only the main page of a website may be shown and other pages may not be shown.
Index size Some techniques have been developped to compare the size of search engines’ indexes. Hypothesis: each search engine indexes only one part of the Web, chosen randomly. The “Capture-recapture” method
Capture-recapture method Two search engines E1 and E2. Take a page from E1 and check if it is in E2 This gives a ratio x Take a page from E2 and check if it is in E1 This gives a ratio y If E1 and E2 are independent and uniform random subsets of the Web, we should have: More details in the book…
Near-duplicates (近似重复) 19.6 Another issue: the Web may contain multiple copies of the same webpage. Up to 40% of the webpages are duplicates (重复 ) of other pages. Some of these of these copies are legitimate (合法的). Others are not. Search engines try to avoid indexing duplicates to reduce the size of their indexes.
Detecting duplicates How to detect duplicates? We do not want to compare billions of webpages with each other. Simple approach: calculate a fingerprint (hash) for each webpage that is a number. If two pages have the same fingerprint, they may be duplicates, so we need to compare them. If they are duplicates, only one of them is indexed.
Web crawling (Web信息发现) 20 Web crawling: the process by which a search engine gather pages from the Web to index them. Goal: Collect information about webpages, Collect information about links between webpages, Do this quickly!
Web crawler (网络爬虫) A web crawler must have the following features (特征): 1) Robustness: Several websites try to cheat and may try to generate an infinite number of pages to mislead web crawlers. Web-crawlers must be able to avoid these « traps » (陷阱).
Web crawler (网络爬虫) 2) Politeness (礼貌 ): 3) Efficient A Web crawler should be polite. It should not visit a website too often. Otherwise, the owner of the website may not be happy. 3) Efficient The Web crawler should be able to efficiently index a huge amount of webpages.
Web crawler (网络爬虫) 4) Quality The Web crawler should try to index the high quality or most useful webpages first The Web crawler must be able to assign different priority levels to different webpages. 5) Extensible A Web crawler should work with different technologies, different languages, different data format, etc.
Crawling How a Web crawler indexes websites? The crawler begins with one or more URL (web page addresses). The crawler visit one of these webpages. The crawler extracts the text and links. The text is indexed. The links are used to find more webpages. The crawler then continue visiting other webpages.
Crawling A Web crawler should not visit the same webpage twice. How fast can it be to crawl the Web? 4 billion webpages 1 month = 1540 webpages / second! A Web Crawler may be designed to visit popular websites more often than less popular websites.
Robot exclusion Some people do not want that Web crawlers index their website. To do this, we can put a file robots.txt on a website to tell the Web Crawlers to ignore the website. Name of a search engine
Crawling Generally, a search engine will have many computers working as Web crawlers. These Web crawlers could be located in different locations: China, Europe, America, etc. These Web crawlers must work together. They must split the work and avoid visiting the same websites multiple times. This can be challenging!
Distributed index For a Web search engine, the index may be very large. Moreover, many users may want to access the index at the same time. Thus an index will be stored on several computers.
Link analysis Many search engines consider the links between websites as an important information to rank webpages. Link analysis: analyzing the links between websites to derive useful information. A link from a website A to another website B is considered as an endorsement (认可 ) of the website B by A. B A
Link analysis When analyzing links, we can also analyze the context of each link in a webpage (the text of the link). e.g. The real-estate market in Shenzhen (…) This is useful because the webpage B may not provide an accurate description of itself. B A
Link analysis In fact, there is often a gap between the terms in a webpage and how web users would describe a page. The text used in a link is useful. But some terms may not be useful. e.g. Click here for information about Shenzhen. We can use the TF-IDF measure to filter unimportant words. B A
Link analysis Thanks to the analysis of the text of links: If we search « big blue », we may find the webpage of IBM. This is great. But there can be some side-effects. For example, if we search « miserable failure » we can find the page of George W. Bush.
This is because many people have purposely linked to the page of George W Bush. with the text « miserable failure » to fool the search engines.
Link analysis Search engines try to use various techniques to avoid this problem. Some search engines will not only consider the text of links, but also the text before and after a link.
Conclusion Today, We have discussed Web Search engines This was the last lecture for this course. Wish all of you a good preparation for the final exam! 再见!
References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008