信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Xin4xi1 jian3suo3 yu3 sou1 suo3 yin3qing2 Spring 2018
Last week What is Information Retrieval (信息检索)? We discussed the « Boolean retrieval model (布尔检索模型) »: searching documents using terms and Boolean operators (e.g. AND, OR, NOT) QQ Group: 623881278 Website: PPTs…
Course schedule (日程安排) Lecture 1 Introduction Boolean retrieval (布尔检索模型) Lecture 2 Term vocabulary and posting lists Lecture 3 Dictionaries and tolerant retrieval Lecture 4 Index construction and compression Lecture 5 Scoring, weighting, and the vector space model Lecture 6 Computer scores, and a complete search system Lecture 7 Evaluation in information retrieval Lecture 8 Web search engines, advanced topics, and conclusion
An exercise This is an exercise that you can do at home if you want to review what we have learnt last week b. Draw the dictionary (also called inverted index representation) for this collection c. What are the returned result for these queries? - schizophrenia AND drug - for AND NOT (drug OR approach)
Introduction To able to search for documents quickly, we need to create an index (索引). What kind of index? Term-document matrix (关联矩阵 ) Dictionary (词典) (also called “inverted index” 倒排索引) Four steps to create an index
How to create an index? Step 1: collect the documents to be indexed Book1 Book2 Book3 Book100 …
How to create an index? Step 1: collect the documents to be indexed Step 2: tokenize the text (标记文本): separate it into words Book1 Book2 Book3 Book100 … Book1 « The city of Shenzhen is located in China… » token1 token2 … … token7 token8
How to create an index? Step 3: Linguistic preprocessing (语言的预处理) Keep only the terms that are useful for indexing documents. « The city of Shenzhen is located in China… » token1 token2 … … token7 token8 During that step, words can be also transformed if necessary: friends friend wolves wolf eaten eat
How to create an index? Step 4: Create the dictionary City | Shenzhen | Located | China Dictionary City Shenzhen Located China Book1, Book2, Book 10, Book 7…. Book1, Book3, Book 5, Book 9…. … Book1, Book 20, Book 34…
How to create an index? The index has been created! It can then be used to search documents. Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9….
Chapter 2 – Term vocabulary and POSTING LISTS
In Chapter 2 We will discuss: Reading documents (2.1) Tokenization (标记化) and linguistic processing (2.2) Posting-lists (2.3) An extended model to handle phrase and proximity queries (2.4). e.g. “City of Shenzhen”
Reading digital documents 2.1 Reading digital documents Data (数据) stored in computers are represented as bits (比特). To read documents, an IR system must convert these bits into characters. 01001000 01100101 01101100 01101100 01101111 Hello http://www.binaryhexconverter.com/ascii-text-to-binary-converter
Reading documents (2) How to convert from bits to characters? There exists several encodings (文本编码) such as ASCII, UTF-8…: 01001000 H 01100101 E 01101100 L 01101100 L 01101111 O … ….
Reading documents (3) An IR system will only extract relevant content ( 相关内容)from a document (e.g. the text). e.g. in a webpage (网页), pictures (图片)can be ignored. Pictures (图片) Text (文本)
Reading documents (4) In this course, we consider English documents English is read from left to right. Some other languages are more complex to read. e.g. Arabic( 阿拉伯语 )mixes both left to right and right to left Also, some vowels( 元音)are not written Creating an index is difficult for such languages!
Reading documents (5) Some IR systems process each document individually e.g. Indexing each e-mail individually Some IR systems process documents as groups. e.g. Indexing all e-mails for a given day, together
Reading documents (6) It is also important to choose the granularity (粒度) carefully. should we index a book as a single document? It can be a bad idea! For example, if we search for books about “Food from China” but “Food” appears only in the first chapter and “China” appears only in the last chapter… Then this book is not about food from China… or should we index each chapter of the book separately?
Tokenization (1) 2.2 After reading a document, the next step is tokenization (标记化). This means to split a text into pieces called tokens ( 标记) while throwing away some characters such as punctuation (标点符号). A text Tokenization (标记化) Token1 Token2 Token3 Token4 Token5 Token6 Token7
Tokenization (2) “This house is close to my house.” A token is a sequence of characters(字符) appearing at a specific location in a document. Two tokens that are identical are said to be of the same type. “This house is close to my house.” These two tokens are of the same type (“house”).
Tokenization (3) Naive approach for tokenization (幼稚的方法): Remove punctuation. Split the text according to the whitespaces (空格) A text Tokenization (标记化) Token1 Token2 Token3 Token4 Token5 Token6 Token7
Tokenization (3) This approach has some problems…. e.g. “Mr. O’Neill and his friends aren’t…” How to tokenized “O’Neill” and “aren’t”? Which one is better? Which one is better?
Tokenization (4) In general, choosing how to tokenize a text influences how we can search for documents e.g. “Mr. O’Neill and his friends aren’t…” If “aren’t” is considered to be a token, then if a person searches for the term “are”, he may not find the document.
Tokenization (5) e.g. “Mr. O’Neill and his friends aren’t…” If “aren’t” is considered to be two tokens (“are” and “n’t”), then if a person searches for “aren’t”, he may not find the document. Solution: 1 - Tokenize the documents 2 - Tokenize the queries of users in the same way.
Tokenization (6) In general, tokenization is different for each language. For this reason, it is useful to first identify the language of a document before performing tokenization and indexing. In Chinese, a difficulty is that there are no whitespaces (空格) between words e.g. “ 我喜欢这节课。"
Tokenization (6) Word segmentation (分词) is the process of dividing a text into words. In Chinese, there are some ambiguities (歧义): « monk » ? or « and » + « still » ? Simple solution: find the longest words… Other solutions: use Markov movels, and other techniques….
Tokenization (7) In English, there are whitespaces between words. But splitting a text using whitespaces may cause problems. “San Francisco” is the name of a city (it should not be considered as two tokens) “1st January 2016” is a date “Hunan University” should be considered as a single token
Tokenization (8) A solution: For a given query such as: « Hunan University » a search engine can retrieve documents using all the different tokenizations: Hunan University HunanUniversity and combine the results.
Tokenization (9) In many languages, there are some unusual tokens. e.g. B-52 is an aircraft C++ and C# are programming languages (编程语言) M*A*S*H* is the name of a TV show (电视节目) http://www.hitsz.edu.cn is a web page It is important to consider these special tokens.
Tokenization (10) Some tokens can be ignored because it is unlikely that someone will search for them: amounts of money e.g. 56 元, numbers e.g. 56.7869 Advantage: this reduces the size of the dictionary Disadvantage: we cannot search for the tokens that are ignored.
Removing common words In text documents, there are some words that are very common and may not be useful for retrieving documents. In English, 25 common words are: Such words are called « stop words » (停用词)
Removing common words (2) These words can be ignored when indexing documents. In general, this will not cause problems when searching for documents. However, stops words are useful when searching for phrases (短语) (consecutive words) e.g. « Airplane tickets to Beijing » is more precise than: « Airplane AND tickets AND Beijing »
Removing common words (3) In terms of performance, removing stop words: results in a smaller index. does not make a big difference in terms of performance (speed…). Most Web search engines do not remove stop words. instead they use other strategies to cope with common words, based on statistics about words.
Normalization -规范化 When a person enters a query in a search engine: User (用户) cars shenzhen Query (查询) An IR system will also « tokenize » the query.
Normalization -规范化 (2) When a person enters a query in a search engine: User (用户) cars shenzhen Query (查询) It is possible that the tokens obtained from the query do not match the tokens from documents
Normalization -规范化 (3) Example: « cars » is used instead of « car » but these two tokens refer to the same object. « cars » is used instead of « automobile » but these two tokens have the same meaning (they are synonyms - 同义词)
Normalization -规范化 (4) Normalization (规范化 ): it is the process of converting tokens to a standard form so that matches will occur despite small differences. cars car car automobile windows window Windows (operating system)
Normalization: accents and diacritics Diacritic (变音符 ): a sign written above or below a letter that indicates a difference in pronunciation à é ê Should we just ignore them? In some languages, they are important. In Spanish: peña = a cliff pena = sorrow
Normalization: accents and diacritics But it is possible that users will not use the diacritics because they may be lazy or may not know how to type them on the computer. Thus, a strategy is to remove them: peña = a cliff pena = sorrow
Capitalization Lower-case letters : a,b,c,d…. (小写) Upper-case letters: A,B,C,D…. (大写) A common strategy is to transform everything to lower-case letters: Ferrari ferrari Australia australia This can be a good idea because often people will not type upper-case letters when searching for documents.
Capitalization But sometimes capitalization is important. Bush: a person named « Bush » (布什) bush: a bush (灌木) C.A.T : a company cat : an animal (猫)
Capitalization Saturday, Jim went out to eat something. A good solution for English: convert the first letter of a sentence to a lower-case letter. Saturday, Jim went out to eat something. saturday Jim went out to eat something This is not a perfect solution but work most of the time. However, as mentioned, users may not type the upper-case letters anyway. Thus, transforming everything to lower-case is a often the best solution.
Other issues in English British spelling vs American spelling colour color Dates 3/12/16 3rd March 2016 Mar. 3, 2016
Lemmatization Sometimes a same word may have different forms: organize, organizes, organizing… Lemmatization: converting a word to a common base form called “lemma” am, are, is ⇒ be car, cars, car’s, cars’ ⇒ car The lemma for « car, cars, … »
Lemmatization (2) How to perform lemmatization? A simple way called stemming consists of removing the end of words: cars car airplanes airplane But it may give some incorrect results: saw s The result should be « see » !
Lemmatization (2) If we want to perform lemmatization in a better way, it is necessary to analyze how the words are used in the text. This can be quite complicated. There exist some software to analyze texts and perform stemming for different languages (free or commercial). For English: Porter Stemmer http://www.tartarus.org/˜martin/PorterStemmer/
Example – Porter Stemmer Applying the Porter Stemmer
Lemmatization (3) In some cases, lemmatization can help to provide better results when searching for documents But in some other cases, it does not help and lead to worse results. Thus, lemmatization may not always be used in practice. Example of problem
Lemmatization (5) Example The Porter Stemmer convert all these words operate operating operates operation operative operatives operational to « oper ». But these words have different meanings.
Lemmatization (6) In general, applying lemmatization allows users to find more documents using an Information retrieval system. But these documents may be less relevant. In other words, lemmatization may: decrease precision. increase recall
Precision (准确率) Precision: What fraction of the returned results are relevant to the information need? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages. Precision = 5 / 10 = 0.5 (50 %)
Recall (召回) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved. Recall = 100 / 1000 = 0.1 (10 %)
How to SEARCH FASTER USING A DICTIONARY 2.3 How to SEARCH FASTER USING A DICTIONARY
Introduction Last week, we saw how we can use a dictionary to search for documents. Example
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20,
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists. To do that, we compare both lists, posting by posting.
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists
Example RESULT: Book 1, Book20 QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, RESULT: Book 1, Book20
How to search faster? There are some techniques to allow faster search. One such technique is to use “skip pointers”. We will see the main idea (without the details)
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20,
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists
Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists
Example RESULT: Book 1, Book20 QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, RESULT: Book 1, Book20
Skip-pointers The idea is to use some « shortcuts » (arrows) to skip some entries when comparing lists. By doing this, we can compare lists of documents faster (we don’t need to completely read the lists). This is just the main idea. We will not discuss technical details! This idea only works for queries using the AND operator (it does not work for OR).
2.4 PhRASE QUERIES
Phrase query (精确查询) Phrase query: a query where words must appear consecutively (one after the other) in documents e.g.: « Harbin Institute of Technology » This query is written with quotes (“ “). It will find all documents containing these words one after the other. This type of query is not supported by all Web search engines.
Phrase query (2) Some Web search engines will instead consider the proximity between words in documents. Documents where words from a query appear closer will be preferred to other documents. How to answer a phrase query?
Biword indexes A solution is to considers each pair of consecutive terms in a document as a term. I walked in Beijing « I walked » « walked in » « in Beijing » Those terms are called « biwords » Each biword can be used to create an index that we call a « biword index » .
Illustration of a biword index Dictionary I I walked walked walked in in Beijing Beijing Book1, Book5, Book 10, … , Book 20…. Book1, Book7 … Book1, Book 12, … …
Biword indexes Using a biword index, we can search using the « biwords: A query: « Harbin Institute » AND « Institute of » AND « of Technology » This query would work pretty well. But it could still find documents where the phrase « Harbin Institute of Technology » would not appear consecutively.
Biword indexes How to solve this problem? A solution is to generalize the concept of biword index to more than two words (e.g. three words). Then, we may find more relevant documents. But a problem is that the index would become much larger (there will be more entries in the dictionary).
Positional indexes (位置索引) A better solution is to use another type of index called positional indexes. Positional index: a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … This indicates that « Shenzhen » appears as the 2nd, 24th and 35th word in “Book1”
Positional indexes (位置索引) A better solution is to use another type of index called positional indexes. Positional index: a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … This indicates that « Shenzhen » appears as the 3rd and 500th word in “Book20”
Positional indexes Positional indexes can be used to answer phrase queries.
Example Phrase query: « Shenzhen City » Result: Book 1 and Book 20 Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … 78 Result: Book 1 and Book 20
Positional indexes Positional indexes can also be used to answer proximity queries. « Shenzhen (within five words of) City »
Conclusion Today, we have discussed in more details how index are created. Tokenization, normalization, lemmatization… The PPT slides are on the website. QQ Group: ____________
References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008