信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Internet Resources Discovery (IRD) Search Engines Quality.
WMES3103 : INFORMATION RETRIEVAL
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
IT253: Computer Organization
LIS618 lecture 2 the Boolean model Thomas Krichel
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Vector Space Models.
Information Retrieval
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS315 Introduction to Information Retrieval Boolean Search 1.
LB160 (Professional Communication Skills For Business Studies)
Information Retrieval in Practice
Why indexing? For efficient searching of a document
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Indexing and Search
CS122B: Projects in Databases and Web Applications Winter 2017
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Indexing & querying text
Text Based Information Retrieval
Information Retrieval and Web Search
CS 430: Information Discovery
CS 430: Information Discovery
Vector Space Model Seminar Social Media Mining University UC3M
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Multimedia Information Retrieval
CS122B: Projects in Databases and Web Applications Winter 2017
CS 430: Information Discovery
Fundamentals of Data Representation
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
PageRank GROUP 4.
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Chapter 5: Information Retrieval and Web Search
CS246: Information Retrieval
Query processing: phrase queries and positional indexes
Basic Text Processing Word tokenization.
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
Information Retrieval
ASCII and Unicode.
Presentation transcript:

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007 Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Xin4xi1 jian3suo3 yu3 sou1 suo3 yin3qing2 Spring 2018

Last week What is Information Retrieval (信息检索)? We discussed the « Boolean retrieval model (布尔检索模型) »: searching documents using terms and Boolean operators (e.g. AND, OR, NOT) QQ Group: 623881278 Website: PPTs…

Course schedule (日程安排) Lecture 1 Introduction Boolean retrieval (布尔检索模型) Lecture 2 Term vocabulary and posting lists Lecture 3 Dictionaries and tolerant retrieval Lecture 4 Index construction and compression Lecture 5 Scoring, weighting, and the vector space model Lecture 6 Computer scores, and a complete search system Lecture 7 Evaluation in information retrieval Lecture 8 Web search engines, advanced topics, and conclusion

An exercise This is an exercise that you can do at home if you want to review what we have learnt last week b. Draw the dictionary (also called inverted index representation) for this collection c. What are the returned result for these queries? - schizophrenia AND drug - for AND NOT (drug OR approach)

Introduction To able to search for documents quickly, we need to create an index (索引). What kind of index? Term-document matrix (关联矩阵 ) Dictionary (词典) (also called “inverted index” 倒排索引) Four steps to create an index

How to create an index? Step 1: collect the documents to be indexed Book1 Book2 Book3 Book100 …

How to create an index? Step 1: collect the documents to be indexed Step 2: tokenize the text (标记文本): separate it into words Book1 Book2 Book3 Book100 … Book1 « The city of Shenzhen is located in China… » token1 token2 … … token7 token8

How to create an index? Step 3: Linguistic preprocessing (语言的预处理) Keep only the terms that are useful for indexing documents. « The city of Shenzhen is located in China… » token1 token2 … … token7 token8 During that step, words can be also transformed if necessary: friends  friend wolves  wolf eaten  eat

How to create an index? Step 4: Create the dictionary City | Shenzhen | Located | China Dictionary City Shenzhen Located China Book1, Book2, Book 10, Book 7…. Book1, Book3, Book 5, Book 9…. … Book1, Book 20, Book 34…

How to create an index? The index has been created! It can then be used to search documents. Dictionary China City Located Shenzhen Book1, Book 20, Book 34… Book1, Book2, Book 7, Book 20…. … … Book1, Book3, Book 5, Book 9….

Chapter 2 – Term vocabulary and POSTING LISTS

In Chapter 2 We will discuss: Reading documents (2.1) Tokenization (标记化) and linguistic processing (2.2) Posting-lists (2.3) An extended model to handle phrase and proximity queries (2.4). e.g. “City of Shenzhen”

Reading digital documents 2.1 Reading digital documents Data (数据) stored in computers are represented as bits (比特). To read documents, an IR system must convert these bits into characters. 01001000 01100101 01101100 01101100 01101111 Hello http://www.binaryhexconverter.com/ascii-text-to-binary-converter

Reading documents (2) How to convert from bits to characters? There exists several encodings (文本编码) such as ASCII, UTF-8…: 01001000 H 01100101 E 01101100 L 01101100 L 01101111 O … ….

Reading documents (3) An IR system will only extract relevant content ( 相关内容)from a document (e.g. the text). e.g. in a webpage (网页), pictures (图片)can be ignored. Pictures (图片) Text (文本)

Reading documents (4) In this course, we consider English documents English is read from left to right. Some other languages are more complex to read. e.g. Arabic( 阿拉伯语 )mixes both left to right and right to left Also, some vowels( 元音)are not written Creating an index is difficult for such languages!

Reading documents (5) Some IR systems process each document individually e.g. Indexing each e-mail individually Some IR systems process documents as groups. e.g. Indexing all e-mails for a given day, together

Reading documents (6) It is also important to choose the granularity (粒度) carefully. should we index a book as a single document? It can be a bad idea! For example, if we search for books about “Food from China” but “Food” appears only in the first chapter and “China” appears only in the last chapter… Then this book is not about food from China… or should we index each chapter of the book separately?

Tokenization (1) 2.2 After reading a document, the next step is tokenization (标记化). This means to split a text into pieces called tokens ( 标记) while throwing away some characters such as punctuation (标点符号). A text Tokenization (标记化) Token1 Token2 Token3 Token4 Token5 Token6 Token7

Tokenization (2) “This house is close to my house.” A token is a sequence of characters(字符) appearing at a specific location in a document. Two tokens that are identical are said to be of the same type. “This house is close to my house.” These two tokens are of the same type (“house”).

Tokenization (3) Naive approach for tokenization (幼稚的方法): Remove punctuation. Split the text according to the whitespaces (空格) A text Tokenization (标记化) Token1 Token2 Token3 Token4 Token5 Token6 Token7

Tokenization (3) This approach has some problems…. e.g. “Mr. O’Neill and his friends aren’t…” How to tokenized “O’Neill” and “aren’t”? Which one is better? Which one is better?

Tokenization (4) In general, choosing how to tokenize a text influences how we can search for documents e.g. “Mr. O’Neill and his friends aren’t…” If “aren’t” is considered to be a token, then if a person searches for the term “are”, he may not find the document.

Tokenization (5) e.g. “Mr. O’Neill and his friends aren’t…” If “aren’t” is considered to be two tokens (“are” and “n’t”), then if a person searches for “aren’t”, he may not find the document. Solution: 1 - Tokenize the documents 2 - Tokenize the queries of users in the same way.

Tokenization (6) In general, tokenization is different for each language. For this reason, it is useful to first identify the language of a document before performing tokenization and indexing. In Chinese, a difficulty is that there are no whitespaces (空格) between words e.g. “ 我喜欢这节课。"

Tokenization (6) Word segmentation (分词) is the process of dividing a text into words. In Chinese, there are some ambiguities (歧义): « monk » ? or « and » + « still » ? Simple solution: find the longest words… Other solutions: use Markov movels, and other techniques….

Tokenization (7) In English, there are whitespaces between words. But splitting a text using whitespaces may cause problems. “San Francisco” is the name of a city (it should not be considered as two tokens) “1st January 2016” is a date “Hunan University” should be considered as a single token

Tokenization (8) A solution: For a given query such as: « Hunan University » a search engine can retrieve documents using all the different tokenizations: Hunan University HunanUniversity and combine the results.

Tokenization (9) In many languages, there are some unusual tokens. e.g. B-52 is an aircraft C++ and C# are programming languages (编程语言) M*A*S*H* is the name of a TV show (电视节目) http://www.hitsz.edu.cn is a web page It is important to consider these special tokens.

Tokenization (10) Some tokens can be ignored because it is unlikely that someone will search for them: amounts of money e.g. 56 元, numbers e.g. 56.7869 Advantage: this reduces the size of the dictionary Disadvantage: we cannot search for the tokens that are ignored.

Removing common words In text documents, there are some words that are very common and may not be useful for retrieving documents. In English, 25 common words are: Such words are called « stop words » (停用词)

Removing common words (2) These words can be ignored when indexing documents. In general, this will not cause problems when searching for documents. However, stops words are useful when searching for phrases (短语) (consecutive words) e.g. « Airplane tickets to Beijing » is more precise than: « Airplane AND tickets AND Beijing »

Removing common words (3) In terms of performance, removing stop words: results in a smaller index. does not make a big difference in terms of performance (speed…). Most Web search engines do not remove stop words. instead they use other strategies to cope with common words, based on statistics about words.

Normalization -规范化 When a person enters a query in a search engine: User (用户) cars shenzhen Query (查询) An IR system will also « tokenize » the query.

Normalization -规范化 (2) When a person enters a query in a search engine: User (用户) cars shenzhen Query (查询) It is possible that the tokens obtained from the query do not match the tokens from documents 

Normalization -规范化 (3) Example: « cars » is used instead of « car » but these two tokens refer to the same object. « cars » is used instead of « automobile » but these two tokens have the same meaning (they are synonyms - 同义词)

Normalization -规范化 (4) Normalization (规范化 ): it is the process of converting tokens to a standard form so that matches will occur despite small differences. cars  car car  automobile windows  window Windows (operating system)

Normalization: accents and diacritics Diacritic (变音符 ): a sign written above or below a letter that indicates a difference in pronunciation à é ê Should we just ignore them? In some languages, they are important. In Spanish: peña = a cliff pena = sorrow

Normalization: accents and diacritics But it is possible that users will not use the diacritics because they may be lazy or may not know how to type them on the computer. Thus, a strategy is to remove them: peña = a cliff pena = sorrow

Capitalization Lower-case letters : a,b,c,d…. (小写) Upper-case letters: A,B,C,D…. (大写) A common strategy is to transform everything to lower-case letters: Ferrari  ferrari Australia  australia This can be a good idea because often people will not type upper-case letters when searching for documents.

Capitalization But sometimes capitalization is important. Bush: a person named « Bush » (布什) bush: a bush (灌木) C.A.T : a company cat : an animal (猫)

Capitalization Saturday, Jim went out to eat something. A good solution for English: convert the first letter of a sentence to a lower-case letter. Saturday, Jim went out to eat something. saturday Jim went out to eat something This is not a perfect solution but work most of the time. However, as mentioned, users may not type the upper-case letters anyway. Thus, transforming everything to lower-case is a often the best solution.

Other issues in English British spelling vs American spelling colour color Dates 3/12/16 3rd March 2016 Mar. 3, 2016

Lemmatization Sometimes a same word may have different forms: organize, organizes, organizing… Lemmatization: converting a word to a common base form called “lemma” am, are, is ⇒ be car, cars, car’s, cars’ ⇒ car The lemma for « car, cars, … »

Lemmatization (2) How to perform lemmatization? A simple way called stemming consists of removing the end of words: cars  car airplanes  airplane But it may give some incorrect results: saw  s The result should be « see » !

Lemmatization (2) If we want to perform lemmatization in a better way, it is necessary to analyze how the words are used in the text. This can be quite complicated. There exist some software to analyze texts and perform stemming for different languages (free or commercial). For English: Porter Stemmer http://www.tartarus.org/˜martin/PorterStemmer/

Example – Porter Stemmer Applying the Porter Stemmer

Lemmatization (3) In some cases, lemmatization can help to provide better results when searching for documents But in some other cases, it does not help and lead to worse results. Thus, lemmatization may not always be used in practice. Example of problem 

Lemmatization (5) Example The Porter Stemmer convert all these words operate operating operates operation operative operatives operational to « oper ». But these words have different meanings.

Lemmatization (6) In general, applying lemmatization allows users to find more documents using an Information retrieval system. But these documents may be less relevant. In other words, lemmatization may: decrease precision. increase recall 

Precision (准确率) Precision: What fraction of the returned results are relevant to the information need? Example: A person searches for webpages about Beijing The search engine returns: 5 relevant webpages 5 irrelevant webpages. Precision = 5 / 10 = 0.5 (50 %)

Recall (召回) Recall: What fraction of the relevant documents in a collection were returned by the system? Example: A database contains 1000 documents about HITSZ. The user search for documents about HITSZ. Only 100 documents about HITSZ are retrieved. Recall = 100 / 1000 = 0.1 (10 %)

How to SEARCH FASTER USING A DICTIONARY 2.3 How to SEARCH FASTER USING A DICTIONARY

Introduction Last week, we saw how we can use a dictionary to search for documents. Example 

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20,

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists. To do that, we compare both lists, posting by posting.

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists

Example RESULT: Book 1, Book20 QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, RESULT: Book 1, Book20

How to search faster? There are some techniques to allow faster search. One such technique is to use “skip pointers”. We will see the main idea (without the details) 

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20,

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists

Example QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, We need to do the intersection (交线) of the two lists

Example RESULT: Book 1, Book20 QUERY : CITY AND CHINA Dictionary City Shenzhen Located China Book1, Book2, Book 10, … , Book 20…. Book1, Book3 … Book1, Book 20, RESULT: Book 1, Book20

Skip-pointers The idea is to use some « shortcuts » (arrows) to skip some entries when comparing lists. By doing this, we can compare lists of documents faster (we don’t need to completely read the lists). This is just the main idea. We will not discuss technical details! This idea only works for queries using the AND operator (it does not work for OR).

2.4 PhRASE QUERIES

Phrase query (精确查询) Phrase query: a query where words must appear consecutively (one after the other) in documents e.g.: « Harbin Institute of Technology » This query is written with quotes (“ “). It will find all documents containing these words one after the other. This type of query is not supported by all Web search engines.

Phrase query (2) Some Web search engines will instead consider the proximity between words in documents. Documents where words from a query appear closer will be preferred to other documents. How to answer a phrase query? 

Biword indexes A solution is to considers each pair of consecutive terms in a document as a term. I walked in Beijing « I walked » « walked in » « in Beijing » Those terms are called « biwords » Each biword can be used to create an index that we call a « biword index » .

Illustration of a biword index Dictionary I I walked walked walked in in Beijing Beijing Book1, Book5, Book 10, … , Book 20…. Book1, Book7 … Book1, Book 12, … …

Biword indexes Using a biword index, we can search using the « biwords: A query: « Harbin Institute » AND « Institute of » AND « of Technology » This query would work pretty well. But it could still find documents where the phrase « Harbin Institute of Technology » would not appear consecutively.

Biword indexes How to solve this problem? A solution is to generalize the concept of biword index to more than two words (e.g. three words). Then, we may find more relevant documents. But a problem is that the index would become much larger (there will be more entries in the dictionary).

Positional indexes (位置索引) A better solution is to use another type of index called positional indexes. Positional index: a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … This indicates that « Shenzhen » appears as the 2nd, 24th and 35th word in “Book1”

Positional indexes (位置索引) A better solution is to use another type of index called positional indexes. Positional index: a dictionary where the positions of terms in documents are stored. Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … This indicates that « Shenzhen » appears as the 3rd and 500th word in “Book20”

Positional indexes Positional indexes can be used to answer phrase queries.

Example Phrase query: « Shenzhen City » Result: Book 1 and Book 20 Dictionary City Shenzhen Located China Book1 (3, 25, 38) Book 20 (4, 100, 1000) Book1 (2, 24, 35)…. Book20(3,500) … 78 Result: Book 1 and Book 20

Positional indexes Positional indexes can also be used to answer proximity queries. « Shenzhen (within five words of) City »

Conclusion Today, we have discussed in more details how index are created. Tokenization, normalization, lemmatization… The PPT slides are on the website. QQ Group: ____________

References Manning, C. D., Raghavan, P., Schütze, H. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008