| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Morphology.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Stemming, tagging and chunking Text analysis short of parsing.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Modern Information Retrieval Chapter 7: Text Processing.
| 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
LIS618 lecture 2 the Boolean model Thomas Krichel
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Information Retrieval Lecture 2: The term vocabulary and postings lists.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
Web- and Multimedia-based Information Systems Lecture 2.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Introduction to Information Retrieval Boolean Retrieval.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
1 The grammatical categories of words and their inflections Kuiper and Allan Chapter 2.1.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Indexing and Search
Lecture 7 Summary Survey of English morphology
CS122B: Projects in Databases and Web Applications Winter 2017
Indexing & querying text
Text Based Information Retrieval
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Information Retrieval and Web Search
CS 430: Information Discovery
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Multimedia Information Retrieval
Boolean Retrieval.
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
Query processing: phrase queries and positional indexes
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Information Retrieval and Web Design
INF 141: Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists

Agenda for today Questions Chapter 1 Chapter 2: Term vocabulary & posting lists Chapter 2: Posting lists with positions Homework/lab assignment

Questions chapter 1

Chapter 2 Overview Preprocessing of documents choose the unit of indexing (granularity) tokenization (removing punctuation, splitting in words) stop list? normalization: case folding, stemming versus lemmatizing,... extensions to postings lists

Tokens, types and terms tokeneach separate word in the text typesame words belong to one type (index) termfinally included in the index index term is an equivalence class of tokens and/or types

Tokens, types and terms The Lord of the Rings Number of tokens? 5 Number of types? 4 Number of terms? 4? 2? 1?

Equivalence classes Casefolding Diacritics Stemming/lemmatisation Decompounding Synonym lists Variant spellings

Equivalence classes Implicit: mapping rules Relational: query expansion Relational: double indexing Mapping should be done: – Indexing – Querying

Diacritics

Words and word forms Inflection (D: verbuiging/vervoeging) -changing a word to express person, case, aspect,... -for determiners, nouns, pronouns, adjectives: declination (D: verbuiging) -for verbs: conjugation (D: vervoeging) Derivation (D: afleiding) -formation of a new word from another word (e.g. by adding an affix (prefix or suffix) or changing the grammatical category)

Inflection examples Determiners E: the D: de, het G: der, des, dem, den, die, das Adjectives E: young D: jonge, jonge G: junger, junge, junges, jungen Nouns E: man, men D: man, mannen G: mann, mannes, Verbs E write / writes / wrote / written D schrijf/ schrijft /schrijven / schreef/ schreven / geschreven G schreibe/ schreibst / schreibt / schreiben / schrieben / geschrieben

Derivation examples to browse -> a browser red -> to redden, reddish Google -> to google arm(s) -> to arm, to disarm -> disarmament, disarming

Stemming and lemmatizing verb formsinform, informs, informed, informing derivationsinformation, informative, informal?? steminform lemma inform, information, informative, informal verb formssing: sings, sang, sung, singing derivations singer, singers, song, songs stemsing, sang, sung, song, lemmasing, singer, song

Discussion Why is stemming used when lemmatizing is much more precise? Lemmatizing is a more complex process it needs - a vocabulary (problem: new words) - morphologic analysis (knowledge of inflection rules) - syntactic analysis, parsing (noun or verb?)

Compound splitting Marketingjargon -> marketing AND jargon Increased retrieval Decreased precision Must be applied to both query and index! But what to do with the query marketing jargon ? And with spreekwoord appel boom ?

Chapter 2 Overview Preprocessing of documents choose the unit of indexing (granularity) tokenization (removing punctuation, splitting in words) stop list? normalization: case folding, stemming versus lemmatizing,... extensions to postings lists

Efficient merging of postings For X AND Y, we have to intersect 2 lists Most documents will contain only one of the two terms

Recall basic intersection algorithm

Skip pointers

Makes intersection of 2 lists more efficient think of millions of list items How many skip pointers and where? Trade-off: More pointers, often useful but small skips. Less pointers … Heuristic: distance √n, evenly distributed

Skip pointers: useful? Yes, certainly in the past With very fast CPUs less important Especially in a rather static index If a list keeps changing less effective

Extensions of the simple term index To support phrase queries “information retrieval” “retrieval of information” Different approaches biword indexes phrase indexes positional indexes combinations

Biword and phrase indexes Holding terms together in the index Simple biword index: retrieval of, of information Sophisticated: POS tagger selects nouns N x* N retrieval of this information Phrase index: includes variable lengths of word sequences terms of 1 and 2 words both included

Positional index Add in the postings lists for each doc the list of positions of the term for phrase queries for proximity search Example [information, 4] : [1:, 2:, …] [retrieval, 2] : [1:, 2: ]

Combination schemes Often queried combinations: phrase index names of persons and organization esp. combinations of common terms (!) find out from query log For other phrases a positional index Williams e.a.: next word index added

H.E. Williams, J.Zobel, and D.Bahle (2004) Fast Phrase Querying With Combined Indexes (ACM Dig Library): Phrase querying with a combination of three approaches (next word index, phrase index and inverted file)... is more than 60% faster on average than using an inverted index alone... requires structures that total only 20% of the size of the collection.

A nextword index (Williams e.a.) docfreq,(,<doc, freq, [..] No of matching docs Doc ID No of occurrences in doc position