CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1 Information Retrieval CSC 9010: Special Topics. Natural Language.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Introduction to Information Retrieval

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Information Retrieval in Practice

Search Engines and Information Retrieval

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,

Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Information Retrieval

Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Search Engines and Information Retrieval Chapter 1.

1 The BT Digital Library A case study in intelligent content management Paul Warren

CSC 8520 Spring Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010.

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

The Internet 8th Edition Tutorial 4 Searching the Web.

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

1 Computing Relevance, Similarity: The Vector Space Model.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Vector Space Models.

1 Information Retrieval LECTURE 1 : Introduction.

Information Retrieval

©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610)

1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.

Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Search Engine Architecture

Text Based Information Retrieval

Information Retrieval and Web Search

Information Retrieval and Web Search

Multimedia Information Retrieval

Information Retrieval

Representation of documents and queries

CS 430: Information Discovery

Introduction to Information Retrieval

Chapter 5: Information Retrieval and Web Search

CS246: Information Retrieval

Information Retrieval and Web Design

Presentation transcript:

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Information Retrieval CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 2 Finding Out About There are many large corpora of information that people use. The web is the obvious example. Others include: –scientific journals –patent databases –Medline –Usenet groups People interact with all that information because they want to KNOW something; there is a question they are trying to answer or a piece of information they want. Information Retrieval, or IR, is the process of answering that information need. Simplest approach: –Knowledge is organized into chunks (pages or documents) –Goal is to return appropriate chunks

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 3 Information Retrieval Systems Goal of an information retrieval system is to return appropriate chunks Steps involve include –asking a question –finding answers –evaluating answers –presenting answers Value of an IR tool depends on how well it does on all of these. Web search engines are the IR tools most familiar to most people.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 4 Asking a question Reflect some information need Query Syntax needs to allow information need to be expressed –Keywords –Combining terms Simple: “required”, NOT (+ and -) Boolean expressions with and/or/not and nested parentheses Variations: strings, NEAR, capitalization. –Simplest syntax that works –Typically more acceptable if predictable Another set of problems when information isn’t text: graphics, music

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 5 Finding the Information Goal is to retrieve all relevant chunks. Too time- consuming to do in real-time, so IR systems index pages. Two basic approaches –Index and classify by hand –Automate For BOTH approaches deciding what to index on (e.g., what is a keyword) is a significant issue. Many IR tools like search engines provide both

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 6 IR Basics A retriever collects a page or chunk. This may involve spidering web pages, extracting documents from a DB, etc. A parser processes each chunk and extracts individual words. An indexer creates/updates a hash table which connects words with documents A searcher uses the hash table to retrieve documents based on words A ranking system decides the order in which to present the documents: their relevance

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 7 How Good Is The IR? Information Retrieval systems are evaluated with two basic metrics: –Precision: What percent of document returned are actually relevant to the information need –Recall: what percent of documents relevant to information need are returned Can’t typically measure these exactly; usually based on test sets.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 8 Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents have been separated into individual files Remaining components must parse, index, find, and rank documents. Traditional approach is based on the words in the documents (predates the web)

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 9 Extracting Lexical Features Process a string of characters –assemble characters into tokens (tokenizer) –choose tokens to index Standard lexical analysis problem Lexical Analyser Generator, such as lex

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 10 Lexical Analyser Basic idea is a finite state machine Triples of input state, transition token, output state Must be very efficient; gets used a LOT blank A-Z blank, EOF

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 11 Design Issues for Lexical Analyser Punctuation –treat as whitespace? –treat as characters? –treat specially? Case –fold? Digits –assemble into numbers? –treat as characters? –treat as punctuation?

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 12 Lexical Analyser Output of lexical analyser is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more efficient, but limits somewhat the power of our search

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 13 Stemming Additional processing at the token level –We covered earlier this semester Turn words into a canonical form: –“cars” into “car” –“children” into “child” –“walked” into “walk” Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 14 Noise Words (Stop Words) Function words that contribute little or nothing to meaning Very frequent words –If a word occurs in every document, it is not useful in choosing among documents –However, need to be careful, because this is corpus-dependent Often implemented as a discrete list

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 15 Example Corpora We are assuming a fixed corpus. Some sample corpora: –Medline Abstracts – . Anyone’s . –Reuters corpus –Brown corpus Will contain textual fields, maybe structured attributes –Textual: free, unformatted, no meta-information. NLP mostly needed here –Structured: additional information beyond the content

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 16 Structured Attributes for Medline Pubmed ID Author Year Keywords Journal

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 17 Textual Fields for Medline Abstract –Reasonably complete standard academic English –Capturing the basic meaning of document Title –Short, formalized –Captures most critical part of meaning –Proxy for abstract

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 18 Structured Fields for To, From, Cc, Bcc Dates Content type Status Content length Subject (partially)

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 19 Text fields for Subject –Format is structured, content is arbitrary. –Captures most critical part of content. –Proxy for content -- but may be inaccurate. Body of –Highly irregular, informal English. –Entire document, not summary. –Spelling and grammar irregularities. –Structure and length vary.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 20 Indexing We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index terms –Assume that each token is a word and we don’t want to recognize any more complex structures than single words. When all documents are processed, create index

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 21 Basic Indexing Algorithm For each document in the corpus –Get the next token –Create or update an entry in a list doc ID, frequency. For each token found in the corpus –calculate #docs, total frequency –sort by frequency –Often called a “reverse index”, because it reverses the “words in a document” index to be a “documents containing words” index. –May be built on the fly or created after indexing.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 22 Fine Points Dynamic Corpora (e.g., the web): requires incremental algorithms Higher-resolution data (eg, char position). –Supports highlighting –Supports phrase searching –Useful in relevance ranking Giving extra weight to proxy text (typically by doubling or tripling frequency count) Document-type-specific processing –In HTML, want to ignore tags –In , maybe want to ignore quoted material

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 23 Choosing Keywords Don’t necessarily want to index on every word –Takes more space for index –Takes more processing time –May not improve our resolving power How do we choose keywords? –Manually –Statistically Exhaustivity vs specificity

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 24 Manually Choosing Keywords Unconstrained vocabulary: allow creator of document to choose whatever he/she wants –“best” match –captures new terms easily –easiest for person choosing keywords Constrained vocabulary: hand-crafted ontologies –can include hierarchical and other relations –more consistent –easier for searching; possible “magic bullet” search

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 25 Examples of Constrained Vocabularies ACM headings ( H: Information Retrieval –H3: Information Storage and Retrieval – H3.3: Information Search and Retrieval »Clustering »Query formulation »Relevance feedback »Search process etc. Medline Headings ( L: Information Science –L01: Information Science – L01.700: Medical Informatics – L : Medical Informatics Applications – L : Information Storage and Retrieval »Grateful Med [L ]

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 26 Automated Vocabulary Selection Frequency: Zipf’s Law. –P n = 1/n a, where Pn is the frequency of occurrence of the nth ranked item and a is close to 1 –Within one corpus, words with middle frequencies are typically “best” Document-oriented representation bias: lots of keywords/document Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 27 Choosing Keywords “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall. Words which have no resolving power within a corpus may be best choices across corpora Not very important for web searching; more relevant for some text mining.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 28 Keyword Choice for WWW We don’t have a fixed corpus of documents New terms appear fairly regularly, and are likely to be common search terms Queries that people want to make are wide-ranging and unpredictable Therefore: can’t limit keywords, except possibly to eliminate stop words. Even stop words are language-dependent. So determine language first.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 29 Comparing and Ranking Documents Once our IR system has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit to my query? –This involves determining what the query is about and how well the document answers it Compare them –Show me more like this. –This involves determining what the document is about.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 30 Determining Relevance by Keyword The typical document retrieval query consists entirely of keywords. Retrieval can be binary: present or absent More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? Simple strategies: –How many times does word occur in document? –How close to head of document? –If multiple keywords, how close together?

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 31 Keywords for Relevance Ranking Count: repetition is an indication of emphasis –Very fast (usually in the index) –Reasonable heuristic –Unduly influenced by document length –Can be "stuffed" by web designers Position: Lead paragraphs summarize content –Requires more computation –Also reasonably heuristic –Less influenced by document length

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 32 Keywords for Relevance Ranking Proximity for multiple keywords –Requires even more computation –Obviously relevant only if have multiple keywords –Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all All keyword methods –Are computationally simple and adequately fast –Are effective heuristics –typically perform as well as in-depth natural language methods for standard IR

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 33 Comparing Documents "Find me more like this one" really means that we are using the document as a query. This requires that we have some conception of what a document is about overall. Depends on context of query. We need to –Characterize the entire content of this document –Discriminate between this document and others in the corpus

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 34 Characterizing a Document: Term Frequency A document can be treated as a sequence of words. Each word characterizes that document to some extent. When we have eliminated stop words, the most frequent words tend to be what the document is about Therefore: f kd (# of occurrences of word K in document d) will be an important measure. Also called the term frequency

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 35 Characterizing a Document: Document Frequency What makes this document distinct from others in the corpus? The terms which discriminate best are not those which occur with high frequency! Therefore: D k (# of documents in which word K occurs) will also be an important measure. Also called the document frequency

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 36 TF*IDF This can all be summarized as: –Words are best discriminators when they occur often in this document (term frequency) don’t occur in a lot of documents (document frequency) One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 37 Describing an Entire Document So what is a document about? TF*IDF: can simply list keywords in order of their TF*IDF values Document is about all of them to some degree: it is at some point in some vector space of meaning

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 38 Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge space -- it is or is not about each of those terms. Consider each term as a vector. Then –We have an n-dimensional vector space –Where n is the number of terms (very large!) –Each document is a point in that vector space The document position in this vector space can be treated as what the document is about.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 39 Similarity Between Documents How similar are two documents? –Measures of association How much do the feature sets overlap? Modified for length: DICE coefficient –DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) ) –# terms compared to intersection Simple Matching coefficient: take into account exclusions –Cosine similarity similarity of angle of the two document vectors not sensitive to vector length

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 40 Bag of Words All of these techniques are what is known as bag of words approaches. Keywords treated in isolation Difference between "man bites dog" and "dog bites man" non-existent If better discrimination is needed, IR systems can add semantic tools –Use POS –Parse into basic NP VP structure –Requires that query be more complex.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 41 Improvements The two big problems with short queries are: –Synonymy: Poor recall results from missing documents that contain synonyms of search terms, but not the terms themselves –Polysemy/Homonymy: Poor precision results from search terms that have multiple meanings leading to the retrieval of non- relevant documents. Martin:

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 42 Query Expansion Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall –Use a dictionary/thesaurus –Use relevance feedback Martin:

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 43 Dictionary/Thesaurus Example

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 44 Relevance Feedback Ask user to identify a few documents which appear to be related to their information need Extract terms from those documents and add them to the original query. Run the new query and present those results to the user. Typically converges quickly Based on Martin:

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 45 Blind Feedback Assume that first few documents returned are most relevant rather than having users identify them Proceed as for relevance feedback Tends to improve recall at the expense of precision Based on Martin:

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 46 Post-Hoc Analyses When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need –Grouped by meaning for polysemic queries (using N-Gram-type approaches) –Grouped by extracted information (Named entities, for instance) –Group into existing hierarchy if structured fields available –Filtering (e.g., eliminate spam)

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 47 Additional IR Issues In addition to improved relevance, can improve overall information retrieval with some other factors: –Eliminate duplicate documents –Provide good context –Use ontologies to provide synonym lists For the web: –Eliminate multiple documents from one site –Clearly identify paid links

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 48 Summary Information Retrieval is the process of returning documents to meet a user’s information need based on a query Typical methods are BOW (bag of words) which rely on keyword indexing with little semantic processing NLP techniques used including tokenizing, stemming, some parsing. Results can be improved by adding semantic information (such as thesauri) and by filtering and other post-hoc analyses.