Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Chapter 5: Introduction to Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Information Retrieval in Practice
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
WMES3103 : INFORMATION RETRIEVAL
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Extracting Lexical Features Development of software tools for a search engine 1. convert an arbitrary pile of textual objects into a well-defined corpus.
Before I stated the database I had to save it into My Documents> ICT> You can do it> D201EPORTFOLIO> Evidence For the field group food item, I set the.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
CSC 8520 Spring Paula Matuszek CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2010.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2. Extracting Lexical Features 2007 년 1 월 23 일 인공지능연구실 조선호 Text : FINDING OUT ABOUT Page. 39 ~ 59.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Information Retrieval CSC 9010: Special Topics. Natural Language.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Information Retrieval LECTURE 1 : Introduction.
©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610)
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Text Based Information Retrieval
Information Retrieval and Web Search
Information Retrieval and Web Search
CS 430: Information Discovery
CS 430: Information Discovery
Information Retrieval and Web Search
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Query Languages.
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents have been separated into individual files Remaining components must parse, index, find, and rank documents. Traditional approach is based on the words in the documents

Extracting Lexical Features Process a string of characters –assemble characters into tokens –choose tokens to index In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex,

Lexical Analyser Basic idea is a finite state machine Triples of input state, transition token, output state Must be very efficient; gets used a LOT blank A-Z blank, EOF

Design Issues for Lexical Analyser Punctuation –treat as whitespace? –treat as characters? –treat specially? Case –fold? Digits –assemble into numbers? –treat as characters? –treat as punctuation?

Lexical Analyser Output of lexical analyser is a string of tokens Remaining operations are all on these tokens We have already thrown away some information; makes more efficient, but limits the power of our search

Stemming Additional processing at the token level Turn words into a canonical form: –“cars” into “car” –“children” into “child” –“walked” into “walk” Decreases the total number of different tokens to be processed Decreases the precision of a search, but increases its recall

Stemming -- How? Plurals to singulars (eg, children to child) Verbs to infinitives (eg, talked to talk) Clearly non-trivial in English! Typical stemmers use a context-sensitive transformation grammar: –(.*)SSES -> /1SS rules are typical

Noise Words (Stop Words) Function words that contribute little or nothing to meaning Very frequent words –If a word occurs in every document, it is not useful in choosing among documents –However, need to be careful, because this is corpus-dependent Often implemented as a discrete list (stop.wrd on CD)

Example Corpora We are assuming a fixed corpus. Text uses two sample corpora: –AI Abstracts – . Anyone’s . Textual fields, structured attributes Textual: free, unformatted, no meta- information Structured: additional information beyond the content

Structured Atributes for AI Theses Thesis # Author Year University Advisor Language

Textual Fields for AIT Abstract –Reasonably complete standard academic English capturing the basic meaning of document Title –Short, formalized, captures most critical part of meaning –(proxy for abstract)

Indexing We have a tokenized, stemmed sequence of words Next step is to parse document, extracting index terms –Assume that each token is a word and we don’t want to recognize any more complex structures than single words. When all documents are processed, create index

Basic Indexing Algorithm For each document in the corpus –get the next token –save the posting in a list doc ID, frequency For each token found in the corpus –calculate #doc, total frequency –sort by frequency (p 53-54)

Fine Points Dynamic Corpora Higher-resolution data (eg, char position) Giving extra weight to proxy text (typically by doubling or tripling frequency count) Document-type-specific processing –In HTML, want to ignore tages –In , maybe want to ignore quoted material

Choosing Keyword Don’t necessarily want to index on every word –Takes more space for index –Takes more processing time –May not improve our resolving power How do we choose keywords? –Manually –Statistically Exhaustivity vs specificity

Manually Choosing Keywords Unconstrained vocabulary: allow creator of document to choose whatever he/she wants –“best” match –captures new terms easily –easiest for person choosing keywords Constrained vocabulary: hand-crafted ontologies –can include hierarchical and other relations –more consistent –easier for searching; possible “magic bullet” search

Examples of Constrained Vocabularies ACM headings Medline Subject Headings

Automated Vocabulary Selection Frequency: Zipf’s Law –Within one corpus, words with middle frequencies are typically “best” Document-oriented representation bias: lots of keywords/document Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.

Choosing Keywords “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall. Words which have no resolving power within a corpus may be best choices across corpora Not very important for web searching; will be more relevant for text mining.

Keyword Choice for WWW We don’t have a fixed corpus of documents New terms appear fairly regularly, and are likely to be common search terms Queries that people want to make are wide- ranging and unpredictable Therefore: can’t limit keywords, except possibly to eliminate stop words. Even stop words are language-dependent. So determine language first.