Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

Information Retrieval and Text Mining Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in.
CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 2: The term vocabulary and postings.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
The term vocabulary and postings lists
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
CS276A Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Information Retrieval Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS347 Lecture 1 April 4, 2001 ©Prabhakar Raghavan.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 1.
Overview of Search Engines
Text Processing & Characteristics
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Text Processing & Characteristics Thanks to: Baldi, Frasconi, Smyth M. Hearst W. Arms R. Krovetz C. Manning, P. Raghavan, H. Schutze.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Intelligent Information Retrieval
Information Retrieval in Practice
Information Retrieval in Practice
Lecture 2: The term vocabulary and postings lists
Ch 2 Term Vocabulary & Postings List
7CCSMWAL Algorithmic Issues in the WWW
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
CS 430: Information Discovery
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing & Characteristics
Text Processing.
Document ingestion.
Token generation - stemming
CS276: Information Retrieval and Web Search
CS 430: Information Discovery
Recap of the previous lecture
PageRank GROUP 4.
Content Analysis of Text
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval Document Parsing

Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. Documents to be indexed. Friends, Romans, countrymen.

Parsing a document What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Plain ASCII, UTF-8, UTF-16,… Each of these is a classification problem, with many complications…

Tokenization: Issues Chinese/Japanese no spaces between words: Not always guaranteed a unique tokenization Dates/amounts in multiple formats フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 ) KatakanaHiraganaKanji“Romaji” What about DNA sequences ? ACCCGGTACGCAC... Definition of Tokens  What you can search !!

Case folding Reduce all letters to lower case exception: upper case (in mid-sentence?) e.g., General Motors USA vs. usa Morgen will ich in MIT … Is this the German “mit”?

Stemming Reduce terms to their “roots” language dependent e.g., automate(s), automatic, automation all reduced to automat. e.g., casa, casalinga, casata, casamatta, casolare, casamento, casale, rincasare, case reduced to cas

Porter’s algorithm Commonest algorithm for stemming English Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. Full morphologial analysis  modest benefit !! sses  ss, ies  i, ational  ate, tional  tion

Thesauri Handle synonyms and homonyms Hand-constructed equivalence classes e.g., car = automobile e.g., macchina = automobile = spider List of words important for a given domain For each word it specifies a list of correlated words (usually, synonyms, polysemic or phrases for complex concepts ). Co-occurrence Pattern: BT (broader term), NT (narrower term) Vehicle (BT)  Car  Fiat 500 (NT) How to use it in SE ??

Dmoz Directory

Yahoo! Directory

Information Retrieval Statistical Properties of Documents

Statistical properties of texts Token are not distributed uniformly They follow the so called “Zipf Law” Few tokens are very frequent A middle sized set has medium frequency Many are rare The first 100 tokens sum up to 50% of the text Many of these tokens are stopwords

K-th most frequent term has frequency approximately 1/k; or the product of the frequency (f) of a token and its rank (r) is almost a constant The Zipf Law, in detail f = c |T| / r   r * f = c |T| f = c |T| / r General Law Sum after the k-th element is ≤ f k k/(z-1) For the initial top-elements is a constant

An example of “Zipf curve”

Zipf’s law log-log plot

Consequences of Zipf Law Do exist many not frequent tokens that do not discriminate. These are the so called “stop words” English: to, from, on, and, the,... Italian: a, per, il, in, un,… Do exist many tokens that occur once in a text and thus are poor to discriminate (error?). English: Calpurnia Italian: Precipitevolissimevolmente (o, paklo) Words with medium frequency Words that discriminate

Other statistical properties of texts The number of distinct tokens grows as The so called “Heaps Law” (|T|  where  ) Hence the token length is  (log |T|) Interesting words are the ones with Medium frequency (Luhn)

Frequency vs. Term significance (Luhn)