Modern Information Retrieval Chapter 7: Text Processing.

Slides:



Advertisements
Similar presentations
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Advertisements

Chapter 5: Introduction to Information Retrieval
Data Compression CS 147 Minh Nguyen.
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Processing of large document collections
Lecture04 Data Compression.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Compression Techniques. Digital Compression Concepts ● Compression techniques are used to replace a file with another that is smaller ● Decompression.
WMES3103 : INFORMATION RETRIEVAL
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS336: Intelligent Information Retrieval
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Data Structures – LECTURE 10 Huffman coding
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Lossless Data Compression Using run-length and Huffman Compression pages
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Data Compression Basics & Huffman Coding
Chapter 5: Information Retrieval and Web Search
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
CSE Lectures 22 – Huffman codes
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
The LZ family LZ77 LZ78 LZR LZSS LZB LZH – used by zip and unzip
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
STATISTIC & INFORMATION THEORY (CSNB134) MODULE 11 COMPRESSION.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
HUFFMAN CODES.
Data Coding Run Length Coding
Data Compression.
Data Compression.
Applied Algorithmics - week7
CS 430: Information Discovery
Data Compression CS 147 Minh Nguyen.
Context-based Data Compression
Advanced Algorithms Analysis and Design
Huffman Coding Greedy Algorithm
Information Retrieval and Web Design
Presentation transcript:

Modern Information Retrieval Chapter 7: Text Processing

Overview 1. Document pre-processing 1. Lexical analysis 2. Stopword elimination 3. Stemming 4. Index-term selection 5. Thesauri 2. Text Compression 1. Statistical methods 2. Huffman coding 3. Dictionary methods 4. Ziv-Lempel compression

Document Preprocessing Document pre-processing is the process of incorporating a new document into an information retrieval system. Document pre-processing is the process of incorporating a new document into an information retrieval system. The goal is to The goal is to – Represent the document efficiently in terms of both space (for storing the document) and time (for processing retrieval requests) requirements. – Maintain good retrieval performance (precision and recall). Document pre-processing is a complex process that leads to the representation of each document by a select set of index terms. Document pre-processing is a complex process that leads to the representation of each document by a select set of index terms. However, some Web search engines are giving up on much of this process and index all (or virtually all) the words in a document). However, some Web search engines are giving up on much of this process and index all (or virtually all) the words in a document).

Document Preprocessing (cont.) Document pre-processing includes 5 stages: Document pre-processing includes 5 stages: 1. Lexical analysis 2. Stopword elimination 3. Stemming 4. Index-term selection 5. Construction of thesauri

Lexical analysis Objective: Determine the words of the document. Objective: Determine the words of the document. Lexical analysis separates the input alphabet into Lexical analysis separates the input alphabet into –Word characters (e.g., the letters a-z) –Word separators (e.g., space, newline, tab) The following decisions may have impact on retrieval The following decisions may have impact on retrieval –Digits: Used to be ignored, but the trend now is to identify numbers (e.g., telephone numbers) and mixed strings as words. –Punctuation marks: Usually treated as word separators. –Hyphens: Should we interpret “pre-processing” as “pre processing” or as “preprocessing”? – Letter case: Often ignored, but then a search for “First Bank” (a specific bank) would retrieve a document with the phrase “Bank of America was the first bank to offer its customers…”

Stopword elimination Objective: Filter out words that occur in most of the documents. Objective: Filter out words that occur in most of the documents. Such words have no value for retrieval purposes Such words have no value for retrieval purposes These words are referred to as stopwords. They include These words are referred to as stopwords. They include –Articles (a, an, the, …) –Prepositions (in, on, of, …) –Conjunctions (and, or, but, if, …) –Pronouns (I, you, them, it…) –Possibly some verbs, nouns, adverbs, adjectives (make, thing, similar, …) A typical stopword list may include several hundred words. A typical stopword list may include several hundred words. As seen earlier, the 100 most frequent words add-up to about 50% of the words in a document. As seen earlier, the 100 most frequent words add-up to about 50% of the words in a document. Hence, stopword elimination improves the size of the indexing structures. Hence, stopword elimination improves the size of the indexing structures.

Stemming Objective: Replace all the variants of a word with the single stem of the word. Objective: Replace all the variants of a word with the single stem of the word. Variants include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc. Variants include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc. Example: connect: connects, connected, connecting, connection,… Example: connect: connects, connected, connecting, connection,… All have similar semantics and relate to a single concept. All have similar semantics and relate to a single concept. In parallel, stemming must be performed on the user query. In parallel, stemming must be performed on the user query.

Stemming (cont.) Stemming improves Stemming improves – Storage and search efficiency: less terms are stored. –Recall: without stemming a query about “connection”, matches only documents that have “connection”. without stemming a query about “connection”, matches only documents that have “connection”. With stemming, the query is about “connect” and matches in addition documents that originally had “connects”, “connected”, “connecting”, etc. With stemming, the query is about “connect” and matches in addition documents that originally had “connects”, “connected”, “connecting”, etc. However, stemming may hurt precision, because users can no longer target just a particular form. However, stemming may hurt precision, because users can no longer target just a particular form. Stemming may be performed using Stemming may be performed using – Algorithms that stripe of suffixes according to substitution rules. –Large dictionaries, that provide the stem of each word.

Index term selection (indexing) Objective: Increase efficiency by extracting from the resulting document a selected set of terms to be used for indexing the document. Objective: Increase efficiency by extracting from the resulting document a selected set of terms to be used for indexing the document. –If full text representation is adopted then all words are used for indexing. Indexing is a critical process: User's ability to find documents on a particular subject is limited by the indexing process having created index terms for this subject. Indexing is a critical process: User's ability to find documents on a particular subject is limited by the indexing process having created index terms for this subject. Index can be done manually or automatically. Index can be done manually or automatically. Historically, manual indexing was performed by professional indexers associated with library organizations. Historically, manual indexing was performed by professional indexers associated with library organizations. However, automatic indexing is more common now (or, with full text representations, indexing is altogether avoided). However, automatic indexing is more common now (or, with full text representations, indexing is altogether avoided).

Indexing (cont.) Relative advantages of manual indexing: Relative advantages of manual indexing: –Ability to perform abstractions (conclude what the subject is) and determine additional related terms, –Ability to judge the value of concepts. Relative advantages of automatic indexing: Relative advantages of automatic indexing: –Reduced cost: Once initial hardware cost is amortized, operational cost is cheaper than wages for human indexers. –Reduced processing time –Improved consistency. Controlled vocabulary: Index terms must be selected from a predefined set of terms (the domain of the index). Controlled vocabulary: Index terms must be selected from a predefined set of terms (the domain of the index). –Use of a controlled vocabulary helps standardize the choice of terms. –Searching is improved, because users know the vocabulary being used. –Thesauri can compensate for lack of controlled vocabularies.

Indexing (cont.) Index exhaustivity: the extent to which concepts are indexed. Index exhaustivity: the extent to which concepts are indexed. –Should we index only the most important concepts, or also more minor concepts? Index specificity: the preciseness of the index term used. Index specificity: the preciseness of the index term used. –Should we use general indexing terms or more specific terms? –Should we use the term "computer", "personal computer", or “Gateway E-3400”? Main effect: Main effect: –High exhaustivity improves recall (decreases precision). –High specificity improves precision (decreases recall). Related issues: Related issues: –Index title and abstract only, or the entire document? –Should index terms be weighted?

Indexing (cont.) Reducing the size of the index: Recall that articles, prepositions, conjunctions, pronouns have already been removed through a stopword list. Recall that articles, prepositions, conjunctions, pronouns have already been removed through a stopword list. – Recall that the 100 most frequent words account for 50% of all word occurrences. Words that are very infrequent (occur only a few times in a collection) are often removed, under the assumption that they would probably not be in the user’s vocabulary. Words that are very infrequent (occur only a few times in a collection) are often removed, under the assumption that they would probably not be in the user’s vocabulary. Reduction not based on probabilistic arguments: Nouns are often preferred over verbs, adjectives, or adverbs. Reduction not based on probabilistic arguments: Nouns are often preferred over verbs, adjectives, or adverbs.

Indexing (cont.) Indexing may also assign weights to terms. Non-weighted indexing: Non-weighted indexing: –No attempt to determine the value of the different terms assigned to a document. –Not possible to distinguish between major topics and casual references. –All retrieved documents are equal in value. –Typical of commercial systems through the 1980s. Weighted indexing: Weighted indexing: –Attempt made to place a value on each term as a description of the document. –This value is related to the frequency of occurrence of the term in the document (higher is better), but also to the number of collection documents that use this term (lower is better). –Query weights and document weights are combined to a value describing the likelihood that a document matches a query

Thesauri Objective: Standardize the index terms that were selected. In its simplest form a thesaurus is In its simplest form a thesaurus is –A list of “important” words (concepts). –For each word, an associated list of synonyms. A thesaurus may be generic (cover all of English) or concentrate on a particular domain of knowledge. A thesaurus may be generic (cover all of English) or concentrate on a particular domain of knowledge. The role of a thesaurus in information retrieval The role of a thesaurus in information retrieval –Provide a standard vocabulary for indexing. –Help users locate proper query terms. –Provide hierarchies for automatic broadening or narrowing of queries. Here, our interest is in providing a standard vocabulary (a controlled vocabulary). Here, our interest is in providing a standard vocabulary (a controlled vocabulary). Essentially, in this final stage, each indexing term is replaced by the concept that defines its thesaurus class. Essentially, in this final stage, each indexing term is replaced by the concept that defines its thesaurus class.

Text Compression Data Encoding: Transform encoding units (characters, words, etc.) into code values. Data Encoding: Transform encoding units (characters, words, etc.) into code values. –Objective is either Reduce size (compression) Reduce size (compression) Hide contents (encryption). Hide contents (encryption). Lossless encoding: The transformation is reversible– original file can be recovered from encoded (compressed, encrypted) file. Lossless encoding: The transformation is reversible– original file can be recovered from encoded (compressed, encrypted) file. Compression ratio: Compression ratio: –S: size of the uncompressed file. –C: size of the compressed file. –Compression-rate = C/S. –Example: S= 300,000 bytes, C=100,000 bytes. S= 300,000 bytes, C=100,000 bytes. Compression rate: 100,000/300,000 = Compression rate: 100,000/300,000 = 0.33.

Text Compression (cont.) Advantages of compression: Advantages of compression: –Reduction in storage size. –Reduction in transmission time. –Reduction in processing times (e.g., searching). Disadvantages: Disadvantages: –Requires time for compression/decompression. –Processing of compressed text is more complex. Specific for information retrieval: Specific for information retrieval: –Decompression time is often more critical than compression time. Unlike transmission-motivated compression (modems), documents in an information retrieval system are encoded once and decoded many times. Unlike transmission-motivated compression (modems), documents in an information retrieval system are encoded once and decoded many times. –Prefer compression techniques that allow searching in the compressed file (without decompressing it).

Text compression (cont.) Basic methods: Statistical methods: Statistical methods: –Estimate the probability of occurrence of each encoding unit (character or word) in the alphabet. –Assign codes to units: more frequent units are assigned shorter codes. –In information retrieval, word-encoding is preferred over character encoding. Dictionary methods: Dictionary methods: –Substitute a phrase (string of units) by a pointer to a dictionary or a previous occurrence of the phrase. –Compression is achieved because the pointer is shorter than the phrase.

Statistical methods Recall from the discussion of information theory: Recall from the discussion of information theory: – Assume a message from an alphabet of n symbols. – Assume that the probability of the i’th symbol is pi. – The average information content (entropy) is: –Optimal encoding is achieved when a symbol with probability pi is assigned a code whose length is log2(1/pi) = –log2(pi). – Hence, E also represents optimal average code length (measured in bits per character). – Therefore, E is the lower bound on compression.

Statistical methods (cont.) Statistical methods must first estimate the frequencies of the encoding units, and then assign codes based on these frequencies. Statistical methods must first estimate the frequencies of the encoding units, and then assign codes based on these frequencies. Approaches: Approaches: – Static: Use a single distribution for all texts. Fast, but not optimal because different texts exhibit different distributions. Fast, but not optimal because different texts exhibit different distributions. The encoding table is stored in the application (not in the text). The encoding table is stored in the application (not in the text). Decompression can start at any point in the file. Decompression can start at any point in the file. – Dynamic: Determine the frequencies in a preliminary pass. Excellent compression, but a total of two passes is required. Excellent compression, but a total of two passes is required. The encoding table is stored at the beginning of the text. The encoding table is stored at the beginning of the text. Decompression can start at any point in the file. Decompression can start at any point in the file. –Adaptive: Progressively learn the distribution of the text while compressing; each character is encoded on the basis of the preceding characters in a text. Fast, and close to optimal compression. Fast, and close to optimal compression. Decompression must start from the beginning Decompression must start from the beginning

Huffman coding General: General: –Huffman coding is one of the best known compression techniques (1952). –It is used in the Unix programs pack/unpack. –It is a statistical method based on variable length codes. –Compression is achieved by assigning shorter codes to more frequent units. –Decompression is unique because no code is the prefix of another. –Encoding units may be either bytes or words. –Does not exploit the dependencies between the encoding units. –Yields optimum average code length when these units are independent. –Can be used with the static, dynamic and adaptive approaches.

Huffman coding (cont.) Method: Method: – 1. Build a table of the encoding units and their frequencies (probabilities). – 2. Combine the two least frequent units into a unit with the sum of the probabilities and encode it in a new “unit”. – 3. Repeat this process until the entire dictionary is represented by a root whose probability is 1.0. – 4. When there is a tie for the two least frequent units, any tie-breaking procedure is acceptable. Unit1: p1 Unit2: p2 New unit: p1+p2

Huffman coding (cont.) Example:

Huffman coding (cont.) Example (cont.): Example (cont.): – The resulting code: – Average code length: – The entropy (compression lower bound) is: lower bound) is: – Fixed code length would have required log2 10 = 3.32 bits (which, in practice, would require 4 bits). – Compression ratio: C/S = 3.05/3.32 = 0.92

Huffman coding (cont.) Example: When the letters A-Z are thus encoded: Example: When the letters A-Z are thus encoded: –Code lengths are between 3 and 10 bits. –Average code length is 4.12 bits. –A fixed code would have required log2 26 = 4.70 bits (i.e., 5 bits). More compression is obtained by encoding words: More compression is obtained by encoding words: –With the 800 most frequent English words (small table!) are encoded in this method (all other words are in plain ASCII), 40-50% compression has been reported. Huffman codes are prefix-specific: Huffman codes are prefix-specific: –No code is the beginning of another code. –Hence, a left-to-right decoding operation is unique. –It is possible to search the compressed text.

Dictionary methods Dictionary methods construct a dictionary of phrases, and replace their occurrences with dictionary pointers. Dictionary methods construct a dictionary of phrases, and replace their occurrences with dictionary pointers. The choice of phrases may be static, dynamic or adaptive. The choice of phrases may be static, dynamic or adaptive. A simple method (digrams): A simple method (digrams): –Construct a dictionary of pairs of letters that occur together frequently (e.g., ou, ea, ch, …). –If n such pairs are used, a pointer (location in the dictionary) requires log2 n bits. –At each step in the encoding, the next pair is examined. If it corresponds to a dictionary pair, it is replaced by its encoding, and the encoding position moves by 2 characters. If it corresponds to a dictionary pair, it is replaced by its encoding, and the encoding position moves by 2 characters. Otherwise, the single character encoding is kept, and the position moves by one character. Otherwise, the single character encoding is kept, and the position moves by one character. –To assure that decoding is unambiguous, an extra bit is needed to indicate whether the next unit is a single character code or a digram code.

Ziv-Lempel compression General: General: –The Ziv-Lempel method (1977) uses a single-pass adaptive scheme. –While compressing, it constructs a dictionary from phrases encountered so far. –Many popular programs (Unix compress/uncompress, GNU gzip/gunzip, and Windows WinZip) are based on the Ziv- Lempel algorithm. –Compression is slightly better than Huffman codes (C/S of 45% vs. 55%). –Disadvantage for information retrieval: decompressed file cannot be searched and decoding cannot start at a random place in the file.

Ziv-Lempel compression (cont.) Compression: Compression: –1. Initialize the dictionary to contain all “phrases” of length one. –2. Examine the input stream and search for the longest phrase which has appeared in the dictionary. –3. Encode this phrase by its index in the dictionary. –4. Add the phrase followed by the next symbol in the input stream to the dictionary. –5. Go to Step 2.

Ziv-Lempel compression (cont.) Example: Assume a dictionary of 16 phrases (4 bit index). Assume a dictionary of 16 phrases (4 bit index). This case does not result in compression. This case does not result in compression. – Source: 25 characters in a 2- character alphabet require a total of 25 bits. Output: 13 pointers of 4 bits require a total of 52 bits. Output: 13 pointers of 4 bits require a total of 52 bits. This is because the length of the input data in this example is too short. This is because the length of the input data in this example is too short. In practice, the Lempel-Ziv algorithm works well only when the input data is sufficiently large and there is sufficient redundancy in the data. In practice, the Lempel-Ziv algorithm works well only when the input data is sufficiently large and there is sufficient redundancy in the data.