Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,

Slides:



Advertisements
Similar presentations
Data Compression CS 147 Minh Nguyen.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Greedy Algorithms (Huffman Coding)
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Processing of large document collections
Lecture04 Data Compression.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Huffman Encoding 16-Apr-17.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
WMES3103 : INFORMATION RETRIEVAL
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Data Compression Basics & Huffman Coding
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Modern Information Retrieval Chapter 7: Text Processing.
Huffman Encoding Veronica Morales.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Communication Technology in a Changing World Week 2.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Chapter 7 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information Theory 7.3 Run-Length Coding 7.4 Variable-Length Coding (VLC) 7.5.
An Overview of Different Compression Algorithms Their application on compressing inverted files.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
3.3 Fundamentals of data representation
Huffman Codes ASCII is a fixed length 7 bit code that uses the same number of bits to define each character regardless of how frequently it occurs. Huffman.
HUFFMAN CODES.
Data Coding Run Length Coding
Data Compression.
Data Compression.
CS 430: Information Discovery
Data Compression CS 147 Minh Nguyen.
Chapter 11 Data Compression
Communication Technology in a Changing World
Communication Technology in a Changing World
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Huffman Encoding.
Information Retrieval and Web Design
Presentation transcript:

Text Operations: Preprocessing and Compression

Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination, stemming, index term selection, thesauri –build a thesaurus Text compression –to improve the efficiency of the retrieval process –statistical methods vs. dictionary methods –inverted file compression

Document Preprocessing Lexical analysis of the text –digits, hyphens, punctuation marks, the case of letters Elimination of stopwords –filtering out the useless words for retrieval purposes Stemming –dealing with the syntactic variations of query terms Index terms selection –determining the terms to be used as index terms Thesauri –the expansion of the original query with related term

The Process of Preprocessing structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

Lexical Analysis of the Text Four particular cases Numbers usually not good index terms because of their vagueness need some advanced lexical analysis procedure –ex) 510B.C., , 2000/2/12, …. Hyphens breaking up hyphenated words might be useful –ex) state-of-the-art  state of the art (Good!) –but, B-49  B 49 (???) need to adopt a general rule and to specify exceptions on a case by case basis

Lexical Analysis of the Text Punctuation marks –removed entirely ex) 510B.C  510BC if the query contains ‘510B.C’, removal of the dot both in query term and in the documents will not affect retrieval performance –require the preparation of a list of exceptions ex) val.id  valid (???) The case of letters –converts all the text to either lower or upper case –part of the semantics might be lost Northwestern University  northwestern university (???)

Elimination of Stopwords Basic concept –filtering out words with very low discrimination values ex) a, the, this, that, where, when, …. Advantage –reduce the size of the indexing structure considerably Disadvantage –might reduce recall as well ex) to be or not to be

Stemming What is the “stem”? –the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes) –ex) ‘connect’ is the stem for the variants ‘connected’, ‘connecting’, ‘connection’, ‘connections’ Effect of stemming –reduce variants of the same root to a common concept –reduce the size of the indexing structure –controversy about the benefits of stemming

Index Term Selection Index terms selection –not all words are equally significant for representing the semantics of a document Manual selection –selection of index terms is usually done by specialist Automatic selection of index terms –most of the semantics is carried by the noun words –clustering nouns which appear nearby in the text into a single indexing component (or concept) –ex) computer science

Thesauri What is the “thesaurus”? –list of important words in a given domain of knowledge –a set of related words derived from a synonymity relationship –a controlled vocabulary for the indexing and searching Main purposes –provide a standard vocabulary for indexing and searching –assist users with locating terms for proper query formulation –provide classified hierarchies that allow the broadening and narrowing of the current query request

Thesauri Thesaurus index terms –denote a concept which is the basic semantic unit –can be individual words, groups of words, or phrases ex) building, teaching, ballistic missiles, body temperature –frequently, it is necessary to complement a thesaurus entry with a definition or an explanation ex) seal (marine animals), seal (documents) Thesaurus term relationships –mostly composed of synonyms and near-synonyms –BT (Broader Term), NT (Narrower Term), RT (Related Term)

Text Operations: Coding / Compression Methods

Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated with space requirements, I/O overhead, communication delays –obstacle: need for IR systems to access text randomly to access a given word in some forms of compressed text, the entire text must be decoded from the beginning until the desired word is reached Two strategies –statistical methods –dictionary methods

Statistical Methods Basic concepts –Modeling: a probability is estimated for each symbol –Coding: a code is assigned to each symbol based on the model –shorter codes are assigned to the most likely symbols Relationship between probabilities and codes –Source code theorem (by Claude Shannon) : a symbol that occurs with probability p should be assigned a code of length log 2 (1/p) bits

Statistical Methods Compression models –adaptive model: progressively learn about the statistical distribution as the compression process goes on decompression of a file has to start from its beginning –static model: assume an average distribution for all input texts poor compression ratios when data deviates from initial distribution assumptions –semi-static model: learn a distribution in a first pass, compress the data in a second pass by using a fixed code derived from the distribution learned information on the data distribution must be stored

Statistical Methods Word-based compression model –take words instead of characters as symbols. Reasons to use this model in an IR context –much better compression rates –words carry a lot of meaning in natural languages and their distribution is much more related to the semantic structure of the text than is the distribution of individual letters –words are the atoms on which most IR systems are built –word frequencies are useful in answering queries involving combinations of words –the best strategy is to start with the least frequent words first

Statistical Methods Coding –the task of obtaining the representation of a symbol based on a probability distribution given by a model –main goal: assign short codes to likely symbols and long codes to unlikely ones Two statistical coding strategies –Huffman coding a variable-length encoding in bits for each symbol relatively fast, allows random access –Arithmetic coding use an interval of real numbers between 0-1 much slower, does not allow random access

Huffman Coding Building a Huffman tree –for each symbol of the alphabet, create a node containing the symbol and its probability –the two nodes with the smallest probabilities become children of a newly created parent node –the parent node is associated a probability equal to the sum of the probabilities of the two chosen children –the operation is repeated, ignoring nodes that are already children, until there is only one node

Author’s Huffman Coding Example: “for each rose, a rose is a rose” rose is“, “ a eachfor for each rose, a rose is a rose

Better Huffman Coding? Example: “for each rose, a rose is a rose” rose is“, “ a eachfor for each rose, a rose is a rose

Author’s Canonical Huffman Coding Height of left tree is never shorter than right tree. S: ordered sequence of pairs (x i, y i ) for each level in tree where x i = # symbols y i = numerical value of first symbol rose is“, “ a eachfor for each rose, a rose is a rose S = ((1, 1) (1, 1) (0, ∞) (4, 0)

Byte-Oriented Huffman Coding Tree has branching factor of 256 Ensure no empty nodes in higher levels of tree # of bottom level elements = 1 + ((v – 256) mod 255) Characteristics –Decompression is faster than for plain Huffman coding. –Compression ratios are better than for Ziv-Lempel family of codings. –Allows direct searching on compressed text

Dictionary Methods Basic concepts –replacing groups of consecutive symbols with a pointer to an entry in a dictionary –the pointer representations are references to entries in a dictionary composed of a list of symbols that are expected to occur frequently –pointers to the dictionary entries are chosen so that they need less space than the phrase they replace –modeling and coding does not exist –there are no explicit probabilities associated to phrases

Dictionary Methods Static dictionary methods –selected pairs of letters are replaced with codewords –ex) Digram coding at each step the next two characters are inspected and verified if they correspond to a digram in the dictionary if so, they are coded together and the coding position is shifted by two characters; otherwise, the single character is represented by its normal code and the position is shifted by one character –main problem the dictionary might be suitable for one text, but unsuitable for another

Dictionary Methods Adaptive dictionary methods –Ziv-Lempel placing strings of characters with a reference to a previous occurrence of the string if the pointer to an earlier occurrence of a string is stored in fewer bits than the string it replaces, compression is achieved

Ziv-Lempel Code Characteristics –identifying each text segment the first time it appears and then simply pointing back to this first occurrence rather than repeating the segment –an adaptive model of coding, with increasingly long text segments encoded as the text is scanned –require less space than the repeated text segments –higher compression than the Huffman codes –codes of roughly 4 bits per character

Ziv-Lempel Code LZ77 – Gzip encoding –the code consists of a set of triples a: identifies how far back in the decoded text to look for the upcoming text segment b: tells how many characters to copy for the upcoming segment c: a new character to add to complete the next segment ex),

Ziv-Lempel Code An example (decoding) p pe pet peter peter_ peter_pi peter_piper peter_piper_pic peter_piper_pick peter_piper_picked ………..

Dictionary Methods Adaptive dictionary methods –Disadvantages over the Huffman method no random access: does not allow decoding to start in the middle of a compression file Dictionary schemes are popular for their speed and low memory use, but statistical methods are more common in an IR environment

Comparing Text Compression Techniques ArithmeticCharacter Huffman Word Huffman Ziv-Lempel Compression Ratio very goodpoorvery goodgood Compression Speed slowfast very fast Decompression Speed slowfastvery fast Memory Spacelow highmoderate Compressed pattern matching noyes Random Access noyes no