Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

T.Sharon-A.Frank 1 Multimedia Compression Basics.
Chapter 5: Introduction to Information Retrieval
Data Compression CS 147 Minh Nguyen.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Processing of large document collections
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Modern Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
WMES3103 : INFORMATION RETRIEVAL
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
A Data Compression Algorithm: Huffman Compression
CS336: Intelligent Information Retrieval
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Modern Information Retrieval Chapter 7: Text Processing.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2011.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Chapter 6: Information Retrieval and Web Search
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
COMPRESSION. Compression in General: Why Compress? So Many Bits, So Little Time (Space) CD audio rate: 2 * 2 * 8 * = 1,411,200 bps CD audio storage:
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
Web- and Multimedia-based Information Systems Lecture 2.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 6 – Basics of Compression (Part 1) Klara Nahrstedt Spring 2011.
Information Retrieval
Text Operations J. H. Wang Feb. 21, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Multi-media Data compression
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2012.
CSE 143 Lecture 22 Huffman slides created by Ethan Apter
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Why indexing? For efficient searching of a document
3.3 Fundamentals of data representation
HUFFMAN CODES.
Data Compression.
Indexing & querying text
CS 430: Information Discovery
Data Compression CS 147 Minh Nguyen.
Advanced Algorithms Analysis and Design
Huffman Coding CSE 373 Data Structures.
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Information Retrieval and Web Design
Presentation transcript:

Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef

Document Preprocessing Lexical analysis Elimination of stopwords Stemming of the remaining words Selection of index terms Construction of term categorization structures. (thesaurus)

Logical View of a Document automatic or manual indexing accents, spacing, etc. noun groups stemming document stopwords text+ structure text structure recognition index terms full text structure

1)Lexical Analysis of the Text Lexical Analysis Convert an input stream of characters into stream words . Major objectives is the identification of the words in the text !! How ?? Digits. ignoring numbers is a common way Hyphens. state-of-the art punctuation marks. remove them. Exception: 510B.C Case

2) Elimination of Stopwords words appear too often are not useful for IR. Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words Problem Search for “to be or not to be”?

3) Stemming A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes or suffixes). Example connect, connected, connecting, connection, connections Removing strategies affix removal: intuitive, simple table lookup successor variety n-gram

4) Index Terms Selection Motivation A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. Most of the semantics is carried by the noun words. Identification of noun groups A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

5) Thesaurus Construction Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words. A controlled vocabulary for the indexing and searching. Why? Normalization, indexing concept , reduction of noise, identification, ..ect

The Purpose of a Thesaurus To provide a standard vocabulary for indexing and searching To assist users with locating terms for proper query formulation To provide classified hierarchies that allow the broadening and narrowing of the current query request

Thesaurus (cont.) Not like common dictionary Words with their explanations May contain words in a language Or only contains words in a specific domain. With a lot of other information especially the relationship between words Classification of words in the language Words relationship like synonyms, antonyms

Roget thesaurus Example cowardly adjective (جبان) Ignobly lacking in courage: cowardly turncoats Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow- bellied (slang) http://www.thesaurus.com http://www.dictionary.com/

Thesaurus Term Relationships BT: broader NT: narrower RT: non-hierarchical, but related

Use of Thesaurus Indexing Select the most appropriate thesaurus entries for representing the document. Searching Design the most appropriate search strategy. If the search does not retrieve enough documents, the thesaurus can be used to expand the query. If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary

Document clustering Document clustering : the operation of grouping together similar documents in classes Global vs. local Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query

Text Compression Why text compression is important? Less storage space Less time for data transmission Less time to search (if the compression method allows direct search without decompression)

Terminology Symbol Alphabet Compression ratio The smallest unit for compression (e.g., character, word, or a fixed number of characters) Alphabet A set of all possible symbols Compression ratio The size of the compressed file as a fraction of the uncompressed file

Types of compression models Static models Assume some data properties in advance (e.g., relative frequencies of symbols) for all input text Allow direct (or random) access Poor compression ratios when the input text deviates from the assumption

Types of compression models Semi-static models Learn data properties in a first pass Compress the input data in a second pass Allow direct (or random) access Good compression ratio Must store the learned data properties for decoding Must have whole data at hand

Types of compression models Adaptive models Start with no information Progressively learn the data properties as the compression process goes on Need only one pass for compression Do not allow random access Decompression cannot start in the middle

General approaches to text compression Dictionary methods (Basic) dictionary method Ziv-Lempel’s adaptive method Statistical methods Arithmetic coding Huffman coding

Dictionary methods Replace a sequence of symbols with a pointer to a dictionary entry aaababbbaaabaaaaaaabaabb input Compress May be suitable for one text but may be unsuitable for another babbabaa output aaa bb dictionary

Adaptive Ziv-Lempel coding Instead of dictionary entries, pointers point to the previous occurrences of symbols aaababbbaaabaaaaaaabaabb Compress a|a|b|b|b|a|a|a|b|b

Adaptive Ziv-Lempel coding Instead of dictionary entries, pointers point to the previous occurrences of symbols aaababbbaaabaaaaaaabaabb a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb 1 2 3 4 5 6 7 8 9 10 0a|1a|0b|1b|3b|2a|3a|6a|2b|9b 1 2 3 4 5 6 7 8 9 10

Adaptive Ziv-Lempel coding Good compression ratio (4 bits/character) Suitable for general data compression and widely used (e.g., zip, compress) Do not allow decoding to start in the middle of a compressed file  direct access is impossible without decompression from the beginning

Arithmetic coding The input text (data) is converted to a real number between 0 and 1, such as 0.328701 Good compression ratio (2 bits/character) Slow Cannot start decoding in the middle of a file

Symbols and alphabet for textual data Words are more appropriate symbols for natural language text Example: “for each rose, a rose is a rose” Alphabet {a, each, for, is, rose, , ‘,’} Always assume a single space after a word unless there is another separator {a, each, for, is, rose, ‘,’}

Huffman coding Assign shorter codes (bits) to more frequent symbols and longer codes (bits) to less frequent symbols Example: “for each rose, a rose is a rose”

Example symb freq each 1 , for is a 2 rose 3 symb freq each 1 , for 5 5 4 4 2 2 2 2 a rose each , for is

Example symb freq each 1 , for is a 2 rose 3 a rose each , for is 1 1 1 1 a rose 1 1 each , for is

Example symb freq code each 1 100 , 101 for 110 is 111 a 2 00 rose 3 1 1 1 a rose 1 1 each , for is

Canonical tree - Height of the left subtree of any node is never smaller than that of the right subtree - All leaves are in increasing order of probabilities (frequencies) from left to right 1 1 1 a rose 1 1 each , for is

Advantages of canonical tree Smaller data for decoding Non-canonical tree needs: Mapping table between symbols and codes Canonical tree needs: (Sorted) list of symbols A pair of number of symbols and numerical value of the first code for each level E.g., {(0, NA), (2, 2), (4, 0)} More efficient encoding/decoding

Byte-oriented Huffman coding Use whole bytes instead of binary coding Non-optimal tree 254 empty nodes 256 symbols 256 symbols Optimal tree 254 symbols 254 empty nodes 256 symbols 2 symbols

Comparison of methods

Compression of inverted files Inverted file: composed of A vector containing all distinct words in the text collection. for each a list of documents in which that word occurs. Types of code: Unary Elias-~ Elisa~o Golomb

Conclusions Text transformation: meaning instead of strings Lexical analysis Stopwords Stemming Text compression Searchable Random access Model-coding inverted files

Thanks…. Any Questions.