Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef Text operations Prepared By : Loay Alayadhi 425121605 Supervised by: Dr. Mourad Ykhlef
Document Preprocessing Lexical analysis Elimination of stopwords Stemming of the remaining words Selection of index terms Construction of term categorization structures. (thesaurus)
Logical View of a Document automatic or manual indexing accents, spacing, etc. noun groups stemming document stopwords text+ structure text structure recognition index terms full text structure
1)Lexical Analysis of the Text Lexical Analysis Convert an input stream of characters into stream words . Major objectives is the identification of the words in the text !! How ?? Digits. ignoring numbers is a common way Hyphens. state-of-the art punctuation marks. remove them. Exception: 510B.C Case
2) Elimination of Stopwords words appear too often are not useful for IR. Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words Problem Search for “to be or not to be”?
3) Stemming A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes or suffixes). Example connect, connected, connecting, connection, connections Removing strategies affix removal: intuitive, simple table lookup successor variety n-gram
4) Index Terms Selection Motivation A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. Most of the semantics is carried by the noun words. Identification of noun groups A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold
5) Thesaurus Construction Thesaurus: a precompiled list of important words in a given domain of knowledge and for each word in this list, there is a set of related words. A controlled vocabulary for the indexing and searching. Why? Normalization, indexing concept , reduction of noise, identification, ..ect
The Purpose of a Thesaurus To provide a standard vocabulary for indexing and searching To assist users with locating terms for proper query formulation To provide classified hierarchies that allow the broadening and narrowing of the current query request
Thesaurus (cont.) Not like common dictionary Words with their explanations May contain words in a language Or only contains words in a specific domain. With a lot of other information especially the relationship between words Classification of words in the language Words relationship like synonyms, antonyms
Roget thesaurus Example cowardly adjective (جبان) Ignobly lacking in courage: cowardly turncoats Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow- bellied (slang) http://www.thesaurus.com http://www.dictionary.com/
Thesaurus Term Relationships BT: broader NT: narrower RT: non-hierarchical, but related
Use of Thesaurus Indexing Select the most appropriate thesaurus entries for representing the document. Searching Design the most appropriate search strategy. If the search does not retrieve enough documents, the thesaurus can be used to expand the query. If the search retrieves too many items, the thesaurus can suggest more specific search vocabulary
Document clustering Document clustering : the operation of grouping together similar documents in classes Global vs. local Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query
Text Compression Why text compression is important? Less storage space Less time for data transmission Less time to search (if the compression method allows direct search without decompression)
Terminology Symbol Alphabet Compression ratio The smallest unit for compression (e.g., character, word, or a fixed number of characters) Alphabet A set of all possible symbols Compression ratio The size of the compressed file as a fraction of the uncompressed file
Types of compression models Static models Assume some data properties in advance (e.g., relative frequencies of symbols) for all input text Allow direct (or random) access Poor compression ratios when the input text deviates from the assumption
Types of compression models Semi-static models Learn data properties in a first pass Compress the input data in a second pass Allow direct (or random) access Good compression ratio Must store the learned data properties for decoding Must have whole data at hand
Types of compression models Adaptive models Start with no information Progressively learn the data properties as the compression process goes on Need only one pass for compression Do not allow random access Decompression cannot start in the middle
General approaches to text compression Dictionary methods (Basic) dictionary method Ziv-Lempel’s adaptive method Statistical methods Arithmetic coding Huffman coding
Dictionary methods Replace a sequence of symbols with a pointer to a dictionary entry aaababbbaaabaaaaaaabaabb input Compress May be suitable for one text but may be unsuitable for another babbabaa output aaa bb dictionary
Adaptive Ziv-Lempel coding Instead of dictionary entries, pointers point to the previous occurrences of symbols aaababbbaaabaaaaaaabaabb Compress a|a|b|b|b|a|a|a|b|b
Adaptive Ziv-Lempel coding Instead of dictionary entries, pointers point to the previous occurrences of symbols aaababbbaaabaaaaaaabaabb a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb 1 2 3 4 5 6 7 8 9 10 0a|1a|0b|1b|3b|2a|3a|6a|2b|9b 1 2 3 4 5 6 7 8 9 10
Adaptive Ziv-Lempel coding Good compression ratio (4 bits/character) Suitable for general data compression and widely used (e.g., zip, compress) Do not allow decoding to start in the middle of a compressed file direct access is impossible without decompression from the beginning
Arithmetic coding The input text (data) is converted to a real number between 0 and 1, such as 0.328701 Good compression ratio (2 bits/character) Slow Cannot start decoding in the middle of a file
Symbols and alphabet for textual data Words are more appropriate symbols for natural language text Example: “for each rose, a rose is a rose” Alphabet {a, each, for, is, rose, , ‘,’} Always assume a single space after a word unless there is another separator {a, each, for, is, rose, ‘,’}
Huffman coding Assign shorter codes (bits) to more frequent symbols and longer codes (bits) to less frequent symbols Example: “for each rose, a rose is a rose”
Example symb freq each 1 , for is a 2 rose 3 symb freq each 1 , for 5 5 4 4 2 2 2 2 a rose each , for is
Example symb freq each 1 , for is a 2 rose 3 a rose each , for is 1 1 1 1 a rose 1 1 each , for is
Example symb freq code each 1 100 , 101 for 110 is 111 a 2 00 rose 3 1 1 1 a rose 1 1 each , for is
Canonical tree - Height of the left subtree of any node is never smaller than that of the right subtree - All leaves are in increasing order of probabilities (frequencies) from left to right 1 1 1 a rose 1 1 each , for is
Advantages of canonical tree Smaller data for decoding Non-canonical tree needs: Mapping table between symbols and codes Canonical tree needs: (Sorted) list of symbols A pair of number of symbols and numerical value of the first code for each level E.g., {(0, NA), (2, 2), (4, 0)} More efficient encoding/decoding
Byte-oriented Huffman coding Use whole bytes instead of binary coding Non-optimal tree 254 empty nodes 256 symbols 256 symbols Optimal tree 254 symbols 254 empty nodes 256 symbols 2 symbols
Comparison of methods
Compression of inverted files Inverted file: composed of A vector containing all distinct words in the text collection. for each a list of documents in which that word occurs. Types of code: Unary Elias-~ Elisa~o Golomb
Conclusions Text transformation: meaning instead of strings Lexical analysis Stopwords Stemming Text compression Searchable Random access Model-coding inverted files
Thanks…. Any Questions.