Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com
Previous chapter: Conclusions Modeling of text helps predict behavior of systems Zipf law, Heaps’ law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax SGML, too expensive HTML, too simple XML, good combination
Text operations Linguistic operations Document clustering Compression Encription (not discussed here)
Linguistic operations Purpose: Convert words to “meanings” Synonyms or related words Different words, same meaning. Morphology Foot / feet, woman / female Homonyms Same words, different meanings. Word senses River bank / financial bank Stopwords Word, no meaning. Functional words The
For good or for bad? More exact matching Unexpected behavior Less noise, better recall Unexpected behavior Difficult for users to grasp Harms if introduces errors More expensive Adds a whole new technology Maintenance; language dependents Slows down Good if done well, harmful if done badly
Document preprocessing Lexical analysis (punctuation, case) Simple but must be careful Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ... Multiword expressions: hot dog, B-52 Here, all the power of linguistic analysis can be used Selection of index terms Often nouns; noun groups: computer science Construction of thesaurus synonymy: network of related concepts (words or phrases)
Stemming Methods Linguistic analysis: complex, expensive maintenance Table lookup: simple, but needs data Statistical (Avetisyan): no data, but imprecise Suffix removal Porter algorithm. Martin Porter. Ready code on his website Substitution rules: sses s, s stresses stress.
Better stemming The whole problematics of computational linguistics POS disambiguation well adverb or noun? Oil well. Statistical methods. Brill tagger Syntactic analysis. Syntactic disambiguation Word sense disambiguatiuon bank1 and bank2 should be different stems Statistical methods Dictionary-based methods. Lesk algorithm Semantic analysis
Thesaurus Terms (controlled vocabulary) and relationships Terms used for indexing represent a concept. One word or a phrase. Usually nouns sense. Definition or notes to distinguish senses: key (door). Relationships Paradigmatic: Synonymy, hierarchical (is-a, part), non-hierarchical Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet synsets
Use of thesurus To help the user to formulate the query Navigation in the hierarchy of words Yahoo! For the program, to collate related terms woman female fuzzy comparison: woman 0.8 * female. Path length
Yahoo! vs. thesaurus The book says Yahoo! is based on a thesaurus. I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation
Text operations Linguistic operations Document clustering Compression Encription (not discussed here)
Document clustering Operation on the whole collection Global vs. local Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query Is more a query transformation operation Already discussed in Chapter 5
Text operations Linguistic operations Document clustering Compression Encription (not discussed here)
Compression Gain: storage, transmission, search Lost: time on compressing/decompressing In IR: need for random access. Blocks do not work Also: pattern matching on compressed text
Compression methods Statistical Huffman: fixed size per symbol. More frequent symbols shorter Allows starting decompression from any symbol Arithmetic: dynamic coding Need to decompress from the beginning Not for IR Dictionary Pointers to previous occurrences. Lampel-Ziv Again not for IR
Compression ratio Size compressed / size decompressed Huffman, units = words: up to 2 bits per char Close to the limit = entropy. Only for large texts! Other methods: similar ratio, but no random access Shannon: optimal length for symbol with probability p is - log2 p Entropy: Limit of compression Average length with optimal coding Property of model
Modeling Find probability for the next symbol Adaptive, static, semi-static Adaptive: good compression, but need to start from beginning Static (for language): poor compression, random access Semi-static (for specific text; two-pass): both OK Word-based vs. character-based Word-based: better compression and search
Huffman coding Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one How to build the tree: book Byte codes are better Allow for sequential search
Dictionary-based methods Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence Adaptive Disadvantages for IR Need to decode from the very beginning New statistical methods perform better
Comparison of methods
Compression of inverted files Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps. Short gaps occur more frequently Statistical compression Our work: order the docs for better compression We code runs of docs Minimize the number of runs Distance: # of different words TSP.
Research topics All computational linguistics Uses of thesaurus Improved POS tagging Improved WSD Uses of thesaurus for user navigation for collating similar terms Better compression methods Searchable compression Random access
Conclusions Text transformation: meaning instead of strings Lexical analysis Stopwords Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression Searchable Random access Word-based statistical methods (Huffman) Index compression
Till compensation lecture Thank you! Till compensation lecture