Download presentation
Presentation is loading. Please wait.
Published byVictoria Maloney Modified over 11 years ago
1
Alexander Gelbukh www.Gelbukh.com
Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh
2
Previous chapter: Conclusions
Modeling of text helps predict behavior of systems Zipf law, Heaps’ law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax SGML, too expensive HTML, too simple XML, good combination
3
Text operations Linguistic operations Document clustering Compression
Encription (not discussed here)
4
Linguistic operations
Purpose: Convert words to “meanings” Synonyms or related words Different words, same meaning. Morphology Foot / feet, woman / female Homonyms Same words, different meanings. Word senses River bank / financial bank Stopwords Word, no meaning. Functional words The
5
For good or for bad? More exact matching Unexpected behavior
Less noise, better recall Unexpected behavior Difficult for users to grasp Harms if introduces errors More expensive Adds a whole new technology Maintenance; language dependents Slows down Good if done well, harmful if done badly
6
Document preprocessing
Lexical analysis (punctuation, case) Simple but must be careful Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ... Multiword expressions: hot dog, B-52 Here, all the power of linguistic analysis can be used Selection of index terms Often nouns; noun groups: computer science Construction of thesaurus synonymy: network of related concepts (words or phrases)
7
Stemming Methods Linguistic analysis: complex, expensive maintenance
Table lookup: simple, but needs data Statistical (Avetisyan): no data, but imprecise Suffix removal Porter algorithm. Martin Porter. Ready code on his website Substitution rules: sses s, s stresses stress.
8
Better stemming The whole problematics of computational linguistics
POS disambiguation well adverb or noun? Oil well. Statistical methods. Brill tagger Syntactic analysis. Syntactic disambiguation Word sense disambiguatiuon bank1 and bank2 should be different stems Statistical methods Dictionary-based methods. Lesk algorithm Semantic analysis
9
Thesaurus Terms (controlled vocabulary) and relationships Terms
used for indexing represent a concept. One word or a phrase. Usually nouns sense. Definition or notes to distinguish senses: key (door). Relationships Paradigmatic: Synonymy, hierarchical (is-a, part), non-hierarchical Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet synsets
10
Use of thesurus To help the user to formulate the query
Navigation in the hierarchy of words Yahoo! For the program, to collate related terms woman female fuzzy comparison: woman 0.8 * female. Path length
11
Yahoo! vs. thesaurus The book says Yahoo! is based on a thesaurus.
I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation
12
Text operations Linguistic operations Document clustering Compression
Encription (not discussed here)
13
Document clustering Operation on the whole collection Global vs. local
Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query Is more a query transformation operation Already discussed in Chapter 5
14
Text operations Linguistic operations Document clustering Compression
Encription (not discussed here)
15
Compression Gain: storage, transmission, search
Lost: time on compressing/decompressing In IR: need for random access. Blocks do not work Also: pattern matching on compressed text
16
Compression methods Statistical Huffman: fixed size per symbol.
More frequent symbols shorter Allows starting decompression from any symbol Arithmetic: dynamic coding Need to decompress from the beginning Not for IR Dictionary Pointers to previous occurrences. Lampel-Ziv Again not for IR
17
Compression ratio Size compressed / size decompressed
Huffman, units = words: up to 2 bits per char Close to the limit = entropy. Only for large texts! Other methods: similar ratio, but no random access Shannon: optimal length for symbol with probability p is - log2 p Entropy: Limit of compression Average length with optimal coding Property of model
18
Modeling Find probability for the next symbol
Adaptive, static, semi-static Adaptive: good compression, but need to start from beginning Static (for language): poor compression, random access Semi-static (for specific text; two-pass): both OK Word-based vs. character-based Word-based: better compression and search
19
Huffman coding Each symbol is encoded, sequentially
More frequent symbols have shorter codes No code is a prefix of another one How to build the tree: book Byte codes are better Allow for sequential search
20
Dictionary-based methods
Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence Adaptive Disadvantages for IR Need to decode from the very beginning New statistical methods perform better
21
Comparison of methods
22
Compression of inverted files
Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps. Short gaps occur more frequently Statistical compression Our work: order the docs for better compression We code runs of docs Minimize the number of runs Distance: # of different words TSP.
23
Research topics All computational linguistics Uses of thesaurus
Improved POS tagging Improved WSD Uses of thesaurus for user navigation for collating similar terms Better compression methods Searchable compression Random access
24
Conclusions Text transformation: meaning instead of strings
Lexical analysis Stopwords Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression Searchable Random access Word-based statistical methods (Huffman) Index compression
25
Till compensation lecture
Thank you! Till compensation lecture
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.