Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Structure. Two segments of data structure –Storage –Retrieval.

Similar presentations


Presentation on theme: "Data Structure. Two segments of data structure –Storage –Retrieval."— Presentation transcript:

1 Data Structure

2 Two segments of data structure –Storage –Retrieval

3 Item normalization Document File Creation Document Manager Document Search Manager Original document file Proc. Token search file

4 –Stemming –Inverted file system –N-gram –PAT trees and arrays –Signature –Hypertext –Stemming –Inverted file system –N-gram –PAT trees and arrays –Signature –Hypertext

5 –Inverted file system most common data structure Minimizes secondary storage access –When using multiple search terms Document, inversion list / posting files, dictionary Storing an inversion of documents –Inverted file system most common data structure Minimizes secondary storage access –When using multiple search terms Document, inversion list / posting files, dictionary Storing an inversion of documents

6 –N-gram Fixed length consecutive series of ‘n’ characters Algorithmically based upon a fixed number of characters Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7) Does not involve semantics - concepts –N-gram Fixed length consecutive series of ‘n’ characters Algorithmically based upon a fixed number of characters Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7) Does not involve semantics - concepts

7 –N-gram Symbol # to represent inter-word symbol (fig. 4.7) –Blank, period, semi-colon, colon etc. Word fragments Uses –Spelling error detection and correction (fig. 4.8) –Text compression Ignores words and treat input as a continuous data –N-gram Symbol # to represent inter-word symbol (fig. 4.7) –Blank, period, semi-colon, colon etc. Word fragments Uses –Spelling error detection and correction (fig. 4.8) –Text compression Ignores words and treat input as a continuous data

8 –N-gram False hits can occur when without # The longer n-gram, the less likely is the error Problems –Increased size of inversion lists –No semantic meaning and concept relationship Can achieve high recall –N-gram False hits can occur when without # The longer n-gram, the less likely is the error Problems –Increased size of inversion lists –No semantic meaning and concept relationship Can achieve high recall

9 –PAT trees PATRICIA trees –Practical algorithm to retrieve information coded in alphanumerics Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input Substrings are termed as sistrings (Figure 4.9 - 4.11) Best for string searching but not widely used commercially –PAT trees PATRICIA trees –Practical algorithm to retrieve information coded in alphanumerics Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input Substrings are termed as sistrings (Figure 4.9 - 4.11) Best for string searching but not widely used commercially

10 Signature To provide a fast test to eliminate the majority of items that are not related to a query A linear scan of the compressed version of items Coding based upon words in the item Words are mapped onto a word signature –A fixed length code with a fixed number of bits set to 1 –Set to 1 determined by the hash function –ORed to create the signature of an item –Fig 4.13 Words in the query are mapped to the word signature Search via template matching Signature To provide a fast test to eliminate the majority of items that are not related to a query A linear scan of the compressed version of items Coding based upon words in the item Words are mapped onto a word signature –A fixed length code with a fixed number of bits set to 1 –Set to 1 determined by the hash function –ORed to create the signature of an item –Fig 4.13 Words in the query are mapped to the word signature Search via template matching

11 Signature Longer code length reduces probability of collision hashing the same words Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item Signature Longer code length reduces probability of collision hashing the same words Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item

12 Hypertext (HTML and XML) Allow one item to reference another item via an embedded pointer A node (separate item) Link (reference pointer) –Similar or different data type than the original Navigates –Managing the loosely structured information Issue –Linkage integrity (no update of the removed or deleted items) Hypertext (HTML and XML) Allow one item to reference another item via an embedded pointer A node (separate item) Link (reference pointer) –Similar or different data type than the original Navigates –Managing the loosely structured information Issue –Linkage integrity (no update of the removed or deleted items)

13 Hypertext (HTML and XML) Dynamic HTML –Combination of the latest HTML tags and options, style sheets and programming –Creation of animated Web pages and responsive to user interaction Dynamic HTML Object Model –Object-oriented view of Web pages and its elements –Cascading style sheets –Programming addressing the page elements with dynamic fonts Hypertext (HTML and XML) Dynamic HTML –Combination of the latest HTML tags and options, style sheets and programming –Creation of animated Web pages and responsive to user interaction Dynamic HTML Object Model –Object-oriented view of Web pages and its elements –Cascading style sheets –Programming addressing the page elements with dynamic fonts

14 DOCUMENTSDICTIONARYINVERSION LISTS Doc #1, computer,bit (2)bit - 1, 3 bit, byte DOCUMENTSDICTIONARYINVERSION LISTS Doc #1, computer,bit (2)bit - 1, 3 bit, byte

15 Inversion list –Weights –Words with special characteristics e.g. date Searching –Locate the inversion lists –Apply appropriate logic between lists –Final hit of the list of items is the result Inversion list –Weights –Words with special characteristics e.g. date Searching –Locate the inversion lists –Apply appropriate logic between lists –Final hit of the list of items is the result

16 –B trees e.g. of order m A root node with between 2 and 2m keys All other internal nodes have between m and 2m keys All keys are kept in order from smaller to larger All leaves are at the same level or differ by at most one level –B trees e.g. of order m A root node with between 2 and 2m keys All other internal nodes have between m and 2m keys All keys are kept in order from smaller to larger All leaves are at the same level or differ by at most one level

17 –Inversion list structures Provide optimum performance in searching large databases Minimization of data flow Involve only directly related data Good for storing concepts and their relationship Each list for representing a concept A concordance of all of the items containing the concepts Location of the concepts Do not solely work for natural language processing –Inversion list structures Provide optimum performance in searching large databases Minimization of data flow Involve only directly related data Good for storing concepts and their relationship Each list for representing a concept A concordance of all of the items containing the concepts Location of the concepts Do not solely work for natural language processing

18 Stemming algorithm Goal: to improve performance and require less system resources by reducing number of unique words that a system has to contain Currently reviewed for potential improvements of recall and associated decline in precision Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants Creates a large index for the stem vs. term masking (ORing) Stemming algorithm Goal: to improve performance and require less system resources by reducing number of unique words that a system has to contain Currently reviewed for potential improvements of recall and associated decline in precision Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants Creates a large index for the stem vs. term masking (ORing)

19 Stemming algorithm Conflation: refer to mapping multiple morphological variants to a single representation (stem) Stem: carries the meaning of the concept associated with the word Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes Languages: grammars defining usage and evolve on human usage Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules Stemming algorithm Conflation: refer to mapping multiple morphological variants to a single representation (stem) Stem: carries the meaning of the concept associated with the word Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes Languages: grammars defining usage and evolve on human usage Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules

20 Stemming algorithm Compression – savings in storage and processing? Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list Size of inversion list Compression does not significantly reduce storage requirements – small vs. large-sized collection Stemming algorithm Compression – savings in storage and processing? Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list Size of inversion list Compression does not significantly reduce storage requirements – small vs. large-sized collection

21 Stemming algorithm Improve recall? As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming Improve precision? Only if the expansion guarantees every item retrieved by the expansion is relevant Stemming algorithm Improve recall? As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming Improve precision? Only if the expansion guarantees every item retrieved by the expansion is relevant

22 Stemming algorithm System must recognize the word before stemming Proper names and acronyms – no stemming applied since no common core concept Problems for natural language processing system – loss of information needed for aggregate levels of processing e.g. tenses needed to determine a particular concept Time – important in natural language processing Stemming algorithm System must recognize the word before stemming Proper names and acronyms – no stemming applied since no common core concept Problems for natural language processing system – loss of information needed for aggregate levels of processing e.g. tenses needed to determine a particular concept Time – important in natural language processing

23 Stemming algorithm Removal of suffixes and prefixes Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network) Successor stemming – determine prefix overlap as the length of a stem is increased e.g. tenses needed to determine a particular concept Time – important in natural language processing Stemming algorithm Removal of suffixes and prefixes Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network) Successor stemming – determine prefix overlap as the length of a stem is increased e.g. tenses needed to determine a particular concept Time – important in natural language processing

24 Stemming algorithm Porter stemming algorithm Dictionary look-up stemmers Successor stemmers Stemming algorithm Porter stemming algorithm Dictionary look-up stemmers Successor stemmers

25 Porter Stemming algorithm Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V C and V – optional and m is the number VC repeats *, *v*, *d, *o Porter Stemming algorithm Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V C and V – optional and m is the number VC repeats *, *v*, *d, *o

26 Dictionary Look-Up Stemmers Simple stemming rules – fewest exceptions (plural) Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality Dictionary Look-Up Stemmers Simple stemming rules – fewest exceptions (plural) Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality

27 Successor Stemmers Based upon length of the prefixes that optimally stem expansions of additional suffixes Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes e.g. bag, barn, bring, both, box, bottle (Fig. 4.2) Successor Stemmers Based upon length of the prefixes that optimally stem expansions of additional suffixes Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes e.g. bag, barn, bring, both, box, bottle (Fig. 4.2)

28 Successor Stemmers Methods: cut-off, peak and plateau, complete word method, and entropy method Cut-off method: cut-off value to define stem length, value varies for each possible set of words Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off) Complete word method: break on boundaries of complete words (not needing cut-off) Entropy method: uses the distribution of successor variety letters Figure 4.3 Successor Stemmers Methods: cut-off, peak and plateau, complete word method, and entropy method Cut-off method: cut-off value to define stem length, value varies for each possible set of words Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off) Complete word method: break on boundaries of complete words (not needing cut-off) Entropy method: uses the distribution of successor variety letters Figure 4.3

29 Stemming Algorithm Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming Stemming is dependent upon the nature of the vocabulary Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4 Measure the ability to partition terms semantically and morphologically related to each other into “concept groups” Understemming index – concept groups with multiple stem Overstemming index – same stem is found in multiple groups Stemming Algorithm Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming Stemming is dependent upon the nature of the vocabulary Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4 Measure the ability to partition terms semantically and morphologically related to each other into “concept groups” Understemming index – concept groups with multiple stem Overstemming index – same stem is found in multiple groups


Download ppt "Data Structure. Two segments of data structure –Storage –Retrieval."

Similar presentations


Ads by Google