Download presentation
Presentation is loading. Please wait.
1
Document and Query Forms Chapter 2
2
2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored data record in any form, such as a book, a letter, a computer graphic, a voice recording. What is a query? A query expresses the information need A query expresses the information need. If a query is considered as a document, then the retrieval process becomes a problem of document matching; else a problem of mapping documents into queries, i.e., identifying docs that satisfy the mapping criteria (?)
3
3 Document Structures Q 2. What is a fully formatted document? A document consists of a predefined number of fields, each with a predefined size and position within the doc. Q. What is an unformatted document? A document has no imposed structured, e.g., sound and image data. A document has no imposed structured, e.g., sound and image data. Q. What is an intermediately structured document? A document that is stored in a mixed form, which is largely unformatted but includes some formatted portions.
4
4 Consider a two-stage retrieval process for retrieving intermediately structured documents: 1) A rough retrieval based on the formatted portion. 2) Refine the results computed by step 1 to locate the desired items. Q. What is an imposed, structured document? A document that comes with an identifier or conceptual relationships among various portions of the document. Document Structures
5
5 Document Surrogates A document surrogate is a limited representation of full documents Examples of surrogates: Document identifiers (e.g., ISBN number of a book) Titles & Names (e.g., author/corporate/publishers name) Keywords/phrases (e.g., introduction, summary, reviews, abstracts) Extracts: artificially constructed sentences/phrases from documents Review, a critique Abstract, a summary/content description Use of surrogates has the risk of making firm decisions based on incomplete (often misleading) information (e.g., document title).
6
6 Vocabulary Control The vocabulary used in IR queries can be controlled. (Pos) Enforces uniformity throughout the IR system, which makes it more efficient in information retrieval Similar concepts are treated as one concept. A query on a particular concept will return all the docs identified by the corresponding vocabulary term. (Cons) Eliminates the user’s ability to describe the information need in fine detail. While the retrieval is efficient, many of the retrieved docs may be irrelevant. Tradeoff: Even though using a free uncontrolled vocabulary adds complexity, it is a small price to pay for added precision in document retrieval.
7
7 Data Structures ASCII & EBCDIC are the two major encoding systems developed to meet the text processing needs. Both were based on a byte to represent a character. Thus, 2 8 = 256 characters are possible. ASCII has become the standard for text encoding. It only uses the codes 0 to 127. As the sophistication of document processing increases, ASCII has become inadequate for docs that include non-English characters or mathematical equations. UNICODE: an encoding system that provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language is.
8
8 Data Structures What is Unicode? Standard for character representation designed to accommodate each character in every (computer) language Why use Unicode? Since all Unicode chars are 16 bits long, Unicode eliminates the need to distinguish between 1-byte & 2-byte chars. Unicode is a sensible way to mix languages in a document & to interchange documents between people with different locales/countries. Unicode & Internationalization Internationalization provides the functionality for adapting programs to various languages and localities and allows you to manipulate strings to use the internationally recognized Unicode standard.
9
9 Data Compression What is a static lossless compression model? A static model examines a sample of text & constructs statistical tables (word/characters) representing the sample. The tables are then used for the entire body of text to be compressed. Uniqueness: less computation, faster (de)compression. What is an adaptive lossless compression model? An adaptive model begins with an initial statistical distribution for the text symbols but modifies the distribution as each character or word is encoded. Uniqueness: higher compression ratio (# of bits for data representation / # of bits for compressed data).
10
10 Huffman Code (Algorithm) 1. Initially, each symbol is considered a separate (trivial) tree. 2. Two trees with the lowest frequencies are chosen & combined into a single binary tree whose assigned frequency is the sum of the two given frequencies. 3. This process is then repeated until only a single binary tree remains. 4. The two branches from each node in the tree are assigned values 0 and 1. 5. The code of each symbol can be read by following the branch from the root to the symbol. Note that the original symbols occupy the leaves of the tree.
11
11 Huffman Code The Huffman coding technique is less efficient than the adoptive coding techniques, since Huffman coding often results in a code of more bits per character. Huffman code is a prefix code A prefix code does not include any code word that is a prefix of another code word Huffman code is an optimum prefix code, which has two properties: 1) Symbols that occur more frequently, i.e., have a higher probability of occurrence, will have shorter code words than symbols that occur less frequently. 2) The two symbols that occur least frequently will have the same length.
12
12 Huffman Code (Algorithm) What is a greedy algorithm? A greedy algorithm arrives at a solution by making a sequence of choices, each of which simply looks the best at the moment, i.e., each choice made is locally optimal. When the algorithm stops, a local optimum may be the same as global optimum. If so, the algorithm is correct. If not, the algorithm has a sub-optimal solution. A greedy algorithm sacrifices completeness for efficiency. Although this technique is applied to optimization problems, it is still considered a general design technique. Sample greedy algorithms minimum spanning tree, Dijkstra shortest path, etc.
13
13 Ziv-Lempel Code An adaptive encoding model Procedure: 1) Identify a text segment the 1 st time it appears 2) Point back to the 1 st occurrence of the segment for each subsequent, identical segment Properties: As a text is scanned, increasing longer segments are encoded. (LZ77) Encoding: (See Example 2.2 on page 34) The code consists of a set of tuples, where a: the backward (character) distance required to determine the next (text) segment b: the number of characters to copy for the next segment c: the new character to add to complete the next segment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.