Download presentation
Presentation is loading. Please wait.
1
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999
2
Introduction n Text u main form of communicating knowledge. n Document u loosely defined, denote a single unit of information. u can be any physical unit F a file F an email F a Web Page
3
Introduction n Document u Syntax and structure u Semantics u Information about itself
4
Introduction n Document Syntax u Implicit, or expressed in a language (e.g, TeX) u Powerful languages: easier to parse, difficult to convert to other formats. u Open languages are better (interchange) u Semantics of texts in natural language are not easy for a computer to understand u Trend: languages which provides information on structure, format and semantics being readable by human and computers
5
Introduction n New applications are pushing for format such that information can be represented independetly of style. n Style: defined by the author, but the reader may decide part of it n Style can include treatment of other media
6
Metadata n “Data about the data” u e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. n Descriptive Metadata u Author, source, length u Dublin Core Metadata Element Set n Semantic Metadata u Characterizes the subject matter within the document contents u MEDLINE
7
Metadata n MARC 100 00201 $aHagler, Ronald. 245 007414$aThe bibliographic... 250 0012$a3rd. Ed. 260 0052$aChicago :$bALA, $c1997
8
Metadata n Metadata information on Web documents u cataloging, content rating, property rights, digital signatures n New standard: Resource Description Framework u description of Web resources to facilitate automated processing of information u nodes and attched atribute/values pairs n Metadescription of non-textual objects u keyword can be used to search the objects
9
Metadata n RDF Example John Smith John’s Home Page
10
Metadata n RDF Schema Exemple
11
Text n Text coding in bits u EBCDIC, ASCII F Initially, 7 bits. Later, 8 bits u Unicode F 16 bits, to accommodate oriental languages
12
Text n Formats u No single format exists u IR system should retrieve information from different formats u Past: IR systems convert the documents u Today: IR systems use filters
13
Text n Formats u Formats for document interchange (RTF) u Formats for displaying (PDF, PostScript) u Formats for encode email (MIME) u Compressed files F uuencode/uudecode, binhex
14
Text n Information Theory u Amount of information is related to the distribution of symbols in the document. u Entropy: u Definition of entropy depends on the probabilities of each symbol. u Text models are used to obtain those probabilites
15
Text n Example - Entropy u 001001011011
16
Text n Example - Entropy u 111111111111
17
Text n Modeling Natural Language u Symbols: separate words or belong to words u Symbols are not uniformly distributed F binomial model u Dependency of previous symbols F k-order markovian model u We can take words as symbols
18
Text n Modeling Natural Language u Words distribution inside documents u Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word u Real data fits better with between 1.5 and 2.0
19
Text n Modeling Natural Language u Example - word distibution (Zipf’s Law) F V=1000, = 2 F most frequent word: n=300 F 2nd most frequent: n=76 F 3rd most frequent: n=33 F 4th most frequent: n=19
20
Text n Modeling Natural Language u Skewed distribution - stopwords u Distribution of words in the documents F binomial distribution F Poisson distribution
21
Text n Modeling Natural Language u Number of distinct words u Heaps’ Law: u Set of different words is fixed by a constant, but the limit is too high
22
Text n Modeling Natural Language u Heaps’ Law example F k between 10 and 100, is less than 1 F example: n=400000, = 0.5 K=25, V=15811 K=35, V=22135
23
Text n Modeling Natural Language u Length of the words F defines total space needed for vocabulary u Heaps’ Law: length increases logarithmically with text size. u In practice, a finit-state model is used F space has p=0.2 F space cannot apear twice subsequently F there are 26 letters
24
Text n Similarity Models u Distance Function F Should be symmetric and satisfy triangle inequality u Hamming Distance F number of positions that have different characters reverse receive
25
Text n Similarity Models u Edit (Levenshtein) Distance F minimum number of operations needed to make strings equal survey surgery F superior for modeling syntatic errors extensions: weights, transpositions, etc
26
Text n Similarity Models u Longest Common Subsequence (LCS) survey - surgery LCS: surey u Documents: lines as symbols (diff in Unix) F time consuming F similar lines u Fingerprints u Visual tools
27
Conclusions n Text is the main form of communicating knowledge. n Documents have syntax, structure and semantics n Metadata: information about data n Formats of text n Modeling Natural Language u Entropy u Distribution of symbols n Similarity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.