Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.

Similar presentations


Presentation on theme: "Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999."— Presentation transcript:

1 Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999

2 Introduction n Text u main form of communicating knowledge. n Document u loosely defined, denote a single unit of information. u can be any physical unit F a file F an email F a Web Page

3 Introduction n Document u Syntax and structure u Semantics u Information about itself

4 Introduction n Document Syntax u Implicit, or expressed in a language (e.g, TeX) u Powerful languages: easier to parse, difficult to convert to other formats. u Open languages are better (interchange) u Semantics of texts in natural language are not easy for a computer to understand u Trend: languages which provides information on structure, format and semantics being readable by human and computers

5 Introduction n New applications are pushing for format such that information can be represented independetly of style. n Style: defined by the author, but the reader may decide part of it n Style can include treatment of other media

6 Metadata n “Data about the data” u e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. n Descriptive Metadata u Author, source, length u Dublin Core Metadata Element Set n Semantic Metadata u Characterizes the subject matter within the document contents u MEDLINE

7 Metadata n MARC 100 00201 $aHagler, Ronald. 245 007414$aThe bibliographic... 250 0012$a3rd. Ed. 260 0052$aChicago :$bALA, $c1997

8 Metadata n Metadata information on Web documents u cataloging, content rating, property rights, digital signatures n New standard: Resource Description Framework u description of Web resources to facilitate automated processing of information u nodes and attched atribute/values pairs n Metadescription of non-textual objects u keyword can be used to search the objects

9 Metadata n RDF Example John Smith John’s Home Page

10 Metadata n RDF Schema Exemple

11 Text n Text coding in bits u EBCDIC, ASCII F Initially, 7 bits. Later, 8 bits u Unicode F 16 bits, to accommodate oriental languages

12 Text n Formats u No single format exists u IR system should retrieve information from different formats u Past: IR systems convert the documents u Today: IR systems use filters

13 Text n Formats u Formats for document interchange (RTF) u Formats for displaying (PDF, PostScript) u Formats for encode email (MIME) u Compressed files F uuencode/uudecode, binhex

14 Text n Information Theory u Amount of information is related to the distribution of symbols in the document. u Entropy: u Definition of entropy depends on the probabilities of each symbol. u Text models are used to obtain those probabilites

15 Text n Example - Entropy u 001001011011

16 Text n Example - Entropy u 111111111111

17 Text n Modeling Natural Language u Symbols: separate words or belong to words u Symbols are not uniformly distributed F binomial model u Dependency of previous symbols F k-order markovian model u We can take words as symbols

18 Text n Modeling Natural Language u Words distribution inside documents u Zipf´s Law: i-th most frequent word appears 1/i  times of the most frequent word u Real data fits better with  between 1.5 and 2.0

19 Text n Modeling Natural Language u Example - word distibution (Zipf’s Law) F V=1000,  = 2 F most frequent word: n=300 F 2nd most frequent: n=76 F 3rd most frequent: n=33 F 4th most frequent: n=19

20 Text n Modeling Natural Language u Skewed distribution - stopwords u Distribution of words in the documents F binomial distribution F Poisson distribution

21 Text n Modeling Natural Language u Number of distinct words u Heaps’ Law: u Set of different words is fixed by a constant, but the limit is too high

22 Text n Modeling Natural Language u Heaps’ Law example F k between 10 and 100,  is less than 1 F example: n=400000,  = 0.5 K=25, V=15811 K=35, V=22135

23 Text n Modeling Natural Language u Length of the words F defines total space needed for vocabulary u Heaps’ Law: length increases logarithmically with text size. u In practice, a finit-state model is used F space has p=0.2 F space cannot apear twice subsequently F there are 26 letters

24 Text n Similarity Models u Distance Function F Should be symmetric and satisfy triangle inequality u Hamming Distance F number of positions that have different characters reverse receive

25 Text n Similarity Models u Edit (Levenshtein) Distance F minimum number of operations needed to make strings equal survey surgery F superior for modeling syntatic errors  extensions: weights, transpositions, etc

26 Text n Similarity Models u Longest Common Subsequence (LCS) survey - surgery LCS: surey u Documents: lines as symbols (diff in Unix) F time consuming F similar lines u Fingerprints u Visual tools

27 Conclusions n Text is the main form of communicating knowledge. n Documents have syntax, structure and semantics n Metadata: information about data n Formats of text n Modeling Natural Language u Entropy u Distribution of symbols n Similarity


Download ppt "Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999."

Similar presentations


Ads by Google