Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperação de Informação B

Similar presentations


Presentation on theme: "Recuperação de Informação B"— Presentation transcript:

1 Recuperação de Informação B
Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999

2 Introduction Text Document main form of communicating knowledge.
loosely defined, denote a single unit of information. can be any physical unit a file an a Web Page

3 Introduction Document Syntax and structure Semantics
Information about itself

4 Introduction Document Syntax
Implicit, or expressed in a language (e.g, TeX) Powerful languages: easier to parse, difficult to convert to other formats. Open languages are better (interchange) Semantics of texts in natural language are not easy for a computer to understand Trend: languages which provides information on structure, format and semantics being readable by human and computers

5 Introduction New applications are pushing for format such that information can be represented independetly of style. Style: defined by the author, but the reader may decide part of it Style can include treatment of other media

6 Metadata “Data about the data” Descriptive Metadata Semantic Metadata
e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. Descriptive Metadata Author, source, length Dublin Core Metadata Element Set Semantic Metadata Characterizes the subject matter within the document contents MEDLINE

7 Metadata MARC 100 0020 1 $aHagler, Ronald.
$aThe bibliographic... $a3rd. Ed. $aChicago :$bALA, $c1997

8 Metadata Metadata information on Web documents
cataloging, content rating, property rights, digital signatures New standard: Resource Description Framework description of Web resources to facilitate automated processing of information nodes and attched atribute/values pairs Metadescription of non-textual objects keyword can be used to search the objects

9 Metadata RDF Example <RDF:RDF>
<RDF:Description RDF:HREF = “page.html”> <DC:Creator> John Smith </DC:Creator> <DC:Title> John’s Home Page </DC:Title> </RDF:Description> </RDF:RDF>

10 Metadata RDF Schema Exemple

11 Text Text coding in bits EBCDIC, ASCII Unicode
Initially, 7 bits. Later, 8 bits Unicode 16 bits, to accommodate oriental languages

12 Text Formats No single format exists
IR system should retrieve information from different formats Past: IR systems convert the documents Today: IR systems use filters

13 Text Formats Formats for document interchange (RTF)
Formats for displaying (PDF, PostScript) Formats for encode (MIME) Compressed files uuencode/uudecode, binhex

14 Text Information Theory
Amount of information is related to the distribution of symbols in the document. Entropy: Definition of entropy depends on the probabilities of each symbol. Text models are used to obtain those probabilites

15 Text Example - Entropy

16 Text Example - Entropy

17 Text Modeling Natural Language
Symbols: separate words or belong to words Symbols are not uniformly distributed binomial model Dependency of previous symbols k-order markovian model We can take words as symbols

18 Text Modeling Natural Language Words distribution inside documents
Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word Real data fits better with  between 1.5 and 2.0

19 Text Modeling Natural Language Example - word distibution (Zipf’s Law)
V=1000,  = 2 most frequent word: n=300 2nd most frequent: n=76 3rd most frequent: n=33 4th most frequent: n=19

20 Text Modeling Natural Language Skewed distribution - stopwords
Distribution of words in the documents binomial distribution Poisson distribution

21 Text Modeling Natural Language Number of distinct words Heaps’ Law:
Set of different words is fixed by a constant, but the limit is too high

22 Text Modeling Natural Language Heaps’ Law example
k between 10 and 100,  is less than 1 example: n=400000,  = 0.5 K=25, V=15811 K=35, V=22135

23 Text Modeling Natural Language Length of the words
defines total space needed for vocabulary Heaps’ Law: length increases logarithmically with text size. In practice, a finit-state model is used space has p=0.2 space cannot apear twice subsequently there are 26 letters

24 Text Similarity Models Distance Function Hamming Distance
Should be symmetric and satisfy triangle inequality Hamming Distance number of positions that have different characters reverse receive

25 Text Similarity Models Edit (Levenshtein) Distance
minimum number of operations needed to make strings equal survey surgery superior for modeling syntatic errors extensions: weights, transpositions, etc

26 Text Similarity Models Longest Common Subsequence (LCS)
survey - surgery LCS: surey Documents: lines as symbols (diff in Unix) time consuming similar lines Fingerprints Visual tools

27 Conclusions Text is the main form of communicating knowledge.
Documents have syntax, structure and semantics Metadata: information about data Formats of text Modeling Natural Language Entropy Distribution of symbols Similarity


Download ppt "Recuperação de Informação B"

Similar presentations


Ads by Google