Recuperação de Informação B

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
WMES3103 : INFORMATION RETRIEVAL
Chapter3: Language Translation issues
Content Types: Text and Metadata. Introduction Text documents come in many forms –Article (news, conference, journal, etc.) – , memo, … –Book, manual,
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
RDF Kitty Turner. Current Situation there is hardly any metadata on the Web search engine sites do the equivalent of going through a library, reading.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.
OLC Spring Chapter Conferences Metadata, Schmetadata … Tell Me Why I Should Care? OLC Spring Chapter Conferences, 2004 Margaret.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Resource Description Framework ( RDF ) Xinxia An.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background Dublin.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Chapter 6 Text and Multimedia Languages and Properties
CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
CHAPTER FIVE TEXT.
Logics for Data and Knowledge Representation
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
New Perspectives on XML, 2nd Edition
Introduction to Interactive Media Interactive Media Components: Text.
Towards a semantic web Philip Hider. This talk  The Semantic Web vision  Scenarios  Standards  Semantic Web & RDA.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Evolving MARC 21 for the future Rebecca Guenther CCS Forum, ALA Annual July 10, 2009.
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
Problems with XML & XML Schemas XML falls apart on the Scalability design goal. 1.The order in which elements appear in an XML document is significant.
HTML Introduction. Lecture 7 What we will cover…  Understanding the first html code…  Tags o two-sided tags o one-sided tags  Block level elements.
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D
Introduction to XML MIS3502: Application Integration and Evaluation Paul Weinberg Presentation by David Schuff.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
1 RDF, XML & interoperability Metadata : a reprise Communities, communication & XML An introduction to RDF RDF, XML and interoperability.
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 1 Using HTML to Create Web Pages.
3.3 Fundamentals of data representation
Introduction to Persistent Identifiers
Introduction Multimedia initial focus
XML QUESTIONS AND ANSWERS
Chapter 3 Data Storage.
RDF For Semantic Web Dhaval Patel 2nd Year Student School of IT
Overview What is Multimedia? Characteristics of multimedia
Attributes and Values Describing Entities.
Cataloging the Internet
PREMIS Tools and Services
Digital Encodings.
What is XML?.
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Structuring Content in a Web Document
MUMT611: Music Information Acquisition, Preservation, and Retrieval
Text Languages and Properties
Attributes and Values Describing Entities.
ASCII and Unicode.
Presentation transcript:

Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999

Introduction Text Document main form of communicating knowledge. loosely defined, denote a single unit of information. can be any physical unit a file an email a Web Page

Introduction Document Syntax and structure Semantics Information about itself

Introduction Document Syntax Implicit, or expressed in a language (e.g, TeX) Powerful languages: easier to parse, difficult to convert to other formats. Open languages are better (interchange) Semantics of texts in natural language are not easy for a computer to understand Trend: languages which provides information on structure, format and semantics being readable by human and computers

Introduction New applications are pushing for format such that information can be represented independetly of style. Style: defined by the author, but the reader may decide part of it Style can include treatment of other media

Metadata “Data about the data” Descriptive Metadata Semantic Metadata e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. Descriptive Metadata Author, source, length Dublin Core Metadata Element Set Semantic Metadata Characterizes the subject matter within the document contents MEDLINE

Metadata MARC 100 0020 1 $aHagler, Ronald. 245 0074 14$aThe bibliographic... 250 0012 $a3rd. Ed. 260 0052 $aChicago :$bALA, $c1997

Metadata Metadata information on Web documents cataloging, content rating, property rights, digital signatures New standard: Resource Description Framework description of Web resources to facilitate automated processing of information nodes and attched atribute/values pairs Metadescription of non-textual objects keyword can be used to search the objects

Metadata RDF Example <RDF:RDF> <RDF:Description RDF:HREF = “page.html”> <DC:Creator> John Smith </DC:Creator> <DC:Title> John’s Home Page </DC:Title> </RDF:Description> </RDF:RDF>

Metadata RDF Schema Exemple

Text Text coding in bits EBCDIC, ASCII Unicode Initially, 7 bits. Later, 8 bits Unicode 16 bits, to accommodate oriental languages

Text Formats No single format exists IR system should retrieve information from different formats Past: IR systems convert the documents Today: IR systems use filters

Text Formats Formats for document interchange (RTF) Formats for displaying (PDF, PostScript) Formats for encode email (MIME) Compressed files uuencode/uudecode, binhex

Text Information Theory Amount of information is related to the distribution of symbols in the document. Entropy: Definition of entropy depends on the probabilities of each symbol. Text models are used to obtain those probabilites

Text Example - Entropy 001001011011

Text Example - Entropy 111111111111

Text Modeling Natural Language Symbols: separate words or belong to words Symbols are not uniformly distributed binomial model Dependency of previous symbols k-order markovian model We can take words as symbols

Text Modeling Natural Language Words distribution inside documents Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word Real data fits better with  between 1.5 and 2.0

Text Modeling Natural Language Example - word distibution (Zipf’s Law) V=1000,  = 2 most frequent word: n=300 2nd most frequent: n=76 3rd most frequent: n=33 4th most frequent: n=19

Text Modeling Natural Language Skewed distribution - stopwords Distribution of words in the documents binomial distribution Poisson distribution

Text Modeling Natural Language Number of distinct words Heaps’ Law: Set of different words is fixed by a constant, but the limit is too high

Text Modeling Natural Language Heaps’ Law example k between 10 and 100,  is less than 1 example: n=400000,  = 0.5 K=25, V=15811 K=35, V=22135

Text Modeling Natural Language Length of the words defines total space needed for vocabulary Heaps’ Law: length increases logarithmically with text size. In practice, a finit-state model is used space has p=0.2 space cannot apear twice subsequently there are 26 letters

Text Similarity Models Distance Function Hamming Distance Should be symmetric and satisfy triangle inequality Hamming Distance number of positions that have different characters reverse receive

Text Similarity Models Edit (Levenshtein) Distance minimum number of operations needed to make strings equal survey surgery superior for modeling syntatic errors extensions: weights, transpositions, etc

Text Similarity Models Longest Common Subsequence (LCS) survey - surgery LCS: surey Documents: lines as symbols (diff in Unix) time consuming similar lines Fingerprints Visual tools

Conclusions Text is the main form of communicating knowledge. Documents have syntax, structure and semantics Metadata: information about data Formats of text Modeling Natural Language Entropy Distribution of symbols Similarity