Content Types: Text and Metadata. Introduction Text documents come in many forms –Article (news, conference, journal, etc.) –Email, memo, … –Book, manual,

Slides:



Advertisements
Similar presentations
Putting together a METS profile. Questions to ask when setting down the METS path Should you design your own profile? Should you use someone elses off.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
What is XML? a meta language that allows you to create and format your own document markups a method for putting structured data into a text file; these.
METS: An Introduction Structuring Digital Content.
XHTML Basics.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Content Types: Markup and Multimedia. Introduction Markup languages use extra textual syntax to encode: –Formatting / display information –Structure information.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
WMES3103 : INFORMATION RETRIEVAL
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
8/28/97Information Organization and Retrieval Metadata and Data Structures University of California, Berkeley School of Information Management and Systems.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Tutorial 11 Creating XML Document
Introduction to XML This material is based heavily on the tutorial by the same name at
Working with Namespaces Combining XML Vocabularies in a Compound Document.
WORKING WITH NAMESPACES
XML – Extensible Markup Language Sivakumar Kuttuva & Janusz Zalewski.
Chapter 6 Text and Multimedia Languages and Properties
CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
MSc IT Multimedia XML & XSLT P. Muneesawang. 2 Outline Why XML XSL.
Metadata Xiangming Mu. What is metadata? What is metadata? (cont’) Data about data –Any data aids in the identification, description and location of.
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
XML 1 Enterprise Applications CE00465-M XML. 2 Enterprise Applications CE00465-M XML Overview Extensible Mark-up Language (XML) is a meta-language that.
XML - Why: The HTML-Dilemma HTML, SGML, XML - How: Syntax, Concept, Language Elements Basics Well-formed XML-Documents (without DTD) Valid XML-Documents.
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
Chapter 16 The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange.
Introduction to HTML. What is HTML? Hyper Text Markup Language (HTML) is a language for describing web pages. HTML is not a programming language, it is.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
SDPL 2001Notes 4: Intro to Stylesheets1 4. Introduction to Stylesheets n Discussed recently: –Programmatic manipulation of (data-oriented) documents n.
Content and Computer Platforms Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
E0262 – MIS – Multimedia Storage Techniques XML (Extensible Markup Language  XML is a markup language for creating documents containing structured information.
CP3024 Lecture 9 XML: Extensible Markup Language.
LIS654 lecture 5 DC metadata and omeka tables Thomas Krichel
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge.
XP New Perspectives on XML, 2nd Edition Tutorial 2 1 TUTORIAL 2 WORKING WITH NAMESPACES.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
What it is and how it works
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
1 Tutorial 12 Working with Namespaces Combining XML Vocabularies in a Compound Document.
Content and Systems Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers available.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
SDPL 2002Notes 4: Intro to Style Sheets1 4. Introduction to Style Sheets n Discussed recently: –Programmatic manipulation of documents n Now a more human-oriented.
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
XML Technology. Emerging Importance of XML –HTML-tagging is display oriented. –XML-based content tagging has important uses: data mining role-oriented.
LBSC 690 Session 4 Programming. Languages How do we learn a language? Learn by listening Then reading Then writing How do we teach programming? Learn.
Basic HTML Document Structure. Slide 2 Goals (XHTML HTML5) XHTML Separate document structure and content from document formatting HTML 5 Create a formal.
SDPL 2004Notes 4: Intro to Style Sheets1 4. Introduction to Style Sheets n Discussed recently: –Programmatic manipulation of documents n Now a more human.
8/28/97Information Organization and Retrieval Introduction University of California, Berkeley School of Information Management and Systems SIMS 245: Organization.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
C Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Introduction to XML Standards.
Describing resources II: Dublin Core CERN-UNESCO School on Digital Libraries Rabat, Nov 22-26, 2010 Annette Holtkamp CERN.
XML Extensible Markup Language
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 1 Using HTML to Create Web Pages.
XML QUESTIONS AND ANSWERS
Text Languages and Properties
Recuperação de Informação B
Presentation transcript:

Content Types: Text and Metadata

Introduction Text documents come in many forms –Article (news, conference, journal, etc.) – , memo, … –Book, manual, manuscript, transcript, … –Any part of one of the above Syntax can express –Structure –Presentation style –Semantics (e.g. software code)

Metadata Metadata – data about data Descriptive metadata –External to meaning of document –Author, publication date, document source, document length, document genre, file type, bits per second, frame rate, etc. Semantic metadata –Characterizes semantic content of document –LoC subject heading, keywords, subject headings from ontologies (e.g. MESH), etc.

Metadata Formats Machine Readable Cataloging Record (MARC) –Used by most libraries –Fields include title, author, etc. Resource Description Framework (RDF) –Used for Web resources –Node and attribute / value pairs –Node ID is any Uniform Resource Identifier (URI), which could be a URL

Metadata Sets Dublin Core Metadata Elements 1.Contributor – entities contributing to the content 2.Coverage – extent or scope of content (spatial area, temporal period, …) 3.Creator – entity primarily responsible for making the content 4.Date – date associated with event (e.g. publication) for resource 5.Description – abstract, table of contents, … 6.Format – media (file) type, dimensions (size, duration), hardware needed 7.Identifier – unique identifier 8.Language – language of content 9.Publisher – entity responsible for making resource available 10. Relation – reference to related resource(s) 11. Rights – information about rights held in/over resource 12. Source – resource from which content is derived 13. Subject – keywords, key phrases, classification code, etc. 14. Title – name of the resource 15. Type – nature or genre of content

Text Formats Coding schemes –EBCDIC (7 bit, one of first coding schemes) –ASCII (initially 7 bit, extended to 8 bit) –Unicode (16 bit for large alphabets) Additional Formats –RTF (format-oriented document exchange) –PDF and PostScript (display-oriented representation) –Multipurpose Internet Mail Exchange (MIME) (multiple character sets, languages, media)

Information Theory How can we predict information value of components of a document? Entropy – attempts to model information content (information uncertainty) E = - Sum all symbols in alphabet (p i log 2 p i ) p i is the probability of symbol I (symbol frequency over number of symbols) Need a text model for real language Also important for compression as E acts as a limit of how much a text can be compressed.

Modeling Character Strings Symbols in NL are not evenly distributed –Some symbols are not part of words (often used for syntax) –Symbols in words are not evenly distributed Models –Binomial model uses distribution of symbols in language But previous symbols influence probabilities of later symbols (what letter will appear after a q?) –Finite context or Markovian models used for this dependency k-order where k is the number of previous characters taken into account by the model Thus, the binomial model is a 0-order model

Word Distribution in Documents How frequent are words within documents? Zipf’s Law –Frequency of the i th most frequent word is 1/i theta * frequency of most frequent word –The value of theta depends on the text (value of 1 is logarithmic distribution) –Theta values of 1.5 to 2.0 best model real texts In practice, a few hundred words make up 50% of most texts –Frequent words provide less information –Thus, many search strategies involve ignoring stopwords (a, an, the, is, of, by, …)

Word Distribution in Collections Simplest to assume uniform distribution of words in documents –But not true Better models built on negative binomial distributions or Poisson distributions

Vocabulary Size for Documents and Collections Heap’s Law –Vocabulary size (V) grows with number of words (n) V = Kn b Experimentally, –K is between 10 and 100 –B is between 0.4 and 0.6 –So vocabulary grows proportionally with the square root of the size of the document or collection in words –Works best for large documents & collections

String Similarity Models Similarity is measured by a distance function Hamming distance – number of characters different in strings Levenshtein distance – minimum number of insertions, deletions, and substitutions needed to make strings equal –color to colour is 1 –survey to surgery is 2 Can be extended to documents –UNIX diff treats each line as a character

Content Types: Markup and Multimedia

Introduction Markup languages use extra textual syntax to encode: –Formatting / display information –Structure information –Descriptive metadata –Semantic metadata Marks are often called tags –The act of adding markup is called tagging –Most markup languages use initial and ending tags surrounding the marked text

Standard Generalized Markup Language (SGML) Metalanguage for markup. –Includes rules for defining markup language –Use of SGML includes Description of structure of markup Text marked with tags Document Type Declaration (DTD) –Describes and names tags and how they are related –Comments used to express interpretation of tags (meaning, presentation, …)

SGML DTD Example <! ATTLIST idID#REQUIRED date_sentDATE#REQUIRED status (secret | public ) public > <! ATTLIST ref idIDREF#REQUIRED > <! ATTLIST (image | audio) id IDREF #REQUIRED >

SGML Example Pablo Neruda Federico Garcia Lorca Ernest Hemingway Picture of my house in Isla Gabriel Garcia Marquez Here are two photos. One is of the view (photo ). “photo1.gif” “photo2.jpg”

SGML Characteristics DTD provides ability to determine if a given document is well-formed. SGML generally does not specify presentation/appearance. Output specification standards: –DSSSL (Document Style Semantic Specification Language) –FOSI (Formatted Output Specification Instance)

HyperText Markup Language (HTML) Based on SGML –HTML DTD not explicitly referenced by documents HTML documents can have documents embedded within them –Images or audio –Frames with other HTML documents When programs are included, it is referred to as Dynamic HTML Strict HTML includes only non-presentational markup. –Cascade Style Sheets (CSS) used to define presentation In reality, presentational and structural markup are blended by HTML authoring applications.

(Original) HTML Limitations In contrast to SGML: –Users cannot specify their own tags or attributes. –No support for nested structures that can represent database schemas or object- oriented hierarchies. –No support for validation of document by consuming applications.

eXtensible Markup Language (XML) XML is a simplified subset of SGML –XML is a meta-language –XML designed for semantic markup that is both human and machine readable –No DTD is required –All tags must be closed Extensible Style sheet Language (XSL) –XML equivalent of CSS –Can be used to convert XML into HTML and CSS

Multimedia Lots of data file formats for non-textual data –Images BMP, GIF, JPEG (JPG), TIFF –Audio AU, MIDI, WAVE, MP3 –Video MPEG, AVI, QuickTime –Graphics / Virtual Environments CGM, VRML, OpenGL

Audio and Video Data files often have: –Header Indicates time granularity, number of channels, bits per channel Somewhat like a DTD –Data The signal Data may be compressed –Data may be in frequency domain rather than time domain –Data may be encoded as sequence of differences between consecutive time segments.