Content Types: Text and Metadata
Introduction Text documents come in many forms –Article (news, conference, journal, etc.) – , memo, … –Book, manual, manuscript, transcript, … –Any part of one of the above Syntax can express –Structure –Presentation style –Semantics (e.g. software code)
Metadata Metadata – data about data Descriptive metadata –External to meaning of document –Author, publication date, document source, document length, document genre, file type, bits per second, frame rate, etc. Semantic metadata –Characterizes semantic content of document –LoC subject heading, keywords, subject headings from ontologies (e.g. MESH), etc.
Metadata Formats Machine Readable Cataloging Record (MARC) –Used by most libraries –Fields include title, author, etc. Resource Description Framework (RDF) –Used for Web resources –Node and attribute / value pairs –Node ID is any Uniform Resource Identifier (URI), which could be a URL
Metadata Sets Dublin Core Metadata Elements 1.Contributor – entities contributing to the content 2.Coverage – extent or scope of content (spatial area, temporal period, …) 3.Creator – entity primarily responsible for making the content 4.Date – date associated with event (e.g. publication) for resource 5.Description – abstract, table of contents, … 6.Format – media (file) type, dimensions (size, duration), hardware needed 7.Identifier – unique identifier 8.Language – language of content 9.Publisher – entity responsible for making resource available 10. Relation – reference to related resource(s) 11. Rights – information about rights held in/over resource 12. Source – resource from which content is derived 13. Subject – keywords, key phrases, classification code, etc. 14. Title – name of the resource 15. Type – nature or genre of content
Text Formats Coding schemes –EBCDIC (7 bit, one of first coding schemes) –ASCII (initially 7 bit, extended to 8 bit) –Unicode (16 bit for large alphabets) Additional Formats –RTF (format-oriented document exchange) –PDF and PostScript (display-oriented representation) –Multipurpose Internet Mail Exchange (MIME) (multiple character sets, languages, media)
Information Theory How can we predict information value of components of a document? Entropy – attempts to model information content (information uncertainty) E = - Sum all symbols in alphabet (p i log 2 p i ) p i is the probability of symbol I (symbol frequency over number of symbols) Need a text model for real language Also important for compression as E acts as a limit of how much a text can be compressed.
Modeling Character Strings Symbols in NL are not evenly distributed –Some symbols are not part of words (often used for syntax) –Symbols in words are not evenly distributed Models –Binomial model uses distribution of symbols in language But previous symbols influence probabilities of later symbols (what letter will appear after a q?) –Finite context or Markovian models used for this dependency k-order where k is the number of previous characters taken into account by the model Thus, the binomial model is a 0-order model
Word Distribution in Documents How frequent are words within documents? Zipf’s Law –Frequency of the i th most frequent word is 1/i theta * frequency of most frequent word –The value of theta depends on the text (value of 1 is logarithmic distribution) –Theta values of 1.5 to 2.0 best model real texts In practice, a few hundred words make up 50% of most texts –Frequent words provide less information –Thus, many search strategies involve ignoring stopwords (a, an, the, is, of, by, …)
Word Distribution in Collections Simplest to assume uniform distribution of words in documents –But not true Better models built on negative binomial distributions or Poisson distributions
Vocabulary Size for Documents and Collections Heap’s Law –Vocabulary size (V) grows with number of words (n) V = Kn b Experimentally, –K is between 10 and 100 –B is between 0.4 and 0.6 –So vocabulary grows proportionally with the square root of the size of the document or collection in words –Works best for large documents & collections
String Similarity Models Similarity is measured by a distance function Hamming distance – number of characters different in strings Levenshtein distance – minimum number of insertions, deletions, and substitutions needed to make strings equal –color to colour is 1 –survey to surgery is 2 Can be extended to documents –UNIX diff treats each line as a character
Content Types: Markup and Multimedia
Introduction Markup languages use extra textual syntax to encode: –Formatting / display information –Structure information –Descriptive metadata –Semantic metadata Marks are often called tags –The act of adding markup is called tagging –Most markup languages use initial and ending tags surrounding the marked text
Standard Generalized Markup Language (SGML) Metalanguage for markup. –Includes rules for defining markup language –Use of SGML includes Description of structure of markup Text marked with tags Document Type Declaration (DTD) –Describes and names tags and how they are related –Comments used to express interpretation of tags (meaning, presentation, …)
SGML DTD Example <! ATTLIST idID#REQUIRED date_sentDATE#REQUIRED status (secret | public ) public > <! ATTLIST ref idIDREF#REQUIRED > <! ATTLIST (image | audio) id IDREF #REQUIRED >
SGML Example Pablo Neruda Federico Garcia Lorca Ernest Hemingway Picture of my house in Isla Gabriel Garcia Marquez Here are two photos. One is of the view (photo ). “photo1.gif” “photo2.jpg”
SGML Characteristics DTD provides ability to determine if a given document is well-formed. SGML generally does not specify presentation/appearance. Output specification standards: –DSSSL (Document Style Semantic Specification Language) –FOSI (Formatted Output Specification Instance)
HyperText Markup Language (HTML) Based on SGML –HTML DTD not explicitly referenced by documents HTML documents can have documents embedded within them –Images or audio –Frames with other HTML documents When programs are included, it is referred to as Dynamic HTML Strict HTML includes only non-presentational markup. –Cascade Style Sheets (CSS) used to define presentation In reality, presentational and structural markup are blended by HTML authoring applications.
(Original) HTML Limitations In contrast to SGML: –Users cannot specify their own tags or attributes. –No support for nested structures that can represent database schemas or object- oriented hierarchies. –No support for validation of document by consuming applications.
eXtensible Markup Language (XML) XML is a simplified subset of SGML –XML is a meta-language –XML designed for semantic markup that is both human and machine readable –No DTD is required –All tags must be closed Extensible Style sheet Language (XSL) –XML equivalent of CSS –Can be used to convert XML into HTML and CSS
Multimedia Lots of data file formats for non-textual data –Images BMP, GIF, JPEG (JPG), TIFF –Audio AU, MIDI, WAVE, MP3 –Video MPEG, AVI, QuickTime –Graphics / Virtual Environments CGM, VRML, OpenGL
Audio and Video Data files often have: –Header Indicates time granularity, number of channels, bits per channel Somewhat like a DTD –Data The signal Data may be compressed –Data may be in frequency domain rather than time domain –Data may be encoded as sequence of differences between consecutive time segments.