CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.

CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication & Information Sciences Ph.D. Program University of Hawai'i at Mānoa Teaching Session #9 1

Documents: Language & Properties Chapter Contents Metadata Document Formats Markup Languages Text Properties Document Preprocessing Organizing Documents Text Compression 2

Introduction Document Denotes a single unit of information Structure and a Syntax Semantics, specified by the author Presentation style 3

Introduction 4

Document Syntax Expresses structure, presentation style, semantics Implicit in its content Expressed in a simple declarative language Expressed in a programming language Text Can be written in natural language (Hard to process) 5

Introduction Document Style How a document is visualized or printed Can be embedded in the document i.e. RTF files Can be complemented by macros 6

Introduction Queries Short pieces of text Differ from normal text Semantics often ambiguous due to polysemy User intent behind a query is not easy to infer 7

Metadata Data about data Information on the organization of the data, various data domains, and their relationship Metadata is associated with most documents 8

Metadata Descriptive Metadata External to the meaning of the document and pertain more to how it was created. Author of the text Date of publication Source of the publication Documentation length 9

Metadata Semantic Metadata Characterizes the subject matter within the document contents Associated with a wide number of documents Availability is increasing 10

Metadata Metadata Format Machine Readable Cataloging Record (MARC) Format used for most library records Includes fields for distinct attributes of a bibliographic entry such as: title, author, publication venue. 11

Metadata Metadata in Web Documents Increase in web data has led to adding metadata information to web pages. Cataloging and content rating Intellectual property rights and digital signatures Electronic Commerce 12

Metadata Resource Description Framework (RDF) New standard for Web metadata Allows describing Web resources to facilitate automated processing. Does not assume any particular application or semantic domain. Consists of a description of nodes and attached attribute/value pairs. 13

Text Computers represent characters in binary, which is done through coding schemes: EBCDIC (7 bits) ASCII (8 bits) UNICODE (16 bits) IR systems should be able to retrieve information from many text formats (doc, pdf, html, txt) IR systems have filters to handle most documents (might not be possible with proprietary formats) 14

Text Text Formats For document exchange: Rich Text Format (RTF) For printing and displaying: Portable Document Format (PDF) For printing and displaying: Postscript (PS) 15

Text Interchange Formats For encoding email: Multipurpose Internet Mail Exchange (MIME) For compressing text: ZIP 16

Multimedia For applications that handle different types of data: Text Sounds Images Video Different types of formats are necessary for storing each media 17

Images Image Formats Simplest image formats are direct representations of a bit- mapped display: XBM, BMP, PCX These formats have lots of redundancy and can be compressed efficiently: GIF 18

Images Lossy Compression To improve compression ratios. Uncompressing a compressed image does not yield exactly the original image. Joint Photographic Experts Group (JPEG) Eliminates parts of the image that have less impact in the human eye. Parametric format – loss can be tuned. 19

Images Interchange Formats for Images Tagged Image File Format (TIFF) Provides for metadata, compression, and varying number of colors. Standard de facto for images on the Web: Portable Network Graphics (PNG) 20

Audio Audio Formats Audio is digitalized MIDI is the standard format to interchange music between electronic instruments and computers. AU, WAVE 21

Movies Movie Formats Works by coding changes in consecutive frames Takes advantage of temporal image redundancy Includes audio signal associated with the video Audio: MP3, Video: MP4 AVI, FLI, Quicktime 22

Graphics Format for 3-D Graphics Computer Graphics Metafile (CGM) Virtual Reality Modeling Language (VRML) VRML is the universal interchange format for 3-D graphics and multimedia. 23

Markup Markup Languages Defined as extra syntax used to describe formatting actions, structure information, text semantics, attributes XML:eXtensible Markup Language HTML:Hyper Text Markup Language SGML:Standard Generalized Markup Language 24

Markup Standard Generalized Markup Language (SGML) ISO 8879 Meta-language for tagging text Provides rules for defining a markup language based on tages Includes a description of the document structure: “document type definition” SGML document defined by: document type definition with the text itself marked with tags describing the structure 25

Markup SGML Document Type Definition Describes the pieces that a document is composed of Defines how those pieces relate to each other Part of the definition can be specified by an SGML Document Type Declaration (DTD) Other parts (i.e. semantics of elements & attributes) cannot be express formally in SGML 26

Markup SGML Document Type Definition 27

Markup SGML Document Type Definition 28

Markup SGML Tags are denoted by angle brackets Used to identify the beginning and ending of an element Ending tags include a slash before the tag name Attributes are specified inside the beginning tag 29

Markup SGML Document description does not specify how a document is printed Output specifications are added to SGML documents: DSSSL: Document Style Semantic Specification Language FOSI: Formatted Output Specification Instance These standards define mechanisms for associating style information with SGML document instances Allows defining data identified by a tag should be typeset in some particular font 30

Markup HyperText Markup Language (HTML) Instance of SGML Created in 1992 Latest Version is 4.0 (HTML5 under development) Includes support for style sheets, frames, tables, forms, etc. Backwards compatible Most documents on the Web are stored and transmitted in HTML HTML tags follow all SGML conventions and include formatting directives. 31

Markup HyperText Markup Language (HTML) Can have media embedded within, such as images or audio Has fields for metadata Adding programs (i.e. Javascript) inside a webpage makes it dynamic (hence dynamic HTML). 32

Markup HyperText Markup Language (HTML) 33

Markup HyperText Markup Language (HTML) 34

Markup Cascade Style Sheets (CSS) Because HTML does not fix a presentation style, CSS was introduced. 1997 Way for authors to improve the aesthetics of HTML pages Information about presentation is separate from document content Support for CSS in current browsers in still modest 35

Markup eXtensible Markup Language (XML) Is a simplified subset of SGML Not a markup language (like HTML) but a meta-language (like SGML) Allows human-readable sematic markup, which is also machine-readable Does not have the restriction of HTML Allows any user to define new tags More rigid syntax on the syntax: Ending tags cant be omitted Distinguishes upper and lower case Attribute values must be in quotes 36

Markup eXtensible Style Sheet Language (XSL) The XML counterpart of Cascading Style Sheets (CSS) Syntax based on XML Designed to transform and style highly-structured, data-rich documents written in XML i.e. With XML it would be possible to automatically extract a table of contents from a document 37

Markup Hypermedia/Time-based Structuring Language SGML architecture that specifies the generic hypermedia structure of documents Includes complex locating of document objects Includes relationships (hyperlinks) between document objects Includes numeric, measured associations between document objects Does not specify graphical interfaces, user navigation or user interaction. 38

Theory Information Theory It is difficult to formally capture how much information there is in a given text However, distribution of symbols is related to it A text where one symbol appears almost all the time does not convey much information Information Theory defines a special concept, entropy, to capture information content 39

Theory Entropy 40

Theory Entropy 41

Theory Modeling Natural Language We can divide the symbols of a text in two disjoint subsets: Symbols that separate words; Symbols that belong to words; Symbols are not uniformly distributed in a text i.e. In English the vowels are usually more frequent than most consonants. 42

Theory Modeling Natural Language A simple model to generate text is the Binomial model The probability of a symbol depends on previous symbol. i.e. f cannot appear after a letter c A finite-context or Markovian model can be used to reflect this dependency. Second issue: is how the different words are distributed inside each document. 43

Theory Zipf’s Law 44

Theory 45

Theory 46 Modeling Natural Language Words arranged in decreasing order of their frequencies

Theory 47 Modeling Natural Language Words arranged in decreasing order of their frequencies Distribution of words is very skewed Words that are too frequent (“stopwords”) can be disregarded. Stopword is a word which does not carry meaning in natural language i.e. Stopwords in English: a, the, by, and Therefore, half of the words appearing in a text do not need to be considered

Theory 48 Modeling Natural Language Third Issue: Distribution of words in the documents of a collection. Simple Model: Consider that each word appears the same number of times in every document (Not True) Better Model: Use a binomial distribution

Theory 49 Heaps’ Law Fourth Issue: Number of distinct words in a document (document vocabulary) To predict the growth of vocabulary size in natural language text:

Theory 50 Modeling Natural Language Vocabulary size grows sub-linearly with text size

Theory 51 Modeling Natural Language The set of different words of a language is fixed by a constant. However, the limit is so high that it is common to assume the size of the vocabulary is: Many argue that the number keeps growing anyway because of typing and spelling errors. As the total text size grows, the predictions of the model become more accurate.

Theory 52 Text Similarity Similarity is measured by a distance function Hamming distance: For strings of the same length, distance between them is the number of positions with different characters (distance is 0 if equal). A distance function should be symmetric and satisfy:

Theory 53 Text Similarity Levenshtein “edit” distance: the minimal number of char insertions, deletions, and substitutions needed to make two strings equal. Edit distance between color and colour is 1 Edit distance between survey and surgery is 2

Theory 54 Text Similarity Longest Common Subsequence (LCS): All non-common characters of two (or more) strings Remaining sequence of characters is the LCS of both strings LCS of survey and surgery is surey.

Theory 55 Text Similarity Similarity can be extended to documents Compute the longest sequence of lines between two files ‘diff’ command in Unix

Theory 56 Resemblance Measure

Theory 57 Resemblance Measure

Model 58 Document Preprocessing Operations Lexical analysis of the text Elimination of stopwords Stemming of the remaining words Selection of index terms or keywords Construction of term categorization structures (thesaurus)

Model 59 Logical View of a Document

Document Preprocessing 60 Lexical Analysis Process of converting stream of chars into stream of words Major Objective: Identify words in the text Word Seperators: - Space: most common separator - Numbers: inherently vague, context required - Hyphens: break up hyphenated words - Punctuation marks - Case of letters: A vs. a

Document Preprocessing 61 Elimination of Stopwords Words that appear too frequently Usually, not good discriminators Filtered out as potential index terms Reduces size of index by 40% or more At expense of reducing recall: not able to retrieve documents that contain “to be or not to be”

Document Preprocessing 62 Stemming Stem: portion of word left after removal of prefixes/suffixes User specifies query word but only variant of it is present in a relevant document This is partially solved by the adoption of stems Stemming reduces size of the index Controversial Many search engines do not adopt any stemming

Document Preprocessing 63 Keyword Selection Full text representation: all words in text is used as index terms (or, keywords). Alternative to full text representation: –Not all words in text used as index terms –Use just nouns as index terms –Group nouns that appear nearby in text into a single indexing component (a concept)

Document Preprocessing 64 Thesaurus Used as reference to a treasury of words. Precompiled list of important words in a knowledge domain For each word in this list, a set of related words derived from a synonymy relationship

Document Preprocessing 65 Thesaurus Used as reference to a treasury of words. Precompiled list of important words in a knowledge domain For each word in this list, a set of related words derived from a synonymy relationship

Document Preprocessing 66 Thesaurus Query formulation process (for IR): –User forms a query –Query terms might be erroneous and improper –Solution: reformulate the original query –Usually, this implies expanding original query with related terms –Thus, it is natural to use a thesaurus for finding related terms

Taxonomies 67

Folksonomies 68 Folksonomy Collaborative flat vocabulary Terms are selected by a population of users Each term is called a tag

Folksonomies 69

References Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition. Chapter 6, Documents: Languages & Properties, Retrieved from http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf 70

Questions? probbins@hawaii.edu www2.hawaii.edu/~probbins 71

CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.

Similar presentations

Presentation on theme: "CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.

Similar presentations

Presentation on theme: "CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication."— Presentation transcript:

Similar presentations

About project

Feedback