WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects and Papers Ideas - Searchable Personal Digital Library - Browser hacks for searching
Text & Multimedia 6 Metadata Text Markup Languages Multimedia Trends
It all comes down to Text Main form of knowledge communication, storage, retrieval - Picture = 1K words? Clusters of (hopefully related) text are documents Documents have syntax, structure and semantics - Styles - Formats - Uses - Languages
Metadata Information that describes a document that is not (necessarily) in the document Describes the document in relation to other documents Context about the Content Document semantics Internally consistent descriptions of content for individual documents, document sets or a specified set of content. For collections or individual documents
Metadata Types Dublin Core elements MARC (machine readable cataloging) - What isn’t machine readable? Semantic Web elements Bottom-up, derived data Format-based - ASCII, EBCDIC - RTF - PostScript PDF - MIME
Information Theory & Text Text conveys an amount of information related to the distribution of symbols (in a document). Entropy: Ribeiro-Neto 1999 The more unique the symbols, the more information contained. Alphabet has σ symbols, with probability p i, as measured in bits Text models have these probabilites. Text models can be unique themselves.
Natural Language and Text We have two kinds of symbols - Words - Separators, Differentiators Conventions of language have probabilities - “th”, “qu”, “ff”, “ing” - “fn”, “qi”, “aa”, “en” Etaoin Shrdlu Grammar models have structure Languages have structure too
Models of Entropy & Frequency Entropy is related to Frequency (Uniqueness) Zipf’s Law is a model of the distribution of words in a text, document or language. - “Principle of Least Effort” - Known, predictable models of language use - i-th most frequent word appears 1/i times of the most frequent word, Vocabulary, Harmonic numberHarmonic
Zipf’s Law The distribution which applied to word frequency in a text states that the nth ranking word will appear k/n times, where k is a constant for that text. It is easier to choose and use familiar words, therefore probabilities of occurrence of familiar words is higher. rf=C rank, frequency, Count all of the words in a document (- stop list) with the most frequent occurrences representing the subject matter of the document. Relative frequency (more often than expected) instead of absolute frequency is possible.
Wyllys on Zipf’s Law Surprisingly constrained relationship between rank and frequency in natural language. Zipf said the fundamental reason for human behavior : the striving to minimize effort. Mandelbrot - further refinement of Zipf’s law: (r+m) B f=c where r is the rank of a word, f is its frequency, m, B and c are constants dependent on the corpus. m has the greatest effect when r is small.
Heap’s Law Predicts the growth of a vocabulary in a normal (natural language) text A text can also be a collection of documents - Papers for this class? - The Web? Length of words increases in the Vocabulary logarithmically with text size - Longer the text (documents), longer the p of words
Text Document Properties More text: - equals less overall entropy - More overall predictability - At the vocabulary level (Zipf) - At the document level (Heap) Users will be searching over similar texts a lot in a document set. Documents have similarity - Measured by a distance function - Edit distance is number of transforms to make things equal (entropy)
Markup Languages Additional structure applied to text Formats for presentation or content description SGML - DTD - HTML XML - MathML - SMIL - RDF Prescribed by authors with tools Automated for higher machine readability
Trends for Text More Markup languages (finer details) Automated markup & conversion - Based on “Laws” - From CMS Semantic Web text representation Multi-lingual text representation - Global measures Laws of your language use and search term preferences
Text Operations 7 Document Preprocessing Document Clustering Text Compression Comparing Text Compression Techniques Trends
Document Preprocessing 1.Lexical Analysis Characters, digits, punctuation Sentence, paragraphs 2.Stopword filtering Eliminate redundant words & phrases Reduce entropy 3.Word Stemming Prefix, suffix, variations 4.Index term selection Syntax, frequency, structure (markup) 5.Term category structures Thesaurus, estimated queries, metadata use
Document Clustering Grouping together similar or related documents in classes. P 173 Global – with whole collection - Collections on the Web? - Sites? Domains? Versions? Local – in context of the query - Multiple queries? - Many contexts? - Links?
Text Compression Is compression always good? - Less space may mean less functionality. - Open standards for compression, immediately (machine) recognizable Taking advantage of document preprocessing and the Laws to reduce size. (fewer bytes) Random access to text is difficult enough, compressed text more so
Compression Methods Statistical Methods - Probability (chain of codings) - Words and NLP Dictionary Methods - Symbols and substitution - Inverted File Compression Vocabulary Lists with pointers Which is best? - Speed of compression & compressed size - Memory, access & pattern match
Compression is new again Web is making all these issues important again Distributed indexing Meta tags Multiple authors Versioning
Multimedia IR (11 &12) Data Modeling Query Languages XML & SQL Indexing - Text track Feature Extraction - Keystone frames, transitions Speed, dynamic identification - Machine Learning - Feature Extraction (ownership, subject)
Finalize Topic Discussions Leading WIRED Topic Discussions - Week 6, 8 (1), About 20 minutes reviewing issues from the week’s readings Key ideas from the readings Questions you have about the readings Concepts from readings to expand on - PowerPoint slides - Handouts - Extra readings (at least a few days before class) – send to wired listserv
Web Information Retrieval System Evaluation - 5 page written evaluation of a Web IR System - technology overview (how it works) - a brief overview of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use and its overall effectiveness
How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications New interfaces for applications New datasets for applications Projects and/or Papers Overview