Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.

Slides:



Advertisements
Similar presentations
EE 4780 Huffman Coding Example. Bahadir K. Gunturk2 Huffman Coding Example Suppose X is a source producing symbols; the symbols comes from the alphabet.
Advertisements

Chapter 5: Introduction to Information Retrieval
Data Compression CS 147 Minh Nguyen.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
An introduction to Data Compression
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Lecture04 Data Compression.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
SWE 423: Multimedia Systems
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
CSCI 3 Chapter 1.8 Data Compression. Chapter 1.8 Data Compression  For the purpose of storing or transferring data, it is often helpful to reduce the.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
IN350 Document Management & Info Steering Introduction to Document Management. Class 1 August 27, 2001 Judith A. Molka-Danielsen
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Term weighting and vector representation of text Lecture 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
Chapter 5: Information Retrieval and Web Search
1 Lossless Compression Multimedia Systems (Module 2) r Lesson 1: m Minimum Redundancy Coding based on Information Theory: Shannon-Fano Coding Huffman Coding.
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
Cryptanalysis of the Vigenere Cipher Using Signatures and Scrawls To break a Vigenere cipher you need to know the keyword length. – The Kasiski and Friedman.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Lecture 15 XSL Transformations (part V) Acknowledgment:
Source Coding Efficient Data Representation A.J. Han Vinck.
Abdullah Aldahami ( ) April 6,  Huffman Coding is a simple algorithm that generates a set of variable sized codes with the minimum average.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
Dom and XSLT Dom – document object model DOM – collection of nodes in a tree.
Evidence from Content INST 734 Module 2 Doug Oard.
Digital Image Processing Lecture 22: Image Compression
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information Theory Information Suppose that we have the source alphabet of q symbols s 1, s 2,.., s q, each with its probability p(s i )=p i. How much.
Why indexing? For efficient searching of a document
Assignment 6: Huffman Code Generation
Text Indexing and Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Applied Algorithmics - week7
CS 430: Information Discovery
COT 5611 Operating Systems Design Principles Spring 2012
COT 5611 Operating Systems Design Principles Spring 2014
Chapter 8 – Binary Search Tree
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
CS 430: Information Discovery
CSE 589 Applied Algorithms Spring 1999
Content Analysis of Text
Recuperação de Informação B
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Lecture 8 Huffman Encoding (Section 2.2)
Presentation transcript:

Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization and Indexes. Judith A. Molka-Danielsen September 10, 2001 Class 5 Notes are based on Chapter 6 and Appendix A of the Article Collection

Judith Molka-Danielsen, Høgskolen i Molde2 Review: Guest Lectures by Michael Spring l The Document Processing Revolution by MBS. –How do you define a document? –Revolutions: reprographics, communications –Transition: create,compose,render; WWW/XML –New processing model for e-docs, future doc forms, changes. l XML- a first look (see MBS notes page). –Namespace rules allow for a modular document, reuse. –DTDs are historical, Schema is the future. –Xpath and pointers: used for accessing the document as a tree of nodes. –XSL style – rendering, XSLT uses XSL and FO. Any look. –XSLT transformations of docs to multiple forms. l E-business: B2B applications will use concepts, specifications, and tools based on the XML family.

Judith Molka-Danielsen, Høgskolen i Molde3 Modeling Natural Languages and Compression l Information theory - the amount of information in a text is quantified by its entropy, information uncertainty. If one symbol appears all the time it does not convey much information. Higher entropy text cannot be compressed as much lower. l Symbols - There are 26 in the English alphabet and 29 in Norwegian. l Frequency of occurrence - of the symbols in text in different in different languages. In english ’e’ has the highest occurance. Run length encoding schemes such as Huffman encoding can be used to represent the symbols based on frequency of occurance. l Compression for transfering data can be based on the frequency of symbols. More on this in another lecture.

Judith Molka-Danielsen, Høgskolen i Molde4 Modeling Natural Languages and Compression l Creation of Indicies are often based on the frequency of occurance of words within a text. l Zipf's Law - named after the Harvard linguistic professor George Kingsley Zipf ( ), is the observation that frequency of occurrence of some event ( P ), as a function of the rank ( i ) when the rank is determined by the above frequency of occurrence, is a power-law function P i ~ 1/i a with the exponent a close to unity. l Zipf’s distribution - frequency of words in a document approximatly follow this distribution. In the English language, the probability of encountering the i th most common word is given roughly by Zipf law for i up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. l Stopwords - a few hundred words take up 50% of the text. These can be disregarded to reduce the space of indices.

Judith Molka-Danielsen, Høgskolen i Molde5 Modeling Natural Languages and Compression l Document vocaburlary - is the number of distinct (different) words within a document. l Heaps’ Law - is show in the right graph in the readings Figure 6.2. In general, the number of new words found in a document does increases logarithmically with the increasing text size. So, if you have an encylopedia, probably most of the words are found in the first volume. l Heaps’ Law is also implies the length of words increase logarithmically with the text size, but the average word length is constant. That is because there is a greater occurance of the shorter words.

Judith Molka-Danielsen, Høgskolen i Molde6 Text size and Indexing l Today, we discuss in class : What is the purpose of an index and how is it made? Based on: ”Appendix A: File Organization and Storage Structures” l Text Processing: large text collections are often indexed using inverted files. A vector of distinct words forms a vocabulary, with pointers to a list of all the documents that contain the word(s). l Indexes can be compressed to allow quicker access. (We discuss this later with Ch.7 in the collection.)