CIS265/506 Files & Indexing CIS265/506: File Indexing.

Slides:



Advertisements
Similar presentations
Hard Disk Drives Chapter 7.
Advertisements

Magnetic Disk Magnetic disks are the foundation of external memory on virtually all computer systems. A disk is a circular platter constructed of.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Chapter 4 : File Systems What is a file system?
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Compression & Huffman Codes
1 Chapter 6 Storage and Multimedia: The Facts and More.
A Data Compression Algorithm: Huffman Compression
Operating Systems File systems
Data Storage Technology
Data Compression Basics & Huffman Coding
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Operating Systems COMP 4850/CISG 5550 Disks, Part II Dr. James Money.
CSE Lectures 22 – Huffman codes
Hard Drives Non-Volatile Storage. Hard Drives Hard Drives (HD) The primary storage device in a computer system.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
CS 346 – Chapter 10 Mass storage –Advantages? –Disk features –Disk scheduling –Disk formatting –Managing swap space –RAID.
Physical Storage and File Organization COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
1 6 Further System Fundamentals (HL) 6.2 Magnetic Disk Storage.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
1 Secondary Storage Management Submitted by: Sathya Anandan(ID:123)
1 Analysis of Algorithms Chapter - 08 Data Compression.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Lecture No 11 Storage Devices
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Communication Technology in a Changing World Week 2.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Chapter 5 Record Storage and Primary File Organizations
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.
Data Storage and Querying in Various Storage Devices.
1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
Memory Management.
Design & Analysis of Algorithm Huffman Coding
HUFFMAN CODES.
Compression & Huffman Codes
External Memory.
Lecture 16: Data Storage Wednesday, November 6, 2006.
Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Computer Science 210 Computer Organization
Backing Store.
The Greedy Method and Text Compression
I/O Resource Management: Software
Hard Drives.
CPSC-608 Database Systems
Oracle SQL*Loader
The Greedy Method and Text Compression
9/12/2018.
13 Text Processing Hongfei Yan June 1, 2016.
CHAPTER 4 Memory System Organization and Architecture
Lecture 11: DMBS Internals
Disk Storage, Basic File Structures, and Hashing
Secondary Storage Devices
Disk storage Index structures for files
Advanced Algorithms Analysis and Design
Huffman Coding CSE 373 Data Structures.
Communication Technology in a Changing World
Communication Technology in a Changing World
File Storage and Indexing
Data Structure and Algorithms
Lesson 9 Types of Storage Devices.
Secondary Storage Management Hank Levy
Podcast Ch23d Title: Huffman Compression
Department of Computer Science
Networks & I/O Devices.
Huffman Coding Greedy Algorithm
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
External Sorting Chapter 13
Presentation transcript:

CIS265/506 Files & Indexing CIS265/506: File Indexing

Storage Basics Hard Disks are come in several interfaces and formats. Storage Capacity is measured in Gigabytes Bandwidth determines how fast data can be moved to or from storage. It is measured in MB/Sec with both sustained and burst rates for read and write. Access Time is in ms and consist of seek time (the head moving across the platter), rotation latency (time it takes for the drive to rotate to correct position) and Block Transfer Time (time to read/write a block). In general, higher RPMS, smaller platter size and more numerous platters all make for faster access Mean Time Between Failure (MBTF) usually the number of hours of operation before a drive will fail (on average). Interface is the protocol that the drive uses to communicate with the PC. CIS265/506: File Indexing

Terminology Heads consists of the number of read/write ‘needles’ that can access your drive. In general 2 per platter Spindle what the drive platters spin on Platter is a magnetically coated disk that resembles a record and stores numerous 0s or 1s. May have multiple platters stacked on top of one another in a disk (typically 20 GB a platter for IDE and 18GB a platter for SCSI) Tracks and Cylinder (multi-platter tracks) positional descriptor assigned to each “ring” of a disk Sector another positional descriptor of the disk. A pie shaped pie slice of the disk that contains many sectors CIS265/506: File Indexing

Terminology Blocks are the combined position of sector and track numbers and typically store 512 to 4096 Bytes each. Blocks are separated by Inter Block Gaps which serve as “speed bumps” so that the drive knows where blocks begin and end. Blocks can be combined into contiguous, logically addressable units called clusters Hardware Address consists of block, sector and track numbers CIS265/506: File Indexing

CIS265/506: File Indexing

Why do we care? Hard drive performance is measured in milliseconds (ms) while your computer processes information in nanoseconds (ns). Hard drives are usually 1000’s of times slower than your CPU. Any speedup in hard drive access yields a serious speedup in machine performance. CIS265/506: File Indexing

From “Data Structures for Java” William H. Ford William R. Topp Chapter 23 File Compression Bret Ford © 2005, Prentice Hall CIS265/506: File Indexing CIS265/506: File Indexing

Binary Files File types are text files and binary files. Java deals with files by creating a byte stream that connects the file and the application. Binary files can be handled with DataInputStream and DataOutputStream classes. CIS265/506: File Indexing CIS265/506: File Indexing

CIS265/506: File Indexing

CIS265/506: File Indexing

CIS265/506: File Indexing

CIS265/506: File Indexing

Binary Files (continued) A data input stream lets an application read primitive Java data types from an underlying input stream in a machine-independent way. A data output stream lets an application write primitive Java data types to an output stream in a portable way. An application can then use a data input stream to read the data back in. CIS265/506: File Indexing CIS265/506: File Indexing

Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing

File Compression Lossless compression loses no data and is used for data backup. CIS265/506: File Indexing CIS265/506: File Indexing

File Compression (continued) Lossy compression is used for applications like sound and video compression and causes minor loss of data. CIS265/506: File Indexing CIS265/506: File Indexing

http://en.wikipedia.org/wiki/Lossy_data_compression Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. Subject to disclaimers. CIS265/506: File Indexing

File Compression (continued) The compression ratio is the ratio of the number of bits in the original data to the number of bits in the compressed image. For instance, if a data file contains 500,000 bytes and the compressed data contains 100,000 bytes, the compression ratio is 5:1 CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression Huffman compression relies on counting the number of occurrences of each 8-bit byte in the data and generating a sequence of optimal binary codes called prefix codes. The Huffman algorithm is an example of a greedy algorithm. A greedy algorithm makes an optimal choice at each local step in the hope of creating an optimal solution to the entire problem. CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) The algorithm generates a table that contains the frequency of occurrence of each byte in the file. Using these frequencies, the algorithm assigns each byte a string of bits known as its bit code and writes the bit code to the compressed image in place or the original byte. Compression occurs if each 8-bit char in a file is replaced by a shorter bit sequence. CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) b c d e f Frequency (in thousands) 16 4 8 6 20 3 Fixed-length code word 000 001 010 011 100 101 Compression Ratio = 456000/171000 = 2.67 CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) Use a binary tree to represent bit codes. A left edge is a 0 and a right edge is a 1. Each interior node specifies a frequency count, and each leaf node holds a character and its frequency. CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) Each data byte occurs only in a leaf node. Such codes are called prefix codes. A full binary tree is one in where each interior node has two children. By converting the tree to a full tree, we can generate better bit codes for our example. CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) Compression ratio = 456000/148000 = 3.08 CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) To compress a file replace each char by its prefix code. To uncompress, follow the bit code bit‑by‑bit from the root of the tree to the corresponding character. Write the character to the uncompressed file. Good compression involves choosing an optimal tree. It can be shown that the optimal bit codes for a file are always represented by a full tree. CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Compression (continued) A Huffman tree generates the minimum number of bits in the compressed image. It generates optimal prefix codes. CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree For each of the n bytes in a file, assign the byte and its frequency to a tree node, and insert the node into a minimum priority queue ordered by frequency. CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree (continued) Remove two elements, x and y, from the priority queue, and attach them as children of a node whose frequency is the sum of the frequencies of its children. Insert the resulting node into the priority queue. In a loop, perform this action n-1 times. Each loop iteration creates one of the n-1 interior nodes of the full tree. CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree (continued) With a minimum priority queue the least frequently occurring characters have longer bit codes, and the more frequently occurring chars have shorter bit codes. CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree (continued) CIS265/506: File Indexing CIS265/506: File Indexing

Building a Huffman Tree (continued) For the Huffman tree, the compressed file contains (16(2) + 4(4) + 8(2) + 6(3) + 20(2) + 3(4)) x 1000 = 134,000 bits, which corresponds to a compression ratio of 3.4. CIS265/506: File Indexing CIS265/506: File Indexing

Huffman Tree Review pages 415-422 in your text for code and additional information CIS265/506: File Indexing

Serialization A persistent object can exist apart from the executing program and can be stored in a file. Serialization involves storing and retrieving objects from an external file. The classes ObjectOutputStream and ObjectInputStream are used for serialization. CIS265/506: File Indexing CIS265/506: File Indexing

Serialization (continued) Assume anObject is an instance of a class that implements the Serializable interface. // the stream oos uses a FileOutputStream that is attached to // file "storeFile" for storage of an object ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream("storeFile")); ... oos.writeObject(anObject); // write anObject to file "storeFile CIS265/506: File Indexing CIS265/506: File Indexing

Serialization (continued) Deserializing an Object. // the stream ois uses a FileInputStream that is attached to // file "storeFile" to retrieve an object ObjectInputStream ois = new ObjectInputStream(new FileInputStream("storeFile")); ClassName recallObj; // retrieve from "storeFile" recallObj = (ClassName)ois.readObject(); CIS265/506: File Indexing CIS265/506: File Indexing