CIS265/506 Files & Indexing CIS265/506: File Indexing
Storage Basics Hard Disks are come in several interfaces and formats. Storage Capacity is measured in Gigabytes Bandwidth determines how fast data can be moved to or from storage. It is measured in MB/Sec with both sustained and burst rates for read and write. Access Time is in ms and consist of seek time (the head moving across the platter), rotation latency (time it takes for the drive to rotate to correct position) and Block Transfer Time (time to read/write a block). In general, higher RPMS, smaller platter size and more numerous platters all make for faster access Mean Time Between Failure (MBTF) usually the number of hours of operation before a drive will fail (on average). Interface is the protocol that the drive uses to communicate with the PC. CIS265/506: File Indexing
Terminology Heads consists of the number of read/write ‘needles’ that can access your drive. In general 2 per platter Spindle what the drive platters spin on Platter is a magnetically coated disk that resembles a record and stores numerous 0s or 1s. May have multiple platters stacked on top of one another in a disk (typically 20 GB a platter for IDE and 18GB a platter for SCSI) Tracks and Cylinder (multi-platter tracks) positional descriptor assigned to each “ring” of a disk Sector another positional descriptor of the disk. A pie shaped pie slice of the disk that contains many sectors CIS265/506: File Indexing
Terminology Blocks are the combined position of sector and track numbers and typically store 512 to 4096 Bytes each. Blocks are separated by Inter Block Gaps which serve as “speed bumps” so that the drive knows where blocks begin and end. Blocks can be combined into contiguous, logically addressable units called clusters Hardware Address consists of block, sector and track numbers CIS265/506: File Indexing
CIS265/506: File Indexing
Why do we care? Hard drive performance is measured in milliseconds (ms) while your computer processes information in nanoseconds (ns). Hard drives are usually 1000’s of times slower than your CPU. Any speedup in hard drive access yields a serious speedup in machine performance. CIS265/506: File Indexing
From “Data Structures for Java” William H. Ford William R. Topp Chapter 23 File Compression Bret Ford © 2005, Prentice Hall CIS265/506: File Indexing CIS265/506: File Indexing
Binary Files File types are text files and binary files. Java deals with files by creating a byte stream that connects the file and the application. Binary files can be handled with DataInputStream and DataOutputStream classes. CIS265/506: File Indexing CIS265/506: File Indexing
CIS265/506: File Indexing
CIS265/506: File Indexing
CIS265/506: File Indexing
CIS265/506: File Indexing
Binary Files (continued) A data input stream lets an application read primitive Java data types from an underlying input stream in a machine-independent way. A data output stream lets an application write primitive Java data types to an output stream in a portable way. An application can then use a data input stream to read the data back in. CIS265/506: File Indexing CIS265/506: File Indexing
Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Binary Files (continued) CIS265/506: File Indexing CIS265/506: File Indexing
File Compression Lossless compression loses no data and is used for data backup. CIS265/506: File Indexing CIS265/506: File Indexing
File Compression (continued) Lossy compression is used for applications like sound and video compression and causes minor loss of data. CIS265/506: File Indexing CIS265/506: File Indexing
http://en.wikipedia.org/wiki/Lossy_data_compression Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. Subject to disclaimers. CIS265/506: File Indexing
File Compression (continued) The compression ratio is the ratio of the number of bits in the original data to the number of bits in the compressed image. For instance, if a data file contains 500,000 bytes and the compressed data contains 100,000 bytes, the compression ratio is 5:1 CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression Huffman compression relies on counting the number of occurrences of each 8-bit byte in the data and generating a sequence of optimal binary codes called prefix codes. The Huffman algorithm is an example of a greedy algorithm. A greedy algorithm makes an optimal choice at each local step in the hope of creating an optimal solution to the entire problem. CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) The algorithm generates a table that contains the frequency of occurrence of each byte in the file. Using these frequencies, the algorithm assigns each byte a string of bits known as its bit code and writes the bit code to the compressed image in place or the original byte. Compression occurs if each 8-bit char in a file is replaced by a shorter bit sequence. CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) b c d e f Frequency (in thousands) 16 4 8 6 20 3 Fixed-length code word 000 001 010 011 100 101 Compression Ratio = 456000/171000 = 2.67 CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) Use a binary tree to represent bit codes. A left edge is a 0 and a right edge is a 1. Each interior node specifies a frequency count, and each leaf node holds a character and its frequency. CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) Each data byte occurs only in a leaf node. Such codes are called prefix codes. A full binary tree is one in where each interior node has two children. By converting the tree to a full tree, we can generate better bit codes for our example. CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) Compression ratio = 456000/148000 = 3.08 CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) To compress a file replace each char by its prefix code. To uncompress, follow the bit code bit‑by‑bit from the root of the tree to the corresponding character. Write the character to the uncompressed file. Good compression involves choosing an optimal tree. It can be shown that the optimal bit codes for a file are always represented by a full tree. CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Compression (continued) A Huffman tree generates the minimum number of bits in the compressed image. It generates optimal prefix codes. CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree For each of the n bytes in a file, assign the byte and its frequency to a tree node, and insert the node into a minimum priority queue ordered by frequency. CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree (continued) Remove two elements, x and y, from the priority queue, and attach them as children of a node whose frequency is the sum of the frequencies of its children. Insert the resulting node into the priority queue. In a loop, perform this action n-1 times. Each loop iteration creates one of the n-1 interior nodes of the full tree. CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree (continued) With a minimum priority queue the least frequently occurring characters have longer bit codes, and the more frequently occurring chars have shorter bit codes. CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree (continued) CIS265/506: File Indexing CIS265/506: File Indexing
Building a Huffman Tree (continued) For the Huffman tree, the compressed file contains (16(2) + 4(4) + 8(2) + 6(3) + 20(2) + 3(4)) x 1000 = 134,000 bits, which corresponds to a compression ratio of 3.4. CIS265/506: File Indexing CIS265/506: File Indexing
Huffman Tree Review pages 415-422 in your text for code and additional information CIS265/506: File Indexing
Serialization A persistent object can exist apart from the executing program and can be stored in a file. Serialization involves storing and retrieving objects from an external file. The classes ObjectOutputStream and ObjectInputStream are used for serialization. CIS265/506: File Indexing CIS265/506: File Indexing
Serialization (continued) Assume anObject is an instance of a class that implements the Serializable interface. // the stream oos uses a FileOutputStream that is attached to // file "storeFile" for storage of an object ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream("storeFile")); ... oos.writeObject(anObject); // write anObject to file "storeFile CIS265/506: File Indexing CIS265/506: File Indexing
Serialization (continued) Deserializing an Object. // the stream ois uses a FileInputStream that is attached to // file "storeFile" to retrieve an object ObjectInputStream ois = new ObjectInputStream(new FileInputStream("storeFile")); ClassName recallObj; // retrieve from "storeFile" recallObj = (ClassName)ois.readObject(); CIS265/506: File Indexing CIS265/506: File Indexing