Unit -3 Preeti Deshmukh.

Unit -3 Preeti Deshmukh

Data Compression Some ways to make files smaller Reasons :
Use less storage- resulting in cost saving Transmitted faster Decreases accessing time Allows the same access time with lower and cheaper bandwidth Can be processed faster sequentially Data Compression involves- “encoding the information in a file in such a way that takes up less space” Many data compression techniques available Some are general and some are specific kinds (speech, pictures, text, instrument data)

Using different Notation:
Redundancy reduction: By decreasing number of bits by finding more compact notation Cost of this compression scheme? By using pure binary encoding – file is unreadable by humans Encoding time- for new addition and decoding whenever needed Incorporate the encoding / decoding modules in all software-increasing complexity of file This technique of Compression worth???? Bad Idea Good Idea File fairly small 1. file containing several million Accessed by many S/W records Some S/W cant deal with 2. gen. Processed by one program data

Suppressing Repeating Sequences:
Run-length encoding: sparse arrays are the good candidate kind of compression Example: Consider image of sky, First we choose one special, unused byte value to indicate a run-length code follows. Then run-length encoding algorithm goes like: Read through the pixels that make up the image, copying the pixel values to the file in sequence, except where the same pixel value occurs more than once in succession. Where the same value occurs more than once in succession, substitute the following 3 byte in order: The special run-length indicator The pixel value that is repeated; The number of times that the value is repeated(up to 256 times)

Example : Suppose we wish to compress image using run-length encoding, we find we can omit the byte “0xff” from representation of image. We choose the byte “0xff” as our run-length indicator Encoding hexadecimal byte values (sequence) First 3 pixels are to be copied in sequence The runs of 24 & 26 are both run-length encoded The remaining pixels are copied in sequence. Resulting sequence is: ff ff

Assigning the Variable-Length Codes
Suppose two different symbols to use in encoding scheme: A dot(.) and a dash(-) Have to assign combination of dots & dashes to letters of the alphabet Can determine frequently occurring letters with dot and dash Any other symbols can be used for other alphabets Getting fewer symbols Morse code- most common scheme for variable-length codes. Variable-length codes based on principle- some values occur more frequently than others, so code for these values take least amount of space Can be implemented using a table lookup, where table never changes Variable-length codes – another form of redundancy reduction

Modern variable-length coding technique dynamically builds the table which describes encoding scheme
One of the most successful : Huffman code Determines the probability of each value occurring in the data set and then builds a binary tree Binary tree- in which the search path for each value represents the code for that value More frequently occurring values are given shorter search paths in the tree. This tree is then turned into a table, like Morse code table This table can be used for encoding and decoding the data Example: suppose we have data set containing 7 letters Letter : a b c d e f g Probability: Code :

Irreversible Compression Techniques
Based on the assumption some information can be sacrificed (called entropy reduction) Example : shrinking a raster image from- 400-by-400 pixels to 100-by-100 pixels New image contains 1 pixel for every 16 pixels Not very useful for data files Speech compression done by voice coding, -transmits parameterized description of speech, can be synthesized at the receiving end with varying amounts of distortion

Compression in Unix: Berkeley and System V Unix provides compression routines- heavily used Effective System V has : pack and unpack Uses Huffman codes on a byte-by-byte basis Pack achieves 25 to 40 % reduction on text files Less on binary files After compression automatically appends “.z” to the end of file signalling file has been compressed Berkeley has: compress and uncompress Uses a dynamic method – Lempel-Ziv Compress and uncompress behaves almost same as pack and unpack Readily available routines – wise to use

Reclaiming Space in Files
Variable length record updating Suppose record modified in such way that new record is longer than the original one Then??? Either append extra data at the end and give pointer Or append entire record at the end General file modification takes: Record Addition Record Updating Record Deletion

Record Deletion and Storage Compaction:
makes files smaller by looking for places in file where there is no data at all and recovering this space Empty spaces occur when record deleted Record deletion- Record deletion strategy must provide way – to recognize deleted records Simple and workable approach- place a special mark Example: place an asterisk as the first field of deleted record Reusing spaces- After recognizing deleted records; next question is- how to reuse these spaces??? Approach: Just mark as deleted Don't delete for some time period Programs uses logic to just ignore such records Useful when user wants to undelete Reclamation of spaces from deleted records happens all at once A special program is used to reconstruct the file with all deleted records

Deleting Fixed-Length Records for Reclaiming Space Dynamically
Some applications are – too volatile and interactive for storage compaction to be useful- in these situations we need to Reuse the space of deleted records as soon as possible Dynamic storage reclamation- fixed-length record deletion Mechanism for record deletion- guarantee two things Deleted records are marked in some special way We can find the space that deleted records once occupied so we can reuse that space when we add records With fixed-length records: Search sequentially through a file before adding a record Space reutilization can take the form of looking through the file – record by record- until deleted record found. If not found- new record appended at the end of file

To make it speedy/ quick
Slow approach To make it speedy/ quick A way to know immediately if there are empty slots in the file A way to jump directly to one of those slots if they exist Structures can be used: Linked List Stacks

Linked Lists: Use of linked lists can meet both needs
In this structure each node contains reference to its successor List made up of deleted records- available space in file-List called: avail list When inserting new record – all available records are equal No reason to prefer any particular order of records Head Pointer ptr ptr ptr Ptr -1

Stacks: Simplest way to handle list as stack
Avail list managed as stack containing relative record numbers (RRN) Example: stack contains RRN 5 & 2 , then adds RRN 3, Head pointer (5) RRN RRN Head Pointer (3) RRN RRN 5 2 RRN 2 -1

Linking and stacking deleted records:
Meet the two criteria for rapid access to reusable space from deleted records. We need: A way to know immediately if there are empty slots in the file A way to jump directly to one of those slots if it exists Placing deleted record on stack meets both criteria. If pointer to the top of stack == end-of-list value Then we need to append at the end of file If pointer to the top of stack == valid node reference That reusable slot is available ? Is where to find it... ????? Staking and linking is done by arranging and rearranging links used to make one available record slot point to the next

Pointing is not done with pointers Done through RRN Example:
Working with fixed length records on disk files not on memory addresses, Pointing is not done with pointers Done through RRN Example: Working with fixed-length record file containing 7 records (RRNs 0-6) Suppose record 3 & 5 have been deleted Then first field marked with * So we will use second field of deleted record to hold link to next node on avail list We will explore in diagram

Implementation of fixed-length record deletion
List head (first available record) 5 Edwards... Bates.... Wills... * -1 Masters... *3 Chavez... (a) List head (first available record) 1 Edwards... *5 Wills... * -1 Masters... *3 Chavez... (b) List head (first available record) -1 Edwards... 1st new record Wills... 3rd new record Masters... 2nd new record Chavez... Now!!! Implementation of fixed-length record deletion Need on function which should return RRN of reusable record slot or RRN of the next record to be appended if no reusable slot available (c)

Deleting Variable-Length Records
Reuse through an avail list A way to link deleted records together into a list Algorithm for adding newly deleted records to Avail list Algorithm for finding & removing records from the avail list when we are ready to use when we are ready to use them Avail list of Variable –Length records Adding and removing records Avail list same as fixed length records But we cannot use RRN for links (cant compute byte offset) Link must contain byte offsets

Adding and removing records:
Size 47 Size 38 Size 72 Size 68 -1 (a) Size 47 Size 38 New Link Size 68 -1 Size 72 -1 Removed Record (b)

Placement Strategies When we need to remove record slot from avail list, we look through the list, starting from begging, until we either find a record slot that is big enough or reach the end of the file This is called a “first fit placement strategy” One thing can be done: develop a more orderly approach for placing records on the avail list ascending or descending order by size Effect on the closeness On ascending sequence by size: Searches sequentially until encounters record big enough First is smallest record Fit between available & new record’s needs would be as close as we can make it.- “Best –fit placement strategy”

On descending sequence by size:
Best fit placement strategy: Search through at least a part of the list When we get record from the list As well as when we put newly deleted record on list Extra processing time On descending sequence by size: Largest record slot on top Retrieve record procedure starts form beginning Always return largest available record slot known as “ worst-fit placement strategy” Procedure works: Removing record procedure- looks only first record if it is not big enough to do the job, none of others will be. By extracting space we need from available slots, unused portion of the slot is as large as possible, decreasing the external fragmentation

Conclusion: No placement strategy is superior under all circumstances
Some suggestions: Placement strategies make sense only with regards to volatile, variable length record files. With fixed length records simply not an issue If space lost due to internal fragmentation, choice is between first-fit and best-fit. Worst-fit truly makes internal fragmentation worse. If space is lost due to external fragmentation, one should give careful consideration to worst-fit strategy.

Finding things Quickly:
Finding things in Simple Field and Record Files Search by guessing: Binary Search Binary Search V/S Sequential Search Sorting a Disk File In Memory Limitations of Binary Searching and Internal Sorting Binary searching requires more than one or two accesses Keeping a file sorted is very expensive An Internal sort works only on small files

KeySorting Keysort also referred as tag sort
We need to sort only record keys No need to read whole file into memory during sorting process We need to Read keys into memory Sort them Then rearrange the records in the file according to new order of keys Keysort never reads complete set of records into memory, that’s why it can sort larger files than internal sort with same amount of memory

Method : Assume fixed-length record file, with a count of the number of records stored in header record The file class “FixedRecordFile” must support methods NumRecs ReadByRRN To store key RRN pair from file we need a class “KeyRRN” with two data members, KEY and RRN Keysort algorithm works much similar to normal internal sort but with two important differences Rather than read an entire record into a memory array, we simply read each record into a temporary buffer, extract the key, then discard it; and When we are writing the records out in sorted order, we have to read them in a second time, since they are not all stored in memory

class FixedRecordFile { public: int NumRecs(); int ReadByRRN (RecType & record, int RRN); // additional methods required for keysort int Create (char * fileName); int Append (RecType &record); }; class KeyRRN //contains a pair (KEY, RRN) {public: KeyType KEY; int RRN; KeyRRN(); KeyRRN (KeyType key, int rrn); Int Sort(KeyRRN[], int numKeys); // sort array by key Minimal functionality required for classes used by the keysort algorithm

int KeySort (FixedRecordFile & inFile, char
int KeySort (FixedRecordFile & inFile, char *outFileName) { RecType obj; KeyRRN * KEYNODES = new KeyRRN[ inFile. NumRecs()]; // read file and load keys for(int i=0; i<inFile.NumRecs(); i++) inFile.ReadByRRN(obj, i); // read record i KEYNODES[i] = KeyRRN(obj.Key(), i); //put key and RRN into Keys } Sort(KEYNODES, inFile.NumRecs() ); //sort keys FixedRecordFile outFile; //file to hold records in key order outFile.Create(outFileName); // create a new file // write new file in key order for( int j=0; j< inFile.NumRecs(); j++) inFile. ReadByRRN ( obj, KEYNODES[j].RRN); // read in key order outFile. Append ( obj); // write in key order } return 1;

Limitations of Keysort method
We need to read in the records a second time before we can write out the new sorted file Doing something twice is not desirable Reads record before writing to new file Not reading from input file sequentially Creating a sorted file require many random seeks

Index What is an Index? An index is a way to find things.
Index is a table containing a list of topics(keys) and reference to the records (reference fields) Basic concept- keys and reference fields Types: Simple Indexes

Simple index for Entry-Sequenced file
Example: collection of musical recordings Contents of data file: Identification number Title Composer or composers Artist or artists Label(Publisher) Record Address Label ID number Title Composer Artists 17 LON 2312 Romeo and Juliet Prokofiev Maazel 62 RCA 2626 Quartet in C sharp Minor Beethoven Julliard 117 WAR 23699 Touchstone Corea 152 ANG 3795 Symphony No. 9 Giulini Contents of a sample recording file

Forming primary key: Combining fields Example: for recording file
Initial for the company label and recording ID Canonical form for the LabelID consists of uppercase form of label field followed immediately by the ASCII representation of ID number LON2312 Rapid keyed access to individual records Construct an index for the file Data file and Index file

Using Template Classes in C++ for Object I/O
Good object oriented design for a file of object should provide operations to read and write operations to skip packing and unpacking steps We have used class BufferFile to provide Pack & Unpack methods But we want a class RecordFile to make Person p; RecordFile pFile; pFile.Read(p); Recording r; RecordFile rFile; rFile.Read(r);

Template class RecordFile
template <class RecType> Class RecordFile: public BufferFile {public: int Read(RecType &record, int recaddr = -1); int Write(const RecType &record, int recaddr = -1); int Append(const RecType &record); RecordFile(IOBuffer & buffer) : BufferFile(buffer) { } }; //the tamplate parameter RecType must have the following methods // int Pack(IOBuffer &); pack record into buffer // int Unpack(IOBuffer &); unpack record from buffer } Template class RecordFile

Implementation of RecordFile::Read
Object oriented support for Indexed, Entry-Sequenced Files of Data Objects Template <class RecType> { int writeAddr, result; writeAddr = BufferFile :: Read(recaddr); if(!writeAddr) return -1; Result = record . Unpack(Buffer); //RecType::Unpack if(!result) return -1; return writeAddr; } Implementation of RecordFile::Read

Index Operations required to Maintain an Indexed File:
Create the original empty index and data files Load the index file into memory before using it Rewrite the index file from memory after using it Add data records to the data file Delete records from the data file Update records from the data file Update the index to reflect changes in the data file

Index: Inverted List: Selective Index: Pinned Record: Binding:
Index is a tool for finding records in a file It consist of a key field on which the index is searched It consist a reference field also that tells where to find the data file records associated with a key Inverted List: Inverted list refers to indexes in which a key may be associated with a list of reference fields pointing to documents that contain the key Selective Index: Selective index contains keys for only a portion of the records in the data file Pinned Record: A Record is said to be Pinned when there is other record or file structures that refers to it by its physical location Binding: Binding takes place either during the preparation of the data file and indexes or during program execution Coalescence: The process of combining smaller available spaces into larger one is known as coalescing holes Coalescence is a way to counteract the problem of external fragmentation

Indexes that are Too Large to hold in Memory
Assumption- index is small enough to be loaded in memory If index is too large then access and maintenance must be done on secondary storage Accessing indexes on disk has disadvantages: Binary searching requires several seeks instead of taking place at memory Binary searching on disk is slow than binary searching of sorted file Due to record addition and deletion, shifting and sorting is required on secondary storage Millions of times expensive than performing same operations on memory Any time simple index is too large to hold in memory A hash organization can be used if speed is a top priority A tree-structured, multilevel, index, (B-tree) can be used if flexibility of both keyed access and ordered, sequential access required

Multiple Keys Indexing
Example: consider an analogy of our index as a library card catalog Primary key- Label ID – as a kind of catalog number Library catalog relates Author entry(SK) to card catalog number (PK) These fields are secondary key fields

class SecondaryIndex //an index in which the record reference is a string { public: int Insert(char * secondaryKey, char * primaryKey); char * Search(char * secondaryKey); // returns primary key }; template <class RecType> int SearchOnSecondary(char * composer, SecondaryIndex index, IndexedFile<RecType> dataFile, RecType &rec) char * Key = index . Search(composer); // use primary key index to read file return dataFile. Read(Key, rec); }

Composer Index Secondary key Primary key Anand-Milind ANG3795
Anil-Biswas DG139201 Ankit-Tiwari DG18807 Anu-Malik RCA2626 ARRehman WAR23699 Bhupen COLS12809 Daboo Malik LON2312 Himesh MER75016 Karthik COL38358 Karthik-Raja FF245 Madan-Mohan AZS7562 Mithoon GRE87453 RD-Barman GTE263 Vishal-Shekhar FTE78421 Yuvan-Shankar LON451239 Zakir-Husain ERF56831

Record Addition: Adding record to file = adding an entry to the Secondary index In this either record must be shifted, or a vector of pointers to structures needs to be rearranged With primary indexes, the cost of doing this decreases greatly if the secondary index can be read into memory and changed there. Difference between Secondary and Primary index is secondary can contain duplicate keys Duplicate keys grouped together Within this group keys ordered according to the values of the reference fields

TITLE INDEX SECONDARY KEY PRIMARY KEY Anand-Milind ANG3795 Anil-Biswas
DG139201 Ankit-Tiwari DG18807 Anu-Malik RCA2626 ARRehman WAR23699 Bhupen COLS12809 Daboo Malik LON2312 Himesh MER75016 Karthik COL38358 Karthik-Raja FF245 Mithoon AZS7562 GRE87453 GTE263 Vishal-Shekhar FTE78421 Yuvan-Shankar LON451239 Zakir-Husain ERF56831

Record Deletion Removing all references to that record in file system
Removing record means removing entry from primary as well as secondary index It involves rearranging entries to close-up the space left open by deletion Delete all references approach is useful when secondary index referenced the data file directly reference fields associated with the secondary key would be pointing to byte offsets- after deletion can be associated with different records Avoided referencing actual addresses in the secondary key index After finding secondary key another search made on primary key After record deletion primary key index reflects so return not found Updated primary key index acts as final check, protecting us from retrieving deleted record

Record Updating Primary key index serves as protective buffer , insulating secondary indexed from changes in data file This record insulation/protection extends to record updating If secondary indexes contain references directly to byte offsets, then updates in file results change in physical location in file require updating secondary indexed too Data file updates affects the secondary index only when they change either primary or secondary key Possible 3 situations Updates changes the secondary key Updates changes the primary key Update Confined to Other Fields

Retrieval using combinations of Secondary Keys
Suppose Find the recording with Label ID AZS7562(primary key access) Find all the recordings of Mithoon’s work(secondary key composer); and Find all recordings titled “Kyon ki tum hi ” (secodary key title) To respond this request need to combine retrieval On composer index with title index such as : find all the recordings of Mithoon’s Kyon ki tum hi . With sequential search it is very expensive

With secondary index it is simple
Rephrase this request as Boolean and operations specifying intersection of two subsets Find all data records with: composer = “Mithoon” and title =“ Kyon ki tum hi ho“ Result: AZS7562 GRE87453 GTE263 Next search title index: title =“ Kyon ki tum hi ho“ Result: AZS7562 GTE263

Now perform boolean: Composers: AZS7562 GRE87453 GTE263 Titles:
Matched List: AZS7562 GTE263

Inverted Lists Secondary index structures/ inverted list results in two problems We need to rearrange the index file every time a new record added (even for existing secondary key) If there are duplicate secondary keys, secondary key field is repeated for each entry. To overcome these problems A first attempt at Solution A Better Solution: Linking the List of References

Improving the Secondary Index Structure
Secondary index structures result in two distinct difficulties: We have to rearrange the index file every time new record is added to file even if secondary key exist. If there are duplicate secondary keys, the secondary key field is repeated for each entry. This wasted space and file size increases unnecessarily

First attempt at a Solution
Change the secondary index structure to an array of references with each secondary key. Major contribution is- it help in solving first difficulty of rearranging secondary index file So adding a new record does not need to require addition of another record to index Need to modify only corresponding secondary index record It does have some problems: Space for limited number of key entries Need to keep track of extra keys Space usage: We might lose more space to internal fragmentation Mithoon AZS7562 GRE87453 GTE263

SET OF PRIMARY KEY REFERENCE
First Attempt at a Solution Ankit-Tiwari ANG DG DG18807 RCA2626 SECONDARY KEY SET OF PRIMARY KEY REFERENCE Ankit-Tiwari ANG DG DG18807 RCA2626 Anu-Malik WAR23699 ARRehman Himesh COLS12809 Mithoon LON2312 Vishal-Shekhar MER75016 Fig. : Secondary Key index containing space for multiple references

Improvement over it: Retains the attractive feature of not requiring reorganization of the secondary indexes for every new entry to the data file Allow more than the fixed number of keys to be associated with each secondary key Eliminates the waste of space due to internal fragmentation

Better solution: Linking the list of references
Secondary index files, in which a secondary key leads to a set of one or more primary keys, are called inverted lists. We are dealing with the primary key references Each secondary key point to a different list of primary key references Each list grow to be just as long as it needs to be. Secondary index file need to be rearranged only when new secondary key will be added to file.

LIST OF PRIMARY KEY REFERENCE
SECONDARY KEY index Ankit-Tiwari Anu-Malik ARRehman Himesh Mithoon Vishal-Shekhar LIST OF PRIMARY KEY REFERENCE ANG3795 DG18807 RCA2626 WAR23699 WAR23699 COLS12809 LON2312 MER75016 Conceptual view of the primary key reference fields as a series of lists

Secondary index consists of records with two fields:
Secondary key field Relative record number of the first corresponding primary key field Label Linked list Secondary Index file 1 2 3 4 5 6 7 8 9 10 WAR23699 -1 RCA2626 ANG3795 8 DG18807 1 COLS12809 LON2312 9 5 MER75016 LON451239 Secondary Index file Ankit-Tiwari 3 Anu-Malik ARRehman 4 Himesh 6 Mithoon 7 Vishal-Shekhar 10

Selective Indexes Secondary indexes can be used to divide a file into parts and provide a selective view Example: It is possible to build a selective index that contains only the titles of classical recordings in the record collection If we have additional information in records like date of recording realised, so we could build selective index “recordings released prior to 1970” and “recordings since 1970” Such selective index info could be combined into Boolean Selective indexes are sometimes useful when the contents of a file fall naturally and logically into several broad categories

Binding Design of file system that use indexes is : At what point is the key bound to the physical address of its associated record? Binding of primary keys to an address takes place at the time the files are constructed Secondary keys are bound to an address at the time that they are used Binding at file construction results in faster access Secondary key retrieval is simple and faster Primary index file and Secondary index file are Better to keep on secondary storage rather than on primary memory

Unit -3 Preeti Deshmukh.

Similar presentations

Presentation on theme: "Unit -3 Preeti Deshmukh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit -3 Preeti Deshmukh.

Similar presentations

Presentation on theme: "Unit -3 Preeti Deshmukh."— Presentation transcript:

Similar presentations

About project

Feedback