Download presentation
Published byAnabel Franklin Modified over 7 years ago
1
Sequential Files Outline File operations Field and record organization
Sequential search and direct access Sequential Files Sorted Sequential Files Co-sequential processing CENG 351
2
Files and OS The files under an operating systems are maintained by a header. Header information may include all the information necessary to maintain the file, such as type, size, disk addresses of the first and last blocks or records, time created, time last updated, protection data, and other information based on the file type and OS. CENG 351
3
File types A file can be seen as A stream of bytes (no structure), or
Data semantics is lost: there is no way to get it apart again. or A collection of records with fields (more to come) CENG 351
4
Field and Record Organization
Definitions Record: a collection of related fields. Field: the smallest logically meaningful unit of information in a file. Key: a subset of the fields in a record used to identify (uniquely) the record. e.g. In the example file of books: Each line corresponds to a record. Fields in each record: ISBN, Author, Title CENG 351
5
Record Keys Primary key: a key that uniquely identifies a record.
Secondary key: other keys that may be used for search Author name Book title Author name + book title Note that in general not every field is a key (keys correspond to fields, or a combination of fields, that may be used in a search). CENG 351
6
Field Structures Fixed-length fields
87359Carroll Alice in wonderland 38180Folk File Structures Begin each field with a length indicator Carroll19Alice in wonderland Folk15File Structures Place a delimiter at the end of each field 87359|Carroll|Alice in wonderland| 38180|Folk|File Structures| Store field as keyword = value ISBN=87359|AU=Carroll|TI=Alice in wonderland| ISBN=38180|AU=Folk|TI=File Structures| CENG 351
7
Type Advantages Disadvantages
Fixed Easy to read/write Waste space with padding Length-based Easy to jump ahead to the end of the field Long fields require more than 1 byte Delimited May waste less space than with length-based Have to check every byte of field against the delimiter Keyword Fields are self describing. Allows for missing fields Waste space with keywords CENG 351
8
Record Structures Fixed-length records. Fixed number of fields.
Begin each record with a length indicator. Use an index to keep track of addresses. Place a delimiter at the end of the record. CENG 351
9
Fixed-length records Two ways of making fixed-length records:
Fixed-length records with fixed-length fields. Fixed-length records with variable-length fields. 87359 Carroll Alice in wonderland 03818 Folk File Structures 87359|Carroll|Alice in wonderland| unused 38180|Folk|File Structures| unused CENG 351
10
Variable-length records
Fixed number of fields: Record beginning with length indicator: Use an index file to keep track of record addresses: The index file keeps the byte offset for each record; this allows us to search the index (which have fixed length records) in order to discover the beginning of the record. Placing a delimiter: e.g. end-of-line char 87359|Carroll|Alice in wonderland|38180|Folk|File Structures| ... |Carroll|Alice in wonderland| |Folk|File Structures| .. CENG 351
11
Easy to jump to the i-th record Waste space with padding
Type Advantages Disadvantages Fixed length record Easy to jump to the i-th record Waste space with padding Variable-length record Saves space when record sizes are diverse Cannot jump to the i-th record, unless through an index file or sequential advance CENG 351
12
File Organization Four basic types of organization: Sequential Indexed
Direct Multi-key In all cases we view a file as a sequence of records. A record is a list of fields. Each field has a data type. today CENG 351
13
File Operations Typical Operations: Retrieve a record Insert a record
Delete a record Modify a field of a record Sort Merge Match CENG 351
14
Sequential files Records are stored contiguously on the storage device. Sequential files are read from beginning to end. Some operations are very efficient on sequential files (e.g. finding averages) Organization of records: Unordered sequential files (pile files) Sorted sequential files (records are ordered by some field or fields) CENG 351
15
Pile Files A pile file is a succession of records, simply placed one after another with no additional structure. Records may vary in length. Typical Request: Print all records with a given field value e.g. print all books by Folk. We must examine each record in the file, in order, starting from the first record. CENG 351
16
Timing computations for sequential files
The smallest data unit that can be read from the disk is a block. Question: knowing the disk address of the block containing the record, what is the time to fetch a record? Answer: seek(s)+ rotational latency(r) + block transfer time(btt). CENG 351
17
Fast sequential reading
In fast sequential reading once the starting block is reached, for the remaining of the blocks the rotational latency is assumed zero. This is possible if the blocks are arranged with correct staggering on the disk, on the same or consecutive cylinders. For a large file, with very large blocks (b), s and r can be ignored entirely. CENG 351
18
Searching Sequential Pile Files
To look-up a record, given the value of one or more of its fields, we must search the whole file. In general, (b is the total number of blocks in file): At least 1 block is accessed At most b blocks are accessed. Thus, the average number of blocks to be read for a randomly chosen target block to be reached will be: (1/b)*ib =(1/b)(b*(b+1)/2)=~b/2. Thus, time to find and read a record in a pile file is approximately : TF = (b/2) * btt Time to fetch one record CENG 351
19
Exhaustive Reading of the File
Time it takes to read and process all records sequentially TX = b * btt (approximately twice the time to fetch one record) Fetching the logical next record is the same as TF above. Why? At average, half the records have to be read to locate the next target record. CENG 351
20
Note on btt vs ebt Note that effective block transfer time (ebt) includes a gap and the block transfer time. Note that ebt>btt, but we will ignore the difference… CENG 351
21
Exhaustive Reading of the File
Random reading of b blocks, where s and r cannot be ignored, as they apply to all independent block reads: Tx=b(s+r+btt) CENG 351
22
Inserting a new record Just place the new record at the end of the file (assuming that we don’t worry about duplicates) Read the last block Put the record at the end. => TI = s + r + btt + (2r – btt ) + btt => TI = s+3r +btt Time to wait to re-write CENG 351
23
TU (variable length) = TD + TI
Updating a record For fixed length records: TU (fixed length) = TF + 2r For variable length records update is treated as a combination of delete and insert. TU (variable length) = TD + TI CENG 351
24
What happens if records are deleted?
Deleting Records What happens if records are deleted? There is no basic operation that allows us to “remove part of a sequential file” easily. We have to use combination of operations CENG 351
25
Record deletion and Storage compaction
Deletion can be done by “marking” a record as deleted. e.g. Place a delimiter such as ‘*’ at the beginning of the record The space for the record is not released, but the file processing program must include logic that checks if record is deleted or not. After a lot of records have been deleted, a special program is used to compact or squeeze the file – this is called Storage Compaction. CENG 351
26
Deleting fixed-length records and reclaiming space dynamically
How to use the space freed by deleted records for storing records that are added later? Use an “AVAIL LIST”, a linked list of available records. Where a header record (at the beginning of the file) stores the beginning of the AVAIL LIST When a record is deleted, it is marked as deleted and inserted into AVAIL LIST CENG 351
27
Deleting variable length records
Use AVAIL LIST as before, but take care of the variable-length storage allocation difficulties. The records in AVAIL LIST must store its size as a field. Exact byte offset must be used. Addition of records must find a large enough record in AVAIL LIST CENG 351
28
Record Placement Strategies
There are several placement strategies for selecting a record in AVAIL LIST when adding a new record: First-fit Strategy AVAIL LIST is not sorted by size; the first record large enough is used for the new record. Best-fit Strategy List is sorted by size. The smallest record large enough is used. Worst-fit strategy List is sorted by decreasing order of size; largest record is used; unused space is placed in AVAIL LIST again. CENG 351
29
Problem 1 Estimate the time required to reorganize a file for storage compaction. (Derive a formula in terms of the number of blocks b, btt, number of records n, and blocking factor Bfr) Reorganize the file with one disk drive, ie, read and write to the same disk Reorganize the file with two disk drives, ie, R from one disk W to the other, with possible R/W overlapping CENG 351
30
Problem 2 Given two pile files A and B with n=100,000 records and record size of 400 bytes each, We want to create an intersection file. Assume that 70% of the records are in common and the available memory for this operation is 10MB. Calculate a timing estimate for finding and writing the intersection file (Use s = 16 ms, r = 8.3ms, btt = 0.84ms, B=2400bytes.) You may ignore in memory processing time Hint: the logical cluster size depends on the available main memory; one may read enough consecutive number of blocks from a file to compare to find the intersection file… Compute number of blocks that fit in the memory, 4166(=10,000,000/2400) in total, 4166/2.7=1543 blocks for each read file (1543*0.7)=1080 blocks for write in the memory, (100,000*400)/(1543*2400)=11 (s+r+1543*btt) time required for each read file, (70000*400)/(1080*2400)*(s+r+btt)= for intersection file. The total time: 2*11(s+r+1543*btt) +11*(s+r+1080*btt) CENG 351
31
Sorted Sequential Files
Sorted files are usually read sequentially to produce lists, such as mailing lists, invoices.etc. A sorted file cannot easily stay sorted, if it requires additions Solution: A sorted file will have to have an overflow area for added records. Overflow area is not sorted, otherwise it would be very expensive! To find a record: Use binary search First look at sorted area Then search overflow area If there are too many overflows, the access time degenerates, approaching to that of a sequential file. CENG 351
32
Searching for a record in a sorted file
We can do binary search (assuming fixed-length records) in the sorted part. Sorted part overflow x blocks y blocks (x + y = b) Worst case to fetch an existing record from the sorted area :TF = log2 x * (s + r + btt). If the record is not found, search the overflow area too. Thus total worst case time is: TF = log2 x * (s + r + btt) + s + r + y * btt CENG 351
33
Searching for a record-average case for a record in the sorted part
If the record does exist in the sorted part.: Then the average search time is:- TF = (log2 x-1) * (s + r + btt)/2. CENG 351
34
Searching for an existing record: expected time-1
Total expected search time of an existing record: Use probabilistic record distribution, where b is the total number of blocks TF=(x/b)*(logx –1)*(s+r+btt) +(y/b)* ( logx*(s+r+btt)+ s+r+(y/2)*btt) CENG 351
35
Searching for an existing record: expected time-2
If x is very large compared to y, the second term can be ignored, assuming x is nearly equal to b: TF=(x/b)*(logx –1)*(s+r+btt ) or TF= (logx –1)*(s+r+btt) If y is very large, ignoring the rest, the expected search time will be: TF=[y2/(2b)]*btt= [y/2]*btt CENG 351
36
Problem 3 Given the following: Q1) Calculate TF for a certain record
Block size = 2400 Blocking factor=8 File size = 40M Block transfer time (btt) = 0.84ms Average Seek time s = 16ms Average rotational latency r = 8.3 ms Q1) Calculate TF for a certain record in a pile file in a sorted file (no overflow area) Q2) Calculate the time to look up a record in10000 records. CENG 351
37
Fetching the next record in the sorted order-1
The next record may or may not be in the same block as the current record. What is the probability of the next record in the same block: 1-1/m, where m is the blocking factor, i.e., number of records in a block. CENG 351
38
Fetching the next record in the sorted order-2
The probability that the current record is the last record in the block is 1/m. Thus, the estimated time for reading the next record, assuming that it is either in the same block or in the same cylinder: TN = (1-1/m)*0+(1/m)*(r+btt)= (1/m)*(r+btt) CENG 351
39
Deleting the next record in the order of sort key
If the record to be deleted or updated is any record, not involving the key field, then TD =TU =TF+2r If the update involves the key field , then a deletion and an insertion are required. Thus, TU =TD +TI, where TI is time required the record to be placed in the last block in the overflow area, in which case it requires reading the last block and full rotation to write it back: TI=s+r+btt+2r CENG 351
40
Matching or Merging Processing
Merging and matching (Co-sequential processing) involves the coordinated processing of two or more sequential files to produce a single output file. Matching takes intersection of the records in the files. Merging takes union of the records in the files. CENG 351
41
Example applications Matching: Finding the intersection file
Batch Update Master file – bank account info (account number, person name, balance) – sorted by account number Transaction file – used to update master file for certain accounts on accounts (account number, credit/debit info) – sorted by account number Merging: Merging two lists keeping alphabetic order. Sorting large files (break into small pieces, sort each piece and then merge them) CENG 351
42
Problem 4: Intersection of two sorted file
Compute the time required to take intersection of two sorted files, with the following characteristics. Given the following: Block size = 2400 Blocking factor=8 File size = 40M Block transfer time (btt) = 0.84ms Average Seek time s = 16ms Average rotational latency r = 8.3 ms CENG 351
43
Algorithm for Batch Update on a sorted file
Initialize pointers to the first records in the master file and the transaction file. Until pointers reach end of the file: Compare keys of the current records in both files Take appropriate action * Advance one (or both) of the pointes. * Match or merge CENG 351
44
Appropriate Action for master and transaction file merge
if master key < transaction key copy master file record to the end of the new master file. advance master file pointer. if master key > transaction key if transaction is an insert copy transaction file record to the end of new master file else log an error advance the transaction file pointer. if master key = transaction key If transaction is modify copy modified master file record to the end of the new file If transaction is insert log an error If transaction is delete, do nothing In all three cases, advance both the master and transaction file pointer CENG 351
45
Reorganization of a sorted file-1
Merging the sorted part of a file with its overflow area Read in the overflow area blocks, sort the records in memory, if possible merge the sorted records with sorted part of the file into another sorted file. CENG 351
46
Reorganization of a sorted file-2
Assume that there are x blocks in the sorted area and y blocks in the overflow area, and z records in the overflow area. TY=y*btt+ Tsort+x*btt+(n/m)*btt Where n is total number of records, m is the blocking factor Where Tsort is the time it takes to sort z records in the overflow area, say (zlogz)*t, where t is the time each iteration of the sort routine takes. This time is usually in microsecond domain and can be ignored. CENG 351
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.