CPSC 231 D.H.1 Learning Objectives Understanding of disk versus RAM performance gap. Understanding definition, design goals and design problems of file structure. Understanding of file structure research history. Understanding and naming key terms used in file structure.
CPSC 231 D.H.2 Secondary Storage in Computer Systems Data can be stored on: hard disks floppy disks tapes CD-ROMs ZIP and JAZZ disks network servers Most data is stored on hard disks.
CPSC 231 D.H.3 Disks Disks provide enormous capacity to store information. Disks are orders of magnitude slower than main memory (a single disk access can take a quarter of million times longer than a single RAM access). DISK = LARGE and SLOW and CHEAP RAM = SMALL and FAST
CPSC 231 D.H.4 RAM versus Disk Performance Gap Example: –120 nanoseconds to access RAM (Main Memory) –30 milliseconds to access disk Analogy: –20 seconds versus 58 days CONCLUSION: –Application programs have to spend a lot of time waiting for data to be read from the disk or to be written to the disk.
CPSC 231 D.H.5 Questions What is a millisecond, microsecond and nanosecond? Millisecond = 1/1000 s Microsecond = 1/ s Nanosecond = 1/ s How many times is RAM access faster than disk access? Assume 120 nanoseconds to access RAM (Main Memory) 30 milliseconds to access disk
CPSC 231 D.H.6 File Structure Definition: –A file structure is a combination of: representation for data in files and of operations for accessing the data. –A file structure allows applications to read, write and modify data. –A good file structure design will give an application an efficient (fast) access to the needed data.
CPSC 231 D.H.7 File Structure Design Goals Minimize the total disk access time by clustering related data together by keeping adjacent blocks close to each other on the disk ideally, get all the needed data in just ONE disk access Maximize the total disk space utilization disk de-fragmentation procedures data compression
CPSC 231 D.H.8 Files structure design problems One of the most difficult problems in meeting the design goals of a file structure is the fact that files are quite dynamic, i.e. they: grow shrink change their data The design goals would be easier to meet if files were static. WHY?
CPSC 231 D.H.9 Historical view of file structure design Early work presumed that files were located on tapes access was sequential Recent work most files are stored on direct access devices (s.a. hard disks, floppy disks, CD-ROMs, ZIP disks, etc.) large files required indexing indexes and keys allowed for speedy searches of data on the disk
CPSC 231 D.H.10 File structure history cont. Indexed files grew and became slow to access => tree structures emerged. Unfortunately some trees grew very unevenly resulting in slow (almost sequential) searches => AVL trees emerged (self-adjusting binary trees) AVL trees grew large and required multiple disk accesses => B-trees emerged.
Tree File CPSC 231 D.H.11
AVL Trees CPSC 231 D.H.12
B - Tree CPSC 231 D.H.13
CPSC 231 D.H.14 File structure history cont. B-trees provided excellent performance for non-sequential files but sequential access was very slow => B + trees emerged. B-trees and B + trees became the basis for many commercial file systems, since they provide access times that grows in the proportion to log k N, where N is the number of entries in the file and k is the number of entries indexed in a single block of the B- tree.
B+ Trees CPSC 231 D.H.15
CPSC 231 D.H.16 Hashing Hashing is a data access mechanism that is based on converting the search key into a storage address. A good hashing algorithm can significantly reduce the number of disk accesses. Extendible hashing is a hashing that works well with files that over time undergo substantial changes in size.
Hashing Function CPSC 231 D.H.17
CPSC 231 D.H.18 Key terms. AVL tree - self adjusting binary tree that can guarantee good access times for data stored in memory (but not on the disk). B-tree - a tree structure that provides fast access to data stored in files. B-tree does NOT have to be a binary tree. B + tree - a variation of the B-tree structure that provides for fast sequential access to data as well as indexed access.
CPSC 231 D.H.19 Key Terms Cont. File structure –the organization of data on secondary storage devices such as disks together with operations defined for the data Sequential access –access of data that takes records in serial order, looking at the first, second, and so on. Random access –access of data that that takes records in any order, not necessary serial.
CPSC 231 D.H.20 Physical files and logical files. Files are collections of related information. Physical files exist on secondary storage devices. Operating systems are responsible for managing physical files. Logical files are visible to application programs. Application programs do not know about physical locations of the files (often they do not know if the data is coming from a file or from a keyboard)
CPSC 231 D.H.21 Association between physical and logical files Applications have to make an association between physical and logical file names. In C++ this can be done in the following way: ofstream outClientFile (“clients.dat”, ios:out) The application can write to outClientFile while the operating system sees clients.dat
CPSC 231 D.H.22 Special Characters in Files All computer systems have reserved a number of characters for specific system functions. Examples: –Control-Z indicates often end-of-file in MS- DOS programs –Control-D indicates often end-of-file in Unix programs –CR (Carriage return) and LF (Line Feed) characters together indicate end-of-line
CPSC 231 D.H.23 Directory Structures Files are stored in directories. Thus directories are collections of files Most modern systems maintain a tree directory structure:(WHY?)
CPSC 231 D.H.24 I/O Redirection I/O redirection allows for changing the source of input to come from a file instead of a keyboard: –program < file /* program reads input form a file /* instead of keyboard I/O redirection allows for directing the output to go a file instead of the screen –program > file /* program writes to a file instead of /* the screen Redirection operator
CPSC 231 D.H.25 Pipes An output of one program can be used as an input to another program be using pipes: Example: –program1 | program2 Pipe operator
Pipe Operator CPSC 231 D.H.26