Download presentation
Presentation is loading. Please wait.
Published byDenis McDowell Modified over 9 years ago
1
File Processing - Fundamental concepts MVNC1 Fundamental File Structure Concepts Chapter 4
2
File Processing - Fundamental concepts MVNC2 Record and Field Structure l A record is a collection of fields. l A field is used to store information about some attribute. l The question: when we write records, how do we organize the fields in the records: » so that the information can be recovered » so that we save space » so that we can process efficiently » to maximize record structure flexibility
3
File Processing - Fundamental concepts MVNC3 Field Structure issues l What if » Field values vary greatly » Fields are optional
4
File Processing - Fundamental concepts MVNC4 Field Delineation methods l Fixed length fields l Include length with field l Separate fields with a delimiter l Include keyword expression to identify each field
5
File Processing - Fundamental concepts MVNC5 Fixed length fields l Easy to implement - use language record structures (no parsing) l Fields must be declared at maximum length needed last first address city state zip 10 10 15 15 2 9 “Yeakus Bill 123 Pine Utica OH43050 “
6
File Processing - Fundamental concepts MVNC6 Include length with field l Begin field with length indicator l If maximum field length <256, a byte can be used for length last first address city state zip Length bytes Yeakus Bill 123 Pine 06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33 20 50 69 6E 64..
7
File Processing - Fundamental concepts MVNC7 Separate fields with a delimiter l Use a special character not used in data » space, comma, tab » Also special ASCII char’s: Field Separator (fs) 1C » Here we use “|” l Also need a end of record delimiter: “#” “Yeakus|Bill|123 Pine|Utica|OH|43050#“
8
File Processing - Fundamental concepts MVNC8 Include keyword expression l Keywords label each fields l A self-describing structure l Allows LOTS of flexibility l Uses lots of space “LAST=Yeakus|FIRST=Bill|ADDRESS=123 Pine| CITY=Utica|STATE=OH|ZIP=43050#“
9
File Processing - Fundamental concepts MVNC9 Optional Fields l Fixed length » Leave blank l Field length » zero length field l Delimiter » Adjacent delimiters l Keywords » Just leave out
10
File Processing - Fundamental concepts MVNC10 Reading a stream of fields l Need to break record into fields l Fixed length can simply be read into record structure l Others must be “parsed” with a parse algorithm
11
File Processing - Fundamental concepts MVNC11 Record Structures l How do we organize records in a file? l Records can be fixed length or variable length » Fixed length allows simple direct access lookup » Fixed may waste space » Variable - how do we find a records position?
12
File Processing - Fundamental concepts MVNC12 Record Structures l Fixed Length Records l Fixed number of fields in records l Variable length » prefix each record with a length » Use a second file to keep track of record start positions » Place delimiter between records
13
File Processing - Fundamental concepts MVNC13 Fixed Length Records l All records same length l Record positions can be calculated for direct access reads. l Does not imply the that the sizes or number of fields are fixed. l Variable length records would lead to unused space.
14
File Processing - Fundamental concepts MVNC14 Fixed number of fields in records l Field size could be fixed or variable l Fixed » results in fixed size records » simply read directly into “struct” l Variable sized fields » delimited or field lengths » Simply count fields while parsing
15
File Processing - Fundamental concepts MVNC15 Variable length Records l prefix each record with a length l Use a second file to keep track of record start positions l Place delimiter between records
16
File Processing - Fundamental concepts MVNC16 Prefix records with a length l Allows true variable length records l Form of prefix: » Character number (fixed length) » Binary number (write integer without conversion) » Must consider Maximum length l No direct access (great for sequencial access)
17
File Processing - Fundamental concepts MVNC17 Index of record start addresses l A second file is simply a list of offsets to successive records l Since the offsets are fixed length, this file allows direct access, thereby allow direct access to main file. l Problem » Maintaining file (adding and deleting records) » Cost of index
18
File Processing - Fundamental concepts MVNC18 Place delimiter between records l Special character not used in record l Allows efficient variable size l No direct access l Bible files - use ‘\n’ as delimiter
19
File Processing - Fundamental concepts MVNC19 Binary data in files l Binary reals and integers can be written, and read, from a file: » Need to know byte size of variables used. » “tsize” function returns data size
20
File Processing - Fundamental concepts MVNC20 Binary data in files int rsize; char rec_buf[MAX];... cpystr(rec_buf,”this is a test record”); rsize = strlen(rec_buf); write(my_fd,&rsize,tsize(int)); // write the size write(my_fd,&rec_buf,rsize); // write the record... read(my_fd, &rsize,tsize(int)); // read the size read(my_fd,&rec_buf,rsize); // read the record
21
File Processing - Fundamental concepts MVNC21 Viewing Binary file data l Use the file dump utility (od - octal dump) » od -xc » x - hex output » c - character output l Useful for viewing what is actually in file
22
File Processing - Fundamental concepts MVNC22 Using Classes to Manipulate Buffer l Three Classes »delimited fields »Length-based fields »Fixed length fields
23
File Processing - Fundamental concepts MVNC23 Class for Delimited fields l Consider a class to manage delimited text buffers »Allows reading and writing of delimited records »Allows packing and unpacking
24
File Processing - Fundamental concepts MVNC24 Class for Delimited fields class Person { public: // fields char LastName [11]; char FirstName [11]; char Address [16]; char City [16]; char State [3]; char ZipCode [10]; // Methods next... }
25
File Processing - Fundamental concepts MVNC25 Class for Delimited fields class DelimTextBuffer { public: DelimTextBuffer (char Delim = '|', int maxBytes = 1000); int Read (istream &); int Write (ostream &) const; int Pack (const char *, int size = -1); int Unpack (char *); private: char Delim; char DelimStr[2]; // zero terminated string for Delim char * Buffer; // character array to hold field values int BufferSize; // size of packed fields int MaxBytes; // maximum number of characters in the buffer int NextByte; // packing/unpacking position in buffer };
26
File Processing - Fundamental concepts MVNC26 Class for Delimited fields l Packing a buffer Person Bill_Yeakus DelimitedTextBuffer buffer; buffer.pack(Bill_Yeakus.LastName); buffer.pack(Bill_Yeakus.FastName); … buffer.pack(Bill_Yeakus.ZipCode); buffer.Write(stream);
27
File Processing - Fundamental concepts MVNC27 Class for Delimited fields int DelimTextBuffer :: Pack (const char * str, int size) // set the value of the next field of the buffer; // if size = -1 (default) use strlen(str) as Delim of field { short len; // length of string to be packed if (size >= 0) len = size; else len = strlen (str); if (len > strlen(str)) // str is too short! return FALSE; int start = NextByte; // first character to be packed NextByte += len + 1; if (NextByte > MaxBytes) return FALSE; memcpy (&Buffer[start], str, len); Buffer [start+len] = Delim; // add delimeter BufferSize = NextByte; return TRUE; }
28
File Processing - Fundamental concepts MVNC28 Class for Delimited fields int DelimTextBuffer :: Write (ostream & stream) const { stream. write ((char*)&BufferSize, sizeof(BufferSize)); stream. write (Buffer, BufferSize); return stream. good (); }
29
File Processing - Fundamental concepts MVNC29 Class for Delimited fields int DelimTextBuffer :: Read (istream & stream) { Clear (); stream. read ((char*)&BufferSize, sizeof(BufferSize)); if (stream.fail()) return FALSE; if (BufferSize > MaxBytes) return FALSE; // buffer overflow stream. read (Buffer, BufferSize); return stream. good (); }
30
File Processing - Fundamental concepts MVNC30 Class for Delimited fields int DelimTextBuffer :: Unpack (char * str) // extract the value of the next field of the buffer { int len = -1; // length of packed string int start = NextByte; // first character to be unpacked for (int i = start; i < BufferSize; i++) if (Buffer[i] == Delim) {len = i - start; break;} if (len == -1) return FALSE; // delimeter not found NextByte += len + 1; if (NextByte > BufferSize) return FALSE; strncpy (str, &Buffer[start], len); str [len] = 0; // zero termination for string return TRUE; }
31
File Processing - Fundamental concepts MVNC31 Class for Delimited fields l Class Person can be extended to provide specialized packing functions
32
File Processing - Fundamental concepts MVNC32 Class for Delimited fields int Person::Pack (DelimTextBuffer & Buffer) const {// pack the fields into a FixedTextBuffer, return TRUE if all succeed, FALSE o/w int result; Buffer. Clear (); result = Buffer. Pack (LastName); result = result && Buffer. Pack (FirstName); result = result && Buffer. Pack (Address); result = result && Buffer. Pack (City); result = result && Buffer. Pack (State); result = result && Buffer. Pack (ZipCode); return result; }
33
File Processing - Fundamental concepts MVNC33 Class for Delimited fields int Person::Unpack (DelimTextBuffer & Buffer) { int result; result = Buffer. Unpack (LastName); result = result && Buffer. Unpack (FirstName); result = result && Buffer. Unpack (Address); result = result && Buffer. Unpack (City); result = result && Buffer. Unpack (State); result = result && Buffer. Unpack (ZipCode); return result; }
34
File Processing - Fundamental concepts MVNC34 Class for Fixed Length fields int FixedTextBuffer :: AddField (int fieldSize) { if (NumFields == MaxFields) return FALSE; if (BufferSize + fieldSize > MaxChars) return FALSE; FieldSize[NumFields] = fieldSize; NumFields ++; BufferSize += fieldSize; return TRUE; }
35
File Processing - Fundamental concepts MVNC35 Class for Fixed Length fields int FixedTextBuffer :: Read (istream & stream) { stream. read (Buffer, BufferSize); return stream. good (); }
36
File Processing - Fundamental concepts MVNC36 Class for Fixed Length fields int FixedTextBuffer :: Write (ostream & stream) { stream. write (Buffer, BufferSize); return stream. good (); }
37
File Processing - Fundamental concepts MVNC37 Class for Fixed Length fields int FixedTextBuffer :: Pack (const char * str) // set the value of the next field of the buffer; { if (NextField == NumFields || !Packing) // buffer is full or not packing mode return FALSE; int len = strlen (str); int start = NextCharacter; // first byte to be packed int packSize = FieldSize[NextField]; // number bytes to be packed strncpy (&Buffer[start], str, packSize); NextCharacter += packSize; NextField ++; // if len < packSize, pad with blanks for (int i = start + packSize; i < NextCharacter; i ++) Buffer[start] = ' '; Buffer [NextCharacter] = 0; // make buffer look like a string if (NextField == NumFields) // buffer is full { Packing = FALSE; NextField = NextCharacter = 0; } return TRUE; }
38
File Processing - Fundamental concepts MVNC38 Class for Fixed Length fields int FixedTextBuffer :: Unpack (char * str) // extract the value of the next field of the buffer { if (NextField == NumFields || Packing) // buffer is full or not unpacking mode return FALSE; int start = NextCharacter; // first byte to be unpacked int packSize = FieldSize[NextField]; // number bytes to be unpacked strncpy (str, &Buffer[start], packSize); str [packSize] = 0; // terminate string with zero NextCharacter += packSize; NextField ++; if (NextField == NumFields) Clear (); // all fields unpacked return TRUE; }
39
File Processing - Fundamental concepts MVNC39 Class for Fixed Length fields void FixedTextBuffer :: Print (ostream & stream) { stream << "Buffer has max fields "<<MaxFields<<" and actual "<<NumFields<<endl <<"max bytes "<<MaxChars<<" and Buffer Size "<<BufferSize<<endl; for (int i = 0; i < NumFields; i++) stream <<"\tfield "<<i<<" size "<<FieldSize[i]<<endl; if (Packing) stream <<"\tPacking\n"; else stream <<"\tnot Packing\n"; stream <<"Contents: '"<<Buffer<<"'"<<endl; }
40
File Processing - Fundamental concepts MVNC40 Class for Fixed Length fields class FixedTextBuffer { public: FixedTextBuffer (int maxFields, int maxChars = 1000); int AddField (int fieldSize); int Read (istream &); int Write (ostream &); int Pack (const char *); int Unpack (char *); private: char * Buffer; // character array to hold field values int BufferSize; // sum of the sizes of declared fields int * FieldSize; // array to hold field sizes int MaxChars; // maximum number of characters in the buffer int NextCharacter; // packing/unpacking position in buffer };
41
File Processing - Fundamental concepts MVNC41 Class for Fixed Length fields int Person::Pack (FixedTextBuffer & Buffer) const {// pack the fields into a FixedTextBuffer, return TRUE if all succeed, FALSE o/w int result; Buffer. Clear (); result = Buffer. Pack (LastName); result = result && Buffer. Pack (FirstName); result = result && Buffer. Pack (Address); result = result && Buffer. Pack (City); result = result && Buffer. Pack (State); result = result && Buffer. Pack (ZipCode); return result; }
42
File Processing - Fundamental concepts MVNC42 Class for Fixed Length fields int Person::Unpack (FixedTextBuffer & Buffer) { Clear (); int result; result = Buffer. Unpack (LastName); result = result && Buffer. Unpack (FirstName); result = result && Buffer. Unpack (Address); result = result && Buffer. Unpack (City); result = result && Buffer. Unpack (State); result = result && Buffer. Unpack (ZipCode); return result; }
43
File Processing - Fundamental concepts MVNC43 Record Access - Keys l Attribute used to identify records l Often used to find records l Standard or canonical form »rules which keys must conform to »prevents missing record because key in different form »Example: –all capitals –Phone in form (nnn) nnn-nnnn
44
File Processing - Fundamental concepts MVNC44 Record Access - Keys l Keys can distinct - uniquely identify records »Primary keys »one-to-one relationship between key value and possible entities represented »SSN, Student ID l Keys can identify a collection of records »Secondary keys »one-to-many relationship »City, position, department
45
File Processing - Fundamental concepts MVNC45 Record Access - Keys l Primary key desired characteristics »unique among collection of entities »dataless - what if some entities have not value of this type (e.g. SSN) »unchanging
46
File Processing - Fundamental concepts MVNC46 Record access l Performance of access method »how do we compare techniques? »Must be careful what events we count. »“big-oh” notation gives us a way to factor out all but the most significant factors
47
File Processing - Fundamental concepts MVNC47 Record Access - timing l Sequential searching »Consider file of 4000 records »What if no blocking done, and one record per block? (500 bytes records, 512 byte blocks) »What if cluster size set to 8? »always requires O(n), but search is faster by a constant factor
48
File Processing - Fundamental concepts MVNC48 Sequential searching l Usually NOT the best method l Sometimes it is best: »Searching for some ASCII pattern (grep) »Small files »Files rarely searched »Searching on secondary key, and a large percentage of records match (say 25%)
49
File Processing - Fundamental concepts MVNC49 Unix Tools for sequential file processing l cat - display a file l wc - count lines, words, and characters l grep - find lines in file(s) which match regular expression.
50
File Processing - Fundamental concepts MVNC50 Direct Access l Move “directly” to record without scanning preceding data l Different languages/OS’s support different models: »Byte offset model –Programmer must specify offset to record, and record size to read. –Supports variable size records, skip sequential processing »Relative Record Number (RRN) model –File has a fixed record size (declared at creation time) –Records are specified by a record number –File modeled as a collection of components –Higher level of abstraction
51
File Processing - Fundamental concepts MVNC51 Direct Access l Different language support »RRN support –PL/I –COBOL –Pascal (files are modeled as a collection of components (records) –FORTRAN »Byte offset –C
52
File Processing - Fundamental concepts MVNC52 Choosing Record Sizes for Direct Access l Fixed Length Fields »Very easy to parse records - just read into record structure! »Each field must be maximum length needed! –Thus record must be as long all the maximum fields last first address city state zip 10 10 15 15 2 9 “Yeakus Bill 123 Pine Utica OH43050 “
53
File Processing - Fundamental concepts MVNC53 Choosing Record Sizes for Direct Access l Variable length fields »Each field can be any length »since some can be long, others short, overall record size may be shorter. »This gives more flexibility to fields length »Records must be parsed, space wasted for delimiter or length bytes. Yeakus|Bill|123|Pine|Utica|OH43050 Snivenloppinsky|Helmut|12232 Galmentary Avenue|Spotsdale|NY|11232
54
File Processing - Fundamental concepts MVNC54 Header Records l The first record in a direct file may be used to store special information »Number of records used. »Location of first record in key order sequence. »Location of first empty record »File record structure (meta-data) l In languages with the RRN model Pascal, variant record facility must be used l In C, the header record can be of different size from the rest of the file records.
55
File Processing - Fundamental concepts MVNC55 Header Records l Consider “update.c” is text. l Header record contains 2 byte number of record count. l Header size is 32, record size is 64 static struct { short rec_count; char fill[30]; } head;
56
File Processing - Fundamental concepts MVNC56 Header Records l Must be written when file created l Must be rewritten when file changed l Must be read when file is opened
57
File Processing - Fundamental concepts MVNC57 File Access and Organization l File Organization »Variable Length Records »Fixed Length Records »Field Structures (size bytes, delimiters, fixed) l File Access »Sequential access »Direct access »Indexed access
58
File Processing - Fundamental concepts MVNC58 File Access and Organization l Interaction between organization and access »Can the file be divided into fields? »Is there a higher level of organization to the file (mete data)? »Do all records have to have the same number of fields, bytes? »How do we distinguish one record from the next? »How do we recognize if a fixed length record holds real data or not?
59
File Processing - Fundamental concepts MVNC59 File Access and Organization l There is a often a trade-off between space and time »Fixed length records - allow direct access, waste space »Variable require sequential search l We also must consider the typical use of the file - what are the desired access patterns l Selection of a particular organization has implications on the allowable types of access
60
File Processing - Fundamental concepts MVNC60 Portability and Standardization l Differences among Languages »Fixed sized records versus byte addressable access l Differences among Machine Architectures »Byte order of binary data »May be high order or low order byte first
61
File Processing - Fundamental concepts MVNC61 Byte order of binary data l High order first: (Big Endian) »A long int: say 45 is stored in memory. »It is stored as: 00 00 00 2D »Sun’s, Network protocols l Low order first (Little Endian) »A long int: say 45 is stored in memory. »It is stored as: 2D 00 00 00 »PC’s, VAX’s
62
File Processing - Fundamental concepts MVNC62 Byte order of binary data l If binary data is written to a file, it is written in the order stored in memory l If the data is later read by a system with a different ordering, the number will be incorrect! l For the sake of portability, files should be written in an agreed upon format (probably Big Endian)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.