Using files Taken from notes by Dr. Neil Moore CS 115 Lecture Using files Taken from notes by Dr. Neil Moore
Sequential and random access Sequential access: reading or writing the file in order starting from the beginning and going to the end Similar to how a for loop works on a list or string Read the first line, then the second line, etc. No skipping, no backing up! Random access: reading or writing without regard to order “Go to byte 7563 and put a 7 there” Like lists: we can use mylist[5] without having to go through indexes 0 through 4 first Also called direct access
Sequential and random access Random access does not work that well with text files Bytes don’t always match up with characters And they definitely don’t match up with lines. At what byte number does line 10 start? You’d have to go through the lines sequentially and count Text files usually use sequential access Binary files can use either sequential or random access
Reading a line at a time Here’s one way to read a line at a time for line in file: This technique is very useful but a little inflexible: It only lets us use one line per iteration But what if our file looked like this? Student 1 Score 1 … The readline method lets you control reading more precisely. line = infile.readline()
Readline line = infile.readline() This reads a single line from the file Returns a string, including the newline at the end The next time you do input, you get the next line Even if you use a different input technique next time readfile-readline.py
Buffers and the file pointer When you open a file in a Python program, Python and the OS set up a buffer for that file. It’s a piece of memory that holds some of the file’s data before you need it Why? Disks are much much slower than RAM And disks are faster if you read from them in large chunks, not byte by byte. Where have you seen buffers before? YouTube! “Video is buffering..” For the same reason, memory is much faster than the network A keyboard buffer “type-ahead” When you call readline, etc., that gets data from the buffer. If the program asks for data that isn’t in the buffer yet, the OS refills the buffer with new data from the file.
The buffer The buffer holds data our program has already read… … and data it has not read yet How does it tell the difference? A file pointer indicates the beginning of the unread part. Starts out at the beginning of the file (except in append mode) When you do a sequential read, like readline: It starts reading at the location of the file pointer When it sees the newline character, it advances the pointer to the next byte So that every input will get a different line Sequential access means the file pointer does not skip anything and never moves backward.
Reading a whole file at once Python also gives us two methods that read in the whole file at once That’s much easier if we need to process the contents out of order Or if we need to process each line several times But it doesn’t work if the file is larger than RAM. readlines gives us the whole file as a list of strings (note the s on the end of the name) line_list = infile.readlines() infile.close() #close, the file is exhausted for line in line_list: The lines still contain one newline each, at the end After the readlines, there is nothing left to read so might as well close Technically, it gives you the rest of the file, from the file pointer to the end Maybe you read in the first line with readline, then called readlines. readfile-readlines.py
Reading a whole file another way read gives us the whole (rest of the) file as a single string newlines and all content = infile.read() infile.close() As with readlines, you might as well close it immediately What to do with the string? You can use split line_list = content.split(“\n”) Unlike other methods, this gives you a list of strings without newlines Because split removes the delimiter
Splitting the string you get from read Be careful! If the last line in the file ends in a newline character, there will be an extra empty string in the list of lines content = ‘Hello\nWorld\n’ line_list = content.split(‘\n’) gives line_list = [‘Hello’,’World’,’’] Not just in Python: some OSes and programs treat that as a blank line readfile-read.py
Output files It’s possible to write to files too First, open the file for writing: outfile = open(“out.txt”, “w”) # w for write mode If the file does not already exist, it creates it If the file already exists, truncates it to zero bytes!! Cuts everything out of the file to start over The old data in the file is lost forever! Opening for writing is both creative and destructive. You can also open a file for appending: logfile = open(“audit.log”, “a”) # a for append Adds to the end of an existing file – no truncation, no data lost If the file doesn’t exist, will still create it. Which to use? Do you want to add to the file or overwrite it?
Open modes summary Mode Letter File exists File doesn’t exist Read r OK IOError Write w Truncates the file Creates the file Append a OK (adds to the end)
Writing to an output file There are two ways to write to an output file: You can use the same print function that we’ve been using Just give an extra argument at the end, file = fileobj print(“Hello,”, name, file = outfile) You can use all the functionality of print Printing strings, numbers, lists, etc. sep = “ “, end = “ “ You can also write a string with the write method Takes one string argument (nothing else!) outfile.write(“Hello, world!\n”) Does not automatically add a newline character! outfile-write.py
Closing files You should close a file when you are finished with it outfile.close() Frees resources (like the file buffer!) Closing files is even more important for output files The buffer isn’t written to the disk until either it’s full or the file is closed. Until you close the file, the most recent buffer-full is not ‘committed’ to disk And what if your program crashes without closing? Closing allows the file to be marked as “closed” = “not busy” by the OS Ever had a file that had been created during an application that crashed? Lots of times you cannot do anything to the file because Windows says “it’s busy” outfile.py When to close a file? As soon as you know you don’t have to read / write it again Immediately after your loop With read or readlines, immediately after that statement
Files versus lists Files are Lists are Permanent “persistent” Bigger (although must fit on your secondary storage device) Slower to access Sequential-access (as we do it in this class) Lists are Temporary (life of the program) Smaller (must fit in memory) Faster to access Random or sequential access
Example: letter count Let’s see a program to count letters in a file 26 counters, 26 different counters? We don’t want to make 26 different variables if char == “A”: a_count += 1 elif char == “B”: b_count += 1 …. Ugh! Instead we’ll make a list of counters. Initialize: counts = [ 0 ] * 26 That gives you a list of 26 elements which are all zero. Read through each character in the file Find and increment the corresponding counter
Converting characters to numeric codes The last time we heard about ASCII and Unicode They assign a numeric code to each possible character Python has functions to covert between characters and the code ord takes a character and returns its numeric code code = ord(“A”) Argument is a single character as a string, returns an integer chr takes a numeric code and returns the corresponding character char = chr(65) Argument is an integer, returns a one-character string Codes below 32 are control characters (newline, tab, …)
Converting characters to numeric codes We can use these to convert letters into a useful list subscript We want to put the counter for “A” at position 0, the counter for “B” at position 1, etc. So the subscript can be ord(char) – ord(“A”) gives a number from 0 to 25 To convert back: chr(i + ord(“A”)) if i is an integer from 0 to 25 lettercount.py