Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005
Overview Explain the problem Talk about the solution strategy Then walk through the code line by line; and explain the tools and ideas in the solution
What’s not here that we might want to discuss in the future High speed numerical Python: a slow language; with fast libraries Writing your own objects good program structure Functional programming: map, filter, lambda, and reduce commands. Good short overview at: (Stata generate / replace commands are roughly map; and Stata drop if ~X is roughly filter)
The Challenge Download > 1000 daily and monthly electricity market database files from the California Independent System Operator Website.
Overview Explain the problem Talk about the solution strategy Then walk through the code line by line; and explain the tools and ideas in the solution
Solution Strategy Research the location (URL) of each database Write Python Code that executes once for each month t from the sample period Generate strings for the locations of the webpage and local disk file for month t Open the web page Create a local disk file Read the web page and save it in the local disk file
Disclaimer This is my first Python program. I fear that I’ve reinvented a lot of wheels. This program uses lots of basic Python functions rather than tapping into libraries and extensions in ways that would create a shorter program. This program structure – which has a main loop that is not in a function or object -- is fine for a simple program; but is dangerous for large, complex programs
Overview Explain the problem Talk about the solution strategy Then walk through the code line by line; and explain the tools and ideas in the solution
Python Syntax We’ll Need Loops Conditional Statements Functions File / web reading and writing Exception Handling
For Loops in Python Python loops over the elements of a list; not by updating an integer. Python requires a colon (:) between a conditional / loop / function declaration and the block of additional statements it affects For item in list: Do stuff Other programming languages would approach this as: For integer i = start to stop {Do stuff} Python’s range(start,stop+1) is identical to other languages’ start to stop
Solution Strategy Research the database’s location (URL) Write Python Code that executes once for each month t from the sample period Generate strings for the webpage and local disk file for month t open the web page create a local disk file Read the web page and save it in the local file
The Main Loop Part I month_length = [31,28,31,30,31,30,31,31,30,31,30,31] #list of number of days in each month for year in range(2001,2005): #years 2001 to notice ranges include the #first num, but are strictly less than the last num for month in range(1,13): if ((year in range(2002,2004)) or (year == 2001 and month > 3) or (year == 2004 and month < 10)): #only begins executing the main block if we are in #the sample period Red highlights: –Logical operators are words and and or; not & and | –To test whether a and b are the same use a == b with two equal signs; to put b in a use a=b with one equal sign.
Functions Functions are groups of statements other parts of the code can call def FunctionName (parameters): statements return optional return value Functions may return a value. If the function returns a value, you can call it in an assignment statement, like result=FunctionName(inputs) Functions and objects are crucial tools to design large programs that are modular, flexible, and reliable. See McConnell, Code Complete for more detail.
Python passes scalar parameters by value. It passes more complex things as references to their memory locations. Different functions work on different copies of the values / references which can protect values from being accidentally changed. If you create a new object in the function, the original will be unaffected. list_var = list_var+[“C”, “D”] If you modify the original object without changing its memory address, the original will be changed: list_var.extend(["C", "D"]) or list_var[1]=“C” Any variable that is defined outside of a function or object is global and can get changed by any part of the code. Avoid using global variables because it can be difficult to find and fix errors involving changes in them.
Passing by Value and Reference notice that test_list has changed to ['A', 'B', 'C', 'D'] but that test_integer is still 5 but the copy we returned is 5000 def python_copies_numbers_but_shares_lists_and_objects(list_input, integer_input): integer_input = integer_input*1000 list_input.extend(["C","D"]) return integer_input def main (): test_list = ["A","B"] test_integer = 5 updated_integer = python_copies_numbers_but_shares_lists_and_objects(test_list, test_integer) print "notice that test_list has changed to " print test_list print "but that test_integer is still " + fpformat.fix(test_integer,0) + " but the copy we returned has changed to " + fpformat.fix(updated_integer,0) return main()
Solution Strategy Research the location (URL) of each database Write Python Code that executes once for each month t from the sample period Generate strings for the webpage and local disk file for month t open the web page create a local disk file Read the web page and saves it in the local file
Main loop then Calls a Functions month_string = make_two_dgt_string(month) import fpformat # fpformat formats floating point numbers into strings def make_two_dgt_string(n): #takes a number and adds a leading zero if the number is less than 10 #assumes that the input number is < 100 if n > 9: #check whether we need to pad the date with a leading zero n_string = fpformat.fix(n,0) #if we don't need to pad, convert the number directly to a string else: #pad low numbers with a leading zero n_string = "0"+fpformat.fix(n,0) #otherwise convert to string and add a leading zero to the string. return n_string #either way, return the results.
Main Loop then creates strings and calls more functions #now, for each month in the sample, request a price data file #generate caiso URL load_url = " ear,0)+month_string… #generate file name for my hard disk load_file_name = "caiso_price_"+fpformat.fix(year,0)+"- "+month_string+"-"+"1- "+fpformat.fix(end_date,0)+".zip" #download and save the requested files. get_save_file(load_url,load_file_name) #continue looping until we go through every month in the sample...
Solution Strategy We have: Researched the location (URL) of each database Written Python Code that executes once for each month t from the sample period Generated strings for the webpage and local disk file for month t We’ve called but not seen the code that: opens the web page creates a local disk file Reads the web page and saves it in the local file
Connect to the webpage def get_save_file(url, file_name): #this function gets the file specified in URL from the web and then saves it in #location FILE_NAME #Designates the location in which to save the file path = "C:\\rjl\\ca_amp\\download\\price\\"+file_name try: web_data = urllib.urlopen(url) #attempt to create a shortcut / handle to the desired web page / web file except IOError, msg: print "didn't open URL %s: %s", url, str(msg)
Creating and Using Objects Many python libraries are object oriented An object bundles a kind of data with “member functions” for manipulating that data. Steps: 1) create (“instantiate”) objects 2) use their functions. objectName = libName.constructor(initial values) objectName.doSomething(parameters)
Exceptions try/except sequences handle routine problems like file not found errors ("exceptions") gracefully rather than ending the whole program. try: –SomethingThatMightNotWork #this will either work or it fail and generate an exception message of failureType except failureType1 –{If we get failure type 1, do this and continue from here} Dividing by zero or inverting a singular matrix might throw exceptions. limited goto statement – if there is an exception, the program stops executing and jumps immediately to the next except statement that handles that error
create a local file and save the downloaded page try: f = open(path, "wb") #create a handle to a new file for "wb": _w_riting in _b_inary f.write(web_data.read()) #write into the new file the results from downloading the webpage f.close() #complete writing process. print "saved %s", path except IOError, msg: print "didn't save %s: %s", path, str(msg) return #end the routine
File Manipulation in Python Details on files: Python Tutorial Section 7.2 Start: Construct a file object using the open command file_object_name = open(filename, mode) Read/writestring/data= file_object_name. read() file_object_name. write(data to write) Finish using the file file_object_name. close()
Possible extensions Unzip the files that we downloaded (easy?) import os os.system(‘unzip ’+file_name) (See Test that downloaded data have expected characteristics (e.g. four fields per line) using regular expressions Read in and manipulate the XML databases (harder?) Enter these file names into a SAS or Stata import / analysis code and run SAS / Stata
Python can do far more with webpages Details on web: Its sample programs include: –Webchecker.py (checks for broken links on a website) –Websucker.py (downloads a whole website) I found their code a bit hard to follow. I used snippets of those programs as examples for this program