Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and
Research Code has a Bad Reputation Research coding is not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories. It is usually the code’s writer who is the consumer, or in some cases a few others in the lab. make-research-software-accountable/ make-research-software-accountable/
Mistakes (Research) Programmers Make I just need to do this specific thing one time.
Mistakes (Research) Programmers Make I’ll remember what I did, if I need to do it again.
Mistakes (Research) Programmers Make No one is interested in this code.
Mistakes (Research) Programmers Make No one will ever see this code.
What research code looks like This is not application development. Often research code involves: –A series of small scripts, –linking together existing open source toolkits, –reformatting input and output, –generating plots and graphs. Where is the “software”?
What research code looks like The contribution of the paper may be –Extension of an existing codebase –a set of small scripts and reformatting one-liners. –implemented in multiple languages.
A new way of doing business These are bad excuses. There is movement to encourage and incentivize the distribution of source code with publications. And facilities to encourage it.
Source Code dissemination Host it yourself. (many, many more)
What is good enough? Right now: –ANYTHING. Ideally: –“production level” Code that can be run or compiled on a standard configuration. –Thorough documentation.
Intellectual Property and Licensing GPL –copyleft Apache many many more You have copyright over your code. A license allows someone else to use it. Disclosures can limit your ability to patent.
Version Control Version control allows multiple users to edit the same content. Allows for coding in the open. subversion, git, many more.
Version Control
Coding for the User Code for your future self. You are your most important user.
Don’t try to be clever Write simple, understandable code. Efficiency in number of lines is not important. Efficiency in number of operations or memory also might not be important.
There are many ways to skin a cat print “Just another Perl hacker,”; $_='987;s/^(d+)/$1-1/e;$1?eval:print"Just another Perl hacker,"';eval; $_ = "wftedskaebjgdpjgidbsmnjgc"; tr/a-z/oh, turtleneck Phrase Jar!/; print;
Establish a coding style. ClassName nameMethodsUsingVerbs underscored_lowercase_variable_names CONSTANTS Spacing –x_mean=x_total/n More than anything, be consistent
Testing Unit tests. –Small pieces of code that test “atomic” functionality of a program. void testAddWorksCorrectly() { assertEquals(4, add(2,2)); } void testConstructorInitializesNameFieldToDefault() { Person p = new Person() assertEquals(“John Smith”, p.getName()); }
Why write tests? Identify problems. Easier Changes. Simple integration. Documentation.
Test Driven Development Write a Test Run tests to see if it fails Write as little code as possible Make the tests pass (go green) Refactor code Repeat [wikipedia]
Bug fixes and Testing When you find a bug in your code. Write a test that “catches the bug”. –It fails. The bug is fixed when the test passes. And it’ll never happen again.
Refactoring Just because code works, it doesn’t mean it’s done. Consolidate code to increase modularity –Eliminate code duplication. Some examples –Extract Classes –Extract Method –Move/Rename Method
Code Review Give your code to another person for feedback. Companies do this to ensure consistent style and correctness. Research labs rarely do.
Some specific advice. Take an enormous amount of notes. –What did you do? –What did you learn? –What bugs did you fix? –What new issues did you find? –What questions did you come up with?
Specifics Copy and Paste is your enemy. –If you are copying and pasting in code, you have probably made a mistake.
Specifics Use CONSTANTS –Never encode constants inline in your code. mean_height = total_height / 15 num_people = 13 mean_height = total_height / num_people
Specifics Use CONSTANTS –Never encode constants inline in your code. data[17] = ‘Andrew’ data[18] = 1.78 name_idx = 17 score_idx = 18 data[name_idx] = ‘Andrew’ data[score_idx] = 1.78
Specifics Don’t use global variables
Specifics Use sensible function names start() step1() step2() step3() wrapup()
Specifics Use sensible function names initializeParameters() setPaths() calculateRHS() calculateLHS() writeResults()
Specifics Use sensible variable names x1 = income / population ipc = income / population income_per_capita = income / population
Specifics Serialize Frequently. main() { preprocessData() extractFeatures() runBaselineExperiment() runNewExperiment() evaluateResults() }
Specifics Serialize Frequently. preprocess files.data > clean_files.data extractFeatures clean_files.data > features.csv runBaseline features.csv > baseline.results runNewExperiment features.csv > new.results evaluate baseline.results > baseline.report evaluate new.results > new.report
Specifics When things get slow, use a profiler. –Identify slow functions, and fix them. –Some code needs to do a lot, so it can be slow
Recap Research Code should be released –This is becoming more common, expected and, sometimes, required. Research Code needs to be good code. –So you can reuse it. –So you can release it.