Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine
Learning Objectives Flow control (if/else) and Operators For loops Recursion Reading and Writing files (File I/O) Create custom functions with def Dictionaries
Flow control Programs need to make decisions, and have controlled looping (repeat operations for a specific number of times). Decision operators: if, elif, else Looping operators: for x in list: while a < 10:
For Loops For loops iterate (step) through a list one element at a time. In Python, loops and decisions are set off by a colon and an indent. Python ‘for’ syntax is very simple, but you must use correct indent of statements in the loop >>> my_list=['G', 'A', 'hat', 'cat'] >>> concat = "" # this is an empty string >>> for i in my_list: concat = concat + i >>> print (concat) GAhatcat
Loop through a String For loops work on strings as if they were a list of characters. >>> my_dna ='ATGCGTA' >>> for i in my_DNA: print (i) A T G C
if/else example >>> my_DNA = "ATGCGTA“ >>> if my_DNA.find("GC"): print (“GC is found”) else: print (“No GC found”)
Operators Operators include the basic math functions: +, -, /, *, ** (raise to power) Comparisons: >, =, <=, == Boolean operators: and, or, not
Example dna=‘GATCCGGTTACTACGACCTGA’ count_G=0 count_A=0 for base in dna: if base == 'G': count_G += 1 elif base == 'A' count_A += 1 print ('G= ' + str(count_G) + ' ' + 'A= ' + str(count_A)
Functions More complex operators are also known as functions They can deal with file I/O, more complex math, or other manipulations of data. Functions use parentheses to act on some data object, and may take additional parameters print(x) open('filename', r) read(filehandle) my_list.append(42) write(data, 'filename') len(my_dna)
Range range(start,stop,[step]) creates a list of integers – Starts at zero by default – A range does not include the stop number – Step is optional >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(4, 11, 2) # from 4 to 11 with step of 2 [4, 6, 8, 10] range() is often used as part of a for loop to step through a list while keeping track of what number item you are working on: >>> a = ['Mary', 'had', 'a', 'little', 'lamb'] >>> for i in range(len(a)): print i, a[i] 0 Mary 1 had 2 a 3 little 4 lamb
List Compression A list compression creates a list using a function and a for loop. An optional if statement can be included. squares = [] # create a list of squares < 50 for x in range(10): if (x**2) <50: squares.append(x**2) print squares [0, 1, 4, 9, 16, 25, 36, 49] # create a list of squares < 50 with a list compression squares = [x**2 for x in range(10) if (x**2) < 50] print squares [0, 1, 4, 9, 16, 25, 36, 49]
Custom Functions In Python, users can create their own functions, which act like subroutines or use functions within code written by others (known as modules) def g_count(dna): #function takes a string as input count=0 for base in dna: if base == ‘G’: count += 1 return(count) #function returns an integer
ATG finder >>> def find_ATG(dna): if dna.find("ATG"): return ("ATG is found") else: return ("No ATG found") >>> my_dna =‘TATGCGTA‘ >>> find_ATG(my_dna) ATG is found Bonus point if you find and fix some of the bugs in this code
Recursion Now that you can make custom functions … – what would happen if you wrote a function that called itself? def countdown(n): if n <= 0: print “Blastoff!” else: print n countdown(n-1) Of course, you should avoid creating an infinite loop … def plustwo(n): print n plustwo(n+2) #be careful running this- get ready to kill it
Fibonacci Computer Scientists use recursion often, it is less common in Bioinformatics applications. has several sections that explore algorithms in computational biology and beyond. – There is a nice (fairly simple) problem about Fibonacci Numbers: – Give it a try (in Python, of course).
def Fib(x): if x =0: return 0 elif x = 1: return 1 elif x > 1: return Fib(x-1) + Fib(x-2) Why is this program such a bad idea? How can you do it better using a simple list to store the Fib series? This is also a good introduction to computational complexity. Bioinformatics often deals with large data and complex computations, so the speed of computing for a given task is an important issue.
File I/O Usually your programs will get input data in a text file, and you will want to write output to a file rather than dump it on the screen (“standard output”, “stdout”) In Python, a file must be opened before reading or writing. The open file is assigned to a variable called a ‘handle’, then the program will read or write to the handle The.read() method captures the whole contents of the file in a single string..close() the file when you are done with it. file1 = open(‘human_pep.fasta’) Hum_pep = file1.read() gene_count = Hum_pep.count(‘>’) file1.close()
with open() as f A nicer way to open a file is to use the with/as keywords and an indented block. This automatically closes the file when the indented block is completed. >>> with open(‘human_pep.fasta’) as file1: Hum_pep = file1.read()
Write output to a file To create an output file, open a file (give it any name you want) with the ‘w’ option and assign it to a variable name. Then use the write() method. write() works just like print(), you can include string methods, concatenation, etc. inside the parentheses. output=open( ' humpep_count.txt ', ' w ' ) output.write( ' Gene Count: ' + str(gene_count)) output.close()
Read a file line by line with a for loop readlines() captures a file as a list of lines (rather than all in one big string), then you can loop over the list of lines. my_file = open(‘human_dna.fasta’) human_seq = my_file.readlines() for line in human_seq: print (len(line)) Or you can iterate over lines in the file directly with a for loop: my_file = open(‘human_dna.fasta’) for line in my_file: print (len(line))
Dictionaries Dictionaries contain key-value pairs. (Called a “hash” in most other programming languages) my_dict1 = {'ATT' : 'I', 'CTT' : 'L', 'GTT' : 'V', 'TTT' : 'F'} Very useful for lookup lists of things like the amino acid codon table or k-mer lists Designed to give very fast random access lookup of the key and return the corresponding value Keys must be unique strings, values can be anything
Zip makes a dictionary Rather than type a dictionary, you can build a dictionary from two lists using zip() >>> list1 = ('GAT', 'CAT', 'TAT', 'AAT') >>> list2 = (1, 2, 3, 4) >>> zip(list1,list2) [('GAT', 1), ('CAT', 2), ('TAT', 3), ('AAT', 4)]
Check and add to dictionary Another useful application of a dictionary is to build a non- redundant list. – For each item, check if it is in the dictionary, if not then add it to the dictionary. – You can count occurrences at the same time. Example: count DNA dimers DNA = 'GATCCGGTTACTACGACCTGAGAT' Dimers = {}#create an empty dictionary for x in range(len(DNA)): di = DNA[x:(x+2)] if di in Dimers: Dimers[di] += 1#add one to count for di else: Dimers[di] = 1#add di to Dimers dict print Dimers Bonus point if you find and fix the bugs in this code
Challenge Assignment Write a function that translates a DNA string into protein. In your function, use a dictionary of triplet codons as keys and amino acids as values Begin translation at the first ATG codon Write a program that uses your translate function to open and translate a file that contains a single DNA sequence as text, write the output as another text file.
Zip a codon table (save yourself some typing ) codons= ['ttt', 'ttc', 'tta', 'ttg', 'tct', 'tcc', 'tca', 'tcg', 'tat', 'tac', 'taa', 'tag', 'tgt', 'tgc', 'tga', 'tgg', 'ctt', 'ctc', 'cta', 'ctg', 'cct', 'ccc', 'cca', 'ccg', 'cat', 'cac', 'caa', 'cag', 'cgt', 'cgc', 'cga', 'cgg', 'att', 'atc', 'ata', 'atg', 'act', 'acc', 'aca', 'acg', 'aat', 'aac', 'aaa', 'aag', 'agt', 'agc', 'aga', 'agg', 'gtt', 'gtc', 'gta', 'gtg', 'gct', 'gcc', 'gca', 'gcg', 'gat', 'gac', 'gaa', 'gag', 'ggt', 'ggc', 'gga', 'ggg'] amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGG GG‘ >>> codon_table = dict(zip(codons, amino_acids)) Very nice Python code by Peter CollingridgeVery nice Python code by Peter Collingridge:
Re-use Code vs Write New A little break for a philosophical debate When should you find and re-use code written by others and when should you write your own? In Bioinformatics, many of the problems you will encounter with data have been faced by other people. – A great deal of code has been written and shared in public repositories. – Some of this code has been published an cited in the literature – Don’t try to re-write BLAST (unless you really, really have to) If you can’t find code to do exactly what you want, should you adapt existing, or write your own? – There are challenges to figuring out someone else’s code – New code that uses (depends) on programs written by others is very fragile – There are challenges to validate your own code when using it to analyze and publish scientific data – There is value to building your own repository of code elements from scratch that work and fit together in a way that is intuitive for you
Some Statistics in Python NumPy has some basic statistics functions that work on arrays. >>> squares = [x**2 for x in range(10) if (x**2) < 50] >>> sq=np.array(squares) >>> np.mean(sq) 17.5 >>> np.median(sq) 12.5 >>> np.std(sq)
Other NumPy funcions NumPy has: linear algebra trigonometry logarithms polynomials Fourier Transformations random sampling permutations sorting and distributions (normal, Poisson, hypergeometrix, logistic, gamma, negative binomial, etc)
SciPy SciPy is an extension of NumPy that provides a great deal more complex mathematic, statistical, and scientific data analysis functions. >>> import antigravity
Summary Flow control (if/else) and Operators For loops Recursion Reading and Writing files (File I/O) Create custom functions with def Dictionaries
Next Lecture: Biopython