Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.

Slides:



Advertisements
Similar presentations
Introduction to perl programming: the minimum to know! Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China.
Advertisements

Huntington Disease An overview
The genetic code.
Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student
Huntington Disease (HD) This presentation includes: Clinical classification and features. Structure and molecular basis of the HD gene. Clinical photographs.
Protein Synthesis (making proteins)
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Gene Mutations Worksheet
Transcription & Translation Worksheet
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
In vitro expression of BVDV capsid protein Corpus Christi College, University of Oxford Glycobiology Institute, Department of Biochemistry KOR SHU CHAN.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Figure S1. Sequence alignment of yeast and horse cyt-c (Identity~60%), green highly conserved residues. There are 40 amino acid differences in the primary.
Dictionaries.
GENE MUTATIONS aka point mutations. DNA sequence ↓ mRNA sequence ↓ Polypeptide Gene mutations which affect only one gene Transcription Translation © 2010.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
1 Perl: subroutines (for sorting). 2 Good Programming Strategies for Subroutines #!/usr/bin/perl # example why globals are bad $one = ; $two = ; $max.
Math 15 Introduction to Scientific Data Analysis Lecture 10 Python Programming – Part 4 University of California, Merced Today – We have A Quiz!
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
A.B. C. orf60(pOrf60) 042orf orf60(pOrf60-M5 ) orf60(pOrf60-M1) orf60(pOrf60-M4) 042orf60 042orf60(pOrf60-M5) orf60(pOrf60) 042orf60(pOrf60-M1)
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
NSCI 314 LIFE IN THE COSMOS 4 - The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Supplementary materials
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Structure and Function of DNA DNA Replication and Protein Synthesis.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4
Topic: Replication of DNA Standard: Explain the role of DNA in storing and transmitting cellular information.
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Ji-Yoon Park Nanoparticle-Based Theorem Proving.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
Nanoparticle-based Theorem Proving
PERMUTATIONS AND COMBINATIONS
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Python.
Supplemental Table 3. Oligonucleotides for qPCR
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
DNA and RNA.
Gene architecture and sequence annotation
PROTEIN SYNTHESIS RELAY
More on translation.
Molecular engineering of photoresponsive three-dimensional DNA
Fundamentals of Protein Structure
Python.
Station 2 Protein Synethsis.
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
Presentation transcript:

Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine

Learning Objectives Flow control (if/else) and Operators For loops Recursion Reading and Writing files (File I/O) Create custom functions with def Dictionaries

Flow control Programs need to make decisions, and have controlled looping (repeat operations for a specific number of times). Decision operators: if, elif, else Looping operators: for x in list: while a < 10:

For Loops For loops iterate (step) through a list one element at a time. In Python, loops and decisions are set off by a colon and an indent. Python ‘for’ syntax is very simple, but you must use correct indent of statements in the loop >>> my_list=['G', 'A', 'hat', 'cat'] >>> concat = "" # this is an empty string >>> for i in my_list: concat = concat + i >>> print (concat) GAhatcat

Loop through a String For loops work on strings as if they were a list of characters. >>> my_dna ='ATGCGTA' >>> for i in my_DNA: print (i) A T G C

if/else example >>> my_DNA = "ATGCGTA“ >>> if my_DNA.find("GC"): print (“GC is found”) else: print (“No GC found”)

Operators Operators include the basic math functions: +, -, /, *, ** (raise to power) Comparisons: >, =, <=, == Boolean operators: and, or, not

Example dna=‘GATCCGGTTACTACGACCTGA’ count_G=0 count_A=0 for base in dna: if base == 'G': count_G += 1 elif base == 'A' count_A += 1 print ('G= ' + str(count_G) + ' ' + 'A= ' + str(count_A)

Functions More complex operators are also known as functions They can deal with file I/O, more complex math, or other manipulations of data. Functions use parentheses to act on some data object, and may take additional parameters print(x) open('filename', r) read(filehandle) my_list.append(42) write(data, 'filename') len(my_dna)

Range range(start,stop,[step]) creates a list of integers – Starts at zero by default – A range does not include the stop number – Step is optional >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(4, 11, 2) # from 4 to 11 with step of 2 [4, 6, 8, 10] range() is often used as part of a for loop to step through a list while keeping track of what number item you are working on: >>> a = ['Mary', 'had', 'a', 'little', 'lamb'] >>> for i in range(len(a)): print i, a[i] 0 Mary 1 had 2 a 3 little 4 lamb

List Compression A list compression creates a list using a function and a for loop. An optional if statement can be included. squares = [] # create a list of squares < 50 for x in range(10): if (x**2) <50: squares.append(x**2) print squares [0, 1, 4, 9, 16, 25, 36, 49] # create a list of squares < 50 with a list compression squares = [x**2 for x in range(10) if (x**2) < 50] print squares [0, 1, 4, 9, 16, 25, 36, 49]

Custom Functions In Python, users can create their own functions, which act like subroutines or use functions within code written by others (known as modules) def g_count(dna): #function takes a string as input count=0 for base in dna: if base == ‘G’: count += 1 return(count) #function returns an integer

ATG finder >>> def find_ATG(dna): if dna.find("ATG"): return ("ATG is found") else: return ("No ATG found") >>> my_dna =‘TATGCGTA‘ >>> find_ATG(my_dna) ATG is found Bonus point if you find and fix some of the bugs in this code

Recursion Now that you can make custom functions … – what would happen if you wrote a function that called itself? def countdown(n): if n <= 0: print “Blastoff!” else: print n countdown(n-1) Of course, you should avoid creating an infinite loop … def plustwo(n): print n plustwo(n+2) #be careful running this- get ready to kill it

Fibonacci Computer Scientists use recursion often, it is less common in Bioinformatics applications. has several sections that explore algorithms in computational biology and beyond. – There is a nice (fairly simple) problem about Fibonacci Numbers: – Give it a try (in Python, of course).

def Fib(x): if x =0: return 0 elif x = 1: return 1 elif x > 1: return Fib(x-1) + Fib(x-2) Why is this program such a bad idea? How can you do it better using a simple list to store the Fib series? This is also a good introduction to computational complexity. Bioinformatics often deals with large data and complex computations, so the speed of computing for a given task is an important issue.

File I/O Usually your programs will get input data in a text file, and you will want to write output to a file rather than dump it on the screen (“standard output”, “stdout”) In Python, a file must be opened before reading or writing. The open file is assigned to a variable called a ‘handle’, then the program will read or write to the handle The.read() method captures the whole contents of the file in a single string..close() the file when you are done with it. file1 = open(‘human_pep.fasta’) Hum_pep = file1.read() gene_count = Hum_pep.count(‘>’) file1.close()

with open() as f A nicer way to open a file is to use the with/as keywords and an indented block. This automatically closes the file when the indented block is completed. >>> with open(‘human_pep.fasta’) as file1: Hum_pep = file1.read()

Write output to a file To create an output file, open a file (give it any name you want) with the ‘w’ option and assign it to a variable name. Then use the write() method. write() works just like print(), you can include string methods, concatenation, etc. inside the parentheses. output=open( ' humpep_count.txt ', ' w ' ) output.write( ' Gene Count: ' + str(gene_count)) output.close()

Read a file line by line with a for loop readlines() captures a file as a list of lines (rather than all in one big string), then you can loop over the list of lines. my_file = open(‘human_dna.fasta’) human_seq = my_file.readlines() for line in human_seq: print (len(line)) Or you can iterate over lines in the file directly with a for loop: my_file = open(‘human_dna.fasta’) for line in my_file: print (len(line))

Dictionaries Dictionaries contain key-value pairs. (Called a “hash” in most other programming languages) my_dict1 = {'ATT' : 'I', 'CTT' : 'L', 'GTT' : 'V', 'TTT' : 'F'} Very useful for lookup lists of things like the amino acid codon table or k-mer lists Designed to give very fast random access lookup of the key and return the corresponding value Keys must be unique strings, values can be anything

Zip makes a dictionary Rather than type a dictionary, you can build a dictionary from two lists using zip() >>> list1 = ('GAT', 'CAT', 'TAT', 'AAT') >>> list2 = (1, 2, 3, 4) >>> zip(list1,list2) [('GAT', 1), ('CAT', 2), ('TAT', 3), ('AAT', 4)]

Check and add to dictionary Another useful application of a dictionary is to build a non- redundant list. – For each item, check if it is in the dictionary, if not then add it to the dictionary. – You can count occurrences at the same time. Example: count DNA dimers DNA = 'GATCCGGTTACTACGACCTGAGAT' Dimers = {}#create an empty dictionary for x in range(len(DNA)): di = DNA[x:(x+2)] if di in Dimers: Dimers[di] += 1#add one to count for di else: Dimers[di] = 1#add di to Dimers dict print Dimers Bonus point if you find and fix the bugs in this code

Challenge Assignment Write a function that translates a DNA string into protein. In your function, use a dictionary of triplet codons as keys and amino acids as values Begin translation at the first ATG codon Write a program that uses your translate function to open and translate a file that contains a single DNA sequence as text, write the output as another text file.

Zip a codon table (save yourself some typing ) codons= ['ttt', 'ttc', 'tta', 'ttg', 'tct', 'tcc', 'tca', 'tcg', 'tat', 'tac', 'taa', 'tag', 'tgt', 'tgc', 'tga', 'tgg', 'ctt', 'ctc', 'cta', 'ctg', 'cct', 'ccc', 'cca', 'ccg', 'cat', 'cac', 'caa', 'cag', 'cgt', 'cgc', 'cga', 'cgg', 'att', 'atc', 'ata', 'atg', 'act', 'acc', 'aca', 'acg', 'aat', 'aac', 'aaa', 'aag', 'agt', 'agc', 'aga', 'agg', 'gtt', 'gtc', 'gta', 'gtg', 'gct', 'gcc', 'gca', 'gcg', 'gat', 'gac', 'gaa', 'gag', 'ggt', 'ggc', 'gga', 'ggg'] amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGG GG‘ >>> codon_table = dict(zip(codons, amino_acids)) Very nice Python code by Peter CollingridgeVery nice Python code by Peter Collingridge:

Re-use Code vs Write New A little break for a philosophical debate When should you find and re-use code written by others and when should you write your own? In Bioinformatics, many of the problems you will encounter with data have been faced by other people. – A great deal of code has been written and shared in public repositories. – Some of this code has been published an cited in the literature – Don’t try to re-write BLAST (unless you really, really have to) If you can’t find code to do exactly what you want, should you adapt existing, or write your own? – There are challenges to figuring out someone else’s code – New code that uses (depends) on programs written by others is very fragile – There are challenges to validate your own code when using it to analyze and publish scientific data – There is value to building your own repository of code elements from scratch that work and fit together in a way that is intuitive for you

Some Statistics in Python NumPy has some basic statistics functions that work on arrays. >>> squares = [x**2 for x in range(10) if (x**2) < 50] >>> sq=np.array(squares) >>> np.mean(sq) 17.5 >>> np.median(sq) 12.5 >>> np.std(sq)

Other NumPy funcions NumPy has: linear algebra trigonometry logarithms polynomials Fourier Transformations random sampling permutations sorting and distributions (normal, Poisson, hypergeometrix, logistic, gamma, negative binomial, etc)

SciPy SciPy is an extension of NumPy that provides a great deal more complex mathematic, statistical, and scientific data analysis functions. >>> import antigravity

Summary Flow control (if/else) and Operators For loops Recursion Reading and Writing files (File I/O) Create custom functions with def Dictionaries

Next Lecture: Biopython