Programming for Engineers in Python

Programming for Engineers in Python
Biopython

Classes The methods of a class get the instance as the first parameter
class <classname>: statement_1 . statement_n The methods of a class get the instance as the first parameter traditionally named self The method __init__ is called upon object construction (if available)

Classes Classes are user-defined types.
Reminder: type = data representation + behavior. Classes are user-defined types. class <classname>: statement_1 . statement_n Objects of a class are called class instances. Like a mini-program: Variables. Function Definitions. Even arbitrary commands.

Classes – Attributes and Methods
class Vector2D: def __init__ (self, x, y): self.x, self.y = x, y def size (self): return (self.x ** 2 + self.y ** 2) ** 0.5 Methods Instance Attributes (each instance has its own copy)

Classes – Instantiate and Use
>>> v = Vector2D(3, 4) # Make instance. >>> v <__main__.Vector2D object at 0x A2828> >>> v.size() # Call method on instance. 5.0

Example – Multimap A dictionary with more than one value for each key
We already needed it once or twice and used: >>> lst = d.get(key, [ ]) >>> lst.append(value) >>> d[key] = lst We will now write a new class that will be a wrapper around a dict The class will have methods that allow us to keep multiple values for each key

Multimap. partial code class Multimap: def __init__(self): '''Create an empty Multimap''' self.inner = inner def get(self, key): '''Return list of values associated with key''' return self.inner.get(key, []) def put(self, key, value): '''Adds value to the list of values associated with key''' value_list = self.get(key) if value not in value_list: value_list.append(value) self.inner[key] = value_list

Multimap put_all and remove
def put_all(self, key, values): for v in values: self.put(key, v) def remove(self, key, value): value_list = self.get(key) if value in value_list: value_list.remove(value) self.inner[key] = value_list return True return False

Multimap. Partial code def __len__(self): '''Returns the number of keys in the map''' return len(self.inner) def __str__(self): '''Converts the map to a string''' return str(self.inner) def __cmp__(self, other): '''Compares the map with another map''' return self.inner.cmp(other) def __contains__(self, key): '''Returns True if key exists in the map''' return self.has_key(k)

Multimap Use case – a dictionary of countries and their cities:
>>> m = Multimap() >>> m.put('Israel', 'Tel-Aviv') >>> m.put('Israel', 'Jerusalem') >>> m.put('France', 'Paris') >>> m.put_all('England',('London', 'Manchester', 'Moscow')) >>> m.remove('England', 'Moscow') >>> print m.get('Israel') ['Tel-Aviv', 'Jerusalem']

Biopython An international association of developers of freely available Python ( tools for computational molecular biology Provides tools for Parsing files (fasta, clustalw, GenBank,…) Interface to common softwares Operations on sequences Simple machine learning applications BLAST And many more

Installing Biopython Go to http://biopython.org/wiki/Download Windows
Unix Select python 2.7 NumPy is required

SeqIO The standard Sequence Input/Output interface for BioPython
Provides a simple uniform interface to input and output assorted sequence file formats Deals with sequences as SeqRecord objects There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objects

Parsing a FASTA file # Parse a simple fasta file from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) Why repr and not str?

GenBank files # genbank files from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record # added to print just one record example break

GenBank files from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record)

Sequence objects Support similar methods as standard strings
Provide additional methods Translate Reverse complement Support different alphabets AGTAGTTAAA can be DNA Protein

Sequences and alphabets
Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions For example: Adding ambiguous symbols Adding special new characters

Example – generic alphabet
>>> from Bio.Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> my_seq.alphabet Alphabet() Non-specific alphabet

Example – specific sequences
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna) >>> my_seq Seq('AGTACACTGGT', IUPACUnambiguousDNA()) >>> my_seq.alphabet IUPACUnambiguousDNA() >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_prot = Seq("AGTACACTGGT", IUPAC.protein) >>> my_prot Seq('AGTACACTGGT', IUPACProtein()) >>> my_prot.alphabet IUPACProtein()

Sequences act like strings
Access elements Count without overlaps >>> print my_seq[0] #first letter G >>> print my_seq[2] #third letter T >>> print my_seq[-1] #last letter G >>> from Bio.Seq import Seq >>> "AAAA".count("AA") 2 >>> Seq("AAAA").count("AA") 2

Calculate GC content >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> GC(my_seq)

Slicing Simple slicing Start, stop, stride
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) >>> my_seq[0::3] Seq('GCTGTAGTAAG', IUPACUnambiguousDNA()) >>> my_seq[1::3] Seq('AGGCATGCATC', IUPACUnambiguousDNA()) >>> my_seq[2::3] Seq('TAGCTAAGAC', IUPACUnambiguousDNA())

Concatenation Simple addition as in Python But, alphabets must fit
>>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> protein_seq + dna_seq Traceback (most recent call last): …

Changing case >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq Seq('acgtACGT', DNAAlphabet()) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet())

Changing case Case is important for matching
IUPAC names are upper case >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq.upper() True >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> dna_seq.lower() Seq('acgt', DNAAlphabet())

Reverse complement >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

Transcription >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> template_dna = coding_dna.reverse_complement() >>> messenger_rna = coding_dna.transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) As you can see, all this does is switch T → U, and adjust the alphabet.

Translation Simple example Stop codon!
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) Stop codon!

Translation from the DNA
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

Using different translation tables
In several cases we may want to use different translation tables Translation tables are given IDs in GenBank (standard=1) Vertebrate Mitochondrial is table 2 More details in

Using different translation tables
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

Translation tables in biopython

Translate up to the first stop in frame
>>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(to_stop=True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2, to_stop=True) Seq('MAIVMGRWKGAR', IUPACProtein())

Comparing sequences Standard “==“ comparison is done by comparing the references (!), hence: >>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq1==seq2 Warning (from warnings module): … FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception)… please use str(seq1)==str(seq2) to make your code explicit and to avoid this warning. False >>> seq1==seq1 True

Mutable vs. Immutable Like strings standard seq objects are immutable
If you want to create a mutable object you need to write it by either: Use the “tomutable()” method Use the mutable constructor mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAG GGTGCCCGA", IUPAC.unambiguous_dna)

Unknown sequences example
In many biological cases we deal with unknown sequences >>> from Bio.Seq import UnknownSeq >>> from Bio.Alphabet import IUPAC >>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna) >>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) >>> unk_dna+my_seq Seq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA())

Read MSA Use Bio.AlignIO.read(file, format) File – the file path
Format support: “stockholm” “fasta” “clustal” … Use help(AlignIO) for details

Example We want to parse this file from PFAM

Example from Bio import AlignIO alignment = AlignIO.read("PF05371.sth", "stockholm") print alignment

Alignment object example
>>> from Bio import AlignIO >>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm") >>> print alignment[1] ID: Q9T0Q8_BPIKE/1-52 Name: Q9T0Q8_BPIKE Description: Q9T0Q8_BPIKE/1-52 Number of features: 0 /start=1 /end=52 /accession=Q9T0Q8.1 Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA', SingleLetterAlphabet())

Alignment object example
>>> print "Alignment length %i" % alignment.get_alignment_length() Alignment length 52 >>> for record in alignment: print "%s - %s" % (record.seq, record.id) AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73

Cross-references example
Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure? >>> for record in alignment: if record.dbxrefs: print record.id, record.dbxrefs COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;'] COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;'] Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;'] COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']

Comments Remember that almost all MSA formats are supported
When you have more than one MSA in your files use AlignIO.parse() Common example is PHYLIP’s output Use AlignIO.parse("resampled.phy", "phylip") The result is an iterator object that contains all MSAs

Write alignment to file
from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Align import MultipleSeqAlignment align1 = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),]) from Bio import AlignIO AlignIO.write(align1, "my_example.phy", "phylip") 3 12 Alpha ACTGCTAGCT AG Beta ACT-CTAGCT AG Gamma ACTGCTAGDT AG 3 9 Delta GTCAGC-AG Epislon GACAGCTAG Zeta GTCAGCTAG 3 13 Eta ACTAGTACAG CTG Theta ACTAGTACAG CT- Iota - CTACTACAG GTG

Slicing Alignments work like numpy matrices
>>> print alignment[2,6] T # You can pull out a single column as a string like this: >>> print alignment[:,6] TTT---T >>> print alignment[3:6,:6] SingleLetterAlphabet() alignment with 3 rows and 6 columns AEGDDP COATB_BPM13/24-72 AEGDDP COATB_BPZJ2/1-49 AEGDDP Q9T0Q9_BPFD/1-49 >>> print alignment[:,:6] SingleLetterAlphabet() alignment with 7 rows and 6 columns AEPNAA COATB_BPIKE/30-81 AEPNAA Q9T0Q8_BPIKE/1-52 DGTSTA COATB_BPI22/32-83 AEGDDP COATB_BPM13/24-72 AEGDDP COATB_BPZJ2/1-49 AEGDDP Q9T0Q9_BPFD/1-49 FAADDA COATB_BPIF1/22-73

External applications
How do we call MSA algorithms on unaligned set of sequences? Biopython provides wrappers The idea: Create a command line object with the algorithm options Invoke the command (Python uses subprocesses) Bio.Align.Applications module: >>> import Bio.Align.Applications >>> dir(Bio.Align.Applications) ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ]

ClustalW example First step: download ClustalW from ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.1/ Second step: install Third step: look for clustal exe files Now you can run ClustalW from your Python code

Run example >>> import os >>> from Bio.Align.Applications import ClustalwCommandline >>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe" >>> clustalw_cline = ClustalwCommandline(clustalw_exe, infile="opuntia.fasta") >>> assert os.path.isfile(clustalw_exe), "Clustal W executable missing" >>> stdout, stderr = clustalw_cline() The command line is actually a function we can run!

ClustalW >>> from Bio import AlignIO >>> align = AlignIO.read("opuntia.aln", "clustal") >>> print align SingleLetterAlphabet() alignment with 7 rows and 906 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi| |gb|AF |AF191

ClustalW - tree In case you are interested, the opuntia.dnd file ClustalW creates is just a standard Newick tree file, and Bio.Phylo can parse these: >>> from Bio import Phylo >>> tree = Phylo.read("opuntia.dnd", "newick") >>> Phylo.draw_ascii(tree)

Running BLAST over the internet
We use the function qblast() in the Bio.Blast.NCBIWWW module. This has three non- optional arguments: The blast program to use for the search, as a lower case string: works with blastn, blastp, blastx, tblast and tblastx. The databases to search against. The options for this are available on the NCBI web pages at html. A string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number.

qblast additional parameters
qblast can receive other parameters, analogous to the parameters of the actual server Important examples: format_type: "HTML", "Text", "ASN.1", or "XML". The default is "XML", as that is the format expected by the parser (see next examples) expect sets the expectation or e-value threshold.

Step 1: call BLAST >>> from Bio.Blast import NCBIWWW
# Option 1 - Use GI ID >>> result_handle = NCBIWWW.qblast("blastn", "nt", " ") # Option 2 – read a fasta file >>> fasta_string = open("m_cold.fasta").read() >>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) # option 3 – parse file to seq object >>> record = SeqIO.read(open("m_cold.fasta"), format="fasta") >>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)

Step2: parse the results
Read can be used only once! blast_record object keeps the actual results >>> from Bio.Blast import NCBIXML >>> blast_record = NCBIXML.read(result_handle)

Remarks Basically, Biopython supports reading BLAST results from HTMLs and text files. These methods are not stable and sometimes fail because the servers change the format. XML is stable You can save XML files In the server From result_handle objects (next slide)

Save results as XML Read can be used only once! >>> save_file = open("my_blast.xml", "w") >>> save_file.write(result_handle.read()) >>> save_file.close() >>> result_handle.close()

BLAST records A BLAST Record contains everything you might ever want to extract from the BLAST output. Example: >>> E_VALUE_THRESH = 0.04 >>> for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '' print hsp.match[0:75] + '' print hsp.sbjct[0:75] + ''

BLAST records

More functions We cover here very basic functions
To get more details use >>> import Bio.Blast.Record >>> help(Bio.Blast.Record) Help on module Bio.Blast.Record in Bio.Blast: NAME Bio.Blast.Record - Record classes to hold BLAST output. FILE d:\python27\lib\site-packages\bio\blast\record.py DESCRIPTION Classes: Blast Holds all the information from a blast search. PSIBlast Holds all the information from a psi-blast search. Header Holds information from the header. Description Holds information about one hit description. Alignment Holds information about one alignment hit. HSP Holds information about one HSP. MultipleAlignment Holds information about a multiple alignment. DatabaseReport Holds information from the database report. Parameters Holds information from the parameters.

Accessing NCBI’s Entrez Databases

Bio.Entrez Module for programmatic access to Entrez
Example: search PubMed or download GenBank records from within a Python script Makes use of the Entrez Programming Utilities Makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI Note! If the NCBI finds you are abusing their systems, they can and will ban your access!

ESearch example >>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]") >>> record = Entrez.read(handle) # Each of the IDs is a GenBank identifier. >>> print (record["IdList"]) [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Explanation Entrez.read
Transforms the actual results (retrieved as XML) to a usable object of type Bio.Entrez.Parser.DictionaryElement >>> record {u'Count': '158', u'RetMax': '20', u'IdList': [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '], u'TranslationStack': [{u'Count': '2482', u'Field': 'Organism', u'Term': '"Cypripedioideae"[Organism]', u'Explode': 'Y'}, {u'Count': '71514', u'Field': 'Gene', u'Term': 'matK[Gene]', u'Explode': 'N'}, 'AND'], u'TranslationSet': [{u'To': '"Cypripedioideae"[Organism]', u'From': 'Cypripedioideae[Orgn]'}], u'RetStart': '0', u'QueryTranslation': '"Cypripedioideae"[Organism] AND matK[Gene]'}

Database options 'pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists'

Download a full record >>> from Bio import Entrez
# Always tell NCBI who you are >>> Entrez. = # rettype: get a GenBank record >>> handle = Entrez.efetch(db="nucleotide", id=" ", rettype="gb", retmode="text") >>> print handle.read()

Change ‘gb’ to ‘fasta’

Read directly to Seq.IO object
>>> from Bio import Entrez, SeqIO >>> handle = Entrez.efetch(db="nucleotide", id=" ",rettype="gb", retmode="text") >>> record = SeqIO.read(handle, "genbank") >>> handle.close() >>> print record ID: EU Name: EU Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast. Number of features: Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA())

Download directly from a URL
Suppose we know how the database URLs look like Example: GEO (gene expression omnibus) " c=GSE6609&format=file"

Use the urlib2 module >>> import urllib2 >>> u = urllib2.urlopen(' >>> localFile = open('gse6609_raw.tar', 'w') >>> for x in u: localFile.write(x) >>> localFile.close()

More details We covered only a few concepts
For more details on Biopython options, including dealing with specialized parsers, see ml#sec:parsing-blast Chapter 9 Look at the urllib2 manual

Sequence Motifs

Gene expression regulation
Transcription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs) TFBSs are located mainly in the gene’s promoter – the DNA sequence upstream the gene’s transcription start site (TSS) TFs can promote or repress transcription Other regulators: micro-RNAs (miRNAs)

Ab-initio motif discovery
You are given a set of strings You want to find a motif that is significantly represented in the strings For example: TF\miRNA binding site

TFBS models The BSs of a particular TF share a common pattern, or motif, which is often modeled using: Degenerate string GGWATB (W={A,T}, B={C,G,T}) PWM = Position weight matrix ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA 6 5 4 3 2 1 0.2 0.7 0.8 0.1 A 0.6 0.4 0.5 C G 0.3 0.9 T Cutoff = 0.009 AGCTACACCCATTTAT AGTAGAGCCTTCGTG CGATTCTACAATATGA

Motif discovery: The typical two-step pipeline
Co-regulated gene set Promoter/3’UTR sequences Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery

Motif discovery: Goals and challenges
Goal: Reverse-engineer the transcriptional regulatory network Challenges: BSs are short and degenerate (non-specific) Promoters are long + complex (hard to model) Search space is huge (motif and sequence) Data is noisy What to look for? (enriched?, localized?, conserved?) Problem is still considered very difficult despite extensive research

Biopython motif objects
from Bio import motifs from Bio.Seq import Seq instances = [Seq("TACAA"),Seq("TACGC"),Seq("TACAC"),Seq("TACCC"),Seq("AACCC"),Seq("AATGC"),Seq("AATGC")] m = motifs.create(instances) print m TACAA TACGC TACAC TACCC AACCC AATGC AATGC

>>> print m.counts 0 1 2 3 4 A: C: G: T:

>>> m.consensus Seq('TACGC', IUPACUnambiguousDNA()) #The anticonsensus sequence, corresponding to the smallest values in the columns of the .counts matrix: >>> m.anticonsensus Seq('GGGTG', IUPACUnambiguousDNA())

Motif database (http://jaspar.genereg.net/)

Read records from Bio import motifs arnt = motifs.read(open("Arnt.sites"), "sites") print arnt.counts 0 1 2 3 4 5 A: C: G: T:

MEME MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It takes as input a group of DNA or protein sequences and outputs as many motifs as requested. Therefore, in contrast to JASPAR files, MEME output files typically contain multiple motifs.

Assumptions The number of motifs is known
Assume this number is 1 The size of the motif is known Biologically, we have estimates for the size for TFs and miRNA Missing information PWM of the motif PWM of the background Motif locations

Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions

Assumptions Given a sequence X and a PWM Y, of the same length we can calculate P(X|Y) Assume independence of motif positions Given a PWM we can now calculate for each position K in each sequence J the probability the motif starts at K in the sequence J.

Expectation Maximization (EM) Algorithm
Start with initial guess for the PWMs The EM algorithm consists of the two steps, which are repeated consecutively. Step 1, estimate the probability of finding the site at any position in each of the sequences. These probabilities are used to provide new information as to expected base or aa distribution for each column in the site. Step 2, the maximization step, the new counts for bases or aa for each position in the site found in the step 1 are substituted for the previous set.

OOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o o o o o o o o o o o o o o o o o o o o o OOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO IIII IIIIIIII IIIIIII Columns defined by a preliminary alignment of the sequences provide initial estimates of frequencies of aa in each motif column Columns not in motif provide background frequencies Bases Background Site column 1 Site column 2 …… G 0.27 0.4 0.1 C 0.25 A 0.2 T 0.23 0.7 Total 1.00

XXXXOOOOOOOOOOOOOOOO XXXX IIII IIIIIIIIIIIIIIII OXXXXOOOOOOOOOOOOOOO I IIIIIIIIIIIIIII …background frequencies in the remaining positions. X Use previous estimates of aa or nucleotide frequencies for each column in the motif to calculate probability of motif in this position, and multiply by…….. A B The resulting score gives the likelihood that the motif matches positions A, B or other in seq 1. Repeat for all other positions and find most likely locator. Then repeat for the remaining seq’s.

EM Algorithm 2nd optimisation step: calculations
The site probabilities for each seq calculated at the 1st step are then used to create a new table of expected values for base counts for each of the site positions using the site probabilities as weights. Suppose that P (site 1 in seq 1) = Psite1,seq1 / (Psite1,seq1 + Psite2,seq1 + …+ Psite78,seq1 ) = 0.01 and P (site 2 in seq 1) = 0.02. Then this values are added to the previous table as shown in the table below. This procedure is repeated for every other possible first columns in seq1 and then the process continues for all other sequences resulting in a new version of the table. The expectation and maximization steps are repeated until the estimates of base frequencies do not change. Bases Background Site column 1 Site column 2 …… G … 0.4 + … … C … A 0.1 + … T … 0.2 + … Total/ weighted 1.00

Run MEME (http://meme.nbcr.net/meme/cgi-bin/meme.cgi)

Results

Parse results >>> handle = open("meme.dna.oops.txt") >>> record = motifs.parse(handle, "meme") >>> handle.close() >>> len(record) 2 >>> motif = record[0] >>> print motif.consensus TTCACATGCCGC >>> print motif.degenerate_consensus TTCACATGSCNC

Motif attributes >>> motif.num_occurrences 7 >>> motif.length 12 >>> evalue = motif.evalue >>> print "%3.1g" % evalue 0.2 >>> motif.name 'Motif 1'

Where the motif was found
>>> motif = record['Motif 1'] # Each motif has an attribute .instances with the sequence instances in which the motif was found, providing some information on each instance >>> len(motif.instances) 7 >>> motif.instances[0] Instance('TTCACATGCCGC', IUPACUnambiguousDNA()) >>> motif.instances[0].start 620 >>> motif.instances[0].strand '-' >>> motif.instances[0].length 12 >>> pvalue = motif.instances[0].pvalue >>> print "%5.3g" % pvalue 1.85e-08

Amadeus Advanced algorithms improve upon MEME
This is an algorithm for motif finding Appears to be one of the top algorithms in many tests Java based tool Easy to use GUI Supports analysis of TFs and miRNAs Developed here in TAU

Amadeus A Motif Algorithm for Detecting Enrichment in mUltiple Species
Supports diverse motif discovery tasks: Finding over-represented motifs in one or more given sets of genes. Identifying motifs with global spatial features given only the genomic sequences. Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets. How? A general pipeline architecture for enumerating motifs. Different statistical scoring schemes of motifs for different motif discovery tasks.

Input: ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. ’02]
CHR Pairs analysis NF-Y (CCAAT-box)

Clustering analysis

Clustering - reminder Cluster analysis is the grouping of items into clusters based on the similarity of the items to each other. Bio.Cluster module Kmeans SOM Hierarchical clustering PCA

K-means clustering MacQueen, 65
Input: a set of observations (x1, x2, …, xn) For example, each observation is a gene, and x is the values Goal: partition the observation to K clusters S = {S1, S2, …, Sk} Objective function:

K-means clustering MacQueen, 65
Initialize an arbitrary partition P into k clusters C1 ,…, Ck. For cluster Cj, element i Cj, EP(i, Cj) = cost of soln. if i is moved to cluster Cj. Pick EP(r, Cs) if the new partition is better Repeat until no improvement possible Requires knowledge of k

K-means variations Compute a centroid cp for each cluster Cp, e.g., gravity center = average vector Solution cost: clusters pi in cluster pd(vi,cp) Parallel version: move each to the cluster with the closest centroid simultaneously Sequential version: one at a time “moving centers” approach Objective = homogeneity only (k fixed)

The centers of the clusters that were deserted should move too – this is not shown.

Data representation The data to be clustered are represented by a n × m Numerical Python array data. Within the context of gene expression data clustering, typically the rows correspond to different genes whereas the columns correspond to different experimental conditions. The clustering algorithms in Bio.Cluster can be applied both to rows (genes) and to columns (experiments).

Distance\Similarity functions
'e': Euclidean distance 'c': Pearson correlation coefficient 'a': Absolute value of the Pearson correlation coefficient 'u': cosine of the angle between two data vectors 'x': Absolute uncentered Pearson correlation 's': Spearman’s rank correlation

Calculating distance matrices
>>> from Bio.Cluster import distancematrix >>> matrix = distancematrix(data) data - required Additional options: transpose (default: 0) Determines if the distances between the rows of data are to be calculated (transpose==0), or between the columns of data (transpose==1). dist (default: 'e', Euclidean distance)

Distancematrix To save space Biopython keeps only the lower\upper triangle of the matrix

Partitioning algorithms
Algorithms that receive the number of clusters K as an argument Kmeans Kmedians Often referred to as EM variations

Analysis example

Analysis example # Read the data import csv file = open('ge_data_example.txt', 'rb') data = csv.reader(file, delimiter='\t') table = [row for row in data] >>> len(table) 100 >>> table[1][1] '9.412' >>> table[0][0] 'sample' >>> len(table[1]) 17

Analysis example # Transform the data to numpy matrix
from numpy import * mat = matrix(table[1:][1:],dtype='float') print len(mat) # Create the distance matrix from Bio.Cluster import distancematrix dist_matrix = distancematrix(mat) # Cluster from Bio.Cluster import kcluster clusterid, error, nfound = kcluster(mat)

Analysis example Clusterid: array with cluster assignments
# Cluster from Bio.Cluster import kcluster clusterid, error, nfound = kcluster(mat) Clusterid: array with cluster assignments Error: the within cluster sum of distances Nfound: the number of times the returned solution was found

Analysis example >>> clusterid array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) >>> error >>> nfound 1

Kcluster: other options
nclusters (default: 2): the number of clusters k. transpose (default: 0): Determines if rows (transpose is 0) or columns (transpose is 1) are to be clustered. npass (default: 1): the number of times the k-means/-medians clustering algorithm is performed method (default: a): describes how the center of a cluster is found: method=='a': arithmetic mean (k-means clustering); method=='m': median (k-medians clustering). dist (default: 'e', Euclidean distance) initialid (default: None) Specifies the initial clustering to be used for the algorithm.

Hierarchical clustering
from Bio.Cluster import treecluster tree1 = treecluster(mat) # Can be applied to a precalculated distance matrix tree2 = treecluster(distancematrix=dist_matrix) # Get the cluster assignments clusterid = tree1.cut(3)

Hierarchical clustering using SciPy
Better visualizations! # Create a distance matrix X=mat D = scipy.zeros([len(x),len(x)]) for i in range(len(x)): for j in range(len(x)): D[i,j] = sum(abs(x[i] - x[j]))

# Compute and plot first dendrogram. fig = pylab.figure(figsize=(8,8)) # Add an axes at position rect [left, bottom, width, height] where all quantities are in fractions of figure width and height. ax1 = fig.add_axes([0.09,0.1,0.2,0.6]) # Clustering analysis Y = sch.linkage(D, method='centroid') Z1 = sch.dendrogram(Y, orientation='right') ax1.set_xticks([]) ax1.set_yticks([])

# Plot distance matrix. axmatrix = fig.add_axes([0.3,0.1,0.6,0.6]) idx1 = Z1['leaves'] D = D[idx1,:] im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=pylab.cm.YlGnBu) axmatrix.set_xticks([]) axmatrix.set_yticks([])

# Plot colorbar. axcolor = fig.add_axes([0.91,0.1,0.02,0.6]) pylab.colorbar(im, cax=axcolor) fig.show()

Phylogenetic trees

Remember the Newick format?
Simple example without branch length (((A,B),(C,D)),(E,F,G))

Visualizing trees >>> localFile.close() >>> from Bio import Phylo >>> tree = Phylo.read("simple.dnd", "newick") >>> print tree Tree(weight=1.0, rooted=False) Clade(branch_length=1.0) Clade(branch_length=1.0) Clade(branch_length=1.0) Clade(branch_length=1.0, name='A') Clade(branch_length=1.0, name='B') Clade(branch_length=1.0) Clade(branch_length=1.0, name='C') Clade(branch_length=1.0, name='D') Clade(branch_length=1.0) Clade(branch_length=1.0, name='E') Clade(branch_length=1.0, name='F') Clade(branch_length=1.0, name='G')

Visualizing trees

Use matplotlib >>> import matplotlib >>> tree.rooted = True >>> Phylo.draw(tree)

Phylo IO Phylo.read() reads a tree with exactly one tree
If you have many trees use a loop over the returned object of Phylo.parse() Write to file using Phylo.write(treeObj,format) Popular formats: “nwk”, “xml” Convert tree formats using Phylo.convert Phylo.convert("tree1.xml", "phyloxml", "tree1.dnd", "newick")

Programming for Engineers in Python

Similar presentations

Presentation on theme: "Programming for Engineers in Python"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming for Engineers in Python

Similar presentations

Presentation on theme: "Programming for Engineers in Python"— Presentation transcript:

Similar presentations

About project

Feedback