Download presentation
Presentation is loading. Please wait.
Published byVernon Webster Modified over 8 years ago
1
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1
2
Specify a common template for all objects in the class Declare three types of components: constructors: generate new objects properties: hold data about the object methods: perform operations on objects Python Classes 2
3
Can be used to represent DNA and Protein Sequences Provide methods for carrying out basic operations on sequences: finding patterns in sequences reversing, complementing, and translating sequences To create a Seq object one must provide: a String object representing the DNA or protein sequence an alphabet object specifying the type of sequence Seq Objects 3
4
Creating a Seq object 4 >>> from Bio.Seq import * >>> from Bio.Alphabet import * >>> pla2str='''CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCCTGTTGCTAGC... AGTTGTGATCATGGCCTTTGGCTCAATTCAGGTCCAGGGGAGCCTTCTGGAGTTTGGGCAAATG... ATTCTGTTTAAGACAGGAAAGAGAGCTGATGTTAGCTATGGCTTCTACGGTTGCCATTGTGGTG... TGGGTGGCAGAGGATCCCCCAAGGATGCCACAGATTGGTGCTGTGTGACTCATGACTGTTGTTA... CAACCGTCTGGAGAAACGTGGATGTGGCACAAAGTTTCTGACCTACAAGTTCTCCTACCGAGGG... GGCCAAATCTCCTGCTCTACAAACCAGGACTCCTGCCGGAAACAGCTGTGCCAGTGCGATAAAG... CTGCCGCTGAATGTTTTGCCCGGAACAAGAAAAGCTACAGTTTAAAGTACCAGTTCTACCCCAA... CAAGTTTTGCAAAGGGAAGACGCCCAGTTGCTGAAAGAGACATCTTCGGAAACATCCAGACATC... CTCTAACACCTCTCCTAGCCCAACCAAGTTCCCCAGTGATCAAGAAAACACCCCTCTCCAACCC... TAGAAGCAGGCGGGCCCTTCTGTCTTCACCCAGAAGGAGCCGCTGAAGCCTGATCTTTCCCCAA... CACTCCACAGCCTTGGATCCGCCCACTTTTCCCTTGGCATCCAACTTCCTGCTGCGTAGTACCT... AAGAGGGTCCTGAGAGGCTCTCGCAAGTAAAGCAATTCATCAAC''' >>> pla2str.replace('\n','') 'CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCCTGTTGCTAGCAGTTGTGATCATGGCCTTTGGCTCAA TTCAGGTCCAGGGGAGCCTTCTGGAGTTTGGGCAAATGATTCTGTTTAAGACAGGAAAGAGAGCTGATGTTAGCTATGGCTTCTACGGTTG CCATTGTGGTGTGGGTGGCAGAGGATCCCCCAAGGATGCCACAGATTGGTGCTGTGTGACTCATGACTGTTGTTACAACCGTCTGGAGAAA CGTGGATGTGGCACAAAGTTTCTGACCTACAAGTTCTCCTACCGAGGGGGCCAAATCTCCTGCTCTACAAACCAGGACTCCTGCCGGAAAC AGCTGTGCCAGTGCGATAAAGCTGCCGCTGAATGTTTTGCCCGGAACAAGAAAAGCTACAGTTTAAAGTACCAGTTCTACCCCAACAAGTT TTGCAAAGGGAAGACGCCCAGTTGCTGAAAGAGACATCTTCGGAAACATCCAGACATCCTCTAACACCTCTCCTAGCCCAACCAAGTTCCC CAGTGATCAAGAAAACACCCCTCTCCAACCCTAGAAGCAGGCGGGCCCTTCTGTCTTCACCCAGAAGGAGCCGCTGAAGCCTGATCTTTCC CCAACACTCCACAGCCTTGGATCCGCCCACTTTTCCCTTGGCATCCAACTTCCTGCTGCGTAGTACCTAAGAGGGTCCTGAGAGGCTCTCG CAAGTAAAGCAATTCATCAAC' >>> plA2=Seq(plA2str,generic_rna) >>> Alphabet
5
Combining Seq Objects 5 >>> pla2=Seq(pla2str,generic_dna) >>> pla2 Seq('CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCC...AAC ', DNAAlphabet()) >>> 'GGGG' + pla2 Seq('GGGGCAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTC...AAC ', DNAAlphabet()) >>> 'GGGG' + pla2 + 'TTTT' Seq('GGGGCAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTC...TTT ', DNAAlphabet()) >>>
6
Finding and Counting Patterns 6 >>> pla2.find('ATG') 41 >>> pla2.find('GAA') 3 >>> pla2.count('ATG') 9 >>> pla2.count('GAA') 15 >>> pla2.count(Seq('GAA',generic_dna)) 15 >>> pla2.count(Seq('GAA',generic_dna)) 15 >>> start_codon=Seq('ATG', generic_dna) >>> pla2.count(start_codon) 9
7
Complementing Sequences 7 >>> pla2 Seq('CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCT CC...AAC', DNAAlphabet()) >>> pla2.complement() Seq('GTTCTTCGGTATGGTGGTAGGGTAGGTTCTCTCGACTGTCGTACTTCCAGGA GG...TTG', DNAAlphabet()) >>> pla2.reverse_complement() Seq('GTTGATGAATTGCTTTACTTGCGAGAGCCTCTCAGGACCCTCTT AGGTACTAC...TTG', DNAAlphabet()) >>>
8
Translating Sequences 8 >>> pla2.find('ATG') 41 >>> pla2.find('TGA',400) 479 >>> coderegion=pla2[41:482] >>> coderegion Seq('ATGAAGGTCCTCCTGTTGCTAGCAGTTGTGATCATGGCCTTTGGCTCAATTC AG...TGA', DNAAlphabet()) >>> coderegion.translate() Seq('MKVLLLLAVVIMAFGSIQVQGSLLEFGQMILFKTGKRADVSYGFYGCHCGVG GR...SC*', HasStopCodon(ExtendedIUPACProtein(), '*')) >>> str(coderegion.translate()) 'MKVLLLLAVVIMAFGSIQVQGSLLEFGQMILFKTGKRADVSYGFYGCHCGVGGRGS PKDATDWCCVTHDCCYNRLEKRGCGTKFLTYKFSYRGGQISCSTNQDSCRKQLCQCD KAAAECFARNKKSYSLKYQFYPNKFCKGKTPSC*' >>>
9
THE END 9
10
from Bio import SeqIO handle = open("hemoglobin.fasta") for sr in SeqIO.parse(handle,"fasta"): print sr.id print sr.seq handle.close() Simple example #1 10 Read in Fasta sequence file
11
from Bio import SeqIO handle = open("hemoglobin.gb") for sr in SeqIO.parse(handle,"genbank"): print sr.id print sr.seq handle.close() Simple example #2 11 Read in Genbank sequence file
12
from Bio import SeqIO handle = open("hemoglobin.uniprot") for sr in SeqIO.parse(handle,"swiss"): print sr.id print sr.seq handle.close() Simple example #3 12 Read in UniProt (swiss/trembl) sequence file
13
from Bio import AlignIO handle = open("PA2.aln") for Almnt in AlignIO.parse(handle,"clustal"): for sr in Almnt: print sr.id print sr.seq handle.close() Simple example #4 13 Read in clustal aln file
14
from Bio import SeqIO from Bio import Entrez #Please use your REAL email address below: Entrez.email="youremail@yourdomain.edu" handle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") sr = SeqIO.parse(handle,"genbank").next() print sr.id print sr.seq handle.close() Fetch Over the Network #1 14 Fetch Genbank Sequence from the Network
15
from Bio import SeqIO from Bio import Entrez #Please use your REAL email address below: Entrez.email="youremail@yourdomain.edu" handle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") # The blue line and the red line are equivalent: #sr = SeqIO.parse(handle,"genbank").next() sr = SeqIO.read(handle,"genbank") print sr.id print sr.seq handle.close() Fetch Over the Network #2 15 Fetch Genbank Sequence from the Network
16
from Bio import SeqIO from Bio import Entrez Entrez.email="youremail@yourdomain.edu" InHandle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") OutHandle = open("NM_000518.gb","w") sr = SeqIO.parse(InHandle, "genbank") SeqIO.write(sr,OutHandle,"genbank") InHandle.close() OutHandle.close() Fetch Over the Network #3 16 Fetch Genbank Sequence from the Network and Save
17
from Bio import SeqIO from Bio import Entrez Entrez.email="youremail@yourdomain.edu" InHandle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") OutHandle = open("NM_000518.fasta","w") sr = SeqIO.parse(InHandle, "genbank") SeqIO.write(sr,OutHandle,"fasta") InHandle.close() OutHandle.close() Fetch Over the Network #4 17 Fetch Genbank Sequence from the Network and Save
18
Using BioPython, write a program to read in several sequences in a file in the Uniprot/Swiss file format and save them in a file as FASTA format. You may use the Hemoglobin.swiss test file from the supplemental materials section on moodle. 18 Homework Problem #1
19
Use your own routine when: The algorithm or coding is interesting to you BioPython data structure mapping is too complex for your task You want to “own” the source code from a copyright perspective Use Biopython when: Routine fits your needs Routine is unchallenging or boring - Why waste your time? Routine will take you a lot of effort to write Extend Biopython routine when: Routine almost does what you want but not quite Challenging for the beginning programmer! Can you read and understand someone else’s code? BioPython vs Your Own Routines 19
20
from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import SeqIO query_file = open("hemoglobin.fasta") save_file = open("hemoglobin.xml", 'w') record = SeqIO.read(query_file, format="fasta") results_handle = NCBIWWW.qblast("blastp", "swissprot", \ record.seq, expect=1, matrix_name='BLOSUM62') blast_results = results_handle.read() save_file.write(blast_results) save_file.close() Network Blast #1 20
21
from Bio import Phylo #Read in a phylogenetic tree in Newick format ConTree=Phylo.read("consensus.tre","newick") print '\nHere is the tree in the native format used by Phylo:' print ConTree print '\nHere is the tree drawn using ASCII line representation:' Phylo.draw_ascii(ConTree) print '\nGet subtree that includes taxa LUCI_RENRE and Q6SH_9BACT:' A=ConTree.common_ancestor({"name": "LUCI_RENRE"},\ {"name" : "Q6SH_9BACT"}) Phylo.draw_ascii(A) print '\nTrace the path between taxa LUCI_RENRE and Q6SH_9BACT:' print ConTree.trace({"name": "LUCI_RENRE"},{"name" : "Q6SH_9BACT"}) print '\nCount the distance between taxa LUCI_RENRE and Q6SH_9BACT:' print ConTree.distance({"name": "LUCI_RENRE"},{"name" : "Q6SH_9BACT"}) print '\nCount and print number of terminal nodes (taxa) in the tree:' print ConTree.count_terminals() print ConTree.get_terminals() Phylogenetic Trees 21
22
from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import SeqIO query_file = open('hemoglobin.fasta') save_file = open("hemoglobin.txt", 'w') record = SeqIO.read(query_file, format="fasta") results_handle = NCBIWWW.qblast("blastp", "swissprot", record.seq, expect=1, \ matrix_name='BLOSUM62', descriptions=2000, alignments=2000, hitlist_size=2000) blast_results = NCBIXML.parse(results_handle).next() alncnt=0 for align in blast_results.alignments: alncnt= alncnt + 1 ln='Alignment # ' + str(alncnt) + ' ' + align.accession save_file.write(ln + '\n') save_file.write(align.title[0:132] + '\n') for hsp in align.hsps: save_file.write(hsp.query + '\n') save_file.write(hsp.match + '\n') save_file.write(hsp.sbjct + '\n') save_file.write('\n') query_file.close() save_file.close() Network Blast #2 22
23
Network Blast/XML 23
24
Researcher downloaded a set of sequences from the Uniprot database in Fasta format The names in the Fasta file name were very long Researcher performed a clustalw alignment with the set of sequences and has a tree and a clustalw alignment file. Researcher now finds that some tools (such as Genedoc and Phylip) require shorter names! Homework Problem #2 24
25
Solution: Write a Python Program to read in all three files: A sequence file in fasta format An alignment file in clustalw format A tree file in newick format Replace the names with shorter names (10 character max) and write three new files. Implementation Notes: Check that Identifiers are the same in all three files If UniProt Fasta file, ask user if substituting accessions for ids is desired. Otherwise prompt user to enter a new (shortened) ID Homework Problem #2 25
26
First, download BioPython from the BioPython website: http://www.biopython.org Install on your computer Include appropriate module. For list of modules and descriptions see: http://www.biopython.org/DIST/docs/api/ Using BioPython 26
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.