Presentation is loading. Please wait.

Presentation is loading. Please wait.

MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.

Similar presentations


Presentation on theme: "MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1."— Presentation transcript:

1 MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1

2  Specify a common template for all objects in the class  Declare three types of components:  constructors: generate new objects  properties: hold data about the object  methods: perform operations on objects Python Classes 2

3  Can be used to represent DNA and Protein Sequences  Provide methods for carrying out basic operations on sequences:  finding patterns in sequences  reversing, complementing, and translating sequences  To create a Seq object one must provide:  a String object representing the DNA or protein sequence  an alphabet object specifying the type of sequence Seq Objects 3

4 Creating a Seq object 4 >>> from Bio.Seq import * >>> from Bio.Alphabet import * >>> pla2str='''CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCCTGTTGCTAGC... AGTTGTGATCATGGCCTTTGGCTCAATTCAGGTCCAGGGGAGCCTTCTGGAGTTTGGGCAAATG... ATTCTGTTTAAGACAGGAAAGAGAGCTGATGTTAGCTATGGCTTCTACGGTTGCCATTGTGGTG... TGGGTGGCAGAGGATCCCCCAAGGATGCCACAGATTGGTGCTGTGTGACTCATGACTGTTGTTA... CAACCGTCTGGAGAAACGTGGATGTGGCACAAAGTTTCTGACCTACAAGTTCTCCTACCGAGGG... GGCCAAATCTCCTGCTCTACAAACCAGGACTCCTGCCGGAAACAGCTGTGCCAGTGCGATAAAG... CTGCCGCTGAATGTTTTGCCCGGAACAAGAAAAGCTACAGTTTAAAGTACCAGTTCTACCCCAA... CAAGTTTTGCAAAGGGAAGACGCCCAGTTGCTGAAAGAGACATCTTCGGAAACATCCAGACATC... CTCTAACACCTCTCCTAGCCCAACCAAGTTCCCCAGTGATCAAGAAAACACCCCTCTCCAACCC... TAGAAGCAGGCGGGCCCTTCTGTCTTCACCCAGAAGGAGCCGCTGAAGCCTGATCTTTCCCCAA... CACTCCACAGCCTTGGATCCGCCCACTTTTCCCTTGGCATCCAACTTCCTGCTGCGTAGTACCT... AAGAGGGTCCTGAGAGGCTCTCGCAAGTAAAGCAATTCATCAAC''' >>> pla2str.replace('\n','') 'CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCCTGTTGCTAGCAGTTGTGATCATGGCCTTTGGCTCAA TTCAGGTCCAGGGGAGCCTTCTGGAGTTTGGGCAAATGATTCTGTTTAAGACAGGAAAGAGAGCTGATGTTAGCTATGGCTTCTACGGTTG CCATTGTGGTGTGGGTGGCAGAGGATCCCCCAAGGATGCCACAGATTGGTGCTGTGTGACTCATGACTGTTGTTACAACCGTCTGGAGAAA CGTGGATGTGGCACAAAGTTTCTGACCTACAAGTTCTCCTACCGAGGGGGCCAAATCTCCTGCTCTACAAACCAGGACTCCTGCCGGAAAC AGCTGTGCCAGTGCGATAAAGCTGCCGCTGAATGTTTTGCCCGGAACAAGAAAAGCTACAGTTTAAAGTACCAGTTCTACCCCAACAAGTT TTGCAAAGGGAAGACGCCCAGTTGCTGAAAGAGACATCTTCGGAAACATCCAGACATCCTCTAACACCTCTCCTAGCCCAACCAAGTTCCC CAGTGATCAAGAAAACACCCCTCTCCAACCCTAGAAGCAGGCGGGCCCTTCTGTCTTCACCCAGAAGGAGCCGCTGAAGCCTGATCTTTCC CCAACACTCCACAGCCTTGGATCCGCCCACTTTTCCCTTGGCATCCAACTTCCTGCTGCGTAGTACCTAAGAGGGTCCTGAGAGGCTCTCG CAAGTAAAGCAATTCATCAAC' >>> plA2=Seq(plA2str,generic_rna) >>> Alphabet

5 Combining Seq Objects 5 >>> pla2=Seq(pla2str,generic_dna) >>> pla2 Seq('CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCC...AAC ', DNAAlphabet()) >>> 'GGGG' + pla2 Seq('GGGGCAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTC...AAC ', DNAAlphabet()) >>> 'GGGG' + pla2 + 'TTTT' Seq('GGGGCAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTC...TTT ', DNAAlphabet()) >>>

6 Finding and Counting Patterns 6 >>> pla2.find('ATG') 41 >>> pla2.find('GAA') 3 >>> pla2.count('ATG') 9 >>> pla2.count('GAA') 15 >>> pla2.count(Seq('GAA',generic_dna)) 15 >>> pla2.count(Seq('GAA',generic_dna)) 15 >>> start_codon=Seq('ATG', generic_dna) >>> pla2.count(start_codon) 9

7 Complementing Sequences 7 >>> pla2 Seq('CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCT CC...AAC', DNAAlphabet()) >>> pla2.complement() Seq('GTTCTTCGGTATGGTGGTAGGGTAGGTTCTCTCGACTGTCGTACTTCCAGGA GG...TTG', DNAAlphabet()) >>> pla2.reverse_complement() Seq('GTTGATGAATTGCTTTACTTGCGAGAGCCTCTCAGGACCCTCTT AGGTACTAC...TTG', DNAAlphabet()) >>>

8 Translating Sequences 8 >>> pla2.find('ATG') 41 >>> pla2.find('TGA',400) 479 >>> coderegion=pla2[41:482] >>> coderegion Seq('ATGAAGGTCCTCCTGTTGCTAGCAGTTGTGATCATGGCCTTTGGCTCAATTC AG...TGA', DNAAlphabet()) >>> coderegion.translate() Seq('MKVLLLLAVVIMAFGSIQVQGSLLEFGQMILFKTGKRADVSYGFYGCHCGVG GR...SC*', HasStopCodon(ExtendedIUPACProtein(), '*')) >>> str(coderegion.translate()) 'MKVLLLLAVVIMAFGSIQVQGSLLEFGQMILFKTGKRADVSYGFYGCHCGVGGRGS PKDATDWCCVTHDCCYNRLEKRGCGTKFLTYKFSYRGGQISCSTNQDSCRKQLCQCD KAAAECFARNKKSYSLKYQFYPNKFCKGKTPSC*' >>>

9 THE END 9

10 from Bio import SeqIO handle = open("hemoglobin.fasta") for sr in SeqIO.parse(handle,"fasta"): print sr.id print sr.seq handle.close() Simple example #1 10 Read in Fasta sequence file

11 from Bio import SeqIO handle = open("hemoglobin.gb") for sr in SeqIO.parse(handle,"genbank"): print sr.id print sr.seq handle.close() Simple example #2 11 Read in Genbank sequence file

12 from Bio import SeqIO handle = open("hemoglobin.uniprot") for sr in SeqIO.parse(handle,"swiss"): print sr.id print sr.seq handle.close() Simple example #3 12 Read in UniProt (swiss/trembl) sequence file

13 from Bio import AlignIO handle = open("PA2.aln") for Almnt in AlignIO.parse(handle,"clustal"): for sr in Almnt: print sr.id print sr.seq handle.close() Simple example #4 13 Read in clustal aln file

14 from Bio import SeqIO from Bio import Entrez #Please use your REAL email address below: Entrez.email="youremail@yourdomain.edu" handle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") sr = SeqIO.parse(handle,"genbank").next() print sr.id print sr.seq handle.close() Fetch Over the Network #1 14 Fetch Genbank Sequence from the Network

15 from Bio import SeqIO from Bio import Entrez #Please use your REAL email address below: Entrez.email="youremail@yourdomain.edu" handle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") # The blue line and the red line are equivalent: #sr = SeqIO.parse(handle,"genbank").next() sr = SeqIO.read(handle,"genbank") print sr.id print sr.seq handle.close() Fetch Over the Network #2 15 Fetch Genbank Sequence from the Network

16 from Bio import SeqIO from Bio import Entrez Entrez.email="youremail@yourdomain.edu" InHandle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") OutHandle = open("NM_000518.gb","w") sr = SeqIO.parse(InHandle, "genbank") SeqIO.write(sr,OutHandle,"genbank") InHandle.close() OutHandle.close() Fetch Over the Network #3 16 Fetch Genbank Sequence from the Network and Save

17 from Bio import SeqIO from Bio import Entrez Entrez.email="youremail@yourdomain.edu" InHandle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") OutHandle = open("NM_000518.fasta","w") sr = SeqIO.parse(InHandle, "genbank") SeqIO.write(sr,OutHandle,"fasta") InHandle.close() OutHandle.close() Fetch Over the Network #4 17 Fetch Genbank Sequence from the Network and Save

18  Using BioPython, write a program to read in several sequences in a file in the Uniprot/Swiss file format and save them in a file as FASTA format.  You may use the Hemoglobin.swiss test file from the supplemental materials section on moodle. 18 Homework Problem #1

19  Use your own routine when:  The algorithm or coding is interesting to you  BioPython data structure mapping is too complex for your task  You want to “own” the source code from a copyright perspective  Use Biopython when:  Routine fits your needs  Routine is unchallenging or boring - Why waste your time?  Routine will take you a lot of effort to write  Extend Biopython routine when:  Routine almost does what you want but not quite  Challenging for the beginning programmer! Can you read and understand someone else’s code? BioPython vs Your Own Routines 19

20 from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import SeqIO query_file = open("hemoglobin.fasta") save_file = open("hemoglobin.xml", 'w') record = SeqIO.read(query_file, format="fasta") results_handle = NCBIWWW.qblast("blastp", "swissprot", \ record.seq, expect=1, matrix_name='BLOSUM62') blast_results = results_handle.read() save_file.write(blast_results) save_file.close() Network Blast #1 20

21 from Bio import Phylo #Read in a phylogenetic tree in Newick format ConTree=Phylo.read("consensus.tre","newick") print '\nHere is the tree in the native format used by Phylo:' print ConTree print '\nHere is the tree drawn using ASCII line representation:' Phylo.draw_ascii(ConTree) print '\nGet subtree that includes taxa LUCI_RENRE and Q6SH_9BACT:' A=ConTree.common_ancestor({"name": "LUCI_RENRE"},\ {"name" : "Q6SH_9BACT"}) Phylo.draw_ascii(A) print '\nTrace the path between taxa LUCI_RENRE and Q6SH_9BACT:' print ConTree.trace({"name": "LUCI_RENRE"},{"name" : "Q6SH_9BACT"}) print '\nCount the distance between taxa LUCI_RENRE and Q6SH_9BACT:' print ConTree.distance({"name": "LUCI_RENRE"},{"name" : "Q6SH_9BACT"}) print '\nCount and print number of terminal nodes (taxa) in the tree:' print ConTree.count_terminals() print ConTree.get_terminals() Phylogenetic Trees 21

22 from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import SeqIO query_file = open('hemoglobin.fasta') save_file = open("hemoglobin.txt", 'w') record = SeqIO.read(query_file, format="fasta") results_handle = NCBIWWW.qblast("blastp", "swissprot", record.seq, expect=1, \ matrix_name='BLOSUM62', descriptions=2000, alignments=2000, hitlist_size=2000) blast_results = NCBIXML.parse(results_handle).next() alncnt=0 for align in blast_results.alignments: alncnt= alncnt + 1 ln='Alignment # ' + str(alncnt) + ' ' + align.accession save_file.write(ln + '\n') save_file.write(align.title[0:132] + '\n') for hsp in align.hsps: save_file.write(hsp.query + '\n') save_file.write(hsp.match + '\n') save_file.write(hsp.sbjct + '\n') save_file.write('\n') query_file.close() save_file.close() Network Blast #2 22

23 Network Blast/XML 23

24  Researcher downloaded a set of sequences from the Uniprot database in Fasta format  The names in the Fasta file name were very long  Researcher performed a clustalw alignment with the set of sequences and has a tree and a clustalw alignment file.  Researcher now finds that some tools (such as Genedoc and Phylip) require shorter names! Homework Problem #2 24

25  Solution:  Write a Python Program to read in all three files:  A sequence file in fasta format  An alignment file in clustalw format  A tree file in newick format  Replace the names with shorter names (10 character max) and write three new files.  Implementation Notes:  Check that Identifiers are the same in all three files  If UniProt Fasta file, ask user if substituting accessions for ids is desired.  Otherwise prompt user to enter a new (shortened) ID Homework Problem #2 25

26  First, download BioPython from the BioPython website:  http://www.biopython.org  Install on your computer  Include appropriate module. For list of modules and descriptions see:  http://www.biopython.org/DIST/docs/api/ Using BioPython 26


Download ppt "MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1."

Similar presentations


Ads by Google