MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.

Slides:



Advertisements
Similar presentations
Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Advertisements

INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc.
BioPython Tutorial Joe Steele Ishwor Thapa. BioPython home page ial.html.
Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein More on Classes, Biopython.
10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11.
The BioPerl project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
11ex.1 Modules and BioPerl. 11ex.2 sub reverseComplement { my ($seq) $seq =~ tr/ACGT/TGCA/; $seq = reverse $seq; return $seq; } my $revSeq = reverseComplement("GCAGTG");
12ex.1. 12ex.2 The BioPerl project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science.
Bioperl modules.
Public Resources (II) – Analysis tools  Web-based analysis tools – easy to use, but often with less customization options.  Stand-alone analysis tools.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Python programs How can I run a program? Input and output.
BioPerl - documentation Bioperl tutorial tutorial Mastering Perl for Bioinformatics: Introduction.
BioPython Workshop Gershon Celniker Tel Aviv University.
Trinity College Dublin, The University of Dublin A Brief Introduction to Scientific Programming with Python Karsten Hokamp, PhD TCD Bioinformatics Support.
Introduction to Python for Biologists Lecture 3: Biopython This Lecture Stuart Brown Associate Professor NYU School of Medicine.
Public Resources for Bioinformatics Databases : how to find relevant information. Analysis Tools.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Describe the Program Development Cycle. Program Development Cycle The program development cycle is a series of steps programmers use to build computer.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
11/6/2013BCHB Edwards Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19.
10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.
Data Structures and Algorithms Lecture 1 Instructor: Quratulain Date: 1 st Sep, 2009.
Clean up sequences with multiple >GI numbers when downloaded from NCBI BLAST website [ Example of one sequence and the duplication clean up for phylo tree.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
GE3M25: Computer Programming for Biologists Python, Class 5
Using Local Tools: BLAST
Copyright OpenHelix. No use or reproduction without express written consent1.
Important modules: Biopython, SQL & COM. Information sources  python.org  tutor list (for beginners), the Python Package index, on-line help, tutorials,
Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19 By Edwards & Li Slides:
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 3 High-level Programming with Python Part III: Files and Directories Reference:
Introducing Bioperl Toward the Bioinformatics Perl programmer's nirvana.
Advanced Perl For Bioinformatics Part 1 2/23/06 1-4pm Module structure Module path Module export Object oriented programming Part 2 2/24/06 1-4pm Bioperl.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Biopython 1. What is Biopython? tools for computational molecular biology to program in python and want to make it as easy as possible to use python for.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
July LJM Introduction to Bioinformatics Lisa Mullan, HGMP-RC.
Biopython. biopython al/Tutorial.html
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython: Overview 1.
Sequence File Parsing using Biopython
Development Environment
Bioinformatics Data Management
Modules and BioPerl.
Using Local Tools: BLAST
EMBL-EBI, programmatically - take a REST from manual searching: Sequence analysis tools Web Production Team Anna Foix Joon Lee.
Problem with N-W and S-W
BioPython Download & Installation Documentation
Advanced Python Concepts: Modules
Using Molecular Biology to Teach Computer Science
Sequence I/O How to find sequence information from Bio import SeqIO
Using Web-Services: NCBI E-Utilities, online BLAST
Essential BioPython Retrieving Sequences from the Web
Using Web-Services: NCBI E-Utilities, online BLAST
BioPython Download & Installation Documentation
Sequence File Parsing using Biopython
*current controlled assessment plans are unknown
Lesson 3 Bioinformatics Laboratory
Basic Local Alignment Search Tool (BLAST)
Advanced Python Concepts: Modules
Using Local Tools: BLAST
Multiple sequence alignment & Phylogenetics Analysis
Using Local Tools: BLAST
Advanced Python Concepts: Modules
Using Web-Services: NCBI E-Utilities, online BLAST
Sequence File Parsing using Biopython
Supporting High-Performance Data Processing on Flat-Files
Presentation transcript:

MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1

 Specify a common template for all objects in the class  Declare three types of components:  constructors: generate new objects  properties: hold data about the object  methods: perform operations on objects Python Classes 2

 Can be used to represent DNA and Protein Sequences  Provide methods for carrying out basic operations on sequences:  finding patterns in sequences  reversing, complementing, and translating sequences  To create a Seq object one must provide:  a String object representing the DNA or protein sequence  an alphabet object specifying the type of sequence Seq Objects 3

Creating a Seq object 4 >>> from Bio.Seq import * >>> from Bio.Alphabet import * >>> pla2str='''CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCCTGTTGCTAGC... AGTTGTGATCATGGCCTTTGGCTCAATTCAGGTCCAGGGGAGCCTTCTGGAGTTTGGGCAAATG... ATTCTGTTTAAGACAGGAAAGAGAGCTGATGTTAGCTATGGCTTCTACGGTTGCCATTGTGGTG... TGGGTGGCAGAGGATCCCCCAAGGATGCCACAGATTGGTGCTGTGTGACTCATGACTGTTGTTA... CAACCGTCTGGAGAAACGTGGATGTGGCACAAAGTTTCTGACCTACAAGTTCTCCTACCGAGGG... GGCCAAATCTCCTGCTCTACAAACCAGGACTCCTGCCGGAAACAGCTGTGCCAGTGCGATAAAG... CTGCCGCTGAATGTTTTGCCCGGAACAAGAAAAGCTACAGTTTAAAGTACCAGTTCTACCCCAA... CAAGTTTTGCAAAGGGAAGACGCCCAGTTGCTGAAAGAGACATCTTCGGAAACATCCAGACATC... CTCTAACACCTCTCCTAGCCCAACCAAGTTCCCCAGTGATCAAGAAAACACCCCTCTCCAACCC... TAGAAGCAGGCGGGCCCTTCTGTCTTCACCCAGAAGGAGCCGCTGAAGCCTGATCTTTCCCCAA... CACTCCACAGCCTTGGATCCGCCCACTTTTCCCTTGGCATCCAACTTCCTGCTGCGTAGTACCT... AAGAGGGTCCTGAGAGGCTCTCGCAAGTAAAGCAATTCATCAAC''' >>> pla2str.replace('\n','') 'CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCCTGTTGCTAGCAGTTGTGATCATGGCCTTTGGCTCAA TTCAGGTCCAGGGGAGCCTTCTGGAGTTTGGGCAAATGATTCTGTTTAAGACAGGAAAGAGAGCTGATGTTAGCTATGGCTTCTACGGTTG CCATTGTGGTGTGGGTGGCAGAGGATCCCCCAAGGATGCCACAGATTGGTGCTGTGTGACTCATGACTGTTGTTACAACCGTCTGGAGAAA CGTGGATGTGGCACAAAGTTTCTGACCTACAAGTTCTCCTACCGAGGGGGCCAAATCTCCTGCTCTACAAACCAGGACTCCTGCCGGAAAC AGCTGTGCCAGTGCGATAAAGCTGCCGCTGAATGTTTTGCCCGGAACAAGAAAAGCTACAGTTTAAAGTACCAGTTCTACCCCAACAAGTT TTGCAAAGGGAAGACGCCCAGTTGCTGAAAGAGACATCTTCGGAAACATCCAGACATCCTCTAACACCTCTCCTAGCCCAACCAAGTTCCC CAGTGATCAAGAAAACACCCCTCTCCAACCCTAGAAGCAGGCGGGCCCTTCTGTCTTCACCCAGAAGGAGCCGCTGAAGCCTGATCTTTCC CCAACACTCCACAGCCTTGGATCCGCCCACTTTTCCCTTGGCATCCAACTTCCTGCTGCGTAGTACCTAAGAGGGTCCTGAGAGGCTCTCG CAAGTAAAGCAATTCATCAAC' >>> plA2=Seq(plA2str,generic_rna) >>> Alphabet

Combining Seq Objects 5 >>> pla2=Seq(pla2str,generic_dna) >>> pla2 Seq('CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCTCC...AAC ', DNAAlphabet()) >>> 'GGGG' + pla2 Seq('GGGGCAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTC...AAC ', DNAAlphabet()) >>> 'GGGG' + pla2 + 'TTTT' Seq('GGGGCAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTC...TTT ', DNAAlphabet()) >>>

Finding and Counting Patterns 6 >>> pla2.find('ATG') 41 >>> pla2.find('GAA') 3 >>> pla2.count('ATG') 9 >>> pla2.count('GAA') 15 >>> pla2.count(Seq('GAA',generic_dna)) 15 >>> pla2.count(Seq('GAA',generic_dna)) 15 >>> start_codon=Seq('ATG', generic_dna) >>> pla2.count(start_codon) 9

Complementing Sequences 7 >>> pla2 Seq('CAAGAAGCCATACCACCATCCCATCCAAGAGAGCTGACAGCATGAAGGTCCT CC...AAC', DNAAlphabet()) >>> pla2.complement() Seq('GTTCTTCGGTATGGTGGTAGGGTAGGTTCTCTCGACTGTCGTACTTCCAGGA GG...TTG', DNAAlphabet()) >>> pla2.reverse_complement() Seq('GTTGATGAATTGCTTTACTTGCGAGAGCCTCTCAGGACCCTCTT AGGTACTAC...TTG', DNAAlphabet()) >>>

Translating Sequences 8 >>> pla2.find('ATG') 41 >>> pla2.find('TGA',400) 479 >>> coderegion=pla2[41:482] >>> coderegion Seq('ATGAAGGTCCTCCTGTTGCTAGCAGTTGTGATCATGGCCTTTGGCTCAATTC AG...TGA', DNAAlphabet()) >>> coderegion.translate() Seq('MKVLLLLAVVIMAFGSIQVQGSLLEFGQMILFKTGKRADVSYGFYGCHCGVG GR...SC*', HasStopCodon(ExtendedIUPACProtein(), '*')) >>> str(coderegion.translate()) 'MKVLLLLAVVIMAFGSIQVQGSLLEFGQMILFKTGKRADVSYGFYGCHCGVGGRGS PKDATDWCCVTHDCCYNRLEKRGCGTKFLTYKFSYRGGQISCSTNQDSCRKQLCQCD KAAAECFARNKKSYSLKYQFYPNKFCKGKTPSC*' >>>

THE END 9

from Bio import SeqIO handle = open("hemoglobin.fasta") for sr in SeqIO.parse(handle,"fasta"): print sr.id print sr.seq handle.close() Simple example #1 10 Read in Fasta sequence file

from Bio import SeqIO handle = open("hemoglobin.gb") for sr in SeqIO.parse(handle,"genbank"): print sr.id print sr.seq handle.close() Simple example #2 11 Read in Genbank sequence file

from Bio import SeqIO handle = open("hemoglobin.uniprot") for sr in SeqIO.parse(handle,"swiss"): print sr.id print sr.seq handle.close() Simple example #3 12 Read in UniProt (swiss/trembl) sequence file

from Bio import AlignIO handle = open("PA2.aln") for Almnt in AlignIO.parse(handle,"clustal"): for sr in Almnt: print sr.id print sr.seq handle.close() Simple example #4 13 Read in clustal aln file

from Bio import SeqIO from Bio import Entrez #Please use your REAL address below: handle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") sr = SeqIO.parse(handle,"genbank").next() print sr.id print sr.seq handle.close() Fetch Over the Network #1 14 Fetch Genbank Sequence from the Network

from Bio import SeqIO from Bio import Entrez #Please use your REAL address below: handle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") # The blue line and the red line are equivalent: #sr = SeqIO.parse(handle,"genbank").next() sr = SeqIO.read(handle,"genbank") print sr.id print sr.seq handle.close() Fetch Over the Network #2 15 Fetch Genbank Sequence from the Network

from Bio import SeqIO from Bio import Entrez InHandle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") OutHandle = open("NM_ gb","w") sr = SeqIO.parse(InHandle, "genbank") SeqIO.write(sr,OutHandle,"genbank") InHandle.close() OutHandle.close() Fetch Over the Network #3 16 Fetch Genbank Sequence from the Network and Save

from Bio import SeqIO from Bio import Entrez InHandle = Entrez.efetch(db="nucleotide",rettype="gb",id="NM_000518") OutHandle = open("NM_ fasta","w") sr = SeqIO.parse(InHandle, "genbank") SeqIO.write(sr,OutHandle,"fasta") InHandle.close() OutHandle.close() Fetch Over the Network #4 17 Fetch Genbank Sequence from the Network and Save

 Using BioPython, write a program to read in several sequences in a file in the Uniprot/Swiss file format and save them in a file as FASTA format.  You may use the Hemoglobin.swiss test file from the supplemental materials section on moodle. 18 Homework Problem #1

 Use your own routine when:  The algorithm or coding is interesting to you  BioPython data structure mapping is too complex for your task  You want to “own” the source code from a copyright perspective  Use Biopython when:  Routine fits your needs  Routine is unchallenging or boring - Why waste your time?  Routine will take you a lot of effort to write  Extend Biopython routine when:  Routine almost does what you want but not quite  Challenging for the beginning programmer! Can you read and understand someone else’s code? BioPython vs Your Own Routines 19

from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import SeqIO query_file = open("hemoglobin.fasta") save_file = open("hemoglobin.xml", 'w') record = SeqIO.read(query_file, format="fasta") results_handle = NCBIWWW.qblast("blastp", "swissprot", \ record.seq, expect=1, matrix_name='BLOSUM62') blast_results = results_handle.read() save_file.write(blast_results) save_file.close() Network Blast #1 20

from Bio import Phylo #Read in a phylogenetic tree in Newick format ConTree=Phylo.read("consensus.tre","newick") print '\nHere is the tree in the native format used by Phylo:' print ConTree print '\nHere is the tree drawn using ASCII line representation:' Phylo.draw_ascii(ConTree) print '\nGet subtree that includes taxa LUCI_RENRE and Q6SH_9BACT:' A=ConTree.common_ancestor({"name": "LUCI_RENRE"},\ {"name" : "Q6SH_9BACT"}) Phylo.draw_ascii(A) print '\nTrace the path between taxa LUCI_RENRE and Q6SH_9BACT:' print ConTree.trace({"name": "LUCI_RENRE"},{"name" : "Q6SH_9BACT"}) print '\nCount the distance between taxa LUCI_RENRE and Q6SH_9BACT:' print ConTree.distance({"name": "LUCI_RENRE"},{"name" : "Q6SH_9BACT"}) print '\nCount and print number of terminal nodes (taxa) in the tree:' print ConTree.count_terminals() print ConTree.get_terminals() Phylogenetic Trees 21

from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import SeqIO query_file = open('hemoglobin.fasta') save_file = open("hemoglobin.txt", 'w') record = SeqIO.read(query_file, format="fasta") results_handle = NCBIWWW.qblast("blastp", "swissprot", record.seq, expect=1, \ matrix_name='BLOSUM62', descriptions=2000, alignments=2000, hitlist_size=2000) blast_results = NCBIXML.parse(results_handle).next() alncnt=0 for align in blast_results.alignments: alncnt= alncnt + 1 ln='Alignment # ' + str(alncnt) + ' ' + align.accession save_file.write(ln + '\n') save_file.write(align.title[0:132] + '\n') for hsp in align.hsps: save_file.write(hsp.query + '\n') save_file.write(hsp.match + '\n') save_file.write(hsp.sbjct + '\n') save_file.write('\n') query_file.close() save_file.close() Network Blast #2 22

Network Blast/XML 23

 Researcher downloaded a set of sequences from the Uniprot database in Fasta format  The names in the Fasta file name were very long  Researcher performed a clustalw alignment with the set of sequences and has a tree and a clustalw alignment file.  Researcher now finds that some tools (such as Genedoc and Phylip) require shorter names! Homework Problem #2 24

 Solution:  Write a Python Program to read in all three files:  A sequence file in fasta format  An alignment file in clustalw format  A tree file in newick format  Replace the names with shorter names (10 character max) and write three new files.  Implementation Notes:  Check that Identifiers are the same in all three files  If UniProt Fasta file, ask user if substituting accessions for ids is desired.  Otherwise prompt user to enter a new (shortened) ID Homework Problem #2 25

 First, download BioPython from the BioPython website:   Install on your computer  Include appropriate module. For list of modules and descriptions see:  Using BioPython 26