BioPython http://biopython.org/wiki/Biopython Download & Installation http://biopython.org/wiki/Download Documentation http://biopython.org/wiki/Category%3AWiki_Documentation
BioPython Key features: Sequences Sequence Annotation I/O Operations Accessing online databases Multiple sequence alignments BLAST and many many more …
quickstart: Sequence objects Simple example: from Bio.Seq import Seq from Bio.Alphabet import IUPAC dna_sequence = Seq('AGGCTTCTCGTA', IUPAC.unambiguous_dna) print dna_sequence print dna_sequence.alphabet
quickstart: parsing sequences Simple example: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) file format
sequence objects alphabet sequence sequences work like strings from Bio.Seq import Seq from Bio.Alphabet import IUPAC dna_sequence = Seq('AGGCTTCTCGTA', IUPAC.unambiguous_dna) for index, letter in enumerate(dna_sequence): print("%i %s" % (index, letter)) print dna_sequence[2:7] print dna_sequence[0::3] print dna_sequence[1::3] my_seq = str(dna_sequence) + “ATTAATTG” fasta_format_string = ">Name\n%s\n" % my_seq print(fasta_format_string) alphabet sequence sequences work like strings slicing of sequences striding of sequences turning sequences into strings
sequence objects making complements making mRNA from Bio.Seq import Seq from Bio.Alphabet import IUPAC my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC”, IUPAC.unambiguous_dna) print my_seq print my_seq.complement() print my_seq.reverse_complement() messenger_rna = Seq(my_seq, IUPAC.unambiguous_rna) print messenger making complements making mRNA
sequence objects translation translation from Bio.Seq import Seq from Bio.Alphabet import IUPAC messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG”, IUPAC.unambiguous_rna) print messenger_rna print messenger_rna.translate() coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) print coding_dna.translate() translation translation
seqRecord object .seq sequence itself, typically a Seq object. .id primary id, string .name common name, string .description human readable description, string .letter_annotations Holds per-letter-annotations using a (restricted) dictionary of additional information, Python sequence .annotations additional information, dictionary .features A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence) .dbxrefs database cross-references, string
seqRecord object from scratch from Bio.Seq import Seq simple_seq = Seq("GATC") from Bio.SeqRecord import SeqRecord simple_seq_r = SeqRecord(simple_seq) simple_seq_r.id = (“1234”) simple_seq_r.description = "Made up sequence” print simple_seq_r reading the information from Bio import SeqIO record = SeqIO.read("NC_005816.fna", "fasta") print record
Sequence I/O Parsing from file from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) Or using an iterator: identifiers = [seq_record.id for seq_record in SeqIO.parse("ls_orchid.fasta", ”fasta")] print identifiers handle format
Sequence I/O Parsing from the web from Bio import Entrez from Bio import SeqIO Entrez.email = "A.N.Other@example.com" handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="6273291") seq_record = SeqIO.read(handle, "fasta") handle.close() print("%s with %i features" % (seq_record.id, len(seq_record.features)))
Sequence I/O How to find sequence information from Bio import SeqIO orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", ”fasta")) creates Python dictionary with each entry held as a SeqRecord object in memory