Sequence I/O How to find sequence information from Bio import SeqIO

Sequence I/O How to find sequence information from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", ”fasta")) creates Python dictionary with each entry held as a SeqRecord object in memory

Sequence I/O Writing a file Creating seqRec writing to file
from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \ +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \ +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \ +"SSAC", generic_protein), id="gi| |gb|AAK |AF376133_1", description="chalcone synthase [Cucumis sativus]") rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" \ +"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein), id="gi| |gb|AAK |", description="chalcone synthase [Fragaria vesca subsp. bracteata]") my_records = [rec1, rec2, rec3] SeqIO.write(my_records, "my_example.faa", "fasta") Creating seqRec writing to file

Pairwise Alignment - DNA
Seq 1 A C T T G A Seq 2 A G T T A G A Seq 1 A C T T G A Seq 2 A G T T A G A Single Nucleotide Polymorphism (SNP) Seq 1 A C T T - G A Seq 2 A G T T A G A C G insertion/deletion (indel) - A

Protein sequences GDA P+H++ + GPV Seq 1 GDAFPLHKLEL-GPV
Seq 2 GDASPVHQMTMKGPV identity = the residues are identical similarity = the residues are physio-chemically similar eg hydrophobic A L I V F M or not = gap or different

Describing similarity
The sequences are identical 100% similarity The sequences are similar (protein) there is some level of similarity this can range from 99% to 45% even 20% use of adjectives: highly, significantly, somewhat The sequences are homologous implies an evolutionary relationship ie they have a common ancestor requires knowledge of protein or species phylogeny

Alignment algorithms GLOBAL ALIGNMENT
Considers similarity across the entire length of the sequences LOCAL ALIGNMENT Looks at regions of similarity within the sequences, substrings. These are usually biologically meaningful motifs or domains

Original manual method
Start with a 2D matrix on which you put the query sequence (I) on the vertical, and the test sequence (J) on the horizontal axis. Find the common k-words. G D I H F V GDIHIF GDVHIF

Dotplot of similarity G D I H F V
Finding the diagonal. Visualization of similarity

Doesn’t scale well GDIHAL GRVHIF
- Visualization of similarity not so good with distantly related sequences - No statistical/numerical measure of similarity - Slow +

Computational considerations
Sequence orientation which strand? Character substitutions variations Physio-chemical properties protein structure => Biological meaning? Need a system whereby substitutions and gaps are weighted. Different types of scoring systems can be used with different algorithms – The main problem is computational time.

Comparing Strings Called “edit distance” measurements. Hamming distance: the number of different characters between strings of equal lengths; only allows substitutions. Levenshtein distance: the strings do not have to be the same length; allows insertions, deletions, substitutions. Applied to automated spelling corrections…

Dynamic programming A method of solving a large problem by dividing it up into smaller subproblems and solving those, then putting them together to solve the larger problem. Can often have more than one way of aligning two sequences, dynamic programming allows backtracking and testing of various options. The path that most effectively links together the subproblems into an optimal solution is selected as the final alignment. Needs to be based on statistics of scoring system and probability thresholds set May still end up with a few solutions of equal scores – Ultimately need a biologically relevant answer.

Dynamic programing A recursive algorithm
To find an LCS of X and Y , we may need to find the LCSs of X and Yn-1 and of Xm-1 and Y . Furthermore, each of these subproblems has the subsubproblem of finding an LCS of Xm-1 and Yn-1. c[i,j] is the length of an LCS of the sequences Xi and Yj. The optimal substructure of the LCS problem gives the recursive formula if i = 0 or j = 0 if i, j > 0 and xi = yj if i, j > 0 and xi ≠ yj

Dynamic programing AB C BDAB BDCAB A BCBA

Backtracking AB C BDAB BDCAB A BCBA

Step 3: Computing the length of a LCS
AB C BDAB BDCAB A BCBA CSC317

AB C BDAB BCBA BDCAB A CSC317
Step 4: Constructing a LCS (Backtracking) AB C BDAB BDCAB A BCBA CSC317

Putting it all together
Create a matrix - as in dotplot but “virtual” Calculate the match/mismatch score step by step as you go along – how many are too much? Then backtrack and do it again – iterative, to find optimal score

Needleman and Wunsch Global alignment algorithm for sequence comparison. Biased towards finding similarity, rather than difference. Based on 2D matrix, gives a maximum match; ie the largest number of residues (n or aa) of one sequence that can be matched with another, allowing for all possible deletions and substitutions. Begins at beginning of sequence and ends at the end.

Smith-Waterman algorithm
Local alignment algorithm which finds shorter, locally similar regions (substrings). It is more sensitive to weaker sequence similarities, and to finding conserved motifs This is also a matrix-based approach, but each cell in the matrix defines the end of a potential alignment. It is the basis for many subsequent alignment algorithms, including BLAST.

Sequence alignments For pairwise alignments Biopython contains the Bio.pairwise2 module What you would do: 1.) Prepare an input file of your unaligned sequences, typically this will be a FASTA file which you might create using Bio.SeqIO. 2.) Call the command line tool to process this input file, typically via one of Biopython’s command line wrappers (which we’ll discuss here). 3.) Read the output from the tool, i.e. your aligned sequences, typically using Bio.AlignIO..

Pairwise sequence alignments
contains essentially the same algorithms as water (local) and needle (global) from Bio import pairwise2 from Bio import SeqIO seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq) print(pairwise2.format_alignment(*alignments[0])) global alignment 2 letter code (scoring matches, gaps) x … match scores 1, x … no cost for gaps m ... general values, s ... costs for gaps

Pairwise sequence alignments
from Bio import pairwise2 from Bio import SeqIO rom Bio.SubsMat.MatrixInfo import blosum62 seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.localds(seq1.seq, seq2.seq, blosum62, -10, -0.5) print(pairwise2.format_alignment(*alignments[0])) costs for opening, extending gaps substitution matrix, Blosum, PAM local alignment

Problem with N-W and S-W
They are exhaustive, with stringent statistical thresholds As the databases got larger, these began to take too long – computational constraints Need a program that runs an algorithm that is based on dynamic programming (substrings) but uses a heuristic approach (takes assumptions, quicker, not as refined). Therefore can deal with comparing a sequence against 100 million others in a relatively short time… the bigger the database, the more optimal the results.

Two heuristic approaches
Each is based on different assumptions and calculations of probability thresholds (ie the statistical significance of results) FASTA All proteins in the db are equally likely to be related to the query  probability multiplied by the number of sequences in database 2. BLAST Query more likely to be related to a sequence its own length or shorter  probability multiplied by N/n N: total number of residues in database n: length of subject sequence

Basic Local Alignment Search Tool
The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller segments (words) Does the same for all sequences in the database Looks for exact matches, word by word, and expands those up- and down- stream one base at a time, allowing for a certain number of mismatches Stops when sequence ends or statistical significance becomes too low (too many mismatches) Can find more than one area of similarity between two sequences

BLAST Using online BLAST from Bio.Blast import NCBIWWW
database type from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", " ") Blast program sequence or from Bio.Blast import NCBIWWW from Bio import SeqIO record = SeqIO.read("m_cold.fasta", format="fasta") result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”) Just sequence whole seq. information

BLAST Using local BLAST - First we create a command line prompt (for example): blastn -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 - then we us the BLAST Python wrappers from BioPython: evalue threshold Blast program query sequence output file database from Bio.Blast.Applications import NcbiblastxCommandline blastn_exe = r”path to blastn" blastn_cline = NcbiblastnCommandline(query="opuntia.fasta", db="nr", evalue=0.001, out="opuntia.xml”) blastn_cline() use Bio.Blast.NCBIXML.parse() to parse results in the output file

Algorithm evolution Smith Waterman Local alignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…

Sequence I/O How to find sequence information from Bio import SeqIO

Similar presentations

Presentation on theme: "Sequence I/O How to find sequence information from Bio import SeqIO"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence I/O How to find sequence information from Bio import SeqIO

Similar presentations

Presentation on theme: "Sequence I/O How to find sequence information from Bio import SeqIO"— Presentation transcript:

Similar presentations

About project

Feedback