Sequence I/O How to find sequence information from Bio import SeqIO

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Protein Sequence Alignment and Database Searching.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
DNA, RNA and protein are an alien language
Step 3: Tools Database Searching
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Bioinformatics for Research
Problem with N-W and S-W
The ideal approach is simultaneous alignment and tree estimation.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Least common subsequence:
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Fast Sequence Alignments
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Basic Local Alignment Search Tool (BLAST)
Pairwise Alignment Global & local alignment
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Sequence I/O How to find sequence information from Bio import SeqIO orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", ”fasta")) creates Python dictionary with each entry held as a SeqRecord object in memory

Sequence I/O Writing a file Creating seqRec writing to file from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \ +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \ +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \ +"SSAC", generic_protein), id="gi|14150838|gb|AAK54648.1|AF376133_1", description="chalcone synthase [Cucumis sativus]") rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" \ +"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein), id="gi|13919613|gb|AAK33142.1|", description="chalcone synthase [Fragaria vesca subsp. bracteata]") my_records = [rec1, rec2, rec3] SeqIO.write(my_records, "my_example.faa", "fasta") Creating seqRec writing to file

Pairwise Alignment - DNA Seq 1 A C T T G A Seq 2 A G T T A G A Seq 1 A C T T G A Seq 2 A G T T A G A Single Nucleotide Polymorphism (SNP) Seq 1 A C T T - G A Seq 2 A G T T A G A C G insertion/deletion (indel) - A

Protein sequences GDA P+H++ + GPV Seq 1 GDAFPLHKLEL-GPV Seq 2 GDASPVHQMTMKGPV identity = the residues are identical similarity = the residues are physio-chemically similar eg hydrophobic A L I V F M or not = gap or different

Describing similarity The sequences are identical 100% similarity The sequences are similar (protein) there is some level of similarity this can range from 99% to 45% even 20% use of adjectives: highly, significantly, somewhat The sequences are homologous implies an evolutionary relationship ie they have a common ancestor requires knowledge of protein or species phylogeny

Alignment algorithms GLOBAL ALIGNMENT Considers similarity across the entire length of the sequences LOCAL ALIGNMENT Looks at regions of similarity within the sequences, substrings. These are usually biologically meaningful motifs or domains

Original manual method Start with a 2D matrix on which you put the query sequence (I) on the vertical, and the test sequence (J) on the horizontal axis. Find the common k-words. G D I H F V GDIHIF GDVHIF https://www.youtube.com/watch?v=_yczdPm3Kuw

Dotplot of similarity G D I H F V Finding the diagonal. Visualization of similarity

Doesn’t scale well GDIHAL GRVHIF - Visualization of similarity not so good with distantly related sequences - No statistical/numerical measure of similarity - Slow +

Computational considerations Sequence orientation which strand? Character substitutions variations Physio-chemical properties protein structure => Biological meaning? Need a system whereby substitutions and gaps are weighted. Different types of scoring systems can be used with different algorithms – The main problem is computational time.

Comparing Strings Called “edit distance” measurements. Hamming distance: the number of different characters between strings of equal lengths; only allows substitutions. Levenshtein distance: the strings do not have to be the same length; allows insertions, deletions, substitutions. Applied to automated spelling corrections…

Dynamic programming A method of solving a large problem by dividing it up into smaller subproblems and solving those, then putting them together to solve the larger problem. Can often have more than one way of aligning two sequences, dynamic programming allows backtracking and testing of various options. The path that most effectively links together the subproblems into an optimal solution is selected as the final alignment. Needs to be based on statistics of scoring system and probability thresholds set May still end up with a few solutions of equal scores – Ultimately need a biologically relevant answer.

Dynamic programing A recursive algorithm To find an LCS of X and Y , we may need to find the LCSs of X and Yn-1 and of Xm-1 and Y . Furthermore, each of these subproblems has the subsubproblem of finding an LCS of Xm-1 and Yn-1. c[i,j] is the length of an LCS of the sequences Xi and Yj. The optimal substructure of the LCS problem gives the recursive formula if i = 0 or j = 0 if i, j > 0 and xi = yj if i, j > 0 and xi ≠ yj

Dynamic programing AB C BDAB BDCAB A BCBA

Backtracking AB C BDAB BDCAB A BCBA

Step 3: Computing the length of a LCS AB C BDAB BDCAB A BCBA CSC317

AB C BDAB BCBA BDCAB A CSC317 Step 4: Constructing a LCS (Backtracking) AB C BDAB BDCAB A BCBA CSC317

Putting it all together Create a matrix - as in dotplot but “virtual” Calculate the match/mismatch score step by step as you go along – how many are too much? Then backtrack and do it again – iterative, to find optimal score

Needleman and Wunsch Global alignment algorithm for sequence comparison. Biased towards finding similarity, rather than difference. Based on 2D matrix, gives a maximum match; ie the largest number of residues (n or aa) of one sequence that can be matched with another, allowing for all possible deletions and substitutions. Begins at beginning of sequence and ends at the end. http://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm

Smith-Waterman algorithm Local alignment algorithm which finds shorter, locally similar regions (substrings). It is more sensitive to weaker sequence similarities, and to finding conserved motifs This is also a matrix-based approach, but each cell in the matrix defines the end of a potential alignment. It is the basis for many subsequent alignment algorithms, including BLAST.

Sequence alignments For pairwise alignments Biopython contains the Bio.pairwise2 module What you would do: 1.) Prepare an input file of your unaligned sequences, typically this will be a FASTA file which you might create using Bio.SeqIO. 2.) Call the command line tool to process this input file, typically via one of Biopython’s command line wrappers (which we’ll discuss here). 3.) Read the output from the tool, i.e. your aligned sequences, typically using Bio.AlignIO..

Pairwise sequence alignments contains essentially the same algorithms as water (local) and needle (global) from Bio import pairwise2 from Bio import SeqIO seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq) print(pairwise2.format_alignment(*alignments[0])) global alignment 2 letter code (scoring matches, gaps) x … match scores 1, x … no cost for gaps m ... general values, s ... costs for gaps

Pairwise sequence alignments from Bio import pairwise2 from Bio import SeqIO rom Bio.SubsMat.MatrixInfo import blosum62 seq1 = SeqIO.read("alpha.faa", "fasta”) seq2 = SeqIO.read("beta.faa", "fasta") alignments = pairwise2.align.localds(seq1.seq, seq2.seq, blosum62, -10, -0.5) print(pairwise2.format_alignment(*alignments[0])) costs for opening, extending gaps substitution matrix, Blosum, PAM local alignment

Problem with N-W and S-W They are exhaustive, with stringent statistical thresholds As the databases got larger, these began to take too long – computational constraints Need a program that runs an algorithm that is based on dynamic programming (substrings) but uses a heuristic approach (takes assumptions, quicker, not as refined). Therefore can deal with comparing a sequence against 100 million others in a relatively short time… the bigger the database, the more optimal the results.

Two heuristic approaches Each is based on different assumptions and calculations of probability thresholds (ie the statistical significance of results) FASTA All proteins in the db are equally likely to be related to the query  probability multiplied by the number of sequences in database 2. BLAST Query more likely to be related to a sequence its own length or shorter  probability multiplied by N/n N: total number of residues in database n: length of subject sequence

Basic Local Alignment Search Tool The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller segments (words) Does the same for all sequences in the database Looks for exact matches, word by word, and expands those up- and down- stream one base at a time, allowing for a certain number of mismatches Stops when sequence ends or statistical significance becomes too low (too many mismatches) Can find more than one area of similarity between two sequences

BLAST Using online BLAST from Bio.Blast import NCBIWWW database type from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") Blast program sequence or from Bio.Blast import NCBIWWW from Bio import SeqIO record = SeqIO.read("m_cold.fasta", format="fasta") result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”) Just sequence whole seq. information

BLAST Using local BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download - First we create a command line prompt (for example): blastn -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 - then we us the BLAST Python wrappers from BioPython: evalue threshold Blast program query sequence output file database from Bio.Blast.Applications import NcbiblastxCommandline blastn_exe = r”path to blastn" blastn_cline = NcbiblastnCommandline(query="opuntia.fasta", db="nr", evalue=0.001, out="opuntia.xml”) blastn_cline() use Bio.Blast.NCBIXML.parse() to parse results in the output file

Algorithm evolution Smith Waterman Local alignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…