Stand-alone tools 2. 1.Download the zip file to the GMS6014 folder. 2.Unzip the files to a folder named “clustalx”. 3.Edit the MDM2_isoforms_5.fasta file.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Heuristic alignment algorithms and cost matrices
FASTA and BLAST. FASTA: Introduction FASTA (pronounced FAST-Aye) stands for FAST-All, reflecting the fact that it can be used for a fast protein comparison.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Psi-Blast: Detecting structural homologs Psi-Blast was designed to detect homology for highly divergent amino acid sequences Psi = position-specific iterated.
Sequence alignment, E-value & Extreme value distribution
What is Blast What/Why Standalone Blast Locating/Downloading Blast Using Blast You need: Your sequence to Blast and the database to search against.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Public Resources (II) – Analysis tools  Web-based analysis tools – easy to use, but often with less customization options.  Stand-alone analysis tools.
Python programs How can I run a program? Input and output.
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Public Resources for Bioinformatics Databases : how to find relevant information. Analysis Tools.
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Identification and evaluation of causative genetic variants corresponding to a certain phenotype Xidan Li.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Identifying the ortholog of TNF (Tumor necrosis factor) in mosquito genomes Pet Projects:
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
BioPerl Ketan Mane SLIS, IU. BioPerl Perl and now BioPerl -- Why ??? Availability Advantages for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Using Local Tools: BLAST
Sequence Alignment.
Construction of Substitution matrices
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Step 3: Tools Database Searching
Practice – file types (Cont.) Load the “Mysequence.doc” file to Webcutter using “Choose file” and then “Upload sequence file”. -Notice that the “sequence”
Copyright OpenHelix. No use or reproduction without express written consent1.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
IST 210: PHP Basics IST 210: Organization of Data IST2101.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
DNA SEQUENCE ALIGNMENT FOR PROTEIN SIMILARITY ANALYSIS CARL EBERLE, DANIEL MARTINEZ, MENGDI TAO.
Stand alone BLAST on Linux
Using Local Tools: BLAST
Install external command line softwares
Integrative Genomics Viewer (IGV)
Basics of BLAST Basic BLAST Search - What is BLAST?
Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.
Intro to Alignment Algorithms: Global and Local
Sequence Based Analysis Tutorial
Comparative Genomics.
Introduction to javadoc
Basic Local Alignment Search Tool (BLAST)
Using Local Tools: BLAST
Using Local Tools: BLAST
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Web Application Development Using PHP
Presentation transcript:

Stand-alone tools 2. 1.Download the zip file to the GMS6014 folder. 2.Unzip the files to a folder named “clustalx”. 3.Edit the MDM2_isoforms_5.fasta file with WordPad and save. 4.Run the.exe file. 5.Load sequence file, select sequences, perform alignment. 6.Write the alignment to a ps file. Practice –the ClustalX application.

Stand-alone tools 3. Command line applications:  Accounts for a large number of high-quality, sophisticated programs. Practice – (install and) run standalone blast in your own computer

Searching for potential ortholog of oncogene MDM2 in the fruit fly genome Pet Projects:

Practice – Install the blast program (1) 1.Download the BLAST executable file, save the file in a folder, such as c:\GMS6014\blast\ 2.Run the installation program by double click. Inspect the folder following installation. 3.Add three more folders to your /blast directory, “/query”, “/dbs”, and “/out”.

Practice – Install the blast program (2) 5.Inspect the contents of the doc, data, and bin folder. Move the programs from blast\bin to the blast folder. 6.Bring a command (cmd) window by typing “cmd” in the Start  Run box. 7.Go to the blast folder by typing “cd C:\GMS6014\blast” 8.Try to run the program by typing “blastall”, read the output.

Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window, navigate to the C:\GMS6014\blast folder. 3.At the prompt “C:\GMS6014\blast >” type the command “formatdb –i dbs\Dm.P –p T” -- format the dataset for the program. 4.Compose the query sequence save as “3TNF.txt” in the “blast\query\” folder. 5.Initiated the search by typing “blastall –p blastp –d dbs\Dm.P –i query\4_MMD2.fasta –o out\Mdm2_DmP.html –T T”

What’s in a command? formatdb –i dbs\Dm.P –p T Program – format database for search. Feed me the input file name Tell me is it a protein sequence file? For more info, refer to the “user manual” file in the blast\doc folder.

Advantages of Running BLAST at Your Own Machine  Do it at any time, no waiting on the line.  Search for multiple sequences at once.  Search a defined data set.  Automate Blast analysis.  Combine Blast with other analysis.  …..

BLAST is a program implemented in C/C++ void BlastTickProc(Int4 sequence_number, BlastThrInfoPtr thr_info) { if(thr_info->tick_callback && (sequence_number > (thr_info->last_db_seq + thr_info->db_incr))) { NlmMutexLockEx(&thr_info->callback_mutex); thr_info->last_db_seq += thr_info->db_incr; thr_info->tick_callback(sequence_number, thr_info->number_of_pos_hits); thr_info->last_tick = Nlm_GetSecs(); NlmMutexUnlock(thr_info->callback_mutex); } return; } /* Sends out a message every PERIOD (i.e., 60 secs.) for the index. THis function runs as a separate thread and only runs on a threaded platform. Should I care ?

If you care: 1.) Data structure and Algorithm char: name char: sequence SEQ Identify the best alignment for two sequences (p69-73) Seq1: MA-DSV—WC.. Seq2: MALD-IHWS.. int: seq_length

Programming language comparison /* TRANSLATION: 3 or 6 frame translate cDNA sequences */ // #include "translation.hpp" int main(int argc, char **argv) { int num_seq=0; char string[MAXLINE]; DSEQ * dseq; infile.getline (string,MAXLINE); if (string[0]=='>') strncpy (dbname,string,MAXLINE); while (!infile.eof()) { dseq=Get_Lib_Seq (); if (dseq->reverse==0) Translation (&dseq->name[1], dseq->seq); else Translation (&dseq->name[1], dseq->r_seq); num_seq++; if (num_seq%1000==0) { cout<<num_seq<<endl; cout name<<endl; } delete dseq; } infile.close(); outfile.close(); cout<<num_seq<<" translated"<<endl; getch(); return 0; } DSEQ* Get_Lib_Seq() { int i,n; char str[MAXLINE]; DSEQ* dseq; n = 0; dseq=new DSEQ; strcpy (dseq->name, dbname); while(infile.getline(str,MAXLINE)) {if (str[0] == '>') { strcpy( dbname, str); break; } for(i=0;i<strlen(str);i++) {if(n==MAXSEQ) break; dseq->seq[n++] = str[i]; } dseq->seq[n]='\0'; if(n==MAXSEQ) cout<<"WARNING: sequence"<<dbname<<"too long!"<<endl; dseq->len=n; if (dseq->name[9]=='3') Reverse (dseq); else dseq->reverse=0; return dseq; } void Reverse (DSEQ* dseq) //Reverse dseq {int i,j; j=0; for (i=(dseq->len-1);i>0;i--) { if (dseq->seq[i]=='A'||dseq->seq[i]=='a') dseq->r_seq[j++]='T'; if (dseq->seq[i]=='C'||dseq->seq[i]=='c') dseq->r_seq[j++]='G'; if (dseq->seq[i]=='G'||dseq->seq[i]=='g') dseq->r_seq[j++]='C'; if (dseq->seq[i]=='T'||dseq->seq[i]=='t') dseq->r_seq[j++]='A'; if (dseq->seq[i]=='N'||dseq->seq[i]=='n') dseq->r_seq[j++]='N'; } dseq->r_seq[j++]='\0'; dseq->reverse=1; } void Translation (char name[], char seq[]) { char ppseq[MAXSEQ/3]; for (int f=0; f<3; f++) { outfile "<<"F_"<<f<<name<<endl; int j=0; int len=strlen(seq); for( int i=f; i<len; i=i+3) ppseq[j++]=Translate(&seq[i]); ppseq[j++]='\0'; int m=strlen(ppseq)/50; // output 50 aa per line for (int n=0; n<=m; n++) {for (int i=n*50; i<50*(n+1); i++) {outfile<<ppseq[i]; if (ppseq[i]=='\0') break; } outfile<<endl; } char Translate(char s[]) { int c1,c2,c3; char P, code[3]; //***standard translation table, A(0),C(1), G(2), T(3)***** char table [4][4][4]= {{{'K','N','K','N'},{'T','T','T','T'},{'R','S','R','S'},{'I','I','M','I'}}, {{'Q','H','Q','H'},{'P','P','P','P'},{'R','R','R','R'},{'L','L','L','L'}}, {{'E','D','E','D'},{'A','A','A','A'},{'G','G','G','G'},{'V','V','V','V'}}, {{'*','Y','*','Y'},{'S','S','S','S'},{'*','C','W','C'},{'L','F','L','F'}}}; //*********** table2 for n at 3rd position******************** char table2 [4][4]={{'X','T','X','X'},{'X','P','R','L'}, {'X','A','G','V'},{'X','S','X','X'}}; strncpy (code, s, 3); c1=Convert(code[0]); c2=Convert(code[1]); c3=Convert(code[2]); if (c1>=4 || c2>=4) P='X'; //can be Optimized further here by considering.... else { if (c3>=4) P=table2[c1][c2]; else P=table[c1][c2][c3]; //P=table[Convert(code[0])][Convert(code[1])][Convert(code[2 ])]; } return (P); } int Convert (char c) { char s=c; if (s=='A'||s=='a') return (0); if (s=='C'||s=='c') return (1); if (s=='G'||s=='g') return (2); if (s=='T'||s=='t'||s=='U'||s=='u') return (3); if (s=='N'||s=='n') return (4); else return (5); } f#Translation -- read from fasta DNA file and translate into three frames # import string from Bio import Fasta from Bio.Tools import Translate from Bio.Alphabet import IUPAC from Bio.Seq import Seq ifile = "S:\\Seq\\test.fasta" parser = Fasta.RecordParser() file =open (ifile) iterator = Fasta.Iterator (file, parser) cur_rec = iterator.next() cur_seq = Seq (cur_rec.sequence,IUPACUnambiguousDNA()) translator = Translate.unambiguous_dna_by_id[1] translator.translate (cur_seq) Translation : C Translation : Python

Programming languages C/C++ Java - Biojava Python - Biopython Perl - Bioperl Efficiency, Power Simplicity, Fast Dev.

Observe: scripting is not that difficult Example: Python and bioPython. 1.Simple python scripts. 2.Batch Blast with a Python script.

Blast outputoutput

Questions after the Blast search? Questions: Is this a expressed gene in the Fruit fly? - Gene prediction & gene structure Is this the true ortholog of MDM2? - Fundamentals of sequence comparison What can we learn from the comparison of sequences? -- protein dommains/motifs.

Blast outputoutput

How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P P Seq_B: M P P W I

Judging the match using “Scoring Matrix” Q: which one is a better match to the query ? Query: M A T W L Seq_B: M P P W I Score: Total: 16 Total: 7 -4 Query: M A T W L Seq_A: M A T P P Score: 545-3

“Scoring Matrix” assigns a score to each pair of amino acids A S T L I V K D... L –1 –2 – –4 BLOSUM-62

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL SALEDFL ASLDDYL ASIDEFY ASIDEFY … Score(a1/a2) = observed frequency of a1/a2 2* log2 predicated frequency of a1/a2 AA: 6 AS: 3 SS: 0

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … Score (a1/a2) > 0 = 0 < 0 observed frequency of a1/a2 > predicated frequency of a1/a2 observed frequency of a1/a2 predicated frequency of a1/a2 <

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … observed frequency of L / I > predicated frequency of L / I i.e: 0.1*0.1 = 0.01i.e: 0.03 Score (L/I) > 0 Substitution of L / I is common in conserved sequences

BLOSUM - Blocks Substitution Matrices Block: very well conserved region of a protein family. – perform the same (similar) function. ASLDEFL ASLEDFL ASLDDYL SALEEFL ASLDDYL SALEEFL … observed frequency of L / K < predicated frequency of L / K i.e: 0.1*0.1 = 0.01i.e: Score (L/K) < 0 Substitution of L / K is rare in conserved sequences

“Scoring Matrix” assigns a score to each pair of amino acids A S T L I V K D... L –1 –2 – –4 BLOSUM-62

Scoring matrix –BLOSUM 62