Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen, David Martin, Gary Williams and many.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

EMBOSS – an application suite for Bioinformatics  Shahid Manzoor  Adnan Niazi SLU Global Bioinformatics Centre.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Introduction to EMBOSS Gary Williams. What is EMBOSS? n Wisconsin package, GCG n Widely used, sources available for inspection n EGCG - academic.
EMBOSS GUI 2k EMBOSS
Profiles for Sequences
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Introduction to Bioinformatics
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Single Motif Charles Yan Spring Single Motif.
Sequence comparison: Local alignment
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequencing a genome and Basic Sequence Alignment
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Comparing Sequences AND Multiple Sequence Alignment Bioinformatics
SRS Introductory Course 5/12/ Temporary and permanent sessions - Simple querying - Browsing indices - Standard and extended query forms - User defined.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein Domain Database
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Advanced SRS Course 12/12/02 -Linking -Subentries -Applications.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
UK MRC Human Genome Mapping Project Resource Centre EMBOSS – an application suite for bioinformatics Lisa Mullan.
Introduction to wEMBOSS (EMBOSS) Shahid Manzoor Adnan Niazi SLU Global Bioinformatics Centre, Uppsala, Sweden.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
Genome Center of Wisconsin, UW-Madison
Fast Sequence Alignments
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
BLAST.
Pairwise Sequence Alignment
Protein structure prediction.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Sequence analysis using EMBOSS & wEMBOSS by Martin Sarachu Based on the EMBOSS tutorial, by Nikos Drakos, Val Curwen, David Martin, Gary Williams and many more. Find this tutorial at Throughout this tutorial, we're going to look at members of the rhodopsin family of G-protein coupled receptors. The general principles are, of course, applicable to any sequences you would like to analyse.

Hands-on: Look at databases available with showdb (Information>>showdb) Output is a simple table displaying the names, contents and access methods for the databases. ID allows programs to extract a single explicitly named entry from the database, for example: embl:x13776 Query indicates that programs can extract a set of matching wildcard entry names. For example: sw:pax*_human All allows programs to analyse all the entries in the database sequentially. For example: embl:* Hands-on: Retrieve sequence with identifier xlrhodop from embl DB (Edit>>seqret) Hands-on: Copy the sequence to your current project & include it into nucList Retrieving sequences from databases

Getting information about sequences infoseq is a small utility to list the sequences USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description. Hands-on: Run infoseq (Information>>infoseq) with sequence xlrhodop in your project This sequence corresponds to a sequence in SwissProt that has the identifier OPSD_XENLA Hands-on: Retrieve the information about all OPSD sequences in SwissProt (sw DB, use the opsd_* wildcard)

Pairwise sequence alignment An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ. The most intuitive representation of the comparison between two sequences uses dot-plots. One sequence is represented on each axis and significant matching regions are distributed along diagonals in the matrix. Hands-on: Upload sequence xl23808 from your computer to the current project and add it to nucList Hands-on: Make a dotplot with dottup between xl23808 and xlrhodop (Alignment>>Dot Plots>>dottup)

Global alignment A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. needle is and implementation of the Needleman-Wunsch algorithm for global alignment. The computation is rigorous and needle can be time consuming to run if the sequences are long. Hands-on: do a global alignment between xlrhodop (1-470 region) and xl23808 ( region) (Alignment>>Global>>needle) stretcher is another EMBOSS program for global alignment, it is less rigorous and therefore run more quickly. Useful for DB searching.

Local alignment Local alignment methods are very useful for scanning databases or when you do not know that the sequences are similar over their entire lengths. water is a rigorous implementation of the Smith Waterman algorithm for local alignments. Hands-on: perform a local alignment between xlrhodop & xl23808 (Alignement>>Local>>water) matcher is a an EMBOSS program for local alignment, it is less rigorous and therefore run more quickly. Useful for DB searching. supermatcher is designed for local alignments of very large sequences and is even less rigorous in its implementation. You can look at its documentation clicking the “Manual” button on the program’s menu.

Identifying the ORF We can get a rapid visual overview of the distribution of ORFs in the six frames of our sequence using plotorf. Hands-on: run plotorf with sequence xlrhodop (Nucleic>>Translation>>plotorf) Longest ORF is in frame 2 from around position 100 to Hands-on: identify the exact start and end points for translation with getorf (Nucleic>>Gene finding>>getorf) Look at output options! Translate your sequence between START and STOP codons. We know from plotorf that our ORF will be in the region 100 to Identify the actual start and end positions.

Translating the sequence Hands-on: you should have found that the region to be translated is from 110 to 1171 in our cDNA sequence. Use transeq to translate that region (Nucleic>>Translation>>transeq) Hands-on: copy xlhrodop.pep to your project and add it to protList pepinfo produces information on amino acid properties (size, polarity, aromaticity, charge, etc). Hands-on: run pepinfo with xlhrodop.pep and examine the information it provides (Protein>>Composition>>pepinfo)

Pattern matching In a number of cases, the active site of a protein can be recognized by a specific fingerprint or template, a fairly small set of residues that are unique to a family of proteins. An example is the sequence GXGXXG (where G=glycine and X=any amino acid) which defines a GTP binding site. Searching for a (rather loose) predefined string of characters in a sequence is called Pattern Matching. Hands-on: use patmatmotifs to search your protein sequence for motifs defined in PROSITE DB of protein families and domains. (Protein>>Motifs>>patmatmotifs) Look at output options! Specify a full documentation output. In our case we already know that our sequence is a rhodopsin. However, if you had an unknown sequence, we hope you can see that identifying motifs might provide you with information to help you plan further experiments.

Protein fingerprints PRINTS is a database that defines functional protein families, identifying each domain by a number of short, particularly well conserved sequences. A full match to one of these "fingerprints" will match all the relevant short sequences in the correct order. A partial match is recorded if some are missing or if they occur in an incorrect order. Hands-on: use pscan with your peptide sequence and examine the matches. (Protein>>Motifs>>pscan)

Multiple Sequence Analysis One of the most popular programs for performing multiple sequence alignments is clustalw. The EMBOSS interface to clustal is emma. pscan has told us that our sequence belongs to the rhodopsin family. We will now retrieve some further members of the family from SwissProt and produce a multiple alignment; we'll then use this multiple alignment to produce a profile of this group of sequences and use that to align them all to our original sequence. Hands-on: use seqret to retrieve a set of sequences from SwissProt DB, use the ops2_* wildcard to get all sequences whose identifiers begin ops2_ Hands-on: copy the output file to your project, rename it to ops2.fasta and add it to protList.

Multiple Sequence Analysis Hands-on: align these sequences using emma (Alignment>>Multiple>>emma). It will produce an alignment and a dendogram. We have aligned ops2 sequences from two fruit fly species, two crab species, locust and scallop. Hands-on: copy the alignment to your project, and view it. The sequences are similar, but there are differences. Add the alignment to your protList. Hands-on: prettyplot will give you a clearer view of differences by aligning the sequences on top of one another. (Alignment>>Multiple>>prettyplot) Identical residues are shown in red, and similar residues in green. This type of display can given you a first impression regions of conservation.

Profiles Profile analysis is a sequence comparison method for finding and aligning distantly related sequences. The comparison allows a new sequence to be aligned optimally to a family of similar sequences. Hands-on: prophecy is an EMBOSS program for creating a profile from a set of multiple aligned sequences. Create a profile from ops2 alignment. (Protein>>Profiles>>prophecy) Look at output options! Specify a Gribskov profile type. When prophecy finishes, copy the profile to your current project. Hands-on: use prophet to align xlrhodop.pep to the ops2 profile. (Protein>>Profiles>>prophet) The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. We hope you can see that aligning members of a family can reveal conserved regions that may be important for structure and/or function.

Conclusion We have shown you some of the programs available within EMBOSS, and have introduced you to the way you can run these programs from wEMBOSS. You can search for EMBOSS programs within wEMBOSS from the “Search for programs” frame. You can examine individual program documentation from the program menu. You can get a listing of all EMBOSS programs from wossname (Information>>wossname) EMBOSS site: wEMBOSS site: