EMBOSS – an application suite for Bioinformatics Shahid Manzoor Adnan Niazi SLU Global Bioinformatics Centre
E – European M – Molecular B – Biology O – Open S – Software S - Suite SLU Global Bioinformatics Centre
All Information EMBOSS info at wEMBOSS info at to get a username and password for wEMBOSS at
SLU Global Bioinformatics Centre Open Source molecular biology analysis package. Handles a variety of common file formats. Provides libraries for easy development Software, licensed under GPL and LGPL Developed by Martin Sarachu and Marc Colet Available at What is EMBOSS
SLU Global Bioinformatics Centre A comprehensive set of sequence analysis programs. All sequence and many alignment and structural formats are Handled. It runs on practically every UNIX you can think of (and likely some that you can't), plus Windows and OS X. Each application has the same style of interface so master one and you've mastered them all. Features of EMBOSS
SLU Global Bioinformatics Centre Sequence alignment. Protein motif identification (including domain analysis) Nucleotide sequence pattern analysis (for example to identify CpG islands or repeats). Presentation tools for publications. Uses for EMBOSS
SLU Global Bioinformatics Centre Many small and large programs in package (>140). All programs share a common look and feel. Easy to run from command line. Retrieval of sequence data from the web. Programs in EMBOSS
SLU Global Bioinformatics Centre The one Argument help the –help argument displays a short help for any EMBOSS program.
SLU Global Bioinformatics Centre wossname wossname searches the other programs short description for keywords. The One Command
Large collection of gene and protein analysis tools Sequence retrieval Alignments Primer design Restriction Mapping Protein domain searching Translation SLU Global Bioinformatics Centre
DNA Sequence 1 DNA Sequence 2 dotplottranslation protein local/global alignment protein Sequence 1 protein Sequence 2 multiple sequence alignment motif and domain searching physico- chemical properties SLU Global Bioinformatics Centre
AGTGGTCGTGAAG AGAATGCTCCTCC TTTGGAATCTTAA >SEQ1.fasta AGTGCTCCTCCCT TAGAATCTTAG >SEQ2.fasta Unix% dottup SEQ1.fasta SEQ2.fasta –window 10 & Unix% dotmatcher SEQ1.fasta SEQ2.fasta –window 10 – threshold 17 & For an exact match: For a similarity match: Dotplots SLU Global Bioinformatics Centre
A T G C A T G – C Identity Matrix Dotplots … SLU Global Bioinformatics Centre Window Size is number of bases in a sliding window that is moved along each sequence and compared to generate a single data point on the plot. Window size must be an odd number. Mismatch Limit determines how similar the two sequences in a window must be to "match". For example, if window size is 9 and mismatch limit is 2, then up to 2 mismatches in a 9 base window will still be classified as a match.
A T G C A T G – C CCTCCTTTGG Score = CCTCCTTTGG CCTCCCTTAG Score = 32 ProLeu ProLeu Dotplots … SLU Global Bioinformatics Centre
Dotplots A dot plot is a simple graphical representation of identical residues between two sequences. The X axis represents the first sequence (PHO5), The Y axis represents the second sequence (PHO3) A dot is plotted for each match between two residues of the sequences. Diagonal lines reveal regions of identity between the two sequences.
SLU Global Bioinformatics Centre The dot plot can be adapted to display only word matches, which correspond to a diagonal of dots in the letter-based dot plot. Example: alignment of PHO5 and PHO3 coding sequences, with different word sizes. Dotplots …
SLU Global Bioinformatics Centre Detecting repeats with a dot plot Sequence repeats are easily detected in a dot plot when a sequence is compared to itself. The main diagonal is completely marked (by definition, since the sequence is identical do itself) Repeats appear as segments of lines parallel to the diagonal.
ATGGGTCGTGAAG AGAATGCTCCTCC TTTGGAATCTTAA >SEQ1.fasta ATGGCTCCTCCCT TAGAATCTTAG >SEQ2.fasta Unix% plotorf SEQ1.fasta –stop TAA, TAG –out GA.plot & Unix% getorf SEQ1.fasta –minsize 5 –table 0 –find 1 –out GA.getorf & SLU Global Bioinformatics Centre Plotorf
ATGGGTCGTGAAGAGAATGCTCCTCCTTTGGAATCTTAA TACCCAGCACTTCTCTTACGAGGAGGAAACCTTAGAATT Frame -3 Frame -2 Frame -1 Frame 1 Frame 2 Frame 3 Start and stop codons are located according to the instructions to the program, and the area in between start and stop codons SLU Global Bioinformatics Centre
Indication of full coding sequence? Alternative splice form? SLU Global Bioinformatics Centre
>_1 [ ] MLLLWNL >_2 [1 - 36] MGREENAPPLES* Using getorf: stop codon start methionine SLU Global Bioinformatics Centre
Unix% transeq SEQ1.fasta –frame 1 –table 0 –sbegin 4 –send 33 -out GA.fasta & >GA.fasta GREENAPPLES SLU Global Bioinformatics Centre
Unix% needle GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 & Unix% water GA.fasta A.fasta –gapopen 10 –gapextend 0.5 –matrix EPAM250 & >GA.fasta GREENAPPLES >A.fasta APPLES For a global alignment: For a local alignment: Alignments SLU Global Bioinformatics Centre
Alignments … To align two or more sequences in a biologically significant way. GREENAPPLES APPLES Local (water) Global (needle) Gap penalty = 10; Extension penalty = 0.5 APPLES SLU Global Bioinformatics Centre
GREENAPPLES APPLES looks like the “apples” motif may be part of a larger domain APPLES physicochemical properties pattern searching SLU Global Bioinformatics Centre
Physico-chemical properties Unix% iep GA.fasta –plot -step 0.5 –out GA.IEP & Unix% pepinfo GA.fasta –hwindow 8 –generalplot –hydropathyplot & Isoelectric point General properties SLU Global Bioinformatics Centre
Physico-chemical properties D Y FW H K R E Q N M A G C S P I V L T Aliphatic Aromatic Hydrophobic Tiny Small Charged Positive Polar The pepinfo graph of properties is based on this diagram SLU Global Bioinformatics Centre
Physico- chemical properties non-polar region with small residues polar region to one side of non- charged region SLU Global Bioinformatics Centre
Pattern searching GREENAPPL---ES -RE-DAPPL---ES GREEN---LEAVES -RE-D---LEAVES GREENAPPLES >GA.fasta GREENLEAVES >GL.fasta REDAPPLES >RA.fasta REDLEAVES >RL.fasta [G] (0,1)-R–[E] (1,2)–[ND]–X (3)–L–X (3) – E – S SLU Global Bioinformatics Centre
Pattern searching Unix% fuzzpro sptr:* pattern.fruit –mismatch 0 –out GA.fuzzpro & Search a protein database: [G] (0,1) - [R] – [E] (1,2) – [ND] –x (3) – [L] –x (3) – [E] – [S] pattern.fruit Nothing resembling this pattern is found in the database - But we could try scanning PRINTS (pscan) and PROSTIE (patmatmotifs) with one of our sequences. SLU Global Bioinformatics Centre
Some Programs
SLU Global Bioinformatics Centre Some Programs …
SLU Global Bioinformatics Centre More Information