Bioinformatics and Protein Sequence Analysis

1 Bioinformatics and Protein Sequence Analysis
With sequencing of large number of proteins and subsequent storage of data, it has become easier for researchers to study the proteins. These studies help in providing preliminary insights into the structural and functional aspects of proteins without conducting experiments.

This animation consists of 2 parts: Part 1: Protein Sequence Alignment Part 2: Alignment analysis and interpretations Extract the newly determined amino acid sequence for your query peptide. Select the relevant algorithm and its associated parameters: Pair-wise Sequence alignment Multiple Sequence alignment Assess the significance of the result with its alignment score

3 Definitions of the components Part 1 – Protein sequence alignment
Query Peptide: This refers to the unknown protein or peptide that is provided as an input to the sequence analysis server. The sequence of this protein is determined before carrying out further studies for analyzing similarity matches with other proteins. Relevant Algorithm: An algorithm refers to the sequence of logical steps that are used for comparing the query peptide with other given protein sequences. The nature of query such as “Local” or “Global” and “Pair-wise alignment” or “Multiple Sequence Alignment” determines the algorithm that is used. Local Alignment: “Local” alignment represents matching individual blocks of protein sequences in which the protein alignment gets broken at positions where a mismatch occurs. The aim of such alignment studies is to find the longest possible blocks of similarity in aligned protein sequences. Global Alignment: “Global” alignment represents an end-to-end alignment of two or more sequences, where gaps are introduced at the positions where mismatches occur. Pair-wise sequence alignment: This procedure compares and aligns two given sequences. The comparison can either be Global or Local with the quality of alignment being judged by the alignment score. 2 3 4 5

4 Definitions of the components Part 1 – Protein sequence alignment
Multiple Sequence Alignment: This refers to the end-to-end alignment of several given sequences that are provided to the search engine. Multiple alignment tends to introduce minimum gaps and finds regions of similarity within all given sequences. Word –length: The minimum length of an amino acid sequence that needs to match exactly in order to initiate an alignment process in either direction. Sensitivity and speed of alignment are dependent on the word length provided by the user. Scoring Matrix: The matrix of values that are referred to for assigning a score to the alignment of pairs of residues. The matrix used for a BLAST search is selected depending on the type of sequences that one is searching with. These are PAM series matrices and BLOSUM series. PAM: PAM stands for Point Accepted Mutations. It is a log-odds, matrix scoring system that is constructed on the amino acid replacements in a set of closely related proteins. PAM value helps in defining the percentage of mutations that get accepted from a given set of proteins. 1 PAM refers to a change in position for an average of 1% of amino-acids residues. BLOSUM: This stands for “Blocks of Amino Acid Substitution Matrix” and is constructed from a set of distantly related proteins. BLOSUM provides a comprehensive biological insight into proteins when the evolutionary distance is not known beforehand. It is based on the relative frequency of amino acid residues and the probabilities of their substitution in a set of highly conserved blocks of residues in proteins that are evolutionarily distant. 2 3 4 5

5 1 Definitions of the components Part 1 – Protein sequence alignment 2
Threshold: Threshold provides a measure of the statistical significance of the results of an alignment study and represents the expected number of matches occurring by chance event. Gap Penalty and Gap Extension: In an alignment of two or more given protein sequences, a gap is introduced wherever an amino acid mismatch occurs. In this context, “Gap penalty” refers to a deduction in the overall alignment score on introduction of a gap while the “Gap Extension” is for extending an already existing gap. Alignment Score: This is also referred to as the Bit Score and provides a comparative quantification of the quality of alignment. The score increases when a higher number of residue matches and lower number of mismatches are encountered. The alignment having a higher bit score is a better match. Percentage Identity: This indicates the percentage of amino acid residues that are an identical match to each other during the comparison of two sequences. E-value: E-value provides a quantification of any chance alignment between two or more sequences instead of them being a biologically significant match. For similarity match against a database, this value is dependant on the size of the database against which the sequence is compared. The closer the e-value is to zero, the higher is the biological significance of the match. Hit: The results of a search are called a ‘Hit’ and the term ‘best Hit’ would refer to the best result for that particular query. 2 3 4 5

6 Step 1: Pair-wise sequence alignment for two given sequences - INPUT
Step 1: Pair-wise sequence alignment for two given sequences - INPUT

Length of initial set of amino acids that needs to be matched before alignment begins

Enter sequence 1
Word Size 3
>gi|39582116|ref|XP_363835.1| C. briggsae CBR-COL-186 protein [Caenorhabditis briggsae]
MKSTEKKSTELDLELEAQSLRRIAFFGVAMSTVATFVCIITVPLAYNKMQQMQSNMIDQYMASARGIRVA…

Values deducted from overall alignment score on introduction and extension of mismatches

Threshold 10

The reference matrix used to assign scores to matches of residues

Enter sequence 2
Gap penalty Existence 11, Extension 1
>gi|6682|emb|CAA28836.1| collagen [Caenorhabditis elegans]
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYK…

Scoring Matrix BLOSUM62

ALIGNMENT ALGORITHM (BLAST)

7 Step 2: Pair-wise sequence alignment for two given sequences - OUTPUT
Step 2: Pair-wise sequence alignment for two given sequences - OUTPUT

Bit score are the normalized scores which are found after normalization of raw scores based on the scoring matrix used in the algorithm

Dot-Plot is the graphical visualization of the two given sequences to find approximate overlaps to identify regions of close similarity

The percentage of residues which were identical in the two sequences

The statistical measure of the biological significance. The closer e-value is to 0, higher is the biological significance

Shows the match or mismatch between each of the residues

DOT-PLOT
E-VALUE
BIT SCORE
PERCENTAGE IDENTITY

ALIGNMENT:
Sequence 1 LELEAQSLRRIAFFGVAMSTVATFVCIITVPLAYNKMQQMQSNMIDQYMASARGIRVARR
           + E  +SLR++AFFG+A+ST+AT     II VP+ YN MQ +QS++   +
Sequence 2 IAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSE VEF

Gaps introduced in sequence 2 due to lack of similar residues in sequence 1

6e-19
77.4 bits
34%

Pair-wise alignment with the help of BLOSUM 62 matrix gives various kinds of results after alignment. These are alignment, alignment score, dot-plot, percentage identity and e-value. The raw score from BLOSUM62 matrix is 189 and from PAM30 matrix is 189. Bit score for alignment of the exact same study done using BLOSUM62 is 77.4 and for PAM30 matrix is 77.4. Therefore, the Bit scores give a uniform and normalized measure of the overall quality of alignment irrespective of the scoring system. The biological significance of this result is very high as the e value is very near to 0.

Step 3: Pair-wise alignment of sequences against database- INPUT

Enter sequence 1
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYKRFQGVSGVEGRIKRDAYHRSLGVSGASRKARRQSYGNDAAVGGFGGSSGGSCCSCGSGAAGPAGSPGQDGAPGNDGAPGAPGNPGQDASEDQTAGPDSFCFDCPAGPPGPSGAPGQKGPSGAPGAPGQSGGAALPGPPGP

Word Size 3
Threshold 10

SELECT DATABASE
Gap penalty Existence 11, Extension 1

PROTEIN NUCLEOTIDE GENE PROTEOME GEO EST SNP

Scoring Matrix PAM30

ALIGNMENT ALGORITHM (BLAST)

Alignment can also be done by matching a sequence against a related database of sequences to identify it. Input the unknown sequence, and then select the database against which the sequence is to be matched. Fill the parameter values as per the purpose of the search and the nature of the query sequence. In this case we study the hits using PAM30 scoring Matrix. Click on the BLAST tool once all parameters have been entered.

9 Step 4: Pair-wise alignment of sequences against database- OUTPUT
Step 4: Pair-wise alignment of sequences against database- OUTPUT

Percentage of residues exactly matching in the query sequence and the selected hit

The query is scanned to find domains from Pfam Database. In case, such a domain is identified, it is shown as part of the result

In the case of database searches, E-value is found by the multiplication of pair-wise e-value number of sequences in the database.

Identifies the protein sequence and the source organism for the unknown sequence

Alignment shows 100% matching with the identified sequence

Measure of the quality of the alignment when compared to bit scores of other hits of the search

Pfam ID: pfam01484: Domain Name: Col_cuticle_N  Description: Nematode cuticle collagen N-terminal domain

Domain Identified (if any)

ALIGNMENT:
Query    MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH
         MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH
Database MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH

IDENTIFICATION
GENE ID: col-13 | Collagen [Caenorhabditis elegans]

Percentage Identity: 100%
TOTAL SCORE: 624 bits
E-Value: 1e-176

Pair-wise alignment gives various kinds of results after alignment. These are alignment views, alignment score, dot-plot, e-value, percentage identity amongst many others. When compared to bit scores from other hits of the result, the bit score turns out to be the highest for collagen proteins in Caenorhabditis elegans

10 Step 5: Multiple Sequence Alignment - INPUT
Step 5: Multiple Sequence Alignment - INPUT

The word-size is the length of the initial seed set of amino acids, which needs to match exactly to get the alignment extended in both directions

Window Length is the length of the residues on either side of the initial matched sequence, till which the alignment will be extended.

Enter sequence 1
>gi|39582116|ref|XP_363835.1| Hypothetical protein CBG18259 [Caenorhabditis briggsae]
MDEKQRLQAYRFVAYSAVTFSTVAVFSLCITLPLVYNYVDGIKTQINHEIKFCKHSARDIFAEVNHIRANPKNASRFARQAGYGTDEAVSGGS

Word Size 3

Users can choose to see absolute scores for comparing or percentage value of the scores

Window length 10

Enter sequence 2
>gi|17553051|ref|NP_502096.1| COLlagen family member (col-96) [Caenorhabditis elegans]
MDEITRRNAYRFVAYSAVTFSVVAVFSLCITLPMVYNYVHGIKSQINHQISFCKHSARDIFSEVNHIRASPNNATLREKRQAGDCSGCCL

Gap penalty Existence 11, Extension 1

Enter sequence 3
Score type ABSOLUTE
>gi|17553051|ref|NP_502096.1| COLlagen family member (col-13) [Caenorhabditis elegans]
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYKRFQGVSGVEGRIKRDAYH

MULTIPLE SEQUENCE ALIGNMENT (CLUSTAL-W)

Multiple Sequence Alignment tools are used to compare the amino acid sequences of more than two proteins. The word-size is the length of the seed set of amino acids, which needs to match exactly to get extended in both directions. Window Length is the length of the residues on either side, till which the alignment will be extended. The Gap penalty and extension hold the same meaning as in pair-wise alignment. In the scores, users can choose to see absolute scores for comparing or percentage value of the scores.

11 3 1 2 4 5 Step 6: Multiple Sequence Alignment - OUTPUT 5269 Action
Step 6: Multiple Sequence Alignment - OUTPUT

Mapping of colors to amino acid groups

Text alignment of query sequences

Color coded alignment of query sequences

Alignment score which can be compared with other scores to measure the quality of alignmnet

COLOR CODED ALIGNMENT

MULTIPLE SEQUENCE ALIGNMENT
sequence 1 MDE-----KQRLQAYRFVAYSAVTFSTVAVFSLCITLPLVYNYVDGIKTQ
sequence 2 MDE-----ITRRNAYRFVAYSAVTFSVVAVFSLCITLPMVYNYVHGIKSQ
sequence 3 MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSS

ALIGNMENT SCORE
Sequence 1 Sequence 2 Sequence 3
5269

Multiple sequence alignment gives various kinds of results after alignment. The alignment view in text format displays the residue wise matching for the input sequence. The color coded alignment gives a better graphical picture as the amino acid residues are assigned colors based on their physico-chemical properties. Here we depict one of the many color coding available. Alignment score is an absolute term, as selected previously. It can be compared with other scores to measure the quality of alignment. Users obtain .output file for the summary of the result, .aln files which contains the text alignment and .dnd files which contain the distance based information.

This animation consists of 2 parts:
Part 1: Protein Sequence Alignment
Part 2: Alignment analysis and interpretations

Phylogram representing evolutionary relationships
Structural features that decide function
Protein secondary structures

13 Definitions of the components Part 2 – Alignment analysis and interpretations
1 Computational Phylogenetic Predictions: Sequence alignment studies of proteins can reveal the conserved and variable residues between the two sequences. Protein sequences derived from different organisms, but having a high degree of similarity are assumed to be coming from the same ancestor. Such predictions, which can now be carried out computationally with the help of various algorithms, help in providing an insight into evolutionary processes. Phylogram: Phylogram is a pictorial representation that provides a visualization of evolutionary relationships or phylogeny. In this, the length of branches in the tree are considered to be proportional to the evolutionary distance. Cladogram: A Cladogram is another form of pictorial representation that also gives a visual insight into evolutionary relationships or phylogeny. Unlike the phylogram, the branches of a cladogram are of equal length irrespective of the evolutionary distance. Maximum Parsimony: A method used for alignments which show very strong sequence similarity. This is usually applied for less than twelve sequences. 2 3 4 5

14 Definitions of the components Part 2 – Alignment analysis and interpretations
1 Distance methods: This predicts the evolutionary distance when there is any sequence variation present and can be used on large number of sequences. As the distance between two sequences increases, the uncertainty of the alignment also increases. Maximum likelihood: This method is useful for prediction of evolutionary distance when sequence variability is high. It can be used for alignments with any amount of variability. Protein structure prediction: The three dimensional structure of a protein is largely specified by its amino acid sequence. Protein structures can be predicted with an accuracy of 70-75% when provided with the sequence. Functional annotation: Function(s) of proteins can be predicted for those proteins having a well-described homology. Gene Ontology terms (GO terms) provide a unique identification of the function that the gene is involved in. These functions are categorized at different levels of functional hierarchy. Protein motif: Common patterns of residues in a set of protein sequences is known as a motif. 2 3 4 5

15 Step 1: Phylogenetic analysis from alignment- Input
Step 1: Phylogenetic analysis from alignment- Input

Enter a sequence alignment for 2 or more sequences

Select a method

MAXIMUM PARSIMONY
USED FOR SEQUENCES WITH HIGHLY CONSERVED RESIDUES

DISTANCE METHODS
USED FOR SEQUENCES WITH MODERATELY CONSERVED RESIDUES

MAXIMUM LIKELIHOOD
USED FOR SEQUENCES WITH HIGHLY VARIABLE RESIDUES

Seq 1 LLFLFSSAYSRGVFRRDTHK
Seq2  MKWVTFISLLFLFSSAYSRGVFRRDAH
Seq3  MKWVTFLLLLFVSGSAFSRGVFRREA

PHYLOGENETIC ANALYSIS (PHYLIP)

Multiple sequence alignment produces alignment files (.aln), which can be used to determine the evolutionary distances of a set of given protein sequences. This can be achieved by many server-based and stand-alone programs. The user needs to select the method for calculating the distance. Here we depict the usage of alignment files for phylogenetic analysis.

16 Step 2: Phylogenetic analysis from alignment- Output
Step 2: Phylogenetic analysis from alignment- Output

DND files gives the distance measure of the aligned sequences from their common ancestral node

Branching diagram depicting evolutionary relationships or phylogeny. Phylogram is a branching depicting evolutionary relationships or phylogeny. In this, the length of branches in the tree are considered to be proportional to the evolutionary distance.

PGFPPLVAPEPDALCAAFQDN
PNLPRLVRPEVDVMCTAFHDN
PKLK-PDPNTLCDEFKADEKKF

PHYLOGENETIC ANALYSIS (PHYLIP)

DND FILES
( seq 1:0.18964, Seq 2:0.09482, seq 3:0.28446);

PHYLOGRAM

CLADOGRAM

The outputs from the analysis will be Distance file known as the DND file, Cladogram and Phylogram which are evolutionary trees. In the DND file, there is a common node. The values against the sequence are the distance from the common node. DND files give the distance measure of the aligned sequences from their common ancestral node. Cladograms are the graphical representation of the branching during evolution of the proteins that were aligned. Cladograms do not represent the evolutionary distances or the common ancestral node. Phylograms also represent the evolutionary distance tree in a graphical format. In this, the branch lengths correspond to the evolutionary distance between the two proteins. All branches will converge to a common ancestral root.

17 Structural and Functional prediction (MeMe server)
Step 3: Structural and Functional prediction from alignment- Input

Enter a sequence alignment for 2 or more sequences

Range for width of the motifs to be found 6-50

Seq 1 PGFPPLVAPEPDALCAAFQDN
Seq 2 PNLPRLVRPEVDVMCTAFHDN
Seq 3 PKLK-PDPNTLCDEFKADEKKF

Maximum number of motifs to be found 3

Structural and Functional prediction (MeMe server)

Alignment files can also be used for a variety of structural and functional analysis. Here we represent the functioning of such programs and servers by taking a simple example of protein motif prediction. The range of the width and the maximum number of motifs to be found are defined by the user.

18 Structural and Functional prediction (MeMe server)
Step 4: Structural and Functional prediction from alignment- Output

The color coded diagram shows the positions of the motifs in the text alignment of the compared sequences

Block diagram of motif prediction is the schematic used to visualize the positions and kinds of motifs in the alignment of two or more sequences

PGFPPLVAPEPDALCAAFQDN
PNLPRLVRPEVDVMCTAFHDN
PKLK-PDPNTLCDEFKADEKKF

Structural and Functional prediction (MeMe server)

Residue-wise sites for motifs
Color coded block diagram for motifs

The outputs obtained are 1. Block Diagram of protein motifs, which is the schematic used to visualize the positions and kinds of motifs in the alignment of two or more sequences. The color coding varies from server to server. 2. Sites of the blocks on a residue-by-residue basis.

Step 5: Structural and Functional prediction from alignment- Further Analysis

Enzyme Active sites
Epitope prediction in antigens
Finding Trans-membrane domains
Identify DNA binding residues

Once the protein motifs are detected, they can be used for further analysis, such as 1. Epitope Prediction 2. Active site determination 3. Determination of trans-membrane domains 4. Identification of DNA binding residues

Functions that can be predicted from sequence data

20 Interactivity option 1: Find the evolutionary distance between insulin chain A of human and mouse
Interactivity option 1: Find the evolutionary distance between insulin chain A of human and mouse

Input the term "insulin chain A" in the protein database of your choice
Check the source organism for the protein sequence
Chose the protein sequences corresponding to insulin A
Store the FASTA sequences mentioned against Human and mouse in separate locations
Input the two sequences in a multiple alignment server
Run the server to obtain output
Check for the .aln file and input it into programs for finding Phylogenetic distances such as phylip
Check the.dnd

21 5 1 2 3 4 Interactivity option 2.a : Match the following
SIMILARITY BASED SCORING MATRIX PAM MATRIX EVOLUTIONARY TREE DOMAIN IDENTIFICATION 2 MEASURE OF BIOLOGICAL SIGNIFICANCE PHYLOGRAM DISTANCE BASED SCORING MATRIX BIT SCORE 3 MEASURE OF QUALITY OF ALIGNMENT, NORMALIZED ACCORDING TO SCORING MATRIX E-VALUE BLOSUM MATRIX BLAST RESULT LINKED TO PFAM 4 Interacativity Type Options Results Results on next slide 5 Match the left column to the right Match the meaning of the parameter on the right to the name of the parameter on the left. If the matching is correct, turn the tab green, else flash “Try Again”

22 5 1 2 3 4 Interactivity option 2.b : Match the following PAM MATRIX
SIMILARITY BASED SCORING MATRIX BLAST RESULT LINKED TO PFAM DOMAIN IDENTIFICATION 2 EVOLUTIONARY TREE PHYLOGRAM MEASURE OF QUALITY OF ALIGNMENT, NORMALIZED ACCORDING TO SCORING MATRIX BIT SCORE 3 E-VALUE MEASURE OF BIOLOGICAL SIGNIFICANCE BLOSUM MATRIX DISTANCE BASED SCORING MATRIX 4 Interacativity Type Options Boundary/limits Results Correct Matching 5 Match the left column to the right Match the meaning of the parameter on the right to the name of the parameter on the left. If the matching is correct, turn the tab green, else flash “Try Again”

23 Questionnaire 1 1. Which is a scoring matrix based on distantly related proteins? Answers: a) PAM b)BLOSUM c) Both d)‏ None 2. Which parameter signifies whether the match between two sequences is a chance alignment? Answers: a) word-length b) e-value c) dot-plot d)‏ none 3. Which evolutionary tree has the branch length corresponding to the evolutionary distances? Answers: a) Phylogram b)Cladogram c) both d)‏ none 4. Which is NOT a ClustalW output file extension? Answers: a) .dnd b) .txt c) .aln d)‏ .output 5. Phylogenetic method for most variable sequence is Answers: a) Distance method b) Maximum Distance c) Maximum Parsimony d)‏ Maximum Likelihood 2 3 4 5

24 Links for further reading
Reference websites:

25 Links for further reading
Following URLs are used for animations

26 Links for further reading
Books: Bioinformatics Sequence and Genome Analysis by David Mount

