Download presentation
1
Sequence Based Analysis Tutorial
NIH Proteomics Workshop Cecilia Arighi, Ph.D. Protein Information Resource at Georgetown University Medical Center
2
Retrieval, Sequence Search & Classification Methods
Retrieve protein info by text / UID Sequence Similarity Search BLAST, FASTA, Dynamic Programming Family Classification Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks Integrated Search and Classification System
3
Sequence Similarity Search (I)
Based on Pair-Wise Comparisons Dynamic Programming Algorithms Global Similarity: Needleman-Wunch Local Similarity: Smith-Waterman Heuristic Algorithms FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search
4
Sequence Similarity Search (II)
Similarity Search Parameters Scoring Matrices – Based on Conserved Amino Acid Substitution Dayhoff Mutation Matrix, e.g., PAM250 (~20% Identity) Henikoff Matrix from Ungapped Alignments, e.g., BLOSUM 62 Gap Penalty Search Time Comparisons Smith-Waterman: 10 Min FASTA: 2 Min BLAST: 20 Sec 10
5
Feature Representation
Features of Amino Acids: Physicochemical Properties, Context (Local & Global) Features, Evolutionary Features Alternative Amino Acids: Classification of Amino Acids To Capture Different Features of Amino Acid Residues
6
Substitution Matrix Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys) 10
7
Secondary Structure Features
a Helix Patterns of Hydrophobic Residue Conservation Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic) b Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6
8
BLAST BLAST (Basic Local Alignment Search Tool) Extremely fast Robust
Most frequently used It finds very short segment pairs (“seeds”) between the query and the database sequence These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached
9
BLAST Search From BLAST Search Interface
Table-Format Result with BLAST Output and SSEARCH (Smith-Waterman) Pair-Wise Alignment Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Click to see SSearch alignment Click to see alignment
10
Blast Result & Pairwise Alignment
BLAST Aligment
11
Classification What is classification?
Why do we need protein classification? Different levels of classification Basis for functional protein classification How to classify a protein of unknown function?
12
Classification Databases
C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H The 2 C's and the 2 H's are zinc ligands Protein motif Protein domain 3-D structure Whole-protein Group proteins according to the presence of a common domain Group proteins according to common 3D structure Group proteins according to common domain architecture and length Protein Domain: Structurally compact, independently folded unit that forms a stable 3D-structure and shows a certain level of evolutionary conservation Protein motif: A set of conserved amino acid residues that are important for protein function and located within a certain distance from one another
13
Family Classification Methods
Based on Other Classification Information Multiple Sequence Alignment (ClustalW) ProSite Pattern Search Profile Search Hidden Markov Models (HMMs) Domain (Pfam); Whole protein (PIRSF) Neural Networks
14
How do you build a tree? Pick sequences to align Align them
Verify the alignment Keep the parts that are aligned correctly Build and evaluate a phylogenetic tree Integrated Analysis
15
Multiple Sequence Alignment: CLUSTALW
Pairwise alignment: Calculate distance matrix Mean number of differences per residue Unrooted Neighbor-Joining Tree Branch length drawn to scale Rooted NJ Tree (guide tree) Root place at a position where the means of the branch lengths on either side of the root are equal Progressive Alignment guided by the tree Alignment starts from the tips of the tree towards the root Thompson et al., NAR 22, 4675 (1994).
16
PIR Multiple Alignment and Tree
From Text/Sequence Search Result or CLUSTAL W Alignment Interface
17
Here is an example of two different functions easily separated on a phylogenetic tree. Each functional group is used to build an HMM.
18
PIR Pattern Search Signature Patterns for Functional Motifs
From Text/Sequence Search Result or Pattern Search Interface P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N Alignment of a region involved in catalytic activity Create Pattern and search in database: A P-[IV]-[WY]-x(3)-H-[MR]-V-x(3,4)-Q-x(1,2)-D-x(4,5)-G-A-N B Test sequence against PROSITE database O05689
19
Pattern Search Result (I)
One Query Pattern Against UniProtKB or UniRef100 DBs Display the query pattern Indicate pattern sequence region(s) Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report
20
Pattern Search Result (II)
One Query Sequence Against PROSITE Pattern Database
21
Profile Method Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments Num of Rows = Num of Aligned Positions Each row contains a score for the alignment with each possible residue. Profile Searching Summation of Scores for Each Amino Acid Residue along Query Sequence Higher Match Values at Conserved Positions
22
Prosite PS50157 profile for Zinc finger C2H2
23
1 PIRSF scan Shows PIRSF that the query belongs to Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HMMER The matched regions and statistics will be displayed. Statistical data for all domains Statistical data per domain Alignment with consensus sequence
24
Creation and Curation of PIRSFs
25
Integrated Bioinformatics System for Function and Pathway Discovery
Data Integration Associative Analysis
26
Analytical Pipeline Family Classification & Functional Analysis
Query Sequence UniProt Top-Matched Superfamilies/Domains BLAST Search HMM Domain Search Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs SSEARCH CLUSTALW Superfamily/Domain/Motif Alignments Family Relationships & Functional Features Family Classification & Functional Analysis HMM Motif Search Pattern Search SignalP/TMHMM Analytical Pipeline
27
Integrated Bioinformatics System
Global Bioinformatics Analysis of 1000’s of Genes and Proteins Pathway Discovery, Target Identification
28
Lab Section
29
Rat eye lens phosphoproteomics in normal and cataract
Kamei et al., Biol. Pharm. Bull., 2005. Normal Cataract (-) pI (+) More phosphorylated spots in cataract sample. Digestion and MS from Spot 16 gave these peptides: MDVTIQHPWFKR ALGPFYPSR CSLSADGMLTFSG YRLPSNVDQSALS Mw MDVTIQHPWFKR We want to identify the protein(s) that contain these peptides Use Peptide Search
30
Peptide Search Restrict search to an organism
31
Peptide Search & Results
Species restricted search Sorting arrows Search in UniProtKB, 23 proteins Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Matching peptide highlighted in the sequence
32
Retrieve more sequences
Batch Retrieval Results (I) Retrieve multiple proteins in from iProClass using a specific identifier or a combination of them Provides a means to easily retrieve and analyze proteins when the identifiers come from different databases Retrieve more sequences
33
Blast Similarity Search
What proteins are related to rat CRYAA? Perform sequence similarity search >P24623
34
Blast Search Results BLAST (partial) result for CRYAA_RAT in UniProtKB database BLAST alignment with the human protein
35
Pairwise Alignment
36
PIR protein family classification
PIR Text Search ( UniProtKBDatabase and unique UniParc sequences Let’s search for human crystallins PIR protein family classification database
37
Let’s look for crystallins which have 3D structure
Refine your search or start over Display PDB ID
38
Let’s perform a multiple alignment on the sequences containing PF00030
Domain Display allows to compare simultaneously Pfam domains present in multiple proteins Share same domain architecture Let’s perform a multiple alignment on the sequences containing PF00030
39
Multiple Alignment
40
Interactive Phylogenetic Tree and Alignment
Beta B1 and gamma crystallins share the same domains, SCOP fold and share significant sequence similarity suggesting that they are related
41
Pattern Search (I) Select P07320 and perform a pattern search
Search for proteins containing this pattern (PS00225) in rat
42
Pattern Search Result Beta and gamma Crystallins have multiple copies of this pattern
43
Pfam domains assigned with high confidence
PIRSF provides a single platform where all the previous analysis has been done by curators Pfam domains assigned with high confidence Validation tag Represents extent of manual curation Link to PIRSF report
44
Taxonomic Distribution Alpha-crystallin is exclusively found in metazoans Domain Architecture Multiple Alignment
45
PIRSF scan
46
PIRSF report (I): a single platform to study proteins
Subfamily level
47
PIRSF report (II) http://www.geneontology.org/
Cross-links to other databases provides a controlled vocabulary to describe gene and gene product attributes in any organism
48
alpha-Crystallin and Related Proteins
Alpha crystallin beta chain HSPs Alpha crystallin alpha chain
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.