BIO-TRAC 25 (Proteomics: Principles and Methods) October 10, 2003 NIH, Bethesda, MD Zhang-Zhi Hu, M.D. Senior Bioinformatics Scientist, Protein Information Resource National Biomedical Research Foundation, GUMC Tutorial: Bioinformatics Resources
2 What is Bioinformatics? NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2002) - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Bioinformatics is the application of information technology to the analysis, organization and distribution of biological data in order to answer complex biological questions.
3 Bioinformatics Resources The Molecular Biology Database Collection: An Online Compilation of Relevant Database Resources 2003 update: update: Nucleic Acids Research Database Issues (January Annually) ( Nucleic Acids Research Database Issues (January Annually) ( DBcat: A Catalog of > 500 Biological Databases
4 Molecular Biology Database Collection Molecular Biology Database Collection (
5 The Molecular Biology Database Collection: 2003 update (Baxevanis, A.D.) -- An online resource of 386 key databases of 18 categories Major sequence repositories Comparative Genomics Gene Expression Gene Identification and Structure Genetic and Physical Maps Genomic Databases Intermolecular Interactions Metabolic Pathways and Cellular Regulation Mutation Databases Pathology Protein Sequence Motifs Proteome Resources Retrieval Systems and Database Structure RNA Sequences StructureTransgenics Varied Biomedical Content
6 Overview Protein Sequence Analysis I. Sequence Similarity Search and Alignment II. Family Classification Methods III. Structure Prediction Methods Molecular Biology Databases IV. Protein Family Databases V. Database of Protein Functions VI. Databases of Protein Structures Proteomic Resources VII. 2D-gel databases VIII. Proteomic analyses
7 I. Sequence Similarity Search Find a protein sequence: text search Based on Pair-Wise Comparisons BLOSUM scoring matrix BLOSUM scoring matrix PAM scoring matrix PAM scoring matrix Dynamic Programming Algorithms Global Similarity: Needleman-Wunsch (GAP/BestFit) Global Similarity: Needleman-Wunsch (GAP/BestFit) Local Similarity: Smith-Waterman (SSEARCH) Local Similarity: Smith-Waterman (SSEARCH) Heuristic Algorithms (Sequence Database Searching) FASTA: Based on K-Tuples (2-Amino Acid) FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs (NREF) Gapped-BLAST: Allow Gaps in Segment Pairs (NREF) PHI-BLAST: Pattern-Hit Initiated Search (NCBI) PHI-BLAST: Pattern-Hit Initiated Search (NCBI) PSI-BLAST: Iterative Search (NCBI) PSI-BLAST: Iterative Search (NCBI)
8 Sequence Search by Text or Unique ID Entrez ( ( n.edu/pirwww/search /textsearch.html)
9 Pair-Wise Comparisons Scoring matrix lobal local Global and local Similarity: Dynamic Programming ( (Needleman-Wunsch, Smith-Waterman) (
10 FASTA Search ( ac.uk/fasta33/) ac.uk/fasta33/ (
11 Gapped-BLAST Search ( (
A BLAST Result
13 PSI-BLAST Iterative Search (
14 PSI-BLAST
15 II. Family Classification Methods Multiple Sequence Alignment and Phylogenetic Analysis ClustalW Multiple Sequence Alignment ClustalW Multiple Sequence Alignment Alignment Editor & Phylogenetic Trees Alignment Editor & Phylogenetic Trees Searches Based on Family Information PROSITE Pattern Search PROSITE Pattern Search Motif and Profile Search Motif and Profile Search Hidden Markov Model (HMMs) Hidden Markov Model (HMMs)
16 Multiple Sequence Alignment ClustalW ( )
17 Alignment Editor (Jalview) (
18 Alignment Editor (GeneDoc) (
19 Phylogenetic Analysis Tree Programs: ( genetics.washington.edu/phylip.html) Tree Searches: ( mbu.iisc.ernet.in/~pali/index.html) mbu.iisc.ernet.in/~pali/index.html
20 Phylogenetic Trees Phylogenetic Trees (IGFBP Superfamily) (Radial Tree) (Phylogram)
21 PROSITE Pattern Search (
22 Profile Search (
23 Hidden Markov Model Search ( ( -heidelberg.de) -heidelberg.de
24 III. Structural Prediction Methods Signal Peptide: SIGFIND, SignalP Transmembrane Helix: TMHMM, TMAP 2D Prediction ( -helix, -sheet, Coiled-coils): PHD, JPred 3D Modeling: Homology Modeling (Modeller, SWISS- MODEL), Threading, Ab-initio Prediction
25 Structure Prediction: A Guide ( heidelberg.de/gtsp/flow chart2.html) heidelberg.de/gtsp/flow chart2.html
26 Protein Prediction Server ( dtu.dk/services/) dtu.dk/services/
27 Signal Peptide Prediction ( ( k/services/SignalP-2.0) k/services/SignalP
28 Transmembrane Helix (
29 Protein Structure Prediction ( ( biotools/biotools9.html) biotools/biotools9.html
30 Structure Prediction Server ( ( dee.ac.uk/WWW_Servers/ JPred/jpred.html) dee.ac.uk/WWW_Servers/ JPred/jpred.html
31 3D-Modelling ( ( ch/swissmod/SWISS -MODEL.html) ch/swissmod/SWISS -MODEL.html
32 IV. Protein Family Databases Whole Proteins PIR: Superfamilies and Families COG (Clusters of Orthologous Groups) of Complete Genomes ProtoNet: Automated Hierarchical Classification of Proteins Protein Domains Pfam: Alignments and HMM Models of Protein Domains SMART: Protein Domain Families Protein Motifs PROSITE: Protein Patterns and Profiles BLOCKS: Protein Sequence Motifs and Alignments PRINTS: Protein Sequence Motifs and Signatures Integrated Family Databases iProClass: Superfamilies/Families, Domains, Motifs, Rich Links InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART
33 Protein Clustering (
34 Protein Domains Pfam ( SMART ( smart.embl-heid elberg.de/smart/ show_motifs.pl)
35 Protein Motifs PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles. (
36 Integrated Family Classification InterPro InterPro: An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. (
37 V. Databases of Protein Functions Metabolic Pathways, Enzymes, and Compounds Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed Reactions (EC-IUBMB) KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes EcoCyc: Encyclopedia of E. coli Genes and Metabolism MetaCyc: Metabolic Encyclopedia ( Metabolic Pathways) WIT: Functional Curation and Metabolic Models BRENDA: Enzyme Database UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways Klotho: Collection and Categorization of Biological Compounds Cellular Regulation and Gene Networks EpoDB: Genes Expressed during Human Erythropoiesis BIND: Descriptions of interactions, molecular complexes and pathways DIP: Catalogs experimentally determined interactions between proteins RegulonDB: Escherichia coli Pathways and Regulation
38 KEGG Metabolic & Regulatory Pathways ( bin/show_pathway?hsa ) bin/show_pathway?hsa KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks, the information of genes and proteins, and of chemical compounds and reactions. (
39 BioCyc (EcoCyc/MetaCyc Metabolic Pathways) The BioCyc Knowledge Library is a collection of Pathway/Genome Databases (
40 Protein-Protein Interactions: DIP (
41 Protein-Protein Interaction: BIND (
42 BioCarta Cellular Pathways (
43 VI. Databases of Protein Structures Protein Structure and Classification PDB: Structure Determined by X-ray Crystallography and NMR CATH: Hierarchical Classification of Protein Domain Structures SCOP: Familial and Structural Protein Relationships FSSP: Protein Fold Family Database Protein Sequence-Structure Relationship PIR-NRL3D: Protein Sequence-Structure Database PIR-RESID: Protein Structure/Post-Translational Modifications HSSP: Families and Alignments of Structurally-Conserved Regions
44 PDB Structure Data (
45 PDBsum: Summary and Analysis Summary and Analysis ( ac.uk/bsm/pdbsum) ac.uk/bsm/pdbsum
46 Protein Structural Classification CATH: Hierarchical domain classification of protein structures ( ucl.ac.uk/bsm/cath_new/ucl.ac.uk/bsm/cath_new/)
47 Protein Structural Classification ( cam.ac.uk/scop/) cam.ac.uk/scop/ The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the PDB.
48 VII. Proteomic Resources GELBANK ( 2D-gel patterns from completed genomes; SWISS-2DPAGE ( PEP: Predictions for Entire Proteomes: ( pep/): Summarized analyses of protein sequences pep/ pep/ Proteome BioKnowledge Library: ( Detailed information on human, mouse and rat proteomes Proteome Analysis Database ( Online application of InterPro and CluSTr for the functional classification of proteins in whole genomes Expression Profiling databases: GNF ( bin/index.cgi, human and mouse transcriptome), SMD ( www5.stanford.edu/MicroArray/SMD/, Stanford microarray data analysis), EBI Microarray Informatics ( index.html, managing, storing and analyzing microarray data) bin/index.cgihttp://genome- www5.stanford.edu/MicroArray/SMD/ index.htmlhttp://expression.gnf.org/cgi- bin/index.cgihttp://genome- www5.stanford.edu/MicroArray/SMD/ index.html
49 2D-Gel Image Databases (1) (
50 2D-Gel Image Databases (2) ( ( bin/nice2dpage.pl?P06493)
51 VIII. Proteome Analysis (
52 Expression Profiling Human and Mouse Transcriptome ( ( stanford.edu/serum/) stanford.edu/serum/
53 Lab: Visit selected websites and analyze some protein sequences of your own choices. - List of Bioinformatics Resources of this tutorial available : Try some of the following sequences for analysis: 1) well characterized proteins: PIR:A26366(CYP17), JS0747(Sp1) 2) less characterized proteins: PIR:A59000(MATER) TrEMBL:Q9QY16(GRTH) 3) hypothetical protein: PIR:T12515, T00338, T47130 SWISS-PROT:Q9BWT7