Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Protein Sequence Databases Computational Molecular Biology Biochem 218 – BioMedical.

Slides:



Advertisements
Similar presentations
Databases (“knowledge bases”) used in genome analysis
Advertisements

On line (DNA and amino acid) Sequence Information Lecture 7.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Protein Databases EBI – European Bioinformatics Institute
The Cell, Central Dogma and Human Genome Project.
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
An Introduction to Bioinformatics Molecular Biology Databases.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Biological Databases By : Lim Yun Ping E mail :
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Protein families, domains and motifs in functional prediction May 31, 2016.
Tutorial: Bioinformatics Resources ( georgetown
Bioinformatics Overview
Protein databases Henrik Nielsen
Bio/Chem-informatics
Demo: Protein Information Resource
Archives and Information Retrieval
Biological Sequence Databases
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
There are four levels of structure in proteins
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
Next Generation Sequencing and Human Genome Databases
Protein Sequence Analysis - Overview -
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Protein Sequence Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics

2 nd Homework Transcripts and splice variants  NCBI - Evidence Viewer o Full Length RefSeq cDNAs o UniGene EST Assemblies o dbEST Assemblies  ENSEMBL Genes o Transcripts o Splice Variants  UCSC Genome Browser o RefSeq Tracts o Human EST tracks o Spliced EST tracks o UniGene Track

Protein Sequence and Classification Databases Evolution  Sequence Similarity or Distance  Clustered into protein Superfamilies and Families  Multiple sequence alignment profiles or Hidden Markov Models  Phylogenies Function  Sequence motifs (consensus sequences), Positions Specific Scoring Matrices (PSSMs), Profiles, Hidden Markov Models (HMMs) o Active sites o Binding sites for ligands, substrates, DNA o Protein-protein interaction sites  Enzymes, Structures, Receptors, Pores, Signals, Regulators Structure  Secondary Structure  Tertiary Folds – Hidden Markov Models  Clustered into protein domains, superfamilies and families  Quaternary Complexes - Interaction Databases

Protein Sequence Databases (Historical Order) National Biomedical Research Foundation (NBRF) => Protein Identification Resource (PIR)Protein Identification Resource (PIR) –Margaret Dayhoff –Atlas of Protein Sequences –Phylogenies, evolution, amino acid substitution matrices (PAM) and discovering active sites in enzymes –PIR SF Evolutionary Family, iProClass Functional site analysis and ontologies, iProLink to literaturePIR SF iProClass iProLink –UniProt - Universal Protein ResourceUniProt SwissProt (EXPASY site) –Amos Bairoch –Manual curation and annotation –Highly cross-referenced –Many useful analytical tools (EXPASY Tools)EXPASY Tools –2D-PAGE and Mass Spectrometry databases2D-PAGE –Prosite functional motif databaseProsite –UniProt - Universal Protein ResourceUniProt TREMBL –Translation of mRNAs (RefSeq), UniGene, open reading frames (ORFs) and predicted genes from genomes –Automatic annotations EMBL => EBI Protein databasesEBI Protein databases –ClustersClusters –Interpro linked to domain and motif databases (CATH, PANTHER, PRINTs, PROSITE, pFAM, PIRSF, PRODOM, SCOP, SMART, SUPERFAMILY)Interpro –Intron-exon structure and links to ORFs, coding regions –UniProt - Universal Protein ResourceUniProt NCBI Protein Database –Protein and nrPRO database SwissProt, PIR and translated genes/genomes –Protein Clusters Database (prokaryotic) and COGS and KOGSProtein Clusters DatabaseCOGS and KOGS –Linked to coding regions and intron/exon structure –Linked to coding SNPs and variations databasesSNPs –Linked to MMDSB structure databaseMMDSB structure database –Linked to 3D domains3D domains –Linked to CDD Conserved Domain DatabaseCDD Conserved Domain Database UCSC Proteome Browser

UniProt Sequence Databases UniProt Archive (UniParc) –Stable, comprehensive, non-redundant collection of all protein sequences ever published –Merged from PIR, SwissProt, TREMBL, DDBJ/EMBL/GenBank proteins and proteomes, PDB, International Protein Index, RefSeq translations and other organism proteomes not yet in DDBJ/EMBL/GenBank UniProt Reference (UniRef) –Three non-redundant collections based on sequence similarity clusters UniRef100 has all identical and identical overlapping subsequences merged into one entry in UniRef100 UniRef90 merges all protein sequence clusters with 90% sequence identity into a single entry. UniRef50 merges all protein sequence clusters with 50% sequence identity into a single entry

UniProt Sequence Databases (cont.) UniProt Archive (UniProt) –UniProt/SwissProt Manually curated highly-annotated sequences from SwissProt & PIRSF including descriptions, taxonomy, citations, GO terms, motifs, functional and structural classifications, residue specific annotations including variations. Some automatic rule-based annotations including InterPro domains and motifs, PROSITE, PRINTS, Prodom, SMART, PFAM, PIRSF, Superfamily and TIGRFAMS classifications. –UniProt/TREMBL Automatically translated from genomes including predicted as well as RefSeq genes. Automated rule-based annotations.

RuleBase Protein Annotations Fleischmann et al Fleischmann et al Learn rules from SwissProt –Find all SwissProt sequences with a specific motif, profile or HMM (from InterPro) –Examine all annotations (keywords, taxonomy, GO Terms, etc.) of SwissProt proteins that share a motif or a domain profile –Discover annotations common to all these proteins –Record rule that motif, profile or HMM=> annotation Apply rules to TREMBL –Find TREMBL sequences which have the same domain or motif as in RuleBase –Apply common annotations to TREMBL –Flag annotations as “BY SIMILARITY” Generates many false annotations –Due to low specificity domain or motif profiles –Large size of TREMBL compared to SwissProt training set –Used e MOTIFs (higher specificity motifs < 1 error per 10 9 amino acids) to confirm patterns –Build rules based on multiple motifs (Spearmint)multiple motifs (Spearmint) –Introduced confidence thresholds

Serine Protease Catalytic Triad Three conserved motifs containing the three residues of the catalytic triad in the serine protease subtilisin BPN of Bacillus amyloliquefaciens. The first conserved motif is shown in red (residues 135–146; Prosite PS00136) and contains the active site aspartate (red spheres). The second conserved motif is shown in blue (residues 171–181; Prosite PS00137) and contains the active site histidine (blue spheres). The third motif is shown in yellow (residues 326–336; Prosite PS00138) and contains the active site serine (yellow spheres).

Filtering erroneous annotations with negative rules (Xanthippe) Wieser et al. 2004Wieser et al Examine taxonomic and other “core” annotations that lack a more specific annotation in SwissProt –All bacterial proteins lack “nuclear protein” annotation –If RuleBase assigns “nuclear protein” to a bacterial protein in TREMBL then remove it –>4,000 exclusion rules based just on taxonomic class –Build exclusion rules based on decision trees involving multiple motif annotations Some exclusion rules in error due to database bias –All venomous snakes lacked ATP binding proteins! The only SwissProt proteins from venomous snakes came from venom, not from the rest of the organism!

Performance of Xanthippe (Tested by Crossvalidation on SwissProt)

ExPASy Proteomics Server

ExPASy Proteomic Tools

UniProt Database

UniProt PCNA Search

UniProt PCNA Search

UniProt PCNA Reviewed Search

UniProt PCNA BLAST Search

UniProt PCNA Alignment

ExPASy Proteomics Server

Prosite Database

Prosite Zinc Finger C2H2 Entry

Prosite Zinc Finger C2H2 Profile Entry

Prosite Zinc Finger C2H2 Profile Entry

Prosite Zinc Finger Pattern

Prosite Zinc Finger Pattern

ScanProsite Patterns and Profiles

ScanProsite Results

MyHits Local Motifs Search

MyHits Local Motifs Query

MyHits Local Motifs Search

MyHits Local Motifs Summary

MyHits Local Motif Hits

MyHits Local Motifs Hits (Cont.)

MyHits Local Motifs Hits (Cont.)

InterPro

InterPro Scan

InterPro Scan Hourglass

InterPro Scan PCNA

InterPro Scan PCNA Table View

InterPro pFAM Result for PCNA

InterPro PRINTS Result for PCNA bin/dbbrowser/sprint/searchprintss.cgi?display_opts=Prints&category=None&queryform=false&prints_accn=PR bin/dbbrowser/sprint/searchprintss.cgi?display_opts=Prints&category=None&queryform=false&prints_accn=PR00339

InterPro TIGRFams Result for PCNA

EBI Protein Databases

EBI UniProt

EBI UniProt Search Tools

Protein Identification Resource

PIR PRO Database

PIR iProClass Database

NCBI Protein Database

Hemoglobin in Title Index

Hemoglobin Title & Human Organism

Hemoglobin Title & Human Organism Results

UCSC Proteome Browser

UCSC Hemoglobin Protein

UCSC Hemoglobin Coding Region

UCSC Hemoglobin Protein Properties

UCSC Hemoglobin Gene Region

UCSC Hemoglobin Links

National Center for Biotechnology Information

European Bioinformatics Institute

DNA Database of Japan

DDBJ/EMBL/GenBank Feature Table Example: Protein-coding region Prototypical eukaryotic gene ’ UTRsignalpeptidematurepep-tide3’ UTR FEATURESLocation/Qualifiers exon /number intron /number=1 exon /number=2 intron /number=2 exon /number=3 intron /number=3 31UTR exon /number=4 signal-peptidejoin( , ) mature-peptidejoin( , , ) /product=“prototypica1 protein" CDSjoin( , , , ) /product="prototypical protein" mRNAjoin( , , , ) prim_transcript

International Common Fields

National Center for Biotechnology Information

Entrez Databases

NCBI Handbook

NCBI Nucleotide Sequence Database

NCBI Indexes for Core DNA Sequences

NCBI Search for PCNA Gene Sequences

NCBI Protein Sequence Database

NCBI Protein Search for Human PCNA

BLAST Link for Human PCNA

Conserved Domain Link for PCNA

Online Mendelian Inheritance in Man

Online Mendelian Inheritance in Man

Entrez Structure Database

MMDB Structure for PCNA (1VYM)

VAST Neighbors of PCNA (1VYM)

Protein Data Bank (Structures)

PDB Contents

Growth of PDB

Growth of PDB by Fold

Protein Structure Initiative

Structure Summary for 1VYM

Jmol View of 1VYM

Ramachandran Plot of 1VYM

Swiss Model: Comparative Modeling Server

Protein Identification Resource - PIR

Controlled Vocabulary for p53

ProClass Summary for Human p53

Sequences Related to p53

Genome Ontology Consortium