Pfam, DAS and the future Rob Finn DAS Workshop 2009.

Slides:



Advertisements
Similar presentations
Andy Jenkinson, EBI An Introduction to DAS. Summary of Topics What is Data Integration? Problems in Data Integration An architectural overview of DAS.
Advertisements

2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Peter Rice and Mahmut Uludag EMBOSS as an Efficient DAS Annotation Source Peter Rice, EBI Mahmut Uludag, EBI 10th March.
Pfam Pfam & DAS Rob Finn 26 th Feb Pfam Acknowledgements John Tate Roger Pettett Andreas Prlic Andy Jenkinson But takes data from community…..!
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Lecture 2.21 Retrieving Information: Using Entrez.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Adding GO GO Workshop 3-6 August GOanna results and GOanna2ga 2. gene association files 3. getting GO for your dataset 4. adding more GO (introduction)
COSMIC GBrowse Visualising cancer mutations in genomic context Dave Beare Cancer Genome Project Wellcome Trust Sanger Institute, Hinxton,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
MolIDE2: Homology Modeling Of Protein Oligomers And Complexes Qiang Wang, Qifang Xu, Guoli Wang, and Roland L. Dunbrack, Jr. Fox Chase Cancer Center Philadelphia,
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Copyright OpenHelix. No use or reproduction without express written consent1.
DAS Current Situation and Future Developments Jonathan Warren DAS coordinator for the Sanger Institute
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
3D-EM DAS Extending DAS to 3D-EM and Fitting /02/26.
Protein Domain Database
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Biomedical and Bioscience Gateway to National Cyberinfrastructure John McGee Renaissance Computing Institute
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
What is BLAST? Basic BLAST search What is BLAST?
Metagenomic dataset preprocessing – data reduction
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
What is BLAST? Basic BLAST search What is BLAST?
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Basics of BLAST Basic BLAST Search - What is BLAST?
Overview of the Encyclopedia of Life (EOL) Project
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Bioinformatics Capstone Project
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Large Scale Annotation of Genomic Datasets with Genephony
Strategies for annotation of a genome
BLAST.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Pfam, DAS and the future Rob Finn DAS Workshop 2009

What is Pfam? –Protein families/domain database Complete and accurate classification of protein space Each family represented by alignments and profile HMMs –Two Distinct Parts Pfam-A - high quality, curated, annotation Pfam-B - low quality, automated, unannotated –Additional Features Active site, coiled-coils, low complexity, transmembrane regions

Sequence Features Client

Motivation –Include Other annotations Identify where we are missing domains –Reduce data duplication –Enrich single protein data in Pfam –Allow tailored views

Updates from DAS registry Tailored Features Views

Features Request List of sources

DAS Alignments The Next Step…. –Multiple Sequence Alignments –PREFIX/das/alignment?query=ID DAS Client DAS Alignment Server

Dealing with large alignments –PREFIX/das/alignment?query=ID[&subject=ID[RANGE]] or/and [&rows=START-END} DAS Client X DAS Alignment Server DAS Alignments

Dealing with large alignments –PREFIX/das/alignment?query=ID[&rows=START-END] DAS Client DAS Alignment Server DAS Align Feature Server DAS Alignments

In Practice –Pfam alignments vary in size ,000+ sequences Paging Essential –Simple DAS alignment client HTML, AJAX Pfam Alignments

Future Directions More alignment sources are on their way! –Develop standalone, generic application –Paging replaced for ‘Live Grid’ Issues –Genomics alignments! –Layering on features

HMMER3 Faster and more sensitive version of underlying software –Make use of new features? Query Length Pfam (140 X 11000) Real time DAS searches!

Hot Alignments Can we scale efficiently?

Bringing in other datasets Pfam –NCBI NR (genPept) –Metagenomics COSMIC - Catalogue Of Somatic Mutations In Cancer

COSMIC Data Sources Advantages Prolong life of data Maintain integrity Genes continually updated Scientist explore data Ability to combine data sets Features Manual Curation Map reference sequence Standards Mutation naming Tumour sample Phenotype Scientific LiteratureCancer Genome Project Systematic Screens COSMIC

COSMIC/Pfam/Uniprot Prototyped on 60 ‘classic’ Proteins Automated update when COSMIC or Uniprot released

Linking COSMIC/Pfam/Spice Linking and State Maintenance

Acknowledgements Pfam –Prasad Gunasekaran –John Tate –Alex Bateman –Penny Coggill –Jaina Mistry COSMIC –Jon Teague –Cosmic team…… Questions?