Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
European Bioinformatic Institute.
Databases (“knowledge bases”) used in genome analysis
UniProt Eric Jain Swiss Institute of Bioinformatics, Geneva W3C Workshop on Semantic Web for Life Sciences, October 2004.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
Bioinformatics and Chips Bioinformatics is a very integral part of each step in a chip project. Bioinformatics is a very integral part of each step in.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique.
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Babelomics Functional interpretation of genome-scale experiments Barcelona, 28 November de 2007 Ignacio Medina David Montaner
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Model Organisms and Databases. Model Organisms Characteristics of model organisms in genetics studies –Genetic history well known –Short life cycle; large.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
« Databases of homologous gene families for comparatives genomics » Poster 23 - JOBIM Nantes - Juin 2009 Databases of homologous gene families for comparatives.
HOGENOM a phylogenomic database
Genomes School B&I TCD Bioinformatics May Genome sizes Completed eukaryotic nuclear genomes Type of organismSpeciesGenome size (10 6 base pairs)
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Biological databases Nicky Mulder:
Biological Databases By : Lim Yun Ping E mail :
Ontologies, data standards and controlled vocabularies.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Sequence databases and retrieval systems Guy Perrière [ replaced by Manolo Gouy ] Pôle Bio-Informatique Lyonnais Laboratoire de Biométrie et Biologie Évolutive.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Chani & Malki present: Project adviser: Dr. Ron Wides The OdzFinder.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
Sequence Search and Analysis SPE 1653 (703)
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Using blast to study gene evolution – an example.
Chapter 1 Introduction.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Advanced SRS Course 12/12/02 -Linking -Subentries -Applications.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Bioinformatics Computing
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Web services and genome annotation in GRID by DNA Data Bank of Japan (DDBJ) Center for Information Biology and DNA Data Bank of Japan National Institute.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Bio/Chem-informatics
Annotating with GO: an overview
Demo: Protein Information Resource
Swiss-Prot Database --- Xie, H
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
Linking transcriptional mediators via the GACKIX domain super family
Computational genomics
Presentation transcript:

Databases of homologous gene families: new developments and web interfaces. Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie Evolutive Université Claude Bernard - Lyon 1 Simon Penel, Julien Grassot, Laurent Duret, Manolo Gouy, Guy Perrière. Pôle Bio-Informatique Lyonnais

Homologous Genes Databases Research fields: Proteome/genome comparative analysis Phylogenetic studies Orthology/Paralogy relationship assignments Development of generalist databases, specialised databases –HOVERGEN: families of homologous vertebrate genes –HOBACGEN: families of homologous bacterial genes –NureBase, RTKdb, Hoppsigen, Mitalib,.. Important regions identification in genomic sequences Evolution at the molecular level Species phylogeny Function prediction

Extension of HOVERGEN and HOBACGEN to all organisms for which the complete genome sequence has been determined Structured under the ACNUC (M. Gouy) retrieval system: flat file & index files Integrates : –Protein multiple alignments –Phylogenetic trees –Taxonomic data –Nucleic and protein sequences –Sequence annotations The HoGenom database: Homologous Genes Families of fully Sequenced Organisms European project TEMBLOR

Building of HoGenom Selection of fully sequenced organisms protein sequences on the EBI proteome site. Sequence comparison with BLAST on the whole sequences dataset Clustering of the sequences in genes family on the basis of sequence similarity (transitive association) Add the gene family info in the protein sequence annotations EMBL cross references calculations, nucleotide sequences selection Add gene family info in the EMBL/GenBank nucleotide annotations Protein Alignments Phylogenetic trees ACNUC Protein database ACNUC Nucleotide database For each family

Hogenprot: Q9DCD0 ID Q9DCD0 PRELIMINARY; PRT; 483 AA. AC Q9DCD0; DT 01-JUN-2001 (TrEMBLrel. 17, Created) DT 01-JUN-2001 (TrEMBLrel. 17, Last sequence update) DT 01-MAR-2002 (TrEMBLrel. 20, Last annotation update) DE A05RIK PROTEIN. GN A05RIK. OS Mus musculus (Mouse). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. OX NCBI_TaxID=10090 RN [1] RP SEQUENCE FROM N.A. RC STRAIN=C57BL/6J; TISSUE=KIDNEY; RX MEDLINE= ; PubMed= ; RA Kawai J., Shinagawa A., Shibata K., Yoshino M., Itoh M., Ishii Y., ---- RA Hayashizaki Y.; RT "Functional annotation of a full-length mouse cDNA collection."; RL Nature 409: (2001). CC -!- CATALYTIC ACTIVITY: 6-PHOSPHO-D-GLUCONATE + NADP(+) = D-RIBULOSE CC 5-PHOSPHATE + CO(2) + NADPH. CC -!- PATHWAY: HEXOSE MONOPHOSPHATE SHUNT. CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE CC FAMILY. CC -!- GENE_FAMILY: HBG [ FAMILY / ALN / TREE ] DR EMBL; AK002894; BAB ; -. DR HSSP; P00349; 2PGD. DR MGD; MGI: ; A05Rik. DR InterPro; IPR001744; 6PGD. DR Pfam; PF00393; 6PGD; 1. DR PRINTS; PR00076; 6PGDHDRGNASE. DR PROSITE; PS00461; 6PGD; 1. DR PRODOM; Q9DCD0. DR SWISS-2DPAGE; Q9DCD0. KW NADP; Oxidoreductase; Pentose shunt. FT DOMAIN 5 60 PRODOM:2001.3:PD FT DOMAIN PRODOM:2001.3:PD FT DOMAIN PRODOM:2001.3:PD SQ SEQUENCE 483 AA; MW; CD0A3F72EEC2831E CRC64; Protein sequence annotations

Hogennucl: AK PE1 AK PE1 Location/Qualifiers FT CDS_pept FT /codon_start=1 FT /db_xref="MGD:MGI: " FT /db_xref="SWISS-PROT:Q9DCD0" FT /note="data source:SPTR, source key:P52209, evidence:ISS" FT /note="homolog to 6-PHOSPHOGLUCONATE DEHYDROGENASE, FT DECARBOXYLATING (EC )" FT /note="putative" FT /transl_table=1 FT /gene_family="HBG000005" FT /protein_id="BAB " FT /translation="MAQADIALIGLAVMGQNLILNMNDHGFVVCAFNRTVSKVDDFLAN FT EAKGTKVVGAQSLKDMVSKLKKPRRVILLVKAGQAVDDFIEKLVPLLDTGDIIIDGGNS FT EYRDTTRRCRDLKAKGILFVGSGVSGGEEGARYGPSLMPGGNKEAWPHIKAIFQAIAAK FT VGTGEPCCDWVGDEGAGHFVKMVHNGIEYGDMQLICEAYHLMKDVLGMRHEEMAQAFEE FT WNKTELDSFLIEITANILKYRDTDGKELLPKIRDSAGQKGTGKWTAISALEYGMPVTLI FT GEAVFARCLSSLKEERVQASQKLKGPKVVQLEGSKKSFLEDIRKALYASKIISYAQGFM FT LLRQAATEFGWTLNYGGIALMWRGGCIIRSVFLGKIKDAFERNPELQNLLLDDFFKSAV FT DNCQDSWRRVISTGVQAGIPMPCFTTALSFYDGYRHEMLPANLIQAQRDYFGAHTYELL FT TKPGEFIHTNWTGHGGSVSSSSYNA" atggcccaag ctgacattgc actgatcgga ctggctgtca tgggccagaa cttaattttg 60 aacatgaatg atcatggatt tgtggtctgt gctttcaata ggacagtctc caaagtcgat 120 …. ccctgcttca ctactgccct ctccttctat gatgggtaca gacacgagat gctgccagca 1320 aacctcatcc aggctcaacg ggattacttt ggggctcaca cctatgaact cttaaccaaa 1380 ccgggagaat ttatccacac caactggacg ggccacgggg gcagtgtgtc atcctcttca 1440 tacaatgcct ag 1452 // Nucleotide sequence annotations

HoGenom ACNUC contents 8th September 2003 HoGenom Proteins 423,577 sequences HoGenom Nucleotide Sequences 448,582 cds 117 fully sequenced organisms Data Source Protein data from EBI: non-redondant complete proteome sets (SWISS-PROT, TrEMBL, TrEMBLnew) June 2003 Genomic data from EMBL, June 2003

117 organisms protein sequences Arabidopsis thaliana (plant) Caenorhabditis elegans (nematod) Drosophila melanogaster (fly) Encephalitozoon cuniculi (microsporidia) Guillardia theta (alguae) Homo sapiens (man) Mus musculus (mouse) Rattus norvegicus (rat) Saccharomyces cerevisiae (yeast) Schizosaccharomyces pombe (fungus) 31% 9% 60%

families protein sequences Sequences belonging to a family (72%) Orphan Sequences (27%)

Access to HoGenom is available at the PBIL: Web page of HoGenom :

Databases Access on the Web Two main www interfaces WWW Query –Multiple query on sequences (Guy Perrière) –Multiple query on families – Cross Taxa –Search of families in function of complex taxonomic criteria –Selection of families –

Cross Taxa: Selection of Gene Families example : selecting families of animal specific genes A list of families

√ √

display family

Family Page

Application to other databases Any sequence database can be structured under ACNUC and queried with WWW-Query Currently available : SWISS-PROT, EMBL, GenBank, etc. Any family database can be structured under ACNUC and queried with WWW-Query and Cross-Taxa For example, an ACNUC version of the HAMAP database developed by SWISS-PROT is currently available at the PBIL

Example: sequence Q8ZY16 in NiceProt : cross-references to HAMAP-ACNUC and HOBACGEN Cross-references with external databases 1 sequence associated family Display the family, alignment and phylogenetic tree associated to an sequence accession number via a URL link. http

Acknowledgements People from BBE: SWISS-PROT group Laurent Duret Alexandre Gattiker Manolo Gouy Julien Grassot Simon Penel Guy Perrière This project is supported by o the European Commission (TEMBLOR) o the Rhône-Alpes region (Projet Thématiques Prioritaires)