Presentation is loading. Please wait.

Presentation is loading. Please wait.

WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique.

Similar presentations


Presentation on theme: "WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique."— Presentation transcript:

1 WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et Biologie Evolutive Université Claude Bernard - Lyon 1 Simon Penel, Julien Grassot, Manolo Gouy, Guy Perrière, Laurent Duret. Pôle Bio-Informatique Lyonnais

2 Introduction : –Databases for phylogenomics Achievements of WP12 Milestones –Automatised updating procedure –Development of a database of homologous genes from complete genomes Perspectives

3 Databases of homologous gene families for comparative genomics Goal : –Provide an easy access to all the information that can be drawn from the comparison of homologous sequences General approach : –Search for sequence similarities –Clustering of homologous sequences into families –Analysis of sequence families (multiple alignment, profile, phylogenetic tree,...) –Query software and user interface to retrieve and display relevant information

4 Domain vs. gene families Modular evolution of protein genes Families of homologous protein domains: - Evolution by domain shuffling (duplication, loss, translocation) Gene families: - Evolution of homologous genes by speciation or by gene duplication - Sequences are homologous over their entire length (or almost)

5 Different databases for different purposes Databases of protein domains (InterPro, etc.) WP5 –Prediction of the biochemical activity of proteins: Does this protein have a kinase catalytic site ? Does it contain a DNA binding domain ? … –Prediction of protein structures Does this protein contain a domain homologous to an already known 3D structure ? –…–…

6 Different databases for different purposes Databases of gene families (WP12): identify orthologues or paralogues within a given set of taxa Example of typical queries: –Identify all orthologues between human, mouse and zebrafish Prediction of gene function Phylogenetics Comparative mapping –Identify all paralogous genes originating from a duplication in the last common ancestor of vertebrates Evolution of the function of duplicated genes Analysis of genome duplications –Identify all the genes that are specific to a pathogenic strain of E. coli –…–…

7 Orthology/Paralogy Homology: two genes are homologous if they share a common ancestor Ancestral insulin gene Rodents Primates INS2 INS1 HumanRatMouse RatMouse INSINS1 INS2 Speciation Duplication Orthologs: homologs that have diverged after a speciation Paralogs: homologs that have diverged after a duplication

8 Data for Phylogenomics Search for homologues and homologies interpretation (orthology, paralogy) require: –To find similarities. –To compute multiple alignments. –To build phylogenetic trees. –To have reference taxonomic data. –To access sequence databanks annotations.

9 Databases content Database of homologous genes –HOVERGEN: vertebrates (Duret et al., 1994) –HOBACGEN: bacteria and archea (Perrière et al., 2000) –HOGENOM : fully sequenced organisms Protein sequences from SWISS-PROT/ TrEMBL. Nucleotide sequences from EMBL. Taxonomic data (NCBI). Homologous genes classified into families. Multiple alignments. Phylogenetic trees.

10 WP 12 Milestones Milestone 12 Automatised updating procedure Milestone 24 Development of a database of homologous genes from complete genomes: HoGenom Milestone 36 Development of tools for automatic analysis of phylogenetic trees

11 Building of HoGenom : general view Selection of fully sequenced organisms protein sequences on the EBI proteome site. Sequence comparison with BLAST on the whole sequences dataset Clustering of the sequences in genes family on the basis of sequence similarity (transitive association) Add the gene family info in the protein sequence annotations Protein Alignments Phylogenetic trees For each family

12 Hogenprot: Q9DCD0 ID Q9DCD0 PRELIMINARY; PRT; 483 AA. AC Q9DCD0; DT 01-JUN-2001 (TrEMBLrel. 17, Created) DT 01-JUN-2001 (TrEMBLrel. 17, Last sequence update) DT 01-MAR-2002 (TrEMBLrel. 20, Last annotation update) DE 0610042A05RIK PROTEIN. GN 0610042A05RIK. OS Mus musculus (Mouse). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. OX NCBI_TaxID=10090 RN [1] RP SEQUENCE FROM N.A. RC STRAIN=C57BL/6J; TISSUE=KIDNEY; RX MEDLINE=21085660; PubMed=11217851; RA Kawai J., Shinagawa A., Shibata K., Yoshino M., Itoh M., Ishii Y., ---- RA Hayashizaki Y.; RT "Functional annotation of a full-length mouse cDNA collection."; RL Nature 409:685-690(2001). CC -!- CATALYTIC ACTIVITY: 6-PHOSPHO-D-GLUCONATE + NADP(+) = D-RIBULOSE CC 5-PHOSPHATE + CO(2) + NADPH. CC -!- PATHWAY: HEXOSE MONOPHOSPHATE SHUNT. CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE CC FAMILY. CC -!- GENE_FAMILY: HBG000005 [ FAMILY / ALN / TREE ] DR EMBL; AK002894; BAB22439.1; -. DR HSSP; P00349; 2PGD. DR MGD; MGI:1914101; 0610042A05Rik. DR InterPro; IPR001744; 6PGD. DR Pfam; PF00393; 6PGD; 1. DR PRINTS; PR00076; 6PGDHDRGNASE. DR PROSITE; PS00461; 6PGD; 1. DR PRODOM; Q9DCD0. DR SWISS-2DPAGE; Q9DCD0. KW NADP; Oxidoreductase; Pentose shunt. FT DOMAIN 5 60 PRODOM:2001.3:PD001594 134 FT DOMAIN 63 296 PRODOM:2001.3:PD001025 91 FT DOMAIN 316 469 PRODOM:2001.3:PD001549 79 SQ SEQUENCE 483 AA; 53247 MW; CD0A3F72EEC2831E CRC64;

13 Building of HoGenom : general view Selection of fully sequenced organisms protein sequences on the EBI proteome site. Sequence comparison with BLAST on the whole sequences dataset Clustering of the sequences in genes family on the basis of sequence similarity (transitive association) Add the gene family info in the protein sequence annotations EMBL cross references calculations, nucleotide sequences selection Add gene family info in the EMBL/GenBank nucleotide annotations Protein Alignments Phylogenetic trees ACNUC Protein database For each family

14 Hogennucl: AK002894.PE1 AK002894.PE1 Location/Qualifiers FT CDS_pept 76..1527 FT /codon_start=1 FT /db_xref="MGD:MGI:1914101" FT /db_xref="SWISS-PROT:Q9DCD0" FT /note="data source:SPTR, source key:P52209, evidence:ISS" FT /note="homolog to 6-PHOSPHOGLUCONATE DEHYDROGENASE, FT DECARBOXYLATING (EC 1.1.1.44)" FT /note="putative" FT /transl_table=1 FT /gene_family="HBG000005" FT /protein_id="BAB22439.1" FT /translation="MAQADIALIGLAVMGQNLILNMNDHGFVVCAFNRTVSKVDDFLAN FT EAKGTKVVGAQSLKDMVSKLKKPRRVILLVKAGQAVDDFIEKLVPLLDTGDIIIDGGNS FT EYRDTTRRCRDLKAKGILFVGSGVSGGEEGARYGPSLMPGGNKEAWPHIKAIFQAIAAK FT VGTGEPCCDWVGDEGAGHFVKMVHNGIEYGDMQLICEAYHLMKDVLGMRHEEMAQAFEE FT WNKTELDSFLIEITANILKYRDTDGKELLPKIRDSAGQKGTGKWTAISALEYGMPVTLI FT GEAVFARCLSSLKEERVQASQKLKGPKVVQLEGSKKSFLEDIRKALYASKIISYAQGFM FT LLRQAATEFGWTLNYGGIALMWRGGCIIRSVFLGKIKDAFERNPELQNLLLDDFFKSAV FT DNCQDSWRRVISTGVQAGIPMPCFTTALSFYDGYRHEMLPANLIQAQRDYFGAHTYELL FT TKPGEFIHTNWTGHGGSVSSSSYNA" atggcccaag ctgacattgc actgatcgga ctggctgtca tgggccagaa cttaattttg 60 aacatgaatg atcatggatt tgtggtctgt gctttcaata ggacagtctc caaagtcgat 120 …. ccctgcttca ctactgccct ctccttctat gatgggtaca gacacgagat gctgccagca 1320 aacctcatcc aggctcaacg ggattacttt ggggctcaca cctatgaact cttaaccaaa 1380 ccgggagaat ttatccacac caactggacg ggccacgggg gcagtgtgtc atcctcttca 1440 tacaatgcct ag 1452 //

15 Building of HoGenom : general view Selection of fully sequenced organisms protein sequences on the EBI proteome site. Sequence comparison with BLAST on the whole sequences dataset Clustering of the sequences in genes family on the basis of sequence similarity (transitive association) Add the gene family info in the protein sequence annotations EMBL cross references calculations, nucleotide sequences selection Add gene family info in the EMBL/GenBank nucleotide annotations Protein Alignments Phylogenetic trees ACNUC Protein database ACNUC Nucleotide database For each family

16 Automatised updating procedure: Sequences in the database Iterative sequence comparison with BLAST compare new sequences with themself compare new sequences with old sequences release 1 release 2 old new Sequences in the database

17 Family construction 1:similarity search BLASTP BLOSUM62 E ≤ 10 -4 Filtering (SEG) SWISS-PROT + TrEMBL New x New + New x Old  Local pairwise alignments Automatised updating procedure:

18 S2S4S1S3 Seq. A Seq. B S2S1’ ∆lg1lgHSP1∆lg2 ∆lg3 lgHSP2 Seq. A Seq. B Family construction 2: Selection of consistent HSPs Automatised updating procedure:

19 Family construction 3: Clustering into families A B A C HSP ≥ 80 % length Similarity ≥ 50 % C B A Cluster A, B, C 1 : Clustering of complete sequences into families 2 : Including partial sequences to the families defined previously Automatised updating procedure:

20 Protein family ABCDEFGABCDEFG BIONJ Neighbor joining, Observed divergence Partial sequences: distance matrix with missing values Multiple alignment ABCDEFGABCDEFG Rooting: mid-point Phylogenetic tree G F E D C B A CLUSTAL W Default parameters Family construction 4: Alignments and trees Automatised updating procedure:

21 The HoGenom database: Families of homologous genes from complete genomes Month 24 Deliverable

22 Improvements in computation time Collaboration with IN2P3 Computing Center (Lyon) –CPU: about 1000 processors (Sun, Linux) –Disk storage: about 700 Tb –Batch queuing system (BQS) Building of HOGENOM (September 2003): –Total BLAST real time (800 Linux processors): 30h –310, 000 new sequences –112, 000 old sequences parallelisation ~ 2 months Local ressources ~ 1 day IN2P3 ressources

23 HoGenom ACNUC contents 8th September 2003 HoGenom Proteins 423,577 sequences HoGenom Nucleotide Sequences 448,582 cds 117 fully sequenced organisms Data Source Protein data from EBI: non-redondant complete proteome sets (SWISS-PROT, TrEMBL, TrEMBLnew) http://www.ebi.ac.uk/proteome, June 2003 Genomic data from EMBL, June 2003

24 117 organisms 423 577 protein sequences 1016 91 Arabidopsis thaliana (plant) Caenorhabditis elegans (nematod) Drosophila melanogaster (fly) Encephalitozoon cuniculi (microsporidia) Guillardia theta (alguae) Homo sapiens (man) Mus musculus (mouse) Rattus norvegicus (rat) Saccharomyces cerevisiae (yeast) Schizosaccharomyces pombe (fungus) 31% 9% 60%

25 41 907 families 423 577 protein sequences Sequences belonging to a family 305 514 (72%) 305 514 115 373 Orphan Sequences (27%) 115 373

26 Access to HoGenom is available at the PBIL: http://pbil.univ-lyon1.fr/ Web page of HoGenom : http://pbil.univ-lyon1.fr/databases/hogenom.html

27 Databases access on the Web (Perrière et al. 2003) Two main www interfaces WWW Query –Multiple query on sequences (Guy Perrière) –Multiple query on families –http://pbil.univ-lyon1.fr/search/query_fam.phphttp://pbil.univ-lyon1.fr/search/query_fam.php Cross Taxa –Search of families in function of complex taxonomic criteria –Selection of families –http://pbil.univ-lyon1.fr/search/cross_fam.phphttp://pbil.univ-lyon1.fr/search/cross_fam.php Cross-references with external databases, integration to Integr8 (WP2)

28 Cross Taxa: Selection of gene families example : selecting families of animal specific genes

29

30 display family

31 Family Page

32

33

34 Example: sequence Q8ZY16 in NiceProt : cross-references to HAMAP-ACNUC and HOBACGEN Cross-references with external databases and integration (WP2) 1 sequence associated family Display the family, alignment and phylogenetic tree associated to an sequence accession number via a URL link. http http://pbil.univ-lyon1.fr/cgi-bin/acnuc-link-ac2fam?db=HAMAPprot&query=Q8ZY16

35 Next steps Milestone 36 –Development of tools for phylogenetic tree analysis, automatic orthology and paralogy relationship assignment (J.F. Dufayard) –Phylogenetic profiles –Collaboration with WP3 : cross-references between genome CDS and complete proteome

36 Acknowledgements People from BBE: SWISS-PROT group Laurent Duret Alexandre Gattiker (S, HAMAP) Manolo Gouy Julien Grassot INRIA Simon PenelJean-François Dufayard Guy Perrière

37 Databases of homologous genes Databases of homologous genes at PBIL: –HOVERGEN (1994): vertebrates –HOBACGEN (2000): prokaryotes –HOGENOM: complete genomes –RTKdb: receptor tyrosine kinase (J. Grassot, G. Mouchiroud) –NuReBase: nuclear receptors (M. Robinson, V. Laudet) Goals Database content and updating Query software

38 Databases for comparative genomics Databases of homologous protein domains –PROSITE –PFAM –PRODOM –... –InterPro Databases of gene families –COG –HOBACGEN, HOVERGEN –...

39 Comparative genomics Functional genomics: –Prediction of gene function, protein structure –Identification of functional constraints –Identification of regulatory elements –... Molecular evolution studies: –Search for horizontal transfers –Species-specific metabolic pathways –Ancestral genome content –Gene, genome duplication and acquisition of novel functions –…

40 Gene duplication and evolution of function Gene duplication... Time Pseudogene Ancient paralogs Specific function e.g. expression pattern, subcellular localisation, biochemical activity,...

41 Phylogenomic approach for function prediction 2) Align sequences 3) Compute phylogenetic tree 2A 1A 2B 1B 3A 3B 5) Infer the likely function of other genes 2A 1A 2B 1B 3A 3B 4) Place known functions in the tree 2A 1A 2B 1B 3A 3B 1) Identify homologs 2A 1A 2B 1B 3A 3B Species: 1, 2, 3 gene duplication

42 Hogennucl: AK002894.PE1 AK002894.PE1 Location/Qualifiers FT CDS_pept 76..1527 FT /codon_start=1 FT /db_xref="MGD:MGI:1914101" FT /db_xref="SWISS-PROT:Q9DCD0" FT /note="data source:SPTR, source key:P52209, evidence:ISS" FT /note="homolog to 6-PHOSPHOGLUCONATE DEHYDROGENASE, FT DECARBOXYLATING (EC 1.1.1.44)" FT /note="putative" FT /transl_table=1 FT /gene_family="HBG000005" FT /protein_id="BAB22439.1" FT /translation="MAQADIALIGLAVMGQNLILNMNDHGFVVCAFNRTVSKVDDFLAN FT EAKGTKVVGAQSLKDMVSKLKKPRRVILLVKAGQAVDDFIEKLVPLLDTGDIIIDGGNS FT EYRDTTRRCRDLKAKGILFVGSGVSGGEEGARYGPSLMPGGNKEAWPHIKAIFQAIAAK FT VGTGEPCCDWVGDEGAGHFVKMVHNGIEYGDMQLICEAYHLMKDVLGMRHEEMAQAFEE FT WNKTELDSFLIEITANILKYRDTDGKELLPKIRDSAGQKGTGKWTAISALEYGMPVTLI FT GEAVFARCLSSLKEERVQASQKLKGPKVVQLEGSKKSFLEDIRKALYASKIISYAQGFM FT LLRQAATEFGWTLNYGGIALMWRGGCIIRSVFLGKIKDAFERNPELQNLLLDDFFKSAV FT DNCQDSWRRVISTGVQAGIPMPCFTTALSFYDGYRHEMLPANLIQAQRDYFGAHTYELL FT TKPGEFIHTNWTGHGGSVSSSSYNA" atggcccaag ctgacattgc actgatcgga ctggctgtca tgggccagaa cttaattttg 60 aacatgaatg atcatggatt tgtggtctgt gctttcaata ggacagtctc caaagtcgat 120 …. ccctgcttca ctactgccct ctccttctat gatgggtaca gacacgagat gctgccagca 1320 aacctcatcc aggctcaacg ggattacttt ggggctcaca cctatgaact cttaaccaaa 1380 ccgggagaat ttatccacac caactggacg ggccacgggg gcagtgtgtc atcctcttca 1440 tacaatgcct ag 1452 // Nucleotide sequence annotations

43 Hogenprot: Q9DCD0 ID Q9DCD0 PRELIMINARY; PRT; 483 AA. AC Q9DCD0; DT 01-JUN-2001 (TrEMBLrel. 17, Created) DT 01-JUN-2001 (TrEMBLrel. 17, Last sequence update) DT 01-MAR-2002 (TrEMBLrel. 20, Last annotation update) DE 0610042A05RIK PROTEIN. GN 0610042A05RIK. OS Mus musculus (Mouse). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. OX NCBI_TaxID=10090 RN [1] RP SEQUENCE FROM N.A. RC STRAIN=C57BL/6J; TISSUE=KIDNEY; RX MEDLINE=21085660; PubMed=11217851; RA Kawai J., Shinagawa A., Shibata K., Yoshino M., Itoh M., Ishii Y., ---- RA Hayashizaki Y.; RT "Functional annotation of a full-length mouse cDNA collection."; RL Nature 409:685-690(2001). CC -!- CATALYTIC ACTIVITY: 6-PHOSPHO-D-GLUCONATE + NADP(+) = D-RIBULOSE CC 5-PHOSPHATE + CO(2) + NADPH. CC -!- PATHWAY: HEXOSE MONOPHOSPHATE SHUNT. CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE CC FAMILY. CC -!- GENE_FAMILY: HBG000005 [ FAMILY / ALN / TREE ] DR EMBL; AK002894; BAB22439.1; -. DR HSSP; P00349; 2PGD. DR MGD; MGI:1914101; 0610042A05Rik. DR InterPro; IPR001744; 6PGD. DR Pfam; PF00393; 6PGD; 1. DR PRINTS; PR00076; 6PGDHDRGNASE. DR PROSITE; PS00461; 6PGD; 1. DR PRODOM; Q9DCD0. DR SWISS-2DPAGE; Q9DCD0. KW NADP; Oxidoreductase; Pentose shunt. FT DOMAIN 5 60 PRODOM:2001.3:PD001594 134 FT DOMAIN 63 296 PRODOM:2001.3:PD001025 91 FT DOMAIN 316 469 PRODOM:2001.3:PD001549 79 SQ SEQUENCE 483 AA; 53247 MW; CD0A3F72EEC2831E CRC64; Protein sequence annotations

44 Previous computation time: (Sun Sparc Ultra 900 MHz, 2 Gb RAM) Updating the HOVERGEN database (April 2002) –137,000 old + 33,000 new sequences (51 10 6 aa) –BLAST comparison (new x old + new x new): 23 days –Multiple alignments (Clustalw): 4 days –Phylogenetic trees (BioNJ, no bootstrap): 0.5 day –Total: 28 days (1 processor) Improvements in computation time Calculation time was a bottleneck for frequent updates of several databases

45 General overview on WP12 research –Phylogenomics –Databases of homologous gene families –Family construction The HOGENOM database –Building –Results Access to databases –Database query via a web server –Database cross-references via URLs

46 Proteome/genome comparative analysis Phylogenetic studies Orthology/Paralogy relationship assignments Development of generalist databases, specialised databases –HOVERGEN: families of homologous vertebrate genes –HOBACGEN: families of homologous bacterial genes –HOGENOM: families of homologous from complete genomes –NureBase, RTKdb, Hoppsigen, Mitalib,.. Important regions identification in genomic sequences Evolution at the molecular level Species phylogeny Function prediction WP 12 Research fields:

47 General overview on WP12 research –Phylogenomics –Databases of homologous gene families –Family construction The HOGENOM database –Building –Results Access to databases –Database query via a web server –Database cross-references via URLs

48 General overview on WP12 research –Phylogenomics –Databases of homologous gene families –Family construction The HOGENOM database –Building –Results Access to databases –Database query via a web server –Database cross-references via URLs

49

50 Application to other databases Any sequence database can be structured under ACNUC and queried with WWW-Query Currently available : SWISS-PROT, EMBL, GenBank, etc. Any family database can be structured under ACNUC and queried with WWW-Query and Cross-Taxa For example, an ACNUC version of the HAMAP database developed by SWISS-PROT is currently available at the PBIL

51 Ortholog ≠ Functional equivalent !! Orthology: not necessarily one-to-one relationship (one-to-many or many-to- many) e.g.: the human INS gene has two orthologs in rodents (Ins1 and Ins2) The rodent Ins1 gene is more closely related to its paralog Ins2 than to its human ortholog INS. Ancestral insulin gene Rodents Primates INS2 INS1 HumanRatMouse RatMouse INSINS1 INS2 Speciation Duplication


Download ppt "WP 12 Contribution to Integr8 The HoGenom database : Families of homologous genes from complete genomes Work Package 12: Equipe Bioinformatique et Génomique."

Similar presentations


Ads by Google