SPECIES AT THE GENOMIC LEVEL
DDH has been the gold standard the “sex” for higher eukaryotes Stackebrandt et al., 2002, Int J Syst Evol Microbiol. 52: Rosselló-Mora & Amann 2001, FEMS Rev. 25:39-67 Gevers et al., 2005, Nature Rev. Microbiol. 3: DDH (DNA-DNA hybridization): 70% similarity (50-70%) used since the 60’s strong influence non cumulative DB need to be substituted MLSA (multilocus sequence analysis): 5-10 full/partial sequences house keeping genes primer design difficulties biases in the selection of genes time consuming ↓↓ number for stable topology Amplify and sequence 5-10 housekeeping genes for each strain Concatenate gene sequences Reconstruct the phylogeny genAgenBgenCgenDgenEgenF Str. 1 Str. 2 Str. 3 Str. 4
Alternative approaches ANI Konstantinidis and Tiedje, 2005, PNAS. 102: Genome a BLAST N Genome b Search annotated ORFs sieve common orthologous genes ANI aa b genome Cut into fragments of 1020 nuc + BLAST N < 30% identity < 70% aligned seq > 30% identity > 70% aligned seq discard ANI Goris et al., 2007, IJSEM. 7:81-91
JSpecies ( JSpecies Biologist oriented user friendly and usable with multifasta data
ANI is way to circumscribe species genomically in the future ANIm vs DDH: 85 genospecies evaluated 94-96% a plausible borderline inconsistent results most probably due to wrong DDH values ANI thresholds of 94-96% genomospecies 20% random sequences (i.e., 250 nuc) of two genomes is enough Complete catalogue of type strain genomes only 4% random genome sequence is enough Richter & Rosselló-Móra 2009, PNAS 106:
The best scenario ◄► all species genomes sequenced afedcb glkjih mrqpon sxwvut complete type strain genomes + < 20% random sequence genome coverage Perhaps with 1000 reads would be enough (200€) STABLE ANI 1% of the genome will be enough for IDENTIFICATION purposes need of an effort to full sequence the species collection need of an effort to full sequence the species collection (GEBA; Wu et al Nature 24: ) it will be in the future necessary to fully sequence any new type strain 94% - 96% ANI boundary
► Data analysis in summer 2009 => 938 genomes ► 10% of the entries tagged with the collection number (the rest with original strain number) ► 255 species names represented by their Type Strain ► 256 species names NOT represented by their Type Strain ► 50 species names NEVER validly published ► it is possible to circumscribe uncultured species (i.e. Buchnera & Wolbachia) Richter & Rosselló-Móra 2009, PNAS 106: Genome database & Type strains
Tetranucleotide variation: 4 4 = 256 TETRA: Genomes have an oligonucleotide usage (not yet understood, related to codon usage) Similar genomes might have similar usage ALIGNMENT FREE PARAMETER may be useful in deciding whether a group of strains deserve a species status Same species >0.999
► The case of the synthetic genome of M. mycoides strain GM12 transplanted to M. capricolum (Science (2010) 329: 52) ► 88.5 (66% aligned) ► 94.5 (78% aligned) ► 87.8 (76% aligned)
► Only one of the several transplantations worked out! ► Different ways of reading the genome? organismtargetANI TETRA (r) M. hyopneumoniae 7448 M. hyopneumoniae J M. mycoides LC M. capricolum M. genitalium M. capricolum M. genitalium M. pneumoniae M. genitalium M. gallisepticum M. aligatoris (crocodyli) M. capricolum Same species WorkedNONONO Genome transplantation experiments of Venter
► The phylogenetic (evolutive) distance plays an important role in the recognition of how the genetic information is coded ► M. genitalium M. pneumoniae, strange! Wrong identified strain?
OTHER PARAMETERS Average Aminoacid Identity (AAI) Kostantinidis & Tiejde, 2005, J. Bacteriol. 187: Maximal Unique and Exact Matches (MUM) De Loger et al., 2009, J. Bacteriol. 191: High Scoring Segment Pairs (HSP) (HSP) Auch et al., Std Gen Sci 2: And more to come Need full genome sequences The easiest is the best