Population Genetic Structure Analysis of The Emerging Marine Pathogen, Vibrio vulnificus Bisharat et al Analyses by RM Harding 6th July, 2006.

Slides:



Advertisements
Similar presentations
Topic 12 – Further Topics in ANOVA
Advertisements

Recombination and genetic variation – models and inference
Analysis. Start with describing the features you see in the data.
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
“Everything is everywhere – the environment selects.” (Baas-Becking, 1934) “Although there is no direct effect of distance per se, distance is related.
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Basics of Linkage Analysis
Logistic Regression Part I - Introduction. Logistic Regression Regression where the response variable is dichotomous (not continuous) Examples –effect.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Atelier INSERM – La Londe Les Maures – Mai 2004
Signatures of Selection
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
Islands in Africa: a study of structure in the source population for modern humans Rosalind Harding Depts of Statistics, Zoology & Anthropology, Oxford.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Population Genetics What is population genetics?
Salit Kark Department of Evolution, Systematics and Ecology The Silberman Institute of Life Sciences The Hebrew University of Jerusalem Conservation Biology.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
One-way Between Groups Analysis of Variance
What Darwin Never Knew How Genetics influences Evolutionary Thought.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Beyond Phylogeny: Evolutionary analysis of a mosaic pathogen Dr Rosalind Harding Departments of Zoology and Statistics, Oxford University,UK.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
TOWARDS TESTING THE EPIDEMIC CLONE MODEL OF BACTERIAL PATHOGENS Daniel J. Wilson, Gilean A.T. McVean and Martin C.J. Maiden Peter Medawar Building for.
Gene Hunting: Linkage and Association
Comp. Genomics Recitation 3 The statistics of database searching.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Time series Model assessment. Tourist arrivals to NZ Period is quarterly.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Patterns of divergent selection from combined DNA barcode and phenotypic data Tim Barraclough, Imperial College London.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
INTRODUCTION TO ASSOCIATION MAPPING
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Introduction to History of Life. Biological evolution consists of change in the hereditary characteristics of groups of organisms over the course of generations.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
1 Chapter 4 Numerical Methods for Describing Data.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
ANOVA, Regression and Multiple Regression March
Logistic Regression Analysis Gerrit Rooks
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
CSCOPE Unit: 09 Lesson: 01.  Be prepared to share your response to the following: ◦ Biological evolution happens at the __________ level, not the individual.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
The Haplotype Blocks Problems Wu Ling-Yun
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Step 1: Specify a null hypothesis
Common variation, GWAS & PLINK
Genetic Linkage.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Of Sea Urchins, Birds and Men
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Signatures of Selection
Genome Wide Association Studies using SNP
Genetic Linkage.
Estimating Recombination Rates
The ‘V’ in the Tajima D equation is:
The coalescent with recombination (Chapter 5, Part 1)
Genetic Drift, followed by selection can cause linkage disequilibrium
Genetic Linkage.
There is a Great Diversity of Organisms
Lecture 9: QTL Mapping II: Outbred Populations
Presentation transcript:

Population Genetic Structure Analysis of The Emerging Marine Pathogen, Vibrio vulnificus Bisharat et al Analyses by RM Harding 6th July, 2006

Here is where we started: Why two clusters? How are they related to each other? What is their origin? Gene order for MLST: glp-gyrB-mdh-metG-purM (large chromosome), dtdS-lysA-pntA-pyrC-tnaA (small chromosome).

Potential explanations 1.Historical structure. Cluster II isolates represent a clonal expansion of some ancient hybrid (eg ancient horizontal transfer event from V. cholerae?) and so share some sequence similarity by descent. Free recombination between isolates has since been erasing this difference. Support from association of human disease cases with cluster II and prevalence of environmental origin of cluster I strains. Enriched, non-random sampling of human-disease associated isolates predicts lower diversity and higher LD within cluster II, particularly surrounding genes associated with pathogenicity phenotype. 2.Population structure due to isolation between geographic locations or hosts. Lack of recombinant hybrids because members of each cluster rarely meet. No evidence of geographic structure. Ubiquitous presence in marine and estuarine environments provides no clue indicating host structure. 3.Genetic structure due to adaptation and ongoing selection. Lack of recombinant hybrids because of their low fitness. What would be the nature of positive evidence?

Evidence against Hypothesis 1: Each individual MLST locus splits the isolates into the same two groups, more or less, which confirms that clustering of isolates is not due to proximity to a particular divergent locus. Splits trees also suggest that isolates have recombined with each other. But has recombination mainly occurred within cluster I? Perhaps the rate of recombination (within MLST genes) is so low for cluster II that clonal identity (LD) is being maintained across genes? So that would be why all genes split into two clusters.

Table 2. Estimates of rates of mutation and recombination, and their ratios. locus Vibrio vulnificus ParametersCluster I EstimatesCluster II Estimates Large chrom, glp 480 bp  = 2N e   = 2N e r (ML),  =  (28), (15), 1.9 Large chrom, gyrB 459 bp  = 2N e   = 2N e r (ML),  =  (23),  (2), 0.3 Large chrom, mdh 489 bp  = 2N e   = 2N e r (ML),  =  (2), (8), 1.4 Large chrom, metG 429 bp  = 2N e   = 2N e r (ML),  =  (1), (4), 0.7 Large chrom, purM 444 bp  = 2N e   = 2N e r (ML),  =  (2), (10), 1.2 Average 460 bp  = 2N e r (ML) Average  (8*) (8*) 1.1 Small chrom, dtdS 417 bp  = 2N e   = 2N e r (ML),  =  (8), (13) 1.5 Small chrom, lysA 465 bp  = 2N e   = 2N e r (ML),  =  (10), (10) 0.7 Small chrom, pntA 396 bp  = 2N e   = 2N e r (ML),  =  (5), (9), 1.4 Small chrom, pyrC 423 bp  = 2N e   = 2N e r (ML),  =  (4), (19), 2.6 Small chrom, tnaA 324 bp  = 2N e   = 2N e r (ML),  =  (20) (0), 0 Average 405 bp  = 2N e r (ML) Average  (8.5*) (11*) 1.2 *Maximum likelihood estimates from likelihoods combined over 5 genes.

Results so far Although rates of recombination are variable between genes, averages over within-locus recombination rates provide no evidence for a lower rate of recombination for cluster II on either chromosome. Likewise, no evidence of lower diversity (estimated as  ) for cluster II on either chromosome. This observation is compatible with the assumptions of house-keeping function and neutrality for diversity within genes, for both clusters. But, using DNASp, let’s check nucleotide diversity (estimated as  ) and Tajima’s D (difference between  and  ) for any evidence of either selected differences between clusters or clonal expansion within clusters.

Tajima’s D LC, both clusters –TD: (ns)  : LC, cluster I –TD: (ns)  : LC, cluster II –TD: (ns)  : SC, both clusters –TD: (ns)  : SC, cluster I –TD: (ns)  : SC, cluster II – TD: (ns)  : No evidence for either selected differences between clusters or clonal expansion within clusters. If sequence divergence between the clusters was particularly deep (relative to the diversity expected for random assortment), then Tajima’s D between clusters would be large and positive (>1.5) If diversity reflected clonal expansion, Tajima’s D within clusters would be large and negative (<-1.5)

Linkage disequilibrium between loci and fit to an LDhat model. Given that the splits trees are more or less concordant in their splits between clusters, we have to expect chromosome-wide LD for isolates when combined across clusters. The next slide shows the significant LD between SNP pairs (Fisher’s exact test from DNASp) on the lower diagonal for the large chromosome (pale blue for significant against a yellow background). Yellow above the diagonal shows a good fit to a conversion tract model for recombination. The best fit was found for 300 bp tract lengths. Red indicates hotspots, i.e. more recombination than expected given model, only observed within MLST loci. Blue indicates unexpected LD. Keep in mind that the average length of MLST loci on the large chromosome is 460 bp. If the best fit is much smaller than this, then the scale used for interlocus distances is irrelevant. The inter-locus distances just have to be larger than any intra-locus distance. I used 1 kb between each pair of loci. The fit isn’t great, but not bad. The lack of fit, which is mainly due to an excess of apparent hotspots within loci, can be explained by model misspecification rather than anything of biological significance. They aren’t really hotspots. It’s the estimates that are too small, because of problems with applying the model.

300 bp, ML(  )=5 mdh gyrB glp purM metG mdhgyrBglppurMmetG 184 SNPs

Context and interpretation: large chromosome The best fitting tract length is 300 bp and the ML recombination rate is ML(  )=5 over all 5 loci concatenated with 1 kb intervals (5/6301 for rate per bp) whether the average tract length is 300 bp or 500 bp. Compare with ML(  )=8 for the average of the 5 loci given in the earlier Table (8/460 bp for the rate per bp), both for cluster I and cluster II. The estimated recombination rate over the chromosome is low to accommodate all the LD across clusters between loci. The model fit highlights departures from expectations of rates based on ML(  )=5. Since these expectations for the rate are low, recombination between some intra- locus SNP pairs is judged high (red). But it’s not too bad – the resulting poor fit (and model failure) is mainly within loci, not between. For comparison, the best fitting average tract length for analyses of individual loci within clusters (reported in the table) was estimated at ~500 bp, but keep in mind that this is a lower bound set by the physical intra-locus distances (average locus length is 460 bp). Given the average locus length of 460 bp, there must be a switch between roughly two haplotype blocks within each locus, and I think I can see that in the LD patterns below the diagonal. Now, focusing on the SNPs that segregate between the clusters, take a look in the next slide at the haplotypes (judged by a small subset of SNPs sharing lots of LD). In the following slide the rows (haplotypes) are organised by a UPGMA tree on large chromosome loci. The same two clusters are observed as when all loci are combined but ordering within the clusters is different.

Moving on to the small chromosome Next slide is the LD and model fit for the small chromosome, both clusters combined. I found the best fit for tract lengths of 110 bp. The LD within loci appears broken into even smaller blocks than on the large chromosome. Again there is a lot inter-locus LD. Individual blocks align, repeating the segregation pattern between the two clusters. There must be somewhat more LD because the estimated recombination rate is lower than for the large chromosome: ML(  )=2 over all 5 loci concatenated with 1 kb intervals (2/6025 for rate per bp) whether the tract length is 110 bp or 500 bp. Compare with with ML(  )=8.5 for cluster I or ML(  )=11 for cluster II, for the average of the 5 loci given in the earlier Table, assuming tract length of 500 bp (Divide by 405 bp for the rate per bp).

110 bp, ML(  ) = 2 pntA tnaA lysA dtdS pyrC pntAtnaAlysAdtdSpyrC 252 SNPs

Context and interpretation: small chromosome Now the model fit is much worse with so called ‘hotspots’ dispersed between as well as within loci. The main reason for these ‘hotspots’ is model misspecification. The estimated ML recombination rates within and between loci, are too low. Even judging against a model that sets the recombination rates too low, there is excess inter-locus LD between pyrC and other genes. The inter-locus LD on the small chromosome must be more substantial in magnitude (not just significance) than on the large chromosome, leading to particularly poor model fit. Given that the recombination tract-length LDhat model worked reasonably well for the large chromosome, what is different about the small chromosome? –Suppression of recombination between clusters for the small chromosome? –Loss of hybrid isolates by selection? Next slide shows the major segregating haplotypes. The haplotype blocks do look shorter than for the large chromosome.

Main observations on the large and small chromosomes Estimated recombination tract length is short (300 bp for the large chromosome and 110 bp for the small chromosome.) Although there is a lot of inter-locus LD due to chromosome wide segregation into the same two clusters of isolates, the LD has been broken into short haplotype blocks within loci. For comparison, Jolley et al reported an average tract length of 1.1 kb for a mixed set of disease-associated and carried Neisseria meningitidis. I think the tract length estimates are reasonable and more robust to model misspecification than the recombination rate estimates. (Gil agrees). Why are the blocks short when we focus on the sites segregating between the clusters? –The SNP variation segregating between clusters is generally older than the variation within clusters and there has been more time for recombination events to break up the LD into short blocks. –The nature of the recombination process between clusters is different to the recombination process(es) acting within clusters, on the small chromosome in particular.

Large chromosome, keeping the clusters separate For the large chromosome, both cluster I and cluster II isolates show LD within loci, and some but not lots of LD between loci. For cluster I, the model fit improves as the tract length is increased to 4000 bp, overlapping up to 3 loci. For cluster II, the model fit improves as the tract length is increased to 2000 bp, overlapping adjacent pairs of loci. The ML recombination rate is lower for cluster II compared with cluster I. There does appear to be more inter-locus LD on cluster II. I’ve got no way of testing whether 2000 and 4000 are significantly different fits, (probably not). What is most interesting is that there is information that recombination tract lengths within clusters on the large chromosome overlap adjacent pairs of MLST loci, which physically are a long way apart – between 0.35 MB (mdh and gyrB) and 1.1 MB (purM and metG). The LD plots below the diagonal provide some information to explain why the best fitting tract lengths overlap adjacent loci. For both clusters there is lots of LD within loci, and some LD but not as much between some pairs of loci, eg any locus with metG. For cluster II in particular, the inter-locus LD is as likely between more distant comparisons as between adjacent loci. We do need to remember the chromosome is circular and metG is equally far from mdh and purM, and actually furtherest from glp. Perhaps the recombination probability does continue to increase with increasing inter-locus distance up to 1.6 MB, ie over-lapping any 3 neighbouring loci.

mdh gyrB glp purM metG mdhgyrBglppurMmetG ML(  ) C4000: SNPs

ML(  ) C2000: 32 (best fit) ML(  ) C4000: SNPs

Small chromosome, keeping the clusters separate. For the small chromosome, both cluster I and cluster II isolates show most LD within loci, some LD between loci, but more LD between pyrC and other loci for cluster II than cluster I. For cluster I, the model fit improves as the tract length is increased to 800 bp, ie not extending between any loci. For cluster II, the model fit improves as the tract length is increased to 4500 bp, overlapping sets of 3 loci. The ML recombination rate is higher for cluster II than for cluster I (comparing estimates for tract lengths of 4000 bp.) But, judging by eye, the higher ML rate leads to more excess inter-locus LD and a poorer model fit for cluster II than for cluster I. It is also poorer than the model fit for either cluster on the large chromosome. So, whatever it is that is different about the small chromosome compared with the large chromosome, cluster II on the small chromosome is the most odd.

pntA tnaA lysA dtdS pyrC pntAtnaAlysAdtdSpyrC ML(r) C800 = 21 (best fit) ML(  ) C4000 = SNPs

pntA tnaA lysA dtdS pyrC pntAtnaAlysAdtdSpyrC The best model fit is with 4500 bp, but the picture doesn’t look any different to this one. ML(  ) C4000 = SNPs

A check on variable inter-locus distances for the small chromosome The MLST loci on the small chromosome are very unevenly distributed; dtdS is close to pyrC (and a lot of inter-locus LD is evident). Also tnaA is relatively close to lysA. Remembering that we have a circular chromosome, pyrC is as close, or as far, to pntA as it is to lysA. So I tried varying the inter-locus distances to roughly reflect these relationships. The new results showed that for cluster II, the average tract length remained long at 4000 bp or more, potentially covering any three loci. Also the fit improved a little from 2058 for the picture shown (with 1 kb interlocus distances) to 2022 (variable inter-locus distances) but this difference, spread over the whole matrix of SNP pairs, is too small to alter the highlighted pattern of highs (red) and lows (blue). The fits with tract length of 4000 bp (or more) are better than fits with tract lengths of 2000 bp at 2068 (1 kb interlocus distances) or 2031 (variable inter-locus distances of either 1 or 3 kb). But I doubt that any of the different tract length estimates varying from 2000 upwards are meaningful improvements. However, it would be interesting to know if a best fit of 800 bp for cluster I on the small chromosome, rather than say bp, is meaningful. The observation of 800 bp is in the same ball park as 1.1 kb estimated by Gil for Jolley et al.’s paper on Neisseria. It indicates recombination tract lengths covering the extent of individual MLST loci, but not overlapping them.

Adjust the focus to cluster I on the small chromosome Cluster I isolates for the small chromosome have the lowest diversity (  = compared with  > 0.2 for cluster II or either cluster on the large chromosome. Cluster I isolates for the small chromosome also show the least inter-locus LD, giving tract lengths of only 800 bp Also, I think cluster I haplotypes on the small chromosome show the least intrusion of diversity from the other cluster. Perhaps selection as adaptation to the environment is acting more strongly on the small chromosome than on the large chromosome and eliminating diversity, including diversity that happens to increase the probability of pathogenesis.

Selection for adaptation to the environment If selection for environmental adaptation is the important evolutionary process, then perhaps there are two adaptive basic phenotypes. The more common (and more fit?) basic phenotype is associated with cluster I MLSTs. On this background biotype 2 has evolved and propensity to human pathogenicity has been lost. The less common basic phenotype found in the environment is associated with cluster II MLSTs. Despite being less common we have obtained a large sample because its frequency is enriched by sampling disease cases. The observation that diversity segregates between these clusters means that both types have higher fitness than most hybrids between them (until some unusual hybrid arises, like biotype 3).

Conclusions The recombination rate for concatenated loci seems to be comparable and high within both clusters of both chromosomes. However, the model fit is poor for cluster II of the small chromosome, suggesting that against model expectations there is excess inter-locus LD, mainly extending from pyrC. Excess inter-locus LD extending from pyrC on the small chromosome is detectable for the full sample combining clusters, as well as for cluster II isolates separately. In a related point, many of the theoretically expected recombinants (between clusters) for the small chromosome are missing from cluster I, and this looks to me like there has been selection against them. Are the long recombination tracts detectable by the LDhat model for the large chromosome, within clusters, due to (a) recombination from the other cluster, or (b) recombination between similar, cluster specific haplotypes (by conjugation?)??