Download presentation
Presentation is loading. Please wait.
1
Identification of Paralogs in RADseq data
Garrett McKinney, Ryan Waples, Lisa Seeb, Jim Seeb University of Washington Seattle
2
Paralogs in Salmon Salmon whole genome duplication has resulted in a high proportion of paralogs Up to 25% of the genome experienced delayed diploidization Up to 10% of the genome is still inherited tetrasomically Include citations? Lien et al. 2016
3
Paralogs in Salmon Paralogs are concentrated on the distal ends of eight pairs of homeologous chromosome arms (Brieuc et al. 2014, Kodama et al. 2015, Waples et al. 2015, Larson et al , McKinney et al. 2016) Distal portions of these homeologs can still undergo tetrasomic inheritance Waples et al. 2015
4
Allelic Ratios and Genotypes in Heterozygous Individuals
Paralogs in Salmon Three broad locus classes result from the rediploidization process Locus type* Inheritance Allelic Ratios and Genotypes in Heterozygous Individuals Singleton Disomic 1:1 AB Duplicate Tetrasomic 3:1 2:2 1:3 AA | AB AB | AB AA | BB AB | BB Diverged duplicates AA || AB AA || BB * Based on nomenclature and definitions given in Allendorfand Thorgaard (1984) and Limborg et al. (2016)
5
Issues With Paralogs It is difficult to accurately genotype paralogs
AAAB vs AABB vs ABBB Paralogs can’t always be analyzed using standard population genetic methods Many models assume diploid system Paralogs are often removed from analyses due to these considerations Must first be identified
6
Methods to Identify paralogs
Genome/transcriptome alignments Genomes not available or incomplete for many species Gene annotation Most RAD loci outside of coding regions Haploids/doubled haploids Sequence depth Heterozygosity/Hardy-Weinberg deviations Number of alleles/polymorphic sites
7
How to better identify paralogs
Paralogs differ from singleton loci in two important characteristics Heterozygosity Allele ratios within heterozygotes Can be combined to accurately identify paralogs in population datasets HDplot
8
Heterozygosity Duplicate loci have more heterozygotes than singleton loci Increase in heterozygosity occurs at every allele frequency but is maximized when p=q=0.5
9
Difference is maximized when allele frequencies are even
10
Allele Ratio Allele ratio is discrete within individuals but continuous at the population level Singleton heterozygotes (AB) always 1:1 Duplicates have three heterozygous classes AAAB, AABB, ABBB (3:1, 1:1, 1:3) Proportion of each heterozygous class depends on allele frequencies Stochastic variation in RADseq can mask true ratio in individuals Reads for each allele can be summed over heterozygous individuals to increase power
13
Difference is maximized when one allele is rare
14
A statistical test can be done to see if the observed allelic ratio differs from 1:1
15
Chinook salmon data Larson et al. 2014 McKinney et al. 2016
Population Dataset 266 individuals 5 populations McKinney et al. 2016 Linkage map: 14,620 loci Use copy-status from mapped loci to verify accuracy of HDplot
16
Data simulation Simulated data based on Chinook salmon dataset
20,000 loci (17,000 singleton, 1,500 duplicate, 1,500 diverged duplicate) Allele frequency for each locus randomly generated 2,000 genotypes created based on allele frequency 250 individuals randomly sampled from genotypes Read depth per locus and read depth per allele modeled based on Chinook salmon dataset
17
Data simulation Calculating H and D H is the heterozygosity per locus
D (read ratio deviation) is a z-score for each locus Measure deviation of observed allelic ratio from expected 1:1
18
Data Simulation Results
Singletons and Duplicates are distinguished both by heterozygosity and read ratio deviation Duplicates and Diverged duplicates are distinguished primarily by heterozygosity
19
Real Data Total Loci Singleton Duplicate 19,299 16,019 (83%)
3,280 (17%)
20
Singleton Loci Duplicate Loci Diverged Duplicate Loci
21
Accuracy of Other Methods
Haploid Mapping Accurate 97% agreement with HDplot for singletons 95% agreement with HDplot for duplicates Limited to identifying duplicates that are polymorphic in parents of haploids Could only assign copy-status to ~50% of the population loci
22
Accuracy of Other Methods
Read Depth ~30% of duplicate loci misidentified
23
Accuracy of Other Methods
Heterozygosity Used FIS threshold of <-0.5 to classify loci as duplicate Subset of duplicates identified (~30%) No singletons misidentified
24
Accuracy of Other Methods
Number of Polymorphic Sites/Alleles Duplicate loci do have more alleles and polymorphic sites but overlap in distribution is too great to distinguish Singleton Duplicate
25
How to use knowledge of copy-status
Filter duplicates Has been default approach, removes regions of the genome which may be important Incorporate into population analyses Estimate population allele frequencies (polyfreqs) Assign allele dosage
26
Genotyping Paralogs: RADseq
AAAB AABB ABBB 1:3 1:1 3:1 1:3 1:1 3:1 Sequence reads too variable to assign allele dosage Dosage may be assigned with few ambiguities
27
Genotyping Paralogs: RADseq
AAAB AABB ABBB 28 reads per individual 119 reads per individual *Difference is average read depth
28
Genotyping Paralogs: Amplicon
Sample Haplotype Reads Dosage 43 AAT 468 2 TGG 407 Sample Haplotype Reads Dosage 48 AAT 246 1 TAT 242 TGG 430 2
29
Acknowledgements Seeb Lab Funding Wes Larson Morten Limborg
Carita Pascal Carolyn Tarpey Funding
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.