Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2

Similar presentations


Presentation on theme: "Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2"— Presentation transcript:

1 Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 chih.lee@uconn.edu ion@engr.uconn.edu craig.nelson@uconn.edu 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology University of Connecticut

2 Outline  Introduction  Methods  Results and Discussions  Conclusions

3  Introduction  Methods  Results and Discussions  Conclusions Outline

4 Ethnicity in Forensics  Ethnicity information assists forensic investigators.  Investigator-assigned ethnicity: based on genetic and non-genetic markers.  Genetic information enhances inference accuracy when access to most informative markers (e.g. skin/hair) is limited.  Autosomal markers: Excellent accuracy assigning samples to clades [Phi07, Shr97] May not survive degradation

5 Mitochondrial DNA  Circular  16,569 bps  Maternally inherited  High copy number  Recoverable from degraded samples  Coding region SNPs define haplogroups [Beh07]  Hypervariable Region

6 Hypervariable Region  High mutation rate compared to the coding region  Haplogroup inference [Beh07] 23 groups 96.7% accuracy rate with 1NN  Geographic origin inference [Ege04] SE Africa, Germany and Icelandic 66.8% accuracy rate with PCA-QDA 16024 165691 576 HVR 1HVR 2

7 Ethnicity Inference from HVR  The problem: Given a set of HVR sequences tagged with ethnicities Predict the ethnicities of new HVR sequences A classification problem  Our contribution: Assess the performance of 4 classification algorithms: SVM, LDA, QDA and 1NN.

8 Outline  Introduction  Methods  Results and Discussions  Conclusions

9 Encoding HVR  Align to rCRS (revised Cambridge reference sequence)  SNP profile  a SNP  a binary variable  Missing data (not typed regions) Assume rCRS Use mutation probability Common region 16067TCTCT 315.1Cinsertion 523Ddeletion

10 Support Vector Machines  Binary classification algorithm  Map instances to high-D space (the feature space)  Optimal separating hyperplane with max margins  Kernel function k(x 1,x 2 ): similarity x 1 and x 2 between in the feature space  Radial basis kernel: exp(-γ||x 1 -x 2 || 2 )  Software: LIBSVM [Cha01]

11 Linear/Quadratic Discriminant Analysis  Find argmax g P(G=g|X=x)  Assumptions: X|G=g ~N p (μ g, Σ g ) P(G=g) ’ s are equal for all g  P(G=g|X=x) prop. to P(X=x|G=g)  μ g and Σ g are estimated by the training data  LDA: common dispersion matrix Σ g = Σ for all g

12 1-Nearest Neighbor  Assign a new sample to the dominating ethnicity among the nearest samples in the training data  Distance measure: the Hamming distance  Used by Behar et al. (2007) for haplogroup assignment

13 Principal Component Analysis  A dimension reduction technique  Used in conjunction with SVM, LDA and QDA  Denoted as: PCA-SVM, PCA-LDA and PCA-QDA

14 Outline  Introduction  Methods  Results and Discussions  Conclusions

15 The FBI mtDNA Population Database  Two tables: forensic: typed by FBI published: collected from literature  Retain only Caucasian, African, Asian and Hispanic samples # samples AllCaucasianAfricanAsianHispanic forensic dataset 4,4261,674 (37.8%) 1,305 (29.5%) 761 (17.2%) 686 (15.5%) published dataset 3,9762,807 (70.6%) 254 (6.4%) 915 (23%)

16 Data Coverage and Subsets  Variable sequence lengths  trimmed forensic dataset (4,426) 16024-16365  trimmed published dataset (1,904) 16024-16365  full-length forensic dataset (2,540) 16024-16569, 1-576 16024 165691 576 HVR 1HVR 2 forensic published

17 5-fold Cross-Validation (trimmed forensic)  Macro-Accuracy: Average of ethnicity-wise accuracy rates  Micro-Accuracy: Weighted by # Samples  More accurate than Egeland et al. (2004)  Matches human experts depending on skull and large bones [Dib83, isc83]

18 Seq. Region Effect on Accuracy  Different primers result in different coverage.  PCA-LDA outperforms 1NN on long sequences.  PCA-SVM is consistently the best. 100%90%80% 16024 165691 576 HVR 1HVR 2 full-length forensic dataset

19 80% Seq. Region Effect on Accuracy  HVR 2 contains less information.  PCA-SVM is consistently the best. 100%90% 16024 165691 576 HVR 1HVR 2 full-length forensic dataset

20 Twenty 10% Windows  Accuracy varies with region.  PCA-SVM remains the best.  1NN is as good as PCA-SVM for short regions. 16024 165691 576 HVR 1HVR 2 10%

21 Independent Validation (1/2)  Training data: trimmed forensic dataset  Test data: trimmed published dataset  PCA-SVM  No Hispanic samples in the test data but samples can be mis-classified as Hispanic  Asian: ~17% lower than CV

22 Independent Validation (2/2)  Composition of the Asian samples in the training data: China (356 profiles), Japan (163), Korea (182), Pakistan (8), and Thailand (52) Strong bias towards East Asia  145 Mis-classified Asian samples in the test data: 10 samples of unknown country of origin 90 samples from Kazakhstan and Kyrgyzstan  Both countries have significant Russian population.  Evidence of admixture with Caucasians. # SamplesAsianCaucasianAfricanHispanic Kazakhstan10756 (52.3%) 47 (44.0%) 3 (2.8%) 1 (0.9%) Kyrgyzstan9556 (58.9%) 34 (35.8%) 1 (1.1%) 4 (4.2%)

23 Handling Missing Data  Mimic real-world scenario  Training: forensic dataset  Test: published dataset  rCRS and Probability are biased toward Caucasian.  Common Region is the best overall.

24 Posterior Probability Calibration  PCA-SVM on published dataset with “ Common Region ”  Accuracy rates are slightly higher than the estimated posterior probabilities.

25 Conclusions  SVM is the most accurate algorithm among those investigated, outperforming Discriminant analysis employed by Egeland et al. (2004) 1NN similar to that used by Behar et al. (2007)  Overall accuracy of 80%-90% in CV and independent testing Matches the accuracy of human experts depending on measurements of skull and large bones [Dib83,isc83] Approaches the accuracy by using ~60 autosomal loci [Bam04]

26 Questions?  Thank you for your attention.


Download ppt "Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2"

Similar presentations


Ads by Google