Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

ECG Signal processing (2)
Aggregating local image descriptors into compact codes
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Yue Han and Lei Yu Binghamton University.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Face Recognition & Biometric Systems Support Vector Machines (part 2)
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.
Classification and Decision Boundaries
DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.
A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Efficient Model Selection for Support Vector Machines
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Interactive Learning of the Acoustic Properties of Objects by a Robot
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Fuzzy Regions for Handling Uncertainty in Remote Sensing Image Segmentation Ivan Lizarazo, (a) and Paul Elsner (b) (a) Department of Cadastral Engineering.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Imputation-based local ancestry inference in admixed populations
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
FUZZ-IEEE Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Yixin Chen and James Z. Wang The Pennsylvania State.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Results for all features Results for the reduced set of features
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
School of Computer Science & Engineering
An Enhanced Support Vector Machine Model for Intrusion Detection
Introduction Feature Extraction Discussions Conclusions Results
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Imputation-based local ancestry inference in admixed populations
Forensic Biology by Richard Li
COSC 4335: Other Classification Techniques
Abdur Rahman Department of Statistics
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Inferring Ethnicity from Mitochondrial DNA Sequence Chih Lee 1, Ion Mandoiu 1 and Craig E. Nelson 2 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology University of Connecticut

Outline  Introduction  Methods  Results and Discussions  Conclusions

 Introduction  Methods  Results and Discussions  Conclusions Outline

Ethnicity in Forensics  Ethnicity information assists forensic investigators.  Investigator-assigned ethnicity: based on genetic and non-genetic markers.  Genetic information enhances inference accuracy when access to most informative markers (e.g. skin/hair) is limited.  Autosomal markers: Excellent accuracy assigning samples to clades [Phi07, Shr97] May not survive degradation

Mitochondrial DNA  Circular  16,569 bps  Maternally inherited  High copy number  Recoverable from degraded samples  Coding region SNPs define haplogroups [Beh07]  Hypervariable Region

Hypervariable Region  High mutation rate compared to the coding region  Haplogroup inference [Beh07] 23 groups 96.7% accuracy rate with 1NN  Geographic origin inference [Ege04] SE Africa, Germany and Icelandic 66.8% accuracy rate with PCA-QDA HVR 1HVR 2

Ethnicity Inference from HVR  The problem: Given a set of HVR sequences tagged with ethnicities Predict the ethnicities of new HVR sequences A classification problem  Our contribution: Assess the performance of 4 classification algorithms: SVM, LDA, QDA and 1NN.

Outline  Introduction  Methods  Results and Discussions  Conclusions

Encoding HVR  Align to rCRS (revised Cambridge reference sequence)  SNP profile  a SNP  a binary variable  Missing data (not typed regions) Assume rCRS Use mutation probability Common region 16067TCTCT 315.1Cinsertion 523Ddeletion

Support Vector Machines  Binary classification algorithm  Map instances to high-D space (the feature space)  Optimal separating hyperplane with max margins  Kernel function k(x 1,x 2 ): similarity x 1 and x 2 between in the feature space  Radial basis kernel: exp(-γ||x 1 -x 2 || 2 )  Software: LIBSVM [Cha01]

Linear/Quadratic Discriminant Analysis  Find argmax g P(G=g|X=x)  Assumptions: X|G=g ~N p (μ g, Σ g ) P(G=g) ’ s are equal for all g  P(G=g|X=x) prop. to P(X=x|G=g)  μ g and Σ g are estimated by the training data  LDA: common dispersion matrix Σ g = Σ for all g

1-Nearest Neighbor  Assign a new sample to the dominating ethnicity among the nearest samples in the training data  Distance measure: the Hamming distance  Used by Behar et al. (2007) for haplogroup assignment

Principal Component Analysis  A dimension reduction technique  Used in conjunction with SVM, LDA and QDA  Denoted as: PCA-SVM, PCA-LDA and PCA-QDA

Outline  Introduction  Methods  Results and Discussions  Conclusions

The FBI mtDNA Population Database  Two tables: forensic: typed by FBI published: collected from literature  Retain only Caucasian, African, Asian and Hispanic samples # samples AllCaucasianAfricanAsianHispanic forensic dataset 4,4261,674 (37.8%) 1,305 (29.5%) 761 (17.2%) 686 (15.5%) published dataset 3,9762,807 (70.6%) 254 (6.4%) 915 (23%)

Data Coverage and Subsets  Variable sequence lengths  trimmed forensic dataset (4,426)  trimmed published dataset (1,904)  full-length forensic dataset (2,540) , HVR 1HVR 2 forensic published

5-fold Cross-Validation (trimmed forensic)  Macro-Accuracy: Average of ethnicity-wise accuracy rates  Micro-Accuracy: Weighted by # Samples  More accurate than Egeland et al. (2004)  Matches human experts depending on skull and large bones [Dib83, isc83]

Seq. Region Effect on Accuracy  Different primers result in different coverage.  PCA-LDA outperforms 1NN on long sequences.  PCA-SVM is consistently the best. 100%90%80% HVR 1HVR 2 full-length forensic dataset

80% Seq. Region Effect on Accuracy  HVR 2 contains less information.  PCA-SVM is consistently the best. 100%90% HVR 1HVR 2 full-length forensic dataset

Twenty 10% Windows  Accuracy varies with region.  PCA-SVM remains the best.  1NN is as good as PCA-SVM for short regions HVR 1HVR 2 10%

Independent Validation (1/2)  Training data: trimmed forensic dataset  Test data: trimmed published dataset  PCA-SVM  No Hispanic samples in the test data but samples can be mis-classified as Hispanic  Asian: ~17% lower than CV

Independent Validation (2/2)  Composition of the Asian samples in the training data: China (356 profiles), Japan (163), Korea (182), Pakistan (8), and Thailand (52) Strong bias towards East Asia  145 Mis-classified Asian samples in the test data: 10 samples of unknown country of origin 90 samples from Kazakhstan and Kyrgyzstan  Both countries have significant Russian population.  Evidence of admixture with Caucasians. # SamplesAsianCaucasianAfricanHispanic Kazakhstan10756 (52.3%) 47 (44.0%) 3 (2.8%) 1 (0.9%) Kyrgyzstan9556 (58.9%) 34 (35.8%) 1 (1.1%) 4 (4.2%)

Handling Missing Data  Mimic real-world scenario  Training: forensic dataset  Test: published dataset  rCRS and Probability are biased toward Caucasian.  Common Region is the best overall.

Posterior Probability Calibration  PCA-SVM on published dataset with “ Common Region ”  Accuracy rates are slightly higher than the estimated posterior probabilities.

Conclusions  SVM is the most accurate algorithm among those investigated, outperforming Discriminant analysis employed by Egeland et al. (2004) 1NN similar to that used by Behar et al. (2007)  Overall accuracy of 80%-90% in CV and independent testing Matches the accuracy of human experts depending on measurements of skull and large bones [Dib83,isc83] Approaches the accuracy by using ~60 autosomal loci [Bam04]

Questions?  Thank you for your attention.