From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.

Slides:



Advertisements
Similar presentations
...visualizing classifier performance in R Tobias Sing, Ph.D. (joint work with Oliver Sander) Modeling & Simulation Novartis Pharma AG 3 rd BaselR meeting.
Advertisements

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
HCV Evolution: Diversification and Convergence Yury Khudyakov Division of Viral Hepatitis Centers for Disease Control and Prevention, Atlanta, GA.
Carolyn A. Ragland August Hepatitis C *Paraenterally transmitted Contact with infected blood / blood products Blood transfusions – no longer the.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Evaluation of kernel function modification in text classification using SVMs Yangzhe Xiao.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Classification II (continued) Model Evaluation
Evaluating Classifiers
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
HBV genotyping 12/21/07 Carrie Marshall. Received a send-out request for HBV genotyping on a 52y man.
Division of Population Health Sciences Royal College of Surgeons in Ireland Coláiste Ríoga na Máinleá in Éirinn Indices of Performances of CPRs Nicola.
Integration II Prediction. Kernel-based data integration SVMs and the kernel “trick” Multiple-kernel learning Applications – Protein function prediction.
Estimating fitness landscapes John Pinney
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
“PREDICTIVE MODELING” CoSBBI, July Jennifer Hu.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
The Broad Institute of MIT and Harvard Classification / Prediction.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Evaluating Results of Learning Blaž Zupan
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Hepatitis C Analysis of Sequence Data from ARUP and NCBI databases By Ian Odell.
1 Wrap up SCREENING TESTS. 2 Screening test The basic tool of a screening program easy to use, rapid and inexpensive. 1.2.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Hybrid Classifiers for Object Classification with a Rich Background M. Osadchy, D. Keren, and B. Fadida-Specktor, ECCV 2012 Computer Vision and Video Analysis.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Canadian Bioinformatics Workshops
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Learning to Detect and Classify Malicious Executables in the Wild by J
GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer Morteza H.Chalabi, Fabio Vandin Hello.
Misha L. Rajaram and Karin S. Dorman Iowa State University
Disease risk prediction
Prevalence of Hepatitis C Virus Genotypes in Bangladesh
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Classifiers!!! BCH364C/394P Systems Biology / Bioinformatics
Evaluating Results of Learning
Introduction Feature Extraction Discussions Conclusions Results
Features & Decision regions
Accurate genotyping of hepatitis C virus through nucleotide sequencing and identification of new HCV subtypes in China population  Y.-Q. Tong, B. Liu,
Volume 28, Issue 5, Pages (November 2015)
Jean-Michel Pawlotsky, MD, PhD  Clinics in Liver Disease 
Learning Algorithm Evaluation
iSRD Spam Review Detection with Imbalanced Data Distributions
Accurate genotyping of hepatitis C virus through nucleotide sequencing and identification of new HCV subtypes in China population  Y.-Q. Tong, B. Liu,
Volume 28, Issue 5, Pages (November 2015)
Evaluating Models Part 1
L. Dubourg  Clinical Microbiology and Infection 
Presentation transcript:

From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011

Hepatitis C virus (HCV) WHO – Estimates 3% of the world population is infected with HCV – 170 million people are chronic carriers at risk of developing cirrhosis and/or liver cancer

HCV Genotyping HCV is classified into 6 genotypes – Several subtypes per genotype – Example: 1a denotes genotype 1, subtype a Genotyping methods – Common diagnostic tools for HCV genotyping Molecular assays: Genome typing assays – Examine particular regions of the HCV genome. – Example: Amplification of HCV genome followed by digestion with restriction enzymes and restriction length polymorphism analysis

Computational Problem How do we represent HCV genome sequence data and apply machine learning methods to determine genotype?

Significance of Genotyping 1.Indicates the severity of infection 2.Informs clinical decision making — Different genotypes require particular treatments

Proposed Approach: Classifier GCCAGCCCCCGATTGGGGGCGACAC TCCACCATAGATCACTCCCCTGTGAG GAACTATTGTCTTCACGCAGAAAGCG TCTAGCCATGGCGTTAGTATGAGTGTC GTGCAACCTCCAGGACCCCCCCTCCC GGGAGAGCCATAGTGGTCTGCGGA String kernel + SVM (One-against-all) Genotype One-against-all (examples) C1C2C3C4C5C C1C2C3C4C5C

Preliminary data: Data collection

Related Work Research group: Qiu et al. (2009) Methodology: Position weight matrix, SVMs and random forest Results: Achieved 99% prediction accuracy. Strengths: High prediction accuracy Weaknesses: Several pre-processing steps (e.g. sequence alignment, creation of position weight matrices, concatenation of matrices) Research group: Hraber et al. (2008) Methodology: Branching index to combine distance and phylogeny methods Results: Characterization of ‘problematic’ sequences Strengths: Characterization of problematic sequences Weaknesses: Several pre-processing steps (e.g. creation of phylogenetic trees)

Evaluation of Results K-fold cross validation  E.g. 10-fold cross validation  To estimate the predictive ability or unknown samples Overall accuracy  Dataset not used for training  overall accuracy = (TP + TN)/(TP + TN + FP + FN) - TP = true positives - TN = true negatives - FP = false positives - FN = false negatives

Conclusion Problem How to represent HCV sequence data, apply machine learning to determine genotype? Significance Genotype indicates severity of infection, informs clinical decisions in treating infection. Proposed methodology String kernels to represent viral sequence data, SVMs for classification of the virus with one-against-all Related work 1.Qiu et al. (2009), position weight matrix, SVMs and random forest, achieved 99% prediction accuracy. 2.Hraber et al. (2008), branching index to combine distance and phylogeny methods Evaluation K-fold cross validation, overall prediction accuracy.