A Perspective on the Data Ajit Paul Singh M.Sc. Candidate Dept. of Computing Science University of Alberta.

Slides:



Advertisements
Similar presentations
Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland
Advertisements

PERSONALIZED MEDICINE: Planning for the Future You, Your Biomarkers and Your Rights.
Regulation of Consumer Tests in California AAAS Meeting June 1-2, 2009 Beatrice OKeefe Acting Chief, Laboratory Field Services California Department of.
Which Phenotypes Can be Predicted from a Genome Wide Scan of Single Nucleotide Polymorphisms (SNPs): Ethnicity vs. Breast Cancer Mohsen Hajiloo, Russell.
Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Assuming normally distributed data! Naïve Bayes Classifier.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Missing Data ll. Multiple Imputation Essentially, the replacement of one individual with another randomly selected individual from a defined population.
Region Based Image Annotation Through Multiple-Instance Learning By: Changbo Yang Wayne State University Department of Computer Science.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
For Better Accuracy Eick: Ensemble Learning
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
1 CSC 8520 Spring Paula Matuszek Kinds of Machine Learning Machine learning techniques can be grouped into several categories, in several ways: –What.
Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Wang Y 1,2, Damaraju S 1,3,4, Cass CE 1,3,4, Murray D 3,4, Fallone G 3,4, Parliament M 3,4 and Greiner R 1,2 PolyomX Program 1, Department.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CLASSIFICATION: Ensemble Methods
The Use of Predictive Biomarkers in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
Slides for “Data Mining” by I. H. Witten and E. Frank.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
SEBs: Private Payer Perspective Carlyn Volume-Smith, B.Sc.Pharm., M.Sc., Ph.D. Senior Manager, Benefit Services.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
(1) Genotype-Tissue Expression (GTEx) Largest systematic study of genetic regulation in multiple tissues to date 53 tissues, 500+ donors, 9K samples, 180M.
Multivariate analysis (Machine learning) Supervised True answer is known Classification Answer is categorical Regression Answer is continuous (ordered.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning with Spark MLlib
CS Fall 2016 (Shavlik©), Lecture 5
Evaluating Classifiers
Classification with Gene Expression Data
Chapter 6 Classification and Prediction
Gene Expression Classification
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Introductory Seminar on Research: Fall 2017
Sample Mean Distributions
Classification and Prediction
CellNetQL Image Segmentation without Feature Definition
Supervised vs. unsupervised Learning
Classification and Prediction
Ensemble learning Reminder - Bagging of Trees Random Forest
Machine learning overview
HR University Relations
The European Statistical Training Programme (ESTP)
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

A Perspective on the Data Ajit Paul Singh M.Sc. Candidate Dept. of Computing Science University of Alberta

Machine Learning Systems that use experience to improve at a given task.  Data as experience  Supervised vs. Unsupervised Learning SNP focus: supervised learning

The Running Example IDWeightColorDentalclass 12123greygoodHappy 24321greengoodSick 33321greyokHappy 43499purplev.goodHungry, Hungry 52803greenv.goodSick 62599greybadHappy 74402greyokHappy 84506greenbadSick

Data Assumptions Samples are independent, and identically distributed (IID) Dealing with patients/tuples –One set  complex distribution  more training data –Split into subsets  many simpler distribution  less training data per problem

Defining the Task Predictive –Diagnosing members of the public Rare class issue –Diagnosing clinic referrals Is the training set representative of patients that will be tested ? –Subtyping cancer patients Feature Selection –Find interesting SNPs for further study

Measuring Improvement Competitors –Human experts using clinical data –Diagnostic tests (e.g. BRCA1 truncations) –Other learners using genetic markers Benefits of Polyomx –Accuracy, Cost, Speed Need for a baseline to compare against

Issues to Consider Missing data Negative control features

Types of Missing Data Missing Completely At Random (MCAR) Missing At Random (MAR) Censored

Negative Control Features SNPs were hand selected Feature selection problem –Measuring relevance of selected features Prediction problem –Ensuring the learner is robust Add negative control features –Features that are probably irrelevant