A Perspective on the Data Ajit Paul Singh M.Sc. Candidate Dept. of Computing Science University of Alberta
Machine Learning Systems that use experience to improve at a given task. Data as experience Supervised vs. Unsupervised Learning SNP focus: supervised learning
The Running Example IDWeightColorDentalclass 12123greygoodHappy 24321greengoodSick 33321greyokHappy 43499purplev.goodHungry, Hungry 52803greenv.goodSick 62599greybadHappy 74402greyokHappy 84506greenbadSick
Data Assumptions Samples are independent, and identically distributed (IID) Dealing with patients/tuples –One set complex distribution more training data –Split into subsets many simpler distribution less training data per problem
Defining the Task Predictive –Diagnosing members of the public Rare class issue –Diagnosing clinic referrals Is the training set representative of patients that will be tested ? –Subtyping cancer patients Feature Selection –Find interesting SNPs for further study
Measuring Improvement Competitors –Human experts using clinical data –Diagnostic tests (e.g. BRCA1 truncations) –Other learners using genetic markers Benefits of Polyomx –Accuracy, Cost, Speed Need for a baseline to compare against
Issues to Consider Missing data Negative control features
Types of Missing Data Missing Completely At Random (MCAR) Missing At Random (MAR) Censored
Negative Control Features SNPs were hand selected Feature selection problem –Measuring relevance of selected features Prediction problem –Ensuring the learner is robust Add negative control features –Features that are probably irrelevant