Choosing Inputs for Machine Learning

Slides:

Advertisements

Similar presentations

Critical Reading Strategies: Overview of Research Process

Advertisements

Sampling Distributions

Chapter 5 Data mining : A Closer Look.

How to Critically Review an Article

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

Writing a Journal Article. Sections of a Journal Article Introduction or Statement of Purpose Literature Review Specific Statement of Hypothesis(es) Description.

Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.

1 CSC 8520 Spring Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 1 Paula Matuszek Spring, 2013.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

1 CSC 8520 Spring Paula Matuszek Kinds of Machine Learning Machine learning techniques can be grouped into several categories, in several ways: –What.

1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.

Copyright Paula Matuszek Kinds of Machine Learning.

Report Writing Lecturer: Mrs Shadha Abbas جامعة كربلاء كلية العلوم الطبية التطبيقية قسم الصحة البيئية University of Kerbala College of Applied Medical.

Research Introduction to the concept of incorporating sources into your own work.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Statistics & Evidence-Based Practice

Instructional Objectives:

Chapter 13! One Brick At A Time!.

Finding Magazine & Newspaper Articles in a Library Database

Chapter 14 Introduction to Multiple Regression

Writing Scientific Research Paper

Information Organization: Overview

5.1 INTRODUCTORY CHI-SQUARE TEST

The study of the trends and patterns of diseases Key Points

Section 1.2 Identifying Health Risks Objectives

CSC 8520 Spring Paula Matuszek

Machine Learning: Introduction

Sampling Population: The overall group to which the research findings are intended to apply Sampling frame: A list that contains every “element” or.

Challenges in Creating an Automated Protein Structure Metaserver

Conclusion Bibliography Abstract

Chapter 10 Samples.

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Multiple Regression Analysis and Model Building

Chapter 11: Learning Introduction

The Basics of Literature Reviews

Check Your Assumptions

Combining Random Variables

Data Mining Lecture 11.

Applications of IScore (using R)

Science Fair Project Due:

Chapter 13 Experimental and Observational Studies

Peer Reviewed Journal Articles in the Community College Classroom

Analyzing Reliability and Validity in Outcomes Assessment Part 1

What Is Science? Read the lesson title aloud to students.

What Is Science? Read the lesson title aloud to students.

Entity Relationship Diagrams

Cross Sectional Designs

Can solve these rebuses?

What Is Science? Read the lesson title aloud to students.

Barbara Gastel INASP Associate

Lecture 6: Introduction to Machine Learning

Chapter 11 Practical Methodology

Machine Learning: Introduction

Chapter 12 Power Analysis.

Selecting the Right Predictors

Article of the Week - PAP

What Is Science? Read the lesson title aloud to students.

Section 1.2 Identifying Health Risks Objectives

Chapter 5: Producing Data

Chapter 7: Sampling Distributions

How To conduct a thesis 1- Define the problem

Evaluating Classifiers

Section 1.2 Identifying Health Risks Objectives

Analyzing Reliability and Validity in Outcomes Assessment

How To conduct a thesis 1- Define the problem

Information Organization: Overview

Lesson Overview 1.1 What Is Science?.

Evidence Based Diagnosis

Presentation transcript:

Choosing Inputs for Machine Learning Villanova University Machine Learning Project

The Inputs We defined learning as changes in behavior based on experience. The nature of that experience is critical to the success of a learning system. In machine learning, that means we need to give careful attention to the examples we give the system to learn from. There can be some interesting discussion here of when people learn the wrong things. Phobias can be looked at as learning gone awry, for instance. Punishment may teach a child to avoid the behavior that led to the punishment -- or to avoid the person who administered it. Villanova University Machine Learning Project Inputs to Machine Learning

Supervised Learning Provide the system with example training data and the desired result from those data. each example, or training case, consists of a set of variables or features describing one case, including the decision that should be made. System builds a model from the examples and uses the model to make a decision, the outcome. Critic compares actual outcome and desired outcome Learner tweaks the model to make the actual and desired outcomes more similar Villanova University Machine Learning Project Inputs to Machine Learning

Supervised Learning, continued Most commonly used machine learning methods are based on supervised learning. The success of a supervised learning system depends very heavily on the examples it is given. They must be typical or representative. It also depends on the data or features provided, the feature space. The feature space must reflect the domain or field. Villanova University Machine Learning Project Inputs to Machine Learning

Representative Examples Goal of our machine learning system is to act correctly over some set of inputs or cases. This is the population or corpus it will be applied to. The machine learning examples must accurately reflect the field or domain that we want to learn. The examples must be typical of the population for which we will eventually make decisions. Villanova University Machine Learning Project Inputs to Machine Learning

Representative Sample Examples Spam or non-spam Bad: random sample of email from the last week Better: random sample of email from the last year Bad: email received at an individual Gmail account need some good and bad examples from various fields here... Individual Gmail is bad because google has already screened out almost all spam, so you’re getting a very non-representative sample. One week is bad because spam topics tend to go in batches over time, and this is too short a time period; you will miss some of the common topics. Villanova University Machine Learning Project Inputs to Machine Learning

Typical Mistakes in Choosing Examples The “convenience” sample: using the examples that are easy to get, whether or not they reflect the cases we will make decisions on later. The positive examples: using only examples with one of the two possible outcomes; only loans made, for instance, not those denied The unbalanced examples choosing different proportions of positive and negative examples than the population. For instance, about 7% of mammograms show some abnormality. If you wanted to train a system to recognize abnormal mammogram films, you should NOT use 50% normal and 50% abnormal films as example. The problem with all of these is that the performance of the system on actual cases will not be as good as its performance on the sample. Villanova University Machine Learning Project Inputs to Machine Learning

Are these good samples? Villanova University Machine Learning Project Inputs to Machine Learning

Feature Spaces Which features to include in the examples is a major question in developing a supervised learning system: They should be relevant to the decision to be made They should be (mostly) observable for every example They should be as much as possible independent of one another Villanova University Machine Learning Project Inputs to Machine Learning

Relevant Examples We want our system to look at some features and some decision, and find the patterns which led to the decision. This will only work if the features we give the system are in fact related to the decision being made. Examples: To decide whether a house will sell Probably relevant: price, square footage, age Probably irrelevant: name of the owner, day of the week Most supervised systems will accept a large number of features and successfully identify the relevant ones, but if the most relevant ones aren’t included the system cannot perform well. Villanova University Machine Learning Project Inputs to Machine Learning

Which Are Relevant 1 To decide whether an illness is influenza presence of fever last name of patient color of shoes presence of cough date 1 and 4 definitely. 2 and 3 no. 5 can be argued either way. In discussion, it’s almost impossible to completely rule something as irrelevant; students can be adept at coming up with a way in which almost anything could be relevant. That’s fine -- it’s still contributing to their understanding of the concept. Villanova University Machine Learning Project Inputs to Machine Learning

Which Are Relevant 2 Decide gender of a skeleton shape of pelvis gender of the examiner length of femur date number of ribs position of bones 1. Yes. 2. No. 3. Yes, although age and race affect it considerably. 4. No. 5. No -- Eve notwithstanding, males and females have the same number of ribs. 6. No. Villanova University Machine Learning Project Inputs to Machine Learning

Unsupervised Learning In an unsupervised learning application, we do not give the system any a priori decisions. The task instead is to find similarities among the examples given and group them The critic is some measure of similarity among the cases in a group compared to those in a different group The data we provide define the kind of similarities and groupings we will find. Still important to have representative examples. Villanova University Machine Learning Project Inputs to Machine Learning

Representative Examples If we are using unsupervised learning to examine all of a population or corpus, then the question of representative examples does not arise. For example, if we are clustering search results, and we give the system all the results, no problem. If we are developing a model which will then be applied to future examples, then we still need the examples to be representative. For example, if I am creating some book categories and plan to assign future books to the same categories, I need a good set of books to start with. Villanova University Machine Learning Project Inputs to Machine Learning

Relevant Features Because we are not giving an unsupervised learning system specific correct answers, but only a metric to somehow measure similarity, giving it relevant features is critical. A supervised learning algorithm can use the expected answer to ignore irrelevant features. An unsupervised learning algorithm does not have this tool. You must know your domain and have some feel for what matters to use these techniques successfully. Villanova University Machine Learning Project Inputs to Machine Learning

Irrelevant Features: A Painful Example There were about 23 million articles relevant to medicine indexed in Medline in 2013; about 450,000 of them included a mention of diabetes. Clearly, if you are doing research about diabetes, you are not going to read them all. Medline provides a reference and abstract for each. Can we cluster them by giving this information to a machine learning system, and get some idea of the overall pattern of the research? Villanova University Machine Learning Project Inputs to Machine Learning

Example Medline Abstract Acta Diabetol. 2012 Nov 16. [Epub ahead of print] Polymorphisms in the Selenoprotein S gene and subclinical cardiovascular disease in the Diabetes Heart Study. Cox AJ, Lehtinen AB, Xu J, Langefeld CD, Freedman BI, Carr JJ, Bowden DW. Source Center for Human Genomics, Wake Forest School of Medicine, Winston-Salem, NC, USA. Abstract Selenoprotein S (SelS) has previously been associated with a range of inflammatory markers, particularly in the context of cardiovascular disease (CVD). The aim of this study was to examine the role of SELS genetic variants in risk for subclinical CVD and mortality in individuals with type 2diabetes mellitus (T2DM). The association between 10 polymorphisms tagging SELS and coronary (CAC), carotid (CarCP) and abdominal aortic calcified plaque, carotid intima media thickness and other known CVD risk factors was examined in 1220 European Americans from the family-basedDiabetes Heart Study. The strongest evidence of association for SELS SNPs was observed for CarCP; rs28665122 (5' region; β = 0.329, p = 0.044), rs4965814 (intron 5; β = 0.329, p = 0.036), rs28628459 (3' region; β = 0.331, p = 0.039) and rs7178239 (downstream; β = 0.375, p = 0.016) were all associated. In addition, rs12917258 (intron 5) was associated with CAC (β = -0.230, p = 0.032), and rs4965814, rs28628459 and rs9806366 were all associated with self-reported history of prior CVD (p = 0.020-0.043). These results suggest a potential role for the SELS region in the development subclinical CVD in this sample enriched for T2DM. Further understanding the mechanisms underpinning these relationships may prove important in predicting and managing CVD complications in T2DM. PMID: 23161441 [PubMed - as supplied by publisher] Villanova University Machine Learning Project Inputs to Machine Learning

Diabetes clustering Representation: the entire text of set of Medline abstracts relevant to diabetes. Actor: Tool to display clusters of related documents Critic: a measure of how much vocabulary is in common between two abstracts Learner: a method which uses the critic to draw cluster boundaries such that the abstracts in a cluster have similar vocabularies Villanova University Machine Learning Project Inputs to Machine Learning

Result A good clustering tool created tight clusters. But the vocabulary they mostly had in common was the abbreviated journal titles. Acta Diabetol So all the clusters did was assign articles to journals. Useful sometimes, but not here Would have been better to omit title, date, authors, etc., and just include the actual text. Not useful here because the journals are an organization of articles that the scientists are already aware of. Villanova University Machine Learning Project Inputs to Machine Learning

Summary In order for our machine learning system to produce good results both with the example data and with future data it is applied to, we need good examples Representative or typical examples The examples must reflect cases we expect to apply it to Relevant features The inputs or features we give the system must be related to the task we are trying to accomplish. In order to make an effective machine learning system, you need to understand your field or domain well enough to judge both. We said in the beginning that we are developing a model; this is basically looking at some of the limits of the model. Villanova University Machine Learning Project Inputs to Machine Learning