Choosing Inputs for Machine Learning

Choosing Inputs for Machine Learning
Villanova University Machine Learning Project

The Inputs We defined learning as changes in behavior based on experience. The nature of that experience is critical to the success of a learning system. In machine learning, that means we need to give careful attention to the examples we give the system to learn from. There can be some interesting discussion here of when people learn the wrong things. Phobias can be looked at as learning gone awry, for instance. Punishment may teach a child to avoid the behavior that led to the punishment -- or to avoid the person who administered it. Villanova University Machine Learning Project Inputs to Machine Learning

Supervised Learning Provide the system with example training data and the desired result from those data. each example, or training case, consists of a set of variables or features describing one case, including the decision that should be made. System builds a model from the examples and uses the model to make a decision, the outcome. Critic compares actual outcome and desired outcome Learner tweaks the model to make the actual and desired outcomes more similar Villanova University Machine Learning Project Inputs to Machine Learning

Supervised Learning, continued
Most commonly used machine learning methods are based on supervised learning. The success of a supervised learning system depends very heavily on the examples it is given. They must be typical or representative. It also depends on the data or features provided, the feature space. The feature space must reflect the domain or field. Villanova University Machine Learning Project Inputs to Machine Learning

Representative Examples
Goal of our machine learning system is to act correctly over some set of inputs or cases. This is the population or corpus it will be applied to. The machine learning examples must accurately reflect the field or domain that we want to learn. The examples must be typical of the population for which we will eventually make decisions. Villanova University Machine Learning Project Inputs to Machine Learning

Representative Sample Examples
Spam or non-spam Bad: random sample of from the last week Better: random sample of from the last year Bad: received at an individual Gmail account need some good and bad examples from various fields here... Individual Gmail is bad because google has already screened out almost all spam, so you’re getting a very non-representative sample. One week is bad because spam topics tend to go in batches over time, and this is too short a time period; you will miss some of the common topics. Villanova University Machine Learning Project Inputs to Machine Learning

Typical Mistakes in Choosing Examples
The “convenience” sample: using the examples that are easy to get, whether or not they reflect the cases we will make decisions on later. The positive examples: using only examples with one of the two possible outcomes; only loans made, for instance, not those denied The unbalanced examples choosing different proportions of positive and negative examples than the population. For instance, about 7% of mammograms show some abnormality. If you wanted to train a system to recognize abnormal mammogram films, you should NOT use 50% normal and 50% abnormal films as example. The problem with all of these is that the performance of the system on actual cases will not be as good as its performance on the sample. Villanova University Machine Learning Project Inputs to Machine Learning

Are these good samples? Villanova University Machine Learning Project
Inputs to Machine Learning

Feature Spaces Which features to include in the examples is a major question in developing a supervised learning system: They should be relevant to the decision to be made They should be (mostly) observable for every example They should be as much as possible independent of one another Villanova University Machine Learning Project Inputs to Machine Learning

Relevant Examples We want our system to look at some features and some decision, and find the patterns which led to the decision. This will only work if the features we give the system are in fact related to the decision being made. Examples: To decide whether a house will sell Probably relevant: price, square footage, age Probably irrelevant: name of the owner, day of the week Most supervised systems will accept a large number of features and successfully identify the relevant ones, but if the most relevant ones aren’t included the system cannot perform well. Villanova University Machine Learning Project Inputs to Machine Learning

Which Are Relevant 1 To decide whether an illness is influenza
presence of fever last name of patient color of shoes presence of cough date 1 and 4 definitely. 2 and 3 no. 5 can be argued either way. In discussion, it’s almost impossible to completely rule something as irrelevant; students can be adept at coming up with a way in which almost anything could be relevant. That’s fine -- it’s still contributing to their understanding of the concept. Villanova University Machine Learning Project Inputs to Machine Learning

Which Are Relevant 2 Decide gender of a skeleton shape of pelvis
gender of the examiner length of femur date number of ribs position of bones 1. Yes No Yes, although age and race affect it considerably No No -- Eve notwithstanding, males and females have the same number of ribs No. Villanova University Machine Learning Project Inputs to Machine Learning

Unsupervised Learning
In an unsupervised learning application, we do not give the system any a priori decisions. The task instead is to find similarities among the examples given and group them The critic is some measure of similarity among the cases in a group compared to those in a different group The data we provide define the kind of similarities and groupings we will find. Still important to have representative examples. Villanova University Machine Learning Project Inputs to Machine Learning

Representative Examples
If we are using unsupervised learning to examine all of a population or corpus, then the question of representative examples does not arise. For example, if we are clustering search results, and we give the system all the results, no problem. If we are developing a model which will then be applied to future examples, then we still need the examples to be representative. For example, if I am creating some book categories and plan to assign future books to the same categories, I need a good set of books to start with. Villanova University Machine Learning Project Inputs to Machine Learning

Relevant Features Because we are not giving an unsupervised learning system specific correct answers, but only a metric to somehow measure similarity, giving it relevant features is critical. A supervised learning algorithm can use the expected answer to ignore irrelevant features. An unsupervised learning algorithm does not have this tool. You must know your domain and have some feel for what matters to use these techniques successfully. Villanova University Machine Learning Project Inputs to Machine Learning

Irrelevant Features: A Painful Example
There were about 23 million articles relevant to medicine indexed in Medline in 2013; about 450,000 of them included a mention of diabetes. Clearly, if you are doing research about diabetes, you are not going to read them all. Medline provides a reference and abstract for each. Can we cluster them by giving this information to a machine learning system, and get some idea of the overall pattern of the research? Villanova University Machine Learning Project Inputs to Machine Learning

Example Medline Abstract
Acta Diabetol Nov 16. [Epub ahead of print] Polymorphisms in the Selenoprotein S gene and subclinical cardiovascular disease in the Diabetes Heart Study. Cox AJ, Lehtinen AB, Xu J, Langefeld CD, Freedman BI, Carr JJ, Bowden DW. Source Center for Human Genomics, Wake Forest School of Medicine, Winston-Salem, NC, USA. Abstract Selenoprotein S (SelS) has previously been associated with a range of inflammatory markers, particularly in the context of cardiovascular disease (CVD). The aim of this study was to examine the role of SELS genetic variants in risk for subclinical CVD and mortality in individuals with type 2diabetes mellitus (T2DM). The association between 10 polymorphisms tagging SELS and coronary (CAC), carotid (CarCP) and abdominal aortic calcified plaque, carotid intima media thickness and other known CVD risk factors was examined in 1220 European Americans from the family-basedDiabetes Heart Study. The strongest evidence of association for SELS SNPs was observed for CarCP; rs (5' region; β = 0.329, p = 0.044), rs (intron 5; β = 0.329, p = 0.036), rs (3' region; β = 0.331, p = 0.039) and rs (downstream; β = 0.375, p = 0.016) were all associated. In addition, rs (intron 5) was associated with CAC (β = -0.230, p = 0.032), and rs , rs and rs were all associated with self-reported history of prior CVD (p = ). These results suggest a potential role for the SELS region in the development subclinical CVD in this sample enriched for T2DM. Further understanding the mechanisms underpinning these relationships may prove important in predicting and managing CVD complications in T2DM. PMID: [PubMed - as supplied by publisher] Villanova University Machine Learning Project Inputs to Machine Learning

Diabetes clustering Representation: the entire text of set of Medline abstracts relevant to diabetes. Actor: Tool to display clusters of related documents Critic: a measure of how much vocabulary is in common between two abstracts Learner: a method which uses the critic to draw cluster boundaries such that the abstracts in a cluster have similar vocabularies Villanova University Machine Learning Project Inputs to Machine Learning

Result A good clustering tool created tight clusters.
But the vocabulary they mostly had in common was the abbreviated journal titles. Acta Diabetol So all the clusters did was assign articles to journals. Useful sometimes, but not here Would have been better to omit title, date, authors, etc., and just include the actual text. Not useful here because the journals are an organization of articles that the scientists are already aware of. Villanova University Machine Learning Project Inputs to Machine Learning

Summary In order for our machine learning system to produce good results both with the example data and with future data it is applied to, we need good examples Representative or typical examples The examples must reflect cases we expect to apply it to Relevant features The inputs or features we give the system must be related to the task we are trying to accomplish. In order to make an effective machine learning system, you need to understand your field or domain well enough to judge both. We said in the beginning that we are developing a model; this is basically looking at some of the limits of the model. Villanova University Machine Learning Project Inputs to Machine Learning

Choosing Inputs for Machine Learning

Similar presentations

Presentation on theme: "Choosing Inputs for Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Choosing Inputs for Machine Learning

Similar presentations

Presentation on theme: "Choosing Inputs for Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback