Download presentation
Presentation is loading. Please wait.
1
Applications of IScore (using R)
Lydia Hsu
2
Agenda Flow of solving machine learning problems
Implementation of Iscore Example 1 — Genetics (Breast Cancer Pathways) Example 2 — Text (Spam Detection) Example 3 — Recommendation System (Orange Juice Preference) Example 4 — Longitudinal Study (Mortality of Americans)
3
The Flow for Solving Supervised Learning Problems
IScore } Feature Selection + Method Regression, Classification, etc.
4
Implementation — Single IScore
} Generate partitions } Count # of cases and controls inside each partition } Calculate IScore
5
Implementation — Backward Dropping
} First start with all variables } Drop one variable at each time, and record the resulting IScore
6
Example 1 - Genetics Motivation:
370 individual genetic variants, or SNPs, were identified associated with breast cancer. However, these SNPs are not predictive of breast cancer risk and fail to explain the incidence of the disease among patients. This lack of predictive power is due to the enormous size of genetic data, which prohibits many of the computationally complex algorithms often relied on to identify complicated relationships between variables, like SNPs, and breast cancer. As a result, most researchers can only test whether individual SNPs are correlated with the disease and report the ones with high statistical significance. However, if in fact gene groups, or SNP sets, determine the incidence of breast cancer among patients, a SNP by SNP sweep of the genome will miss important information relevant to a patient's risk for the disease.
7
Example 1 - Genetics GWAS Data
8
Example 1 - Genetics
9
Reference: www.synapse.org/#!Synapse:syn5605838/wiki/392024
Example 1 - Genetics Reference:
10
Example 2 - Spam or Ham?
11
Example 2 - Spam or Ham? Document Classification Problems:
to assign a document to one or more classes or categories The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.
12
Example 2 - Spam or Ham? Features? Feature selection? Algorithm?
13
Example 2 - Spam or Ham? Features: Word counts — term frequency matrix
Feature Selection: We wish to find the “bag of words” that are predictive for classifying spams or hams. use IScore to find the interaction of words
14
Example 2 - Spam or Ham? Pre-process
15
Example 2 - Spam or Ham? Data Background — Enron scandal
Enron was the world’s leading energy company; it declared bankruptcy in December 2001, which was followed by numerous investigations. During the investigation, the original Enron dataset, consisting of 619,446 messages, was posted to the Web by the Federal Energy Regulatory Commission in May 2002. Later some duplicated mails were deleted, and some others were deleted upon the request of Enron employees. Nowadays the version we have is called the March 2, 2004 version, which is widely used by researchers.
16
Example 2 - Spam or Ham? Convert the term frequency into 3 levels — rare, frequent, and very frequent Run 6-way interaction 10,000 times using backward-dropping method Sort the results by Iscores
17
Example 2 - Spam or Ham? Naive Bayes
a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.
18
Example 2 - Spam or Ham? Logistic Regression
developed by statistician David Cox in 1958 to estimate the probability of a binary response based on one or more predictor (or independent) variables (features) can be seen as a special case of the generalized linear model and thus analogous to linear regression
19
Example 2 - Spam or Ham? 1 recorded xls subject 2 star bryan enron 3
deal 4 owners oasis hplo 5 issue meter 6 7 htm whether free 8 apply beaumont determine book similar gas 9 times eastrans 10 exchange growth green
20
Example 3 - Citrus Hill or Minute Maid
Which one would you buy?
21
Example 3 - Citrus Hill or Minute Maid
Recommendation System Problems: to predict the 'rating' or 'preference' that a user would give to an item typically produce a list of recommendations in one of two ways: Collaborative filtering approaches building a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. These approaches are often combined (Hybrid Recommender Systems).
22
Example 3 - Citrus Hill or Minute Maid
Features: week of purchase, store, price, discount, special, loyalty, price difference, sales price, list price
23
Example 3 - Citrus Hill or Minute Maid
Use IScore statistic with backward dropping algorithm to find the highest interactive features (for prediction) price, discount, special, list price price, sales price store, loyalty, sales price
24
Example 4 - Mortality MAKE A GUESS - which populations have the highest mortality rates? Location? Career? Income? Lifestyle factors? Our goal is to use I-score to identify predictive subsets
25
Example 4 - Mortality Data: National Longitudinal Mortality Study
1.8 million subjects represent total US non-institutionalized population of 1990 Case – death within 10 years Control – no death within 10 years
26
Example 4 - Mortality
27
Example 4 - Mortality
28
Example 4 - Mortality Cancer
29
Example 4 - Mortality Heart Disease
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.