Applications of IScore (using R)

Applications of IScore (using R)
Lydia Hsu

Agenda Flow of solving machine learning problems
Implementation of Iscore Example 1 — Genetics (Breast Cancer Pathways) Example 2 — Text (Spam Detection) Example 3 — Recommendation System (Orange Juice Preference) Example 4 — Longitudinal Study (Mortality of Americans)

The Flow for Solving Supervised Learning Problems
IScore } Feature Selection + Method Regression, Classification, etc.

Implementation — Single IScore
} Generate partitions } Count # of cases and controls inside each partition } Calculate IScore

Implementation — Backward Dropping
} First start with all variables } Drop one variable at each time, and record the resulting IScore

Example 1 - Genetics Motivation:
370 individual genetic variants, or SNPs, were identified associated with breast cancer. However, these SNPs are not predictive of breast cancer risk and fail to explain the incidence of the disease among patients. This lack of predictive power is due to the enormous size of genetic data, which prohibits many of the computationally complex algorithms often relied on to identify complicated relationships between variables, like SNPs, and breast cancer. As a result, most researchers can only test whether individual SNPs are correlated with the disease and report the ones with high statistical significance. However, if in fact gene groups, or SNP sets, determine the incidence of breast cancer among patients, a SNP by SNP sweep of the genome will miss important information relevant to a patient's risk for the disease.

Example 1 - Genetics GWAS Data

Example 1 - Genetics

Reference: www.synapse.org/#!Synapse:syn5605838/wiki/392024
Example 1 - Genetics Reference:

Example 2 - Spam or Ham?

Example 2 - Spam or Ham? Document Classification Problems:
to assign a document to one or more classes or categories The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Example 2 - Spam or Ham? Features? Feature selection? Algorithm?

Example 2 - Spam or Ham? Features: Word counts — term frequency matrix
Feature Selection: We wish to find the “bag of words” that are predictive for classifying spams or hams. use IScore to find the interaction of words

Example 2 - Spam or Ham? Pre-process

Example 2 - Spam or Ham? Data Background — Enron scandal
Enron was the world’s leading energy company; it declared bankruptcy in December 2001, which was followed by numerous investigations. During the investigation, the original Enron dataset, consisting of 619,446 messages, was posted to the Web by the Federal Energy Regulatory Commission in May 2002. Later some duplicated mails were deleted, and some others were deleted upon the request of Enron employees. Nowadays the version we have is called the March 2, 2004 version, which is widely used by researchers.

Example 2 - Spam or Ham? Convert the term frequency into 3 levels — rare, frequent, and very frequent Run 6-way interaction 10,000 times using backward-dropping method Sort the results by Iscores

Example 2 - Spam or Ham? Naive Bayes
a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Example 2 - Spam or Ham? Logistic Regression
developed by statistician David Cox in 1958 to estimate the probability of a binary response based on one or more predictor (or independent) variables (features) can be seen as a special case of the generalized linear model and thus analogous to linear regression

Example 2 - Spam or Ham? 1 recorded xls subject 2 star bryan enron 3
deal 4 owners oasis hplo 5 issue meter 6 7 htm whether free 8 apply beaumont determine book similar gas 9 times eastrans 10 exchange growth green

Example 3 - Citrus Hill or Minute Maid
Which one would you buy?

Recommendation System Problems: to predict the 'rating' or 'preference' that a user would give to an item typically produce a list of recommendations in one of two ways: Collaborative filtering approaches building a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. These approaches are often combined (Hybrid Recommender Systems).

Features: week of purchase, store, price, discount, special, loyalty, price difference, sales price, list price

Use IScore statistic with backward dropping algorithm to find the highest interactive features (for prediction) price, discount, special, list price price, sales price store, loyalty, sales price

Example 4 - Mortality MAKE A GUESS - which populations have the highest mortality rates? Location? Career? Income? Lifestyle factors? Our goal is to use I-score to identify predictive subsets

Example 4 - Mortality Data: National Longitudinal Mortality Study
1.8 million subjects represent total US non-institutionalized population of 1990 Case – death within 10 years Control – no death within 10 years

Example 4 - Mortality

Example 4 - Mortality Cancer

Example 4 - Mortality Heart Disease

Applications of IScore (using R)

Similar presentations

Presentation on theme: "Applications of IScore (using R)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applications of IScore (using R)

Similar presentations

Presentation on theme: "Applications of IScore (using R)"— Presentation transcript:

Similar presentations

About project

Feedback