Applications of IScore (using R)

Slides:



Advertisements
Similar presentations
Content-based Recommendation Systems
Advertisements

Naive Bayes Classifiers, an Overview By Roozmehr Safi.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Chapter 9 Business Intelligence Systems
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
MATH408: Probability & Statistics Summer 1999 WEEKS 8 & 9 Dr. Srinivas R. Chakravarthy Professor of Mathematics and Statistics Kettering University (GMI.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Naïve Bayes Chapter 4, DDS. Introduction Classification Training set  design a model Test set  validate the model Classify data set using the model.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Personalized Spam Filtering for Gray Mail Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Robert McCann Microsoft Corporation.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Bayesian Networks. Male brain wiring Female brain wiring.
by B. Zadrozny and C. Elkan
6/28/2014 CSE651C, B. Ramamurthy1.  Classification is placing things where they belong  Why? To learn from classification  To discover patterns  To.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Spam Detection Ethan Grefe December 13, 2013.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
1 Determining How Costs Behave. 2 Knowing how costs vary by identifying the drivers of costs and by distinguishing fixed from variable costs are frequently.
Machine Learning with Spark MLlib
Data Mining ICCM
Naïve Bayes CSE651C, B. Ramamurthy 6/28/2014.
Latent variable discovery in classification models
Chapter 7. Classification and Prediction
Artificial Intelligence, P.II
DATA MINING © Prentice Hall.
Regression Analysis Module 3.
Multivariate Analysis - Introduction
A GACP and GTMCP company
Waikato Environment for Knowledge Analysis
Adrian Tuhtan CS157A Section1
Machine Learning. k-Nearest Neighbor Classifiers.
Recommender Systems.
Vincent Granville, Ph.D. Co-Founder, DSC
Introduction to Azure Machine Learning Studio
Sangeeta Devadiga CS 157B, Spring 2007
Naïve Bayes CSE487/587 Spring /17/2018.
Naïve Bayes CSE651 6/7/2014.
Collaborative Filtering Nearest Neighbor Approach
Naïve Bayes CSC 600: Data Mining Class 19.
Tabulations and Statistics
Text Categorization Assigning documents to a fixed set of categories
What is Regression Analysis?
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Somi Jacob and Christian Bach
Naïve Bayes CSE487/587 Spring2017 4/4/2019.
CSE 491/891 Lecture 25 (Mahout).
Chapter 7: Transformations
Recommender Systems Group 6 Javier Velasco Anusha Sama
Kenneth C. Laudon & Jane P. Laudon
Naïve Bayes CSC 576: Data Science.
Data Pre-processing Lecture Notes for Chapter 2
Azure Machine Learning
Multivariate Analysis - Introduction
NAÏVE BAYES CLASSIFICATION
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Naïve Bayes Classifier
Presentation transcript:

Applications of IScore (using R) Lydia Hsu

Agenda Flow of solving machine learning problems Implementation of Iscore Example 1 — Genetics (Breast Cancer Pathways) Example 2 — Text (Spam Detection) Example 3 — Recommendation System (Orange Juice Preference) Example 4 — Longitudinal Study (Mortality of Americans)

The Flow for Solving Supervised Learning Problems IScore } Feature Selection + Method Regression, Classification, etc.

Implementation — Single IScore } Generate partitions } Count # of cases and controls inside each partition } Calculate IScore

Implementation — Backward Dropping } First start with all variables } Drop one variable at each time, and record the resulting IScore

Example 1 - Genetics Motivation: 370 individual genetic variants, or SNPs, were identified associated with breast cancer. However, these SNPs are not predictive of breast cancer risk and fail to explain the incidence of the disease among patients. This lack of predictive power is due to the enormous size of genetic data, which prohibits many of the computationally complex algorithms often relied on to identify complicated relationships between variables, like SNPs, and breast cancer. As a result, most researchers can only test whether individual SNPs are correlated with the disease and report the ones with high statistical significance. However, if in fact gene groups, or SNP sets, determine the incidence of breast cancer among patients, a SNP by SNP sweep of the genome will miss important information relevant to a patient's risk for the disease.

Example 1 - Genetics GWAS Data

Example 1 - Genetics

Reference: www.synapse.org/#!Synapse:syn5605838/wiki/392024 Example 1 - Genetics Reference: www.synapse.org/#!Synapse:syn5605838/wiki/392024

Example 2 - Spam or Ham?

Example 2 - Spam or Ham? Document Classification Problems: to assign a document to one or more classes or categories The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Example 2 - Spam or Ham? Features? Feature selection? Algorithm?

Example 2 - Spam or Ham? Features: Word counts — term frequency matrix Feature Selection: We wish to find the “bag of words” that are predictive for classifying spams or hams. use IScore to find the interaction of words

Example 2 - Spam or Ham? Pre-process

Example 2 - Spam or Ham? Data Background — Enron scandal Enron was the world’s leading energy company; it declared bankruptcy in December 2001, which was followed by numerous investigations. During the investigation, the original Enron email dataset, consisting of 619,446 email messages, was posted to the Web by the Federal Energy Regulatory Commission in May 2002. Later some duplicated mails were deleted, and some others were deleted upon the request of Enron employees. Nowadays the version we have is called the March 2, 2004 version, which is widely used by researchers.

Example 2 - Spam or Ham? Convert the term frequency into 3 levels — rare, frequent, and very frequent Run 6-way interaction 10,000 times using backward-dropping method Sort the results by Iscores

Example 2 - Spam or Ham? Naive Bayes a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Example 2 - Spam or Ham? Logistic Regression developed by statistician David Cox in 1958 to estimate the probability of a binary response based on one or more predictor (or independent) variables (features) can be seen as a special case of the generalized linear model and thus analogous to linear regression

Example 2 - Spam or Ham? 1 recorded xls subject 2 star bryan enron 3 deal 4 owners oasis hplo 5 issue meter 6 7 htm whether free 8 apply beaumont determine book similar gas 9 times eastrans 10 exchange growth green

Example 3 - Citrus Hill or Minute Maid Which one would you buy?

Example 3 - Citrus Hill or Minute Maid Recommendation System Problems: to predict the 'rating' or 'preference' that a user would give to an item typically produce a list of recommendations in one of two ways: Collaborative filtering approaches building a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. These approaches are often combined (Hybrid Recommender Systems).

Example 3 - Citrus Hill or Minute Maid Features: week of purchase, store, price, discount, special, loyalty, price difference, sales price, list price

Example 3 - Citrus Hill or Minute Maid Use IScore statistic with backward dropping algorithm to find the highest interactive features (for prediction) price, discount, special, list price price, sales price store, loyalty, sales price

Example 4 - Mortality MAKE A GUESS - which populations have the highest mortality rates? Location? Career? Income? Lifestyle factors? Our goal is to use I-score to identify predictive subsets

Example 4 - Mortality Data: National Longitudinal Mortality Study 1.8 million subjects represent total US non-institutionalized population of 1990 Case – death within 10 years Control – no death within 10 years

Example 4 - Mortality

Example 4 - Mortality

Example 4 - Mortality Cancer

Example 4 - Mortality Heart Disease