Influence in Classification via Cooperative Game Theory Amit Datta, Anupam Datta, Ariel D. Procaccia and Yair Zick (to appear in IJCAI’15)
Big Data Analysis and Transparency Big data is big business. It is “good”: able to identify trends, produce accurate results, impartial (algorithms are not inherently discriminatory). It is not transparent! As a user (or even as a data scientist!) it is hard to tell what factors determine classification outcomes.
Motivation We are given classified dataset (flagged clients in a bank). Classifier is unknown. What is the importance of a given feature to the classification outcome? (F,25-35,English,PA) (M,18-25,English,CO) (F,35-55,Spanish,NY) (M,25-35,English,PA) (F,18-25,Spanish,PA) (M,18-25,Spanish,PA) (M,25-35,Spanish,PA) (F,35-55,Spanish,PA) (M,18-25,Spanish,PA)
Methodology Feature selection: learn a classifier, see what features add the most information. ▫Are we choosing the right classifier to learn? Can be very complex. ▫Some classifiers have no intuitive notion of feature importance (e.g. decision trees). ▫Requires a lot of knowledge about the dataset (what happens when features are removed).
Methodology
Notation
Ideas from Game Theory
Causality
Axiomatic Approach A measure is state symmetric if relabeling of states does not change its value. A measure is feature symmetric if relabeling of features does not change their value.
Axiomatic Approach Bad news… Standard notions will not immediately work.
Axiomatic Approach
Relation to Linear Classifiers High weight translates to high influence!
Extensions
Implementation To test our measure’s behavior, we measure influence on a generated dataset. We employ the AdFisher framework [Datta et al. 2014] to create fake Google user profiles and observe the ads that they are presented.
Implementation
Top Ads for Age Title/Ad DescriptionInfluence Buy Home For Taxes Owed/Or Get 18-36% Interest! Watch 8min Video That Explains All Jim Rickards Project 2015/Economist, Jim Rickards explains the coming economic crash ”My Insomnia Trick”/Naturally Fall Asleep Fast, Stay Asleep All Night – Wake Up Refreshed Get In Now With Graphene/Money-Making Mineral Set To Launch Can Shape The World And Your Wealth Sciatica Exercises?/Stop: What You MUST know Before attempting to Treat your Sciatica: StatisticValue Mean Median0.031 StdDev0.0144
Top Ads for Gender Title/Ad DescriptionInfluence Jim Rickards Project 2015/Economist, Jim Rickards explains the coming economic crash Buy Home For Taxes Owed/Or Get 18-36% Interest! Watch 8min Video That Explains All Tech Gadgets/Daily Deals on Modern Gadgets. Exclusive Pricing - Up To 70% Off Get In Now With Graphene/Money-Making Mineral Set To Launch Can Shape The World And Your Wealth Elabore su Presupuesto/Nuestros Consejeros Certificados Est´an listos para ayudarlo StatisticValue Mean Median StdDev0.0161
Top Ads for Language Title/Ad DescriptionInfluence Elabore su Presupuesto/Nuestros Consejeros Certificados Est´an listos para ayudarlo The Greatest Penny Stocks/Get free daily penny stock alerts. Join now. New pick out soon Business Leads CRM/Business Lead Manager, Dialer, CRM. 400% Boost in Conversion Rates Get In Now With Graphene/Money-Making Mineral Set To Launch Can Shape The World And Your Wealth Buy Home For Taxes Owed/Or Get 18-36% Interest! Watch 8min Video That Explains All StatisticValue Mean0.033 Median StdDev0.024
Findings Overall influence of specific features over ads is somewhat limited (except for language). Ads seem to be targeted at specific subsets (e.g. young men and elderly women). Further (more refined) measurements on larger dataset needed.
Future Work Beyond single state changes (what is the minimal number of changes to others’ states that we need in order to affect a change in value?); necessary if we want to use our measure in datasets where we cannot control the features. What happens when there are priors on data? White box vs. Black box analysis. Thank you! Questions?