Free for All! Assessing User Data Exposure to Advertising libraries on Android John Ramirez.

Free for All! Assessing User Data Exposure to Advertising libraries on Android
John Ramirez

Introduction Advertisers want to create ad conversions out of their ad impressions Ad networks help do this by matching ads to users To fully assess the risk of an ad library, all potential behaviour needs to be taken into account Four major attack channels: Protected APIs Files generated at runtime User input Unprotected APIs Ad impressions: when an ad is presented to a user; Ad conversions: when a user interacts with an ad (clicking it) If a user often exercises/has lots of exercise apps, companies show ads relating to that Protected APIs: obtained through permissions e.g. GET_ACCOUNTS permission Runtime files: stores data e.g. a weight loss app could track your weight, eating habits etc User input: personal data typed by user Unprotected APIs: PackageManager.getInstalledApplications()

Introduction Pluto framework Allows easier assessment of privacy risk
Helps developers assess data exposure risk Uses Natural Language Processing (NLP) and machine learning Pluto framework used to assess amount of data exposure an app could potentially have, as this is not something easily known

Background - Mobile Advertising
Developers monetise their apps through ads Data brokers incorporate ad libraries through applications Data brokers collect user data and maintain profiles Profiles can be made more complete → user attributes & interests Data brokers collaborate with advertisers to create more suitable ads Applications are for free but forced to watch an ad once in a while e.g. Candy Crush These profiles can be used to sell to specific groups of users By selling these profiles → create more tailored ads to users → more likely to click these ads and generate revenue

Background - Android Protection Mechanisms
Apps are given UUID and PID → this extends out to ad libraries Host apps share its permissions with its ad libraries Linux DAC security system allows generated local files to be accessed by the host app Ad libraries have already been collating user info without user knowledge Ad libraries gets the same UUID as host app, same with PID when the app is run To the OS, the ad library is the same as the host app due to its identifier This means if app gets a permission then the ad library can utilise it e.g. GET_COARSE_LOCATION to get location These generated local files can also be accessed by the ad libraries

Background - Natural Language Processing (NLP)
Data miners utilise NLP to determine if words are data points NLP helps determine what part of speech a word is Targeted data can be quite vague so NLP is needed to determine the semantic meaning Wordnet (English semantic dictionary) commonly used in NLP Similarity metric is used to determine if two words are associated Targeted data refers to the information wanted by ad libraries E.g. the word exercise can be refer to workout exercise or an educational exercise Graph shows the associations with words and the hierarchical nature

Threat Model Risk: Potential compromise of an asset through the exploit of a vulnerability done by a threat Attack channels are divided into two categories: in-app and out-app In-app is when the attack channels are dependent on ad library’s host app Out-app is when the attack channels are independent of the host app In this context, asset: targeted user data, vulnerability: one of the attack channels, threat: opportunistic ad library In-app: protected APIs, local files and user input Out-app: unprotected APIs

Data Exposure through In-App Channels
A manual inspection of apps found that through the use of in-app channels, several user data points were exposed These data points and their values can be used to show specific ads Application (Downloads) Data points exposed I’m Pregnant (1-5 million) weight, height, current pregnancy month and day Diabetes Journal ( thousand) birth date, gender, first/last name, weight/height, blood glucose levels TalkLife (10-50 thousand) address, birth date, first name, password in plain text Data point is a category of targeted data point values: gender, data point value: bob is a male Mention that this is through the use of protected apis, local files e.g. SharedPreferences and user input Pregnant app will allow maternity clothes/baby clothes ads to the users

Data Exposure through In-App Channels (cont.)
In this study, inspecting apps by examining local files and protected APIs is referred to as Level One Inspection (L1-I) (left graph) Inspecting apps like above but including eavesdropping on user input is referred to as Level Two Inspection (L2-I) (right graph) Mention how just through the use of these two things, lots of data points were exposed Graphs show how many applications had these specific data points when inspected through LHS graph shows phone data point exposed, data point exposed and address data point exposed RHS graph shows that it was able to retrieve first and last name, age, gender, , phone and address Manual inspection of 262 applications Manual inspection of 35 applications

Data Exposure through Out-App Channels
Channels independent of the host app can be utilised without permissions Public APIs such as getInstalledPackages() and getInstalledApplications() are not considered harmful by Android Open Source Project App bundles can be retrieved and exploited by ad libraries 12.54% of the apps examined (318/2535) incorporated ad libraries that called either method App bundles are the list of apps installed on a device. E.g. if you have lots of food/cooking apps then your ads will tailor to this App bundle retrieval is an issue since some app companies don’t explicitly tell the user → children apps don’t get permission of parents e.g. Radio Disney This can create more personal ads

Pluto Framework Modular framework that estimates in-app and out-app targeted data exposure for a given app In-app: Local generated files App layout and string resources files Manifest file Out-app: App bundles Machine learning In-app focuses on these things Out-app focuses on other device data and makes inferences

In-App Pluto Dynamic Analysis Module (DAM) Runs app in emulator
Decompiles app & extracts created/manifest/layout/resource/run time generated files Data Miners User attributes and interests as a matching goal Matching goal is reached when data point is present in that file Context Disambiguation Layer used to determine whether or not a match is valid (uses droidLESK) In-app meant to simulate what an ad library could do Can be configured for level 1 inspection (protected APIs, local files) or level 2 inspection (+ user input) Talk about monkey that simulates user input 4 different types of miners: GMiner, MMiner, DBMiner and XMLMiner Context Disambiguation Layer attaches the user’s interest to the application category Example of the context disambiguation layer is through the word “exercise” → could refer to either workout or education Aggregator removes duplicates and returns final set of data points

Out-App Pluto Aims to estimate potential data exposure through the list of installed apps Co-Installation Patterns (CIP) Discovers association between apps Frequent Pattern Mining (FPM) algorithms used to discover associations Confidence is used to determine if an association between two apps is worth presenting Classifiers Take in the CIP module and present a set of learned attributes Predicts whether or not apps are indicative of a user attribute/interest Confidence is the percentage that given one app is installed, what’s the chance of another app being installed conf(x,y) = 0.8 means if x is installed, there’s an 80% chance that y is installed

Criticism & Recommendations
Similarity metric used by the data miners only recognise convention like snake_case and camelCase i.e. would recognise userProfile but not userprof More research into other types of naming conventions needed Only four attack channels looked into for this study Expand into other potential attack channels and extend Pluto to cover these Size of study limited to 2535 unique apps for manual inspection and survey was 243 participants with distinct package names collected 2.8 million apps on the Play Store (as of March 2017)1 Knowing this, developers can name their data points obscure things These four channels were the major channels but there are more channels out there such as camera, gyroscope, microphone The study is not representative of the whole google play store not to mention third party stores 1https://

Criticisms & Recommendations (cont.)
In this study, if an application crashed the emulator, they removed it from the study They didn’t try to determine the cause of the crash Maybe Pluto doesn’t work well with apps that hide its ad libraries More research would be needed to be done in this area, since obfuscated apps could pose an issue to Pluto

Thank you! At this point just hope they don’t ask questions

Free for All! Assessing User Data Exposure to Advertising libraries on Android John Ramirez.

Similar presentations

Presentation on theme: "Free for All! Assessing User Data Exposure to Advertising libraries on Android John Ramirez."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Free for All! Assessing User Data Exposure to Advertising libraries on Android John Ramirez.

Similar presentations

Presentation on theme: "Free for All! Assessing User Data Exposure to Advertising libraries on Android John Ramirez."— Presentation transcript:

Similar presentations

About project

Feedback