Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Christo Wilson Assistant Professor Northeastern University To appear at Usenix Security.

Slides:



Advertisements
Similar presentations
Protecting Browser State from Web Privacy Attacks Collin Jackson, Andrew Bortz, Dan Boneh, John Mitchell Stanford University.
Advertisements

This work was supported by the TRUST Center (NSF award number CCF ) 1 Introduction Within the last few years, online advertisement companies have.
Unit 11 Using the Internet & Browsing the Web.  Define the Internet and the Web  Set up & troubleshoot an Internet connection  Categorize webs sites.
Back to Table of Contents
Georgios Kontaxis, Michalis Polychronakis Angelos D. Keromytis, Evangelos P. Markatos Siddhant Ujjain (2009cs10219) Deepak Sharma (2009cs10185)
On the Incoherencies in Web Browser Access Control Policies Authors: Kapil Singh, et al Presented by Yi Yang.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
ASP.NET 2.0 Chapter 6 Securing the ASP.NET Application.
Lecture 21: Privacy and Online Advertising. References Challenges in Measuring Online Advertising Systems by Saikat Guha, Bin Cheng, and Paul Francis.
The Privacy Tug of War: Advertisers vs. Consumers Presented by Group F.
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
HTTP: cookies and advertising Concepts to cover:  web page content (including ads) from multiple site: composition at client  cookies  third-party cookies:
By: Justin Mauss Privacy vs. Convenience. Agenda Finding the Balance: Privacy vs. Convenience Revisit Privacy vs. Convenience Overview of Online Tracking.
Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Pre-Programmatic automated buying model The Ad Network Model Aggregate and automated media buying across websites AD Advertising agency Advertiser.
I Do Not Know What You Visited Last Summer: Protecting users from stateful third-party web tracking with TrackingFree browser Xiang Pan §, Yinzhi Cao †,
WS-Security: SOAP Message Security Web-enhanced Information Management (WHIM) Justin R. Wang Professor Kaiser.
1 Tradedoubler & Mobile Mobile web & app tracking technical overview.
Creating a User ID (1) User makes any HTTP request
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Display & Remarketing What You Need to Know. PROPRIETARY AND CONFIDENTIAL / COPYRIGHT © 2013 BE FOUND ONLINE, LLC 2 WHAT IS DISPLAY?
AppNexus Classroom Training Demand-Side Setup Intended Audience: Buyers new to AppNexus Console v11. July 26, 2012.
By Elijah Redding CIS 150. A New Kind of Ad: Customized to Your Interests With the internet being one of the top choices for advertising, marketers are.
Georgios Kontaxis‡, Michalis Polychronakis‡, Angelos D. Keromytis‡, and Evangelos P.Markatos* ‡Columbia University and *FORTH-ICS USENIX-SEC (August, 2012)
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
I Do Not Know What You Visited Last Summer: Protecting users from stateful third-party web tracking with TrackingFree browser Xiang Pan, Northwestern University.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Xinyu Xing, Wei Meng, Dan Doozan, Georgia Institute of Technology Alex C. Snoeren, UC San Diego Nick Feamster, and Wenke Lee, Georgia Institute of Technology.
DIGITAL ADVERTISING Standard 4. THE ROLE OF DIGITAL ADVERTISING IS TO INCREASE SALES OR IMPROVE BRAND AWARENESS.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
ARE YOU SURE YOU WANT TO CONTACT US? On the privacy risks at website contact pages UISGCON, December 2015 Alex Starov.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
Programmatic Buying Simplified
1 DATA-DRIVEN SOLUTIONS. 2 KEYWORD-LEVEL SEARCH RETARGETING TARGET USERS BASED ON THEIR RECENT SEARCH HISTORY AND SEARCH QUERIES. A user performs a search.
Some from Chapter 11.9 – “Web” 4 th edition and SY306 Web and Databases for Cyber Operations Cookies and.
Bridging the Disconnect Between Web Privacy and User Perception Mike Perry W3C Identity May 24, 2011.
What mobile ads know about mobile users
What is Google Analytics?
CSCE 548 Student Presentation Ryan Labrador
The Invisible Trail: Third-Party Tracking on the Web
Information Systems for Managers Assignment FACEBOOK
Lesson # 9 HP UCMDB 8.0 Essentials
Automated Experiments on Ad Privacy Settings
Anonymous Communication
Unit 12 Using the Internet & Browsing the Web
Publishers Sell space on their website
Multi Rater Feedback Surveys FAQs for Participants
Multi Rater Feedback Surveys FAQs for Participants
Article Authors – Oleksii Starov & Nick Nikiforakas
PIXELS! 12/3/2015.
Are these Ads Safe: Detecting Hidden A4acks through Mobile App-Web Interfaces Vaibhav Rastogi, Rui Shao, Yan Chen, Xiang Pan, Shihong Zou, and Ryan Riley.
563.10: Bloom Cookies Web Search Personalization without User Tracking
EVOLVING IP ISSUES IN BRAND PROTECTION
What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.
Distributed Systems CS
Anonymous Communication
Searching for Truth: Locating Information on the WWW
Behavioral retargeting (also known as behavioral remarketing, or simply, retargeting) is a form of online targeted advertising by which online advertising.
HTML5 and Local Storage.
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
HTML5 and Local Storage.
Recitation on AdFisher
Mobile Security Evangelos Markatos FORTH-ICS and University of Crete
Anonymous Communication
Cross Site Request Forgery (CSRF)
A Longitudinal Analysis of the ads.txt Standard
Presentation transcript:

Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Christo Wilson Assistant Professor Northeastern University To appear at Usenix Security 2016

Your Privacy Footprint 2 Cookie matching

The Modern Ad Ecosystem Modern ad exchanges use Real-Time Bidding (RTB) auctions Advertisers bid on each impression Key requirement: advertisers must be able to identify users before bidding Why would you bid on user you’ve never seen before? Cookie matching allows ad exchanges to ID users in the auctions Cookie matching fundamentally alters the definition of “privacy footprint” Advertisers are not isolated islands of data By sharing identifiers, ad exchanges can share data on users Browsing history Inferred demographic attributes 3

Challenges Several studies have examined cookie matching Acar et al. found hundreds of domains passing identifiers to each other Olejnik et al. found 125 exchanges matching cookies Falahrastegar et al. analyzed clusters of exchanges that share the exact same cookies Limitation: these studies focus on specific mechanisms Ad exchanges that share exact cookies or follow known URL patterns However, DoubleClick hashes cookies before sharing (obfuscation) Limitation: limited ability to attribute matches HTTP Referer always appears to be the 1 st party domain… … even if a request is triggered by JavaScript from a 3 rd party Limitation: only examined information flows through the client-side What if ad exchanges trade information on the server-side? 4

Goals Develop a method to identify information flows (cookie matching) between ad exchanges Mechanism agnostic: resilient to obfuscation Platform agnostic: detect sharing on the client- and server-side Strong attribution: who initiated and received each flow Build a graph of the information sharing relationships between ad exchanges Superset of what is currently known about the ad ecosystem Gives a truer picture of users’ privacy footprint Ultimately, use the data to develop stronger privacy-preserving technologies for users 5

Using Retargets as a Detection Tool Retargeted ads are the most highly targeted for of online ads Very expensive: around $1 per impression vs. >$0.01 for contextual ads 6 Key insight: because retargets are so specific, they can be used to conduct controlled experiments Information must be shared between ad exchanges to serve retargeted ads

Contributions 1.Present a novel methodology for identifying information flows between ad exchanges Leverages controlled experiments, retargeted ads, and crowdsourcing Meets our requirements (e.g. being mechanism agnostic) 2.Demonstrate the limitations of prior work on cookie matching 31% of cookie matching partners cannot be identified using heuristics 3.Develop a method to categorize information sharing relationships Present conclusive evidence that Google shares tracking data between its different domains (e.g. YouTube and DoubleClick) 4.Use graph analysis to infer the roles of actors in the ad ecosystem Supply-side vs. Demand-side 7

The Online Ad Ecosystem Data Collection Classifying Ad Network Flows Results 8

How Do Ads Get Served? 9 GET, CNN’s Cookie GET, DoubleClick’s Cookie Advertisement301 Redirect GET, Criteo’s Cookie Advertisement User Publisher Ad Network Advertiser This is not how ads get served in 2016.

Real Time Bidding (RTB) 10 GET, CNN’s Cookie GET, Rubicon’s Cookie GET, DoubleClick’s Cookie User Publisher Ad Exchange Demand-Side Platforms (DSPs) Supply-Side Platform (SSP) Solicit bids, DoubleClick’s Cookie GET, RightMedia’s Cookie Advertisement Bid DSPs cannot read their cookie! How can they bid if they cannot identify the user?

Cookie Matching Key problem: DSPs cannot read their cookies in the RTB auction How can they submit reasonable bids if they cannot identify the user? Solution: cookie matching Also known as cookie synching Process of linking the identifiers used by two ad exchanges 11 GET, Cookie=12345 GET ?dblclk_id=12345, Cookie=ABCDE 301 Redirect, Location=

The Online Ad Ecosystem Data Collection Classifying Ad Network Flows Results 12

Goals, Revisited Uncover the hidden links between ad exchanges… … in a way that is mechanism agnostic Robust against obfuscation (e.g. cookie hashing) Works even if the information exchange occurs server-side 13

Retargeted Ads 14 RTB and cookie matching enable highly targeted advertisements The most specific ad type are retargeted ads Served to users who view a specific product on specific website Cost up to $1 per impression vs. >$0.01 for contextual ads

Using Retargets as an Experimental Tool This proves a causal flow of information from exchange  DSP Retargets are too expensive to be served to random, unknown users Flow could occur on the client- or server-side 15 Key observation: retargets are only served under very specific circumstances 1) 2) DSP observes the user at a shop DSP and the exchange must have matched cookies

Methodology Overview 1.Instrument a browser to record inclusion chains Need to know what resource caused each outgoing request E.g. was the request generated by a 3 rd party ad library? 2.Construct customer personas Each persona visits specific products at specific e-commerce sites Creates controlled opportunities to observe retargeted ads 3.Crawl ads Visit a broad swath of publishers to elicit retargeted advertisements 4.Isolate retargeted advertisements Ads will be a mix of untargeted, contextual, behavioral, and retargets Only the retargets allow us to infer causal relationships between exchanges 16

Inclusion Chains Instrumented Chromium binary that records the provenance of page elements Uses Information Flow Analysis techniques (IFA) Handles Flash, exec(), setTimeout(), cross-frame, inline scripts, etc. 17 DOM: a.com/index.htmlInclusion Chain a.com/index.html b.com/adlib.js c.net/adbox.html c.net/code.js d.org/flash.swf

Data Collection Created 90 personas to simulate shoppers Each visited 10 hand-chosen products from 10 online shops (100 total URLs) Top 10 shops in 90 Alexa online shopping categories Selected 150 publishers from the Alexa Top-1000 Randomly visited 15 URLs + homepage on each publisher Conducted 9 rounds of crawling in December 2015 Each persona visited the shops and then the publishers in each round Recorded inclusion trees, cookies, ad images, HTTP requests/responses Also included one control account Only crawls the publishers each round Doesn’t visit shops, clears cookies after each request Designed to collect untargeted and contextual ads 18

Filtering Images FilterTotal Unique Images All images from the crawlers571,636 Use EasyList to identify advertisements93,726 Remove ads that are shown to >1 persona31,850 Use crowdsourcing to locate retargets5, Personas visited non-overlapping retailers By definition, retargets should only be shown to a single persona Spent $415 uploading 1,142 HITs to Amazon Mechanical Turk Each HIT asked the worker to label 30 ad images 27 were unlabeled, 3 were known retargets (control images) All ads were labeled by 2 workers Any ad identified as targeted was also manually inspected by us

Example HIT 20

The Online Ad Ecosystem Data Collection Classifying Ad Network Flows Results 21

Final Dataset 5,102 unique retargeted ads From 281 distinct online retailers Publisher(s) on which it was observed 35,448 publisher-side chains that served the retargets Again, we observed some retargets multiple times 22 Next step: classifying each publisher-side chain Who served the retargeted ad? Was the ad served via an RTB auction? How did the ad exchange and the DSP share information about our persona?

Four Classifications Four possible ways for a retargeted ad to be served 1.Direct (Trivial) Matching 2.Cookie Matching 3.Indirect Matching 4.Latent (Server-side) Matching We encode each classification as a set of regex-like rules Rules for shopper- and publisher-side chains Rules are ordered by complexity Assume the simplest rule that matches is the correct classification We do not expect to observe all four classifications in the data No known ad exchanges use Indirect Matching However, we include these classes for sake of completeness 23

1) Direct (Trivial) Matching 24 Shopper-side Publisher-side Example Rule ^shop.*d.*$ ^pub d$ d is the DSP that serves the retarget d must appear on the shopper- side… … but other trackers may also appear No cookie matching necessary: the user was sent directly to the DSP

2) Cookie Matching 25 Shopper-side Publisher-side Example Rule ^shop.*d.*$ d must appear on the shopper-side ^pub.*e d$ ^*.*de.*$ ^*.*ed.*$ Anywhere e precedes d, which implies an RTB auction OR Transition e  d or d  e is where cookie match occurs

[^d] 3) Indirect Matching 26 Shopper-side Publisher-side Example Rule ^shop e[^d]$ Only the exchange e appears on the shopper-side… ^pub.*e d$ e must pass browsing history data to participants in the auction, thus no cookie matching is necessary We do not expect to find indirect matches in the data.

4) Latent (Server-side) Matching 27 Shopper-side Publisher-side Example Rule ^shop [^ed]$ Neither e nor d appears on the shopper-side ^pub.*e d$ d must receive information from some shopper-side tracker We find latent matches in practice!

The Online Ad Ecosystem Data Collection Classifying Ad Network Flows Results 28

Categorizing Chains TypeChains% % Direct (Trivial) Match Forward Cookie Match Backward Cookie Match Indirect Match Latent (Server-side) Match No Match Clustered As expected, most retargets are due to cookie matching Very small number of chains that cannot be categorized Suggests low false positive rate of AMT image labeling task Surprisingly large amount of indirect and latent matches… Raw Chains

Categorizing Chains TypeChains% % Direct (Trivial) Match Forward Cookie Match Backward Cookie Match Indirect Match Latent (Server-side) Match No Match Raw Chains Clustered Chains Cluster together domains by “owner” E.g. google.com, doubleclick.com, googlesyndication.com Indirect and latent matches essentially disappear The vast majority of these chains involve Google Suggests that Google does indeed use exotic techniques to identify users

Who Is Matching? Participant 1Participant 2ChainsAdsHeuristics criteo  googlesyndication  P criteo  doubleclick  E, P  DC, P criteo  adnxs  E, P criteo  rubiconproject  E, P criteo  servedbyopenx  P doubleclick  steelhousemedia36227  P  E, P mathtag  mediaforge  E, P netmng  scene  E  ? googlesyndication  adsrvr10729  P rubiconproject  steelhousemedia8630  E googlesyndication  steelhousemedia4722? adtechus  adacado3618? atwola  adacado326? adroll  adnxs318? Heuristics Key E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method 4.1% of chains would be entirely missed using heuristics 31% of cookie matching partners would be missed

Network Analysis DomainInOutRatiop2p2 p n-1 pnpn # of Shops# of Ads criteo mediaplex tellapart steelhousemedia DegreePosition in Chains (%) rubiconproject adnxs casalemedia advertising servedbyopenx googletagservices googleadservices googlesyndication doubleclick DSPs Exchanges Google

Conclusions and Future Work 33

Summary Existing approaches to measuring the ad ecosystem are incomplete Counting the frequency of trackers or cookies Co-occurrence graphs examining the overall footprints of trackers We develop a novel methodology to detect information flows between ad exchanges Controlled methodology enables causal inference Mechanism agnostic, detects client- and server-side flows Dataset illuminates a shady corner of the ad ecosystem Existing heuristic approaches miss 31% of matching partners Graph analysis enables automatic classification of actor roles Dataset will be open-sourced (in August, for Usenix) 34

Future Work Existing anti-trackers and ad blockers have issues Ghostery and Disconnect do not stop all trackers (e.g. 1 st parties like Facebook) Ad blockers cut off revenue from publishers Towards more balanced, privacy-preserving technology Users aren’t necessarily opposed to ads Ubiquitous tracking and profiling are the issue SafeAds: a new approach to preserving user privacy on the Web Browser extension that blocks trackers and cookie matching… … but allows compliant ad exchanges to still serve ads to users Uses our dataset of cookie matching partners to facilitate an Information Flow Control approach to protecting privacy 35

Questions? Christo Wilson 36

References Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan, Claudia Diaz. “The web never forgets: Persistent tracking mechanisms in the wild.” CCS, Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson, Christo Wilson. “Tracing Information Flows Between Ad Exchanges Using Retargeted Ads.” Usenix Security, Marjan Falahrastegar, Hamed Haddadi, Steve Uhlig, Richard Mortier. “Tracking personal identifiers across the web.” PAM, Lukasz Olejnik, Tran Minh-Dung, Claude Castelluccia. “Selling off privacy at auction.” NDSS,