Clickprints on the Web: Are there Signatures in Web Browsing Data? Balaji Padmanabhan The Wharton School University of Pennsylvania Yinghui (Catherine) Yang Graduate School of Management University of California, Davis
Signatures in technology mediated applications Unique typing patterns, or “keystroke dynamics” Miller 1994, Monrose and Rubin 1997, Everitt and McOwan 2003. In an experiment involving 42 user profiles, Monrose and Rubin (1997) shows that depending on the classifier used, between 80 to 90 percent of users can be automatically recognized using features such as the latency between keystrokes and the length of time different keys are pressed. Writeprints Li, Zheng and Chen (2006) Experiments involving 10 users in two different message boards suggest that “writeprints” could well exist since the accuracies obtained were between 92 and 99 percent. Walkie Talkie? Mäntyjärvi et al. 2005 individuals may have unique “gait” or walking patterns when they move with mobile devices.
Motivating Questions Do unique behavioral signatures exist in Web browsing data? How can behavioral signatures be learned? Why is this useful?
How to Decide Whether Signatures Exist Two General Methods: Build features and classify. Build features/variables to describe users’ activities Learn a classifier (user ID as the dependent variable) Check it’s accuracy on unseen data Answer the question A patterns-based approach pick a pattern representation, and search for distinguishing patterns. e.g. for user k, “total_time < 5 minutes and number of pages > 50” may be a unique clickprint since there is no other user for whom this is true.
The Aggregation Question Given a unit of analysis (click/session), how much aggregation is needed before there is enough information in each aggregation to uniquely identify a person? For some level of aggregation, agg, we’d like {c1, c2,…, cagg} user {c1, c2,…, ck} <v1, v2,…,vq, user> Feature construction, F {<v1, v2,…,vq, user>} user = M(v1, v2,…,vq) Building a predictive model Find the smallest level of aggregation agg at which unique clickprints (accuracy > threshold) exist. Key elements: How features are constructed for a group of sessions How much aggregation needs to be done
An example of aggregations
An example of aggregations
Experiments and Design comScore Networks, 50,000 users, 1 year User-Centric data A session is a user’s activities across Web sites Created multiple data sets by combining sessions from 2, 3, 4, 5, 10, 15, 20 users (140 data sets in total) User selection: Users with household size 1 Users with enough sessions for adequate out-of-sample testing Pick users with > 300 sessions in a year First 2/3 sessions as training, last 1/3 sessions as hold-out Same number of sessions for the selected users in each data set to guarantee same class prior before and after aggregation.
Experiments and Design The Features For a single session (i) The duration (ii) The number of pages viewed (iii) The starting time (in seconds after 12.00am) and (iv) The number of sites visited (v) Binary variables indicating for the top k (=5, 10) Web sites are visited note: these top-k Web sites for each user are identified only from the training set For sets of sessions Create variables capturing distributions of these measures Mean, median, variance max and min for the continuous attributes Frequency counts for the top Web sites
Experiments and Design Classifier J4.8 classification tree in weka Model goodness Temporal hold out samples (1/3 testing) Threshold accuracy 90%, also used other different levels Increase aggregation level and stop when accuracy is high enough or stopping condition is reached. Set agg=30 in these experiments
Results for one specific accuracy threshold The optimal levels of aggregation averaged across 20 runs for 90% accuracy (top 10 web sites). # of users Mean % runs with agg<30 2 1.05 100% 3 1.26 95% 4 1.78 90% 5 2.16 10 4.24 85% 15 5.2 75% 20 8.9 50%
Heuristic for Large Problems: A Monotonicity Assumption accuracy(M | agg1) accuracy(M | agg2) whenever agg1 agg2 In words: the goodness of the model when applied to “more aggregated” data is never worse than the goodness of the model applied to “less aggregated” data Can then use a binary search procedure to find the optimal agg. Perhaps not very useful when useful agg values are much smaller, as in our problems/experiments Continuing to study when this may work and be useful
Conclusion Contribution: Significance of the problem and initial results Challenges Scale What is a signature? On-going/Future Research Pattern-based signature Application-driven signature problems (e.g. fraud detection, personalization, etc.)
Thank you.
Learning user profiles online Related Work Learning user profiles online Aggarwal et al. (1998) Adomavicius and Tuzhilin (2001) Mobasher et al. (2002) User profiles for fraud detection Fawcett and Provost (1996) Cortes and Pregibon (2001) Data Preprocessing Cooley et al. (1999), Zheng et al. (2003). Online intrusion detection Ellis et al. (2004)
Binary search for the optimal aggregation Start with N users’ Web sessions mixed together. Assume that the range of aggregations we wish to consider are 1, 2, 3,…, K sessions Consider accuracy at agg = K/2 If this accuracy ≥ threshold then recursively search in the lower half of the sequence If this accuracy < threshold then recursively search in the higher half of the sequence
Histogram of number of sessions
Distribution of the agg values