Users’ fingerprinting techniques from TCP traffic

Slides:



Advertisements
Similar presentations
Inktomi Confidential and Proprietary The Inktomi Climate Lab: An Integrated Environment for Analyzing and Simulating Customer Network Traffic Stephane.
Advertisements

The Internet and the Web
Rob Smets A user centred approach IPv6 deployment monitoring.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
An Analysis of Internet Content Delivery Systems Stefan Saroiu, Krishna P. Gommadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy Proceedings of.
The Internet Useful Definitions and Concepts About the Internet.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Licentiate Seminar: On Measurement and Analysis of Internet Backbone Traffic Wolfgang John Department of Computer Science and Engineering Chalmers University.
C HAPTER 4 W EB H OSTING. I. I NTRODUCTION To make your Web site visible to the world, it has to be hosted on a Web server. In this tutorial we will teach.
© December 1999 George Paliouras, All Rights Reserved1 Learning Communities of Users on the Internet George Paliouras Christos Papatheodorou Vangelis Karkaletsis.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
EVERYWHERE: IMPACT OF DEVICE AND INFRASTRUCTURE SYNERGIES ON USER EXPERIENCE Cost TMA – Figaro - NSF Alessandro Finamore Marco Mellia Maurizio Munafò Sanjay.
Signatures As Threats to Privacy Brian Neil Levine Assistant Professor Dept. of Computer Science UMass Amherst.
1 10 THE INTERNET AND THE NEW INFORMATION TECHNOLOGY INFRASTRUCTURE.
What is FORENSICS? Why do we need Network Forensics?
Web Page Design I Basic Computer Terms “How the Internet & the World Wide Web (www) Works”
Hour 7 The Application Layer 1. What Is the Application Layer? The Application layer is the top layer in TCP/IP's protocol suite Some of the components.
1 Welcome to CSC 301 Web Programming Charles Frank.
Heuristics to Classify Internet Backbone Traffic based on Connection Patterns Wolfgang John and Sven Tafvelin Dept. of Computer Science and Engineering.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
NETWORK HARDWARE AND SOFTWARE MR ROSS UNIT 3 IT APPLICATIONS.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
User Fingerprinting Jeffrey Pang 1 Ben Greenstein 2 Ramakrishna Gummadi 3 Srinivasan Seshan 1 David Wetherall 2,4 Presenter: Nan Jiang Most Slides:
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Unconstrained Endpoint Profiling Googling the Internet Ionut Trestian, Supranamaya Ranjan, Alekandar Kuzmanovic, Antonio Nucci Reviewed by Lee Young Soo.
How the Web Works Building a Website – Lesson 1. How People Access the Web Browsers People access websites using software called a web browser. To view.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
IP addresses IPv4 and IPv6. IP addresses (IP=Internet Protocol) Each computer connected to the Internet must have a unique IP address.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Chapter 6 Discovering the Scope of the Incident Spring Incident Response & Computer Forensics.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
WHAT'S THE DIFFERENCE BETWEEN A WEB APPLICATION STREAMING NETWORK AND A CDN? INSTART LOGIC.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Data mining in web applications
Experience Report: System Log Analysis for Anomaly Detection
What is Google Analytics?
Fundamentals of Information Systems, Sixth Edition
By Arijit Chatterjee Dr
DNS-sly: Avoiding Censorship through Network Complexity
The Internet Industry Week Two.
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
Lightweight Application Classification for Network Management
Practical Censorship Evasion Leveraging Content Delivery Networks
1st Draft for Defining IoT (1)
An Overview of A Case Study of the Uses Supported by Higher Education Computer Networks and An Analysis of Application Traffic Mark Pisano dps2017.
ICT Communications Lesson 1: Using the Internet and the World Wide Web
Web Caching? Web Caching:.
Developing Web-Based Applications
Vocabulary Prototype: A preliminary sketch of an idea or model for something new. It’s the original drawing from which something real might be built or.
Firewalls.
What is Cookie? Cookie is small information stored in text file on user’s hard drive by web server. This information is later used by web browser to retrieve.
Application layer Lecture 7.
Web Design & Development
Configuring Internet-related services
Privacy-Preserving Dynamic Learning of Tor Network Traffic
Identifying Slow HTTP DoS/DDoS Attacks against Web Servers DEPARTMENT ANDDepartment of Computer Science & Information SPECIALIZATIONTechnology, University.
IP Control Gateway (IPCG)
The Internet and Electronic mail
Q/ Compare between HTTP & HTTPS? HTTP HTTPS
Helpful Things To Know For Successful Digital Marketing Strategy Presented By:- Abhinav Shashtri.
Unconstrained Endpoint Profiling (Googling the Internet)‏
Five Years at the Edge: Watching Internet from the ISP Network
When Machine Learning Meets Security – Secure ML or Use ML to Secure sth.? ECE 693.
Session 3.4 ITU-BDT Regional Network Planning Workshop
Presentation transcript:

Users’ fingerprinting techniques from TCP traffic LUCA VASSIO DANILO GIORDANO - MARTINO TREVISAN - MARCO MELLIA - ANA COUTO DA SILVA

Luca Vassio - About me Italian Mathematical engineer with a multidisciplinary profile PhD in Telecommunication Engineering – now postdoc Research interests Data science Big data analytics Data mining and machine learning Human behaviour analysis and modelling Recommendation systems Social networks behaviour Web browsing habits Smart cities design Optimization Metaheuristics

Are our interests private? Not at all! Huge number of trackers that record user web activities with different techniques

Can we be anonymous? From a network point of view Does our traffic characterize us? Users can change Application Device Network HTTPS limits access to third parties Can we still use network visible information for user fingerprinting?

Fingerprinting: user or device? What characterize the behaviour of the user when online The web-services the user access What characterize the behaviour of the used device The web-services the user access Services that support these web-services (e.g., CDNs) The installed applications (e.g., software updates)

Contributions Goal Study capabilities of tracking using only visited domains (server FQDN) Domains obtained from passive traces (TCP logs) What is novel? 1. Only consider the name of the contacted server 2. Compare different metrics for tracking 3. Propose a methodology to identify domains intentionally requested

Passive network measurements Process of measuring the traffic exchanged between devices interconnected by a network Traffic generated by others and observed on client/server/network Example: through Tstat ISP network

TSTAT: TCP Statistics and Analysis Tool Captures traffic and processes it in real-time Logs more than 100 statistics per TCP flow Logs information from every HTTP request and response Original server domain name retrieved by HTTP/TLS deep packet inspection Tracks DNS conversations to retrieve original server domain name (DN-Hunter) More information at tstat.polito.it Client IP Client port Client bytes Server IP Server port Server bytes Domain … 12.132.54.94 1197 18938 87.250.137.92 443 992221 Acme.com ... 3441 16541498 123.220.231.13 8080 78661 Example.com

Datasets University campus ISP Users with fixed IP addresses - used as client ID 4 weeks in 2017 Software: Apache Spark in a 20-machine Hadoop cluster Computation: reading and processing Campus dataset in ~20 minutes

Ethics & Privacy University ethical board: data collection reviewed and approved Protect leakages of private information Anonymize IP addresses - irreversible hash functions Save only data strictly needed i) Anonymized IP addresses ii) Name of domain iii) Timestamp of the TCP connection ISP dataset: reviewed and approved by ISP security board i) No information about ISP customers ii) Domain names never saved

Similarity computation Hypothesis: users are repetitive over time! Goal: identify user among profiles built in the past by checking visited domains Profile the users creating fingerprints Identify user in a later trace Performance metric: percentage of users correctly identified Select a fixed number of users in all tests  meaningful comparisons

Similarity computation Three methodologies for similarity among fingerprinting sets 1. JACCARD INDEX 2. MAXIMUM LIKELIHOOD ESTIMATION 3. COSINE SIMILARITY BASED ON TF-IDF

1. Jaccard index Size of intersection divided by size of the union of two sample sets Just depends on two domains sets Fast computation time

2. Maximum likelihood estimation Behavioral model: user’s likelihood of visiting a domain governed by (i) Domain overall popularity (ii) Whether domain already appeared in her previous domains set Each user has personalized factor of attraction towards past domains Identification: for each user, likelihood of generating the future set with the model Model proposed in: J. Su, A. Shukla, S. Goel, and A. Narayanan. 2017. De-anonymizing Web Browsing Data with Social Networks. Proocedings of the WWW 2017. 1261-1269. Future Past

3. Cosine similarity with TF-IDF Statistic per domain and per user Reflects How important a domain is for a user (TF – Term Frequency) With respect to all users (IDF – Inverse Document Frequency) Identification: Cosine distance of vectors of TF-IDFs of domains

Computational complexity M users each with ≈ N domains P is the total number of domains seen by all the M users, with N ≤ P ≤ M · N Jaccard index computation costs at most O (M · N^2) TFIDF and MLE methodologies cost at most O (M · N · P), using larger set of all domains Hashing methodology could be used on top of our computation to speed-up the process. Computation cost decrease to O (M · N) for Jaccard and O (M ·P) for MLE and TFIDF In case of very large population, it is therefore much faster to use the simpler Jaccard similarity

Core and support domains Are all domains equal? Core domain: are intentionally requested by a user to download the page HTML document e.g.: www.facebook.com and en.wikipedia.org Support domains: remaining ones triggered by website visits, or by background applications e.g.: static.xx.fbcdn.net and client.dropbox.com Core domains might be more important for fingerprinting Accuracy Interpretability Independence from device www.facebook.com Html static.xx.fbcdn.net image video dl-client.dropbox.com JavaScript CSS cdn.01.com static.xx.fbcdn.net

Identify core domains Classification task Machine learning methodology: decision tree classifier Build a labeled dataset for training and testing Get features: active crawling visiting home page of each domain (Selenium) Features selection: HTML document size and redirection to external domain Classifier training: C4.5 Results: accuracy 96% Execution time: 1 hour for classifying 404 k domains Bottleneck: Internet access speed (1Gbps)

Results

Is user browsing repetitive? Campus dataset One week for profiling and identification Users behave similar over time Users are different among themselves Discriminative power of the profiles

Number of domains Campus dataset Core/support domains for profiling and identification tasks The bigger the set, the better the identification Core domains better characterizing - 500 support ~ 70 core Jaccard index worst method TFIDF best results MLE slightly better with core domains

Observation time Campus dataset Users have different rates for discovering domains Keep constant observation time (profiling/identification) Two consecutive days Two consecutive weeks Much more support domains than core ones Quantity of support domains helps identification (personalized ads/tracking) Jaccard worst performance TFIDF overall best performance With few core domains MLE better domains domains

The ISP case Let’s apply what we have learned to the ISP case Residential access More heterogeneous users Devices multiplexed on same IP address of the access gateway All domains – no distinction between core/support Bigger amount of data More distinct and repetitive behavior: easier to fingerprint and identify domains

Longer profiling or identification? Campus dataset 2. Fixed 2 weeks identification, variable profiling time Fixed 2 weeks profiling, variable identification time Jaccard is symmetric TFIDF and MLE account for whole population when profiling TFIDF and MLE good even with few hours of identification Large profiling sets more important than identification

Conclusions Still no privacy and anonymity online, even from the network point of view Good similarity metrics and machine learning application can improve forensics applications: A simple TFIDF approach can solve identification problem Web-services intentionally requested (core) better characterize users But support domains are also important due to their quantity Future work: analyze point of view of tracker applications Foster new studies and permit results reproducibility! Data and models available at bigdata.polito.it/content/domains-web-users

Want to talk with me? Propose something interesting for my future? Luca Vassio, Danilo Giordano, Martino Trevisan, Marco Mellia, Ana Paula Couto da Silva, Users' Fingerprinting Techniques from TCP Traffic, ACM SIGCOMM Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, 2017 Want to talk with me? Propose something interesting for my future? Luca Vassio luca.vassio@polito.it lucavassio.wordpress.com