The Web: Source of Big Data with a Measurement Perspective Marco Mellia Politecnico di Torino 7th Annual IMDEA Networks Workshop Big Data and Cloud Computing.

Slides:



Advertisements
Similar presentations
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Advertisements

Networking Problems in Cloud Computing Projects. 2 Kickass: Implementation PROJECT 1.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 OSI Transport Layer Network Fundamentals – Chapter 4.
Greg Williams CS691 Summer Honeycomb  Introduction  Preceding Work  Important Points  Analysis  Future Work.
The development of Internet A cow was lost in Jan 14th If you know where it is, please contact with me. My QQ number is QQ is one of the.
Marios Iliofotou (UC Riverside) Brian Gallagher (LLNL)Tina Eliassi-Rad (Rutgers University) Guowu Xi (UC Riverside)Michalis Faloutsos (UC Riverside) ACM.
SESSION ID: #RSAC Chaz Lever Characterizing Malicious Traffic on Cellular Networks A Retrospective MBS-W01 Researcher Damballa,
BotMiner Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
Leveraging Personal Knowledge for Robust Authentication Systems Mentor: Danfeng Yao Anitra Babic Chestnut Hill College Computer Science Department.
Neural Technology and Fuzzy Systems in Network Security Project Progress 2 Group 2: Omar Ehtisham Anwar Aneela Laeeq
On the Feasibility of Large-Scale Infections of iOS Devices
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Intrusion Detection Systems. Definitions Intrusion –A set of actions aimed to compromise the security goals, namely Integrity, confidentiality, or availability,
Intrusion Detection - Arun Hodigere. Intrusion and Intrusion Detection Intrusion : Attempting to break into or misuse your system. Intruders may be from.
Lab 3 Cookie Stealing using XSS Kara James, Chelsea Collins, Trevor Norwood, David Johnson.
INTRUSION DETECTION SYSTEMS Tristan Walters Rayce West.
1. Introduction The underground Internet economy Web-based malware The system analyzing the post-infection network behavior of web-based malware How do.
Security Guidelines and Management
Norman SecureSurf Protect your users when surfing the Internet.
PROJECT IN COMPUTER SECURITY MONITORING BOTNETS FROM WITHIN FINAL PRESENTATION – SPRING 2012 Students: Shir Degani, Yuval Degani Supervisor: Amichai Shulman.
Automated Tracking of Online Service Policies J. Trent Adams 1 Kevin Bauer 2 Asa Hardcastle 3 Dirk Grunwald 2 Douglas Sicker 2 1 The Internet Society 2.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
WhoWas: A Platform for Measuring Web Deployments on IaaS Clouds Liang Wang *, Antonio Nappa +, Juan Caballero +, Thomas Ristenpart *, Aditya Akella * *
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
1 Integrating a Network IDS into an Open Source Cloud Computing Environment 1st International Workshop on Security and Performance in Emerging Distributed.
Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
IT 210 The Internet & World Wide Web introduction.
MPlane – Building an Intelligent Measurement Plane for the Internet Maurizio Dusi – NEC Laboratories Europe NSF Workshop on perfSONAR.
1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),
Honeypot and Intrusion Detection System
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Network Services Networking for Home and Small Businesses – Chapter 6.
Intrusion Detection and Prevention. Objectives ● Purpose of IDS's ● Function of IDS's in a secure network design ● Install and use an IDS ● Customize.
Forensic and Investigative Accounting Chapter 14 Internet Forensics Analysis: Profiling the Cybercriminal © 2005, CCH INCORPORATED 4025 W. Peterson Ave.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
 Two types of malware propagating through social networks, Cross Site Scripting (XSS) and Koobface worm.  How these two types of malware are propagated.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Packet Filtering & Firewalls. Stateless Packet Filtering Assume We can classify a “good” packet and/or a “bad packet” Each rule can examine that single.
CSCE 201 Web Browser Security Fall CSCE Farkas2 Web Evolution Web Evolution Past: Human usage – HTTP – Static Web pages (HTML) Current: Human.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
ECEN “Internet Protocols and Modeling”, Spring 2012 Course Materials: Papers, Reference Texts: Bertsekas/Gallager, Stuber, Stallings, etc Class.
Qiang Xu†, Yong Liao‡, Stanislav Miskovic‡, Z. Morley Mao†, Mario Baldi‡, Antonio Nucci‡, Thomas Andrews† †University of Michigan, ‡Symantec, Inc.
Cryptography and Network Security Sixth Edition by William Stallings.
Chapter 8 Network Security Thanks and enjoy! JFK/KWR All material copyright J.F Kurose and K.W. Ross, All Rights Reserved Computer Networking:
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Intrusion Detection Systems Paper written detailing importance of audit data in detecting misuse + user behavior 1984-SRI int’l develop method of.
1 workshop Barcelona, April 22, 2015.
WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory
A Multifaceted Approach to Understanding the Botnet Phenomenon Aurthors: Moheeb Abu Rajab, Jay Zarfoss, Fabian Monrose, Andreas Terzis Publication: Internet.
Role Of Network IDS in Network Perimeter Defense.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
COMPUTER NETWORKS Hwajung Lee. Image Source:
John S. Otto Mario A. Sánchez John P. Rula Fabián E. Bustamante Northwestern, EECS.
Kali Linux BY BLAZE STERLING. Roadmap  What is Kali Linux  Installing Kali Linux  Included Tools  In depth included tools  Conclusion.
Understanding Online Social Network Usage from a Network Perspective F. Schneider et al (T-Labs, AT&T) Internet Measurement Conference 2009 Networking.
Website Update and Use of Official accounts Dr.Lasantha Ranwala ( MBBS,MSc-Biomedical Informatics) Medical Officer - Health Informatics RDHS Office.
Victor PTSA Fall Forum Don’t Lose Touch With Your Teen Tuesday, October 22, 2013 – 7PM Social media is now an integral part of our every day lives. For.
Powerpoint presentation on Drive-by download attack -By Yogita Goyal.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Application Layer Functionality and Protocols Abdul Hadi Alaidi
CompTIA Security+ Study Guide (SY0-401)
An Overview of A Case Study of the Uses Supported by Higher Education Computer Networks and An Analysis of Application Traffic Mark Pisano dps2017.
CompTIA Security+ Study Guide (SY0-401)
Communication Networks NETW 501 Tutorial 3
Home Internet Vulnerabilities
Five Years at the Edge: Watching Internet from the ISP Network
When Machine Learning Meets Security – Secure ML or Use ML to Secure sth.? ECE 693.
Presentation transcript:

The Web: Source of Big Data with a Measurement Perspective Marco Mellia Politecnico di Torino 7th Annual IMDEA Networks Workshop Big Data and Cloud Computing June 11, 2015, Madrid

2 Agenda Internet (passive) traffic measurements  A good source of big data And applications to  Knowledge of (benign) Internet traffic  Understanding of (malicious) traffic  Implications on your privacy  Understand how people browse the web  … I’m by far not a big data expert… … I may be a measurement expert…

3 The nowadays Internet “The Internet is the first thing that humanity has built that humanity doesn't understand, the largest experiment in anarchy that we have ever had.” Eric Schmidt – ex Google Exec. Chairman

4 mPlane architecture

5

6

7

8

Passive data analysis

10 Measurement scenario Captures traffic on the network interface and processes it in real-time Rebuilds TCP/UDP flows Compute 100+ statistics to print at the end of the flow Open-source:

11

12 Dataset 1 day of traffic from an European ISP  Vantage point, about 20,000 households Approx 4Gb/s => 43 TB/day Security Monitor ClassUsers (%)Records (%)UsersRecords HTTP16,217 (79.1)39.7 M (11.8)1,30842,007 3,640 (17.7)880.7 k (0.2)-- Chat 3,045 (14.8)100.8 k (0.03)71,467 P2P 3,163 (15.4)17.1 M (5.05)-- Other TCP18,806 (91.8)22.7 M (6.7)2476 DNS15,164 (74.1)30.7 M (9.3)-- VoIP 8,371 (40.8)80.5 k (0.02)-- Other UDP17,664 (86.2)224.6 M (66.8)-- total20, M1,32143,550

13 Questions you may answer How much traffic is TCP or UDP? How much traffic is HTTP or P2P? What is the best goodput observed by a user? What is the CDN being used to serve YouTube videos? … How much traffic is HTTPS?  And what are the “cost of the S”?

14 “S” usage trends D. Naylor, et al, “The Cost of the "S" in HTTPS”, CoNEXT '14

15 y = 2x Full handshake y = x Session resumption RTT TLS Handshake Duration [ms] Stragglers Long handshake no matter how close server is Handshake Latency

16 Questions you may answer How much traffic is TCP or UDP? How much traffic is HTTP or P2P? What is the best goodput observed by a user? What is the CDN being used to serve YouTube videos? How much traffic is HTTPS? How much traffic is “malicious”  And how can you detect it

17 THE RISE OF CYBERWARFARE +261% - Stealing personal information +23% - Threatening visitors’ security during web surfing + 28% - Exploiting legitimate websites Source: Symantec Internet Security Threat Report

18

19 Goal Provide a characterization of malicious traffic in a real network  Identify common malicious patterns in network traffic  Augment the information coming from Intrusion Detection System (IDS) software  Generalize the IDS knowledge to unknown threats With a “Big Data” approach  Directly extract knowledge from data A. Finamore, et al, “Macroscopic View of Malware in Home Networks”, IEEE CCNC'15

20 Methodology Look at network traffic as sequence of events… …to spot correlation some events are benign others are malicious (flagged) Traffic analyzer Security monitor Network traffic time Creates per-flow records [ timestamp, class, srcIP, dstIP, srcPort, dstPort, #bytesSent, …] Logs malicious activity [ srcIP, dstIP, srcPort, dstPort, threat-ID ]

21 Dataset 1 day of traffic from an European ISP  Vantage point, about 20,485 households  1,321 (6.4%) households have ≥ 1 connection flagged over a total of 151 threat-IDs Traffic AnalyzerSecurity Monitor ClassUsers (%)Records (%)UsersRecords HTTP16,217 (79.1)39.7 M (11.8)1,30842,007 3,640 (17.7)880.7 k (0.2)-- Chat 3,045 (14.8)100.8 k (0.03)71,467 P2P 3,163 (15.4)17.1 M (5.05)-- Other TCP18,806 (91.8)22.7 M (6.7)2476 DNS15,164 (74.1)30.7 M (9.3)-- VoIP 8,371 (40.8)80.5 k (0.02)-- Other UDP17,664 (86.2)224.6 M (66.8)-- total20, M1,32143,550

High-level characterization

23  Malware is generally stealthy  ~51% of flagged users have just 1 flag (in a day!)  Only 8% of flagged users shows more than 10 flags Malware characteristics – Flags per User

24  Threat popularity highlights a diversified scenario  The most popular infects 800 users  121 threats have less than 10 users Malware characteristics – Threat popularity

25 High-level characterization Take away message  Most of the threats infect few users, only a handful are widespread  Malware is generally silent and most users show a (very) limited number of flags It is like finding a needle in a haystack!

Getting your hand into the data

27 Users ≥ 1 flagUsers > 1 flag #NameUsersFlagsUsersAVG Flags 1Drive-by download [type 1]7811, DynDNS activity [type 1]26626, Blackhole EK [type 1] Skintrim [type 2] Skintrim [type 3] Facebook plugin attack Threat-A Blackhole EK [type 2] Toolbar activity [type 1] Threat-B Threat-C Toolbar activity [type 2] Drive-by download [type 2] Tidserv Threat-D Top -15 most diffused threats

28 HTTP objects requestedUsers# HTTP events ads.staticyonkis.com/www/delivery/afr.php5223 bloggasaurus.com/wp-content/instal/file.php rum.php%3E%3Cimgsrc= =forumsnowboardborder=1%3E%3C/a%3E15 club/%3Cahref= // 3E%3C/a%3E casariposo.org/banner//jquerymini.js neve/%3Cahref= // 3E%3C/a%3E11 total14263 Most recurrent objects Focus Threat-D – Flagged HTTP objects

29 Not flagged + Threat-D ▪ Other threats “file.php” requests are periodic (every ~20m) HTTP objects This User has a lot of flows flagged with other threat-IDs Temporal patterns – file.php, User 6

30 Temporal patterns – file.php, User 6 Not flagged + Threat-D ▪ Other threats “file.php” requests are periodic (every ~20m) HTTP objects This User has a lot of flows flagged with other threat-IDs Not flagged + Threat-D ▪ Other threats ip10|file.php ip9|file.php ip8|file.php ip7|file.php ip6|file.php ip5|file.php ip4|file.php ip3|file.php ip2|webhp ip1|webhp ip32|file.php ip31| ip30|webhp ip29|webhp ip28|file.php ip27|file.php Ip26|webhp ip25|file.php ip24|file.php ip23|file.php ip22|file.php ip21|file.php ip20|file.php ip19|file.php ip18|file.php ip17|file.php ip16 ip15 ip14|file.php ip13|file.php ip12|file.php ip11|file.php macroscopic pattern

31 HTTP objects requestedUsers# HTTP events ads.staticyonkis.com/www/delivery/afr.php5223 bloggasaurus.com/wp-content/instal/file.php rum.php%3E%3Cimgsrc= =forumsnowboardborder=1%3E%3C/a%3E15 club/%3Cahref= // 3E%3C/a%3E casariposo.org/banner//jquerymini.js neve/%3Cahref= // 3E%3C/a%3E11 total14263 Most recurrent objects Focus Threat-D – Flagged HTTP objects

32 Not flagged + Threat-D ▪ 66 event occurrences (mostly overlapped) no macroscopic pattern Temporal patterns – file.php, User 1

33 Lesson learned The picture is very fuzzy and complicated  Volume, velocity, variety, variability, … Repeating the experiment on other threats does not help  Most of the times there is not a clear pattern  In few cases there is a pattern recurrent over time

Going toward automated detection Via a connectivity graph

35 Connectivity graph Represent the activity of a client via a graph  Consider a client  Look at http requests  Build a graph

36 Building the Malicious Graph One HTTP request plays the role of seed and triggers the analysis Malicious seeds, flagged by the IDS Benign seed, no suspicious activities 1.Define an analysis snapshot surrounding the event 2.Collect multiple snapshots coming from different clients to find a common pattern 3.Leverage activities relationships detected on multiple layers (i.e., HTTP, DNS, TCP, …) 4.Join layers to have a unified connectivity graph tt Others DNS HTTP tt Client 1 tt Client N …

37 Building the Connectivity Graph 1. Select a seed, either malicious or benign 2. Process analysis snapshots coming from involved clients 3. Find a common pattern among snapshots and clients 4. Plus tons of “minor details”

38 Building the Connectivity Graph Consider all flags for the client under study related to the seed event FLAGGED HTTP ACTIVITY Context-aware and dynamically generated WHITELIST GENERATION Based on popular items Single HTTP request as white list entry Item = an HTTP request, domain + object path E.g.:

39 Step 1 – Flagged activity Client 1 being analyzed: 3 snapshots Focus on flagged activities detected in the first analysis snapshot Within the snapshot, clients may present other malicious events Malicious seed Red edges = flagged events

40 Building the Connectivity Graph s0 s1 sN s2 GENERATE ANALYSIS SNAPSHOTS Exploit time correlation between the seed and other activities in its neighborhood Consider all flags for the client under study related to the seed event FLAGGED HTTP ACTIVITY Add activities that are repetitive along snapshots EXTRACT COMMON PATTERNS Context-aware and dynamically generated WHITELIST GENERATION`

41 Step 2 - common patterns Step 2 – Common patterns Legitimate activity related to Facebook games Other legitimate activities Malicious activity according to BitDefender, Snort and Sophos

42 Building the Connectivity Graph JOIN LAYERS TOGETHER AND BUILD THE FINAL CONNECTIVIT Y GRAPH Examine ongoing activities in other layers (e.g., DNS, TCP, …) PROCESS OTHER LAYERS s0 s1 sN s2 GENERATE ANALYSIS SNAPSHOTS Exploit time correlation between the seed and other activities in its neighborhood Consider all flags for the client under study related to the seed event FLAGGED HTTP ACTIVITY Add activities that are repetitive along snapshots EXTRACT COMMON PATTERNS Context-aware and dynamically generated WHITELIST GENERATION

43 Step 3 – Per-client connectivity graph DNS failing requests Resource sharing with colocation of multiple malicious domains on the same server IP Resource sharing with colocation of multiple malicious domains on the same server IP

44 Aggregating multiple clients IDENTIFY COMMON SUSPICIOUS BEHAVIOR (e.g., overlap of per-client contributions) CLIENT 1 FLAGGED ACTIVITY COMMON PATTERNS OTHER LAYERS JOIN FINAL GRAP H … CLIENT 2 FLAGGED ACTIVITY COMMON PATTERNS OTHER LAYERS JOIN FINAL GRAP H CLIENT N FLAGGED ACTIVITY COMMON PATTERNS OTHER LAYERS JOIN FINAL GRAP H

45 Step 4 – Final connectivity graph Grafo finale del modello New malicious object identified Failing and recurrent DNS queries trying to resolve suspicious domain names

46 Conclusions It is possible to augment the knowledge of the malicious activity thanks to big data analysis Correlation and filtering are key Next step:  Extract signature from malicious and benign seeds  Instruct a machine learning classifier with the two classes  Design a self update system to automatically augment its knowledge

47 Agenda Passive traffic measurements  Source of big data And applications to  Knowledge of (benign) Internet traffic  Understanding of (malicious) traffic  Implications on your privacy

One more example… … about you…

49 Take 100 users, and observe which services they contact while on the Internet… Word count Google Third-Party sites Take 100 users, and observe which services they contact while on the Internet… … most of those are so called “third-party sites” - aka tracking services… Welcome ScorecardResearch, […] a leading global market research effort that studies and reports on Internet trends and behavior. ScorecardResearch conducts research by collecting Internet web browsing data and then uses that data to help show how people use the Internet, what they like about it, and what they don’t. ScorecardResearch collects data through […] web tagging.

50 Third-Party Trackers Third-party web tracking refers to the practice by which a service records user web activities often for profit Many techniques Cookies HTML5 LocalStorage Finger printing (browser/OS/IP) Some are visible Others are not acmeTrack.com acmeAds.com Metwalley, et al. "The Online Tracking Horde: A View from Passive Measurements.” TMA’15

51 Requests to first tracker First HTTP transaction to a tracker in 10% of cases Android users contact a tracker earlier than PC or iOS ones

52 Trackers per Service Un-Popular services host a lot of trackers Popular services host a lot of trackers

Going toward automated detection

54 Automatic Tracker Detection For a set of events in which hostname is different from referrer field 1. extract all keys from the URLs in the set that includes a query string 2. for each hostname and for each key, investigate one-to-one mapping between the client and the values taken by each of the keys If a key value assumes a different value for each client, and does not change over time  We found a “client identifier”  Mark the service as a “possible tracker” acmeAds.com

55 Automatic Tracker Detection

56 Automatic Tracker Detection - Results Third party sites in  Repubblica.it  YouTube  Facebook 56 CrowdSurf: Empowering Informed Choices in the Web WebsiteThird party sitesKeys Repubblica YouTube Facebook

57 Automatic Tracker Detection - Results Third party sites in  Repubblica.it  YouTube  Facebook Third party hostnames identified 57 CrowdSurf: Empowering Informed Choices in the Web WebsiteThird party sitesKeys Repubblica pix04.revsci.net su.addthis.com track.adform.net YouTube bh.ams.contextweb.com eu-jet-01.sociomantic.com ib.adnxs.com uip.semasio.net Facebook adadvisor.net data.bncnt.com go.flx1.com ira.spysomeone.com tags.bluekai.com ww1.collserve.com

58 Automatic Tracker Detection - Results Third party sites in  Repubblica.it  YouTube  Facebook Third party hostnames identified Third party hostnames confirmed 58 CrowdSurf: Empowering Informed Choices in the Web WebsiteThird party sitesKeys Repubblica pix04.revsci.net su.addthis.com track.adform.net YouTube bh.ams.contextweb.com eu-jet-01.sociomantic.com ib.adnxs.com uip.semasio.net Facebook adadvisor.net data.bncnt.com go.flx1.com ira.spysomeone.com tags.bluekai.com ww1.collserve.com

59 Automatic Tracker Detection - Results Third party sites in  Repubblica.it  YouTube  Facebook Third party hostnames identified Third party hostnames confirmed Keys suggest the exchange of client identifiers 59 CrowdSurf: Empowering Informed Choices in the Web WebsiteThird party sitesKeys Repubblica pix04.revsci.net id su.addthis.com puid track.adform.net icid YouTube bh.ams.contextweb.comvgd eu-jet-01.sociomantic.comfpc ib.adnxs.com uuid sExtCookieId uip.semasio.netinstall_timestamp Facebook adadvisor.net bk_uuid data.bncnt.com uid go.flx1.com anuid, euid ira.spysomeone.coms tags.bluekai.com google_gid ww1.collserve.com bk_uuid ksh_id

60 Collusions 22943A08072D61D9088D3D B c1.microsoft.com c.msn.com MUID MUID a4.bing.com CID

61 Collusion ?!??! 378a a f551433df35 beacon.krxd.net mmuid ad.doubleclick.net mt_uuid eu-u.openx.net y.one.impact-ad.jp pixel.mathtag.com s.kau.li check uid val uid mt_uuid check

62 Conclusions Internet is possibly the “best” source for big data  Volume, Variety, Velocity, Variability, Veracity, … With multiple applications  Characterization  Security  Privacy  User behaviour  … Lot of domain knowledge, and lot of ingenuity to better understand the “largest experiment in anarchy that we have ever had”