Structural Mining of Large-Scale Behavioral Data from the Internet Thesis Defense Mark Meiss April 30, 2010.

Slides:



Advertisements
Similar presentations
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Peer-to-Peer and Social Networks An overview of Gnutella.
A Survey of Botnet Size Measurement PRESENTED: KAI-HSIANG YANG ( 楊凱翔 ) DATE: 2013/11/04 1/24.
Power Laws By Cameron Megaw 3/11/2013. What is a Power Law?
Analysis and Modeling of Social Networks Foudalis Ilias.
Predicting Tor Path Compromise by Exit Port IEEE WIDA 2009December 16, 2009 Kevin Bauer, Dirk Grunwald, and Douglas Sicker University of Colorado Client.
An Analysis of Sampling Effects on Graph Structures Derived from Network Flow Data Mark Meiss Advanced Network Management Laboratory Indiana University.
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining.
Worm Origin Identification Using Random Moonwalks Yinglian Xie, V. Sekar, D. A. Maltz, M. K. Reiter, Hui Zhang 2005 IEEE Symposium on Security and Privacy.
Empirical Investigations of WWW Surfing Paths Jim Pitkow User Interface Research Xerox Palo Alto Research Center.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
EEC-484/584 Computer Networks Discussion Session for HTTP and DNS Wenbing Zhao
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Detecting Fraudulent Clicks From BotNets 2.0 Adam Barth Joint work with Dan Boneh, Andrew Bortz, Collin Jackson, John Mitchell, Weidong Shao, and Elizabeth.
Network Traffic Measurement and Modeling CSCI 780, Fall 2005.
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
Introduction to client/server architecture
Internet Basics.
Network Simulation Internet Technologies and Applications.
Delay Analysis of Large-scale Wireless Sensor Networks Jun Yin, Dominican University, River Forest, IL, USA, Yun Wang, Southern Illinois University Edwardsville,
BOTNETS & TARGETED MALWARE Fernando Uribe. INTRODUCTION  Fernando Uribe   IT trainer and Consultant for over 15 years specializing.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
On the Anonymity of Anonymity Systems Andrei Serjantov (anonymous)
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.
University of California at Santa Barbara Christo Wilson, Bryce Boe, Alessandra Sala, Krishna P. N. Puttaswamy, and Ben Zhao.
Lesson 2 — The Internet and the World Wide Web
Intrusion Prevention System. Module Objectives By the end of this module, participants will be able to: Use the FortiGate Intrusion Prevention System.
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection Guofei Gu, Roberto Perdisci, Junjie Zhang, and.
Chapter 8 The Internet: A Resource for All of Us.
TCP/IP Yang Wang Professor: M.ANVARI.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Application Layer Functionality and Protocols.
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
Mitsubishi Research Institute, Inc Analyses on Distribution of Malicious Packets and Threats over the Internet August 27-31, 2007 APAN Network Research.
Detection Unknown Worms Using Randomness Check Computer and Communication Security Lab. Dept. of Computer Science and Engineering KOREA University Hyundo.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Mining and Querying Multimedia Data Fan Guo Sep 19, 2011 Committee Members: Christos Faloutsos, Chair Eric P. Xing William W. Cohen Ambuj K. Singh, University.
Heuristics to Classify Internet Backbone Traffic based on Connection Patterns Wolfgang John and Sven Tafvelin Dept. of Computer Science and Engineering.
Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Presentation by ANML July 2004.
1 Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.
1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
INTERNET AND . WHAT IS INTERNET The Internet can be defined as the wired or wireless mode of communication through which one can receive, transmit.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Internet Service Providers and types of internet connections
Internet and Intranet.
Due: a start of class Oct 26
Worm Origin Identification Using Random Moonwalks
Introduction to client/server architecture
Internet and Intranet.
Chapter 12: Automated data collection methods
Building Networks from Networks
Internet and Intranet.
Degree Distributions.
The Internet and Electronic mail
Internet and Intranet.
Unconstrained Endpoint Profiling (Googling the Internet)‏
Network Basics and Architectures Neil Tang 09/05/2008
Presentation transcript:

Structural Mining of Large-Scale Behavioral Data from the Internet Thesis Defense Mark Meiss April 30, 2010

The Internet in 1969

The Internet in 2010

WWW Bittorrent FTP SSH Skype IRC Gnutella Botnets Shoutcast Flash WinMX DirectConnect NFS ConnectGateway eDonkey World of Warcraft Back Doors SQL Server Nintendo WFC Battlenet Steam Tsunami USENET

1,800,000,000 users

Key Questions Can we make meaningful inferences about user behavior with available sources of data? What implications do patterns of network behavior have for its design and structure? How can behavioral data be used to understand users and improve network applications?

“how it acts” > “what it is”

Patterns > Payloads

Practicality

Privacy

Roadmap Background Network flow analysis Web click analysis Conclusions

Roadmap Background Network flow analysis Web click analysis Conclusions

Weighted digraph

Degree and strength

Distributions We can calculate probability density functions for degree, strength, etc. –Area under curve is 1 –Can calculate likelihood of a node being in some interval Shown here is an example of a very wide- tailed distribution –Best approximated by power law

Scaling relations We can also investigate how these distributions are correlated Shown here is a degree vs. strength plot.

Other network properties Spectral analysis: Looking at the eigen{values,vectors} of the connectivity matrix. Clustering: Looking at the density of connections among neighbors of a node. Assortativity: Looking at whether high- degree nodes connect to other high- degree nodes.

Roadmap Background Network flow analysis Web click analysis Conclusions

The Internet2/Abilene Network TCP/IP network connecting research and educational institutions in the U.S. –Over 200 universities and corporate research labs Also provides transit service between Pacific Rim and European networks

Why study Abilene? Wide-area network that includes both domestic and international traffic Heterogeneous user base including hundreds of thousands of undergraduates High capacity network (10-Gbps fiber-optic links) that has never been congested Research partnership gives access to (anonymized) traffic data unavailable from commercial networks Variety of traffic to both academic and commercial hosts

Introducing the “flow” Web

Introducing the “flow” Web

Flow collection Flows are exported in Cisco’s netflow-v5 format and anonymized before being written to disk.

Data dimensions In a typical day: –Over 200 terabytes of data exchanged –Almost 1 billion flow records –Over 40 gigabytes on disk –Over 20 million unique hosts involved

What can you do with a flow? Standard answer: –Treat a flow as a record in a relational database Who talked to port 1337? What proportion of our traffic is on port 80? Who is scanning for vulnerable systems? Which hosts are infected with this worm?

What can you do with a flow? Graph-centric approach: –Treat a flow as a directed, weighted edge

Multiple digraphs Other Web P2P

Where do flows come from? Architectural features of Internet routers allow them to export flow data Routers can’t summarize all the data –Packets are sampled to construct the flows –Typical sampling rate is around 1:100

p(packet | 1 router) p(packet | 3 routers) p(flow | n packets)

Distribution Recovery Try to recover a power law, exponent = 2. Send to each of 10 hosts: packet flows packet flows packet flows (etc.)

Increase number of flows by factor of 10

Increase duration of flows by factor of 10

Results Nonlinear chance of flow detection. Very small flows lead to an overestimate of the exponent. With large flows, a range of exponents can be recovered reliably. Aggregation is necessary for accurate results.

Roadmap Background Network flow analysis Web click analysis Conclusions

Source of Click Data ~100 K users

Source MAC: 03:5a:66:17:90:5e Dest. MAC: 10:99:19:3f:51:2f Source IP: Dest. IP: Source Port: 9421 Dest. Port: 80 GET /index.html HTTP/1.1 Agent: SuperCrawler-2009/beta Referer: Host: Web Requests

Source MAC: 03:5a:66:17:90:5e Dest. MAC: 10:99:19:3f:51:2f Source IP: Dest. IP: Source Port: 9421 Dest. Port: 80 GET /index.html HTTP/1.1 Agent: SuperCrawler-2009/beta Referer: Host: Web Requests We have a Web request

Source MAC: 03:5a:66:17:90:5e Dest. MAC: 10:99:19:3f:51:2f Source IP: Dest. IP: Source Port: 9421 Dest. Port: 80 GET /index.html HTTP/1.1 Agent: SuperCrawler-2009/beta Referer: Host: Web Requests from this client

Source MAC: 03:5a:66:17:90:5e Dest. MAC: 10:99:19:3f:51:2f Source IP: Dest. IP: Source Port: 9421 Dest. Port: 80 GET /index.html HTTP/1.1 Agent: SuperCrawler-2009/beta Referer: Host: Web Requests going from this URL

Source MAC: 03:5a:66:17:90:5e Dest. MAC: 10:99:19:3f:51:2f Source IP: Dest. IP: Source Port: 9421 Dest. Port: 80 GET /index.html HTTP/1.1 Agent: SuperCrawler-2009/beta Referer: Host: Web Requests to this one

Source MAC: 03:5a:66:17:90:5e Dest. MAC: 10:99:19:3f:51:2f Source IP: Dest. IP: Source Port: 9421 Dest. Port: 80 GET /index.html HTTP/1.1 Agent: SuperCrawler-2009/beta Referer: Host: Web Requests using this agent

Click Collection

Structural properties: Degree (Link Count)

Structural properties: Strength (Site Traffic)

Evaluation of PageRank PR is a stationary distribution of visit frequency by a modified random walk Compare with actual site traffic (in-strength) From an application perspective, we care about the resulting ranking of sites

PageRank Assumptions 1. Equal probability of following each link from any given node 2. Equal probability of teleporting to each of the nodes 3. Equal probability of teleporting from each of the nodes

#1: Kendall’s Rank Correlation

Local Link Heterogeneity perfect concentration perfect homogeneity HH Index of concentration or disparity

#2: Teleportation Target Heterogeneity

#3: Teleportation Source Heterogeneity (“hubness”) s out < s in teleport sources browsing sinks -2 s out > s in popular hubs

Click data with retention of user identity

Popularity is unbounded!

Session properties depend on timeout.

Where’s the “sweet spot” ?

All users are abnormal.

Timeout dependence is much weaker.

Page Traffic Empty Referrer Traffic

Link Traffic Entropy

Session Size Session Depth

Roadmap Background Network flow analysis Web click analysis Conclusions

Internet behavior is characterized by extreme heterogeneity.

Conclusions Behavioral analysis offers advantages over packet inspection.

Conclusions We observe heterogeneity because users are idiosyncratic, not pathologically eclectic.

Future Directions Relationship between traffic & substrate Community detection Characterization of links Validation of HITS Time-series analysis

THANKS! Filippo Menczer Alessandro Vespigani Katy Börner Minaxi Gupta Kay Connelly Steven Wallace Gregory Travis David Ripley Edward Balas Camillo Viecco Jean Camp J. Duncan Alessandro Flammini Santo Fortunato Bruno Gonçalves José Ramasco Damon Beals Dave Hershberger John Stigall Heather Roinestad

Questions & Comments