Building Networks from Networks

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Anomaly Detection Steven M. Bellovin Matsuzaki ‘maz’ Yoshinobu 1.
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
The structure of the Internet. How are routers connected? Why should we care? –While communication protocols will work correctly on ANY topology –….they.
Search Engines and Information Retrieval
BotMiner Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
ANOMALY DETECTION AND CHARACTERIZATION: LEARNING AND EXPERIANCE YAN CHEN – MATT MODAFF – AARON BEACH.
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Search Engines and Information Retrieval Chapter 1.
Intrusion Prevention System. Module Objectives By the end of this module, participants will be able to: Use the FortiGate Intrusion Prevention System.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
1 The Research on Analyzing Time- Series Data and Anomaly Detection in Internet Flow Yoshiaki HARADA Graduate School of Information Science and Electrical.
User Models for Personalization Josh Alspector Chief Technology Officer.
Mitsubishi Research Institute, Inc Analyses on Distribution of Malicious Packets and Threats over the Internet August 27-31, 2007 APAN Network Research.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
Structural Mining of Large-Scale Behavioral Data from the Internet Thesis Defense Mark Meiss April 30, 2010.
INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Presentation by ANML July 2004.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
1 Uncovering Functional Networks in Internet Traffic Mark Meiss September 25, 2006.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
June 11, 2002 Abilene Route Quality Control Initiative Aaron D. Britt Guy Almes Route Optimization.
Data mining in web applications
Information Retrieval in Practice
Solving Real-World Problems with Wireshark
Information Systems in Organizations
Evaluation Anisio Lacerda.
Lab 2: Packet Capture & Traffic Analysis with Wireshark
Cleveland SQL Saturday Catch-All or Sometimes Queries
Advertising Research.
Due: a start of class Oct 12
Parallel Autonomous Cyber Systems Monitoring and Protection
The Devil and Packet Trace Anonymization
Client-Server Model and Sockets
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Introducing an almost reliable UDP protocol: The Keyed UDP
Data Streaming in Computer Networking
Due: a start of class Oct 26
CHAPTER 3 Architectures for Distributed Systems
Introduction to Computers
Lecture 6: TCP/IP Networking By: Adal Alashban
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Intro to Ethical Hacking
Sizing …today. T: Here’s how. .
CSCD 330 Network Programming Spring
Process-to-Process Delivery:
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 4: Planning and Configuring Routing and Switching.
CSCD 330 Network Programming Spring
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Lecture 6: TCP/IP Networking 1nd semester By: Adal ALashban.
Lecture 2: Overview of TCP/IP protocol
Lecture 3: Secure Network Architecture
CSCD 330 Network Programming Spring
Transport Layer Identification of P2P Traffic
The Internet and Electronic mail
Session 20 INST 346 Technologies, Infrastructure and Architecture
Unconstrained Endpoint Profiling (Googling the Internet)‏
Anonymous Communication
Presentation transcript:

Building Networks from Networks Mining Network Data to Model User Behavior IPAM Workshop -- October 2, 2007

Personal Introduction Tell the audience about myself: Ph.D. student at Indiana University Studying computer science (artificial intelligence and data networks) Advisor is Filippo Menczer Undergraduate background in mathematics Advanced Network Management Laboratory Applied research into analysis of data networks Close association with Internet2, TransPAC, etc. Focus of research Analyzing large-scale behavioral data Modeling user behavior Applications in network security, application design, search engines Personal tastes - fuzzy, real-valued measures - reduction of need for experts - avoidance of highly designed systems

Relevance to Workshop Discuss how structural data mining can allow us to infer aspects of user behavior “Structural data mining” = statistical properties of graph structures Going from lists of events to behavior of a population Applications wherever user behavior modeling can help

Relevance to Workshop (2) Increase community knowledge of available data sets Large graph-based data sets can be hard to come by Internet2 is designed for research partnerships ANML invites collaborative projects

Relevance to Workshop (3) Data management I am not a pure mathematician or expert statistician Do have expertise in managing analysis of very large data sets For each data set, I want to discuss Gathering the data Anonymizing the data Reducing the data

Network Flow Data Collaborators: Filippo Menzcer Alessandro Vespignani (IU, ISI/Torino) Alessandro Vespignani Filippo works with various Web projects: shared bookmarks, directed crawling, integrating content and link similarity - will be at third workshop Alessandro is a physicist (complex system), co-authored book on structure & evolution of the Internet, epidemiology on networks

Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? First data source is “network flow data”. What is network flow data, what information does it contain, and how is it generated? What sources of network flow data are available? Data collection, anonymization, aggregation, derivation of a graph structure. What useful insights into user behavior can this data yield?

Introduction to the idea of a network flow (sorry if the material is basic, but…) Situation: Web surfer (Buddy) wants to access a page for his favorite band using the network.

Some sort of connection needed between the two computers.

To make a connection, the endpoints have to be identified To make a connection, the endpoints have to be identified. Use IP addresses.

Two computers can have multiple connections – need more detailed information.

Introduce idea of a “port” to uniquely identify a conversation.

Client: System INITIATING connection. Server: System WAITING for a connection and RESPONDING to it.

Client uses an EPHEMERAL port number because nobody else needs to know it. Server uses a WELL-KNOWN port number so that people know what door to open.

First Buddy contacts the server.

The server responds to Buddy.

A two-way connection is established, and a Web page can be downloaded.

This two-way connection can be thought of as two flows. One from client to server One from server to client Flows summarize a conversation without containing it. No user data inside!

This is all of the information stored in a standard flow record. Interesting features to point out: Total number of packets AND octets Timestamp cumulative OR of TCP flags – allows filtering of some attacks protocol is important (TCP/UDP/ICMP): affects meaning of ports ToS is “type of service” – not useful to look at AS is “autonomous system” – useful for aggregation

Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? Will talk about: - flow generation - packet sampling - Abilene network - types of network (edge vs. transit)

Credit: Morehouse University Flows generally come from network routers Data is side-effect of architecture – routing decisions made on line cards Cisco’s data format is dominant standard Credit: Morehouse University

Credit: Cisco Systems Almost all flow data is sampled Typically 1:100 packets Sampling methods VARY (time slice, uniform, others) Can be dependent on TYPE OF PACKET (CPU-routed or not) Many potential biases Also, varying definition of a “flow” timeout for large flows arbitrary window for UDP and ICMP Also, little support for IPv6 Credit: Cisco Systems

The Internet2/Abilene network TCP/IP network connecting research and educational institutions in the U.S. Over 200 universities and corporate research labs Also provides transit service between Pacific Rim and European networks Ideal for behavioral studies never full lots of undergraduates instrumented for research (core node data centers, etc.) Perhaps most important: transit network rather than edge network Much smaller source of bias than edge networks

Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? Data occurs in very large volumes Usual approach: treat flow records as relations, aggregate in various ways, store in a database, and ask SQL-like questions Leads to emphasis on anomaly detection by thresholds, etc. Problems with this - thresholds are for well-behaved distributions - there is a lot of information in the relationships Suppose computer C is evil Suppose A is an accomplice Then B may be as well if it relates to C similarly OR… Suppose C is evil Then B may be as well if exhibits similar behavior Behavior is a graph idea. “X said 10,000 words today” vs. “X spoke 5000 words to A and 5000 words to B” vs. “X spoke 1 word to each of 10,000 people”

Flows are exported in Cisco’s netflow-v5 format and anonymized before being written to disk. Sampling: random 1:100 Export: 48-byte records Anonymization: choice of two techniques 13-bit mask (156.56.103.1 -> 156.56.96.0) unique index (temporary) why not one-way hash? 32-bit key space is trivial

Data Dimensions Abilene on April 14, 2005 600 million flow records About 200 terabytes of data exchanged This is roughly 25,000 DVDs of information 600 million flow records Almost 28 gigabytes on disk 15 million unique hosts involved No loss during data gathering – about 3 megabits per second Don’t know how many hosts really exist – spoofing, scans, etc.

A flow is an edge. Mention use of time information for aggregation OR dynamic analysis

Weighted Bipartite Digraph What’s a client and what’s a server? - heuristic: well-known port is a server - hosts can occur in both sets (extent is interesting in itself – Web vs. P2P) Why bipartite? - many protocols are highly asymmetric

Port 80 (Web) Port 6346 (Gnutella) Port 25 (Mail) Port 19101 (???) Can construct different networks based on different flow attributes - for example, server port number ~ application - identity of nodes is the same, edges and weights are different Port 25 (Mail) Port 19101 (???)

Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? Results gleaned so far from this data Perhaps you can suggest other methods of analysis? Spectral methods k-cores

Three biggest chunks of network traffic, by proportion of total flows. Web = HTTP, HTTPS, alternate port, etc. P2P = Bittorrent, Gnutella, Napster, etc. Other = everything else under the sun

Basic distributions to examine for a behavioral network

First, explain that these are PDFs, with in/out combined. Point out number of orders of magnitude – 10 for strength! Web client k and s in [2, 3] – unbounded variance Web server k and s in [1, 2] – unbounded mean Discuss what this means for thresholds - people want to do anomaly detection in transit networks - no good baseline for Web servers! Not power law for P2P : individual computers - also, client + server similar

Examination of k vs. s - 2D histogram with log bins, normalized in each degree bin - pdf of pdfs Data points to right are mean within each bin - means not all well-defined Main result: Web client are superlinear - means traffic/server grows with number of servers

Other methods of aggregation Looking at behavioral network up to this point Functional network: use applications and hosts as nodes Application network: just applications as nodes

Application Correlation Consider the out-strength of a client in the networks for ports p and q: Amount of data generated by a client for an application

Application Correlation Build a pair of vectors from the distribution of strength values: Each vector has one entry for each client

Application Correlation Examine the cosine similarity of the vectors: When σ = 0, applications p and q are never used together. When σ = 1, applications p and q are always used together, and to the same extent. Use cosine similarity as basic measure of similarity (recall these are vectors in 10,000,000-dimensional space) Do not consider anti-correlation - no such thing as negative traffic - failure to act is much weaker evidence

Clustering Applications We now have σ(p, q) for every pair of ports Convert these similarities into distances: If σ = 0, then d is large; if σ = 1, then d = 0 Now apply Ward’s hierarchical clustering algorithm Other clustering algorithms yield similar results – Ward’s used for convenience

Features to point out - top ~ 30 points by total amount of traffic - symmetric matrix, one port per row or column - matrix is manually ordered - pink: P2P, green: Web, blue: traditional client/server ? = unknown applications - did classification of 16 highest unknown ports - discovered ClubBox - no failed classifications Important point: classification by WHAT IT DOES, not WHAT IT IS. - What it does IS what it is. - example: purpose of Usenet and IRC have changed, but the protocol has not

Behavioral Web Data (Clicks) Next Stop: Behavioral Web Data (Clicks) Different data set – even bigger, more analytical challenges

Behavioral Web Data Collaborators: Filippo Menczer Santo Fortunato (IU, ISI/Torino) Santo Fortunato (ISI/Torino) Alessandro Vespignani Alessandro Flammini (IU)

Eight branch campuses Constant stream of 600 – 800 Mbps of traffic Includes all non-internal traffic: academic and commodity Anonymization: no client information is retained - no IP addresses - no distinction between clients

We do not keep up. Well-tuned network stack, tweaked driver, optimized code, etc. Capture about 30% of clicks during “prime time”, all at off-hours - introduces time-of-day bias Explain virtual host and referrer fields Explanation of what we keep - virtual host - full target URL: form used to determine type - agent: over 50,000 different agents (MS anecdote) Timestamp, referring host, target URL, client location, browser/bot

Over half of all page fetches have no referrer. Only about 8% of human page fetch traffic is search-driven (Note: no session information)

Take top servers, top edges, graph the intersection Layout using force-directed model (Kamada-Kawaii) Width ~ amount that “web highway” is used.

Expected result: gamma = 2.1 (well-known result) Why don’t we match? - the web has changed? - sampling bias? Idea: take random sample of servers, compare our sampled k_in with the k_in measured by Yahoo!

x-axis: Yahoo! / y-axis: our data Weak correlation – but sublinear in log-log space - systematic undersampling of very popular sites This is a clue: - each link to a very popular site is less important - everybody knows where it is - nobody needs to find a link to Microsoft or CNN

Again, gamma < 2  no well-defined mean!

Idea is to predict future Web traffic based on current traffic - why? See how predictable, Web caching, traffic allocation Use one hour of data as predictor – “there will be an edge from X to Y with weight W” Precision: # of correct clicks / # of click guessed Recall: # of correct clicks / # of clicks that actually happen Strong daily effect Upswing at one week Bottom of trough is still > 0.3 !

Quick refresher: PageRank is simulation of random walker on web graph - first eigenvector of connectivity matrix Equal probability of starting anywhere Equal probability of following any link Equal probability of jumping at any time Of these, (2) is very easy to incorporate - just use a weighted connectivity matrix with standard PR (this will still converge) - others require a bit more thought

Method of evaluation: compare different distributions by creating the rank lists and comparing them using Kendall’s tau - accelerated version with O(n log n) running time - range is -1 for perfect anticorrelation to 1 for perfect correlation Results: - PR and PRW disagree the most for the highest-traffic sites but then become very similar - Much larger disparity between PR / PRW and actual traffic Conclusion: - There is something significant about jumping behavior and start pages that PR does not capture, even w/ weights - Is this where content similarity can finally come in?

Thanks to my collaborators! Flow Analysis Filippo Menczer (IU, ISI/Torino) Alessandro Vespignani (IU, ISI/Torino) Click Analysis Santo Fortunato (ISI/Torino) Alessandro Flammini (IU)

Thank you!