Building Networks from Networks Mining Network Data to Model User Behavior IPAM Workshop -- October 2, 2007
Personal Introduction Tell the audience about myself: Ph.D. student at Indiana University Studying computer science (artificial intelligence and data networks) Advisor is Filippo Menczer Undergraduate background in mathematics Advanced Network Management Laboratory Applied research into analysis of data networks Close association with Internet2, TransPAC, etc. Focus of research Analyzing large-scale behavioral data Modeling user behavior Applications in network security, application design, search engines Personal tastes - fuzzy, real-valued measures - reduction of need for experts - avoidance of highly designed systems
Relevance to Workshop Discuss how structural data mining can allow us to infer aspects of user behavior “Structural data mining” = statistical properties of graph structures Going from lists of events to behavior of a population Applications wherever user behavior modeling can help
Relevance to Workshop (2) Increase community knowledge of available data sets Large graph-based data sets can be hard to come by Internet2 is designed for research partnerships ANML invites collaborative projects
Relevance to Workshop (3) Data management I am not a pure mathematician or expert statistician Do have expertise in managing analysis of very large data sets For each data set, I want to discuss Gathering the data Anonymizing the data Reducing the data
Network Flow Data Collaborators: Filippo Menzcer Alessandro Vespignani (IU, ISI/Torino) Alessandro Vespignani Filippo works with various Web projects: shared bookmarks, directed crawling, integrating content and link similarity - will be at third workshop Alessandro is a physicist (complex system), co-authored book on structure & evolution of the Internet, epidemiology on networks
Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? First data source is “network flow data”. What is network flow data, what information does it contain, and how is it generated? What sources of network flow data are available? Data collection, anonymization, aggregation, derivation of a graph structure. What useful insights into user behavior can this data yield?
Introduction to the idea of a network flow (sorry if the material is basic, but…) Situation: Web surfer (Buddy) wants to access a page for his favorite band using the network.
Some sort of connection needed between the two computers.
To make a connection, the endpoints have to be identified To make a connection, the endpoints have to be identified. Use IP addresses.
Two computers can have multiple connections – need more detailed information.
Introduce idea of a “port” to uniquely identify a conversation.
Client: System INITIATING connection. Server: System WAITING for a connection and RESPONDING to it.
Client uses an EPHEMERAL port number because nobody else needs to know it. Server uses a WELL-KNOWN port number so that people know what door to open.
First Buddy contacts the server.
The server responds to Buddy.
A two-way connection is established, and a Web page can be downloaded.
This two-way connection can be thought of as two flows. One from client to server One from server to client Flows summarize a conversation without containing it. No user data inside!
This is all of the information stored in a standard flow record. Interesting features to point out: Total number of packets AND octets Timestamp cumulative OR of TCP flags – allows filtering of some attacks protocol is important (TCP/UDP/ICMP): affects meaning of ports ToS is “type of service” – not useful to look at AS is “autonomous system” – useful for aggregation
Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? Will talk about: - flow generation - packet sampling - Abilene network - types of network (edge vs. transit)
Credit: Morehouse University Flows generally come from network routers Data is side-effect of architecture – routing decisions made on line cards Cisco’s data format is dominant standard Credit: Morehouse University
Credit: Cisco Systems Almost all flow data is sampled Typically 1:100 packets Sampling methods VARY (time slice, uniform, others) Can be dependent on TYPE OF PACKET (CPU-routed or not) Many potential biases Also, varying definition of a “flow” timeout for large flows arbitrary window for UDP and ICMP Also, little support for IPv6 Credit: Cisco Systems
The Internet2/Abilene network TCP/IP network connecting research and educational institutions in the U.S. Over 200 universities and corporate research labs Also provides transit service between Pacific Rim and European networks Ideal for behavioral studies never full lots of undergraduates instrumented for research (core node data centers, etc.) Perhaps most important: transit network rather than edge network Much smaller source of bias than edge networks
Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? Data occurs in very large volumes Usual approach: treat flow records as relations, aggregate in various ways, store in a database, and ask SQL-like questions Leads to emphasis on anomaly detection by thresholds, etc. Problems with this - thresholds are for well-behaved distributions - there is a lot of information in the relationships Suppose computer C is evil Suppose A is an accomplice Then B may be as well if it relates to C similarly OR… Suppose C is evil Then B may be as well if exhibits similar behavior Behavior is a graph idea. “X said 10,000 words today” vs. “X spoke 5000 words to A and 5000 words to B” vs. “X spoke 1 word to each of 10,000 people”
Flows are exported in Cisco’s netflow-v5 format and anonymized before being written to disk. Sampling: random 1:100 Export: 48-byte records Anonymization: choice of two techniques 13-bit mask (156.56.103.1 -> 156.56.96.0) unique index (temporary) why not one-way hash? 32-bit key space is trivial
Data Dimensions Abilene on April 14, 2005 600 million flow records About 200 terabytes of data exchanged This is roughly 25,000 DVDs of information 600 million flow records Almost 28 gigabytes on disk 15 million unique hosts involved No loss during data gathering – about 3 megabits per second Don’t know how many hosts really exist – spoofing, scans, etc.
A flow is an edge. Mention use of time information for aggregation OR dynamic analysis
Weighted Bipartite Digraph What’s a client and what’s a server? - heuristic: well-known port is a server - hosts can occur in both sets (extent is interesting in itself – Web vs. P2P) Why bipartite? - many protocols are highly asymmetric
Port 80 (Web) Port 6346 (Gnutella) Port 25 (Mail) Port 19101 (???) Can construct different networks based on different flow attributes - for example, server port number ~ application - identity of nodes is the same, edges and weights are different Port 25 (Mail) Port 19101 (???)
Network Flow Data What is it? Where do you get it? How do you process it? What can it tell you? Results gleaned so far from this data Perhaps you can suggest other methods of analysis? Spectral methods k-cores
Three biggest chunks of network traffic, by proportion of total flows. Web = HTTP, HTTPS, alternate port, etc. P2P = Bittorrent, Gnutella, Napster, etc. Other = everything else under the sun
Basic distributions to examine for a behavioral network
First, explain that these are PDFs, with in/out combined. Point out number of orders of magnitude – 10 for strength! Web client k and s in [2, 3] – unbounded variance Web server k and s in [1, 2] – unbounded mean Discuss what this means for thresholds - people want to do anomaly detection in transit networks - no good baseline for Web servers! Not power law for P2P : individual computers - also, client + server similar
Examination of k vs. s - 2D histogram with log bins, normalized in each degree bin - pdf of pdfs Data points to right are mean within each bin - means not all well-defined Main result: Web client are superlinear - means traffic/server grows with number of servers
Other methods of aggregation Looking at behavioral network up to this point Functional network: use applications and hosts as nodes Application network: just applications as nodes
Application Correlation Consider the out-strength of a client in the networks for ports p and q: Amount of data generated by a client for an application
Application Correlation Build a pair of vectors from the distribution of strength values: Each vector has one entry for each client
Application Correlation Examine the cosine similarity of the vectors: When σ = 0, applications p and q are never used together. When σ = 1, applications p and q are always used together, and to the same extent. Use cosine similarity as basic measure of similarity (recall these are vectors in 10,000,000-dimensional space) Do not consider anti-correlation - no such thing as negative traffic - failure to act is much weaker evidence
Clustering Applications We now have σ(p, q) for every pair of ports Convert these similarities into distances: If σ = 0, then d is large; if σ = 1, then d = 0 Now apply Ward’s hierarchical clustering algorithm Other clustering algorithms yield similar results – Ward’s used for convenience
Features to point out - top ~ 30 points by total amount of traffic - symmetric matrix, one port per row or column - matrix is manually ordered - pink: P2P, green: Web, blue: traditional client/server ? = unknown applications - did classification of 16 highest unknown ports - discovered ClubBox - no failed classifications Important point: classification by WHAT IT DOES, not WHAT IT IS. - What it does IS what it is. - example: purpose of Usenet and IRC have changed, but the protocol has not
Behavioral Web Data (Clicks) Next Stop: Behavioral Web Data (Clicks) Different data set – even bigger, more analytical challenges
Behavioral Web Data Collaborators: Filippo Menczer Santo Fortunato (IU, ISI/Torino) Santo Fortunato (ISI/Torino) Alessandro Vespignani Alessandro Flammini (IU)
Eight branch campuses Constant stream of 600 – 800 Mbps of traffic Includes all non-internal traffic: academic and commodity Anonymization: no client information is retained - no IP addresses - no distinction between clients
We do not keep up. Well-tuned network stack, tweaked driver, optimized code, etc. Capture about 30% of clicks during “prime time”, all at off-hours - introduces time-of-day bias Explain virtual host and referrer fields Explanation of what we keep - virtual host - full target URL: form used to determine type - agent: over 50,000 different agents (MS anecdote) Timestamp, referring host, target URL, client location, browser/bot
Over half of all page fetches have no referrer. Only about 8% of human page fetch traffic is search-driven (Note: no session information)
Take top servers, top edges, graph the intersection Layout using force-directed model (Kamada-Kawaii) Width ~ amount that “web highway” is used.
Expected result: gamma = 2.1 (well-known result) Why don’t we match? - the web has changed? - sampling bias? Idea: take random sample of servers, compare our sampled k_in with the k_in measured by Yahoo!
x-axis: Yahoo! / y-axis: our data Weak correlation – but sublinear in log-log space - systematic undersampling of very popular sites This is a clue: - each link to a very popular site is less important - everybody knows where it is - nobody needs to find a link to Microsoft or CNN
Again, gamma < 2 no well-defined mean!
Idea is to predict future Web traffic based on current traffic - why? See how predictable, Web caching, traffic allocation Use one hour of data as predictor – “there will be an edge from X to Y with weight W” Precision: # of correct clicks / # of click guessed Recall: # of correct clicks / # of clicks that actually happen Strong daily effect Upswing at one week Bottom of trough is still > 0.3 !
Quick refresher: PageRank is simulation of random walker on web graph - first eigenvector of connectivity matrix Equal probability of starting anywhere Equal probability of following any link Equal probability of jumping at any time Of these, (2) is very easy to incorporate - just use a weighted connectivity matrix with standard PR (this will still converge) - others require a bit more thought
Method of evaluation: compare different distributions by creating the rank lists and comparing them using Kendall’s tau - accelerated version with O(n log n) running time - range is -1 for perfect anticorrelation to 1 for perfect correlation Results: - PR and PRW disagree the most for the highest-traffic sites but then become very similar - Much larger disparity between PR / PRW and actual traffic Conclusion: - There is something significant about jumping behavior and start pages that PR does not capture, even w/ weights - Is this where content similarity can finally come in?
Thanks to my collaborators! Flow Analysis Filippo Menczer (IU, ISI/Torino) Alessandro Vespignani (IU, ISI/Torino) Click Analysis Santo Fortunato (ISI/Torino) Alessandro Flammini (IU)
Thank you!