Download presentation
Presentation is loading. Please wait.
Published bySophie Burke Modified over 9 years ago
1
INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Presentation by ANML July 2004
2
INDIANAUNIVERSITYINDIANAUNIVERSITY About the Presenter Mark Meiss Academic Background: –B.S. Mathematics, B.S. Computer Science –Ph.D. student in Department of Computer Science Research interests: –Structural analysis of network traffic data –High-performance file transfer protocols –Autonomous information retrieval agents
3
INDIANAUNIVERSITYINDIANAUNIVERSITY About the Presenter Professional Experience: –Over 10 years in software development –With IU IT Services since 1997 –Worked with Bloomington NOC –First employee of ANML –Developed Animated Traffic Map, Router Proxy, Tsunami file transfer protocol, etc.
4
INDIANAUNIVERSITYINDIANAUNIVERSITY PageRank PageRank is a Web page ranking system invented by Brin and Page of Google –Attempts to measure importance of a Web page –Pages gain rank by being pointed to by many pages and by pointing to pages with high rank –Calculated offline using an iterative algorithm –Examines only the connections in the Web, not the content of the pages
5
INDIANAUNIVERSITYINDIANAUNIVERSITY Technical Details of PageRank A given set of Web pages creates an implied directed graph of connections –The graph has an edge from page A to page B if page A links to page B This graph can be represented as a matrix –If entry (i, j) is non-zero, page i links to page j –Sparse representation is necessary Google’s matrix has over 1,000,000,000,000,000,000 entries
6
INDIANAUNIVERSITYINDIANAUNIVERSITY Technical Details of PageRank Problem with “dangling links” –These are links to pages that contain no links of their own –These pages absorb PageRank without distributing it to other pages Solution is to say that a page without outbound links actually links to every page with equal probability
7
INDIANAUNIVERSITYINDIANAUNIVERSITY Calculating PageRank We can think of the connectivity matrix as defining a Markov model that generates a random list of Web pages –In other words, we can use the matrix to make a random walk of the Web The PageRank vector is the first eigenvector of the connectivity matrix –In other words, it’s the probability that we’re at that page during our random walk
8
INDIANAUNIVERSITYINDIANAUNIVERSITY Vulnerability of PageRank PageRank was first published in 1998 Since then, it has been shown to be vulnerable to “clique attacks” –Unsavory Web site owner buys 75 domains –Home page on each domain points to each of the other domains –All of the domains thus rise in PageRank score Google blacklists Web sites for this
9
INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Netflow records also create an implied connectivity matrix –We can create an edge from host A to host B if host A transmits data to host B The vulnerability to a clique attack becomes a detector of peer-to-peer applications and social networks!
10
INDIANAUNIVERSITYINDIANAUNIVERSITY Weighted PageRank The volume of data in a flow is an important characteristic of the traffic –We modify the basic PageRank algorithm by weighing all entries based on traffic volume –This new algorithm still converges, but the final values have a significantly different distribution
11
INDIANAUNIVERSITYINDIANAUNIVERSITY Weighted PageRank
12
INDIANAUNIVERSITYINDIANAUNIVERSITY So What’s It Good For? These are potential applications; this research is just starting –Automatic detection of peer-to-peer applications or “bot networks” –Heuristic for node importance in visualization tools –Heuristic for ordering importance of IDS anomalies
13
INDIANAUNIVERSITYINDIANAUNIVERSITY Rethinking the Edges In theory, every TCP connection between host A and host B involves two flows –One from host A to host B –One from host B to host A Due to sampling, we often catch only one of the two –This interferes with the operation of FlowRank
14
INDIANAUNIVERSITYINDIANAUNIVERSITY Rethinking the Edges When we see a flow from host A to host B, why should the edge go from A to B and not from B to A? –We can try to identify which host is the client (initiator of the connection) and which is the server (receiver of the connection) –We can make a good guess at this by studying the relative frequency of the ports used
15
INDIANAUNIVERSITYINDIANAUNIVERSITY Rethinking the Edges This client/server classification seems to greatly increase the utility of the connectivity graph Examining the connectivity graph over time can give us an idea of the type of application that runs on a TCP port
16
INDIANAUNIVERSITYINDIANAUNIVERSITY Visualization We build the entries for the connectivity matrix by assigning an index to each IP address –The first host to show up is index 1, etc. Suppose there is a flow from 127.54.1.3 to 10.99.4.63 –127.54.1.3 may get index 314 –10.99.4.63 may get index 57 –Then entry (314, 57) in the matrix will be non-zero We can see this matrix using the “spy” command in Matlab
17
INDIANAUNIVERSITYINDIANAUNIVERSITY Visualization
18
INDIANAUNIVERSITYINDIANAUNIVERSITY Problems Assigning the indices in order of occurrence –Makes the non-zero entries in the graph grow down and to the right over time –Concentrates high-traffic nodes in the upper left –Exposes artifacts of netflow sampling Static image gives very little temporal information
19
INDIANAUNIVERSITYINDIANAUNIVERSITY Solutions After generating the full index for a set of data, we can randomize its order –Tends to separate high-traffic nodes –Avoids sampling artifacts We can include a temporal element as well –Produce a movie with a sliding window of netflow traffic –For example, use a 1-hour window and 15-minute increments for each frame
20
INDIANAUNIVERSITYINDIANAUNIVERSITY [Interlude] …video demonstration…
21
INDIANAUNIVERSITYINDIANAUNIVERSITY Problems We can’t see the FlowRank data We can’t highlight the importance of any particular node We can’t generate a video file in a convenient codec using Matlab
22
INDIANAUNIVERSITYINDIANAUNIVERSITY Solutions Write a frame rendering program and save each frame as a.PNG file –Use the mplayer system to create a DiVX file Use the FlowRank vector to modify the size of a flow in the frame –Size of a flow is proportional to the number of standard deviations difference between the mean FlowRank and (src+dst)/2
23
INDIANAUNIVERSITYINDIANAUNIVERSITY [Interlude] …video demonstration…
24
INDIANAUNIVERSITYINDIANAUNIVERSITY Another Quick Fix We don’t know whether a flow is important because of its source (server), its destination (client), or both Solution: Give each flow a red component and a blue component –A red flow is important because of the server –A blue flow is important because of the client –A magenta flow is important because of both
25
INDIANAUNIVERSITYINDIANAUNIVERSITY [Interlude] …video demonstration…
26
INDIANAUNIVERSITYINDIANAUNIVERSITY Evaluating FlowRank How can we show that FlowRank is a useful metric for distinguishing traffic? We need some empirical way of measuring its utility It has to be useful enough to justify the (considerable) computational expense of calculating it
27
INDIANAUNIVERSITYINDIANAUNIVERSITY Experimental Setup Split large volume of TCP netflow data into 65,536 bins, one for each port Compute an n-dimensional statistical profile for each port (n is currently around 20) –Also compute an (n+m)-dimensional profile, where the extra dimensions are based on FlowRank statistics Apply clustering and classification algorithms (SVM, k-means, etc.) to each set of profiles Examine the differences between the two sets
28
INDIANAUNIVERSITYINDIANAUNIVERSITY Structural Visualization It would be nice to examine the connectivity matrix as an actual graph This presents major problems –Because of port-scanning, crawling, etc., most data contains a single large component containing over 2/3 of all the edges, plus some noise –Optimal graph layout is an NP-hard problem –Current graph layout packages can’t handle hundreds of thousands of nodes (with some limited exceptions)
29
INDIANAUNIVERSITYINDIANAUNIVERSITY Structural Visualization Moving the visualization to 3D gives layout algorithms another degree of freedom Also allows for better interactive navigation of the data (virtual fly-bys, etc.) We have had some early success with the Tulip package
30
INDIANAUNIVERSITYINDIANAUNIVERSITY [Interlude] …video demonstration…
31
INDIANAUNIVERSITYINDIANAUNIVERSITY Future Directions Real-time visualization Anomaly detection Tunneled traffic detection Intent profiling
32
INDIANAUNIVERSITYINDIANAUNIVERSITY Your Ideas are Valued! Please share any thoughts, criticisms, or questions you may have! E-mail: mmeiss@indiana.edummeiss@indiana.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.