Download presentation
Presentation is loading. Please wait.
Published byMoses Reason Modified over 10 years ago
1
Smarter Searching for a Network Packet Database William (Bill) Kenworthy School of Information Technology Murdoch University Perth, Western Australia
2
Content This is a presentation is about an alternative way to search and/or classify data travelling over a network I will be describing the background, methodology and results of the research This research covers two seemingly disparate disciplines so as the conference has a communication focus there is some background on bioinformatics to set the scene This presentation is (almost) maths free! if you want the maths, see the paper :) 2
3
Motivation Test and validate alternate ways to view network data Better visualise the intrinsic relationships between packets of data based on structure rather than content As part of Investigating the possibilities inherent in mining data via structure based methods Searching using statistical ranking of possible answers 3
4
About Searching for information in high speed network traffic is difficult - but is basically a "solved" problem! What is still a problem though is searching for partial, obfuscated or spatially separated (in the data stream) search terms The work described here is a successful attempt to use characteristics more commonly associated with biological systems to identify areas of interest in a network data stream 4
5
Searching Traditional database search results are the result of exact (yes/no) matching based on some regular expression system e.g., [Bb]ank* Traditional database search results are the result of exact (yes/no) matching based on some regular expression system e.g., [Bb]ank* Instead, the algorithms I am recommending match on the low level structure of a sequence of characters character value and position/relationship in the stream character/term substitution results are ranked according to identity score and include false error rate data 5
6
Problem: dealing with raw bits on a network 6
7
What do we mean by "bioinformatics" algorithms There are useful parallels between the way data is structured in a stream of network data and a biological genome Target the structures within a data stream for searching Very sophisticated, statistically valid search algorithms were developed for use in searching biological data Results can be statistically correlated and ranked 7
8
What is constant? - Structure! The property of the algorithms developed for bioinformatics that we are using primarily targets structure: IP numbers will change in the header of an IP packet. BUT the position and placement of other tokens near the IP number does not (fixed size fields) This property extends to data fields Example: DNS data packets will have a similar signature with slight differences depending on the mutable data 8
9
Structure 00 => A 01 => C 10 => G 11 => T 9
10
Example plot of relationships 10
11
Methodology The software used was standard bioinformatics software with the input data modified to suit Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( The software used was standard bioinformatics software with the input data modified to suit Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( Solution - translate packetised network data into bioinformatics compatible data files via mapping ones and zeros to the DNA alphabet – basic data abstraction 11
12
What? What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data Use this as a method to identify and classify network data into categories against which an event can be notified What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data Use this as a method to identify and classify network data into categories against which an event can be notified We have created a database of known good and bad data samples which allows us to place network data in one of three possible categories: known good known bad unknown 12
13
Database Creation Created with isolated island networks using generic PC's with various operating systems database pollution was a problem "Good" samples were typical email, database, browsing "Bad" samples were from PC's intentionally infected with botnet, virus and worm examples Database is in the form of indexed motifs in a "BLAST" formatted flat file design 13
14
Process Processing flow is started by extracting a packet of data to user space (via the linux kernel netfilter nfqueue module) The packet (as a whole) is transcoded and searched against the database returned is a set of "motifs" with score and false error rate statistics for each motif matched in the database Event notification is based on a threshold basis according to an election process for the top rated N hits returned (hits are ranked in order of the identity score) 14
15
Implementation The test design has proven less than reliable under higher packet rates mainly due to inefficient design Next step is to implement the reference design as a Snort IDS module and link to Snorts event notification process where the well designed data handling processes will alleviate the problems mentioned above 15
16
The Future These techniques have wide applicability to search problems where data is structured but mutable And for something completely different :) Using a similar process for detecting collusion between student assignments based on detecting structural similarities in software coding styles Create database of motifs based on code … search! 16
17
Existing work? Considering the advantages I have found, very little work has been undertaken in using these algorithms IBM proposed An Intrusion-Detection System Based on the Teiresias Pattern-Discovery Algorithm in 1999 IBM has proposed using the Teiresias algorithm for SPAM filtering in 2004. Commentators thought it was interesting but little further activity... 17
18
Conclusions It works :) Known/Unknown sorting might be a unique niche application Ability to statistically rank similarity is a useful tool opening up access to alternate ways to view search results 18
19
Questions? William (Bill) Kenworthy W.Kenworthy@murdoch.edu.au Thank you! 19
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.