Lab for Internet & Security Technology (LIST) Northwestern University

Slides:

Advertisements

Similar presentations

Code-Red : a case study on the spread and victims of an Internet worm David Moore, Colleen Shannon, Jeffery Brown Jonghyun Kim.

Advertisements

New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.

1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.

Fast Algorithms For Hierarchical Range Histogram Constructions

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Computer Science Dr. Peng NingCSC 774 Adv. Net. Security1 CSC 774 Advanced Network Security Topic 7.3 Secure and Resilient Location Discovery in Wireless.

M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Shulin You UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Electrical and Computer Engineering.

IP Routing Lookups Scalable High Speed IP Routing Lookups.

Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome *, Brad Karp *†, and Dawn Song * † Intel Research Pittsburgh * Carnegie.

1 Detection of Injected, Dynamically Generated, and Obfuscated Malicious Code (DOME) Subha Ramanathan & Arun Krishnamurthy Nov 15, 2005.

Worm Origin Identification Using Random Moonwalks Yinglian Xie, V. Sekar, D. A. Maltz, M. K. Reiter, Hui Zhang 2005 IEEE Symposium on Security and Privacy.

Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao and Brian.

Protomatching Network Traffic for High Throughput Network Intrusion Detection Shai RubinSomesh JhaBarton P. Miller Microsoft Security Analysis Services.

High-Performance Network Anomaly/Intrusion Detection & Mitigation System (HPNAIDM) Yan Chen Department of Electrical Engineering and Computer Science Northwestern.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Reverse Hashing for High-speed Network Monitoring: Algorithms, Evaluation, and Applications Robert Schweller 1, Zhichun Li 1, Yan Chen 1, Yan Gao 1, Ashish.

1 Energy Efficient Multi-match Packet Classification with TCAM Fang Yu

CSE 830: Design and Theory of Algorithms

High-Performance Network Anomaly/Intrusion Detection & Mitigation System (HPNAIDM) Zhichun Li Lab for Internet & Security Technology (LIST) Department.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Automated Worm Fingerprinting Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage Manan Sanghi.

High-Performance Network Anomaly/Intrusion Detection & Mitigation System (HPNAIDM) Yan Chen Lab for Internet & Security Technology (LIST) Department of.

Measurement and Diagnosis of Address Misconfigured P2P traffic Zhichun Li, Anup Goyal, Yan Chen and Aleksandar Kuzmanovic Lab for Internet and Security.

Chapter 11: Limitations of Algorithmic Power

Elementary Data Structures and Algorithms

High-Performance Network Anomaly/Intrusion Detection & Mitigation System (HPNAIDM) Yan Chen Department of Electrical Engineering and Computer Science Northwestern.

Internet Quarantine: Requirements for Containing Self-Propagating Code David Moore et. al. University of California, San Diego.

Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.

Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.

Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.

Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)

A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.

Vulnerability-Specific Execution Filtering (VSEF) for Exploit Prevention on Commodity Software Authors: James Newsome, James Newsome, David Brumley, David.

Program Performance & Asymptotic Notations CSE, POSTECH.

Network Aware Resource Allocation in Distributed Clouds.

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao and Brian.

Packet Vaccine: Black-box Exploit Detection and Signature Generation

Wire Speed Packet Classification Without TCAMs ACM SIGMETRICS 2007 Qunfeng Dong (University of Wisconsin-Madison) Suman Banerjee (University of Wisconsin-Madison)

Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

IEEE Communications Surveys & Tutorials 1st Quarter 2008.

1 Limits of Learning-based Signature Generation with Adversaries Shobha Venkataraman, Carnegie Mellon University Avrim Blum, Carnegie Mellon University.

1 Network-based Intrusion Detection, Prevention and Forensics System Yan Chen Department of Electrical Engineering and Computer Science Northwestern University.

CSC 211 Data Structures Lecture 13

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Scalable High Speed IP Routing Lookups Scalable High Speed IP Routing Lookups Authors: M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Zhqi.

Polygraph: Automatically Generating Signatures for Polymorphic Worms James Newsome, Brad Karp, and Dawn Song Carnegie Mellon University Presented by Ryan.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

nd Joint Workshop between Security Research Labs in JAPAN and KOREA Polymorphic Worm Detection by Instruction Distribution Kihun Lee HPC Lab., Postech.

Polygraph: Automatically Generating Signatures for Polymorphic Worms Presented by: Devendra Salvi Paper by : James Newsome, Brad Karp, Dawn Song.

Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.

Automated Worm Fingerprinting Authors: Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Publish: OSDI'04. Presenter: YanYan Wang.

Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software Paper by: James Newsome and Dawn Song.

Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS

Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.

Polygraph: Automatically Generating Signatures for Polymorphic Worms Authors: James Newsome (CMU), Brad Karp (Intel Research), Dawn Song (CMU) Presenter:

@Yuan Xue Worm Attack Yuan Xue Fall 2012.

Yan Chen Northwestern Lab for Internet and Security Technology (LIST) Dept. of Computer Science Northwestern University

POLYGRAPH: Automatically Generating Signatures for Polymorphic Worms

A Study of Group-Tree Matching in Large Scale Group Communications

Worm Origin Identification Using Random Moonwalks

Polygraph: Automatically Generating Signatures for Polymorphic Worms

Objective of This Course

Yan Chen Department of Electrical Engineering and Computer Science

CSC-682 Advanced Computer Security

Introduction to Internet Worm

Presentation transcript:

Lab for Internet & Security Technology (LIST) Northwestern University Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience Lab for Internet & Security Technology (LIST) Northwestern University Whether we need to list all authors and affiliations

The Spread of Sapphire/Slammer Worms In the first 30 minutes of Sapphire’s spread, we recorded nearly 75,000 unique infections. As we will detail later, most of these infections actually occurred within 10 minutes. This graphic is more for effect rather than technical detail: We couldn’t determine a detailed location for all infections, and the diameter of each circle is proportional to the lg() of the number of infections, underrepresenting larger infections. Nevertheless, it gives a good feel for where Sapphire spread. We monitored the spread using several “Network Telescopes”, address ranges where we had sampled or complete packet traces at single sources. We also used the D-shield distributed intrusion detection system to determine IPs of infected machines, but we couldn’t use this data for calculating the scanning rate.

Desired Requirements for Polymorphic Worm Signature Generation Network-based signature generation Worms spread in exponential speed, to detect them in their early stage is very crucial… However At their early stage there are limited worm samples. The high speed network router may see more worm samples… But Need to keep up with the network speed ! Only can use network level information

Desired Requirements for Polymorphic Worm Signature Generation Noise tolerant Most network flow classifiers suffer false positives. Even host based approaches can be injected with noise. Attack resilience Attackers always try to evade the detection systems Efficient signature matching for high-speed links No existing work satisfies these requirements !

Outline Motivation Hamsa Design Model-based Signature Generation Evaluation Related Work Conclusion

Choice of Signatures Two classes of signatures Content based Token: a substring with reasonable coverage to the suspicious traffic Signatures: conjunction of tokens Behavior based Our choice: content based Fast signature matching. ASIC based approach can archive 6 ~ 8Gb/s Generic, independent of any protocol or server

Unique Invariants of Worms Protocol Frame The code path to the vulnerability part, usually infrequently used Code-Red II: ‘.ida?’ or ‘.idq?’ Control Data: leading to control flow hijacking Hard coded value to overwrite a jump target or a function call Worm Executable Payload CLET polymorphic engine: ‘0\x8b’, ‘\xff\xff\xff’ and ‘t\x07\xeb’ Possible to have worms with no such invariants, but very hard

Hamsa Architecture

Components from existing work Worm flow classifiers Scan based detector [Autograph] Byte spectrum based approach [PAYL] Honeynet/Honeyfarm sensors [Honeycomb]

Hamsa Design Key idea: model the uniqueness of worm invariants Greedy algorithm for finding token conjunction signatures Highly accurate while much faster Both analytically and experimentally Compared with the latest work, polygraph Suffix array based token extraction Provable attack resilience guarantee Noise tolerant

Outline Motivation Hamsa Design Model-based Signature Generation Evaluation Related Work Conclusion

Hamsa Signature Generator Core part: Model-based Greedy Signature Generation Iterative approach for multiple worms

Problem Formulation With noise NP-Hard! Signature Generator Maximize the coverage in the suspicious pool Suspicious pool Signature Generator Normal pool Signature False positive in the normal pool is bounded by r false positive bound r Without noise, can be solve linearly using token extraction With noise NP-Hard!

Model Uniqueness of Invariants U(1)=upper bound of FP(t1) U(2)=upper bound of FP(t1,t2) FP 21% 9% 17% 5% Joint FP with t1 2% 0.5% 1% The total number of tokens bounded by k*

Signature Generation Algorithm token extraction t1 u(1)=15% tokens Suspicious pool (82%, 50%) (COV, FP) (70%, 11%) (67%, 30%) (62%, 15%) (50%, 25%) (41%, 55%) (36%, 41%) (12%, 9%) Order by coverage

Signature Generation Algorithm (82%, 50%) (COV, FP) (70%, 11%) (67%, 30%) (62%, 15%) (50%, 25%) (41%, 55%) (36%, 41%) (12%, 9%) (69%, 9.8%) (COV, FP) (68%, 8.5%) (67%, 1%) (40%, 2.5%) (35%, 12%) (31%, 9%) (10%, 0.5%) Order by joint coverage with t1

Algorithm Runtime Analysis Preprocessing need: O(m + n + T*l + T*(|M|+|N|)) Running time: O(T*(|M|+|N|)) In most case |M| < |N| so, it can reduce to O(T*|N|) T : the # of tokens l: the maximum length of tokens |M|: the # of flows in the suspicious pool |N|: the # of flows in the normal pool m: the # of bytes in the suspicious pool n: the # of bytes in the normal pool

Provable Attack Resilience Guarantee Proved the worse case bound on false negative given the false positive Analytically bound the worst attackers can do! Example: K*=5, u(1)=0.2, u(2)=0.08, u(3)=0.04, u(4)=0.02, u(5)=0.01 and r=0.01 The better the flow classifier, the lower are the false negatives Noise ratio FP upper bound FN upper bound 5% 1% 1.84% 10% 3.89% 20% 8.75%

Attack Resilience Assumptions Common assumptions for any sig generation sys The attacker cannot control which worm samples are encountered by Hamsa The attacker cannot control which worm samples encountered will be classified as worm samples by the flow classifier Unique assumptions for token-based schemes The attacker cannot change the frequency of tokens in normal traffic The attacker cannot control which normal samples encountered are classified as worm samples by the worm flow classifier

Attack Resilience Assumptions Attacks to the flow classifier Our approach does not depend on perfect flow classifiers But with 99% noise, no approach can work! High noise injection makes the worm propagate less efficiently. Enhance flow classifiers Cluster suspicious flows by return messages Information theory based approaches (DePaul Univ)

Generalizing Signature Generation with noise BEST Signature = Balanced Signature Balance the sensitivity with the specificity Create notation scoring function: score(cov, fp, …) to evaluate the goodness of signature Current used Intuition: it is better to reduce the coverage 1/a if the false positive becomes 10 times smaller. Add some weight to the length of signature (LEN) to break ties between the signatures with same coverage and false positive

Hamsa Signature Generator Next: Token extraction and token identification

Token Exaction Problem formulation: Main techniques: Input: a set of strings, and minimum length l and minimum coverage COVmin Output: A set of tokens (substrings) meet the minimum length and coverage requirements Coverage: the portion of strings having the token Corresponding sample vectors for each token Main techniques: Suffix array LCP (Longest Common Prefix) array, and LCP intervals Token Exaction Algorithm (TEA) Coverage means the p

Suffix Array Illustration by an example String1: abrac, String2: adabra Cat together: abracadabra$ All suffix: a$, ra$, bra$, abra$, dabra$… Sort all the suffix: 4n space Sorting can be done in 4n space and O(nlog(n)) time a 10 abra 7 abracadabra acadabra 3 adabra 5 bra 8 bracadabra 1 cadabra 4 dabra 6 ra 9 racadabra 2

LCP Array and LCP Intervals Suffixes sufarr lcparr idx str a 10 - (0) 2 abra 7 1 abracadabra 4 acadabra 3 adabra 5 bra 8 bracadabra 6 cadabra dabra ra 9 racadabra 0-[0,10] 1-[0,4] 3-[5,6] 2-[9,10] 4-[1..2] LCP intervals => tokens

Token Exaction Algorithm (TEA) Find eligible LCP intervals first Then find the tokens

Token Exaction Algorithm (TEA)

Token Exaction Algorithm (TEA)

Token Identification For normal traffic, pre-compute and store suffix array offline For a given token, binary search in suffix array gives the corresponding LCP intervals O(log(n)) time complexity More sophisticated O(1) algorithm is possible, may require more space

Implementation Details Token Extraction: extract a set of tokens with minimum length l and minimum coverage COVmin. Polygraph use suffix tree based approach: 20n space and time consuming. Our approach: Enhanced suffix array 8n space and much faster! (at least 20 times) Calculate false positive when check U-bounds (Token Identification) Again suffix array based approach, but for a 300MB normal pool, 1.2GB suffix array still large! Optimization: using MMAP, memory usage: 150 ~ 250MB n is the total length of the suspicous pool

Hamsa Signature Generator Next: signature refinement

Signature Refinement Why refinement? How? Produce a signature with same sensitivity but better specificity How? After we use the core algorithm to get the greedy signature, we believe the samples matched by the greedy signature are all worm samples Reduce to a signature generation without noise problem. Do another round token extraction

Extend to Detect Multiple Worms Iteratively use single worm detector to detect multiple worms At the first iteration, the algorithm find the signature for the most popular worms in the suspicious pool. All other worms and normal traffic treat as noise

Practical Issues on Data Normalization Typical cases need data normalization IP packet fragmentation TCP flow reassembly (defend fragroute) RPC fragmentation URL Obfuscation HTML Obfuscation Telnet/FTP Evasion by \backspace or \delete keys Normalization translates data into the canonical form

Practical Issues on Data Normalization (II) Hamsa with data normalization works better Without or with weak data normalization, Hamsa still work But because the data many have different forms of encoding, may produce multiple signature for a single worm Need sufficient samples for each form of encoding

Outline Motivation Hamsa Design Model-based Signature Generation Evaluation Related Work Conclusion

Experiment Methodology Experiential setup: Suspicious pool: Three pseudo polymorphic worms based on real exploits (Code-Red II, Apache-Knacker and ATPhttpd), Two polymorphic engines from Internet (CLET and TAPiON). Normal pool: 2 hour departmental http trace (326MB) Signature evaluation: False negative: 5000 generated worm samples per worm False positive: 4-day departmental http trace (12.6 GB) 3.7GB web crawling including .mp3, .rm, .ppt, .pdf, .swf etc. /usr/bin of Linux Fedora Core 4

Results on Signature Quality Worms Training FN Training FP Evaluation FN Evaluation FP Binary evaluation FP Signature Code-Red II {'.ida?': 1, '%u780': 1, ' HTTP/1.0\r\n': 1, 'GET /': 1, '%u': 2} CLET 0.109% 0.06236% 0.268% {'0\x8b': 1, '\xff\xff\xff': 1,'t\x07\xeb': 1} Single worm with noise Suspicious pool size: 100 and 200 samples Noise ratio: 0%, 10%, 30%, 50%, 70% Noise samples randomly picked from the normal pool Always get above signatures and accuracy.

Results on Signature Quality (II) Suspicious pool with high noise ratio: For noise ratio 50% and 70%, sometimes we can produce two signatures, one is the true worm signature, anther solely from noise, due to the locality of the noise. The false positive of these noise signatures have to be very small: Mean: 0.09% Maximum: 0.7% Multiple worms with noises give similar results

Experiment: U-bound evaluation To be conservative we chose k*=15. u(k*)= u(15)= 9.16*10-6. u(1) and ur evaluation We tested:u(1) = [0.02, 0.04, 0.06, 0.08, 0.10, 0.20, 0.30, 0.40, 0.5] and ur = [0.20, 0.40, 0.60, 0.8]. The minimum (u(1), ur) works for all our worms was (0.08,0.20) In practice, we use conservative value (0.15,0.5)

Speed Results Implementation with C++/Python 500 samples with 20% noise, 100MB normal traffic pool, 15 seconds on an XEON 2.8Ghz, 112MB memory consumption Speed comparison with Polygraph Asymptotic runtime: O(T) vs. O(|M|2), when |M| increase, T won’t increase as fast as |M|! Experimental: 64 to 361 times faster (polygraph vs. ours, both in python) Data already in memory

Experiment: Sample requirement Coincidental-pattern attack [Polygraph] Results For the three pseudo worms, 10 samples can get good results CLET and TAPiON at least need 50 samples Conclusion For better signatures, to be conservative, at least need 100+ samples Require scalable and fast signature generation!

Token-fit Attack Can Fail Polygraph Polygraph: hierarchical clustering to find signatures w/ smallest false positives With the token distribution of the noise in the suspicious pool, the attacker can make the worm samples more like noise traffic Different worm samples encode different noise tokens Our approach can still work!

Token-fit attack could make Polygraph fail Noise samples N1 N2 N3 Worm samples W1 W2 W3 Merge Candidate 1 Merge Candidate 2 Merge Candidate 3 CANNOT merge further! NO true signature found!

Experiment: Token-fit attack Suspicious of 50 samples with 50% noise Elaborate different worm samples like different noise samples. Results Polygraph 100% false negative Hamsa still can get the correct signature as before!

Outline Motivation Hamsa Design Model-based Signature Generation Evaluation Related Work Conclusion

Related works Hamsa Polygraph CFG PADS Nemean COVERS Malware Detection Network or host based Network Host Content or behavior based Content based Behavior based Behavior based Noise tolerance Yes Yes (slow) No Multi worms in one protocol On-line sig matching Fast Slow Generality General purpose Protocol specific Server specific Provable atk resilience Information exploited egp p e eg Newmean, usneix security symposium 2005 (Wisconsin): analyze protocol, treat as automata for clustering. PADS, Infocom 04 or 05: double honeypot flow classifier, byte distribution probability (combining offset information), do the spectrum analysis for critical region CFG, RAID 2005 (from UCSB): control flow graph, slow matching

Conclusion Network based signature generation and matching are important and challenging Hamsa: automated signature generation Fast Noise tolerant Provable attack resilience Capable of detecting multiple worms in a single application protocol Proposed a model to describe the worm invariants

Questions ?

Results on Signature Quality (II) Suspicious pool with high noise ratio: For noise ratio 50% and 70%, sometimes we can produce two signatures, one is the true worm signature, anther solely from noise. The false positive of these noise signatures have to be very small: Mean: 0.09% Maximum: 0.7% Multiple worms with noises give similar results

Normal Traffic Poisoning Attack We found our approach is not sensitive to the normal traffic pool used History: last 6 months time window The attacker has to poison the normal traffic 6 month ahead! 6 month the vulnerability may have been patched! Poisoning the popular protocol is very difficult.

Red Herring Attack Hard to implement Dynamic updating problem. Again our approach is fast Partial Signature matching, in extended version.

Coincidental Attack As mentioned in the Polygraph paper, increase the sample requirement Again, our approach are scalable and fast

Model Uniqueness of Invariants Let worm has a set of invariants: Determine their order by: t1: the token with minimum false positive in normal traffic. u(1) is the upper bound of the false positive of t1 t2: the token with minimum joint false positive with t1 FP({t1,t2}) bounded by u(2) ti: the token with minimum joint false positive with {t1, t2, ti-1}. FP({t1,t2,…,ti}) bounded by u(i) The total number of tokens bounded by k*

Problem Formulation Without noise, exist polynomial time algo Noisy Token Multiset Signature Generation Problem : INPUT: Suspicious pool M and normal traffic pool N; value r<1. OUTPUT: A multi-set of tokens signature S={(t1, n1), . . . (tk, nk)} such that the signature can maximize the coverage in the suspicious pool and the false positive in normal pool should less than r Without noise, exist polynomial time algo With noise, NP-Hard Whether need to add a slide for point 4

Generalizing Signature Generation with noise BEST Signature = Balanced Signature Balance the sensitivity with the specificity But how? Create notation Scoring function: score(cov, fp, …) to evaluate the goodness of signature Current used Intuition: it is better to reduce the coverage 1/a if the false positive becomes 10 times smaller. Add some weight to the length of signature (LEN) to break ties between the signatures with same coverage and false positive

Generalizing Signature Generation with noise Algorithm: similar Running time: same as previous simple form Attack Resilience Guarantee: similar

Extension to multiple worm Iteratively use single worm detector to detect multiple worm At the first iteration, the algorithm find the signature for the most popular worms in the suspicious pool. All other worms and normal traffic treat as noise. Though the analysis for the single worm can apply to multiple worms, but the bound are not very promising. Reason: high noise ratio

Token Extraction Extract a set of tokens with minimum length lmin and coverage COVmin. And for each token output the frequency vector. Polygraph use suffix tree based approach: 20n space and time consuming. Our approach: Enhanced suffix array 4n space Much faster, at least 50(UPDATE) times! Can apply to Polygraph also. n is the total length of the suspicous pool

Calculate the false positive We need to have the false positive to check the U-bounds Again suffix array based approach, but for a 300MB normal pool, 1.2GB suffix array still large! Improvements Caching MMAP suffix array. True memory usage: 150 ~ 250MB. 2 level normal pool Hardware based fast string matching Compress normal pool and string matching algorithms directly over compressed strings Expensive operation

Future works Enhance the flow classifiers Cluster suspicious flows by return messages Malicious flow verification by replaying to Address Space Randomization enabled servers.