Automatic Network Protocol Analysis Gilbert Wondracek, Paolo Milani Comparetti, Christopher Kruegel, and Engin Kirda NDSS 2008 Speaker: Chang Huan Wu 2009/2/17
Outline Introduction Protocol Analysis Evaluation Conclusions Analysis of a Single Message Analysis of Multiple Messages Evaluation Conclusions
Introduction (1/3) Protocol reverse engineering is the process of extracting application-level protocol specifications Especially for closed protocols Security applications Black-box testing for protocol programs Deep packet inspection Reveal differences in server implementations
Introduction (2/3) Manual protocol analysis is time-consuming Only very popular protocols such as SMB can be justified => Automatically analysis
Introduction (3/3) Existing automatic approach Input a binary program and outputs the set of inputs that this program accepts Unable to determine the complete set of inputs Low scalability Input network traffic trace Limited precision
Goal Focus on determining the format specification of a certain type of message first
Approach Use dynamic taint analysis to observe the data flow Observe how the program processes input messages Analyze individual messages Generalize to a message format by messages of a given type
Dynamic Taint Analysis Assign a unique label to each byte of network input Monitor the program, and analyze which byte is processed by which instruction (e.g., mov, sub)
Analysis of a Single Message - Finding delimiters Delimiter is one or more bytes that separate a field or message Record all operations that compare a tainted input byte with an untainted value Traverse each list and check consecutive labels Ex. Compares the first three bytes with ‘a’, and the fourth byte with ’b’ char Label list a 0, 1, 2 b 3 … Message H A B C Label 1 2 3
Analysis of a Single Message - Scopes and delimiter hierarchy Scope fields: A certain delimiter can be present multiple times in the scope field Delimited fields: A certain delimiter present once in the delimited field A delimited field can itself be a scope field for another character A hierarchy of fields reflects nested scopes
Analysis of a Single Message - Identifying length fields A length field is a number of bytes that store the length of another field (target field) Use static analysis to detect loops Look for loops where an exit condition tests the same labels on every iteration => Length field candidate
Analysis of a Single Message - Identifying target fields For each length field candidate, look at labels that is “touched” inside the loop Remove labels touched in all iterations Because those bytes are independent of the current loop iteration
Analysis of a Single Message - Extracting additional information Protocol keywords Compare input data with constant string File names Argument of a system call that opens or creates files Echoed fields Pointers (to somewhere else in packet) Unused fields
Analysis of Multiple Messages – Generalization (1/3) Message alignment Based on Needlman-Wunsch algorithm Extended to a hierarchy of fields
Analysis of Multiple Messages – Generalization (2/3) Operate on a tree of fields, not on a string of bytes To align two inner nodes, recursively call NW on the sequence of child nodes To align two leaf nodes, take into account field semantics Repetition detection Merge two or more consecutive, optional nodes into a single repetition node
Analysis of Multiple Messages – Generalization (3/3)
Evaluation (1/4)
Evaluation (2/4)
Evaluation (3/4) The results in these tables were obtained by manually comparing our specifications with official RFC documents and with Wireshark output Most of the fields were correctly identified Parsing another set of messages by generated specifications succeeded
Evaluation (4/4)
Conclusion Introduced a novel approach to automatic protocol reverse engineering Tested on common servers and protocols
Comments Automatically generate high-precision protocol specification Generated specification may be affected by program implementation