Automatic Protocol Format Reverse Engineering through Context-Aware Monitored Execution Zhiqiang Lin 1 Xuxian Jiang 2, Dongyan Xu 1, Xiangyu Zhang 1 1 Purdue University 2 George Mason University February 12 th, 2008 The 15 th Annual Network and Distributed System Security Symposium
Motivation Protocol reverse engineering A process to recover protocol specifications E.g., fields and their relationships Applications: Network-based Intrusion detection Network management Penetration test …
Challenges 0x0040: cd f6e e d6c 0x0050: f 312e 300d 0a d 0x0060: e 743a f 312e x0070: 2e d6f x0080: d 0a a 202a 0x0090: 2f2a 0d0a 486f a e x00a0: 342e e37 310d 0a43 6f6e 6e x00b0: 696f 6e3a 204b d 416c d. 0x00c0: 0a0d 0a Multiple fields in a single message Non-static size of fields Complex relationships among protocol fields Sequential Parallel Hierarchical
Challenges HTTP-Request = Request-Line (( general-header | request-header | entity-header ) CRLF)* CRLF [ message-body ] Request-Line = Method SP Request-URI SP HTTP-Version CRLF Parallel Sequential Hierarchical A BNF Specification of HTTP Request (RFC2616)
Related Work Network Trace Protocol Informatics Discoverer [W. Cui et. al. Security’07] Binary Analysis Polyglot [J. Caballero et. al. CCS’07] Automatic Network Protocol Analysis [G. Wondracek et. al. NDSS’08]
Observation 119 int read_header(int sid) { if (sscanf(line, "%[^ ] %[^ ] %[^ ]", conn[sid].dat->in_RequestMethod, conn[sid].dat->in_RequestURI, conn[sid].dat->in_Protocol)!=3) while (strlen(line)>0) { if (strncasecmp(line, "Cookie: ", 8)==0) 155 strncpy(conn[sid].dat->in_Cookie, (char *)&line+8, sizeof(conn[sid].dat->in_Cookie)-1); 156 if (strncasecmp(line, "Host: ", 6)==0) 157 strncpy(conn[sid].dat->in_Host, (char *)&line+6, sizeof(conn[sid].dat->in_Host)-1); … 160 if (strncasecmp(line, "User-Agent: ", 12)==0) 161 strncpy(conn[sid].dat->in_UserAgent, (char *)&line+12, sizeof(conn[sid].dat->in_UserAgent)-1); 162 } } Code snippet in http.c (null-httpd-0.5.0)
AutoFormat -- Basic Idea Execution Context Protocol Fields G E T / n e w s … Context One Field Another Field
System Overview Context-aware Execution Monitor GET /news.html 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 ‘\n’ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x1F7F3 ->0xF5A8->ap_read_request->ap_getword_white 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 ‘\n’ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x1F7F3 ->0xF5A8->ap_read_request->ap_getword_white Log call stackEIP input
Protocol Field Identifier Analyze log file Step 1: build protocol field tree from the logged data. Step 2: refine the tree using three heuristics Step 3: output the result
Example: Apache log data 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 ‘\n’ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x4BA56A2 ->0xF5A8->ap_read_request->ap_rgetline_core->ap_get_brigade->0x2D2CE->ap_get_brigade->0x2D667 ->apr_brigade_split_line->memchr … 24 '\n' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x >0xF5A8->ap_read_request->ap_rgetline_core 23 '\r‘ main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x >0xF5A8->ap_read_request->ap_rgetline_core 0 'G' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x1F7F3 ->0xF5A8->ap_read_request->ap_getword_white 1 'E' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x1F7F3 ->0xF5A8->ap_read_request->ap_getword_white 2 'T' main->ap_mpm_run->0x15C57->0x15B38->0x15941->ap_process_connection->ap_run_process_connection 0x1F7F3 ->0xF5A8->ap_read_request->ap_getword_white … GET /news.html HTTP/1.0\r\n\r\n GET
Step 1 -- Building Protocol Field Tree root GET /news.html HTTP/1.0\r\n User−Agent: Wget/ (Red Hat modified)\r\nAccept: */*\r\n…. GET /news.htmlGET HTTP/1.0
Step 1: Building Protocol Field Tree GET /news.html HTTP/1.0\r\n H news.html GET GET /news.html GET /news.html HTTP/1.0\r\n HTTP/1.0\r\n TTP/1.0 / / / news.html H H TTP/1.0
Step 2: Refinement (Tokenization) GET /news.html HTTP/1.0\r\n /news.html GET GET /news.html GET /news.html HTTP/1.0\r\n HTTP/1.0\r\n /news.html HTTP/1.0 GET /news.html HTTP/1.0\r\n H news.html GET GET /news.html GET /news.html HTTP/1.0\r\n HTTP/1.0\r\n TTP/1.0 / / / news.html H H TTP/1.0
Step 2: Refinement (Redundant Node Deletion) GET /news.html HTTP/1.0\r\n /news.html GET GET /news.html GET /news.html HTTP/1.0\r\n HTTP/1.0\r\n /news.html HTTP/1.0 GET /news.html HTTP/1.0\r\n /news.html GET GET /news.htmlHTTP/1.0\r\n
Step 2: Refinement (Node Insertion) GET /news.html HTTP/1.0\r\n /news.html GET GET /news.htmlHTTP/1.0\r\n GET /news.html HTTP/1.0\r\n /news.html GET GET /news.htmlHTTP/1.0\r\n
Step 3: Output the Result Parallel & Sequential Hierarchical GET /news.html HTTP/1.0\r\n /news.html GET GET /news.html HTTP/1.0\r\n /news.html GET HTTP/1.0\r\n
Evaluation Implemented on top of Valgrind Also applies to QEMU, PIN Benchmark 30 messages with six known protocols and one unknown protocol. Evaluation Metric Re: Ratio of exact match |(A ∩ W)| / |W| A: set of fields identified by AutoFormat W: set of fields identified by Wireshark
Overall Result Re(F) = 88.5% Re(H) = 98.0% Re(P) = 100.0% Re=93.4% Re(F): Re for finest-grained fields Re(H): Re for hierarchical fields Re(P): Re for parallel fields
Experimental Result – Slapper Worm Nested data structure declaration Compiler inserted gap
Discussion Dynamic Trace Dependency Byte granularity Protocol State Machine Obfuscated binaries
Conclusion AutoFormat A tool for automatic protocol format extraction. Key insight A protocol implementation is programmed to recognize the protocol format and usually contains protocol field-specific execution context, and we can actually leverage such context to infer the hierarchical structure of protocol fields, and even get their BNF structures.
Thank you For more information: {zlin, dxu, Q & A