1Intel Research Pittsburgh 2CMU 3EPFL 4UT Austin Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen Joint work with Michael Kozuch1, Theodoros Strigkos2, Babak Falsafi3, Phillip B. Gibbons1, Todd C. Mowry1,2, Vijaya Ramachandran4, Olatunji Ruwase2, Michael Ryan1, Evangelos Vlachos2 1Intel Research Pittsburgh 2CMU 3EPFL 4UT Austin
Instruction-Grain Monitoring Software often contain bugs Memory corruptions, data races, …, crashes Security attacks often designed to exploit bugs Instruction-grain lifeguards can help Dynamic monitoring: during application execution Instruction-grain: e.g., memory access, data flow Enables a wide range of powerful lifeguards Lifeguard Application Difficult to write bug-free code (don’t need to say) Reason: added functionality over time, time-to-market pressures, parallel code (for CMPs) Dynamic monitoring instruction-grain events enables a wide range of lifeguards Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Example Instruction-Grain Lifeguards AddrCheck: Monitor malloc/free, memory accesses Check if all memory accesses visit allocated memory regions MemCheck: AddrCheck + check uninitialized values Copying partially uninitialized structures is not an error Lazy error detection to avoid many false positives Track propagation of uninitialized values TaintCheck: detect overwrite-based security exploits Tainted data: data from network or disk Track propagation of tainted data to detect violations LockSet: detect data races in parallel programs [Nethercote’04] [Nethercote & Seward ’03 ’07] Let me briefly describe a number of representative examples [Newsome & Song’05] [Savage et al.’97] Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Design Space of Support Platform Lifeguard-specific hardware This paper Good [Crandall & Chong’04], [Dalton et al’07], [Shetty et al’06], [Shi et al’06], [Suh et al’04], [Venkataramani’07], [Venkataramani’08], [Zhou et al’07] General-Purpose HW improving DBI 3-8X slowdowns [Chen et al’06] [Corliss’03] Performance Dynamic binary instrumentation (DBI) 10-100X slowdowns [Bruening’04] [Luk et al’05] [Nethercote’04] Our contribution is to achieve …flexibility + performance In this way, we hope the solution has a better chance to get into future hardware. DBI: 1,14,20 [Bruening’04] [Luk et al’05][Nethercote’04] Lifeguard Specific: 7,8, 28,29, 30,34,35, 41 [Crandall & Chong’04], [Dalton et al’07], [Shetty et al’06], [Shi et al’06], [Suh et al’04], [Venkataramani’08], [Venkataramani ’07], [Zhou et al’07] General-Purpose HW: LBA [Chen et al’06] DISE [Corliss’03] Poor Specific Lifeguard General Purpose: Wide Range of Lifeguards Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Outline Introduction Background Three Hardware Acceleration Techniques Experimental Evaluation Conclusion Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Example Lifeguard: TaintCheck [Newsome & Song’05] Purpose: detect overwrite-based security exploits Metadata kept for application memory and registers Tainted data: data from network or disk Track taint propagation Detect violation: e.g., tainted jump target address Application TaintCheck Lifeguard mov %eax A mov B %eax taint(%eax) = taint(A) taint(B) = taint(%eax) Heap overflow, stack smashing say at the beginning add %ebx D taint(%ebx)|= taint(D) Detect exploit before attack code takes control jmp *(F) if (taint(F)==1) error; Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
TaintCheck w/ Detailed Tracking Detect violation 1 taint bit / application byte TaintCheck w/ detailed tracking: Construct taint propagation trail More detailed metadata per application location PC of Instruction that tainted this location “tainted from” address Not supported by previous lifeguard-specific HW [Newsome & Song’05] Input Violation Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Instruction-Grain Lifeguard Metadata Characteristics Organization varies per application byte/word size, format, semantics vary greatly Frequently updated e.g., propagation tracking Frequently checked e.g., memory accesses Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Event-capture and delivery Lifeguard Support Application (unmodified) Lifeguard (software) Rare e.g., malloc/free, system calls Frequent e.g., memory access, data movement Events Event Handlers metadata rare events Rare Update Check 1 2 3 Event-capture and delivery More details General-Purpose HW improving DBI Performance bottlenecks: metadata mapping, updates, and checks Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Event-capture and delivery Our Contributions Application (unmodified) Lifeguard (software) Rare e.g., malloc/free, system calls Frequent e.g., memory access, data movement Events Event Handlers metadata rare events Rare Update Check M-TLB IT IF Event-capture and delivery More details Metadata-TLB for metadata mapping Inheritance Tracking for metadata updates Idempotent Filters for metadata checks Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Outline Introduction Background Three Hardware Acceleration Techniques Metadata-TLB Inheritance Tracking Idempotent Filters Experimental Evaluation Conclusion Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Metadata-TLB: Motivation Level-1 index Level-2 chunks Metadata per app byte/word Element size may vary Two-level structure: Robustness & space efficiency Mapping: application address metadata address Frequently used in almost every handler Can be very costly Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Metadata Mapping takes 5 out of 8 instructions ! Example (TaintCheck) void dest_reg_op_mem_4B (UINT32 src_addr /*%eax*/, UINT32 dest_reg /*%edx */) // app instruction type: dest_reg dest_reg op mem(src_addr) // handler operation: reg_taint(dest_reg)|= mem_taint(src_addr) map *mp = level1_index[src_addr>>16]; mov %eax, %ecx shr $16, %ecx mov level1_index(,%ecx,4),%ecx int idx = (src_addr & 0xffff)>>2; and $0xffff, %eax shr $2, %eax UChar mem_taint = mp[idx]; movzbl (%ecx,%eax,1), %eax reg_taint[dest_reg] |= mem_taint; or %al, reg_taint(%edx) nlba (); nlba This is our model of how event is delivered Metadata Mapping takes 5 out of 8 instructions ! Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Our Solution: Metadata-TLB A TLB-like HW associative lookup table LMA (Load Metadata Address) instruction: Application address lifeguard metadata address Managed by (user-mode) lifeguard software Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Example (TaintCheck) w/ M-TLB void dest_reg_op_mem_4B (UINT32 src_addr /*%eax*/, UINT32 dest_reg /*%edx */) // app instruction type: dest_reg dest_reg op mem(src_addr) // handler operation: reg_taint(dest_reg)|= mem_taint(src_addr) map *mp = level1_index[src_addr>>16]; mov %eax, %ecx shr $16, %ecx mov level1_index(,%ecx,4),%ecx int idx = (src_addr & 0xffff)>>2; and $0xffff, %eax shr $2, %eax UChar mem_taint = mp[idx]; movzbl (%ecx,%eax,1), %eax reg_taint[dest_reg] |= mem_taint; or %al, reg_taint(%edx) nlba (); nlba UChar *p = LMA_macro(src_addr); LMA %eax, %ecx UChar mem_taint = *p; mov (%ecx), %al reg_taint[dest_reg] |= mem_taint; or %al, reg_taint(%edx) nlba (); nlba Reduce handler size by half ! Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Inheritance Tracking: Motivation Propagation tracking is expensive Metadata updates for almost every app instruction Previous hardware solutions track propagation automatically update metadata in hardware Problem: only support simple metadata semantics e.g., do not support TaintCheck w/ detailed tracking Our goal: flexibility AND performance Idea: inheritance structure is common, so let’s track inheritance in hardware! Making simplified assumptions about metadata format Track inheritance still keeps the lifeguard flexibility and at the same time reduce a large fractions of metadata update calls Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Problem with General Inheritance Tracking mov %eax A mov B %eax taint(%eax) = taint(A) taint(B) = taint(%eax) Application Propagation Tracking %eax inherits from A B inherits from %eax Inheritance Tracking add %ebx D taint(%ebx) |= taint(D) insert D into %ebx’s inherit-from list Problem: state explosion for binary operations ! Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Unary Inheritance Tracking Many lifeguards can take advantage of unary IT: MemCheck TaintCheck Large performance improvements if used Can be disabled if unary IT does not match the lifeguard check known check Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Tracking Register Inheritance Transformed event State Transition & Event to Deliver Original event IT(%rs) IT(%rd) Deliver event IT table for registers More details in the paper: IT table and state transition table details Conflict detection Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Can significantly reduce metadata update events! Example Application Before Inheritance Tracking mov %eax A mov B %eax mem_to_reg reg_to_mem mem_to_mem mov %ebx C add %ebx D mov E %ebx mem_to_reg dest_reg_op_mem reg_to_mem imm_to_mem Can significantly reduce metadata update events! Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Idempotent Filters: Idea Typically, metadata checks give the same result if Event parameters are the same and Metadata are the same Idea: filter out idempotent (redundant) events For example: AddrCheck: After checking that a memory location is allocated Subsequent loads/stores to the same location are safe Until the next free() event LockSet: (surprisingly) In between synchronization events (e.g., lock/unlock) Check first load to a location Check first store to a location Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Outline Introduction Background Three Hardware Acceleration Techniques Experimental Evaluation Log-Based Architectures (LBA) Simulation Study (w/ reduced input sets) PIN-based Analysis (w/ full inputs) Conclusion Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Log-Based Architectures Application (unmodified) Lifeguard (software) Rare e.g., malloc/free, system calls Frequent e.g., memory access, data movement Events Event Handlers metadata rare events Rare Update Check Event-capture and delivery More details Log-Based Architecture (LBA) Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Idea: Exploiting Chip Multiprocessors LBA components Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Simulation Setup: Dual-Core LBA System Log Transport (e.g. L2 cache) Core 1 Core 2 decompress Compress capture dispatch Operating System: Fedora Core 5 Application Lifeguard Extend Virtutech Simics M-TLB IT & IF Application and lifeguard are processes Application is stalled when log buffer is full Model a 2-level cache hierarchy Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Overall Performance: TaintCheck 1.36X LBA baseline LBA optimized Slowdown = application execution time w/o lifeguard application execution time w/ lifeguard Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Applying Our Techniques One by One 10.0 AddrCheck MemCheck TaintCheck TaintCheck w/ detailed tracking LockSet 9.0 7.80 8.0 7.0 6.05 6.0 average slowdowns 5.0 4.21 4.25 3.81 4.0 3.23 3.27 3.36 3.20 2.71 3.0 2.29 1.90 2.0 1.36 1.51 1.40 1.02 1.0 0.0 BASE MTLB BASE MTLB BASE MTLB BASE MTLB BASE MTLB MTLB+IF MTLB+IT MTLB+IT MTLB+IT MTLB+IF MTLB+IT+IF IT, IF, and M-TLB are indeed complementary Achieve dramatically better performance Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
PIN-Based Analysis: IT IT removes 35.8% to 82.0% of the propagation events Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
PIN-Based Analysis: IF 10 20 30 40 50 60 70 80 8 16 32 64 128 256 number of filter entries reduced check events (%) fully-assoc 16-way 8-way 4-way 2-way 1-way AddrCheck LockSet IF can effectively reduce check events 4-way works as well as fully-associative Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Conclusion Our focus: Instruction-Grain Lifeguards Three complementary hardware techniques: Metadata-TLB (M-TLB) Inheritance Tracking (IT) Idempotent Filters (IF) Flexible to support a wide range of lifeguards Reducing overheads by 2-3X in our experiments Achieving 2-51% overheads for all but MemCheck Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
Thank you! Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen
People Working on LBA Project Intel Research: Shimin Chen Phillip B. Gibbons University Faculty: Babak Falsafi (EPFL) Todd C. Mowry (CMU) CMU Students: Michelle Goodstein Olatunji Ruwase Mike Kozuch Michael Ryan Vijaya Ramachandran (UT Austin) Theodoros Strigkos Evangelos Vlachos Previous Contributors: Limor Fix (IRP) Steve Schlosser (IRP) Anastasia Ailamaki (CMU) Greg Ganger (CMU) Bin Lin (Northwestern) Radu Teodorescu (UIUC) Flexible Hardware Acceleration for Instruction-Grain Program Monitoring Shimin Chen