Kyoungwoo Lee (final defense)

Kyoungwoo Lee (final defense)
Cooperative cross-layer protection for resource constrained Mobile Multimedia systems Prof. Nikil Dutt Prof. Nalini Venkatasubramanian Prof. Lichun Bao This thesis is supervised by Prof. Nikil Dutt, Prof. Nalini Venkatasubramanian, and Prof. Lichun Bao. Nov. 26, 2008 Kyoungwoo Lee (final defense)

Contents PPC (Partially Protected Caches)
Thesis Motivation Thesis Proposal – Cooperative, Cross-layer Methods PPC (Partially Protected Caches) EAVE (Error-Aware Video Encoding) CC-PROTECT (Cooperative, Cross-layer Protection) Thesis Contribution and Future Direction First, I’m going to talk about motivation of this work. And three cooperative, cross-layer methods will be presented, which are PPC, EAVE, and CC-PROTECT. Finally, I’ll conclude this talk and discuss future directions.

Mobile Multimedia Embedded Systems
Resource-limited mobile devices! Main problem is to achieve low power with high performance, high QoS, and high reliability Map Routing 3D Graphics Image Browsing Animation Mobile TV Web Browsing Resource is limited in mobile devices! Video Streaming Satellite TV Video Conferencing

Reliability Reliability is an emerging and critical concern in mobile devices New enhanced technology makes devices vulnerable to errors due to high complexity and high integration Exponential increase of soft error rate as technology scales [Baumann, 05] Mobile applications are running close to humans In pervasive computing, failures of healthcare mobile devices cause serious results Redundancy techniques incur high overheads of power and performance TMR (Triple Modular Redundancy) may exceed 200% overheads without optimization [Nieuwland, 06] Challenging to optimize multiple properties (e.g., performance, power, QoS, and reliability) in mobile embedded systems Reliability issue is emerging and its optimization is challenging due to tradeoffs among multiple properties!

Soft error is becoming an every second concern!
Soft Error Rate (SER) – FIT (Failures in Time) = number of errors in 109 hours SER (FIT) MTTF Reason µm 1000 104 years µm 64x8x1000 81 days High Integration nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration A 65 nm 2x2x1000x64x8x1000 30 minutes Memory takes up 50% of soft errors in a system A system with voltage 65 nm 100x2x2x1000x64x8x1000 18 seconds Exponential relationship b/w SER & Supply Voltage A system with voltage flight (35, nm 800x100x2x2x1000x64x8x1000 FIT 0.02 seconds High Intensity of Neutron Flux at flight (high altitude) SER (FIT) MTTF Reason µm 1000 104 years µm 64x8x1000 81 days High Integration nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration A 65 nm 2x2x1000x64x8x1000 30 minutes Memory takes up 50% of soft errors in a system SER (FIT) MTTF Reason µm 1000 104 years µm 64x8x1000 81 days High Integration nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration A 65 nm 2x2x1000x64x8x1000 30 minutes Memory takes up 50% of soft errors in a system A system with voltage 65 nm 100x2x2x1000x64x8x1000 18 seconds Exponential relationship b/w SER & Supply Voltage SER (FIT) MTTF Reason µm 1000 104 years µm 64x8x1000 81 days High Integration nm 2x1000x64x8x1000 1 hour Technology scaling and Twice Integration SER (FIT) MTTF Reason µm 1000 104 years µm 64x8x1000 81 days High Integration SER (FIT) MTTF Reason µm 1000 104 years Reliability is very broad, and let’s take a look at a case, which is emerging! A concrete example of errors, soft error is increasing significantly!

Errors and Failures in Mobile Embedded Systems
Faults or Errors can cause Failures Bug Application Packet Loss Middleware/ OS Network Exception Soft Error Hardware Mobile device is composed of system abstraction layers such as app, mw, os, and hw, and it is communicating via wireless connection. There exist different types of errors and faults as shown in this diagram. Let’s take a look at those errors and failures, and their protection techniques at each system abstraction layer.

Errors and Error Control Schemes at Hardware
Application MW/ OS Hardware Network Errors and Error Control Schemes at Hardware Failures Causes Metrics Traditional Approaches Soft Errors, Hard Failures, System Crash External Radiations, Thermal Effects, Power Loss, Poor Design, Aging FIT, MTTF, MTBF Spatial Redundancy (TMR, Duplex, RAID-1 etc.) and Data Redundancy (EDC, ECC, RAID-5, etc.) Hardware failures are increasing as technology scales (e.g.) SER increases by up to 1000 times [Mastipuram, 04] Redundancy techniques are expensive (e.g.) ECC-based protection in caches can incur 95% performance penalty [Li, 05] FIT: Failures in Time (109 hours) MTTF: Mean Time To Failure MTBF: Mean Time b/w Failures TMR: Triple Modular Redundancy EDC: Error Detection Codes ECC: Error Correction Codes RAID: Redundant Array of Inexpensive Drives There are soft errors, permanent failures and system crash, which result from external radiations such as alpha particles and neutrons, high temperature, battery out, poor design, and aging. Metrics to measure the reliability for hardware components include FIT, MTTF, and MTBF in general. FIT standing fro Failures in Time is to measure a number of errors in one billion operation hours. MTTF indicates how long it takes to meet a failure and MTBF indicates how long it takes from a failure to the next failure. To recover those failures, we can think about spatial redundancy technique and data redundancy technique. Spatial redundancy techniques include TMR, duplex, RAID, etc. Data redundancy includes error detection codes, error correction codes, etc. These hardware failures increase as technology scales and integration increases. For example, SER increases by several orders of magnitude times every technology. These redundancy techniques are effective but expensive. For instance, ECC for caches incur up to 95% performance penalty, and up to 22% power overhead. Thus, processors for high reliability protects just L2 and L3 caches, but not L1 cache, since it is so sensitive to processors’ performance and power.

Errors and Error Control Schemes at Software
Application MW/ OS Hardware Network Errors and Error Control Schemes at Software Failures Causes Metrics Traditional Approaches Wrong outputs, Infinite loops, Crash Incomplete Specification, Poor software design, Bugs, Unhandled Exception Number of Bugs/Klines, QoS, MTTF, MTBF Spatial Redundancy (N-version Programming, etc.), Temporal Redundancy (Checkpoints and Backward Recovery, etc.) Software errors become dominant as system’s complexity increases (e.g.) Several bugs per kilo lines Hard to debug, and redundancy techniques are expensive (e.g.) Backward recovery with checkpoints is inappropriate for real-time applications Examples of failures at software are wrong outputs, infinite loops, and crashes, which come from mostly designer or programmer’s errors such as incomplete specification, poor software design, programming bugs, unhandled exceptions. To measure reliability of software, there are number of bugs/kilo lines, which is a measure of QoS in programs. Also MTTF and MTBF are used. Traditional approaches include spatial redundancies such as N-programming, and temporal redundancy technique such as recovery with checkpoints. As system’s complexity increases, software errors become dominant. And it’s hard to debug and fault tolerant techniques are expensive. For example, backward error recovery with checkpoints are inappropriate for real-time applications. QoS: Quality of Service

Errors and Error Control Schemes in Networks
Application MW/ OS Hardware Network Errors and Error Control Schemes in Networks Failures Causes Metrics Traditional Approaches Data Losses, Deadline Misses, Node (Link) Failure, System Down Network Congestion, Noise/Interference, Malicious Attacks Packet Loss Rate, Deadline Miss Rate, SNR, MTTF, MTBF, MTTR Resource Reservation, Data Redundancy (CRC, etc.), Temporal Redundancy (Retransmission, etc.), Spatial Redundancy (Replicated Nodes, MIMO, etc.) Network is unreliable (especially, wireless networks) Joint approaches across OSI layers have been investigated for minimal costs [Vuran, 06][Schaar, 07] SNR: Signal to Noise Ratio MTTR: Mean Time To Recovery CRC: Cyclic Redundancy Check MIMO: Multiple-In Multiple-Out It is another huge area to cover network reliability since network is unreliable. Briefly, they consider data losses, deadline misses, node or link failures, and system down, which result from congested network, noisy and interfered channels, and also malicious attacks. As quality metrics to measure reliability of networks, there are SNR standing for Signal to Noise Ratio and loss or miss rates other than MTTF, MTBF, and MTTR. MTTR is a metric how long it takes to recover system from failures. In network, there are tons of approaches and some examples are resource reservation protocols, data redundancy techniques such as CRC, temporal redundancy techniques such as retransmission, and spatial redundancy such as node replication and multiple radios. Interestingly, there have been a lot of work to combine multiple approaches across OSI 7 layers for optimal solutions.

Conventional Approaches
Most redundancy techniques incur overheads in terms of performance, power, area, etc. Conventional TRM (Triple Modular Redundancy) can incur 200% overheads without optimization. Backward Recovery with Checkpoints cannot guarantee the completion time of a task. Recently proposed techniques have focused on the cost reduction without losing reliability However, they still incur overheads Conventional approaches incur overheads. At the end of this talk, this thesis shows the effectiveness of our approach, which even improves the performance and energy consumption rather than incurring overheads.

Thesis Problem Statement
Study tradeoffs among system properties (e.g.) Redundancy incurs energy overheads while DVS increases SER significantly Examine errors and error control schemes across system abstraction layers (e.g.) network errors & error-resilient video encoding, soft errors & ECC or EDC, etc. Maximize reliability with minimal costs of power and performance for mobile embedded systems Mainly focus on soft error reduction for mobile multimedia embedded systems

Cross-Layer Methods Cross-layer approaches:
aim at system-level optimization Integrate and coordinate techniques across system layers Classification [Srivastava, 05] Top-down, Bottom-up, or Both direction Top-down – PPC, PDVS (GRACE), etc. Bottom-up – EAVE, etc. Both direction – CC-PROTECT, etc. Coupling or Merging layers Dynamo [Mohapatra], xTune [Kim], etc. Application Middleware/OS Hardware Merging Bottom-up Top-down Coupling In layered architecture, adjacent layers communicate while cross-layer approaches communicate remote layers. Cross-layer approaches merge and tightly couple layers while layered architecture separate functionalities across layers. PDVS – Practical Dynamic Voltage Scaling

Cross-Layer Approaches – GRACE
GRACE UIUC [W. Yuan Ph.D. thesis in ’04 and A. F. Harris III, Ph.D. thesis in ’06] QoS/Power tradeoffs Primarily OS adaptation for power management in multimedia mobile devices Network adaptation for power management in multimedia communications Application Operating System Hardware [GRACE, 05] QoS and power tradeoffs mainly, and their work has been focused on OS coordinator.

Cross-Layer Approaches – DYNAMO & FORGE
DYNAMO middleware for FORGE UCI [S. Mohapatra Ph.D. thesis in ’05 and R. Cornea Ph.D. thesis in ’07] QoS/Performance/Power tradeoffs for mobile embedded systems Middleware-driven coordination and proxy-based cooperation Content transcoding at the application layer Network traffic shaping at the network layer Backlight (LCD display) setting at the hardware layer NIC shutdown, CPU DVS/DFS at the hardware layer Application 1 2 Proxy Server (NW & MW) Middleware/ OS 3 4 Hardware Dynamo in our FORGE project has also investigated tradeoffs among power/performance, and QoS. We had two Ph.D.s and especially, our middleware-driven approaches and proxy-based techniques have been studied and several techniques have demonstrated the effectiveness of cross-layer approaches.

Cross-Layer Approaches – xTune
xTune UCI and SRI [M. Kim Ph.D. thesis in ’08] QoS/Power/Timeliness adaptation for distributed real-time embedded systems A Formal Methodology for cross-layer tuning and verifiable timeliness of Mobile Embedded Systems Handheld Server Application Formal Method Middleware/ OS Proxy Server Hardware System Realization xTune is a framework study to consider timing issue in a cross-layered manner. Interestingly, her work has focused on dynamic parameter and policy tuning by combining formal methods and system realization at the proxy server. But her work and previously proposed cross-layer approaches have not dealt with reliability issue significantly.

Thesis Proposed Contribution
Thesis proposes a cross-layer design methodology for mobile multimedia embedded systems with minimal costs Reliability/QoS/Power/Performance system optimization for mobile multimedia systems Cooperative, Cross-Layer Protection PPC, EAVE, & CC-PROTECT Low-cost reliability Therefore, this thesis proposes a cross-layer design methodology for mobile embedded systems with minimal costs, and we have proposed several techniques such as PPC, EAVE, and CC-PROTECT, and I’m going to present them briefly in today’s talk.

Overview of Thesis Proposals
Error-prone Networks Mobile Video Application Overview of Thesis Proposals Multimedia Application Error-Resilient Encoder (e.g., PBPAIR) Frame Drop Packet Loss Error- Aware Video EAVE Application PPC (Partially Protected Caches) EAVE (Error-Aware Video Encoding) CC-PROTECT (Cooperative, Cross- layer Protection) QoS Original Video Error-prone Networks Mobile Video Application Error-Controller (e.g., frame drop) Monitor & Translate SER MW/OS Error Injection Rate & Frame Loss Rate Correction Unprotected Cache Protected ECC Error detection EDC In particular, Cc-PROTECT improves the performance and energy consumption, as compared to existing techniques without any protection while increasing the reliability by 1,000 times. I’ll show the effectiveness of this approach in today’s talk. Hardware

Contents PPC (Partially Protected Caches) EAVE CC-PROTECT
Thesis Motivation Thesis Proposal – Cooperative, Cross-layer Methods PPC (Partially Protected Caches) EAVE CC-PROTECT Thesis Contribution and Future Direction Application Middleware/ OS Network Hardware

Conventional Protection for Caches
Cache is the most hit by soft errors Conventional Protected Caches Unaware of fault tolerance at applications Implement a redundancy technique such as ECC to protect all data for every access Overkill for multimedia applications ECC (e.g., a Hamming Code) incurs high performance penalty by up to 95%, power overhead by up to 22%, and area cost by up to 25% Application (Multimedia) Middleware/OS Hardware Unaware of Application High Cost Cache ECC High expensive conventional techniques for caches.

PPC (Partially Protected Caches)
Observation Not all data are equally failure critical Multimedia data vs. control variables Propose PPC architectures to provide an unequal protection for mobile multimedia systems [Lee, CASES06][Lee, TVLSI08] Unprotected cache and Protected cache at the same level of memory hierarchy Protected cache is typically smaller to keep power and delay the same as or less than those of Unprotected cache Unprotected Cache Protected Memory PPC How to Partition Data? Based on observation, we have proposed PPC architecture, which is implemented with different levels of protection.

PPC for Multimedia Applications
Unprotected Cache Protected Memory PPC PPC for Multimedia Applications Propose a selective data protection [Lee, CASES06] Unequal protection at hardware layer exploiting error-tolerance of multimedia data at application layer Simple data partitioning for multimedia applications Multimedia data is failure non- critical All other data is failure critical Application (Multimedia) Middleware/OS Hardware (PPC) Power/Delay Reduction Fault Tolerance

PPC for General Applications
Unprotected Cache Protected Memory PPC PPC for General Applications DPExplore [Lee, PPCDIPES08] Explore partitioning space by exploiting vulnerability of each data page Vulnerable time It is vulnerable for the time when eventually it is read by CPU or written back to Memory Pages causing high vulnerable time are failure critical Vulnerable time closely estimates failure rate Reduce the number of simulations to estimate the failure rate Read Write Eviction Incoming data Vulnerable t0 t1 t2 t3 invulnerable

Application Data & Code
Summary – PPC Application (Multimedia) Middleware/OS Hardware (PPC) All data are not equally failure critical Propose a PPC architecture to provide unequal protection Support an unequal protection at hardware layer by exploiting error-tolerance and vulnerability at application Present cost-efficient reliability Related Publications [Lee, CASES06] – PPC for multimedia embedded systems [Lee, PPCDIPES08] – PPC for general applications [Lee, TVLSI08] – PPC and design space exploration Under submission [Lee, TODAES??] – Partitioning techniques for general applications and instruction caches Application Data & Code Error-tolerance of MM data Vulnerability of Data & Code Page Partitioning Algorithms Failure Non-Critical Failure Critical FNC & FC are mapped into Unprotected & Protected Caches Unprotected Cache Protected Cache PPC

Contents PPC EAVE (Error-Aware Video Encoding) Thesis Motivation
Thesis Proposal – Cooperative, Cross-layer Methods PPC EAVE (Error-Aware Video Encoding) CCPROTECT Thesis Contribution and Future Direction Application Middleware/ OS Network Hardware Researchers have studied error-resilient video encodings against errors in network. Interesting there exists some error-resilient video encodings, which are also energy efficient in general. However, it is not efficient in case of error-free network and low rate of network errors. Thus, to maximize the efficiency of energy consumption in this technique, we have developed error-aware video encodings, which exploits errors actively. We’ll take a look at the idea briefly.

Active Error Exploitation – Intentional Frame Drop
Intentional Frame Drop (one way to actively exploit errors) can result in energy reduction for each operation FDT-1 affects the following components with respect to power, performance, and QoS in mobile video applications Error-prone Networks Mobile Video Application Enc Tx Rx Dec CPU WNI WNI CPU FDT-1 FDT-2 FDT-3 Packet Loss We can inject errors intentionally at each component in this system model if it helps to save the resources for resource-constrained mobile devices. Indeed, we can drop data or video frames before the encoding, before transmission, or before the decoding. And main idea is to exploit error-resilient video encoding to recover the video quality. FDT: Frame Drop Type Enc: Encoding, Dec: Decoding WNI: Wireless Network Interface

Error-Aware Video Encoding
Propose EE-PBPAIR [Lee, DIPES08] Intentionally drop frames at video encoding Reduce the energy consumption for video encoding Maintain the video quality by exploiting error-resilience of PBPAIR Error-prone Networks Intentional frame drop Packet Loss Error-Aware Video Encoder (EAVE) Error- Resilient Video Error- Aware Video Original Video Error-Controller (e.g., frame dropping) Error-Resilient Encoder (e.g., PBPAIR) EIR EIR: Error Injection Rate

Error Resilient Video Encoder Network or Decoding Side
Summary – EAVE Intentional Frame Drop is one way to exploit errors actively Propose an error-aware video encoding (EE-PBPAIR) Present a knob (EIR) to adjust the amount of errors considering the QoS feedback Maintain the video quality using error-resilience of PBPAIR Related Publication [Lee, DIPES08] – EE-PBPAIR Considering Submission [Lee, TECS??] – Generalized idea for error-resilient video encodings Error Resilient Video Encoder Error-Aware Video Data Application Error Rate = PLR + EIR Error Controller Network or Decoding Side EIR PLR & QoS Middleware Energy Reduction CPU, Memory, and WNI Hardware EIR: Error Injection Rate PLR: Packet Loss Rate

Contents PPC EAVE CC-PROTECT (Cooperative Cross-layer Protection)
Thesis Motivation Thesis Proposal – Cooperative, Cross-layer Methods PPC EAVE CC-PROTECT (Cooperative Cross-layer Protection) Thesis Contribution and Future Direction Application Middleware/ OS Network Hardware Now we exploit several existing techniques across system abstraction layers to maximize the resource efficiency, which is very effective.

Errors and Error Control Schemes – No Coupling
Different errors and their protection techniques have not been considered jointly No coupling and no cooperation Cooperating control schemes in a cross-layer manner can open a new venue Error-prone Networks Mobile Video Application Application Packet Loss Middleware/ OS Network Hardware Soft Error

PPC still incurs overheads due to ECC-protection
Propose PPC architectures to provide an unequal protection for mobile multimedia systems [Lee, TVLSI08] Unprotected cache and Protected cache a the same level of memory hierarchy PPC still incurs overheads due to high expensive ECC-protection at the protected cache 29% energy reduction compared to the protected cache 10% energy overhead compared to the unprotected cache Unprotected Cache Protected Memory PPC

PBPAIR is energy-inefficient in error-free network
PBPAIR is error-resilient and energy-efficient in general PBPAIR may not be energy efficient in case of error-free network Packet Loss network PLR PBPAIR We’d like to use these two different techniques at two different abstraction layers to maximize their efficiency. Intra_Threshold PBPAIR: Probability-Based Power Aware Intra Refresh [Kim, 06]

Error Injection Rate & Frame Loss Rate
Error-prone Networks Mobile Video Application Outline of CC-PROTECT Error-Aware Video Encoder (EAVE) Error-Resilient Encoder (e.g., PBPAIR) Error-Controller (e.g., frame drop) Original Video Error- Aware Error Injection Rate & Frame Loss Rate Packet Loss Frame Drop QoS Loss Error-prone Networks Mobile Video Application Feedback Monitor & Translate SER Trigger Selective DFR Support EAVE & PPC Parameter MW/OS SER Data Mapping BER (Backward Error Recovery) DFR (Drop & Forward Recovery) frame K frame K+1 Soft Error Unprotected Cache Protected PPC EDC Error detection

Energy Saving Application (Error-Prone or Error-Resilient)
EDC + DFR + PBPAIR(CC-PROTECT) impact 56% Reduction compared to HW-PROTECT 49% Reduction compared to BASE EDC + DFR impact 36% Reduction compared to HW-PROTECT 26% Reduction compared to BASE EDC impact 17% Reduction compared to HW-PROTECT 4% Reduction compared to BASE Hardware (Unprotected or Protected) BASE = Error-prone video encoding + unprotected cache HW-PROTECT = Error-prone video encoding + PPC with ECC APP-PROTECT = Error-resilient video encoding + unprotected cache MULTI-PROTECT = Error-resilient video encoding + PPC with ECC CC-PROTECT1 = Error-prone video encoding + PPC with EDC CC-PROTECT2 = Error-prone video encoding + PPC with EDC + DFR CC-PROTECT = error-resilient video encoding + PPC with EDC + DFR

Summary – CC-PROTECT Related Publication Considering Submission
Propose CC-PROTECT approach, which cooperates existing schemes across layers to mitigate the impact of soft errors on the failure rate and video quality in mobile video encoding systems PPC (Partially Protected Caches) with EDC (Error Detection Codes) at hardware layer DFR (Drop and Forward Recovery) at middleware PBPAIR (Probability-Based Power Aware Intra Refresh) at application layer Demonstrate the effectiveness of low-cost (about 50%) reliability (1,000x) at the minimal cost of QoS (less than 1%) Related Publication [Lee, ACMMM08] – CC-PROTECT Considering Submission [Lee, ACMTOMCCAP??] – Tradeoff space exploration with CC-PROTECT PBPAIR - Error Resilience Application DFR - Error Correction Middleware/ OS ECC EDC Hardware Unprotected Cache Protected Cache

Contents PPC EAVE CC-PROTECT Thesis Motivation
Thesis Proposal – Cooperative, Cross-layer Methods PPC EAVE CC-PROTECT Thesis Contribution and Future Direction Application Middleware/ OS Network Hardware

Overall Thesis Contribution
Cross-layer methodology to design mobile multimedia embedded systems with minimal costs Effective Cross-layer approaches for reliability Low-cost reliability Expanded trade-off space Extended applicability of existing techniques Application Packet Loss Frame Drop Middleware/ OS Network Soft Error Hardware

Effectiveness of Thesis Proposals (Energy Saving)
PPC EAVE CCPROTECT 29% energy reduction, as compared to a conventional protected cache with ECC 37% energy reduction, as compared to a conventional video encoding 56% energy reduction, as compared to a conventional composition of protections

Publication [Lee, DIPES08] Application Middleware/ OS Network
[Lee, ACMMM08] K. Lee, A. Shirvastava, M. Kim, N. Dutt, and N. Venk atasubramanian, “Mitigating the impact of hardware defects on multimedia applications – A cross-layer approach”, In ACM Inter national Conference on Multimedia, Oct [Lee, TVLSI08] K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkata subramanian, “Partially protected caches to reduce failures due t o soft errors in multimedia applications”, In IEEE Transactions on V ery Large Scale Integration Systems (TVLSI), 2008, to appear. [Lee, DIPES08] K. Lee, M. Kim, N. Dutt, and N. Venkatasubramanian, “E rror exploiting video encoder to extend energy/QoS tradeoffs f or mobile embedded systems”, In 6th IFIP Working Conference o n Distributed and Parallel Embedded Systems (DIPES), Sep [Lee, PPCDIPES08] K. Lee, A. Shrivastava, N. Dutt, and N. Venkatasubr amanian, “Data partitioning techniques for partially protected ca ches to reduce soft error induced failures”, In 6th IFIP Working C onference on Distributed and Parallel Embedded Systems (DIPES), Sep [Lee, CASES06] K. Lee, A. Shrivastava, I. Issenin, N. Dutt, and N. Venkat asubramanian, “Mitigating soft error failures for multimedia appl ications by selective data protection”, In Int. Conference on Compi lers, Architecture, & Synthesis for Embedded Systems (CASES), Oct [Lee, ICME05] K. Lee, N. Dutt, and N. Venkatasubramanian, “Experime ntal Study on Energy Consumption of Video Encryption for Mobil e Handheld Devices", In IEEE International Conference on Multime dia and Expo (ICME 05), Poster Session, July 2005. [Mohapatra, IPDPS05] S. Mohapatra, R. Cornea, H. Oh, K. Lee, M. Kim, N. Dutt, R. Gupta, A. Nicolau, S. Shukla, and N. Venkatasubramanian, “A cross-layer approach for power- performance optimization in distributed mobile systems”, In Next Generation Software Program in conjunction with IEEE International Parallel and Distributed Processing Symposium (IPDPS), April [Lee, DIPES08] Application Middleware/ OS Network Hardware [Lee, TVLSI08] [Lee, PPCDIPES08] [Lee, CASES06] [Lee, ACMMM08] [Mohapatra, IPDPS05] [Lee, ICME05]

Mobile Video Application
Future Direction Error Rate Translation/Integration Different types of errors Different components across system layers Cross-layer methods for distributed embedded systems (Horizontal Expansion) Network-aware methods Context-aware approaches Error-prone Networks Mobile Video Application Bug Application Packet Loss Middleware/ OS Network Exception Soft Error Hardware

Thank you! Any Questions or Comments?

Backup Slides

Why Cross-Layer Approach?
Cross-layer interactions and conflicts arise between system properties DVS increases SER exponentially Over protection or under protection All ECC for multimedia data is an overkill Cross-layer approaches can maximize the reliability with minimal power and performance overheads Benefits of Cross-layer approaches Global system view Coordination for intelligent selection Adaptation Cross-layer approaches have been promising to save the resources at the cost of QoS [Mohapatra, 05][Yuan, 04] Application Middleware/OS Hardware DVS: Dynamic Voltage Scaling SER: Soft Error Rate ECC: Error Correction Codes QoS: Quality of Service We have looked at errors and error control schemes briefly at each layer. Why do we need a cross-layer approach? First off, there are definite conflicts and interactions among system properties such as power, performance, and reliability. For example, voltage scaling increases SER exponentially while redundancy techniques for high reliability incur high overheads in terms of power and performance. So we need consider them together, and approaches not to break other properties. Cross-layer approach provides opportunity to consider multiple properties with perspective of global system. Secondly, without considering coordination across system layers, we may present over protection or under protection. For example, in multimedia applications, protecting all data from hardware defects such as soft errors is an overkill. Cross-layer can provide a cooperative approach with coordination for intelligent selections and dynamic adaptation by monitoring and exploiting errors and error control schemes. Cross-layer approaches have investigated and demonstrated their effectiveness in mobile embedded systems considering multiple properties such as power, performance, and QoS, but not reliability. Thus, our thesis investigates the opportunity of coordination between reliability approaches and resource management techniques.

Thesis Proposed Contribution: CC-PROTECT
Cooperative Cross-layer Protection (CC-PROTECT) by exploiting error-awareness and error control schemes across system abstraction layers Contribution Present cost-efficient reliability methods (cooperative cross- layer protection) Open expanded tradeoff spaces and operating points Rediscover applicability of existing approaches for other purposes We propose cooperative cross-layer protection by exploiting error-awareness such as error-tolerance, and error control schemes across system abstraction layers in mobile embedded systems. Our work can present cost-efficient reliability methods compared to isolated approach at a single layer. Further, our work opens largely expanded tradeoff spaces and operating points by exploiting errors actively. Also, our research rediscovers applicability of existing error-control schemes for other purposes.

Performance vs. Capacity
Total energy available from a battery is a design issue and is fixed at a design time, along with its weight and size Stark contrast between linear growth rate of battery capacity and exponential technology improvement rate of system components [Udani] Sanjay Udani and Jonathan Smith, “Power management in mobile computing”

Generalized Fault Tolerance Techniques
Modular Redundancy N-Version Programming Error-Control Coding Checkpoints and Rollbacks Recovery Blocks I’ll talk about 5 generalized fault tolerance techniques from modular redundancy and recovery blocks. (The goal of fault tolerant technique is to add reliability without adding significant cost. Today I’ll talk about 5 techniques from modular redundancy to recovery blocks.) [Chetan, SPC04] S. Chetan, A. Ranganathan, and R. Campbell, “Towards Fault Tolerant Pervasive Computing”, in SPC ’04 [Somani, IEEECom97] A. K. Somani and N. H. Vaidya, “Understanding Fault Tolerance and Reliability”, in IEEE Computer ’97 vol. 30 issue 4

1) Modular Redundancy Modular Redundancy
Multiple identical replicas of hardware modules Voter mechanism Compare outputs and select the correct output Tolerate most hardware faults Effective but expensive fault Data Producer A Consumer voter Producer B Modular redundancy uses multiple and identical replications for hardware module. There may exist voter mechanism comparing multiple outputs and selecting the correct output of them. This modular redundancy can tolerate most hardware faults. This is very typical and effective mechanism but its cost is expensive. There are lots of examples such as disk array systems (called RAID), multiprocessor and multiple wireless network interfaces. In this picture, there are two identical producers, producer A and producer B. If there is a fault at producer A, voter must select the output from producer B.

2) N-version Programming
Different versions by different teams Different versions may not contain the same bugs Voter mechanism Tolerate some software bugs Data Producer A Consumer voter Program i fault Program j Programmer K Programmer L The basic idea of N-version programming is to write multiple versions of a software module. (It looks similar to modular redundancy but) it keeps different versions by different teams or different programmers because they may not contain the same bugs. There is a voter mechanism to select the best output among them. (Program i at producer A provides data for [to] consumer. If there is a fault in program i, this system is not fault tolerant.) To make it reliable, this system is equipped with two different versions, program i developed by programmer k, and program j by programmer l. When a fault happens at program i, voter should choose the output from program j. (WHAT is DIFFERENCE between Modular Redundancy, N-version programming, and Recovery Blocks?) They all use redundancy basically. But if we insist that we differentiate between modular redundancy, n-version programming, and recovery blocks, then modular redundancy is to use identical modules and n-version programming to use different modules by different programmers but recovery blocks is to use different blocks with various algorithms.

3) Error-Control Coding
Replication is effective but expensive Error-Detection Coding and Error-Correction Coding (example) Parity Bit, Hamming Code, CRC  Much less redundancy than replication fault Data Producer A Consumer Error Control Data Previous two techniques are based on replication principle which are effective with respect to fault tolerance but expensive in terms of cost because they keep identical or different multiple modules all the time. But error-control coding like error-detection or error-correcting has much less redundancy than previous ones. For instance, parity is a simple method to check out one bit error and hamming code corrects one bit error. CRC and Reed-solomon code are another examples of error-control coding used in telecommunication. If there is an error during data transmission from producer A to consumer, Error control coding in the receiver can detect error and request retransmission of this data to Producer A or it can fix errors using error-correction algorithm

4) Checkpoints & Rollbacks
Checkpoints and Rollbacks Checkpoint A copy of an application’s state Save it in storage immune to the failures Rollback Restart the execution from a previously saved checkpoint  Recover from transient and permanent hardware and software failures Data Producer A Consumer Application State K Rollback state (K-1) state K fault Checkpoint Checkpoint and rollback is a coordinated scheme. Every interval it saves a copy of application’s state (such as program counter and memory address) to the safe storage from the failures. If failure occurs, the application’s state is rolled back to the last saved checkpoint and it restarts from there. In this case that fault happens at application in producer A. Every interval application saves state to the green-colored storage. After state K, fault occurs then state is rolled back to the state K and restarts the application.

5) Recovery Blocks Recovery Blocks
Multiple alternates to perform the same functionality One Primary module and Secondary modules Different approaches Select a module with output satisfying acceptance test Recovery Blocks and Rollbacks Restart the execution from a previously saved checkpoint with secondary module Tolerate software failures Data Producer A Consumer Application Block X Block X2 Block Y Block Z Rollback state (K-1) state K fault Checkpoint Recovery blocks provide multiple alternates to perform the same functionality using different approaches or algorithms. Basic approach is to select a module satisfying the quality after acceptance test from primary module to secondary modules. Another possible scheme is recovery blocks and rollbacks. This is the same as checkpoints and rollbacks except restarting with secondary module instead of primary module. If fault is detected at block X in application, Then rollback the state to the previous one at K and restart it with the secondary block X2 not primary block X. (For example, diamond search algorithm and full search algorithm are different approaches to perform motion estimation in MPEG video encoding.) (The main difference between recovery blocks and N-version programming is recovery blocks is applying the different algorithms even by the same programmer. On the other hand, N-version programming is implemented by different programming teams.) (WHAT is DIFFERENCE between Modular Redundancy, N-version programming, and Recovery Blocks?) They all use redundancy basically. But if we insist that we differentiate between modular redundancy, n-version programming, and recovery blocks, then modular redundancy is to use identical modules and n-version programming to use different modules by different programmers but recovery blocks is to use different blocks with various algorithms.

Soft Errors (Transient Faults)
SER increases exponentially as technology scales Integration, voltage scaling, altitude, latitude Caches are most hit due to: Larger portion in processors (more than 50%) No masking effects (e.g., logical masking) Intel Itanium II Processor [Baumann, 05] Transistor 5 hours MTTF 1 1 month MTTF Bit Flip MTTF: Mean time To Failure

Related Work Our Solution
Process Technology Solutions Hardening [Baze, IEEE Trans. on Nuclear Science 00] SOI [O. Musseau, IEEE Trans. on Nuclear Science 96] Process complexity, yield loss, and substrate cost Microarchitectural Solutions for Caches Cache Scrubbing [Mukherjee, PRDC04] Low Power Cache [Li, ISLPED04] Area Efficient Protection [Kim, DATE06] Multiple Bit Correction [Neuberger, TODAES 03] Cache Size Selection [Cai, ASP- DAC06] In-Cache Replication [Zhang, DSN03] Replication Cache [Zhang, IEEE Computers 05] High overheads in terms of power, performance, and area Our Solution Protects caches from failures due to soft errors exploiting error-tolerance of applications Protection can be in conjunction with any techniques

Unequal Data Protection
All pages are not equally failure critical Multimedia data is failure non- critical Program variables are failure critical Failures: system crash, infinite loop, segmentation faults, etc QoS degradation is not a failure Only 9 pages out of 83 are failure critical

Failure Critical and Failure Non-Critical Data

Soft Errors on Increase
Increase exponentially due to technology scaling 0.18 µm 1,000 FIT per Mbit of SRAM 0.13 µm 10,000 to 100,000 FIT per Mbit of SRAM Voltage Scaling Voltage scaling increases SER significantly Qcritical SER  Nflux x CS x exp {- } Qs where Qcritical = C x V

Experimental Setup for Page Failure Rates

Experimental Framework

Experimental Results – Failure Rate
Failure rate of PPC is close to that of Safe (Safe is a protected cache configuration with an ECC protection, i.e., protecting all data, and Unsafe is an unprotected cache)

Experimental Results – Performance
Runtime of PPC is close to that of Unsafe

Experimental Results – Power
Energy consumption of PPC is close to that of Unsafe

Experimental Setup for DPExplore

DPExplore Results

Video Encoding

Error-Resilient Video Encoding
Parameters Resilience Application Middleware/OS Hardware PLR Network Error-Resilient Video Encoding Error-resilient video encodings have been developed to combat errors in networks PBPAIR – energy- efficient and error- resilient video encoding [Kim,06] Passive Error Exploitation It compresses video data according to PLR Error-prone Networks Mobile Video Application Embed Error-Resilience against packet losses Maintain the QoS Packet Loss PBPAIR: Probability-Based Power Aware Intra Refresh

Related Work Energy/QoS-aware video encoding
Video encoding parameters [Mopatra, IPDPS05] Motion estimation algorithm [Tourapis, VCIP00] Integrated power management [Mohapatra, ACM MM03] Global cross-layer adaption [Yuan, MMCN04] Transmission power and QoS [Eisenberg, IEEE Trans. on CSVT 02] Not consider error-resilience Error-resilient video encoding Error-resilient GOP [Yang, JVCIP07] AIR (Adaptive Intra Refreshing) [Worral, ICASSP01] PGOP (Progressive GOP) [Cheng, PCS04] PBPAIR (Probability-Based Power Aware Intra Refresh) [Kim, MCCR06] Passive error exploitation Our Solution Error-aware video encoding: exploits errors actively to minimize energy consumption

EE-PBPAIR

Experimental Setup

Experimental Results – Energy Reduction
Energy saving occurs at every component in a path from encoding to decoding in mobile video applications EC = Energy Consumption Enc EC = EC for Encoding Tx EC = EC for Transmission Dec EC = EC for Decoding Rx EC = EC for Receiving PLR = 10% and EIR = 10% PSNR: Peak Signal to Noise Ratio

Experimental Results – Expanded Tradeoff Space

Experimental Energy Saving
Source EC = Enc EC + Tx EC Destination EC = Rx EC + Dec EC

Experimental Results – Adaptive EIR
Feedback- based approach (Adaptive EE- PBPAIR) maintains the required video quality compared to Static EE- PBPAIR

Adaptive EIR

Conclusion Application Middleware/OS Network Hardware
73 Studied two main cross- layer approaches PPC EAVE Demonstrated the effectiveness of our cooperative cross-layer approaches by exploiting error tolerance and error control schemes Application Middleware/OS Hardware Tolerance Resilience FLR EIR Network feedback PLR Unequal Protection 73

Failure Rate

Video Quality

Memory Access Time (performance)

Future Direction Application (EAVE) Middleware/OS (DFR mechanism)
77 Cooperative approaches combining PPC and EAVE Middleware-driven cross-layer approach manages error control schemes Translate errors to exploit existing approaches at other abstraction layers PPC Apply our approach for other components Instruction caches and logics EAVE Intelligent frame dropping techniques To maximize the energy saving while minimizing the quality degradation Application (EAVE) Middleware/OS (DFR mechanism) Hardware (Inexpensive PPC) Tolerance Resilience FLR EIR feedback PLR SER Unequal Protection 77

Thesis Outline Thesis proposes a cross-layer method
Application Middleware/ OS Network Thesis proposes a cross-layer method Exploit errors and error control schemes across layers to maximize reliability with minimal costs for mobile embedded systems Topic 1 – Approach at hardware and application layers PPC (unequal data protection at hardware exploiting error tolerance at application) [Lee, CASES06][Lee, DIPES08][Lee, TVLSI08] Topic 2 – Approach at application, middleware, and network layers EAVE (intentional exploitation of errors at application, incorporating error resilience in networks) [Lee, DIPES08] Topic 3 – Approach across application/middleware-OS/HW CC-PROTECT (middleware-driven cooperative exploitation of errors and error control schemes across layers) [Lee, ACM MM 08] Hardware

References (cross-layers and tools)
79 [Bajic, 07] I. V. Bajic. Efficient cross-layer error control for wireless video multicast. 53(1):276–285, Mar [Dynamo] DYNAMO. Power Aware Middleware for Distributed Mobile Computing. University of California at Irvine, [Forge] FORGE Project. A Framework for Optimization of Distributed Embedded Systems Software. University of California at Irvine, [Grace] GRACE Project. Global Resource Adaptation through CoopEration. University of Illinois at Urbana-Champaign, [Kim, 08] M. Kim, N. Dutt, N. Venkatasubramanian, and C. Talcott. xTune: Online verifiable cross-layer adaptation for distributed real-time embedded systems. ACM SIGBED Review: Special Issue on the RTSS Forum on Deeply Embedded Real-Time Computing, 5(1), Jan [Mohapatra, 03] S. Mohapatra, R. Cornea, N. Dutt, A. Nicolau, and N. Venkatasubramanian. Integrated power management for video streaming to mobile handheld devices. In ACM international conference on Multimedia, [Mohapatra, 05] S. Mohapatra, R. Cornea, H. Oh, K. Lee, M. Kim, N. Dutt, R. Gupta, A. Nicolau, S. Shukla, and N. Venkatasubramanian. A cross-layer approach for power-performance optimization in distributed mobile systems. In Next Generation Software Program in conjunction with IPDPS, page 218.1, April [Shivakumar, 01] P. Shivakumar and N. Jouppi. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. In WRL Technical Report 2001/2, [Synopsys] Synopsys Inc., Mountain View, CA, USA. Design Compiler Reference Manual, [Schaar, 07] M. van der Schaar and D. S. Turaga. Cross-layer packetization and retransmission strategies for delay-sensitive wireless multimedia transmission. IEEE Transactions on Multimedia, 9(1):185–197, Jan [Vuran, 06] M. C. Vuran and I. F. Akyildiz. Cross-layer analysis of error control in wireless sensor networks. In IEEE Communications Society on Sensor and Ad Hoc Communications and Networks (SECON), pages 585–594, Sep [Yuan, 03] W. Yuan and K. Nahrstedt. Energy-efficient soft real-time CPU scheduling for mobile multimedia systems. 37(5):149–163, Dec [Yuan, 04] W. Yuan and K. Nahrstedt. Practical voltage scaling for mobile multimedia devices. In ACM international conference on Multimedia, pages 924– 931, 2004. 79

References (soft errors and reliability)
80 [Baumann, 05] R. Baumann. Soft errors in advanced computer systems. IEEE Design and Test of Computers, pages 258–266, [Hazucha, 00] P. Hazucha and C. Svensson. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, [Li, 05] J.-F. Li and Y.-J. Huang. An error detection and correction scheme for RAMs with partial-write function. In IEEE International Workshop on Memory Technology, Design and Testing (MTDT), pages 115–120, [Li, 04] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Soft error and energy consumption interactions: A data cache perspective. In ISLPED, Aug [Mastipuram, 04] R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. Sep [Phelan, 03] R. Phelan. Addressing soft errors in arm core-based designs. Technical report, ARM, [Pradhan, 96] D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall, ISBN [Shrivastava, 05] A. Shrivastava, I. Issenin, and N. Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In CASES, pages 90–96, [Wrobel, 01] F. Wrobel, J. M. Palau, M. C. Calvet, O. Bersillon, and H. Duarte. Simulation of nucleon-induced nuclear reactions in a simplified SRAM structure: Scaling effects on SEU and MBU cross sections. IEEE Trans. on Nuclear Science, 48(6), [Xu, 96] J. Xu and B. Randell. Roll-forward error recovery in embedded real-time systems. In ICPADS, page 414, [Nieuwland, 06] A. K. Nieuwland and S. Jasarevic and G. Jerin. Combinational Logic Soft Error Analysis and Protection. In IOLTS06, 2006. 80

References (error-resilient encoding, etc.)
81 [Cheng, 04] L. Cheng and M. E. Zarki. PGOP: An error resilient techniques for low bit rate and low latency video communications. In Picture Coding Symposium (PCS), Dec [Kim, 06] M. Kim, H. Oh, N. Dutt, A. Nicolau, and N. Venkatasubramanian. PBPAIR: An energy-efficient error-resilient encoding using probability based power aware intra refresh. ACM SIGMOBILE Mobile Computing and Communications Review, 10(3):58–69, July [Wang, 98] Y.Wang and Q.-F. Zhu. Error control and concealment for video communication: A review. 86(5):974–997, May [Worrall, 01] S. Worrall, A. Sadka, P. Sweeney, and A. Kondoz. Motion adaptive error resilient encoding for MPEG-4. In ICASSP, May 2001. 81

Kyoungwoo Lee (final defense)

Similar presentations

Presentation on theme: "Kyoungwoo Lee (final defense)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kyoungwoo Lee (final defense)

Similar presentations

Presentation on theme: "Kyoungwoo Lee (final defense)"— Presentation transcript:

Similar presentations

About project

Feedback