A Framework for Supporting Adaptive Fault Tolerant Solutions

Kostas Siozios, Dimitrios Soudris, Michael Hübner

ACM Transactions on Embedded Computing Systems, Special Issue, 2014


For decades computer architects pursued one primary goal: performance. The even-faster transistors provided by Moore’s law were translated into remarkable gains in operation frequency and power consumption. However, the device-level size and architecture complexity impose several new challenges, including a decrease in dependability level due to physical failures. In this paper we propose a software-supported methodology based on game theory for adapting the aggressiveness of fault tolerance at run-time. Experimental results prove the efficiency of our solution since it achieves comparable fault masking to relevant solutions, but with significant lower mitigation cost. More specifically, our framework speedups the identification of suspicious for failure resources on average by 76%, as compared to HotSpot tool. Similarly, the introduced solution leads to average Power×Delay (PDP) savings against to existing TMR approach by 53%.