Runtime Fault Recovery Protocol for NoC-based MPSoCs

eduardo wachter,  augusto erichsen,  leonardo juracy,  alexandre amory,  fernando moraes
PUCRS


Abstract

The design of reliable MPSoCs is mandatory to cope with faults during fabrication or product lifetime. For instance, permanent faults on the interconnect network can stall or crash applications even though the network has alternative fault-free paths to a given destination. Thi1)s paper presents a novel fault-tolerant communication protocol that takes advantage of the NoC parallelism to provide alternative paths between any source-target pair of processors, even in the presence of multiple faults. At the application layer, the method is seen as a typical MPI-like message passing protocol. At the lower layers, the method consists of a software kernel layer that monitors the regularity of message exchanges between pairs of tasks. If a message is not delivered in a certain time, the software fires a path finding mechanism implemented in hardware, which guarantees complete network reachability. The proposed approach determines new paths quickly, and the costs of extra silicon area and memory usage are small.