Debugging watchdog timeouts

Watchdog timeouts occur in crashed or hung systems when the main processor on a printed circuit board no longer sends a heartbeat to an ancillary service processor. The service processor watchdog then, after the timeout period, re-initializes the main system to try to restore it to an operational state. Watchdog timeouts can occur in high- and low-end systems, anywhere from routers to servers to cell phones. Debugging these can be difficult…

Higher-end, high-availability systems typically contain diagnostic routines that can capture system state at the time of the failure, and allow the system designer to find the root cause. Such routines also often have the capability of setting breakpoints within field-deployed systems and permit single-stepping through code in real-time. The most intermittent, irreproducible software bugs and hardware marginalities can then be rooted out. This reduces the dreaded No Trouble Found (NTF) problem and thereby greatly enhances the reliability and availability of those systems.

