At-Scale Debugging

Alan Sguigna

December 18, 2016
2:24 pm

The conventional approach to hardware-assisted debugging on Intel platforms involves physically connecting an external probe to the target. Is there a better way?

High-availability systems in such industries as telecom, military/aerospace, storage and high-end computing are expected to operate 24×7 with close to zero outage downtime. Given the inevitability of failures in deployed systems, it is important to gather forensics data from lab and field electronics to aid in root-cause post-mortem analysis. This discipline should be undertaken for the entire lifespan of the product, from early prototype through end-of-life.

One of the most capable tools for forensics data retrieval is run-control, also known as probe mode or, in the Intel case, In-Target Probe (ITP). The ITP logic within the silicon allows for low-level, hardware (JTAG)-assisted fetching of the “breadcrumbs” needed to indict silicon, hardware, firmware or software bugs in a system. Once identified, these bugs can be rectified, contributing to greater robustness of the platform going forward.

Intel ITP is a very powerful technology. It forms the foundation of hardware-assisted source-level debug right out of system reset, using such tools as ASSET’s SourcePoint debugger. This tool is used universally by Intel OEMs and ODMs to troubleshoot the most elusive, intermittent bugs.

ITP forms a compelling combination with Intel Processor Trace and Trace Hub. Run-control and Trace together put system events from any logic element, thread, core or socket, into a meaningful code context. These are used to find the “needle in the haystack” type bugs.

ITP is also foundational to the Intel Customer Scripts, also known as CScripts. These programs are extremely useful for bringing up and debugging a new hardware design. The methods can range from basic state dump (register and memory dumps) to error injection/logging and sideband-enabled post-mortem access. Some good informational background on the CScripts is in our CScripts eBook (note: requires registration). An example Use Case for CScripts is for PCI Express Link Training & Status State Machine (LTSSM) testing, which you can read about in my two blogs here: LTSSM Testing Part 1 and LTSSM Testing Part 2.

Intel ITP becomes most powerful, however, when it can be applied to any server, anywhere, anytime, without the need for physical access. This is accomplished by embedding the run-control functionality directly into a Baseboard Management Controller (BMC) down on the target. The run-control API library (with such functions as EnterDebugMode, ReadMSR, ReadCSR, etc.) as well as the JTAG master driver (performing the scan-by-scan interface) in the form of, for example, a Linux shared library, perform autonomous forensics retrieval down on the target. This configuration is known as Embedded ITP. It can also optionally be connected to a remote host running the CScripts and the Python Command Line Interface (CLI) supporting them.

The ASSET solution for Embedded ITP is known as ScanWorks Embedded Diagnostics (SED). ASSET’s first public customer deployment was announced with Cray in 2009: http://insidehpc.com/2009/08/cray-asset-intertech-partnership-emdebbed-supercomputer-diagnostics/. I especially enjoyed insideHPC’s critique of the innovation:

"I like this for two reasons. It’s time for supercomputing vendors to take reliability seriously, and a real discipline around automated diagnosis and management will help. Vendors with a footprint on the enterprise have had an advantage over Cray (and, to some extent, SGI), and Cray has lagged in this department. The second thing I like about this is that Cray isn’t diverting its own resources to do it — it is buying the technology and focusing on integrating. This is smart execution, focused on core capability."

For more technical information on SED, please refer to our technical overview document (note: requires registration).