At-Scale Test

Now that At-Scale Debug (ASD) is becoming a de facto standard on Intel designs for the remote execution of CScripts, what other applications can take advantage of this technology?

In the At-Scale Debug article, I referred to the availability and standardization of remote debug forensics on the upcoming generation of Intel-based servers. This new capability represents a significant leap forward in triage and post-mortem analysis capabilities for lab and field-deployed systems. Essentially, it allows for the application of low-level JTAG hardware-assisted debug “at-scale”, in tens, hundreds and thousands of server systems simultaneously, without the need to connect to the target manually with on-site hardware probes and cables. This will be used to resolve the most intermittent, hard-to-reproduce problems in the silicon, hardware, firmware and software of deployed systems. A high-level drawing looks like this:

At Scale Debug nice picture

The key hardware requirements for a server design to support At-Scale Debug are documented in the At-Scale Debug Hardware Design Guidelines article. An on-board BMC must be able to control the Intel CPU’s JTAG chain via TDI, TDO, TCK, TMS and TRST; as well as some of the sideband signals including PREQ, PRDY, and reset/power sense.

In terms of software requirements, the BMC must support, at a minimum, a JTAG mastering function (which performs the IRScan/DRScan functions, state moves, etc.) in its firmware. This allows the BMC to perform the IEEE 1149.1 functions that are a requirement for activating the Design for Debug (DfD) features within the Intel silicon. This is the foundation for the run-control functionality (often called In-Target Probe, or ITP) that performs the register, memory and I/O reads and writes necessary for platform debug. Some BMCs have only software-based bit-banging JTAG masters, in which case the highest TCK rate achievable is 1-2MHz. Other BMCs, such as the Pilot4 as documented within the article Remote CScripts Application Using the Pilot4 BMC, have hardware-based JTAG masters that can run up to 16MHz and beyond sustainable TCK frequency. The higher the TCK, the better, of course; especially when wanting to dump large amounts of forensics data, as with the sysDump() CScript.

In the most common version of At-Scale Debug, all other functionality beyond basic JTAG mastering is back at the remote host. ASSET’s ScanWorks Embedded Diagnostics (SED) provides for embedded the actual run-control functionality down within the BMC firmware. This yields a significant performance boost, provides for on-target in-situ BMC-based diagnostics, and enables a more secure implementation.

Some excellent concrete examples of what can be achieved with at-scale debugging is documented within a Cray User Group paper, Cray XC System Level Diagnosability: Commands, Utilities and Diagnostic Tools for the Next Generation of HPC Systems.

One might ask about what other functionality can be made available, once the JTAG master is available and the BMC has access to the processor scan chain. JTAG is also the foundational technology for boundary-scan test (BST), that performs structural testing of printed circuit boards (PCBs). BST uses IEEE 1149.1/1149.6 in conjunction with typically Wagner patterns to isolate short circuits and open circuits on PCBs; it is also frequently used to program devices that are accessible to the boundary-scan cells on chips on the scan chain. BST is usually found in use on prototype benches, repair & return depots, and manufacturing production lines, where an array of external cables, hardware controllers (containing the JTAG mastering function), and fixtures are necessary to gain access to the PCB’s scan chain(s).

But on PCBs equipped with At-Scale Debug support, the JTAG master and physical access to the processor chain is already provided. So, it is a simple matter of dropping in some additional firmware to execute the test patterns on the board. By embedding BST, thereby removing any external hardware (controller/pod, cabling, fixturing) dependencies, it can be applied at a system-level and in a parallel fashion in such environments as manufacturing production lines, Environmental Stress Screening (ESS), repair/return depots, or even in the field.

Doing so provides immediate structural test access to the high-speed interconnects between the CPUs, on a dual-socket server. Being a point-to-point bus, 100% shorts and open coverage is possible. And given the differential nature of said bus, detection of defects is extremely difficult using functional test; such buses are self-healing in nature, but disproportionately high bit-error rates and link re-trainings lead to slower server performance and intermittent CATERR. BST is the only technology that can detect the root cause of these issues. For how one customer used BST to isolate failures on Intel QuickPath Interconnect on an older Haswell server design, see the case study at Structural Defects on Intel QuickPath Interconnect. It’s fascinating to see how some stray solder on a via can escape detection by conventional means.

Qpi defect

Of course, total boundary-scan test coverage is related to the boundary scan access that is available on the board. A key criterion for consideration is synchronization of test vectors to encompass as many JTAG-compliant devices on the board as possible. This can be achieved by having all 1149.1/1149.6 capable devices in a monolithic board chain that can be accessed from a single TAP from the embedded JTAG controller.

For more information about the technology behind boundary-scan test, and details on the case study, see our eBook here: JTAG Diagnostics for Intel QPI Structural Defects (note: requires registration).