Structural Defects on Intel QuickPath Interconnect – Part 2

Suppose there’s a hidden short-circuit on an Intel QPI net under the CPU socket – what’s the customer going to see?

We saw in my previous blog Structural Defects on Intel QuickPath Interconnect that boundary-scan tests can easily detect structural defects, like shorts and opens, on high-speed differential I/O like Intel® QuickPath Interconnect (QPI). Boundary scan is of course the preferred technology for this kind of testing, because In-Circuit Test has no access to buses like QPI, PCI Express Gen 3, SATA III, etc.; putting test pads down on these nets creates signal integrity issues. But what happens if the manufacturer is not applying boundary-scan tests to these high-speed nets, such defects escape detection, and the system ships to the customer? ASSET performed some experiments to find out.

In one of the experiments, we shorted two QPI nets together on a server board:

QPI DN14 to DP17 short

Interestingly, the QPI port trained up normally, and the system seemed to behave properly. As we know, this kind of defect is often invisible to conventional functional tests, since differential I/O is self-healing in nature.

However, we do know that these defects will affect the overall system margins. To prove it, we ran ScanWorks HSIO on the board, and the results are summarized below:

QPI margins on Green City

The margins have collapsed on lanes 14 and 17, the two that were shorted. The composite margin of that QPI port is poor, but there is in fact sufficient margin for the system to initialize; albeit at a much lower level of performance. Looking more closely, this link has a much higher level of correctable errors. So a number of different things can happen, depending on the degree of bit errors induced by the defect (note that QPI has a BER threshold of 10E-14, much more stringent than PCI Express Gen3 for example, which has a BER threshold of 10E-12; because of the size of QPI’s flits and its CRC):

  • High number of PHY layer re-inits
  • Many CRC errors, with accompanying data link layer re-transmissions
  • Data lane failovers
  • Intermittent kernel crashes with CATERR thrown

This system is defective. Although it may appear to be operational, it is compromised, and will eventually fail in the field.

For more of the theory behind defects on high-speed serial I/O, check out our whitepaper: Detection and Diagnosis of Printed Circuit Boards Defects and Variances.

Alan Sguigna