# JTAG DIAGNOSTICS FOR INTEL® QPI STRUCTURAL

# DEFECTS



**Reference Clock** 

# EBOOK

## BY KENT ZETTERBERG





#### By Kent Zetterberg – Product Manager

Kent Zetterberg started his career in the automation industry, working with systems from ABB, Siemens and others. Following graduation from the University of Gävle with a Bachelor's of Science Degree in Computer Engineering, he worked 15 years in the telecom industry where he held various positions involving hardware test and debug. He joined Ericsson AB in Sweden in 1997 where he developed functional test programs for processor boards, and designed interface boards and test

fixtures. At Ericsson he became an expert in boundary scan and eventually led the boundary-scan team. With ASSET Kent has held several positions in support, serving as a customer trainer and European support team leader. Currently he is the technical product manager for ScanWorks® boundary-scan test products.



### **Table of Contents**

| Executive Summary                                     | . 4 |
|-------------------------------------------------------|-----|
| Diminishing Probe Access                              | . 5 |
| Interconnect Test Background                          | . 6 |
| Boundary-Scan-Based Interconnect Testing of QPI Links | . 7 |
| Detecting QPI Faults                                  | . 8 |
| Customer example of QPI Fault Detection               | . 9 |
| Conclusions                                           | 12  |
| Learn More                                            | 12  |

## **Table of Figures**

| Figure 1: Backdrilling                                                               | 5  |
|--------------------------------------------------------------------------------------|----|
| Figure 2: Typical Interconnect Test                                                  | 6  |
| Figure 3: Schematic picture of boundary-scan interconnect test of differential nets. | 7  |
| Figure 4: Intel QPI lanes                                                            | 8  |
| Figure 5: Two shorted QPI nets                                                       | 9  |
| Figure 6: 3-D X-ray of the CPU BGA sockets                                           | 10 |
| Figure 7: Visual inspection of the PCB                                               | 11 |
| Figure 8: Cross-section of the BGA socket                                            | 11 |

© 2014 ASSET InterTech, Inc.

ASSET and ScanWorks are registered trademarks while the ScanWorks logo is a trademark of ASSET InterTech, Inc. All other trade and service marks are the properties of their respective owners.



#### **Executive Summary**

Boundary-scan tests can easily detect structural defects like shorts and opens on high-speed differential input/output (I/O) buses, including Intel® QuickPath Interconnect (QPI). Boundary scan has become the preferred technology for this kind of testing because in-circuit test (ICT) equipment relies on physical access points such as test pads to access the signaling on buses like QPI, PCI Express (PCIe) Gen 3, SATA III and others. This access is rapidly disappearing on circuit boards. Placing these test pads on high-speed I/O nets will introduce signal integrity disturbances on the buses, likely degrading system performance and possibly causing failures. Formerly, through-hole vias might also provide limited access for probe-based ICT test fixtures, but even these can also cause signal integrity issues on sensitive high-speed nets. As a result, through-hole vias are commonly removed by backdrilling the unused portions of the vias. This further limits the physical access that is essential to ICT and makes boundary scan the technology of choice for structural defect detection.

This eBook introduces the topic of how boundary scan improves the quality of circuit boards during manufacturing by detecting structural faults on the QPI bus. In addition, it shows why detecting QPI structural faults during manufacturing is so critical for ensuring acceptable overall performance and avoiding system failures later when the product is in the hands of users.

QPI is a differential bus. It employs a dedicated forwarded clock lane for every 20 data lanes. It is capable of both half-width and quarter-width operations, similar to but not identical with PCIe. In addition, QPI supports a failover process for both data and clock lanes. So, for example, if a QPI lane fails, the quadrant where the lane is located will be marked as unavailable and the link will drop back to 10- or five-lane operations, whichever is appropriate. When a clock lane fails, a data lane can substitute for the failed clock lane and, in turn, the link will follow the failover procedure described above.

This self-repairing failover functionality may be beneficial in a real-world operating environment, insofar as it will allow the system to remain operational, albeit at a reduced level of performance, but it can have a negative effect on the quality of hardware shipped by a manufacturer. Functional tests typically will not detect QPI structural faults because they may be hidden by QPI's self-repairing capabilities. Often, poor quality circuit boards will appear to



function well enough and slip through functional testing, only registering symptoms like reduced performance, undefined behavior or lane drop-outs. Unfortunately, system crashes and hangs may often occur later when the system is being operated by users.

Boundary-scan tests based on the IEEE 1149.1 Boundary-Scan or JTAG Standard can detect those shorts and opens on QPI that functional testers or system diagnostic routines won't. When the quality of the assembled circuit board is important, more thorough testing processes with boundary-scan are required.

#### **Diminishing Probe Access**

Test pads, which traditionally provided access for test probes, are being removed from highspeed nets on printed circuit boards (PCB) because signal integrity (SI) disturbances on highspeed buses and chip interconnects are introduced on the nets by these pads. Another probe access method, through-holes vias also causes SI issues. As a result, the unused portions of the through-hole vias and pads are being removed by a technique called backdrilling (Figure 1). While improving SI, backdrilling also eliminates the possibility that a via might provide access for test methods that rely on probes, such as ICT.





Even when a probe has access to a signal, placing a probe on the access point will likely distort the signal that is being measured. External probes introduce inter-symbol interference (ISI), multi-path interference and multiple internal reflections into a bus because of the frequency responses of traces, pads, probes and any other externally introduced points of discontinuity.



Before describing how structural faults on high-speed I/O buses can be tested, some background information is necessary.

#### **Interconnect Test Background**

Since the early 1990s, boundary-scan-based (IEEE 1149.1 JTAG) interconnect testing has helped electronics manufacturers achieve better product quality and delivered product to market faster. The typical boundary-scan tool applies test patterns to the boundary-scan driver cells on a JTAG device on a circuit board. Based on the tool's awareness of how the nets on the board are connected, the tool expects certain patterns to appear in the boundary-scan receiver cells on the opposite side of the net from the driver cells (Figure 2). Wagner-based or walking zeros and ones algorithms typically keep the number of vectors to a minimum while providing fault coverage. Today's boundary-scan tools provide netlists and/or CAD files, Boundary Scan Description Language (BSDL) files, device models and other data to its own built-in automatic test pattern generators (ATPG), which create interconnect test patterns for any size design in just a few seconds. These interconnect tests can typically run in a few seconds on the production floor. In fact, the effects of boundary-scan interconnect tests on the production time of a PCB is likely less than one percent.



Figure 2: Typical Interconnect Test

The ever increasing speeds and throughput performance of systems today have caused a shift from single-ended DC-communications to higher speed differential-pair communications. Figure



3 shows a simplified representation of differential nets between boundary-scan devices. (Note that potential termination networks are not taken into account in Figure 3.) A large number of vectors, usually in the form of a Wagner-pattern, will be applied during an interconnect test, resulting in diagnostic information, which will include the pin or net locations of faults.



Figure 3: Schematic picture of boundary-scan interconnect test of differential nets.

One can conclude that boundary-scan ATPG tools can create vectors and detect fault just as easily on this type of differential net as it can on typical LVTTL communication interconnects.

#### **Boundary-Scan-Based Interconnect Testing of QPI Links**

QPI is made up of 21 differential pairs, where 20 pairs are used for data communication, and one as a forwarded clock. The bus is differential, but not AC-coupled like PCIe. Each differential pair is called a QPI link (Figure 4).





Figure 4: Intel QPI lanes

Typically, each node in the Intel QPI link is designed with IEEE 1149.1 Boundary-Scan (JTAG) functionality which can be used for interconnect test. BSDL files describe the boundary-scan implementation in a device and define what differential drivers are paired. CAD files describe how links are connected and the boundary-scan tool will create, apply and diagnose test vectors that are run over the QPI links. Applying boundary-scan tests to QPI interconnects is quite simple.

#### **Detecting QPI Faults**

In an operational system at a user location, a failure on a QPI lane will cause the quadrant where the failed lane is located to be marked as unavailable. The link then will drop back and operate with 10 lanes or five, as needed. If a clock lane were to fail, a data lane can substitute for the clock lane and the link width will drop back as described above.

The critical question is what will be the user experience of a QPI-based system when the manufacturer does not applying boundary-scan tests to QPI and defects escape detection? To illustrate this point, an experiment was conducted where a short circuit was inserted onto an Intel QPI net under the CPU socket. Specifically, two QPI nets were shorted together on a server board (Figure 5).





Figure 5: Two shorted QPI nets

Interestingly, the QPI port trained up normally and the system seemed to behave properly. This shows that this kind of defect is often invisible to conventional functional tests, because the differential QPI bus is self-healing.

Next, a boundary-scan interconnect test was run on the board. The interconnect test showed a failing result, as expected, and the following message was displayed:

#### "A short was detected between nodes A and B"

The boundary-scan test tool provided the exact location of a fault that would have otherwise slipped through functional testing in production and ended up in a user's hands. Quite possibly, unwanted serious consequences may very well have ensued.

#### **Customer example of QPI Fault Detection**

The following are actual empirical results from boundary-scan testing on Intel QPI nets as compiled by a circuit board manufacturer. As mentioned previously, ICT test and conventional functional tests do not provide structural test coverage on these QPI nets.

QPI buses are rated at 9.6 gigatransfers per second (GT/s) per lane on Haswell Xeon systems. This is an increase from 8 GT/s on Sandy Bridge Xeon and the speed is expected to increase in the future. At these speeds, signal integrity issues preclude the placement of ICT test pads on QPI nets. As a result, ICT has no access to provide any test coverage. As shown above, since QPI uses differential signaling, its receivers may be able to reconstruct the incoming data stream



even in the presence of board-level structural defects. So, lanes with defects may initialize at the physical layer and train up, albeit at a degraded level of overall throughput. What happens next depends on the overall operating margins of the board and chip, but typically such systems are subject to reduced performance, undefined behaviors, lane drop-outs, and even system crashes and hangs, often in the hands of a user, unfortunately.

One manufacturer recently reported that it applied boundary-scan tests on an Intel Xeon-based server platform and immediately saw a 2.9 percent failure rate. Somewhat perplexed, since the boards were booting fine and they were passing the manufacturer's functional tests, several root cause analyses were performed, including a 3-D X-ray of the CPU BGA sockets. This is what was seen:



Figure 6: 3-D X-ray of the CPU BGA sockets

In Figure 6 on the left, the yellow features represent the 'dog bone' via and land for node QPI1\_RX\_4\_DP. The land is at the bottom and the via is circled in red (Figure 6). The green feature circled in orange is a land for node GND. The 3-D X-ray picture on the right suggests that there may be a short between the GND land and the adjacent via for QPI1\_RX\_4\_DP.

When the processor's BGA socket was removed, a visual inspection showed the following:





Figure 7: Visual inspection of the PCB

Figure 7 shows that the via for QPI1\_RX\_4\_DP (circled in red) is covered by solder. Such a situation makes it easy for it to be shorted against balls at either of the two adjacent lands, which are circled in green. What is happening is depicted in Figure 8, a graphical representation of a cross-section of the BGA socket ball, lands and vias.



Figure 8: Cross-section of the BGA socket

As stated above, these defects on high-speed serial QPI buses will often escape detection by conventional functional test, because the ports will appear to operate normally by training successfully and transferring traffic. Even more sophisticated functional test algorithms, which report the contents of the QPI error counter registers, may not indicate a potential failure, depending on the bit error rate induced by the defect and the duration of the functional test. (The bit error count is a function of multiple of bit error rates, as observed during the test, and the duration of the test.)



#### Conclusions

The increasing speeds on PCB interconnects like Intel QPI and PCIe in most cases means that the probe-based testing of those high-speed IO nets with test methods such as ICT or flying probe is not possible because probe access cannot be provided on the circuit board. Test points and pads would introduce signal integrity distortions onto the high-speed serial buses.

The examples included in this eBook show that structural defects may slip through functional test even when the functional test is targeted at detecting structural defects rather than malfunctioning applications. Creating interconnect vectors with common boundary-scan tools is easy. Depending on the application and tool of choice, these vectors can be applied with Intel XDP-based probes through the Intel XDP debug interface or with custom boundary-scan (JTAG) hardware.

Circuit board manufacturer who have quality as a priority have made boundary scan a cornerstone of their test strategies.

#### Learn More

For more insight into defects on high-speed serial I/O, check out another of our eBooks, "Detection and Diagnosis of <u>Printed Circuit Board Defects</u> <u>and Variances</u>."



