In previous blogs we covered the kind of defects that might
exist on high-speed serial I/O and their associated impacts on system
performance and stability. A similar analysis on DDR SDRAM yields some
For the purposes of this discussion let’s consider a
standard 240-pin DDR3 DIMM. A simplified high-level block diagram of the pinout
DDR3 memory on high-end systems differs from serial I/O in
many fundamental ways. It is a parallel bus, as opposed to serial. Error
detection and correction is via ECC (Error Correcting Code) memory at the
“physical layer”, as opposed to high-speed serial buses which relegate such
tasks to the data link layer (using cyclic redundancy checks – CRCs) and upper
portions of the protocol stack. And unlike serial buses which usually use embedded
clocking, separate strobe differential, but not AC-coupled, pairs (the DQS
signals) are assigned per nibble of data (the DQ signals). And the strobes are
not continually running “clocks” as they are in serial I/O; but rather they
turn on as needed, and act as source-synchronous clocks.
To understand the behavior of a DIMM when there is a defect
present, it is important to understand the process whereby the memory is first
initialized (also known as “trained”). A BIOS or boot loader will run the
minimal amount of code when a system is first booted in order to ensure that
the memory is basically functional. So in general the BIOS will sync up DQ and
DQS to optimize the system at the center timing and voltage point, and then it
will do a basic test of the DQ at location 0 within each rank. So, if there are
gross defects, the BIOS will either (a) disable the channel, or (b)
hang the boot process with a “memory failure” post code.
If the system quietly disables the channel, this may pose a
problem to conventional functional memory testers because they may not be aware
of the issue. And hanging the boot process is also an issue because, as we all
know, when the screen (normal terminal output) goes dark, it takes a level of expertise
to figure out what has gone wrong.
But, of course, there are numerous defect scenarios whose
impact is far more nefarious (if there weren’t, I wouldn’t be writing this
blog). Let’s look at some of them.
A Short-Circuit on
Since DQS are differential pair, they are immune to common
mode noise, and the receivers operate by considering the difference in
amplitude between the positive and negative nets of the pair. There may be
enough residual signal even with one leg stuck at GND, for example, to ensure
that the memory timing requirements are met – most of the time, so the BIOS will allow the memory to train. A higher number
of bit errors will occur, however, most of which will be invisibly detected and
corrected by the ECC, but which will impact memory performance.
Two DQ Shorted
At first glance it might seem that this kind of defect would
be easily detectable. But, examining this further, consider that the BIOS memory
training process is not reliably pattern-sensitive. In a perfect world, the net
signal of a shorted ‘1’ and a ‘0’ is 0.5 – but the internal voltage biases of
the receivers may in fact cause them to miss the defect, and read back what was
written. So the memory may or may not train; and if it does train correctly,
the system will then subsequently fail under load.
Other Process and
At lower DDR speeds, it was common practice to add test vias
to memory nets in order to perform In-Circuit Test (ICT) on DIMMs. This practice
is now mostly defunct due to signal integrity issues caused by excess metal with
resulting reflections and signal attenuation.
However, these signal integrity issues can still be
manifested through process and random variances in PCB manufacturing. These
include, but are not limited to, the following variances:
Trace surface finish
Incompletely plated vias
Flaws introduced during the imaging process (pin holes,
For more information on this fascinating subject, see our eBook.