Timing Benchmarks for BMC-Assisted Embedded JTAG

Moving run-control (Intel In-Target Probe, or ITP) down into the service processor on an x86 design results in a screamingly fast, scalable implementation of at-scale debug. This article contains some timing benchmarks of our embedded solution versus alternatives. The results are nothing short of astonishing.

OK, after that dramatic introduction, it’s probably unfair of me to make this a very long article. So, let’s look at the benchmarking comparison results first, and then delve into the precise “what” and “how” of the benchmark.

Benchmark test => repeated PCIe speed changes from Gen1 (2.5GT/s) to Gen3 (8GT/s) and back.

SED => ScanWorks Embedded Diagnostics – ASSET’s equivalent to ASD, with run-control library and JTAG mastering function down on the BMC. This runs CScripts remotely.

ASD => At-Scale Debug – an alternative solution, with the JTAG mastering function down on the BMC, and the run-control library running on a remote host connected via Ethernet. This runs CScripts remotely.

SED OTD => ScanWorks Embedded Diagnostics On-Target Diagnostic – ASSET’s unique SED solution that allows run-control applications (such as Python CScripts converted to ‘C’) to run directly down on the BMC, with no host intervention.

ITP => In-Target Probe – a traditional PC-based debugger, with run-control and JTAG mastering on an external hardware probe, connected over USB to a host PC running the debugger application.

SourcePoint => ASSET’s version of Intel benchtop ITP, only faster and more user-friendly.

To see the differences in the above, the slide from the Facebook presentation at the Open Compute Summit is a good reference:

ITP ASD SED comparison png

I’ve added some arrows and a little text to the diagram to help contrast the solutions more clearly.

Now that the three solutions are compared, let’s look at the contrast in time it takes to perform one loop of a Gen1 <-> Gen3 PCIe retrain, in order of improving performance:


Loop Time (sec)











You can see that ASD is the slowest. This is as expected, given that only the JTAG master is running down on the BMC, and it has to transit back over Ethernet to the remote PC for (at least!) each and every call to the run-control library. There’s a lot of wait time and therefore latency in this implementation.

SED is second-slowest, and shares some of the same latency considerations of ASD, based on having to hop back to the remote host over Ethernet for library invocations. But, some of the Python/ITP commands running back on the host PC will encapsulate multiple calls to the run-control library, so SED is faster than ASD: in this instance, about 2.5X faster.

Benchtop ITP is faster than SED or ASD, which you might expect, based on the facts that the hardware probe is local to the target and has a commercial CPU and FPGA on-board. It’s 3.3X faster than ASD, and 30% faster than SED.

SourcePoint is the fastest of the benchtop solutions (those that require a host PC). It’s about 2.7X faster than ITP, 3.5X faster than SED, and 8.8X faster than ASD.

Finally, the SED OTD is by far the fastest: 4.5X faster than benchtop ITP, 6X faster than SED, and 15X faster than ASD. This is particularly critical in this instance, because PCIe hardware validation and component qualification using this mechanism may need to run for many thousands of loops to detect device, firmware or board system marginalities.

And, having the on-target diagnostic running down directly on the target, allows it to be run simultaneously and independently on as many platforms as you want: it’s truly “at-scale”.

And in fact, running at 28 loops per second allows this SED OTD to be used most effectively in a Built-In Self Test (BIST) or Power-On Self Test (POST) scenario.

Want to know more about this specific PCIe test application? Read my blog here: Embedded Run-Control for Power-On Self Test.

Want to know even more? Leave a comment, or register for our technical overview at ScanWorks Embedded Diagnostics. I’ll get back to you.

Alan Sguigna