"PCIe link training failure" with Dell PowerEdge FC640

Ask software engineering and SDK questions for developers working on Mac OS X, Windows or Linux.
  • Author
  • Message
Offline

altug.simsek

  • Posts: 21
  • Joined: Wed Jan 04, 2017 12:18 pm

"PCIe link training failure" with Dell PowerEdge FC640

PostWed Dec 30, 2020 10:46 am

WhatsApp Image 2020-12-23 at 20.33.06.jpeg
Error message during boot
WhatsApp Image 2020-12-23 at 20.33.06.jpeg (141.65 KiB) Viewed 308 times
Decklink Duo2 Mini cards
Dell FX2s blade chassis with Dell PowerEdge FC640 modules
2x Decklink Duo2 card per Dell PowerEdge FC640 compute module

Decklink cards, Dell FX2s chassis and PowerEdge FC640 compute modules are all brand new.

Node1 with 2 Decklinks works fine.
Node1 + Node2, each with 2 Decklinks also work fine.
No errors during boots.

When Node3 is inserted, we experience "PCIe Link Training Failures" for random Decklink cards of all Nodes.

We have 10+ fully loaded (4 compute modules & 8 Decklink cards on a single FX2s) systems, all working fine at another customer.

Screenshot attached.

Any ideas?
Offline

altug.simsek

  • Posts: 21
  • Joined: Wed Jan 04, 2017 12:18 pm

Re: "PCIe link training failure" with Dell PowerEdge FC640

PostFri Feb 05, 2021 11:34 am

Finally solved:

1. Disable UEFI boot and switch to BIOS boot mode as told in:
https://programmersought.com/article/74824283587/
2. Update Windows Server 2016 totally.


Long technical story for interested:

1.
Dell did not deal with the issue. Said that, Decklink cards are not officially supported.
The Dell support guys even said that, Blackmagic does not support Dell hardware, with screenshots from Blackmagic web site.
Dell support guys referring Blackmagic web site for evidence. Weird!

2.
With misdirection from the customer and Dell support guys, the "PCIe link training failure" observed on Bus95 during the boot was believed to be coming from Decklink cards. It is not!

"PCI Express Standard Downstream Switch Port" was at Bus95 and when Decklink cards were installed, it was causing this problem. I think this is the PCIe Switch Module on the FX2s chassis.

When PXE devices were disabled and BIOS boot mode is activated, this error disappeared.

3.
With plain vanilla WinSrv2016 installation after BIOS boot mode selection, we started to have "WHEA UNCORRECTABLE ERROR" BSODs during shutdown and restarts. Not during startups!

OS was booting normally, can see both Decklink cards in the Device Manager, all normal. When rebooting or shutting down, BSOD. In the CMC logs, there were lines complaining about the devices on Bus93 "Express Root Port A - 2030" and Bus94 "PCI Express Standard Downstream Switch Port".

This was affecting other nodes on the chassis. When I reboot Node3 and get BSOD, Node 1 also gets BSOD.

Without the Decklink cards, everything was normal!

4.
Installed WinSrv2012R2 x64 on two nodes and observed that I do not get BSOD with WinSrv2012R2.

5.
Checked the driver versions of devices on Bus 93-94-95 with the fully functional systems on our other customer. "PCI Express Upstream/Downstream Switch Port" driver installed for devices on Bus94 and Bus95 was to blame! I think, this for the PCIe Switch Module on the FX2s chassis.

With plain vanilla WinSrv2016 installation, driver version 10.0.14393.0 is installed. This driver causes BSOD. When Windows Updates are installed totally, this driver is upgraded to 10.0.14393.2938.

With plain vanilla WinSrv2012R2 installation, driver version 6.3.9600.17238 comes. After Windows Updates, this driver is updated to 6.3.18939. Both works normally.

6.
If you are installing WinSrv2016 from standard sources and updating afterwards, remove Decklink cards and do not plug them until after all windows updates are finished. This may not be limited to Decklink cards but for all PCIe card types.

7.
Did not try switching back to UEFI boot mode.

Return to Software Developers

Who is online

Users browsing this forum: No registered users and 3 guests