Page 1 of 1

Upstream CentOS on the HP Z8: a cautionary tale

PostPosted: Thu May 17, 2018 2:16 am
by Seth Goldin
Hi folks,

Just want to recount some difficulties I've been having with upstream CentOS Linux, DaVinci Resolve--and maybe FreeNAS, as a cautionary tale for others.

When I got a few of BMD's recommended HP Z8 workstations, I figured I'd forgo optical drives, because they seemed a bit antiquated. The plan was to have HP's pre-installed Windows 10 Pro on one M.2 SSD for Adobe CC, and to do a fresh installation of upstream CentOS on the other. My understanding was, from Peter's description, that since "well more than 50% of Hollywood films and a much higher percentage of TV programs use that exact config every day," it would be worth it to make sure to run it on Linux. I had seen Puget's recommendation that with just a single GPU that there wasn't really any advantage to CentOS--but because I'm a purist and was used to macOS's UNIX architecture, I figured that Linux would be a bit easier.

Some other specs in my Z8s:
  1. Single GTX 1080 Ti
  2. Intel® Xeon® Gold 6136 Processor (3 GHz, up to 3.7 GHz w/Turbo Boost, 24.75 MB cache, 2666MHz, 12 core)
Though I hadn't had experience with RHEL or CentOS, I didn't think it would be too difficult, since I've been using Ubuntu for years.

This has been a world of hurt.

I did write up some notes on how to successfully install DaVinci Resolve to the upstream CentOS--however there have been a bunch of bumps along the way.

I hope these warnings help others--don't make the same mistakes I made!
  1. There's no way, when using bootable installation media, to definitively know which is the blank M.2 SSD available for CentOS and which is the one with Windows. The CentOS installer just shows generic disks labeled "0" or "1," and HP couldn't tell me if those numbers definitively represent the physical positions in the slots on the board, or if they're set by the UEFI order. The workaround was to actually remove one of the M.2 SSDs and then just see if the machine could boot to Windows. For three different workstations, this did turn out to be the first drive, labeled "1" on the board. The M.2 SSDs live on a little "expansion" board at the very top of the chassis--not directly on the motherboard. The board has to be pulled out and then a heatsink needs to be unhooked and pulled off. For three different workstations, it turned out that HP did in fact install to the one labeled "1," but if you don't take one of them out and make sure it's not Windows, you're flipping a coin.
  2. CentOS's implementation of the Nouveau driver for the boot drive is a bit spotty--you might actually need to install in "simple graphics mode" from the "rescue" option in GRUB, and then later use "sudo systemctl set-default graphical.target" once you get everything installed again.
  3. If you have the "development" packages installed, you'll get a version of gcc that's newer than what was actually used to compile the kernel for the distro. This means that if you try to use NVIDIA's .run file from their site to install the driver, you'll need to set an environment variable for the gcc version to actually match.
  4. I originally installed with NVIDIA's .run file after manually figuring out how to blacklist the Nouveau driver, generate a new initramfs, and switch to a different VST to stop gdm, but was pretty annoyed to find that the update from CentOS 7.4 to 7.5 prevented X from starting at boot. I had to do a clean re-installation and opted to just use ELRepo and kmod-nvidia--which supposedly will persist through future updates.
  5. The default "GNOME Desktop" group of packages doesn't include libpng12--and Resolve won't start without that.
  6. The collaborative workflow has some GUI problems--though collaboration can be enabled across different workstations, the chat doesn't work, or even show different users in the project, and there aren't the regular icons that show which users are in which bins, shots, or timelines. It's not clear why, but I'm troubleshooting in a BMD support ticket.
  7. For reasons I still don't understand, C300 Mk II XF-AVF Intra 4:2:2 10-bit footage, which renders fine on Windows and Mac, causes a "timeout, waiting for frame" error without fail, within 10 minutes. It doesn't matter if it's SMB3 or NFS4 in the fstab. At first I thought this was some sort of issue with my FreeNAS box--but I tried again on a completely fresh ZFS pool on different hardware, migrated the footage, and tried again, only to replicate the same issue. The shares aren't "dropping." The data is all still mounted. This is happening with two different brands of NIC--one was a 10GbE NIC from SmallTree, and one was from HP. There's are no such problems on Windows, with either NIC.
At this point, Windows is performing perfectly--with SMB3 providing wonderful, reliable connectivity, reading and writing any supported codec just fine, and with correct GUI elements for collaborative workflow. My next step might be a $20 experiment to actually get an external DVD writer/reader, burn the official BMD iso, do clean installations of that, and see if the official build actually resolves all these lingering issues.Maybe there are undocumented modifications to upstream CentOS that are crucial for Resolve to run properly.

As an aside--does anyone know the recommended way to mount the same ZFS datasets on Mac, Windows, and Linux? Is it really not viable to make SMB3 shares with ACLs for the Mac and Windows clients and have an NFS share on the same dataset for the Linux clients? If I try again with the official build, I'd love to know what the best practices are. My understanding is that ZFS does have some file locking--you don't need to rely merely on the sharing protocol.