Regarding using GPUs in PCIe x8 versus x16 slots, here are some numbers for interest's sake for those who like reading my essays.
Please check my math and numbers as I quickly threw this post online.
The PCIe 3.0 per Lane specification is 8 GigaTransfers per second, which equates to a 985 MB per second data transfer rate.
A PCIe 3.0 slot running with 8 Lanes (x8) would therefore be able to transfer 8 x 985 = 7880 MB per second (7.88 GB/s).
A PCIe 3.0 slot running with 16 Lanes (x16) would therefore be able to transfer 16 x 985 = 15760 MB per second (15.76 GB/s).
The NVidia GTX 980 is rated at 224 GB/s memory bandwidth.
The AMD R9 290 is rated at 320 GB/s memory bandwidth.
Both are substantially faster than PCIe 3.0 bandwidth.
A single 1920x1080 128-bit RGBA frame data buffer is ((2,073,600 * 4) * 4) = 33,177,600 bytes (~32MB).
Note that this example is assuming a 32-bit floating point value for each color component per pixel.
I am not privvy on how BMD is storing the raw uncompressed frame data in memory.
At 30 frames per second that is (33,177,600 * 30) = 995,328,000 bytes (~995MB) per second.
And since each frame is transferred twice over the PCIe bus, first transferred from main memory to the GPU, then processed by the Compute kernel, then transferred back to main memory, we multiply the data transfer by 2.
So that is (995,328,000 * 2) = 1,990,656,000 bytes (~2GB) per second.
Since 8 Lanes can transfer 7.88GB per second, and the ballpark data transfer as proposed above for HD 128-bit 30fps is ~2GB per second, it is easy to see that a PCIe 3.0 x8 slot can move that amount of data without breaking a sweat.
However, it isn't totally that simple (few things in life are).
Since x16 has twice the total bandwidth of x8, it also has twice the burst bandwidth if we measured the possible data transfer during a specific time-frame in milliseconds.
"Burst", with respect to data transfer, is a high-bandwidth transfer over a short period of time.
This is relevant because the frame data compute process occurs in typically five stages:
1. CPU preparation of frame data (pre-processing, initiating transfer setup, etc.).
2. Frame data bus transfer from main memory to GPU. <- PCIe performance related.
3. Compute kernel processing of frame data on GPU cores.
4. Frame data bus transfer from GPU to main memory. <- PCIe performance related.
5. CPU usage of frame data (post-processing, etc.).
Note that the time spent in each stage is
not equal, ie.
not 20% per stage.
The number of milliseconds spent in bus transfer will be reasonably constant for the same size frame, while the number of milliseconds spent in Compute will vary by the assigned effects etc.
What I am trying to show from this, is that the frame data transfer from main memory to GPU is not a steady continual process where the data is metered out consistently and evenly over the span of every second.
Instead, the data transfer occurs in typically two burst transfers, CPU to GPU and GPU to CPU.
This action is relevant, because as the burst transfer speed of the PCIe bus decreases, then when the frame compute process cycle is measured against a constant metric such as frames-per-second, any "jitter" or "lag" present will increase.
And conversely, as the bus burst speed increases, any "jitter" or "lag" decreases.
When we measure the bus transfer portion as a specific number of milliseconds in length, and we compare this to the metric of the number of milliseconds between frames for 30 frames-per-second, if every nth bus transfer extends beyond every nth per-frame-time, it will result in a frames-per-second flutter.
So, while a person may not see any really distinct and noticeable visual difference between PCIe slots running at x8 or x16, as they increase the video frame rate, which increases the number of data buffer transfers per second, and/or they increase the video frame resolution, which increases the size of the data buffer transferred, jitter will usually increase, which is seen as a fluttering in frames-per-second throughput.
Having a computer system where every GPU is in an x16 slot simply grants you the performance overhead to have a more stable and consistent frames-per-second data transfer. And the overhead to allow future increases in frame rate or frame resolution.
So regarding x8 versus x16, if the amount of time spent in bus transfers is half, then that can reduce any apparent frames-per-second jitter and fluttering, and that also leaves more time available for compute functions.
Regarding benchmarking x8 versus x16 for Compute/CUDA use, I have not found any good comparative tests online that are relevant for video frame editing or bi-directional data buffer cases.
There are numerous articles on x8-vs-x16 related to video game rendering, but this is not relevant for video editing Compute/CUDA, since video game rendering is mainly uni-directional.
Even the CUDA-Z tool AFAIK does not test bi-directional performance, and is limited to integer, floating point, and bandwidth tests only.
I could write a sample Compute application to test this, to some degree, but it would not be entirely relevant to the performance of other software such as Resolve, since I do not know how they are managing memory or the code in their compute kernels.
FYI: I also write 3D software tools for video game developers and VFX use, so I deal with CPU-GPU data transfer requirements all of the time.
Regarding any "bottleneck" with Resolve and GPUs.
When processing frames, especially if there is a lot of math processing (filters, denoise, blur, etc.), probably 75%-90% of the time is spent on the CPU and GPU, and 25%-10% spent on bus transfers.
So a faster CPU and a faster GPU or multiple GPUs with more Compute Cores should make a larger difference than whether the video card(s) are in an x8 or x16 slot.
*edit*
Here is a quick explanation using visual ASCII art.
This is for example only and any values are not meant to be taken literally.
To achieve 30 frames per second, each frame must be fully processed and sent to the final output display within approximately 33 milliseconds (1000 / 30 = 33.333...).
This example is simplistic terms of what may occur within one frame's time duration.
The values chosen are strictly for example purposes and may not reflect actual circumstances.
c = CPU pre-processing of data
b = bus transfer of frame data from main memory to gpu
g = GPU processing of data
B = bus transfer of frame data from gpu to main memory
C = CPU post-processing of data
i = idle time, occurs when the frame process time is faster than 33 milliseconds, assuming framerate syncing is true
This is an example timing of a frame process cycle when the GPU is in an x16 slot.
- Code: Select all
frame |<-- 33 milliseconds per frame -->| frame
-----------------------------------------------
......|ccccccbbgggggggggggggggBBCCCCCCii|......
-----------------------------------------------
Now if we plug the GPU into an x8 slot, and assume that the bus transfer time increases by two times since it is one half the bandwidth of the x16 slot. Note the bus times b and B are twice as long here.
- Code: Select all
frame |<-- 33 milliseconds per frame -->| frame
-----------------------------------------------
......|ccccccbbbbgggggggggggggggBBBBCCCCCC|....
-----------------------------------------------
We can see that the frame process cycle requires more than 33 milliseconds to complete.
If we were imposing a sync timing lock for 30 fps, the rendered framerate for this frame would drop in half from 30 to 15 frames per second.
If sync is free-running the framerate would be less than 30 (35ms or 28.57fps in the example).
If the frame process cycle time varies by a few milliseconds for each frame, which it typically will, we will see fluctuating playback framerates.
From this it is also easy to see where increasing the CPU performance or the GPU performance will decrease their time duration in the frame processing cycle, and also improve framerate.
And it is easy to see where the overall performance of the software is reliant on multiple components in the computer system.
David R. Green - Demenzun Media Inc. - Author Composer Filmmaker Programmer