Blackmagic Forum

Thu May 21, 2015 1:56 pm

Hi,

I'm trying to find any information on using GPUDirect for Video with a Blackmagic card.

I'd like to take an HD-SDI video stream, apply a CUDA algorithm which modifies the frame, and outputs it from the CUDA card DVI.

Has anyone done something similar? Has anyone used GPUDirect for Video?

Simon

Tue May 26, 2015 6:04 pm

The information for GPU Direct for video with Blackmagic Design capture and playback products is in our desktop video products SDK document. You can register and download the SDK and manual by going to our support page, select capture and playback products, and then scroll down in the lower left most area to see available downloads.

Desktop video or the DecklinK SDK supports all capture and playback including Ultrastudio, Intensity and Decklink products. GPU Direct is supported on Windows 7/8 64bit and Linux 64 bit using our SDK, current driver from BMD (10.4) and current Quadro card from NVIDIA equal to 4xxx and higher models 5xxx and 6xxx.

The function is supported via OpenGL and at this time, there is no support for a CUDA transfer between the GPU and the video input / output device. Blackmagic Design does provide sample application and the code for this sample which you can modify in the SDK download via the OS folder and samples. The sample is called "loopThroughWithOpenGL compositing" . If you are using Direct x11 then, please let me know and we can determine how to support your needs.

Thank you.
Matt Jefferson

Wed May 27, 2015 3:00 pm

Just to clarify, GPUDirect works with Quadros under Linux? I was under the impression it was Tesla only, but the OpenGL support vs CUDA would indicate otherwise.

Wed May 27, 2015 3:38 pm

HI Chad -
Per our SDK manual and also from our discussions with NVIDIA. Yes. Below is from our SDK manual Page 14.

NVIDIA GPUDirect support
NVIDIA GPUDirect is supported on Windows 7 and Linux for x86 and x64 architectures where those platforms are also supported by NVIDIA

Those platforms include the Quadro cards 4xxx and higher. Also, includes Tesla cards that can process OpenGL(correct generation that has DVI out port) but that use case is more limited. We are looking into future support of CUDA GPU Direct processing sample but do not have a timeframe currently.

Wed May 27, 2015 5:28 pm

Oh great, I had outdated information then; thought a Linux version was off the table. Thank you.

Wed Jun 03, 2015 8:27 pm

Matt Jefferson wrote:
Desktop video or the DecklinK SDK supports all capture and playback including Ultrastudio, Intensity and Decklink products. GPU Direct is supported on Windows 7/8 64bit and Linux 64 bit using our SDK, current driver from BMD (10.4) and current Quadro card from NVIDIA equal to 4xxx and higher models 5xxx and 6xxx.

The function is supported via OpenGL and at this time, there is no support for a CUDA transfer between the GPU and the video input / output device. Blackmagic Design does provide sample application and the code for this sample which you can modify in the SDK download via the OS folder and samples. The sample is called "loopThroughWithOpenGL compositing" . If you are using Direct x11 then, please let me know and we can determine how to support your needs.

Thank you.
Matt Jefferson

Matt,

I'm trying to make the LoopThroughWithOpenGLCompositing work on Linux with a Quadro K4200 card. The BMD SDK docs indicate "GPUDirect support requires the use of the DVP library supplied by NVIDIA."

It's not clear where I find the DVP library. Is that part of the Cuda tools, or possibly NVIDIA's OpenGL toolkit? Can you tell me where to find the DVP library?

My actual application simply needs to access the frame buffer of the video card and output it directly to a decklink SDI output. Let me know if this doesn't seem doable with GPUDirect.

Thanks!

Wed Jun 03, 2015 11:27 pm

The necessary file(s) come from NVIDiA and then integrated into our SDK so you don't need any cuda or open GL tool kits for DVP. They are included in our SDK package. Please register and download it from the support page. It will place the DVP files necessary on to your system to allow you to run the Loop through sample.

It is compatible with Windows and Linux and you will need the current Quadro graphics driver from NVIDIA.
Download the latest driver and SDK from BlackmagicDesign support area of our website and install that to run the sample from the linux folder

To answer your question- should be doable. Yes depending on your application. You can modify our sample to your own need but we don't manage the graphics side / the offscreen (PBO). We would take the OpenGL and send it using the DVP function to transfer the contents from OpenGL graphics buffer to Decklink card for output.

Matt

Fri Oct 02, 2015 7:04 pm

I'm interested in getting this working at well. I have a Decklink 4K Extreme card with an nvidia quadro M5000 in an HPZ820.

1.) When I run LoopThroughWithOpenGLCompositing.exe from the SDK, I get an "Expected both Input and Output DeckLink devices" error on initialization. I have a 1080p signal hooked up to the input and a 1080p monitor hooked up to the output (loopthrough port on the decklink card). Looking at this in a debugger:

In OpenGlComposite.cpp, function: bool OpenGLComposite::InitDeckLink()
...
while (pDLIterator->Next(&pDL) == S_OK)
{
// Use first board found as capture device, second board will be playout device
if (! mDLInput)
{
if (pDL->QueryInterface(IID_IDeckLinkInput, (void**)&mDLInput) != S_OK)
goto error;
}
else if (! mDLOutput)
{
if (pDL->QueryInterface(IID_IDeckLinkOutput, (void**)&mDLOutput) != S_OK)
goto error;
}
}
..

only iterates through once, seeing an input device, but no output devices. The moniotr plugged into the HDMI loopthrough output port on the decklink card shows image when running LoopThroughWithDX11Compositing.exe

2.) GPUDirect should be shown in a standalone example, not coupled with loop through. A Quadro card should be all I need.

3.) You mentioned this works in linux. Is there any performance difference between the 2 operating systems? I don't see any specific references in the linux code to a dvp equivalent, or dependancy on a shared object/library. Is GPUDirect baked into the nvidia driver in linux?

4.) What is the glass to glass frame delay of this input card with UHD60?

Mon Oct 05, 2015 5:29 pm

Following up on the last post, it looks like this may just be an oversight in the SDK. It looks like the iterator item queries successfully as both the input and the output. i.e. if we use

while (pDLIterator->Next( &pDL ) == S_OK)
{
// see if we can find an input and an output
if (!mDLInput )
{
if( pDL->QueryInterface( IID_IDeckLinkInput, ( void** )&mDLInput ) != S_OK )
{
mDLInput = NULL;
}
}
if ( !mDLOutput)
{
if( pDL->QueryInterface( IID_IDeckLinkOutput, ( void** )&mDLOutput ) != S_OK )
{
mDLOutput = NULL;
}
}
}

if (! mDLOutput || ! mDLInput)
{
... // error
}

we have both an input and an output with the single iterator object. The issue is we never see input show up, just a 3D "X" image floating around. In the

HRESULT CaptureDelegate::VideoInputFrameArrived(IDeckLinkVideoInputFrame* inputFrame, IDeckLinkAudioInputPacket* /*audioPacket*/)

callback

bool hasNoInputSource = (inputFrame->GetFlags() & bmdFrameHasNoInputSource) == bmdFrameHasNoInputSource;

returns true. The display mode is:

BMDDisplayMode displayMode = bmdModeHD1080p6000; // mode to use for capture and playout

which matches the input. In the CapturePreview application, we see the input just fine with1080p60.
Thoughts?

Tue Oct 06, 2015 2:17 am

Hi Randy,

As you have already identified, the LoopThroughWithOpenGLCompositing sample has not yet been updated to use one full duplex device for capture and playback, and presumes the presence of two simplex devices.

For an example of how this could be performed, please see the LoopThroughWithDX11Compositing sample, in the DX11Composite::InitDeckLink() function for a demonstration of how the devices may be enumerated, differentiating between simplex and duplex devices.

The CapturePreview sample in 10.5 uses the Input format detection feature to determine the input video mode and pixel format, while the LoopThroughWithOpenGLCompositing sample does not.

I suspect that your input source is RGB, and the CapturePreview sample is switching to RGB capture, while the LoopThroughWithOpenGLCompositing sample is hardcoded to bmdFormat8BitYUV.

Note that if your input source is RGB, it is not possible to simply change the pixel format passed to EnableVideoInput to RGB, as the rest of the sample assumes that it is processing YUV frame data.

It would be more straightforward to test with a known YUV source first, before attempting the modifications required to composite RGB pixel data.

Cheers,

-nick

Wed Oct 07, 2015 7:19 pm

Hi Nicholas,

We are close to having an RGB example working out of the LoopThroughWithOpenGLCompositing example. On a source PC we are outputting a pure red image with RGB values (255, 0, 0). I am using an HDMI (1080p60) input. We have video frames coming through in the LoopThroughWithOpenGLCompositing example with bmdFormat8BitBGRA. i.e.

bool OpenGLComposite::InitDeckLink()
{
...
if( mDLInput->EnableVideoInput( displayMode, bmdFormat8BitBGRA, bmdVideoInputFlagDefault ) != S_OK )
...
if( mDLOutput->CreateVideoFrame( mFrameWidth, mFrameHeight, mFrameWidth * 4, bmdFormat8BitBGRA, bmdFrameFlagFlipVertical, &outputFrame ) != S_OK )
...
}

When a frame comes through if we look at the content of the videoPixels values in:

void OpenGLComposite::VideoFrameArrived(IDeckLinkVideoInputFrame* inputFrame, bool hasNoInputSource)
{
..
void* videoPixels;
inputFrame->GetBytes(&videoPixels);
..
}

we see the following. I would expected to see “00 00 FF 00” repeating. Using Blue or green shows the same result, 0xEB instead of 0xFF and 0x10 instead of 0x00

: blackmagic_screen.png (247.29 KiB) Viewed 9940 times

Thoughts? Seems like maybe we need to add SRGB gamma correction?

[Edit] The colours show up the same in Media Express, the same (235,16,16) instead of (255,0,0)

Thanks,
Randy

Sun Oct 11, 2015 11:18 pm

Hi Randy,

It is possible that the source is not outputting full range, that the particular capture card does not support full range, or both.

Please check the settings on the graphics card to see if it is outputting full range. The specific setting will vary based on the graphics card, but look for a setting like "dynamic range".

In a previous post you mentioned using the DeckLink 4K Extreme. Please note that this card does not support full range RGB over HDMI.

It is possible to scale the captured range to full range for processing by your application if that is a requirement.

Cheers,

-nick

Fri Feb 09, 2024 6:39 pm

I'm really surprised that this question (also asked in this thread viewtopic.php?f=12&t=91745) hasn't been addressed in nearly a decade! This is really quite simple. Libraries to manipulate CUDA buffers such as OpenCV with cv::GpuMat and NVidia's NVEnc are everywhere these days. If we are looking to process video, we don't need an OpenGL texture (about 100 lines just to set up!), we just need a CUDA array for each incoming frame.

But the there are no examples as such that I saw in the BlackMagic SDK. Instead there are ones that download the frames into pinned memory and then map them to OpenGL textures for preview and such. This may be great for viewing frames, but it would be better to get a CUDA array directly for use in the 1000s of other libraries out there (e.g object detection, tracking, GPU encoding, OpenCV image processing and display) without a CPU bound memory copy (even if pinned). Is this possible? Is there an example.

Imagine a simple example that simply performs a cv::GpuMat wrapping of incoming frames on GPU followed by an cv::imshow, cv::waitKey(1). OpenCV preview windows support CUDA mat buffers directly if cv::namedWindow is given with the proper flags first. I thought this was the promise of NVidia's GPUDirect RDMA? To me, this would be a much simpler "CapturePreviewer" and would allow folks to hit the floor running with the buffer in GPU ready to go without all that OpenGL texture logic/shader format conversions and pinned memory overhead. It would also be way easier to understand because as it is right now, it's impossible to figure out to do exactly this.

I was able to modify the Loop example so at the exact moment the pinned memory had a new frame, I could simply display it with cv::imshow (first converting it from YUV to BGR), but this is unusable because to get it to a GPU CUDA buffer, I would what - have to upload it again? Why this needless round trip: Video->Pinned Memory->CUDA GPU Buffer necessary?

Thoughts?

GPUDirect for Video

GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Re: GPUDirect for Video

Who is online