To clarify a little. The ATEM TVS adds in maximum (!) on frame delay (I don't know, if there may be a second frame when/while using DVE). One frame delay is not noticeable. So, the by XLR inserted audio is nearly in sync, there would be no reason for using an audio delay unit to keep audio in sync, but .....
Most of the video delay comes from cameras which were not designed for live video. They have delay at their output caused in signal processing. This is not noticeabe when play back files from camera memoy

- but when comparig live video with live audio.
So you get at least 3,4 or maybe 5 frames delay in total at recorders input, maybe different delay with different cameras. So i don't think it it is o good idea mixing different camera brands and models.
This delay is really noticeable and to solve this issue, an audio-delay unit is recommended - maybe a digital audio console or a digital audio effects generator. There is also a cheep box sold by "Lindy", doing such delay. (A Little) Audio after video is no problem, this is a normal situation for humans, because of different speed of light and sound in the air

Also other vision mixers have such delay, not only Blackmagig !
When I worked with Tricaster, I had the same issue, when recording video from the vision mixer together with direct audio from a mixing console bypassing the vision mixer. This shows, that there allways will be delay, but maybe other (and more expensive) vision mixers have an audio delay inserted. So some bucks that you save buying an ATEM must be invested in additional equpiment....