This should not really come as a surprise - practically every forum where an ATEM is mentioned, the fact that you generally need to delay the audio is not far behind.
There is a very logical reason for this as well - audio which is coming in from the SDI-ports is fine, and locked to the correct frame of video already. But audio coming in through the XLR ports does not have any such sync information, so it is embedded in the stream "as it comes". As the video processing chain takes a lot longer from the image is captured by the camera than the audio does, the audio WILL need to be delayed if it doesn't enter via a camera, which then does this syncing.
Audio has a general maximum latency of a few milliseconds from the artist makes the sound until it has to be out of the speakers and in-ear monitors, so the technology is designed for this speed. Video is designed to be transmitted in frames, which are generally consisting of 1/50s periods of image data. Unless you gen-lock your cameras, this needs to be synced later-on, and you need to wait at least until the start of every frame to do this. There are also general delays in both cameras (especially consumer / semi-pro ones), effects and mixers that means that video is somewhat behind the real thing, and always will be.
In real life, practically ANY digital
audio mixer has this functionality built in, so there is no point adding it to the video mixer. Do it there, or get a box, like the DEQ2496, if you need to do it yourself and don't have an audio mixer.