The reference input on a video router is usually used to control the timing when a route switch should happen. So if you change the input source assigned to a particular output on the video router/hub, when this switch actually happens would follow the frame timing of the reference input.
The precise timing of the switch generally follows the SMPTE RP 168 standard which specifies that the switch should happen in the video blanking interval so that it does not affect the visible part of the video signal.
So the benefit of the reference input, is that if the various video input signals going into the router are also locked to the same reference timing, then the input-to-output route change would typically happen at a frame boundary. If instead the different video inputs to the router each have their own timing (so they are not genlocked to the same reference), then the reference input to the router doesn't help very much because the change in the route assignment will probably end up happening somewhere in the middle of a video frame anyway.
With independent video sources that do not share a common reference, if you still want the video route changes to cleanly happen at a frame boundary, then you typically need something like the approach followed by the VideoHub CleanSwitch. That model has frame synchronizers on all of the inputs that adds a bit of delay to match up the frame boundaries of the different video signals (the ATEM switchers do this as well, which is why they can be used with video sources that are not genlocked).
The internal frame synchronization approach adds up to a frame of latency to the video signals going through the video router though, which can be a problem for some applications. So if you need to avoid this additional latency, and you need the changes in video routes to match a common frame boundary timing, then you would need lock all of the video sources to a common reference and feed a reference signal into the video router/hub as well.