Blackmagic Forum

Tue Nov 12, 2024 10:35 am

Joe Shapiro wrote:If you search the forum you’ll discover gobs of info about using it with Resolve

Hi, and thanks for the tip. I've been searching the forum for StoryToolkit for a couple of days now, but I haven't found any discussion about multilingual material or the possibility of analysing multi audio tracks. If there is some I'm glad to study it!

Also, if this already-great-product will be improved and if that is possible, it would be handy if the user could specify that the analysis (whereby the audio file is exported from the Resolve) could only be done on the timeline between the in and out marks. That way it could be used alongside editing perhaps more conveniently. Personally, so far I have not noticed with my two-days-experimental that this is not possible atm.

Tue Nov 12, 2024 7:45 pm

Hey Joon
FYI, the developer of StoryToolkitAI prefers we use his project issues page on Github page for StoryToolkitAI discussions.

Octavian Mot wrote:Feel free to ask stuff or add feature requests on the project issues page: https://github.com/octimot/StoryToolkitAI/issues

This way we'll keep the Blackmagic forums noise free and keep all the ideas in one place.

Cheers
Andy

Tue Nov 12, 2024 8:24 pm

Andy Mees wrote:FYI, the developer of StoryToolkitAI prefers we use his project issues page on Github page for StoryToolkitAI discussions.

Aaa, okay - thanks! Looks like there's plenty to explore.
Br, Joon

Sun Jan 05, 2025 12:48 am

This project seems like it has been abandoned. Is there any other way to use whisper ai with Davinci Resolve?

Sun Jan 05, 2025 8:10 am

What makes you say it's been abandoned?

Sun Jan 05, 2025 9:00 pm

Last public release: Aug 21, 2023
Last release for the monthly paying customers: Jan 26, 2024

Mon Jan 06, 2025 9:25 am

I think you'll find those are just the last packaged 'standalone' installer release dates, Mario. If you install from the git sources you'll see that the updates have continued.
For example:
[0.24.0] - 2024-02-22
[0.24.1] - 2024-02-29
[0.24.2] - 2024-06-01
[0.24.3] - 2024-08-28
[0.24.4] - 2024-12-09

Mon Jan 06, 2025 12:02 pm

I have a question for OP.

I currently use WhisperX (https://github.com/m-bain/whisperX) primarily because of this feature:

Version 3 released with a 70x speed improvement, thanks to batched whisper using the faster-whisper backend.
However, WhisperX lacks the level of control that StoryToolkit provides, such as the ability to set the number of words per line and other practical features that are very useful for my work.

While I miss these features, the speed of WhisperX's transcription is so fast now that it's hard to justify using StoryToolkit (for me, and my particular workflow), as its transcription speed is much slower in comparison. Even with all the useful extras.

Would it be possible to integrate the changes from the WhisperX fork into your tool in the future?
Or is their thing is already in the original Whisper? I don't follow follow the development of these tools except when I do some git pull and check the changelog

Currently, I have this command that I used with a right click on any file :

Code: Select all: whisperx "%%~fI" --output_format "!output_format!" --model !model! --verbose True --fp16 True --compute_type float16 --print_progress True --batch_size 18 --language en --align_model WAV2VEC2_ASR_LARGE_LV60K_960H !output_dir_cmd! !highlight_words_cmd!

I cracked up the batch size and it flies!

Then I importe the transcription in Resolve (and there is some "bug" there unfortunately : viewtopic.php?f=21&t=213856). But of course, I can't use the single word transcription option anymore doing it like this :/

I want to start by saying that I really appreciate this tool being free, thank you so much for making it available! I hope this doesn't come across as complaining and anything like this, as that's not my intention. I just wanted to share a few suggestions that might make sense (or maybe not).

Tue Jan 21, 2025 9:53 am

Constantin Gross wrote:
aaronvandomelen wrote:This is incredible, wow! I need to spend some time testing this out.

Do you have a method to import as ranged markers (kinda of like Simon Says?)

You can convert subtitle SRT files to ranged markers with this tool:

https://en.editingtools.io/subtitles/

As a YouTube video creator this tool gonna help me a lot. Thank you Sir. keep share with us. 8-)

Fri Jan 24, 2025 11:41 am

Andy Mees wrote:I think you'll find those are just the last packaged 'standalone' installer release dates, Mario. If you install from the git sources you'll see that the updates have continued.
For example:
[0.24.0] - 2024-02-22
[0.24.1] - 2024-02-29
[0.24.2] - 2024-06-01
[0.24.3] - 2024-08-28
[0.24.4] - 2024-12-09

I only use the standalone so to me it's abandoned.

Fri Feb 21, 2025 12:58 pm

Octavian Mot, I just updated your plugin and I saw a new large-v3-turbo model in it.
Is it still a better model than the v2 large but faster? With just less precision than the v3 large?

Fri Feb 21, 2025 10:07 pm

Cant open it:
Apple could not verify “StoryToolkitAI.0.23.2.1.arm64.app” is free of malware that may harm your Mac or compromise your privacy.

Fri Feb 21, 2025 10:13 pm

Figured it out!

Fri Feb 21, 2025 10:24 pm

Videoneth wrote:Octavian Mot, I just updated your plugin and I saw a new large-v3-turbo model in it.
Is it still a better model than the v2 large but faster? With just less precision than the v3 large?

I asked chatgpt the difference between large v2 and v3:

The Whisper AI models, Large-v2 and Large-v3, share the same architecture but differ in certain aspects:

Spectrogram Input:
• Large-v2: Utilizes 80 Mel frequency bins.
• Large-v3: Expanded to 128 Mel frequency bins, potentially capturing more detailed audio features.

Language Support:
• Large-v3: Introduces a new language token specifically for Cantonese, enhancing its multilingual capabilities.

Training Data and Performance:
• Large-v3: Trained on a combination of 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio generated using Large-v2. This extensive training has led to a 10% to 20% reduction in error rates across various languages compared to Large-v2.

User Experiences:
• Some users have reported that Large-v2 delivers more reliable transcriptions, especially in noisy environments, while Large-v3 may produce hallucinations or inaccuracies under such conditions.
• Conversely, other users have found that Large-v3 offers superior performance in specific scenarios, particularly with clean audio inputs.

In summary, while Large-v3 introduces enhancements like expanded spectrogram input and additional language support, its performance may vary depending on the audio quality and specific use cases. Users are encouraged to evaluate both models to determine which best suits their particular requirements.

And then I asked for my usage and it suggester v2:
Best Whisper Model for Your Needs

✅ Whisper Large-v2 (Recommended)
• Handles accents better: Many users report that Large-v2 performs more reliably with strong accents, including Italian-accented English.
• More stable in noisy environments: Since your vlogs may have background noise (traffic, wind, city sounds), Large-v2 tends to generate fewer hallucinations and errors compared to Large-v3.
• Good for multilingual speech: It supports Arabic, Spanish, Italian, Turkish, and more, switching between languages naturally.
• Proven track record: Many users still prefer Large-v2 for real-world applications like transcription and subtitles.

❓ Whisper Large-v3 (Alternative)
• Slightly better at recognizing different languages but may struggle with accented English and hallucinate in noisy conditions.
• Cantonese-specific improvements, but that’s not relevant for your case.
• Trained on more diverse data, which could be useful if you notice Large-v2 struggles with specific languages.

Why Not Smaller Models (Medium, Small, Base)?
• Lower accuracy with accents – They struggle with strong non-native English pronunciation.
• Weak in multilingual transcriptions – They may misidentify languages or mix them up.
• Less effective in noisy environments – Background noise could lead to errors.

Final Recommendation:

✔ Use Whisper Large-v2 if you want stable, reliable subtitles for accented English and multilingual speech.
✔ If you feel Large-v2 is making too many mistakes in language switching, try Large-v3 as a test.

Sat Feb 22, 2025 6:40 am

Videoneth wrote:Octavian Mot, I just updated your plugin and I saw a new large-v3-turbo model in it.

@Videoneth @Mario69Rossi
The short answer is that the large-v3 turbo model is an optimized version of large-v3, which runs (at least here in our editing room) at least 3x faster than the large-v3 model. According to the model card (https://huggingface.co/openai/whisper-large-v3-turbo) and this discussion (https://github.com/openai/whisper/discussions/2363) it's only slightly affecting the quality of the final output compared to the large models, and I assume it's only for specific languages.

Mario69Rossi wrote:I asked chatgpt the difference between large v2 and v3:

@Mario69Rossi I think chatgpt is comparing the non-turbo models for you. "ChatGPT can make mistakes. Check important info."

Videoneth wrote:While I miss these features, the speed of WhisperX's transcription is so fast now that it's hard to justify using StoryToolkit (for me, and my particular workflow), as its transcription speed is much slower in comparison.

@Videoneth Last time we tested, the 70x speed improvement advertised for WhisperX was not accurate, but this might be our systems. I actually tried to implement it as an option it in the tool's algorithm, but got really mixed results on both Macs and PCs, and definitely not more than 3-4x speed increases. Apart from the faster batched compute on the model, some optimizations are already implemented in the tool (e.g. VAD), so my assumption is that when they say 70x speed increase they made the comparison on audio with a lot of silent gaps, which were not sent to the model (just how it works in the tool) and they took into account the total time of processing (total audio time per time it took to process) instead of the actual speech compute time (total speech time per time it took to process). Again, I might be wrong.

Then, for Windows CUDA systems, the installation of WhisperX "requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system" which could be a nightmare for many non-technical users.

For our use, the speed we get now with whisper large turbo is good enough, and the tradeoff for using faster systems is not worth it, but I can totally see some folks needing speed over quality.

We're trying to find a way to let users choose their own transcription systems when ingesting with the tool (like hooking up WhisperX, faster-whisper, whisper-mlx, OpenAI's whisper API etc.).

Sat Feb 22, 2025 7:15 pm

Well, I transcribe 2h+ podcast in 120-180 seconds with the Large-V3 model. Let say 3-4 mins with the alignement.

I use a batch size of 48, with a 3090.

I have a .bat file in the "SendTo" folder of Windows (so it appears under a right click on a file).

And here the Whisperx command I use.

Needless to say, with regular 20-40 mins video, the transcription are pretty much instant for me.

Code: Select all: whisperx "%%~fI" --output_format "!output_format!" --model !model! --verbose True --fp16 True --compute_type float16 --print_progress True --batch_size 48 --language en --align_model WAV2VEC2_ASR_LARGE_LV60K_960H !output_dir_cmd! !highlight_words_cmd! --vad_offset 0.8

(the ! things are variables prompted to the user when it's launched)

YES, it won't replace everything we can do with StorytoolkitAI, and I'm super grateful for that tool.

Resolve is pretty solid for transcription as well, especially with normal speech at normal pace with not too many overlaps of different speakers. It cuts sentences at the right points and does a decent job of capturing the natural flow of language, with ... - or "". And the speaker detection is WAY better than Whisper (which can find so many different speakers with just to people in the audio lol, so it's pretty much useless imo).

BUT, the transcription quality isn't nearly as good as Whisper(x)'s model. It misses a lot of basic things that Storytoolkit handles better.

Storytoolkit is superior because of the models and all the customization options it offers, BUT, it's pretty slow. Sometimes, I prefer using Resolve directly, hoping it will handle the transcription correctly, but only if the audio is clear. If the audio isn't clear, I'll always go with Whisper's models. BUT, speaker detection is bad and not very useful honestly.

Whisperx is the best for speed and transcription accuracy, but man, it's pretty much just a "raw" tool. We need to write our own scripts and build additional tools around it to get better sentence cuts, etc. Sometimes we end up with like 3 lines of 10-14 words per line, taking up the whole screen. :lol:

For that kind of thing, Resolve is WAY better (especially when you use the basic settings, like 42 characters per line).

What wold be great to have (if it's even possible) is one tool that combines the best features of each of these tools into one package.

here with a 2h14 podcast - I sped it up but it's 3 mins in total (not counting of the loading of the model)

Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Re: Free Transcriptions in Resolve using OpenAI Whisper

Who is online