Videoneth wrote:Octavian Mot, I just updated your plugin and I saw a new large-v3-turbo model in it.
Is it still a better model than the v2 large but faster? With just less precision than the v3 large?
I asked chatgpt the difference between large v2 and v3:
The Whisper AI models, Large-v2 and Large-v3, share the same architecture but differ in certain aspects:
Spectrogram Input:
• Large-v2: Utilizes 80 Mel frequency bins.
• Large-v3: Expanded to 128 Mel frequency bins, potentially capturing more detailed audio features. 
Language Support:
• Large-v3: Introduces a new language token specifically for Cantonese, enhancing its multilingual capabilities. 
Training Data and Performance:
• Large-v3: Trained on a combination of 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio generated using Large-v2. This extensive training has led to a 10% to 20% reduction in error rates across various languages compared to Large-v2. 
User Experiences:
• Some users have reported that Large-v2 delivers more reliable transcriptions, especially in noisy environments, while Large-v3 may produce hallucinations or inaccuracies under such conditions. 
• Conversely, other users have found that Large-v3 offers superior performance in specific scenarios, particularly with clean audio inputs. 
In summary, while Large-v3 introduces enhancements like expanded spectrogram input and additional language support, its performance may vary depending on the audio quality and specific use cases. Users are encouraged to evaluate both models to determine which best suits their particular requirements.
And then I asked for my usage and it suggester v2:
Best Whisper Model for Your Needs
✅ Whisper Large-v2 (Recommended)
• Handles accents better: Many users report that Large-v2 performs more reliably with strong accents, including Italian-accented English.
• More stable in noisy environments: Since your vlogs may have background noise (traffic, wind, city sounds), Large-v2 tends to generate fewer hallucinations and errors compared to Large-v3.
• Good for multilingual speech: It supports Arabic, Spanish, Italian, Turkish, and more, switching between languages naturally.
• Proven track record: Many users still prefer Large-v2 for real-world applications like transcription and subtitles.
❓ Whisper Large-v3 (Alternative)
• Slightly better at recognizing different languages but may struggle with accented English and hallucinate in noisy conditions.
• Cantonese-specific improvements, but that’s not relevant for your case.
• Trained on more diverse data, which could be useful if you notice Large-v2 struggles with specific languages.
Why Not Smaller Models (Medium, Small, Base)?
• Lower accuracy with accents – They struggle with strong non-native English pronunciation.
• Weak in multilingual transcriptions – They may misidentify languages or mix them up.
• Less effective in noisy environments – Background noise could lead to errors.
Final Recommendation:
✔ Use Whisper Large-v2 if you want stable, reliable subtitles for accented English and multilingual speech.
✔ If you feel Large-v2 is making too many mistakes in language switching, try Large-v3 as a test.