I tried the voice recognition feature, but the resulting mistakes are too numerous to be useful. So I would like to download this “better brain model” mentioned in the documentaire. But where is it to be found and how is it added?
Thanks for the reference: I had found another text which didn’t provide this link.
I did try the ggml-base-q8_0 version (the one which is preinstalled is the 5) and it does indeed somewhat better voice recognition.
However, both the old one and the current one perform totally crazily when translation is required: they produce meaningful sentences, but which don’t have any relation whatsoever with the source; moreover, after 40 secs. or so, they repeat the same sentence until the end of the video (which lasts 10 minutes, and in which the speaker is not repeating himself).
All that we can really do is change command line arguments to the whisper.cpp executable we run. Also, if you view the job you can see the tool’s output that we must parse. If you see a difference in that and what you see in the Subtitles panel, we can try to fix that. When Shotcut’s dialog’s option “Include non-spoken words” is kept unchecked, we do try to exclude phrases surrounded by ( ), and that obviously affects parsing.
I decided to compare whisper.cpp with OpenAI Whisper. So, I installed the latter using this Dockerfile, which uses the original OpenAI python version. Then, I ran docker run -it --rm -v ${PWD}/models:/root/.cache/whisper -v ${PWD}/audio-files:/app openai-whisper whisper yt.opus --threads 8 --task translate --model base --output_dir /app --output_format srt
and got the attached he2en-base.txt. What do you think of its quality?
Then, I ran whisper.cpp using the unquantized base model to get a more apples-to-apples comparison. See whisper.cpp-base-he2en.txt. It is very different than python whisper and strange! Also, different from your result with q8, yet one similarity: heavy repetition. I wonder if this implementation has these problems with most languages or certain ones.
So next, I tested the translation performance of the large model in whisper.cpp: whisper.cpp-large_v3-he2en.txt What do you think of it? Much better, agree?
We can add a note to the documentation that you probably need to download a large model for translation. The quality for the model size might depend on which language it is coming from.
Did you choose the correct source language when you started the detection? In the dialog, the language for the source needs to be selected. If your source language is not in the list, than it is not supported.
Repeated lines is the result when the model is hallucinating. In my testing, I actually saw more hallucinations with larger models.
Hope it’s okay to jump in here instead of make a new topic - but which file is suggested to download for a ‘bigger brain’?
I’m assuming I need just one bin file and am looking at the ggml-large-v3-turbo files as they seem to be recent - but what does the q5 and q8 reference?
The default model for the original Python Whisper project is large-turbo. I suspect many around here have done a good comparison of models. Maybe someone will find a good web page about it to share here. The “Q” means quantized, which is faster and uses less memory. I guess you can say it is an alternate form of turbo because I do not know what “turbo” means for models. I also do not know the difference between q5 and q8. Surely, different levels of quantization. Whichever is the less precision (my guess is 5) is going to be less accurate too.