How do I download a bigger and better brain (model) in ggml format?

Hello everyone,

I tried the voice recognition feature, but the resulting mistakes are too numerous to be useful. So I would like to download this “better brain model” mentioned in the documentaire. But where is it to be found and how is it added?

Thanks!

There is a link here

You can change the file in the dialog that pops up when you run speech to text.

Hello @Brian,

Thanks for the reference: I had found another text which didn’t provide this link.

I did try the ggml-base-q8_0 version (the one which is preinstalled is the 5) and it does indeed somewhat better voice recognition.

However, both the old one and the current one perform totally crazily when translation is required: they produce meaningful sentences, but which don’t have any relation whatsoever with the source; moreover, after 40 secs. or so, they repeat the same sentence until the end of the video (which lasts 10 minutes, and in which the speaker is not repeating himself).

You can get the official python version of Whisper in order to compare the results. We do not make this tool, and we do not have the skills and resources to improve the model. If the same happens with OpenAI’s whisper, you can report it to them GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

All that we can really do is change command line arguments to the whisper.cpp executable we run. Also, if you view the job you can see the tool’s output that we must parse. If you see a difference in that and what you see in the Subtitles panel, we can try to fix that. When Shotcut’s dialog’s option “Include non-spoken words” is kept unchecked, we do try to exclude phrases surrounded by ( ), and that obviously affects parsing.

Does that mean the Translate to English option in the Speech to Text dialog?

I decided to compare whisper.cpp with OpenAI Whisper. So, I installed the latter using this Dockerfile, which uses the original OpenAI python version. Then, I ran
docker run -it --rm -v ${PWD}/models:/root/.cache/whisper -v ${PWD}/audio-files:/app openai-whisper whisper yt.opus --threads 8 --task translate --model base --output_dir /app --output_format srt
and got the attached he2en-base.txt. What do you think of its quality?

he2en-base.txt (14.7 KB)

Then, I ran whisper.cpp using the unquantized base model to get a more apples-to-apples comparison. See whisper.cpp-base-he2en.txt. It is very different than python whisper and strange! Also, different from your result with q8, yet one similarity: heavy repetition. I wonder if this implementation has these problems with most languages or certain ones.

whisper.cpp-base-he2en.txt (32.9 KB)

So next, I tested the translation performance of the large model in whisper.cpp: whisper.cpp-large_v3-he2en.txt What do you think of it? Much better, agree?

whisper.cpp-large_v3-he2en.srt (21.4 KB)

We can add a note to the documentation that you probably need to download a large model for translation. The quality for the model size might depend on which language it is coming from.

Did you choose the correct source language when you started the detection? In the dialog, the language for the source needs to be selected. If your source language is not in the list, than it is not supported.

image

Repeated lines is the result when the model is hallucinating. In my testing, I actually saw more hallucinations with larger models.

1 Like

Hope it’s okay to jump in here instead of make a new topic - but which file is suggested to download for a ‘bigger brain’?

I’m assuming I need just one bin file and am looking at the ggml-large-v3-turbo files as they seem to be recent - but what does the q5 and q8 reference?

The default model for the original Python Whisper project is large-turbo. I suspect many around here have done a good comparison of models. Maybe someone will find a good web page about it to share here. The “Q” means quantized, which is faster and uses less memory. I guess you can say it is an alternate form of turbo because I do not know what “turbo” means for models. I also do not know the difference between q5 and q8. Surely, different levels of quantization. Whichever is the less precision (my guess is 5) is going to be less accurate too.

1 Like

In my quick tests the turbo model is way worse than the large one for non-english. The large one used about double the RAM and took double the time so for me I’d say large is clearly the “best” choice if you have the power (it used about 4-5GB of ram for romanian).

Sorry for not replying earlier. To your question: yes, I did select the source language - and the subtitles transcription was more or less correct.
Here is the source. It is a 1981 interview in Hebrew with the then-prime minister of Israel. The interviewer starts with “Mr Prime Minister, good evening to you”, and the PM replies “Good and blessed evening”.
Here is the log of the translation - no relation whatsoever with the original text:
image