Automatic subtitle dont convert to text.

The first job goes well, but the text don’t added to timeline.

OS: Linux Mint LMDE 6.
Shotcut version 24.11.17 - Flatpak
Executable Whisper: /app/bin/whisper.cpp-main
GGML model: /app/share/shotcut/whisper_models/ggml-base-q5_1.bin
Recognition language: italian.

Trying to enable the GPU effect resulted in the same error.

Log:
whisper_init_from_file_with_params_no_state: loading model from ‘/app/share/shotcut/whisper_models/ggml-base-q5_1.bin’
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 9
whisper_model_load: qntvr = 1
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
Failed with exit code 4

Thanks for your help.

I just tried it, and it worked for me. Mine prints this additional info in the log (same as your beginning):

whisper_model_load:      CPU total size =    59.12 MB
whisper_model_load: model size    =   59.12 MB
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 15 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

main: processing '/tmp/shotcut-kJdGZj.wav' (2723840 samples, 170.2 sec), 15 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
...
output_srt: saving output to '/tmp/shotcut-DiFbDv.srt'

whisper_print_timings:     load time =    79.77 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   149.32 ms
whisper_print_timings:   sample time =  2793.26 ms /  3542 runs (    0.79 ms per run)
whisper_print_timings:   encode time =  4152.87 ms /     6 runs (  692.15 ms per run)
whisper_print_timings:   decode time =     7.64 ms /     2 runs (    3.82 ms per run)
whisper_print_timings:   batchd time =  4308.48 ms /  3513 runs (    1.23 ms per run)
whisper_print_timings:   prompt time =   830.59 ms /  1024 runs (    0.81 ms per run)
whisper_print_timings:    total time = 12530.26 ms

Completed successfully in 00:00:12

I chose Italian without translate to English. Did you try more than one audio?

What is your CPU? I think if it is rather old, it will not work as it expects AVX2 instructions. You can check in a terminal with cat /proc/cpuinfo | grep avx2. If if does not print a result, your CPU does not have it.
The other Linux builds we make turn off AVX2 but not Flatpak. (There is a reason related to OpenBLAS.)

You could try to compile it yourself from GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ and see if it is any more reliable.

I have an i3 dual core cpu, without support for AVX2 instructions.
Patience, the automatism would have been convenient, but I will use the manual method to subtitle the video.

Thanks a lot! :slightly_smiling_face: