Automatic subtitle dont convert to text.

shotcut · December 5, 2024, 4:32pm

I just tried it, and it worked for me. Mine prints this additional info in the log (same as your beginning):

whisper_model_load:      CPU total size =    59.12 MB
whisper_model_load: model size    =   59.12 MB
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 15 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0

main: processing '/tmp/shotcut-kJdGZj.wav' (2723840 samples, 170.2 sec), 15 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
...
output_srt: saving output to '/tmp/shotcut-DiFbDv.srt'

whisper_print_timings:     load time =    79.77 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   149.32 ms
whisper_print_timings:   sample time =  2793.26 ms /  3542 runs (    0.79 ms per run)
whisper_print_timings:   encode time =  4152.87 ms /     6 runs (  692.15 ms per run)
whisper_print_timings:   decode time =     7.64 ms /     2 runs (    3.82 ms per run)
whisper_print_timings:   batchd time =  4308.48 ms /  3513 runs (    1.23 ms per run)
whisper_print_timings:   prompt time =   830.59 ms /  1024 runs (    0.81 ms per run)
whisper_print_timings:    total time = 12530.26 ms

Completed successfully in 00:00:12

I chose Italian without translate to English. Did you try more than one audio?

What is your CPU? I think if it is rather old, it will not work as it expects AVX2 instructions. You can check in a terminal with cat /proc/cpuinfo | grep avx2. If if does not print a result, your CPU does not have it.
The other Linux builds we make turn off AVX2 but not Flatpak. (There is a reason related to OpenBLAS.)

You could try to compile it yourself from GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++ and see if it is any more reliable.