Export Speed - Filters and Interpolation

Interesting! Does this mean a multi-cam setup of one event, or shooting several events back-to-back?

“manually turned on” - how?

I want to play!

Three tripods, up to five cameras (a piggy-back and a double-mount bar).

For example:

Abraham, When Called

Thank you.
As I said, I am learning this week.

I will experiment.
(This means turning off my browser for a little while. Be back shortly.)

Oops, I forgot to clarify why the core usage pattern was significant.

If somebody is using libx264 and it saturates the CPU, then they’re exporting the fastest their computer can do and there is nothing further to optimize.

But if someone is using a codec that doesn’t require a lot of CPU, like DNxHR or Ut Video, then all the CPU bottleneck falls on how fast Shotcut can generate a frame (filters and compositing). This is why some people see 16 threads used (they exported with libx264) while other people see only 4 threads used. They maybe used Huffyuv which has near-zero CPU overhead, and the bottleneck was frame generation due to a single-threaded filter.

You found good tips! My main reason for responding to threads like this is to help people understand what hardware makes the biggest difference so their gear budgets are spent wisely. The worst thing is being disappointed about a performance boost someone hoped to see but didn’t get after spending a bunch of money (in the wrong place).

Absolutely!

More experiments.


More screenshots are being uploaded to the Dropbox file mentioned above.)

All with Hardware Encoder UNchecked.

hevc_hvenc runs all four threads at 75% (about the same as with the Hardware Encoder)

All of the libx264 and libx265 run all four cores at 100% or nearly so.

hevc_hvenc is still the fastest. I wonder why it is the same with or without hardware encoding turned on?

This is the hardware encoder. If the checkbox is checked, it recommends this option. If it is unchecked but the user manually picks this option, hardware still gets used. That’s why results are identical.

Since this is a hardware encoder, the CPU is free of encoding and used basically exclusively by Shotcut for frame generation now. The fact that it isn’t 100% CPU usage means a filter or something is not threading well. If you had a 24-core box, it probably wouldn’t do much better because the additional cores would likely sit idle. They aren’t all being fully used even at 4 cores.

thought so

Thanks.

Oh dear.

It looks like I need to put my software engineer hat back on, learn yet another genre, and fix the multi-threading in the filters I need for everyday use.

Well, I have plenty of time; it will be quite a while before I can afford to upgrade to Threadripper.

1 Like

I have found the answer to my question.

“No”

The NVIDIA X-Server Settings screen never shows higher than 10% PCIe Bandwidth Utilization during the bottnecked (75% CPU Usage) time periods.

PCIe bandwidth is NOT the culprit.

Percentages here can be confusing; I was seeing 20% “Video Engine Utilization”, which I believe is the same measure (I cannot seem to find any documentation to confirm or refute this assumption), yet I think now we are both seeing the same thing.

Using 40 CUDA cores to encode would be 5% on a GTX 1650, but it is over 20% on my dinky little GTX 710.

Good point on percentages, sorry. I was referring to recent cards.

CUDA is mainly used to calculate B-frames. If B-frames are turned off, then CUDA doesn’t get used at all. Encoding would happen inside the separate and dedicated NVENC circuitry.

The Shotcut user interface is also using GPU to display preview video and draw some UI controls. It uses OpenGL or DirectX surfaces. Any other running programs doing a similar thing would also register GPU usage. So it’s difficult to tell what amount is actually going to encoding.

1 Like

I did some more testing, based on this comment by @Austin; removing one filter for each test.

The worst culprit seems to be White Balance.
(@shotcut, perhaps some day someone might want to snoop around in the White Balance filter code with this in mind.)

White Balance calls the colgate filter that’s bundled with frei0r, a separate library maintained outside of Shotcut.

1 Like

That make sense, and explains a lot.

Such a library would not be expected to focus on multi-threading issues, unless it were a library built from the ground up with the intention and purpose of providing calculations optimized for today’s multi-threading environments.

This is already accelerated with SIMD CPU extensions, but in the next version 21.01 it uses slice threading.

2 Likes

Wonderful!

I am looking forward to doing an apples-to-apples comparison using the test set and data already developed here.

Interesting thread. :+1:
The issue of the S P & R filter position surprised me, although in my case, I always use that filter first in the filter chain as a habit.

2 Likes