Export Speed - Filters and Interpolation

kagsundaram · January 20, 2021, 5:14pm

Long export times are a drag. We all know that.
Long export times are a fact of life. We all know that as well.
(This is explained in many other wonderfully enlightening threads.)

This week, pondering the delays ensuing from horrendously long export times, I began to to research that topic here on the Shotcut Forum.
I wanted to know why my times were so long, why my GPU usage percentages were so low, and and why my CPU usage was less than the nearly 100% I had seen previously.

There is a tremendous amount of elucidation here on these subjects. It turns out that our fearless leader @shotcut is an excellent teacher. Education continues.

This led to experimentation, the results of which were surprising, very useful to me (I have significantly altered my editing process now), and I believe will be useful to others. So, in the next comment I will summarized what I found by experiment, then I will go on to note what I “unlearned”, and then the details of environment and workflow for those who want to delve deeper.

kagsundaram · January 20, 2021, 5:31pm

Screenshot_20210120_094525-FinalTimes

The Findings:

(1) Using Bilinear Interpolation instead of Bicubic yields an export time improvement of about 2:1, with almost no loss of quality. (Yes, if I look close, I can see a little bit more pixel-jitter on the top of my hair. You have to look REAL close to see it.)

(2) The position of filters in the stack can make a large difference in Export times.
In Bicubic interpolation, moving the Size-Position-Rotation filter from the bottom of the stack to the top (making it first to execute?) gives a 15% improvement.
In Bilinear interpolation, moving the SPR filter up gives me a much larger 2:1 improvement (why the difference?).

So by making these two changes, I get a four-to-one improvement on my Export times, with no noticeable loss of quality.

(Also. my GPU Video Engine utilization goes from 1% to 4%.)

kagsundaram · January 20, 2021, 5:37pm

Screenshot_20210120_091546-FiltersSPRbottomDetail Screenshot_20210120_093736-FiltersSPRtopDetail

The filters.

kagsundaram · January 20, 2021, 5:41pm

For those interested, I have made my setup for this experiment, as well as the results, and copious screenshots, available in Dropbox.

Export Test Folder

kagsundaram · January 20, 2021, 5:56pm

I also tried other movements up and down in the filter stack; a few showed a possible 2% change, for the worse, none showed an improvement other than moving Size, Position, Rotate from the bottom to the top.

kagsundaram · January 20, 2021, 6:07pm

My first suspicion, before consulting this forum, was that since I was clearly neither CPU-bound (CPU use ~70%) nor GPU-bound (Video Engine use 1%), that I must be disk-bound.

However, it was expertly noticed in a forum thread about hardware, that Shotcut make only very slight demand on the drives.

This was verified when I added a Drives band to my System Monitor.

The Export begins at the large Blue-on-Red spike (Reads-on-Writes); note the very small red drive-write bumps, widely spaced.

kagsundaram · January 20, 2021, 6:17pm

What led me to try the change to my filters stack was someone’s comment in a forum thread on GPU filters that this had significant effect.

It was mentioned that the bottleneck is usually the loading of frame data back and forth between CPU memory and GPU memory.

This leads me to wonder if the real bottleneck may be the PCIe lanes.
In my rig, my GPU is using x8 lanes of PCIe. Would my GPU usage double again if I had x16 lanes available to my GPU? Is there such a thing as a GPU with x32 PCIe lanes?

In most discussions of the optimal video editing workstation, the arguments seem to be between three contenders for “most important”: CPU clock speed, CPU cores, GPU horsepower.
Perhaps the real “most important” factor (for those using GPU filters during Export) is GPU PCIe width.

Austin · January 20, 2021, 6:49pm

Nice research!

This is great when it’s an option. If the source is blistering sharp high-quality 4K, then the difference between bilinear and bicubic can become very noticeable. Cheap cell phone footage and old HDV home video would be less problematic. Those sources sometimes don’t have enough detail to reveal the weaknesses of bilinear scaling.

Scaling quality could also depend on how big a size change is being done. For a small change, the difference could be unnoticeable. But to enlarge or reduce by a factor of 3 or more would quickly reveal the quality difference.

I wonder if this is dependent on “canvas size”. If SPR is shrinking a video down to a quarter of its original size first, then would every filter after it only need to operate on a quarter the amount of data, and that’s where the speed difference is noticed? Less work for the following filters? If SPR is last, then Saturation etc would in theory be operating on full-size videos then their results would be all shrunk down at the end. If this was true, we would expect the final export time to increase if SPR was making a video larger rather than smaller (assuming it didn’t get clipped by the timeline dimensions).

As for the difference between bilinear and bicubic SPR… bicubic is just that complex to calculate and will take about 2x longer compared to bilinear.

Option 4: Shotcut is not written multi-threaded enough to take advantage of all the available CPUs. Some cores sit idle. There’s nothing a user can do to optimize this except buy processors that specialize in fast clock speeds rather than high core counts.

True. But the current version of Shotcut isn’t even using GPU for filters, so the discussion was theoretical and moot. In light of this, buying a beefy GPU for Shotcut currently makes no sense. The only benefit would be 7th generation NVENC for higher quality hardware encoding. But encoding doesn’t use CUDA cores (less than 5% at most), so a cheap GTX 1650 Super is the most money that needs to be spent.

High clock speed is where the party’s currently at for Shotcut exporting. i7-8700K is beast. Good balance between high clock and high cores if doing software H.264/HEVC encoding.

For most people, this will be true. If the input and output formats are highly compressed like H.264, then disk usage is minimal. If intermediate or lossless codecs like Ut Video get involved, they can saturate the disk link quickly since their file sizes are insanely large.

kagsundaram · January 20, 2021, 7:08pm

Agree.
In this case, budget constraints and a GoPro-clone

kagsundaram · January 20, 2021, 7:12pm

I think you are right.
By zooming slightly (110% over unzoomed 100%) I have reduced the actual information by 1.1 squared.
I wonder if the effect is greater at greater zooms?

kagsundaram · January 20, 2021, 7:12pm

Yes

kagsundaram · January 20, 2021, 7:13pm

Yes

kagsundaram · January 20, 2021, 7:16pm

Not sure I agree here.
In some of the other threads, users have reported Shotcut using all 8 of their threads; one user at least said Shotcut was using all 16 of his threads.
In my case, Shotcut always uses all four of my threads; when there is no other bottleneck, Shotcut uses all four cores at 100%

kagsundaram · January 20, 2021, 7:20pm

This is curious.
Other threads here at the Shotcut forum (when I spent hours reading earlier this week) said Shotcut uses the GPU (if enabled) for some filters and not for others. It was specifically mentioned that when Shotcut uses the GPU for a filter, the loading/unloading of GPU memory is a performance bottleneck.

kagsundaram · January 20, 2021, 7:22pm

AMD good. Intel bad. LOL!

kagsundaram · January 20, 2021, 7:24pm

Thanks for the tip, @Austin! If I invest in big disks so I can actually use lossless intermediate files, now I know they need to be blazingly fast big drives.

kagsundaram · January 20, 2021, 7:26pm

Already on my wish list. Thanks for the confirmation, @Austin

kagsundaram · January 20, 2021, 7:40pm

Sidenote - My Setup:

[HP Compaq 6305 Pro Small Form Desktop,
AMD Quad Core A8 5500 3.2Ghz,
8GB DDR3,
480GB SSD Hard Drive,
GeForce GT 710,
Kubuntu Linux 18.04

Cameras, (at the time this video was recorded):

Nokia Lumia 920 cellphone,
Nokia Lumia 830 cellphone,
cheap Chinese GoPro-clone from the WalMart clearance table,
Panasonic HDC-SD80

The experiment here is taken from a remastering session; I am producing an edit-friendly intermediate file with the best audio from the Panasonic overlaid on the GoPro-clone video to get one of four simultaneous views for final editing.

kagsundaram · January 20, 2021, 7:49pm

Sidenote - My Process:

(1) shoot three or four simultaneous videos, tripod mounted cameras.

(2) Produce that many time-synchronized intermediate mp4 files, equal length, audio sync problems corrected, flesh-tones color matched, camera-angle problems corrected, etc.

(3) Final edit from the intermediate files.

Before proxies, this was the only way to avoid the horrendously slow rendering during editing with all filters in place. I had tried using the semi-automatic edit-friendly file conversion in Shotcut when it was introduced; I was not satisfied with the results, so I began to do my own.

Austin · January 20, 2021, 7:57pm

Unfortunately, there is a ton of nuance to this. There is a difference between the number of threads used to generate a frame (for filters and compositing), versus the number of threads used to encode a frame (by libx264). If encoding with H.264/HEVC, the encoder will saturate any remaining CPU and make it look like it’s being used well. But Shotcut itself may only be using a single core to generate the frame.

The easiest way to verify this is to use an export codec that is near-zero CPU load, like hardware encoding or libx264 with preset=veryfast and then we can see the cores and threads used to generate a frame.

The other nuance is which filters are used. Some thread very well. Some like “Reduce Noise: Wavelet” only use a single thread (but it’s awesome at its job so I use it anyway). Using a zero-load encoder is the main way to tell what Shotcut is really using, as opposed to what the encoder is using.

Since you noticed your CPU usage at 70% instead of 100%, that tells me the encoder is waiting for a frame to be generated and is sitting idle.

This used to be true. GPU was disabled many versions ago due to instability. It can be manually turned on, but few people do it. Not recommended. Those were probably old discussion topics you found.

Whatever makes you happy lol.

For real-time editing, this would be true. If you edit with proxies, the proxies are low disk usage and there is no speed issue with affordable giant magnetic drives. I still use large magnetic HDDs since I use proxies. Proxies are amazing.