Export experiment with Parallel Processing and Hardware Encoder options

RilosVideos · January 17, 2022, 5:34pm

I dont understand why HE or not makes a difference in file size or a different bitrate. If you set the exact same export parameters the outcome should be exactly the same concerning die encoded video file.

Kaje68 · January 17, 2022, 5:35pm

I literally gave all the parameters in my video, I even talk about a better control environment at the end, the sensitivity is unjustified criticism. Of course I want to dial in exactly the best way to do this but you go and spend a significant amount of time doing something for not just yourself but the community and see if you aren’t a little sensitive with that kind of response.

As for the rest, I am down. I want to compile a list of must haves and nice to haves so I can create a 1080p30fps and 4k60fps projects for testing.

Kaje68 · January 17, 2022, 5:37pm

That one I wasn’t expecting either. I assume the encoder decides on the bitrate, but the larger file size baffled me as well.

Austin · January 17, 2022, 9:58pm

@Kaje68, I acknowledge that I misinterpreted your request to Arpit, and I apologize for that. I’m sorry.

In hindsight, I should have included my reason for commenting from the start, as I do recognize that I jumped into the conversation a bit strong. The reason is because this thread opened the doors for export testing and hardware racing… and hardware means money. We know from previous threads that people monitor these races to make purchasing decisions on which GPU to buy. People can’t test all GPUs themselves, so they look for comparison results in places like this. If tests are not done in a standardized way (or at least well-documented way) along with accurate interpretations, then people can get wrong ideas about what to expect. My goal was to prevent people from spending big money on hardware and then be disappointed that it doesn’t deliver the improvements they expected, simply because their encoding environment is different. I should have stated my intentions earlier, and perhaps my comments would have seemed less personal and direct in that context.

It was never my goal to be dismissive of your hard work or to upset you, although I can see how my list of testing caveats rather than praises of the video I didn’t watch could make it seem that way. I was in a rush and acknowledge that I did not use the best wording.

In regards to the video, I did say I would get to it this evening, as opposed to dismissing it entirely. That’s only a few hours away. The reasons I didn’t watch it sooner were 1) my employer, but also 2) the video is a 4K screencast with teeny tiny fonts that are completely unreadable on a cell phone, which is all I had for viewing at the time. I couldn’t watch the video in a meaningful way even if I tried (which I did), because all the details I needed were microprint.

Because I want a good working relationship with you, I embezzled some computer time and a big screen to watch your video. Right away, I was impressed with your speaking voice and mic setup, which I appreciate greatly as a former recording engineer myself. So, having now seen your video… it’s a great walk-thru of your testing process. I also feel like my former points are still valid.

If I’m tracking things right, the basic premise of this thread is to post your individual testing results, encourage others to post their test results too, and discuss it all on the forum, perhaps with the end goal of a GPU purchase decision. (I know you didn’t extend it to that last part, but rest assured that other people will.)

This is where I ran into problem #1: How do I learn your testing methodology? That means system specs, encoder settings, filter assortment, and export time and file size results. This is maybe three lines of quickly readable text if posted to the forum. Your system specs were posted to the forum… nice. But to get your encoder settings, I had to freeze-frame at 0:20 to learn the software encoder was libx264, then freeze-frame again at 4:17 to learn the hardware encoder was h264_nvenc. The settings panels went by in less than one second because the screencast video was sped up, so this information was difficult to locate. Since your encoder settings were not posted to the forum, it took 10 minutes of my time just to figure out your methodology in order to contribute similar tests. I think other users would have found it more convenient had these settings been posted to the forum, or spoken in the video, or listed in a summary table in the video. I had to dig much too hard to get relevant information. Other people would too, which will limit both the quantity and quality of responses to your call.

This leads to problem #2: When asking other people to contribute results, what information do they need to send in so that meaningful analysis can be done? Neither your posts nor your video provided people with a list of specs to submit with their test results. Analysis is impossible without the necessary data. So, to help you achieve your benchmarking goals, I filled in that gap with my first post, which was directed at everyone except you if you read it again. My first post listed in one place all the information that would need to be sent in by everyone else (since you had already provided your results), along with a brief explanation of why each spec was necessary in case someone thought the list was too long.

I’m legit trying to help you out. I’m interested in test results too and glad you got the ball rolling again. I’m just trying to help it roll well, because there are a lot of caveats to making tests that are comparable across environments.

For instance, in your video, libx264 is using GOP 299 while h264_nvenc is using GOP 30. How can hardware compete on file size when its GOP is only 10% of what libx264 gets? This is one reason why the hardware file was so much larger. Also, for libx264, it used preset=fast which may not be a fair comparison to Turing/Ampere. Had it been preset=medium, the export time difference would be wider, maybe to the point that hardware had an advantage now. While I do find the video’s test results useful and insightful, I still believe they are a little incomplete. Even if export times had been similar, some people may value hardware encoding simply because they are able to do other things on their computer while exporting from Shotcut since the CPU usage is lower. These are all points that people need to be aware of before investing big money in new hardware… or before believing that hardware will do nothing special for them when actually it could.

I realize these are all implications that go far beyond anything you intended with your video, and that is where the friction is. But we know from the past that people often respond to threads like this with purchases or spec competition in mind, and it helps to either standardize the test or clarify its caveats completely before people get wrong ideas and get upset. Been there, done that, trying to spare you the T-shirt.

To wrap up, I wasn’t trying to criticize your results or video, because they represent your specific environment. If it works for you, that’s all that matters, and it’s great to see your results. I was trying to help everyone else for when they send in their test results, and to prepare everyone for how to interpret results when the inevitable comparisons start. In retrospect, I should have sat back longer and let you take the lead on designing the next phase of testing, and then I wouldn’t have had to say anything. I rushed the play because this is familiar territory for me. Sorry I jumped in too soon. I’m also not allergic to 8-minute videos, and apologize that my wording seemed dismissive on the grounds of time alone. I simply couldn’t watch it on a big screen at that time, which I needed since the fonts were unreadable on my phone.

Austin · January 17, 2022, 10:05pm

The code inside of the software libx264 library is not the same code embedded in NVENC circuitry. They each encode video in their own unique way, and therefore their results will be different quality and sizes.

Specifically, hardware gets its speed advantage by breaking a frame into tiles and working on those tiles independently with dedicated cores. Tiles get limited (or zero) visibility into what’s happening in other tiles to prevent mathematical and synchronizational bottlenecks. Being separate is what allows the tiles to be processed in parallel, which makes the process faster than software.

Meanwhile, software will look at the entire frame and be able to take advantage of optimization opportunities that hardware physically can’t see, without losing its speed advantage due to tile boundaries.

This is a high-level view… the lines get blurry with Turing. But the basic issue is that hardware and software encoders approach the job in totally different ways because they have different hardware capabilities available to them. Therefore, their results are very different even with the same settings.

Kaje68 · January 17, 2022, 10:32pm

Ok lot to digest here.

A, fully appreciate and accept your apology, and offer my own for being sensitive, but as you seem already aware even though it is a trivial video I did spend time on it.

B, part of what you spoke about I’m not currently competent on so will have to get educated on that. Specifcally the things you said you needed to pause on. I sped through those things because I, incorrectly, thought they weren’t important.

C, I am going to learn everything you’ve talked about for better testing going forward, and will probabaly need to pick your brain. Before export settings are figured out the source and project needs to be figured out. I’m thinking maybe two twenty minute source videos on separate tracks. Edited down to make project about 10 minutes. Audio muted on one track, and additokal audio track added(like background music). Two video tracks are cropped and resized for side by side, and then add in some transitions and fades. Off the top of my head that feels like that would cover the most common usages and be a fair test but that’s based on my, very limited, personal experience so any suggestions are welcome.

D, I sincerely appreciate the voiceover compliment. I actually feel like I’m pretty trash at it and mediocre at best with audio editing. I do that in audacity as that makes more sense to me on the audio side. Nothing makes you realize how many filler words you use, or how often you smack your lips like recording yourself and playing it back.

Long story short, we cool, and with your help and anyone else interested I will compile a list of ideas for creating the project. I want to do one 1080p30fps and 4k60fps(assuming there isn’t a flaw there). And a list of export options for the various test runs.

Austin · January 18, 2022, 12:27am

A finished video is like birthing a child to its creator, and I should have chosen wording that was more respectful to the time and energy you put into your video. I’ve been on the receiving end too after making a video, and I remember how amplified every piece of feedback felt, good or bad. So I should have known better. I actually do like your video because it shows us exactly what you did to get the results you got, and removes all confusion. Props for that, and sorry for my abrupt early responses.

Two projects at 1080p30 and 4K60 sound like a useful contrast. My first thought is to add an occasional text filter (both Simple and Rich) to the projects. These filters sometimes tore in the past when keyframed with Parallel Processing turned on, so it’s a good test for trouble there. Also, they invoke the compositor, which is a slower processing pipeline in Shotcut (by nature of more tracks to smash together). If somebody uses text extensively, like baked-in subtitles for karaoke night, then it could significantly affect the threading efficiency of Shotcut, which changes the number of cores available for software encoding, which changes the speed difference between hardware and software, etc etc

EDIT: If you need an example of a single-threaded filter for testing purposes, “Reduce Noise: Wavelet” will do the trick.

Aside from adding a few text filters and maybe a few PNGs with transparency, the sample project components sound good! Maybe it’s worth having two project types… one that only has video and no text filters like a gaming video would do, then another project that is heavy on text and effects like a Top 10 List video would do. Those two styles create radically different processing patterns in Shotcut. There is no right or wrong project so long as it’s documented in a way that people can tell if their own project would get similar results.

I’ve covered the most common testing caveats in earlier posts, so I’ll step back and not interfere unless something big and new appears. And sure, pick my brain too, I’m here to help.

Kaje68 · January 18, 2022, 2:54am

We’re good and thanks again.

I’ve only ever done the Rich Text filter I believe, what is the difference between the two?

Also, I haven’t had much reason to do keyframes yet(though I need to learn that) would using those alone have any impact on performance?

As for the project itself, does the source video matter? If the source is 1080p30fps and you export to 4k60 would you expect the same export performance if it was 4k60fps to 4k60fps? The reason I ask is because if it has no impact on the actual export then no need to have two separate projects, the person running the test could just export it to what they want.

I was just thinking, what would be REALLY interesting is if shotcut could auto detect based on your specs, and your project state exactly the best options to use when exporting.

Ar_D · January 18, 2022, 7:42am

I tried my video, which is 5 mins length.

With Hardware Encoding, it was 3 mins and 12 secs.

And with parallel processing, it was 3 mins and 22 secs.

I unfortunately didn’t try both together, I didn’t had time because I had to go to get the 2nd dose of the vaccine.

RilosVideos · January 18, 2022, 10:04am

Thanks Austing, for the information - i was not aware that HE can lead to different results and it seems it can end up with quite a lot of difference - not only file size but also playtime of the encoded file
I also know close to nothing about NVENC, its another codec specialized on using GPU resources for encoding and completely different to h.264 i guess? Can’t you use the same exact codec for both HE and non-HE? If you would use h.264 (or any other) for both ways would the results still be so different?
I understand there must be a different approach of encoding/compressing when using HE and lots of parallel work, so the tiles will be treated on their own and can probably lead to artefacts at the edges?
I it right, that the overall video quality in average will be less when using HE?
Sorry, these are probably beginner questions - i nearly always used h.264 with non-HE so far.

Elusien · January 18, 2022, 11:49am

NVENC is short for NVidia ENCoder. Some generations only produce H.264 (AVC), while later ones are capable of producing H.265 (HEVC) . For an overview see:

Elusien · January 18, 2022, 1:49pm

@RilosVideos If your PC/laptop does not have an Nvidia graphics card, the chip on which its CPU resides will almost certainly house an integrated GPU (Graphics Processing Unit). e.g my laptop has an AMD Ryzen 5 4500U CPU with an integrated AMD Radeon Vega 6 GPU and my old Surface Pro 4 has an Intel HD Graphics 520 GPU integrated. The Radeon GPU uses AMF (Advanced Media Framework) for encoding, while the Intel HD uses QSV (Quick Sync Video), just as Nvidia cards use NVENC.

These GPUs can be used by Shotcut/ffmpeg when exporting videos, rather than using libx264/libx265 software encoding. So, as you can see, there are quite a few options to try using hardware encoding.

RilosVideos · January 18, 2022, 5:46pm

Thanks, Elusien, for the explanation.
I have a Lenovo Thinkpad T50 from 2017 with Intel ® Core™ i7-6820HQ vPro Prozessor
(8 MB L3, up to 3,6 GHz), Graphics: NVIDIA Quadro M1000M 4 GB (or
NVIDIA Quadro M2000M 4 GB) with 32 GB total RAM, so not too bad.

I want to understand how HE influences the output quality - not sure about that at all.
I dont care if the encoding takes a little longer, i am more interested in the video quality and file size.
I only do video editing on occasional basis, nothing professional

If HE produces bigger file size at same video quality or same file size at worse video quality i would prefer not to use HE. But if the effect is only very marginal or can be cured by minimal bigger file size that would be o.k.

Elusien · January 18, 2022, 7:28pm

@RilosVideos Regarding filesize, almost all HE videos I have produced take up quite a bit more space than their respective libx264 (AVC) and libx265 (HEVC) software counterparts.

Regarding quality, that is more subjective and more recent HEs have improved a lot in this metric over the earlier models.

It is definitely worth doing a few experiments. Besides subjective estimation of quality, Shotcut itself can do a more objective estimation. If you export a file then right-click on the exported job and choose “Measure Video Quality” Shotcut will compare each exported frame with the original and provide some metrics on Peak Signal to Noise Ratio (PSNR) - one type of quality metric. I did this with 2 short exports. The results were:

               <-AVC  (H.264)->||<=HEVC (H.265)->
Metric           HE   |   SE   ||    HE   |  SE
--------------|-------|--------||---------|------
Elapsed time  |  33s  | 32s    ||    29s  | 39s
Filesize      |  16MB | 10MB   ||    13MB | 9MB
PSNR (average)| ~54%  | ~52%   ||    ~45% | ~47%

From the PSNR metric (on the Y component) we can see that (by this measurement) for H.264 (AVC) HE produced a slightly better quality video as SE, but it did so in roughly the same elapsed time as the SE and created a file around 1.5 times as big.

For H.265 (HEVC) HE (Hardware Encoding) produced a slightly worse quality video as SE (Software Encoding), but it did so in roughly 75% of the elapsed time of the SE and created a file around 1.5 times as big.

All these values will change for different videos being exported and in my opinion, it is also best to study the resulting videos, than take PSNR as an absolute measure of quality.

Austin · January 18, 2022, 8:03pm

Re: both text filters, Simple and Rich

Not too much. Features, complexity, etc, but probably similar processing demands on the CPU. My main reason for testing both is to verify if tearing reappears with Parallel Processing turned on.

I can’t say for sure, but I would doubt that keyframes themselves would affect performance. All they do is change the parameter values in the filters, and then the rendering effort of the filter is the same… unless it is keyframed to move outside the screen boundaries, perhaps.

EDIT: I forgot to mention that the point of keyframed text is to reveal any tearing due to Parallel Processing. If text sits still, we wouldn’t notice tearing even if it happened.

Short answer… I would expect a significant difference.

There are a few layers involved.

The first layer is video file decoding. In a worst-case scenario, 4K60 will take 8x the decoding effort as 1080p30. That’s almost an order of magnitude more CPU work, so it isn’t trivial. This alone would sufficiently convince me to have separate 4K60 project sources for a performance test.

The next layer is the project timeline resolution. If sources are 1080p30 and the timeline is 1080p30, but then the Export > Advanced > Resolution is changed to 4K60, we have at least two performance measurement problems:

Filter processing usually happens at timeline resolution, meaning at 1080 instead of the overridden 4K output. The 4K upscale doesn’t happen until after the frame has been completely built at 1080. A filter running at 1080 is only a quarter the CPU effort of running in 4K. There will be a significant difference in export performance between 1080->4K versus 4K->4K for this reason in particular.
A source file that’s 1080p30 is only 30fps… where do the additional frames come from to make 60fps? When doing an Export > Advanced > Resolution/FPS override to a higher frame rate, the frames from the 30fps source and timeline will simply get duplicated to create 60fps. Duplication is much easier CPU effort than doing actual filter processing to create those additional 30 frames per second. So this result would be misleading regarding 60fps performance.

A setup that could somewhat mitigate those issues is to have 1080p30 sources, but make the project timeline 4K60. This will cause filter processing to happen in 4K instead of 1080, and will also cause filter processing to happen at 60fps instead of 30fps. The downsides are that decoding effort is still badly underrepresented, and there isn’t true 60fps frame uniqueness in the source. The output will also appear fuzzy since it was upscaled from 1080 to 4K, which means the encoders will probably crush the output extra hard since the detail is lower. Encoders love to apply heavier compression to fuzzy stuff because they know there’s less detail for the eye to notice, meaning more opportunity to remove stuff and reduce file size without being noticed as a quality loss. It will be very difficult to do output quality comparisons or VMAF scoring with upscaled output because all the fine detail got smeared inside that 1080->4K upscale.

This one is a challenge. How is “best” being defined? Does somebody need a visually lossless master for archiving? (Use CRF 16 and call it a day if so.) Or do they need to hit a target bitrate or file size that’s required by whoever they’re submitting the video to? That’s going to be custom settings by nature. What if they want “just barely good enough” so the file size is as small as possible? The ideal settings would be very scene-specific, where dark movies would get different settings than bright outdoor movies for instance. Basically, every definition of “best” comes with a different combination of settings to get there. If you can rigorously define what “best” means to you, then it probably becomes possible to automatically recommend encoder settings that will achieve it.

The option that’s interesting to me, when I have time or necessity, is to do an intermediate or lossless export out of Shotcut to preserve maximum detail. Then, process that export with Av1an, which is an encoding tool that breaks the video into per-scene chunks and encodes them separately with their individual best settings. “Best settings” are determined by how much difference in VMAF scores you’re willing to accept between the encoded output and the original file. (It takes the encoder several tries to find an acceptable configuration.) Then, it stitches the chunks together into a final file. This is essentially what Netflix does to get their movies as small as possible to save on bandwidth, yet make the best usage of that bandwidth for highest quality. Each scene gets its ideal compression settings instead of having a single setting that applies to the entire video.

To @RilosVideos, to access these hardware encoders in Shotcut, you can tick the “Use hardware encoding” checkbox on the Export panel and hope it auto-detects your card, or you can go to Advanced > Codec and manually set the codec to one of these:

h264_amf
h264_nvenc
h264_qsv

That means you’ll get an H.264-compliant video, but use the respective hardware to produce it.

Sounds like you’re on track to be a software encoder for life. I am too at this rate. I’m not aware of an AV1 hardware encoder within my budget yet.

Hardware encoders are more focused on speed, and are generally willing to sacrifice a little video quality and file size to get there. Hardware has literally the opposite priorities that you do.

Hardware is currently at the “good enough” level for most people, and will only get better. Perhaps it will fully outshine software one day. But we’re not quite there yet.

Kaje68 · January 19, 2022, 2:14am

Information overload, I’ll parse this in more detail later but wanted to ask you if you had thoughts on the performance issues? We got so focused on the export itself that we ignored the fact that shotcut started running terribly and actually froze a couple of times. We see I have a pretty high end PC so that one baffles me.

Also, do you know if when you download the latest version of shotcut if you can keep your project history and any other customizations?

RilosVideos · January 19, 2022, 2:14pm

Thanks @Elusien and Austin for your detailed answers!
So it looks i didn’t do anything wrong so far. The differences quality-like seem minor, but file sizes can increase quite a lot with HE. So i will probably keep it on SE, will probably do a test with the next generation computer - coming soon. Normally i am not in a hurry and have a separate PC to work on if one is busy computing. Sure HE with Nvidia will gain more and more in the future and is probably the way to go. I am a bit involved in 3d-rendering where all major tasks are done on the GPU for the last years already. Esp. simulations like for smoke, fire, water are mostly done via Cuda etc.

Austin · January 19, 2022, 8:58pm

Yeah, that was a bit peculiar. It’s really hard to say what’s going on without a debugger running.

One thing I’ve anecdotally noticed but have no source-code evidence to support, is that Shotcut can sometimes act like it’s frozen but it’s really not if a lot of sped-up videos are involved. Your screencast was sped-up videos for the most part. The final video was 8m40s, but to Shotcut, it still feels like a 1h20m project because that’s how much source video it is processing. In the case of sped-up projects, I’ve noticed that Shotcut may not respond to clicks and mouse movements, but it’s still very busy underneath if given time to finish. For whatever reason, it seems much more responsive when clips run at normal speed. I have no scientific ability to prove this. It’s just a tendency that I’ve noticed from my own projects.

Project history as in undo/redo history between editing sessions? No. That history is not stored in the .MLT file.

Project history as in Recently Opened Documents? Yes, that sticks around between sessions.

Other customizations? It’s in the registry on Windows, or a config file on Linux. All platforms can be configured to use an App Data Directory (on the Settings menu) which lets you store Shotcut executable files and config files together on a USB stick that can run fully portable between different computers.

shotcut · January 20, 2022, 12:00am

It depends on the rate control method. If you use Quality-based VBR, then the percentage is completely different per codec and encoder. IOW, 55% x264 is different than 55% h264_nvenc, which is different than 55% HEVC. The % sets a CRF or quantization level that is defined per implementation, and Shotcut makes no attempt to normalize them as that is quite difficult given the number of codecs and encoders. It would require several videos with some diversity times many % levels times a couple of different metrics. Then, all the data analyzed only to find that the differences between the encoders is not a linear function. Basically, no one is doing this - commercial or open source.
Otherwise, with average or constant bitrate there should not be a big difference unless perhaps you go very low or very high bitrate.

Nakuvi · February 16, 2022, 6:07am

I also agree after doing a similar experiment. Hardware encoding makes no improvements in the speed of encoding. My system is:
Processor | AMD Ryzen 7 4700U with Radeon Graphics 2.00 GHz
Installed RAM | 16.0 GB
System type | 64-bit operating system, x64-based processor, Windows 11
In fact, without HE, I achieve smaller file size. Like to know if anyone else has seen the same effect.