Built in proxy generation

@shotcut, your concern is definitely understandable, that an edit-friendly proxy could mask problems in a source file. I assume your biggest concern would be users complaining on the forum that their proxies render fine, but their variable frame rate cell phone source video does not. Our scripts account for this, which I’ll describe at the end of this post for anyone interested.

To answer @DRM’s question, I used to be a C++ programmer back in the day before switching gears to other job-related languages. I dug through the Shotcut source code and now I understand why round-tripping relative coordinates through the pipeline is not the most trivial thing to do, especially with keyframes in the mix. There is a “use_profile” property already defined in some places, but it would be insufficient because the profile resolution itself would be changing in our workflow. I think I’ve found a way to get coordinates completely relative, in theory. I’m walking through some Qt tutorials to get familiar with the Qml components, and it looks pretty straight-forward.

My first thought would be to maintain backward MLT XML compatibility by adding percent values as new attributes in the XML so the existing attributes can remain pixel-based, allowing filters to be retrofitted one by one as time allows. The Qml would hunt for percent attributes first, and if not found, use pixel attributes. For this to work, I think it would require alterations to the MLT XML format, possibly the MLT Framework itself for rendering, and the Shotcut Qml files for UI data entry and coordinate translation. I don’t think the filters themselves like frei0r would need modification. I think the coordinates can be translated before passing them into the filters. I found documentation in the code going back to 2005, so I’m sure this program is like a precious baby to @shotcut, and I’d only want to attempt a change this big with Dan’s blessing. I can say it wouldn’t be a fast change… I’m also a gigging musician and the Christmas season is coming up, which means many evenings are spent in rehearsals instead of coding. Maybe I could have a prototype partway into the new year. If someone else wants to tackle this before me, go for it. I just want to see it work no matter who gets it there first. :slight_smile:

On a different note, @D_S gave me something new to think about. Two years ago when I developed this proxy process, I attempted HuffYUV proxies but the CPU utilization to decode the video was so high that playback would glitch terribly. That’s why I developed the MP4 CRF 12 solution. But at his suggestion of HuffYUV, I went back and tried it again and wow, its CPU utilization is now lower than the MP4 proxies! I guess we have a new version of Shotcut or the bundled ffmpeg to thank for that.

This prompted me to do a full review of all lossless codecs for proxy purposes. I’m not a fan of DNxHD for proxy work because it’s too picky about the resolutions it officially supports (480x270 is not one of them). ProRes could have been an option, but best I recall, its decoding speed was not as fast due to its thread model in ffmpeg. Anyhow, my lossless findings were:

-c:v libx264 -profile:v high -crf 12 -intra -tune film -preset veryfast
The original proxy format. About 10% the file size of 4K H.264 100Mbps sources.

-c:v libx264 -crf 0 -intra -tune film -preset veryfast
H.264 Lossless. The CPU to decode is way too high. Disqualified.

-c:v ffv1
FFV1. The CPU to decode is too high, and also does weird glitch patterns on occasion. Not cool.

-c:v huffyuv -pred left
Less CPU to decode than MP4 CRF 12. File size is 5x of MP4 CRF 12.

-c:v utvideo -pred left
Barely more CPU than HuffYUV, still less than MP4 CRF 12. File size is 4x of MP4 CRF 12.

-c:v utvideo -pred median
Barely more CPU than Ut Video Left, still less than MP4 CRF 12. File size is 3.5x of MP4 CRF 12.

MP4 CRF 12, HuffYUV, and Ut Video all transcode/encode at essentially the same speed through ffmpeg (within seconds on huge files).

So congratulations, @D_S, your suggestion has prompted me to change my proxy generation scripts. I now use Ut Video Left because the file size, while 4x larger than my current proxies, is still small fries in the grand scheme of things. It also provides the lowest CPU utilization among the lossless codecs that support RGB+Alpha and up to 10-bit 4:4:4 colorspace. HuffYUV doesn’t go that high. So for essentially the same CPU usage as HuffYUV at decoding, Ut Video is a smaller file with greater colorspace support. Wikipedia suggests it was developed to be an alternative to HuffYUV, and I’m taking them up on it as my new one-and-done format for proxies. Plus, Ut Video is still actively developed and maintained.

The move to a lossless proxy format is already providing great benefits to our workflow because we can scale to full-screen previews without compression blockiness or dancing noise like MP4 did. We can also color grade against these lossless color-accurate proxies and it holds up perfectly when switching back to 4K. I also did a test just to see how ridiculous I could get, where I stacked up as many tracks of video as I could with every clip having two filters applied: opacity at 20%, and a color grade. The opacity is to ensure the entire stack of videos is composited and evaluated. Shotcut has an optimization to skip lower tracks when a track is opaque and has a blend mode of Over. Using opacity skips this optimization and forces the whole stack to be evaluated. And of course, GPU acceleration is turned off. In this dreadful scenario, I was able to stack up 18 tracks of proxy video with zero glitches in the audio. Glitches started at track 19. That’s insane, not to mention that the test was done on old hardware. There’s no way Shotcut or most other video editors would do 18 tracks of native 4K video with filters, especially on cheap old hardware. This proxy thing with Ut Video is game-changing.

Back to the original issue of VFR video sources, our proxy generation script accounts for this. The method it uses is slow because the entire file (every frame) must be scanned for the potential of VFR, but the process is at least bulletproof (moreso than MediaInfo which only checks the first hundred frames or so). We check for VFR using this logic:

ffmpeg -i %1 -vf vfrdet -f null - 2>&1 >nul | Find "VFR:0.000000 (0/"
If ErrorLevel 1 (
	Echo VFR
) Else (
	Echo CFR
)

If a file is VFR, then the script creates an intermediate and puts it into the Media subfolder and kicks the VFR original to ..\Transcoded. Proxies will then be generated off the intermediates.

We also use ImageMagick to convert image files to our 480x270 proxy size. Especially for PNGs that have alpha channels, the smaller images composite much more quickly when doing a multi-track preview:

For %f In (*.jpg, *.png, *.gif) Do magick "%f" -colorspace LAB -filter Lanczos -resize x270^> -unsharp 0x0.75+0.75+0.008 -colorspace sRGB "..\Proxy\%f"

And lastly, we turn PCM audio files into AAC simply for space savings because some of our projects have eight hours of WAV audio:

For %f In (*.wav?, *.aif?) Do ffmpeg -i "%f" -c:a aac -b:a 192k "..\Proxy\%f.aac"

…then drop the .aac extension like we did the other files.

That’s the status of things so far. Would love to hear what methods other people are using. I’ll start working on a source patch to get relative coordinates into more filters, but progress will be slow.

1 Like

Ok, now it’s my turn to think about something I’m going to have to look into UT Video myself.

@Austin Thanks for the tip about utvideo. I admit it has not even been on my radar. I should add an Export preset for that. Consider to use AC-3 instead of AAC for audio proxies as AAC introduces codec delay, which can make things more challenging and introduce minor latency to seek. Maybe there are other candidate audio codecs. If you ffmpeg -h encoder=aac | grep General you will see “delay” which means codec delay is at play.

I am reading through your posts even though I do not have a lot of time to have a discussion. Making mlt_rect parameters serialize and deserialize as a proportional (relative) value in conjunction with keyframes in MLT is very non-trivial. I tackled it once and had to scrap it. Obviously, I want to revisit it. It would help if you can take a look at trying to convert the crop filter, but do keep in mind that it needs to remain backwards compatible to render old projects. Thus, the proportional value handling should be done either by using new/different properties or looking for ‘%’ in the string value and then getting it as a double and multiplying it against the maximum value.

Regarding Ut Video, I’ve noticed an odd difference between a transcode through static ffmpeg 4.0.2 vs. a transcode through Shotcut 18.10.08 from the source window (no timeline involved). The Shotcut transcode is 2x the file size and has darker colors. I looked at the bundled ffmpeg in Shotcut and it is also n4.0.2 (albeit compiled by a GCC that is two major versions older). I also haven’t been able to change the exported file size by passing pred=left or pred=median through the Export > Other tab. It doesn’t affect me since I do all transcoding through ffmpeg rather than Shotcut, but it could affect Shotcut’s ability to have a Ut Video preset. (FWIW, pred=left works great for lower CPU on proxies, and pred=median works great for cutting file size by 30% over HuffYUV on full-resolution intermediates.)

Thanks for the explanation of AC-3 audio. I noticed in one of the recent release notes that you switched from AAC to AC-3 but didn’t know why. I will follow suit now.

Yes, I planned to store relative coordinates in new properties and attributes to maintain backward compatibility in the MLT XML. I’ll start with the crop filter.

Well, as Dan said in his reply.

@Austin, I am messing around with the UT Video preset that Dan added to the v18.11 beta. There is a second preset for Alpha channels also. Do you know how the UT Video with Alpha channels compares to Quicktime Animation?

I’m going to write a more detailed summary of my experience messing around with UT Video on this thread here but here I want to ask about your plan on using UT Video for proxies. Shotcut is really the first video editor I have learned to use so forgive my ignorance with some of my questions. :grinning: Proxies are meant to be smaller than the original files but the UT Video preset in Shotcut exports files with huge sizes. How is the UT Video codec going to be implemented with a video editor’s built in proxy with the huge sizes it exports? Or does a proxy generator in a video editor work very differently with codecs? Or do I have it all completely wrong and what would be used later on as a proxy generator for Shotcut is not UT Video at all?

By the way, have you checked out the new beta and its UT Video presets? Any thoughts?

Proxies are intended to be faster which typically means removal of IFC(inter frame compression reduces performance when editing since you need to decode multiple frames for each frame) and a lower resolution(probably where the smaller idea came from 1080p vs 4k ect.) but smaller storage size isn’t typically a concern(and is counter to the removal of IFC which while good for storage increases size) when the proxies are discarded at the end of the workflow(as long as you keep the original these are easily regenerated)

Yeah, I have made my own proxies before but what I am wondering is if a built in proxy generator in a video editor works differently with codecs or does it literally just export a proxy file for you to save kind of like how Shotcut does now with the “Convert to Edit-Friendly” feature?

The final workflow would be up to Dan and Brian. But if it’s anything like Blender or kdenlive, the editor does a simple transcode that would be just like you exporting the video in proxy format yourself. It’s just automated by the editor to save you some effort and details. A codec is a codec… no difference whether it is used by a human or another program.

I haven’t played with Quicktime Animation, so I’m unable to comment on exported alpha channels and how they compare to Ut Video. That would be an interesting research project, especially for logos and lower thirds and transitions.

I love the way you phrased your statement… “How is the Ut Video codec going to be implemented with a video editor’s built in proxy with the huge sizes it exports?” :slight_smile: Yeah, basically, it destroys your disk and that’s the price of admission haha. You’re on the right track… an editor doesn’t use the codecs any differently than you do, so all the same considerations and limitations are in play, including disk space consumed.

There could be a difference between the way I do proxies and how other people do them. If the sources are 4K video, then the “real” way might be to use 1080p proxies in ProRes. However, I’m cheap on computer hardware because my money went to camera gear instead. That’s how I ended up using Shotcut to begin with. :slight_smile: (The audience will notice a better camera long before they notice a faster computer.) So, due to my slow hardware that won’t play multiple tracks of HD ProRes in real-time, my proxies are 480x270. This is a unique place to be. The color has to be spectacularly preserved or else the proxy will play back and scale so poorly that you can hardly tell what’s going on in the video. Fortunately, lossless codecs are to the rescue, and Ut Video at 480x270 produces a file that is not terribly large at all. To be specific, if the sources are 4K H.264 4:2:0 100 Mbps, then the 480x270 Ut Video proxies will be 40% the size of the source files. This is totally acceptable to me, and the playback color is so crisp that even a full-screen preview is very usable for editing.

However, Ut Video at 1080p would be a beastly file size. I haven’t tried that and probably won’t. I stick to 480x270 because I routinely stack 5+ tracks in my projects and I need the low resolution to composite everything in real-time. If I were stacking fewer tracks, I might try 960x540. Once we’re in 1080p-land, lossless proxies are probably not the way to go. Flawless color is no longer paramount because there’s plenty of resolution to get “close enough” in the dithering sense. At this point, I would consider H.264 All-Intra at CRF 22 or higher to get small size and fast playback. A proxy isn’t supposed to be perfect – that’s the job of the original. A proxy is supposed to be fast and “good enough” to stand in for the original. It only has to be perfect if you’re down in 480x270-land and every pixel counts haha.

I haven’t checked out the 18.11 presets yet. I should, since Dan was so gracious to add them. But I’ve got to finish an existing video project first. One thing I’ve learned is that it’s a bad idea to switch horses in the middle of a stream. If I started a project with 18.03.06, I’m going to finish it with 18.03.06, even if we’re knee-deep in 2020. :slight_smile:

I’d love to hear how your Ut Video research is going.

1 Like

I completely forgot about that! You are using the UT codec on very low resolution. I’ve been messing around with the UT codec on Shotcut as full scale lossless exports! :rofl: It does indeed make a difference in file size. I just tested a 1080p file I have that’s 12 gigs and exported it out of Shotcut with the UT Lossless preset and changed the resolution to 480x270 as you suggested along with changing the audio codec from the pcm_24le codec that the preset is set for to ac3 as Dan suggested with 128k bitrate. You’re right that the image and colors still hold up even with such a low resolution but the file is still big as it came out to 40 gigs or so. It’s still much lower than it would’ve been at a full scale although I have to say that the results at full scale are very interesting. I don’t remember if it was made clear in the other thread but UT Video is also mathematically lossless like FFV1 and HuffYUV, right?

I imagine that if this proxy generator thing gets going that probably the best approach would be to offer two options. One for a regular proxy file that is very small in file size and a lossless proxy using the UT codec. The choice should be there in case disk space is a concern.

I’ll test some more stuff out with the UT codec and share my thoughts in a day or so. I’m hardly a technical expert on this so bear with me. :slight_smile: And thanks so much for taking the time out to explain all of this stuff. I’m learning a lot!

1 Like

I think the idea of two proxy presets is brilliant. Ut Video is mathematically lossless, and using it along with 480x270 could be a useful preset for people stacking a lot of tracks, which requires low resolution to composite everything in real-time. (My personal stack record at 480x270 is 18 tracks.) The second preset could be 960x540 using H.264 All-Intra to save on disk space. I haven’t tested the lossy codecs as thoroughly for this second preset, so another one may end up being better. But H.264 All-I would be my start point for research.

(To be super technical, all that really matters is the height when deciding the proxy size. My ffmpeg scripts use -vf scale=-1:270 which means “video filter scale to whatever width correlates to a height of 270 pixels when preserving the aspect ratio”. This way, you always have 270 or 540 lines available to your preview window, but the width is free to jump around in case somebody brought in a video of a digitized 4:3 VHS tape or something. Not every source is guaranteed to be 16:9, so only the height is constant among proxies.)

As a general principle, there isn’t much benefit from a 1080p proxy, especially lossless. The phrase “full-resolution proxy” is literally an oxymoron. :slight_smile: A full-resolution transcode, if lossless, is an intermediate rather than a proxy. A full-resolution transcode, if lossy, is not perfect anyway, so why overkill the resolution on such imperfection?

960x540 is a bit of a sweet spot for people editing on 1080p monitors (which is probably the majority of people at the moment). At 960x540, the video is slightly bigger than the Shotcut default preview window, which makes for a fast and sharp preview. It is also half of the monitor’s resolution when going to full-screen, so it will scale perfectly and still look decently sharp for editing work at full-screen. If the proxy was 1080p, the preview window would have to scale the video down to fit (which takes CPU), the compositor and all filters would have four times as many pixels to smash together (more CPU), and the full-screen preview would provide a level of detail that gains you nothing in terms of functionality over 540p. There is also the extra encoding time and disk space required to make the 1080p proxy. It’s extra work for no added benefit if using a 1080p monitor. Proxies by nature are supposed to be imperfect so they can be fast. Sensitive work like color grading can be done after switching back to the originals.

Granted, everyone’s preferences and unique project requirements are different, but that’s the general principle and a pretty good starting point.

Oh, almost forgot… using Ut Video…

4K source to 480x270 proxy = proxy at 40% disk space of source. The start resolution is massive and the proxy resolution is tiny, so the relative space percentage is small.

1080p source (25% of 4K) to 1080p proxy = proxy that takes more space than there are atoms in the known universe. The start resolution is small, the proxy resolution is high for a proxy, and it’s lossless, so the relative percentage is going to be well over 100% (as in, proxies will be significantly larger than the sources).

BTW, you qualify as a technical person if you can make sense out of any of the mumbo going on in here. :joy:

1 Like

We may want to give the user a little more control over it than that since some projects could still be 4:3 instead of 16:9 but I’m sure it would be possible to build a small table and give the users an option between “High quality”(h264 All-Intra or something like prores) and “Lossless” HuffYUV or UT video depending on final details.

If I wanted to do lossless editing of a 1080p or 4K video but I wanted to save as much space as I could, does this make sense to you: Make a FFV1 of the 1080p or 4K source then take that very FFV1 to create a UT 480x270 proxy? Sure, there’ll be lots of CPU usage because of FFV1 but if disc space was a concern could that workflow be good to do lossless editing of 1080p or 4K?

The 4:3 possibility was covered in the second paragraph. All proxies would be created in the same aspect ratios as their originals because the width is left variable and calculated at encoding time. The project timeline has no bearing on proxy resolution. A proxy’s job is to mimic its source, including aspect ratio. So using our current example for lack of a better idea, the proxy preset names could be something like “Proxy - 540p 29.97 fps” which could use H.264 All-I, and “Proxy - 270p 29.97 fps” which could use Ut Video. Neither preset name specifies a width because it’s both unknown and irrelevant.

If I’m tracking this right, you want to edit on 270p proxies. If that’s true, your edit is fast because you’re using proxies. So when it comes to your final render, why not use the original video? Why do you need a lossless FFV1 copy of your original? Converting to FFV1 will not bring any magic quality gains to your original video. The only time I can see that making sense is if you’re converting variable frame rate cell phone video into constant frame rate and you want it to be lossless. That would make sense. Although even in that scenario, a phone sensor is so bad that it may not be worth lossless preservation anyway. H.264 at CRF 18 or so would be less compression than the phone already put on that video.

Doesn’t exporting from a source always go down a generation in quality? That’s why I was asking about lossless editing. Making a lossless file then exporting it from that for the finished work to not lose any quality.

If you’re working with proxies they’re just a “placeholder” ideally you still go back to the source when you export.

You are correct, exporting from a source does go down a generation in quality unless you’re exporting to a mathematically lossless format, or you’re exporting to a visually lossless format and you’re okay losing the details that only bumblebees can see. Visually lossless formats like ProRes are designed to pass through many generations without visible loss.

Just to make sure we’re not skipping any steps (I apologize if this is overbearing), there is no generation loss in reading the original video file. Reading it produces the same bitstream every single time. The loss happens when writing (exporting) a new video file into a lossy format. It’s the new file that has generation loss, not the rendering pipeline that led up to the new file. As in, the compositor and filters and color grades and previews would have access to all of the details that were in the original video files with no loss. The editing and computation phase is lossless. It’s the content of a lossy final file (if a lossy export format is chosen) where the breakdown happens. It’s generally okay for the final render to be in a “good enough” lossy format because no further editing will be done and there’s no point burning disk and CPU on details the eye can’t see anyway. YouTube and Netflix want to crush out everything that doesn’t matter in order to save network bandwidth on streaming. Studios crush the final render to fit longer movies and more bonus content on a single Blu-ray disc and avoid needing a second disc.

So here is where I’m not tracking yet. When you make a lossless file from your source (basically an intermediate) and render your final against the lossless, how is that any higher quality than rendering against the source video directly? How would the lossless file produce color information that the original video file couldn’t? Without that, what does the lossless file bring to the party (outside of converting VFR to CFR)? It doesn’t provide editing speed because editing is against the proxies. It doesn’t provide quality any higher than the original (it’s not going to create more pixels or better colors unless you add filters), so why add the extra step? The lossless file itself was an “export” from the original video and there was no loss. So it’s equally okay to render direct against your source and the only loss that would happen is whatever is inherent to the export format you write to (or none if you export the final in a lossless format).

If I’m reading it all wrong and you’re talking about the final export going to a lossless file rather than each source video, then yes, exporting the final render to a lossless file to prevent generation loss makes total sense.

A shorter version since my last one was too complicated. :slight_smile: My apologies for a double post:

If the goal is to convert the source video from variable frame rate to constant frame rate (or any other necessary conversion because the source is unusable as-is), then yes, using an FFV1 intermediate to avoid loss yet save disk space is a perfect solution.

If the original source file is usable as-is, then adding an FFV1 intermediate buys you nothing extra. You can render directly against the original file and get the same level of quality. After all, the lossless intermediate got its quality from the original; so the final render can do the same.

1 Like

@Austin, I was experimenting more with Ut and had thought about bringing the resolution even lower than 480x270 to see what happens. I have done several exports with Ut at 360x204 both with 1080p and 4K. I found that the video still holds up surprisingly well. I love it because that file sizes are much smaller than 480x270 Ut. Unless I am missing something the 360x204 resolution is the one I will use.

Have you tried it out at 360x204? If not please do and let me know what you think. :slight_smile: