Audio Alignment Implementation

andrecaldas · October 29, 2021, 5:46pm

@DRM:
How do you “measure” the drift? What are the units?

DRM · October 29, 2021, 9:38pm

Not sure how to answer that one cause I’m not knowledgeable in this area. Shotcut just uses frame rates so that’s all I go by. Also I bring this up because it’s been mentioned a number of times on the forum by users who run into this audio drift problem.

andrecaldas · October 29, 2021, 10:37pm

@DRM:
I can determine the drift!

I made a program to determine the drift.

Compiling

You compile it the same way you compile the other two:

g++ -fPIC -I /usr/include/mlt-7 -I /usr/include/mlt-7/mlt++ -o drift drift.cpp FFTInplaceArray.cpp AudioEnvelopeFFT.cpp -Wall -Wextra -lpthread -L /usr/lib/x86_64-linux-gnu/mlt-7/ -l ‘mlt+±7’ -lmlt-7 -lfftw3

Executing

You execute it:

./drift file.mp4 file.wav 2>/dev/null

And it tells you how to adjust the speed of file.wav.

/=== PROFILE: atsc_1080p_60 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 8705 million ticks.
| Finished extracting second producer’s audio. Elapsed: 4569 million ticks.
Determined drift: -0.05%.
You should adjust the speed of file.wav to 99.95%.

I was adjusting these using exactly 99.95%!!!

How it works

I adjusted the speed of the audio from 99% to 101%, by increments of 0.01%. Calculated the “alignment quality” for each one. This “quality” is a sub-product of the alignment algorithm. That is, I executed the alignment algorithm 200 times, but without extracting the audio all over again (I just dropped or inserted the values of each frame).

With two 5 minute files, extracting the audios was about 4.4 seconds. And calculating the alignment 200 times was instantaneous. With two 10 minute files, it was 9 seconds to extract the audio and calculating the alignment 200 times took about 1 second.

DRM · October 29, 2021, 10:48pm

Oh, I think know what you mean now about the units. Shotcut doesn’t use percentages for speed control. In the Properties tab the equivalent of 100% original speed is 1.000000.

andrecaldas · October 29, 2021, 11:03pm

Yes. It is dimensionless.

What I implemented was dropping or inserting frames. Then, when I asked the question, I was thinking about a frequency for “dropping frames”.

Maybe it would be nice if Shotcut used %, just like Kdenlive. My program computes a float: 0.9995. But I print the output in %: 99.95\%.

DRM · October 29, 2021, 11:11pm

It’s actually better that Shotcut doesn’t. Using that many decimal points allows for more precision. I’ve been able to salvage quite number of audio and video files thanks to that many decimal points. It’s a precision that not even proprietary editing software offer as far as I know.

So the most decimal points your program can do is 4? No way it can be bumped up to 6 to match Shotcut?

Although I believe that Shotcut is actually only using 5 decimal points because of a bug from Qt. Qt still hasn’t addressed it, right, @shotcut?

andrecaldas · October 30, 2021, 12:29am

But being or not “percent”, makes no difference… 100.00\% and 1.0000 is just the same. If you want more decimal digits, you can use 100.0000\% or 1.000000.

It can be more precise. To avoid too many calculations, we should probably determine with a smaller precision and a big range, and then use this to repeat the search with a bigger precision and a small range about the rough value previously determined. I will try to implement it later. I will let you know.

I believe that internally, it can be a double precision floating number or single precision. Those decimal points are just a matter of user interface. For the drift, I believe that single precision is more then sufficient.

andrecaldas · October 30, 2021, 11:58am

Increasing the precision was a very good suggestion! Thank you. The clips I use are very short, but for longer clips it might be better to have more precision.

I wrote a version with “incremental precision”. It starts with 0.0001 precision and goes refining up to 0.000000001.

time ./drift file1.mp4 file1.wav

/=== PROFILE: atsc_1080p_60 ===
| Start extracting audio...
Extracting audio...Mean: 1756.75.
| Finished extracting first producer's audio. Elapsed: 5014 million ticks.
Extracting audio...[adpcm_ima_wav @ 0x5595b52c31c0] Multiple frames in a packet.
Mean: 4440.2.
| Finished extracting second producer's audio. Elapsed: 2446 million ticks.
Calculating speed change for the file test_drifted.wav.
Precision: 0.01%.
Range: (99.00%, 101.00%).
Drift up to now: 99.95%.
Precision: 0.001%.
Range: (99.850%, 100.050%).
Drift up to now: 99.951%.
Precision: 0.0001%.
Range: (99.9410%, 99.9610%).
Drift up to now: 99.9505%.
Precision: 0.00001%.
Range: (99.94950%, 99.95150%).
Drift up to now: 99.95057%.
Precision: 0.000001%.
Range: (99.950470%, 99.950670%).
Drift up to now: 99.950577%.
Precision: 0.0000001%.
Range: (99.9505670%, 99.9505870%).
Adjust the speed of the second clip to 99.9505762%.

real	0m14.425s
user	0m14.127s
sys	0m0.280s

Again, what really takes time is the audio extraction. I think it is because the audio might be converted. I don’t know how to fetch the original frequency.

I don’t know much about Shotcut right now. But maybe Shotcut has an already extracted audio I can use.

shotcut · October 30, 2021, 7:31pm

No, it does not.

I think it is because the audio might be converted. I don’t know how to fetch the original frequency.

Yes, of course the audio is converted. There are many audio codecs and sample formats. Obviously, compressed audio must be decoded, but beyond that MLT does not strive to work with every sample format possible. That would make every filter, transition, and consumer dealing with audio very complex. So, all inputs are converted automatically to what you are requesting based on your code: 16-bit signed PCM, 48000 Hz, and 2 channels. If you want to bypass resampling and get the frequency and channel values of the source, you need to look at some properties of the producer:

meta.media.%d.codec.channels
meta.media.%d.codec.sample_rate

where “%d” is the stream index. Multiplexed files can have more than one stream/track including more than one audio stream/track. Eventually, your code will need to handle an audio stream selection that occurs within Shotcut’s Properties > Audio > Track field. The MLT producer property that corresponds to this is audio_index, and when not specified (or -1) MLT uses the first one, which is not always 0. Stream indexes in MLT are absolute and not relative to the stream type (audio, video, text, data, etc.).

To facilitate this lookup see these other properties:

meta.media.nb_streams
meta.media.%d.stream.type

The index is 0-based, and look for the stream.type value of “audio”.

Ultimately, if you want something more direct then you can use FFmpeg libraries.

andrecaldas · October 30, 2021, 9:09pm

Thank you, @shotcut!!!

I wanted to understand why reading the audio takes so long. Maybe, using different frequencies for different files would make the lag detection fail. I think it is worth trying on the near future. So, I am putting it on the road map.

No way!
Here I am trying to agree disagreeing instead of disagree agreeing…

MLT is a great software. I shall not try to bypass it.

But understand where the issue is, we can come up with creative solutions. What we need is the UI to be responsible.

Is this an explanations of probably why I am getting an ALL ZERO sample when using Mlt::Producer('some_shotcut_file.mlt')?

shotcut · October 30, 2021, 9:23pm

No, the information above pertains to the avformat producer. I do not know yet why you are getting zeros.

andrecaldas · October 30, 2021, 11:09pm

I will try to migrate to Shotcut.

Since I will be not using the automatic un-drifting and alignment through Shtocut’s UI until Monday, I shall use the following hack. I’d love to get some help from MLT gurus…

I have written a script that generates an mlt file, just like @brian mentioned.

The script uses the “drift and alignment calculator” and then executes something like

melt -profile atsc_1080p_60 file.mp4 -track -blank 78 file.wav -consumer xml
to generate the “.mlt” file. I will include this mlt file instead of file.mp4 and file.wav.

However, I do not know how to instruct melt to apply the “drift”. It seems I need to use -chain and -link. I want to apply those to file.wav, without applying it to -blank 78. I suppose it is something like this:

melt -profile atsc_1080p_60 file.mp4 -track -blank 78 -chain file.wav -link timeremap speed=0.9995 -consumer xml

The values 78 and 0.9995 are determined by the program I wrote.

So, does any one know the proper way to instruct melt to generate this “aligned and undrifted” mlt file?

Actually, I’d also like to suppress file.mp4’s audio…

brian · October 31, 2021, 12:11am

Using a chain and the timeremap link is one way. But a simpler way is to use the timewarp producer.

To understand how it works:

Open a clip in Shotcut
Edit speed in the properties panel
Save
Inspect the resulting .mlt file

That will show you the producer and properties to use to change the speed of a producer.

andrecaldas · October 31, 2021, 7:44pm

Thank you @brian!
I used:

melt -profile atsc_1080p_60 test.mp4 -attach-clip volume gain=0 -track -blank 82 -producer timewarp:0.99950576:test_drifted.wav -consumer xml > test.mlt

I got some information on how to implement what you have suggested and on how to mute a producer in stackoverflow.

One interesting thing, is that when I open test.mlt in Shotcut, the previewer does not apply the timewarp. Actually, if I simply run

melt -profile atsc_1080p_60 test.mp4 -attach-clip volume gain=0 -track -blank 82 -producer timewarp:0.99950576:test_drifted.wav
melt’s previewer does not apply the timewarp either.

But when I use Shotcut to export test.mlt to output.mp4, the timewarp is correctly applied.

Maybe I am doing something wrong. Or maybe, both previewers use the same consumer, and this consumer is not consuming correctly (STL?).

The previewer in Shotcut is really slow processing test.mlt when I add it as a track in some other project.

Also, my drift program now accepts command line options!

$ ./drift --help
Options:
-h [ --help ] help message
-s [ --script ] output one line with

-p [ --profile ] arg (=atsc_1080p_60) MLT profile
–percent use percentage (i.e.: %) in reports
–debug print debug output
-d [ --field-separator ] arg (=;) field separator for script output
(default ‘;’): speed;lag file
number;lag
-a [ --approximate-drift ] arg (=1) initial drift estimation
-r [ --drift-range ] arg (=0.01) initial drift estimation
-m [ --print-precision ] arg (=6) number of precision digits
–precision arg (=9) number of precision digits
-b [ --base-file ] arg reference to align to
-i [ --input-file ] arg file to be un-drifted and aligned

brian · October 31, 2021, 8:37pm

I just did this test:

Open a clip
Change the speed to 2.0
Save as test.mlt
Make a new project
Open test.mlt as a clip
The speed was applied as expected in the preview

I would suggest to compare test.mlt made using your melt command line to one made by Shotcut as I tested above. Maybe we have a bug somewhere.

andrecaldas · October 31, 2021, 9:02pm

The mlt file needs to have two clips. One mp4 and one wav. Something like this:

<muted mp4 />
<adjusted audio>
…<blank 82 frames />
…<wav with speed=“0.999504” />
</adjusted audio>

Well… my computer is quite under-powered, either.

brian · October 31, 2021, 9:08pm

My point is to construct the mlt file you want using Shotcut so you can compare it to the one created with your command line application.

andrecaldas · October 31, 2021, 9:42pm

You are right. I did it and it is not slow… but the audio speed does not change on the preview window.

Move audio and video to playlist.
Put video on track V1.
Add an audio track A1.
Change the speed of the audio in track A1.
Move the cursor to position 82 frames (00:00:01:22).
Snap the audio on track A1 to this position.
Save as test.mlt.
Close test.mlt.
Open a new project.
Drag test.mlt to “Playlist”.
Play test.mlt by double clicking it on the playlist.
Check that the audio is misaligned on the preview window.
Export the file (1280x720, 30fps, mp4).
Check that the exported mp4 audio is in perfect sync with the video.
Execute melt test.mlt and verify that melt also plays with with the audio misaligned!

If melt and the preview window use the same consumer, I would say this consumer is not applying the timewarp.

andrecaldas · November 1, 2021, 2:47am

I may be having this issue with audio alignment. I will do more tests.

andrecaldas · November 1, 2021, 2:16pm

Hey, @DRM… thank you for the comment about the drift. My program can successfully detect the drift and the misalignment between my video and audio files. I think it is usable, and I hope it can be integrated in Shotcut soon. Thank you @brian and @shotcut for the support. Sorry for making so many questions! I do not know anything about audio and video file and stream formats.

Ploblem I was having

It seems that my audio files where being processed differently by different “people”. I do think that there should be a way to make all those consumer/producers deliver results in a consistent way. Well… I do not know anything about audio or video file formats… and this is not what I want to discuss. So, I used ffmpeg to convert the audio and avoid the issue.

Audio format I was using

My audio is recorded by a cheap device I got on the internet.

First, the application file reports for my audio file:

$ file test.wav
test.wav: RIFF (little-endian) data, WAVE audio

The test.wav “properties” reported by my file browser (nautilus, in Gnome) report (I am translating freely from Portuguese):

Container: WAV
Codec: DVI ADPCM
[…]
Sampling frequency: 48000 Hz
Bit rate: 385 kbps

Using `ffmpeg` to convert

After using ffmpeg…

$ ffmpeg -y -stats -i test.wav -v error -vn -ar 48000 test_48000.wav

The command file reports:

file test_48000.wav
test_48000.wav: RIFF (little-endian) data, WAVE audio, (censored M$) PCM, 16 bit, stereo 48000 Hz

And nautilus gives:

Container: WAV
Codec: WAV
[…]
Sampling frequency: 48000 Hz
Bit rate: 1536 kbps

Generating the audio and video combined `mlt` file

Now, when I use my undrifter and aligner program:
$ ./undrifted_and_aligned_xml.sh test.mp4 test_48000.wav > test.mlt

I get a perfectly aligned and drift-free test.mlt I can use to insert as a clip in my projects.

In Shotcut, on the timeline, I do loose the nice graphical representation of the audio envelope. And also the video thumbnails.

If you want to test it

If you don’t have any “drifted” audio files, but you want to test, you can generate one using the atempo filter in ffmpeg:

ffmpeg -y -stats -i nice_file.wav -v error -vn -ar 48000 -filter:a “atempo=1.001” file_with_drift.wav

To fix this, the detected drift should be \frac{1}{1.001}. Probably, there is a way to cut the first second, so you also have a misaligned file. If anyone knows how to do it, please, tell me as I do not really need it and therefore I shall not spend time looking for.