Audio Alignment Implementation

Extracting the audio takes time (some 3 seconds). Processing the extracted audio is really fast (a blink).

I have updated a new version of compare that prints a report of the time spent on each part of the process. From the output I conclude that probably the time spent on the audio extraction depends on the frame rate and the number of tracks (AV file or just audio file). If there is a faster way to extract the audio, please, let me know. :slight_smile:

/=== PROFILE: atsc_1080p_30 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 1577 million ticks.
| Finished extracting second producer’s audio. Elapsed: 642 million ticks.
| Will calculate lag.
| Finished calculating lag. Elapsed: 1 million ticks.
=== RESULT (atsc_1080p_30) ===
Calculated lag: -10 frames. 0.333333 seconds.
-0h0m0s (-10 frames) (fps = 30).
=== END OF REPORT ===

/=== PROFILE: atsc_1080p_60 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 2589 million ticks.
| Finished extracting second producer’s audio. Elapsed: 1119 million ticks.
| Will calculate lag.
| Finished calculating lag. Elapsed: 2 million ticks.
=== RESULT (atsc_1080p_60) ===
Calculated lag: -19 frames. 0.316667 seconds.
-0h0m0s (-19 frames) (fps = 60).
=== END OF REPORT ===

/=== PROFILE: atsc_720p_60 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 2580 million ticks.
| Finished extracting second producer’s audio. Elapsed: 1187 million ticks.
| Will calculate lag.
| Finished calculating lag. Elapsed: 2 million ticks.
=== RESULT (atsc_720p_60) ===
Calculated lag: -19 frames. 0.316667 seconds.
-0h0m0s (-19 frames) (fps = 60).
=== END OF REPORT ===

Yes, it can. The main binary needs to call fftw_make_planner_thread_safe().
The FFTW library stores global information when it “plans” on how to do the computations.

With a little brainstorming, I think we can reach to a very good user interface. I imagine, for example, that one could simply drag a bunch of audio files and Shotcut:

  1. Automatically chooses what video they correspond to.
  2. Align those files to their corresponding videos.

I guess that each track is a producer. So, instead of aligning to the clip, we can simply align it to the whole track! :smiley:

This will not be something that is automatic; the user must choose to do it and to control the reference. I am not going to let Shotcut compare with every video in the timeline.

Yes. The user chooses to do it. But the user doesn’t have to choose the clip. The user could choose the whole track. Or even “associate” one track to some other track. The UI could determine the position and clip the audio.

For example (this is not a suggestion), there could be a dialog box with two drag-and-droppable areas. A UI specific for audio alignment. This could generate many producers (like color clips) that would reproduce the video from one and the audio from the other. Some sort of audio_aligned_producer.

I need some help with audio extraction.

I used Shotcut to put many video files in one track. Then I saved it to videos.mlt. When I play it with melt using melt videos.mlt, there is audio.

But when, I execute

./compare videos.mlt audio.wav
the audio extraction of videos.mlt just gives me zeros! :frowning_face:

This is the code that extracts the audio. I guess I am not using the correct most general way to extract audio from a producer.

3 seconds for how long of an audio file? What if you have a 2 hour audio file? Will it also take just a few seconds?

When you say that it is extracting audio, do you literally mean that it is exporting an audio file to be saved on a hard drive?

Is there a limit on how many audio tracks it can sync at one time? I’m asking because this would be a very important function for a future implementation of multi-cam in Shotcut. People can be shooting many multiple cameras and would want to have all in sync based on audio.

I’m also wondering if this could solve audio drift issues. Many will record video and audio on one camera and also record audio on a separate device. But because of frame rate issues sometimes the audio from the external device will go out of sync with the audio that is from the video camera as time goes on. Can this solve audio drift from start to end? If not can that be programmed in?

I tested with a 5 minute MP4 (video) and a 5 minute WAV files. I loaded the “atsc_1080p_60” producer to extract the audio. I am using an i3 laptop.
About 3.2GHz of clock. The program I wrote used about 70% of two CPUs. My laptop has 4 CPUs.

The computation is very fast and I suppose it would be (proportionally) as fast as with a 2 hour audio file. The computation algorithm is \mathrm{O}(n \log n).

I do not have such a file, to test. So, I produced an MLT file in Shotcut just to concatenate my video files. But the method for extracting the audio is not working for this case. It just extracts a huge sequence of zeroes.

Calculations are done by the FFTW library. It is very very fast. The Kdenlive audio alignment has the same problem. I think it spends time extracting the audio, not computing the lag.

No. I am just reading the array provided by Mlt::Frame::get_audio, for each frame, using the audio format mlt_audio_s16. I think this process is too slow. And I also do not know how to configure things so that this method works for XML files, for example.

No.

I have thought a little about this issue. I have drifted audios myself.

The algorithm cannot detect the lag if there is drift. But I think the alignment functions can be used to program a solution.

I don’t know of any general purpose “magic” like the one described in this post for calculating the drift. But of course, we can choose two small and distant “slices” of the audio, align them and use this information to calculate the drift. It is an important fact that the technique can be used to align a “short” producer to a “long” reference producer.

The slices would have to be short enough so that the drift does not interfere in the lag detection. But they also have to be long enough so we can find the corresponding match on the reference producer. This would require some experimenting.

One other solution would be to simply discard (or insert) samples in the audio file with some frequency. We do this for a range of frequencies and determine the “best” result. This frequency of dropping samples (or inserting samples) would determine the drift.

@DRM:
How do you “measure” the drift? What are the units?

Not sure how to answer that one cause I’m not knowledgeable in this area. Shotcut just uses frame rates so that’s all I go by. Also I bring this up because it’s been mentioned a number of times on the forum by users who run into this audio drift problem.

@DRM:
I can determine the drift! :smiley:

I made a program to determine the drift.

Compiling

You compile it the same way you compile the other two:

g++ -fPIC -I /usr/include/mlt-7 -I /usr/include/mlt-7/mlt++ -o drift drift.cpp FFTInplaceArray.cpp AudioEnvelopeFFT.cpp -Wall -Wextra -lpthread -L /usr/lib/x86_64-linux-gnu/mlt-7/ -l ‘mlt+±7’ -lmlt-7 -lfftw3

Executing

You execute it:

./drift file.mp4 file.wav 2>/dev/null

And it tells you how to adjust the speed of file.wav.

/=== PROFILE: atsc_1080p_60 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 8705 million ticks.
| Finished extracting second producer’s audio. Elapsed: 4569 million ticks.
Determined drift: -0.05%.
You should adjust the speed of file.wav to 99.95%.

I was adjusting these using exactly 99.95%!!! :smiley:

How it works

I adjusted the speed of the audio from 99% to 101%, by increments of 0.01%. Calculated the “alignment quality” for each one. This “quality” is a sub-product of the alignment algorithm. That is, I executed the alignment algorithm 200 times, but without extracting the audio all over again (I just dropped or inserted the values of each frame).

With two 5 minute files, extracting the audios was about 4.4 seconds. And calculating the alignment 200 times was instantaneous. With two 10 minute files, it was 9 seconds to extract the audio and calculating the alignment 200 times took about 1 second.

1 Like

Oh, I think know what you mean now about the units. Shotcut doesn’t use percentages for speed control. In the Properties tab the equivalent of 100% original speed is 1.000000.

Yes. It is dimensionless.

What I implemented was dropping or inserting frames. Then, when I asked the question, I was thinking about a frequency for “dropping frames”.

Maybe it would be nice if Shotcut used %, just like Kdenlive. My program computes a float: 0.9995. But I print the output in %: 99.95\%.

It’s actually better that Shotcut doesn’t. Using that many decimal points allows for more precision. I’ve been able to salvage quite number of audio and video files thanks to that many decimal points. It’s a precision that not even proprietary editing software offer as far as I know.

So the most decimal points your program can do is 4? No way it can be bumped up to 6 to match Shotcut?

Although I believe that Shotcut is actually only using 5 decimal points because of a bug from Qt. Qt still hasn’t addressed it, right, @shotcut?

But being or not “percent”, makes no difference… 100.00\% and 1.0000 is just the same. If you want more decimal digits, you can use 100.0000\% or 1.000000.

It can be more precise. To avoid too many calculations, we should probably determine with a smaller precision and a big range, and then use this to repeat the search with a bigger precision and a small range about the rough value previously determined. I will try to implement it later. I will let you know. :slight_smile:

I believe that internally, it can be a double precision floating number or single precision. Those decimal points are just a matter of user interface. For the drift, I believe that single precision is more then sufficient.

Increasing the precision was a very good suggestion! Thank you. The clips I use are very short, but for longer clips it might be better to have more precision.

I wrote a version with “incremental precision”. It starts with 0.0001 precision and goes refining up to 0.000000001.

time ./drift file1.mp4 file1.wav

/=== PROFILE: atsc_1080p_60 ===
| Start extracting audio...
Extracting audio...Mean: 1756.75.
| Finished extracting first producer's audio. Elapsed: 5014 million ticks.
Extracting audio...[adpcm_ima_wav @ 0x5595b52c31c0] Multiple frames in a packet.
Mean: 4440.2.
| Finished extracting second producer's audio. Elapsed: 2446 million ticks.
Calculating speed change for the file test_drifted.wav.
Precision: 0.01%.
Range: (99.00%, 101.00%).
Drift up to now: 99.95%.
Precision: 0.001%.
Range: (99.850%, 100.050%).
Drift up to now: 99.951%.
Precision: 0.0001%.
Range: (99.9410%, 99.9610%).
Drift up to now: 99.9505%.
Precision: 0.00001%.
Range: (99.94950%, 99.95150%).
Drift up to now: 99.95057%.
Precision: 0.000001%.
Range: (99.950470%, 99.950670%).
Drift up to now: 99.950577%.
Precision: 0.0000001%.
Range: (99.9505670%, 99.9505870%).
Adjust the speed of the second clip to 99.9505762%.

real	0m14.425s
user	0m14.127s
sys	0m0.280s

Again, what really takes time is the audio extraction. I think it is because the audio might be converted. I don’t know how to fetch the original frequency.

I don’t know much about Shotcut right now. But maybe Shotcut has an already extracted audio I can use.

No, it does not.

I think it is because the audio might be converted. I don’t know how to fetch the original frequency.

Yes, of course the audio is converted. There are many audio codecs and sample formats. Obviously, compressed audio must be decoded, but beyond that MLT does not strive to work with every sample format possible. That would make every filter, transition, and consumer dealing with audio very complex. So, all inputs are converted automatically to what you are requesting based on your code: 16-bit signed PCM, 48000 Hz, and 2 channels. If you want to bypass resampling and get the frequency and channel values of the source, you need to look at some properties of the producer:

  • meta.media.%d.codec.channels
  • meta.media.%d.codec.sample_rate

where “%d” is the stream index. Multiplexed files can have more than one stream/track including more than one audio stream/track. Eventually, your code will need to handle an audio stream selection that occurs within Shotcut’s Properties > Audio > Track field. The MLT producer property that corresponds to this is audio_index, and when not specified (or -1) MLT uses the first one, which is not always 0. Stream indexes in MLT are absolute and not relative to the stream type (audio, video, text, data, etc.).

To facilitate this lookup see these other properties:

  • meta.media.nb_streams
  • meta.media.%d.stream.type

The index is 0-based, and look for the stream.type value of “audio”.

Ultimately, if you want something more direct then you can use FFmpeg libraries.

Thank you, @shotcut!!!

I wanted to understand why reading the audio takes so long. Maybe, using different frequencies for different files would make the lag detection fail. I think it is worth trying on the near future. So, I am putting it on the road map. :sweat_smile:

No way! :upside_down_face:
Here I am trying to agree disagreeing instead of disagree agreeing… :stuck_out_tongue_winking_eye:

MLT is a great software. I shall not try to bypass it.

But understand where the issue is, we can come up with creative solutions. What we need is the UI to be responsible.

Is this an explanations of probably why I am getting an ALL ZERO sample when using Mlt::Producer('some_shotcut_file.mlt')?

No, the information above pertains to the avformat producer. I do not know yet why you are getting zeros.

I will try to migrate to Shotcut. :smiling_face_with_three_hearts:

Since I will be not using the automatic un-drifting and alignment through Shtocut’s UI until Monday, I shall use the following hack. I’d love to get some help from MLT gurus… :sweat_smile:

I have written a script that generates an mlt file, just like @brian mentioned.

The script uses the “drift and alignment calculator” and then executes something like

melt -profile atsc_1080p_60 file.mp4 -track -blank 78 file.wav -consumer xml
to generate the “.mlt” file. I will include this mlt file instead of file.mp4 and file.wav.

However, I do not know how to instruct melt to apply the “drift”. It seems I need to use -chain and -link. I want to apply those to file.wav, without applying it to -blank 78. I suppose it is something like this:

melt -profile atsc_1080p_60 file.mp4 -track -blank 78 -chain file.wav -link timeremap speed=0.9995 -consumer xml

The values 78 and 0.9995 are determined by the program I wrote.

So, does any one know the proper way to instruct melt to generate this “aligned and undrifted” mlt file?

Actually, I’d also like to suppress file.mp4’s audio… :pray: :grin:

Using a chain and the timeremap link is one way. But a simpler way is to use the timewarp producer.

To understand how it works:

  1. Open a clip in Shotcut
  2. Edit speed in the properties panel
  3. Save
  4. Inspect the resulting .mlt file

That will show you the producer and properties to use to change the speed of a producer.

Thank you @brian!
I used:

melt -profile atsc_1080p_60 test.mp4 -attach-clip volume gain=0 -track -blank 82 -producer timewarp:0.99950576:test_drifted.wav -consumer xml > test.mlt

I got some information on how to implement what you have suggested and on how to mute a producer in stackoverflow.

One interesting thing, is that when I open test.mlt in Shotcut, the previewer does not apply the timewarp. Actually, if I simply run

melt -profile atsc_1080p_60 test.mp4 -attach-clip volume gain=0 -track -blank 82 -producer timewarp:0.99950576:test_drifted.wav
melt’s previewer does not apply the timewarp either.

But when I use Shotcut to export test.mlt to output.mp4, the timewarp is correctly applied.

Maybe I am doing something wrong. Or maybe, both previewers use the same consumer, and this consumer is not consuming correctly (STL?).

The previewer in Shotcut is really slow processing test.mlt when I add it as a track in some other project.

Also, my drift program now accepts command line options! :slight_smile:

$ ./drift --help
Options:
-h [ --help ] help message
-s [ --script ] output one line with

-p [ --profile ] arg (=atsc_1080p_60) MLT profile
–percent use percentage (i.e.: %) in reports
–debug print debug output
-d [ --field-separator ] arg (=;) field separator for script output
(default ‘;’): speed;lag file
number;lag
-a [ --approximate-drift ] arg (=1) initial drift estimation
-r [ --drift-range ] arg (=0.01) initial drift estimation
-m [ --print-precision ] arg (=6) number of precision digits
–precision arg (=9) number of precision digits
-b [ --base-file ] arg reference to align to
-i [ --input-file ] arg file to be un-drifted and aligned