Audio Alignment Implementation

andrecaldas · October 29, 2021, 5:03pm

I tested with a 5 minute MP4 (video) and a 5 minute WAV files. I loaded the “atsc_1080p_60” producer to extract the audio. I am using an i3 laptop.
About 3.2GHz of clock. The program I wrote used about 70% of two CPUs. My laptop has 4 CPUs.

The computation is very fast and I suppose it would be (proportionally) as fast as with a 2 hour audio file. The computation algorithm is \mathrm{O}(n \log n).

I do not have such a file, to test. So, I produced an MLT file in Shotcut just to concatenate my video files. But the method for extracting the audio is not working for this case. It just extracts a huge sequence of zeroes.

Calculations are done by the FFTW library. It is very very fast. The Kdenlive audio alignment has the same problem. I think it spends time extracting the audio, not computing the lag.

No. I am just reading the array provided by Mlt::Frame::get_audio, for each frame, using the audio format mlt_audio_s16. I think this process is too slow. And I also do not know how to configure things so that this method works for XML files, for example.

No.

I have thought a little about this issue. I have drifted audios myself.

The algorithm cannot detect the lag if there is drift. But I think the alignment functions can be used to program a solution.

I don’t know of any general purpose “magic” like the one described in this post for calculating the drift. But of course, we can choose two small and distant “slices” of the audio, align them and use this information to calculate the drift. It is an important fact that the technique can be used to align a “short” producer to a “long” reference producer.

The slices would have to be short enough so that the drift does not interfere in the lag detection. But they also have to be long enough so we can find the corresponding match on the reference producer. This would require some experimenting.

One other solution would be to simply discard (or insert) samples in the audio file with some frequency. We do this for a range of frequencies and determine the “best” result. This frequency of dropping samples (or inserting samples) would determine the drift.