Audio Alignment Implementation

This is not a suggestion of some feature. I plan to implement a functionality that is already on the Road Map. But I would like to discuss, document and, if possible, get support from volunteers. I am not really familiar with MLT and Shotcut terminologies. Feel free to correct things (this post is a wiki).

The “MLT only” part has been accomplished. All we need now is to integrate it in Shotcut. I (André) plan to wait until the next release because people are busy with beta tests, right now. The code is, for now, hosted in my sandbox git reppo.

This post is what I understood from studying the corresponding functionality in Kdenlive.

The idea is to document here how automatic audio alignment should/could be implemented.

Theory: basic IDEA on how it works

A representation of an audio clip is a sequence sample values along time. It is simply a huge vector in R^n. When two vectors \vec{v}, \vec{w} \in R^n are “aligned”, their “angle” is as close to 0 as possible. That is,

\frac{\vec{v} \cdot \vec{w}}{\Vert \vec{v} \Vert\,\Vert \vec{v}\Vert} \approx 1

When this is equal to 1, it means \vec{v} is a multiple of \vec{w}. If one removes the “mean sample value” from the signal, then, this amount is called correlation.

The rough idea is to “apply a shift” t to \vec{w} (interpreted as a time-shift). Denote by \vec{w}_t the shifted vector. The audios are “aligned” for a certain t when \vec{v} \cdot \vec{w}_t is maximum. If I understood things right, this t is also called “lag”.

Computing all those inner products for each t is an expensive operation. As far as I understood, the so called Fast Fourier Transform can be used to provide a fast way to calculate what would be the vector

\vec{v} \star \vec{w} = (\vec{v} \cdot \vec{w}_{-n}, \cdots, \vec{v} \cdot \vec{w}_0, \vec{v} \cdot \vec{w}_1, \cdots, \vec{v} \cdot \vec{w}_n)

If I understood things right, this is what is called cross-correlation vector.

Calling \mathcal{F}(\vec{v}) and \mathcal{F}(\vec{w}) the Fourier transforms of \vec{v} and \vec{w},

\mathcal{F}(\vec{v} \star \vec{w}) = \overline{\mathcal{F}(\vec{v})} \mathcal{F}(\vec{w})

That is,

\vec{v} \star \vec{w} = \mathcal{F}^{-1}(\overline{\mathcal{F}(\vec{v})} \mathcal{F}(\vec{w}))

Using MLT

Given an MLT producer, I’d like to extract, for each frame, one number that represents the audio in some sense. That is, one number for each mlt_audio_s. It is important that “similar audio” have similar number.

In Kdenlive, I think this is done in AudioEnvelope::loadAndNormalizeEnvelope(). There, Kdenlive extracts the frame samples in mlt_audio_s16 format for channel = 1. And then, they just sum the data. I would like to know if this is the best decision. Here are some options:

  1. Sum up the data from in one channel.
  2. Sum up the data in every panel.
  3. Pick up just data[0] (where data is cast correctly).

Any thoughts? :slight_smile:

FFT library

I believe MLT already uses the Fast Fourier Transform Library FFTW. So, I think it is a good choice. It is GPLv2 or any later.

First steps


  1. Implement an AudioEnvelope containing \vec{v} and \mathcal{F}(\vec{v}).
  2. Implement an AudioEnvelope::get_lag(const AudioEnvelope&) const.

Shotcut UI

The goal is to allow for clips in the time line to get aligned somehow. However, as a first step, I think it would be nice if we could simply get an audio and one AV clips, align them and generate a producer to be put inside the main Playlist. Just like a color clip. In this case, I thought that creating an AbstractProducerWidget widget could be an easy way to have a it done.

Using Kdenlive’s implementation

I think Kdenlive’s implementation is overly complicated and mixes Kdenlive, Qt signals and MLT too much. A purely MLT approach is much simpler, in my opinion. A callback function would allow “progress feed back”. Shotcut could use this in its interface.

So, I think that reusing Kdenlive’s implementation is not a good idea.

1 Like

I do not know which would be best, but you could make multiple implementations and then empirically find the best one by testing various scenarios.

It is not clear to me what FFT is used for.

I do not think we want the AudioEnvelope class in MLT because it does not relate to the rest of the framework. I think the class could be in Shotcut instead.

Actually I am confused… I don’t really know what planes and channels are in the context of mlt_audio_s. I thought they were the same thing.

For “2”, I am just wondering. Kdenlive uses channel = 1, so I guess it is okay to do the same. I just wanted to understand if and why we should be concerned about a second channel. I don’t know what multiple audio planes/channels are used for. I want to compare two audio tracks…

  • Will one of them have many planes/channels and the other just one or a few?
  • Different planes/channels contain completely different information?

Even for 1 and 3… does mlt_audio_s::data contents usually vary a lot (inside the same frame)?

I am not really familiar with the “various scenarios”. So, I guess the safe choice is to just do it the same way Kdenlive does, and just wait for complaints. :sweat_smile:

Sorry! I didn’t make myself clear. I do not intend to have this included inside MLT. I wrote MLT just to mean it would be “Shotcut independent”. I would be code that could be used by anyone working with MLT. The Kdenlive code is totally dependent on Kdenlive’s internal structure. And I think “ours” shouldn’t be this way.

I do agree a lot that it should be in Shotcut. :slight_smile:

We have two vectors \vec{v} and \vec{w} of a huge size n. Probably n is the number of frames. We need to calculate \vec{v} \cdot \vec{w}_t for t = -n, \dotsc, n. Doing “brute force”, each t takes n products and n sums. Since we have to do it 2n times, we would have 2n^2 products and 2n^2 sums. This is O(n^2) operations. The FFT is a magic that allows one to get the same result in O(n \log(n)) operations. If n = 1000, then n^2 = 1000000 while n \log(n) \approx 6900.

I am writing a “class” that “consumes” the audio of a producer and extracts its “envelope”. Does it make any sense that this class is implemented as a consumer???

What is the “recommended/correct” way to iterate through all frames of a producer?

Take a look at audiolevelstask.cpp in the source code. It does exactly that. You do not need to inherit Mlt::Consumer. You can just get frames from a producer and get audio samples from the frames.

1 Like

There is a Frame::get_audio that I didn’t quite understand. It returns a pointer but the pointer is not used. I guess it is being called for its side effects. I suppose its side effect is to convert the audio to the desired format. Is it so?

If this is so, then I guess Frame::get_double is used to return the “audio level”. But since there is a sampling of the audio, I wonder what exactly is the meaning of what frame->get_double(key[channel]) is returning.

Finally, which channels shall I use?

You might have more questions than I can answer. In that code a filter is attached to the product that does the calculations. When getting audio the samples are not used by the caller, but that is necessary to make the filter do processing. Then, the result of the filter is retrieved through a property.

I do not know which channels you should use.

I suggest to just use the first audio channel. That would allow alignment of stereo and mono audio sources.

It is returning the value of “” which is being set on the frame by the audio level filter:

1 Like

It is alive!!!

Please, test…
I made a simple test play.cpp. It would be a good idea to add noise, gain, etc…

Check it out… or better… clone it out! :slight_smile:


I have implemented a program that actually takes two files and tells you how to align them both. I have tested it with some of my videos, it it works just perfectly.

For example:

$ ./compare file.mp4 file.wav

You can even list the profiles you want to use:

$ ./compare file.mp4 file.wav atsc_720p_25 atsc_1080p_30 atsc_1080p_60 hdv_720_30p

The files I tested, I put them on Shotcut, displaced one of the tracks according to the program output and it works! The program spends most of its time extracting the audio data for each frame it gets from the producers.

Right now, it is just an MLT program. I hope I can integrate it in Shotcut. Shall I use a branch in my cloned repo for that? I hope you don’t mind my asking tons of questions here! It is going to get worse. :sweat_smile:

Well… I know there is a release candidate out there, right now… but if you have time… please, test it!!! :pray:

How long does it take to sync audio files? Is it right away or does it take some time?

And can this sync more than two audio files at the same time?

Congratulations! and thank you for your contribution. We will take a look at this deeper after the next release.

Here is the Kdenlive docs for audio alignment:

Maybe this can be simplified to a single action such that the first clip selected is the reference that does not move.

Extracting the audio takes time (some 3 seconds). Processing the extracted audio is really fast (a blink).

I have updated a new version of compare that prints a report of the time spent on each part of the process. From the output I conclude that probably the time spent on the audio extraction depends on the frame rate and the number of tracks (AV file or just audio file). If there is a faster way to extract the audio, please, let me know. :slight_smile:

/=== PROFILE: atsc_1080p_30 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 1577 million ticks.
| Finished extracting second producer’s audio. Elapsed: 642 million ticks.
| Will calculate lag.
| Finished calculating lag. Elapsed: 1 million ticks.
=== RESULT (atsc_1080p_30) ===
Calculated lag: -10 frames. 0.333333 seconds.
-0h0m0s (-10 frames) (fps = 30).

/=== PROFILE: atsc_1080p_60 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 2589 million ticks.
| Finished extracting second producer’s audio. Elapsed: 1119 million ticks.
| Will calculate lag.
| Finished calculating lag. Elapsed: 2 million ticks.
=== RESULT (atsc_1080p_60) ===
Calculated lag: -19 frames. 0.316667 seconds.
-0h0m0s (-19 frames) (fps = 60).

/=== PROFILE: atsc_720p_60 ===
| Start extracting audio…
| Finished extracting first producer’s audio. Elapsed: 2580 million ticks.
| Finished extracting second producer’s audio. Elapsed: 1187 million ticks.
| Will calculate lag.
| Finished calculating lag. Elapsed: 2 million ticks.
=== RESULT (atsc_720p_60) ===
Calculated lag: -19 frames. 0.316667 seconds.
-0h0m0s (-19 frames) (fps = 60).

Yes, it can. The main binary needs to call fftw_make_planner_thread_safe().
The FFTW library stores global information when it “plans” on how to do the computations.

With a little brainstorming, I think we can reach to a very good user interface. I imagine, for example, that one could simply drag a bunch of audio files and Shotcut:

  1. Automatically chooses what video they correspond to.
  2. Align those files to their corresponding videos.

I guess that each track is a producer. So, instead of aligning to the clip, we can simply align it to the whole track! :smiley:

This will not be something that is automatic; the user must choose to do it and to control the reference. I am not going to let Shotcut compare with every video in the timeline.

Yes. The user chooses to do it. But the user doesn’t have to choose the clip. The user could choose the whole track. Or even “associate” one track to some other track. The UI could determine the position and clip the audio.

For example (this is not a suggestion), there could be a dialog box with two drag-and-droppable areas. A UI specific for audio alignment. This could generate many producers (like color clips) that would reproduce the video from one and the audio from the other. Some sort of audio_aligned_producer.

I need some help with audio extraction.

I used Shotcut to put many video files in one track. Then I saved it to videos.mlt. When I play it with melt using melt videos.mlt, there is audio.

But when, I execute

./compare videos.mlt audio.wav
the audio extraction of videos.mlt just gives me zeros! :frowning_face:

This is the code that extracts the audio. I guess I am not using the correct most general way to extract audio from a producer.

3 seconds for how long of an audio file? What if you have a 2 hour audio file? Will it also take just a few seconds?

When you say that it is extracting audio, do you literally mean that it is exporting an audio file to be saved on a hard drive?

Is there a limit on how many audio tracks it can sync at one time? I’m asking because this would be a very important function for a future implementation of multi-cam in Shotcut. People can be shooting many multiple cameras and would want to have all in sync based on audio.

I’m also wondering if this could solve audio drift issues. Many will record video and audio on one camera and also record audio on a separate device. But because of frame rate issues sometimes the audio from the external device will go out of sync with the audio that is from the video camera as time goes on. Can this solve audio drift from start to end? If not can that be programmed in?

I tested with a 5 minute MP4 (video) and a 5 minute WAV files. I loaded the “atsc_1080p_60” producer to extract the audio. I am using an i3 laptop.
About 3.2GHz of clock. The program I wrote used about 70% of two CPUs. My laptop has 4 CPUs.

The computation is very fast and I suppose it would be (proportionally) as fast as with a 2 hour audio file. The computation algorithm is \mathrm{O}(n \log n).

I do not have such a file, to test. So, I produced an MLT file in Shotcut just to concatenate my video files. But the method for extracting the audio is not working for this case. It just extracts a huge sequence of zeroes.

Calculations are done by the FFTW library. It is very very fast. The Kdenlive audio alignment has the same problem. I think it spends time extracting the audio, not computing the lag.

No. I am just reading the array provided by Mlt::Frame::get_audio, for each frame, using the audio format mlt_audio_s16. I think this process is too slow. And I also do not know how to configure things so that this method works for XML files, for example.


I have thought a little about this issue. I have drifted audios myself.

The algorithm cannot detect the lag if there is drift. But I think the alignment functions can be used to program a solution.

I don’t know of any general purpose “magic” like the one described in this post for calculating the drift. But of course, we can choose two small and distant “slices” of the audio, align them and use this information to calculate the drift. It is an important fact that the technique can be used to align a “short” producer to a “long” reference producer.

The slices would have to be short enough so that the drift does not interfere in the lag detection. But they also have to be long enough so we can find the corresponding match on the reference producer. This would require some experimenting.

One other solution would be to simply discard (or insert) samples in the audio file with some frequency. We do this for a range of frequencies and determine the “best” result. This frequency of dropping samples (or inserting samples) would determine the drift.