This is not a suggestion of some feature. I plan to implement a functionality that is already on the Road Map. But I would like to discuss, document and, if possible, get support from volunteers. I am not really familiar with MLT and Shotcut terminologies. Feel free to correct things (this post is a wiki).
The “MLT only” part has been accomplished. All we need now is to integrate it in Shotcut. I (André) plan to wait until the next release because people are busy with beta tests, right now. The code is, for now, hosted in my sandbox git reppo.
This post is what I understood from studying the corresponding functionality in Kdenlive.
The idea is to document here how automatic audio alignment should/could be implemented.
Theory: basic IDEA on how it works
A representation of an audio clip is a sequence sample values along time. It is simply a huge vector in R^n. When two vectors \vec{v}, \vec{w} \in R^n are “aligned”, their “angle” is as close to 0 as possible. That is,
When this is equal to 1, it means \vec{v} is a multiple of \vec{w}. If one removes the “mean sample value” from the signal, then, this amount is called correlation.
The rough idea is to “apply a shift” t to \vec{w} (interpreted as a time-shift). Denote by \vec{w}_t the shifted vector. The audios are “aligned” for a certain t when \vec{v} \cdot \vec{w}_t is maximum. If I understood things right, this t is also called “lag”.
Computing all those inner products for each t is an expensive operation. As far as I understood, the so called Fast Fourier Transform can be used to provide a fast way to calculate what would be the vector
If I understood things right, this is what is called cross-correlation vector.
Calling \mathcal{F}(\vec{v}) and \mathcal{F}(\vec{w}) the Fourier transforms of \vec{v} and \vec{w},
That is,
Using MLT
Given an MLT producer, I’d like to extract, for each frame, one number that represents the audio in some sense. That is, one number for each mlt_audio_s. It is important that “similar audio” have similar number.
In Kdenlive, I think this is done in AudioEnvelope::loadAndNormalizeEnvelope()
. There, Kdenlive extracts the frame samples in mlt_audio_s16
format for channel = 1
. And then, they just sum the data
. I would like to know if this is the best decision. Here are some options:
- Sum up the data from in one channel.
- Sum up the data in every panel.
- Pick up just
data[0]
(wheredata
is cast correctly).
Any thoughts?
FFT library
I believe MLT already uses the Fast Fourier Transform Library FFTW. So, I think it is a good choice. It is GPLv2 or any later.
First steps
MLT
- Implement an AudioEnvelope containing \vec{v} and \mathcal{F}(\vec{v}).
- Implement an AudioEnvelope::get_lag(const AudioEnvelope&) const.
Shotcut UI
The goal is to allow for clips in the time line to get aligned somehow. However, as a first step, I think it would be nice if we could simply get an audio and one AV clips, align them and generate a producer to be put inside the main Playlist
. Just like a color clip
. In this case, I thought that creating an AbstractProducerWidget
widget could be an easy way to have a it done.
Using Kdenlive’s implementation
I think Kdenlive’s implementation is overly complicated and mixes Kdenlive, Qt signals and MLT too much. A purely MLT approach is much simpler, in my opinion. A callback function would allow “progress feed back”. Shotcut could use this in its interface.
So, I think that reusing Kdenlive’s implementation is not a good idea.