Autocorrelation, cepstrum, and Mozer transform are three algorithms for processing pitched signals. I realized in the shower this morning how closely related these algorithms are. The only difference is what function is applied to each element of a spectrum in the frequency domain to change the balance of weak and strong signals.
Trigger warning: This follow-up to a post from six years ago will be very math-heavy, as I'm taking notes for a possible future project.
Given two functions f[] and g[] on the integers, their convolution (f * g)[] is defined as
(f * g)[t] = Σ[m=-∞ to ∞](f[m]g[t - m])
The related cross-correlation is
(f *_ g)[t] = Σ[m=-∞ to ∞](f*[m]g[t + m])
where f* is the complex conjugate (real part same, imaginary part negated) of f. This is equivalent to convolution of the time reversal of first signal's complex conjugate (or the signal itself if it is real-valued) with a second signal. If one signal is real and symmetric, correlation and convolution are essentially equivalent.
The support (width of domain where the function is nonzero) of a discrete convolution or cross-correlation equals the sum of the support of the inputs minus one. For example, convolving functions of length 3 and 6 produces a function of length 8.
Convolution with a finite impulse response kernel is often used to filter discrete-time signals, such as audio. Cross-correlation of a signal with itself, or autocorrelation, is used to determine the pitch of a signal. A signal's autocorrelation is time symmetric and has a peak once per period of the signal with an overall simpler shape than the signal itself.
Convolution of two long signals takes O(n^2) operations, where n is the width of their support. But because convolution in the time domain correspods to multiplication in the frequency domain (and vice versa), this can be reduced to O(n log n) through an algorithm using the Fourier transform:
Thus autocorrelation becomes this:
Autocorrelation{f[]} = F-1{|F{f[]}|^2}
Time symmetry thus results from loss of phase information, and only the positive half of the autocorrelation is considered.
A power cepstrum is similar to an autocorrelation, except a logarithm is inserted to boost the weak frequencies and separate the formants (timbre) of a periodic signal from its pitch.
PowerCepstrum{f[]} = F-1{log(|F{f[]}|^2)}
The power cepstrum discards phase and is suitable for analysis rather than synthesis, such as pitch detection. There also exists a complex cepstrum that preserves phase and is thus useful for synthesis, but the documents I initially found about calculating the unwrapped phase were paywalled at $33.00 per article from IEEE.
Another approach is to replace squared magnitude with magnitude. This results in the Mozer transform, which discards phase information to find a time-symmetric representation for a given timbre. This was used to compress voice in several products using a codec developed by Electronic Speech Systems, such as Impossible Mission for C64 ("Another visitor") and Mito Koumon for Famicom.
Mozer{f[]} = F-1{|F{f[]}|}
Because the signal is symmetric, half of it can be discarded and computed from the time reversal of the other half. It also concentrates the signal's power close to zero in the time domain, allowing lossy compression of the timbre by windowing out small components far from zero time. And theoretically, the only difference between this and the power cepstrum is that the logarithm (which boosts weak signals) is replaced with the square root (which also boosts weak signals).
So a speech codec designed for real-time playback on a 1.8 MHz microprocessor without a hardware multiply might break the signal into segments on the order of 25 ms and then do the following to each segment:
Trigger warning: This follow-up to a post from six years ago will be very math-heavy, as I'm taking notes for a possible future project.
Given two functions f[] and g[] on the integers, their convolution (f * g)[] is defined as
(f * g)[t] = Σ[m=-∞ to ∞](f[m]g[t - m])
The related cross-correlation is
(f *_ g)[t] = Σ[m=-∞ to ∞](f*[m]g[t + m])
where f* is the complex conjugate (real part same, imaginary part negated) of f. This is equivalent to convolution of the time reversal of first signal's complex conjugate (or the signal itself if it is real-valued) with a second signal. If one signal is real and symmetric, correlation and convolution are essentially equivalent.
The support (width of domain where the function is nonzero) of a discrete convolution or cross-correlation equals the sum of the support of the inputs minus one. For example, convolving functions of length 3 and 6 produces a function of length 8.
Convolution with a finite impulse response kernel is often used to filter discrete-time signals, such as audio. Cross-correlation of a signal with itself, or autocorrelation, is used to determine the pitch of a signal. A signal's autocorrelation is time symmetric and has a peak once per period of the signal with an overall simpler shape than the signal itself.
Convolution of two long signals takes O(n^2) operations, where n is the width of their support. But because convolution in the time domain correspods to multiplication in the frequency domain (and vice versa), this can be reduced to O(n log n) through an algorithm using the Fourier transform:
- Pad both signals to at least the sum of their support (to avoid circular convolution effects)
- Take the Fourier transform of both signals, producing two spectra F{f[]} and F{g[]}
- If performing cross-correlation, take the complex conjugate of F{g[]}
- Multiply the spectra together
- Take the inverse Fourier transform of the resulting spectrum
Thus autocorrelation becomes this:
- Pad the signal to at least twice its support
- Take the FT of the signal
- Multiply each element of the spectrum by its complex conjugate, or equivalently, take the square of the magnitude
- Take the inverse FT
Autocorrelation{f[]} = F-1{|F{f[]}|^2}
Time symmetry thus results from loss of phase information, and only the positive half of the autocorrelation is considered.
A power cepstrum is similar to an autocorrelation, except a logarithm is inserted to boost the weak frequencies and separate the formants (timbre) of a periodic signal from its pitch.
PowerCepstrum{f[]} = F-1{log(|F{f[]}|^2)}
- Pad the signal to at least twice its support
- Take the FT of the signal
- Take the logarithm of the square of the magnitude of each element of the spectrum
- Take the inverse FT
The power cepstrum discards phase and is suitable for analysis rather than synthesis, such as pitch detection. There also exists a complex cepstrum that preserves phase and is thus useful for synthesis, but the documents I initially found about calculating the unwrapped phase were paywalled at $33.00 per article from IEEE.
Another approach is to replace squared magnitude with magnitude. This results in the Mozer transform, which discards phase information to find a time-symmetric representation for a given timbre. This was used to compress voice in several products using a codec developed by Electronic Speech Systems, such as Impossible Mission for C64 ("Another visitor") and Mito Koumon for Famicom.
- Pad the signal to at least twice its support
- Take the FT of the signal
- Take the magnitude and discard the phase
- Take the inverse FT
Mozer{f[]} = F-1{|F{f[]}|}
Because the signal is symmetric, half of it can be discarded and computed from the time reversal of the other half. It also concentrates the signal's power close to zero in the time domain, allowing lossy compression of the timbre by windowing out small components far from zero time. And theoretically, the only difference between this and the power cepstrum is that the logarithm (which boosts weak signals) is replaced with the square root (which also boosts weak signals).
So a speech codec designed for real-time playback on a 1.8 MHz microprocessor without a hardware multiply might break the signal into segments on the order of 25 ms and then do the following to each segment:
- Across the previous, current, and next segment, apply a suitable window function to taper out edge discontinuities.
- Apply one of these dephasing transforms to the windowed signal.
- Pick a peak of the signal with a lag (aka quefrency) greater than 2 ms to find the period. One suggestion is to find the amplitude of the strongest such peak, then find the lowest-lag peak with at least 80% of that amplitude. If needed, use the peaks found when analyzing the previous and next segment to help avoid transient octave errors.
- Apply the Mozer transform to the signal.
- Crop out the signal more than half a period from zero, and see how much of the signal between a quarter and half period you can crop out.
- Store the cropped time domain formants and pitch data.