History of Math & Acoustics Law of Forces/Motions, Wave Equation, Fourier/Laplace Transform, Filtering Theory, Digital Systems, Foundation of Calculus Complex Number Analog Circuits & Electromagnetism Sampling Theory, ... 1600 Newton 1700 Bernoulli, Euler, 1800 1900 Gauss, Fourier, Laplace, 2000 d‘Alembert Riemann, Cauchy, Kirchhoff, Heaviside (http://www2.ling.su.se/staff/hartmut/kemplne.htm)
History of Math & Acoustics Law of Forces/Motions, Wave Equation, Fourier/Laplace Transform, Filtering Theory, Digital Systems, Foundation of Calculus Complex Number Analog Circuits & Electromagnetism Sampling Theory, ... 1600 Newton 1700 Bernoulli, Euler, 1800 1900 Gauss, Fourier, Laplace, 2000 d‘Alembert Riemann, Cauchy, Kirchhoff, Heaviside Frequency Response = = (http://www2.ling.su.se/staff/hartmut/kemplne.htm)
Source-Filter Model Lung Vocal Folds Vocal Tract Lip t f f Signal Generator (Source) Filter 1 Filter 2 Signal Generator Filter 0 Filter 1 Filter 2
20th Century, the Dawn of Speech Processing Cooley and Tukey (1965): Fast Fourier Transform Oppenheim (1969): one of the earliest digital implementation of speech analysis/ synthesis Cepstrum Spectrum (vocal tract filter) Input Output Pitch (source) Analysis Synthesis
Quasi-static Assumption Algorithms affected: ● Homomorphic Filtering ● PSOLA ● Linear Prediction & CELP & MLSA ● Sinusoidal Model ● Harmonic+Noise Model ● SMS & NBVPM ● WORLD & STRAIGHT (slightly)
Mis-represented Aperiodic Component Popular belief: 1. Speech = periodic signal + aperiodic signal (breathing noise) 2. Aperiodic signal is filtered white noise Aperiodic Periodic (Friction)
Mis-represented Aperiodic Component t Algorithms affected: ● (Quasi-)Harmonic+Noise Model ● SMS & Sines+Noise+Transients Model ● WORLD & (TANDEM-)STRAIGHT ● Algorithms that do not model aperiodic component ○ Phase vocoder, CELP, MLSA, ...
Over-simplified Source-Filter Model Oscillator Source Filter Tract Filter Lip Filter Assumption: source filter is independent from pitch Oscillator Tract Filter Equivalent assumption: “When my pitch is higher by 12 semitones, my vocal folds still oscillate at the same speed.” Affected algorithms: all of those listed on page 11
Part 3 Future: How to Fix & the Low Level Speech Model
“Neoclassical” Approaches to Speech Modeling Input Source t Inverse Tract f Lip f LF Model OVE Synthesizer (Liljencrants, Fant and (Fant, 1953) Lin, 1985) Linear Prediction ARX ARX-LF (Atal & Schroeder,1967) (Wen, et al., 1995) (Vincent, et al., 2005)
“Neoclassical” Approaches to Speech Modeling Degottex (2013): similar idea, but in frequency domain Hua (2016, in progress): more robust under poor recording conditions; less sensitive to processed input.
The Low Level Speech Model (new version) An acoustically meaningful speech model Level 1 Glottal/Source Information Vocal Tract Filter Lip Filter (Acoustic Level) (LF Model) Spectrum Channel 1 Energy Harmonic Model Level 0 Pitch Harmonic Model Noise Model Channel 2 Energy Harmonic Model (Signal Level) Channel 3 Energy Harmonic Model ... Input Signal Output Signal
Inverse Analysis of Speech Original Glottal Flow (Source Signal)
Pitch Shifting powered by LLSM Original 50% Pitch 200% Pitch
Pitch Shifting powered by LLSM Original 50% Pitch 200% Pitch Instants of vocal fold closure were revealed
Reference ● A.V. Oppenheim, “Speech Analysis-Synthesis System Based on Homomorphic Filtering”. JASA (1969): Vol. 45, No. 2. ● Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis." Speech Communication 55.2 (2013): 278-294. ● H. K. Dunn, "The calculation of vowel resonances, and an electrical vocal tract", Journal of the Acoustical Society of America, 1950, vol. 22, p. 740-753. ● Pantazis, Yannis, and Yannis Stylianou. "Improving the modeling of the noise part in the harmonic plus noise model of speech." Acoustics, Speech and Signal Processing (2008). IEEE International Conference on.