VQ Source Models
Transcription
VQ Source Models
VQ Source Models: Perceptual & Phase Issues Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,ronw}@ee.columbia.edu 1. 2. 3. 4. http://labrosa.ee.columbia.edu/ Source Models for Separation VQ with Perceptual Weighting Phase and Resynthesis Conclusions VQ Source Models - Ellis & Weiss 2006-05-16 - 1 /12 Single-Channel Scene Analysis freq / kHz • How to separate overlapping sounds? 8 20 6 0 4 -20 2 -40 0 0.5 1 1.5 freq / kHz 0 2 level / dB 0 time / s 0.5 1 1.5 2 time / s 8 6 4 2 0 0 0.5 1 1.5 2 time / s underconstrained: infinitely many decompositions time-frequency overlaps cause obliteration .. no obvious segmentation of sources (?) VQ Source Models - Ellis & Weiss 2006-05-16 - 2 /12 Scene Analysis as Inference • Ideal separation is rarely possible i.e. no projection can guarantee to remove overlaps • Overlaps Ambiguity scene analysis = find “most reasonable” explanation • Ambiguity can be expressed probabilistically i.e. posteriors of sources {Si} given observations X: P({Si}| X) P(X |{Si}) P({Si}) combination physics source models • Better source models → better inference .. learn from examples? VQ Source Models - Ellis & Weiss 2006-05-16 - 3 /12 Vector-Quantized (VQ) Source Models • “Constraint” of source can be captured explicitly in a codebook (dictionary): x(t) ≈ ci(t) where i(t) ∈ 1 . . . N defines the ‘subspace’ occupied by source • Codebook minimizes distortion (MSE) freq / mel band by k-means clustering VQ800 Codebook - Linear distortion measure (greedy ordering) 80 -10 -20 -30 -40 -50 60 40 20 level / dB 0 100 200 300 VQ Source Models - Ellis & Weiss 400 500 600 700 800 codebook index 2006-05-16 - 4 /12 Simple Source Separation • Given models for sources, find “best” (most likely) states for spectra: combination p(x|i1, i2) = N (x; ci1 + ci2, Σ) model inference of {i1(t), i2(t)} = argmaxi1,i2 p(x(t)|i1, i2) source state can include sequential constraints... different domains for combining c and defining • E.g. stationary noise: In speech-shaped noise (mel magsnr = 2.41 dB) freq / mel bin Original speech 80 80 80 60 60 60 40 40 40 20 20 20 0 1 2 time / s 0 VQ Source Models - Ellis & Weiss 1 2 VQ inferred states (mel magsnr = 3.6 dB) 0 1 2 2006-05-16 - 5 /12 Codebook Size • Two (main) variables: o number of codewords o amount of training data • Measure average accuracy (distortion): 10 VQ Magnitude fidelity (dB SNR) o main effect of 16000 training frames 8000 training frames 4000 training frames 2000 training frames 1000 training frames 600 training frames 9.5 9 8.5 codebook size o larger codebooks need/allow more data o (large absolute distortion values) 8 7.5 7 6.5 6 5.5 100 200 300 500 1000 Codebook size VQ Source Models - Ellis & Weiss 2000 3000 2006-05-16 - 6 /12 Distortion Metric • Standard MSE gives equal weight by channel excessive emphasis on high frequencies • Try e.g. Mel spectrum Linear Frequency 8 Mel Frequency freq / Mel band freq / kHz approx. log spacing of frequency bins 6 4 2 • Little effect (?): freq / mel band 0 80 0 0.5 1 1.5 2 time / s VQ800 Codebook - Linear distortion measure 80 -10 -20 -30 -40 -50 60 40 20 0 0 0.5 1 1.5 2 level / dB VQ800 Codebook - Mel/cube root distance 60 40 20 0 codebook index VQ Source Models - Ellis & Weiss codebook index 2006-05-16 - 7 /12 Resynthesis Phase • Codewords quantize spectrum magnitude phase has arbitrary offset due to STFT grid • Resynthesis (ISTFT) requires phase info use mixture phase? no good for filling-in • Spectral peaks indicate common instantaneous frequency ( ∂φ/∂t ) can quantize and cumulate in resynthesis .. like the “phase vocoder” Code 924: Instantaneous Frequency Inst Freq / Hz level / dB Code 924: Magnitudes (21 trn frms) 20 0 1000 800 600 -20 400 -40 -60 200 0 200 400 600 800 1000 freq / Hz VQ Source Models - Ellis & Weiss 0 0 200 400 600 800 1000 freq / Hz 2006-05-16 - 8 /12 Resynthesis Phase (2) • Can also improve phase iteratively repeat: X (1)(t, f )=|X̂(t, f )| · exp{ jφ(1)(t, f )} (1) x(1)(t)=istft{X (t, f )} ! " (1) (2) φ (t, f )=∠ stft{x (t)} • Visible |X (n)(t, f )| = |X̂(t, f )| benefit: SNR of reconstruction (dB) goal: Iterative phase estimation 8 7.6 dB 7 6 5 4.7 dB Magnitude quantization only Starting from quantized phase Starting from random phase 4 3.4 dB 3 0 VQ Source Models - Ellis & Weiss 1 2 3 4 5 6 7 8 Iteration number 2006-05-16 - 9 /12 Evaluating Model Quality • Low distortion is not really the goal; models are to constrain source separation fit source spectra but reject non-source signals • Include sequential constraints • Best way to evaluate is via a task e.g. separating speech from noise freq / bins e.g. transition matrix for codewords .. or smaller HMM with distributions over codebook 80 0dB SNR to Speech-Shaped Noise (magsnr =1.8 dB) 60 40 20 0 Direct VQ (magsnr = -0.6 dB) 80 -10 -20 -30 -40 -50 60 40 20 0 HMM-smoothed VQ (magsnr = -2.1 dB) 80 level / dB 60 40 20 0 0 VQ Source Models - Ellis & Weiss 0.5 1 1.5 2 2.5 time / s 2006-05-16 - 10/12 Future Directions • Factorized codebooks level / dB codebooks too large due to combinatorics separate codebooks for type, formants, excitation? Spectrum decomposed into low, mid, high components 40 20 0 -20 -40 • Model adaptation 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz 8000 many speaker-dependents model, or... single speaker-adapted model, fit to each speaker • Using uncertainty enhancing noisy speech for listeners: use special tokens to preserve uncertainty VQ Source Models - Ellis & Weiss 2006-05-16 - 11/12 Summary • Source models permit separation of underconstrained mixtures or at least inference of source state • Explicit codebooks need to be large .. and chosen to optimize perceptual quality • Resynthesis phase can be quantized .. using “phase vocoder” derivative .. iterative re-estimation helps more VQ Source Models - Ellis & Weiss 2006-05-16 - 12/12 Extra Slides VQ Source Models - Ellis & Weiss 2006-05-16 - 13/12 Other Uses for Source Models • Projecting into the model’s space: Restoration / Extension inferring missing parts • Generation / Copying: adaptation + fitting VQ Source Models - Ellis & Weiss 2006-05-16 - 14/12 that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. Log-power spectrum features were computed at a 15 ms rate. Each frame was of length 40 ms and a 640 point FFT was used producing a 3195 dimensional log-power-spectrum feature vector. system surpasses human performance in all condi dB and −9 dB. The third plot in Figure 1 shows WER for the Di condition. In this condition, our system surpasses mance in the 0 dB and 3 dB conditions. Interesti constraints do not improve performance relative to dynamics as dramatically as in the same talker cas cates that the characteristics of the two speakers in a a) Same Talker are effective for separation. <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> The performance of our best system, which us mar and acoustic-level dynamics, is summarized in t5_bwam5s_m5_bbilzp_6p1.wav e.g. "bin white at M 5 soon" system surpassed human lister performance at SNR −3 dB on average acrossKristjansson all speaker conditions. Av et al. all SNRs, the system surpassed human performanc 6 dB 3 dB 0 dB !3 dB !6 dB !9 dB Gender condition. Based on these initial results, w Interspeech’06 b) Same Gender super-human performance over all conditions is wit Example 2: Mixed Speech Recog. • Cooke & Lee’s Speech Separation Challenge short, grammatically-constrained utterances: 100 50 0 wer / % 100 • IBM’s “superhuman” recognizer: o Model individual speakers 7. References (512 mix GMM) o Infer speakers and gain o Reconstruct speech o Recognize as normal... 50 0 6 dB 3 dB 0 dB !3 dB !6 dB !9 dB c) Different Gender 100 50 0 6 dB 3 dB SDL Recognizer 0 dB No dynamics !3 dB Acoustic dyn. !6 dB !9 dB Grammar dyn. VQ Source Models - Ellis & Weiss Human Figure 1: Word error rates for the a) Same Talker, b) Same Gender and c) Different Gender cases. [1] Martin Cooke and Tee-Won Lee, “Inte separation challenge,” http : //www.d ∼ martin/SpeechSeparationChallenge.htm, 20 [2] T. Kristjansson, J. Hershey, and H. Attias, “Single mi separation using high resolution signal reconstruction [3] S. Roweis, “Factorial models and refiltering for speec denoising,” Eurospeech, pp. 1009–1012, 2003. [4] P. Varga and R.K. Moore, “Hidden markov model d speech and noise,” ICASSP, pp. 845–848, 1990. [5] J.-L. Gauvain and C.-H. Lee, “Maximum a posterio multivariate Gaussian mixture observations of Marko Transactions on Speech and Audio Processing, vol. 2 2006-05-16 - 15/12 298, 1994. [6] Peder Olsen and Satya Dharanipragada, “An efficien der detection scheme and time mediated averaging o • Grammar constraints a big help • Central idea: Employ strong learned constraints time time to disambiguate possible sources Varga & Moore’90 Roweis’03... frequency {Si} = argmaxSi P(X | {Si}) frequency speech-trained Vector-Quantizer • e.g. fitMAXVQ Results: Denoising totimemixed spectrum: time from Roweis’03 time frequency frequency frequency frequency frequency frequency Model-Based Separation time time time VQ Source Models - Ellis & Weiss frequency frequency Training: 300sec of isolated speech to fit 512 codewords, and 100sec of separate via T-F(TIMIT) mask (again) solated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR 2006-05-16 - 16/12 Separation or Description? • Are isolated waveforms required? clearly sufficient, but may not be necessary not part of perceptual source separation! • Integrate separation with application? e.g. speech recognition separation mix t-f masking + resynthesis identify target energy ASR speech models words mix vs. identify find best words model speech models source knowledge words output = abstract description of signal VQ Source Models - Ellis & Weiss 2006-05-16 - 17/12 words Evaluation • How to measure separation performance? depends what you are trying to do • SNR? energy (and distortions) are not created equal different nonlinear components [Vincent et al. ’06] rare for nonlinear processing to improve intelligibility listening tests expensive • ASR performance? Net effect Transmission errors • Intelligibility? Increased artefacts Reduced interference optimum Agressiveness of processing separate-then-recognize too simplistic; ASR needs to accommodate separation VQ Source Models - Ellis & Weiss 2006-05-16 - 18/12