Short-time Viterbi for online HMM decoding

Transcription

Short-time Viterbi for online HMM decoding
Short-time Viterbi for
online HMM decoding:
Evaluation on a real-time
phone recognition task
Fernando Torre, David Pitchford, Phil Brown, Loren
Terveen
ICASSP 2008,IEEE
1
Outline
● Authors
○
Julien Bliot
○
Xavier Rodet
● Abstract
○
The algorithm derives an online version of the Viterbi algorithm on successive
variable length windows, iteratively storing portions of the optimal state path.
● Introduction
○
The algorithm derives an online
version of the Viterbi algorithm on successive variable length
windows, iteratively storing portions of the optimal state path.
2
Author-Julien Bloit
I am a research engineer and musician working at the intersection of digital
media, new musical experience, and live performance. I’m a hands-on person
who likes to build things to help me think, and repurpose everyday objects for
digital interaction. I specialize in audio software but I love taking a breeze
from the headphones to enjoy the smell of solder, burning acrylic on a laser
cutter, or learning about animal communication. My most challenging job so
far was being a steward on the Eiffel Tower elevators.
Deep skills :
●
PhD. in Statistical modelling of temporal sequences. Real time sound
analysis and synthesis, Model selection, Sound morphology. More on my
research page at Ircam.
●
Max/MSP, Matlab, C/C++, Java.
3
Author-Xavier Rodet
Xavier Rodet's research interests are in the areas of signal and pattern
analysis, recognition and synthesis. He has been working particularly on
digital signal processing for speech, speech and singing voice synthesis
and automatic speech recognition. Computer Music is his other main
domain of interest. He has been working on understanding spectrotemporal patterns of musical sounds and on synthesis-by-rules. He has
been developing new methods, programs and patents for musical sound
signal analysis, synthesis and control. He is also working on physical
models of musical instruments and nonlinear dynamical systems applied
to sound signal synthesis.
4
Introduction
● Viterbi algorithm is unsuitable
○
a streamed input for which there is potentially no ending to the
sequence.
● Solution
○
applying the Viterbi algorithm on successive windows
○
partial path hypotheses on an expanding window until they converge
towards the same solution
● Contribution
○
combine both approaches by forcing a suboptimal state label output
when the latency exceeds a predefined threshold.
○
study the influence of the model’s topology on potential paths
convergence, accuracy and latency.
5
Outline
● VITERBI DECODING
○ Standard Viterbi
○ Short-time Viterbi
● INFLUENCE OF THE MODEL’S
TOPOLOGY
○ Necessary Condition for Fusion Points
○ Influence on Latency
6
Viterbi decoding- standard
The decoding step of a recognition system consists in retrieving a
state sequence that maximizes the maximum a posteriori probability,
the optimal path s∗.
● a time-synchronous forward pass to update the
partial likelihoods δt(i), ∀i ∈ S : the score of the
best path ending in state i at time t. The best
state predecessors are stored in a matrix of backpointers ψt(i).
● a backtracking on ψt(i), starting from the state
with the highest score at time T
7
Viterbi decoding- standard
辨識系統的解碼步驟是由擷取狀態序列,最大化 posteriori 的
機率 這樣的序列被參考為最佳路徑s∗ Viterbi 演算法是一種產
生最佳路徑解的方法,伴著兩個主要的重複性步驟:
●
一個同步時間前進的回合去更新部分相似集合δt(i), ∀i ∈
S:最佳路徑的分數在狀態i時間t時結束。一個最佳狀態先前
解被儲存在一個回指標的陣列ψt(i)。
● 一個在ψt(i)上的回溯,從最高分的狀態在時間T開始
8
Viterbi decoding- standard
The backtrack step implies that the algorithm is not
timesynchronous since the last observation frame at time T
is needed in order to decode the global optimal state path s∗
from T back to the first time index.
9
Viterbi decoding- standard
回溯的步驟暗示這個演算法並非時間同步,因為最後一個觀察
框在時間T是被需要的,如此一來,才能解碼從時間T退回到第
一個時間索引全域最佳狀態路徑s∗
10
Viterbi decoding- short time
Let us define s(a, b, i) as the state sequence obtained by computing δ and ψ values
on an observation window delimited by time indices a and b (such that a < b) and
backtracking from an arbitrary state i at time b.
11
Viterbi decoding- short time
A fusion point τ has the attractive following property
: the local paths up to τ are always identical to the
global
Viterbi path between a and τ .
一個熔合點有下列吸引人的特質: 區域路徑們到t在a到τ
間全域viterbi路徑永遠都是相同的
12
Viterbi decoding- short time
From this observation, it is straightforward to derive an online short-time
Viterbi (STV) algorithm, based on local Viterbi decodings between
successive fusion points, used as left bounds for variable size
observation windows :
13
Influence of the model’s topology
Necessary condition for fusion points
The above algorithm assumes that a fusion
point occurs inside a time interval [a, b] smaller
than [1, T].However this is not always allowed
by the topology of the model.
14
Influence of the model’s topology
Necessary condition for fusion points
Local paths cannot converge towards state 4 (on optimal path) because
backtracking from states 1 and 2 can only lead to states 1 and 2. Thus
the algorithm is doomed to stay stuck as the observation window
expands until the final observation T.
15
Influence of the model’s topology
Influence on Latency
●
L= b-a, observation window
●
earliest state: a+1
●
minimum latency of the state: L-1
●
A fusion point occurs at a+1 if all states in
●
distance d(i, j) = count of the shortest backward path from j to i
are connex
according to matrix A
●
●
: set of these lengths for any pair ( i , j ) in
a minimum lower bound of the latency = max(DI).
16
Influence of the model’s topology
Influence on Latency
bound
For example, in the model of Figure 1.b), the two largest distances
are d(1, 4)=3 and d(3, 2)=3. So we can affirm that in
the general case, we can not achieve optimal decoding with a
better latency than 3 frames.
17
Experimentations
Database and Models
●
evaluated the performance of STV on a real-time phoneme
recognition task
○
database: 3794 sentences split to 3640 sentences for the training set
(131967 phonemes), and to 154 sentences (3134 phonemes) for the
test set
○
sound file: sliced into 25 ms windows every 5 ms
○
HTK tools : 37 monophone models
○
Each monophone was modeled by a five states left-right HMM
○
The observation densities are modeled using Gaussian mixture
models with diagonal covariance and J components per state, yielding
a total of eight sets of models for J=1, 2, 4, ..., 128.
18
Experimentations
Topology modifications
●
Two modifications of model from λref to ensure connexity among the
model states
○
one that guarantees connexity by adding transitions at the phone-level
(thus denoted λP , with max(DS) = 5)
○
another (λS) that adds transitions at the state-level (max(DS) = 1)
19
Experimentations
Topology modifications
20
Experimentations
Recognition performance
N: phonemes number
D: deletion errors number
S: substitution errors number
I: insertion errors number
21
Experimentations
Latency performance
●
J=8, utterances lengths in the test set range
from 249 to 1470 frames, mean 524 frames
●
maximum latency with λp: 236. this means
that all online decodings managed to find a
fusion point
●
minimum latency with λp:7, 32-128
●
minimum latency with λs:2, all
22
Experimentations
Hard constraint on Latency
As C gets shorter, accuracy decreases because of the suboptimal
decoding procedure employed when forcing the output.
23
Conclusion and perspective
●
It is possible to operate modifications on a model and make it
obey these constraints
●
As the connexity lengths reduce, the lower bound on best possible
latency decreases
24
Comment
● It is nice concept about looking for a optimal path.
● Topology model matters
● Cost definition for dijkstra
25