Automatic Video Segmentation for Czech TV Broadcast Transcription

Transcription

Automatic Video Segmentation for Czech TV Broadcast Transcription
Automatic Video Segmentation for Czech TV
Broadcast Transcription
Josef Chaloupka
Laboratory of Computer Speech Processing, Institute of Information Technology and Electronics
Technical University of Liberec
Liberec, Czech Republic
Josef.chaloupka@tul.cz
Abstract—This contribution deals with the testing and selection
of methods and algorithms for the automatic video image (or
visual signal) segmentation. The aim of this work has been to
select a reliable and fast method for visual signal segmentation,
which can be used in the system for audio-visual automatic TV
broadcast transcription.
Keywords-visual signal segmentation, shot change detection,
audio-vsual TV broadcast processing
I.
INTRODUCTION
Video recordings of TV broadcasts (or movies) are made
up of single short shots; a short shot is a sequence of
subsequent video images where visual information does not
change too much or changes very little. We want to find the
time boundaries (shot change) of these short shots in the task
of automatic visual signal segmentation. The parts between
time boundaries are visual segments which should correspond
to the original short shots. We most often have shot cuts
between two subsequent visual segments or a special video
effect such as dissolve, fade or wipe is used. The shot cuts and
dissolve have been in our TV broadcast video recordings (see
fig. 1), therefore shot change detection has been solved for
these two cases in our work.
At present, the segmentation of a visual signal is used
mainly for the indexation of audio-visual data (video
recording). It is possible to represent each indexed visual
segment by a key frame and work only with the key frames in
very large audio-visual databases. Visual signal segmentation
is further used in the research area of modern voice
technologies, where visual information from visual segments
can improve the recognition rate of information from audio
signals.
We have used visual segmentation in our very large
vocabulary speech recognition system for the automatic
transcription of Czech broadcasts (system ATT – Audio
Transcription Toolkit). This system has been being developed
in our Laboratory of Computer Speech Processing at the
Technical University of Liberec since 2004 [1]. We are using
our own recognizer for the automatic continuous speech
recognition of the Czech language in the ATT. The speech
recognizer works with a vocabulary of more than 300 000
Czech words and with a Czech language model. The principle
978-1-61284-398-8/11/$26.00 ©2011 IEEE
of the ATT is as follows (see fig. 2): An input audio signal is
preprocessed and parameterized in the first step. The
parameterized signal is segmented into smaller audio segments
containing homogenous information only – e.g. only one
speaker speaks, some music plays, silence and so on. Audio
segmentation is based on speech/non-speech and speakerchanged detection [2]. In the next step, a speech act is
recognized in the audio speech segments. Our speech
recognizer is based on the Hidden Markov Models (HMM) of
single Czech phonemes. It is possible to use Speaker
Independent (SI) HMM or Gender Dependent (GD) HMM.
The recognition accuracy of a speech recognizer is higher if
we use GD HMM instead of SI HMM. The identification
accuracy of gender from a voice reaches more than 99% in our
systems; we are, therefore, using GD HMM in our speech
recognizer. Another strategy how to improve recognition
accuracy is to use Speaker Adaptive (SA) HMM, where HMM
are adapted for specific speakers [3]. Some TV broadcasters,
politicians or well-known celebrities and people are very often
in a TV broadcast, it is, therefore, possible to adapt HMM for
them. The speakers are then identified and verified before
speech recognition in the ATT.
Speaker verification is used after speaker identification
because it is necessary to find out whether the identified
speaker has been identified correctly. The speaker may have
been identified correctly, but he (she) sometimes could not be
verified. Therefore, we have modified the ATT system [4] for
the use of audio-visual speaker identification instead of audio
identification and subsequent verification. The visual signal is
first segmented and the visual segments are compared with
audio segments according to time boundaries. It can be
assumed that the information from an audio segment would be
similar (or equivalent) to the information in the visual segment
if the time boundaries of the audio segment are more or less
the same as the time boundaries of the visual segments. For
example, the speaker is camera scanned and it is recorded into
audio and visual segments. It is, therefore, possible to identify
the speaker from the audio signal and this information is
compared with the visual speaker identification where the
speaker is identified by the detected face in the video image
from the visual segment. Several different methods and
algorithms exist for the visual speaker identification based on
the image of a human face at present. The method based on
the Principle Component Analysis (PCA) is most often used
for the visual speaker (face) identification [5]. Audio speaker
verification is not used when the identified speaker is the same
in the visual segment and in the relevant audio segment, the
recognition accuracy being slightly higher when the module of
audio-visual identification is incorporated into the ATT. One
of the important tasks for the audio-visual speaker
identification in the system of automatic TV broadcast
transcription is to find a reliable algorithm for visual signal
segmentation, so several visual segmentation methods have
been tested in this work. There are many algorithms and
methods for visual signal segmentation [9], but only the
methods and the algorithms that were used in this work are
described here.
Shot cut:
8604
8605
8606
The resulting value of the similarity of two video images
from (1) may be close to zero even when comparing two
completely different video images, so it is better to directly
compare pixels in two successive video images (2):
X
Y
x =1
y =1
¦¦ f (x, y ) − f (x, y ) > T
i
i+ N
(2)
The criterion for the boundary creation of a visual segment
in almost all visual segmentation methods is based on
comparing the resulting value with a threshold T. Such
methods are quite reliable in the case of static video shots.
However, “over-segmentation” can be expected if an object
(or the camera) is moved. Over-segmentation is understood as
one of the expected distributions of the multiple visual
segment.
8607
# frame no.
Dissolve:
8909
8910
8911
8912
# frame no.
Figure 1. Video effects – shot cut, dissolve
II.
VISUAL SIGNAL SEGMENTATION METHODS
A. Pixel Based Methods
The simplest method for visual signal segmentation is
based on the comparison of corresponding pixels in two
successive video images (1, 2) [6], or we can determine how
likely it is for the corresponding pixels to be identical,
possibly to follow the development of changes in the color
values of the corresponding pixels over several consecutive
video images.
X
Y
X
Y
x =1
y =1
¦ ¦ f (x, y ) −¦¦ f (x, y ) > T
i+ N
i
x =1
y =1
(1)
where fi(x, y) is image function of a video image i from a
video signal, where a values of image function can be a RGB
color vector, a brightness or other color part from some color
space. X and Y is a dimension (width a height) of video image,
N is the shift to the next video image (usually 1) and T is a
threshold which we use for the set of boundaries of visual
segment.
Figure 2. The principle of ATT system
B. Histogram Based Methods
One of the few global information sources which somehow
characterize the image, are the image histograms. An image
histogram is created from the frequency of single color values
in single pixels. We get one image histogram for one video
image but different video images may have the same image
histogram, which may be a disadvantage of this method.
However, due to the acquisition of global information from the
image histogram are histogram-based visual segmentation
methods [7] more robust as compared to pixel-based methods,
mainly owing to the low shake or turn of the camera or an
object located in the video shot. The simplest calculation can
be realized by the difference of values in the image histograms
of two successive video images:
V
¦ H (v ) − H (v) > T
i
i+ N
(3)
v =1
where Hi(v) are values of a video image histogram.
An image histogram is computed from the brightness or
RGB of single pixels, but different color parts from different
color spaces (HSV Hue, Saturation, Value, YcbCr, …) are used
for the computation of image histograms in some further
projects. These color parts can have different effect on their
own information in each video image, therefore in some visual
segmentation methods, different weights are set for each color
part. The intersections of image histograms are searched or
image histograms are normalized for a better comparison in
other histogram-based segmentation methods.
C. Feature Based Methods
A video image (matrix of pixels) can be described with
features which well characterize the video image. A video
signal is segmented by the help of these features. A boundary
of visual segment is set if the features from two successive
video images are different. The color values of pixels from
video image or values from image histogram can be features
but only some smaller group of features is acceptable for us in
the feature based visual segmentation methods. Useful features
may be for example image moments, edges, parameters from
some statistic methods, coefficients from some 2D transforms
and so on.
We have developed a feature-based visual segmentation
method where features are extracted from the coefficients of
the 2D Discrete Cosine Transform (DCT) [4]. The principle of
the feature extraction and the subsequent visual segmentation is
as follows: A video image is transformed by 2D DCT:
P- the highest E coefficients are selected as features. The
distance between the features from two successive video
images is counted in the last step (7). The criterion for shot
change detection is very similar to the one in the previous
method, where a specific threshold is used.
P
¦ VP ( p ) − VP ( p ) > T
(7)
i+N
i
p =1
where VPi(p) is feature vector from i video image.
The advantage of our method is that the distance between
two similar successive video images is several times lower than
for two different ones. The advantage of this method is similar
to the distance between consecutive frames is several times
lower than that for two different. The disadvantage is that the
first visual feature VPI (1) is usually several times higher than
the others. Therefore, it is good to normalize the feature vector.
The logarithm of the feature vector is used in our algorithm.
D. Block Based Methods
A video image is divided into several parts (blocks) using a
block-based visual segmentation method. Visual information is
compared in the same blocks from two successive video
images. The result of the comparison from single blocks is then
evaluated. We can assign different weights to the single blocks.
Each block may contribute to the result of the evaluation in a
different way.
The same method can be used for evaluation in blocks such
as those described above, where the features, the image
histograms or the sum of pixel values are computed and
compared only in blocks of video images. Another possibility
is to count the statistical values in single blocks such as
variance and mean [8], the function L(i, b) is then calculated
for two corresponding blocks
((σ
L(i, b ) =
2
i ,b
+ σ i2− N ,b ) / 2 + ((μ i ,b − μ i−N ,b ) / 2)
)
2 2
σ σ
2
i ,b
2
i − N ,b
(8)
(4)
where ı2i,b is the variance and ȝi,b is mean color values in
the single blocks b in i video image.
where F(u,v) are DCT coefficients of transformed image
f(x,y) a c are coefficients:
Value L(i, b) is compared with same threshold Tb then. L(i,
b) = 1 if it is higher than threshold, otherwise L(i, b) = 0. The
criterion for shot change detection is:
2c(u )c(v)
F (u , v) =
N
­ 1
°
c(k ) = ® 2
°¯ 1
N −1 N −1
¦¦
x =0 y =0
§ 2x + 1 · § 2 y + 1 ·
f (x, y ) cos¨
uπ ¸ cos¨
vπ ¸
© 2N
¹ © 2N
¹
for k = 0
B
(5)
otherwise
The computation of DCT is relatively fast because there is
an algorithm very similar to the FFT algorithm (Fast Fourier
Transform) for the computation of DFT (Discrete Fourier
Transform). The square of DCT coefficients is computed:
E (u , v ) = F (u, v) 2
(6)
¦ w L(i, b ) > T
b
(9)
b=1
where B is number of blocks, T is segmentation threshold
and wb is weight value for single blocks.
It is necessary to properly determine the number and
distribution of single blocks in the video image for the blockbased segmentation methods. It is easier to correctly adjust the
threshold value T if we choose a suitable number of blocks and
their distribution in the video image.
III.
EXPERIMENTS
Seven methods for visual signal segmentation have been
tested in our experiments: M1_PB – a pixel-based
segmentation method (equation 1), M2_PB – a pixel-based
segmentation method (equation 2), M3_HB – a histogrambased segmentation method (equation 3), M4_FB – a DCT
visual-based segmentation method, M5_BB – a block-based
segmentation method where video image has been divided into
4 blocks (2 x 2); the segmentation evaluation has been
computed by equation 1, M6_BB – a block based segmentation
method – 16 blocks (4 x 4) and the segmentation evaluation
has been computed just like in the previous method, M7_BB –
a block based segmentation method – 16 blocks (4 x 4) –
computation by equations 8, 9.
8606 (00:05:44) 8710 (00:05:48) 8814 (00:05:53) 8815 (00:05:53)
8844 (00:05:54) 8874 (00:05:55) 8875 (00:05:55) 8893 (00:05:56)
8909 (00:05:56) 8912 (00:05:56) 9146 (00:06:06) 9380 (00:06:15)
# frame no. (hour:minute:second)
Figure 3.
The sample of short visual signal
A database with almost 2 hours of video recordings of
Czech TV broadcast news has been used for the automatic
threshold setting for single segmentation methods. The
boundaries (shot changes) of single visual segments have been
found and set manually in this database for further evaluation
where the threshold has been changed in an interval for each
segmentation method. The resulting threshold has been
selected according to the highest value of the Visual
Segmentation Rate - VSR (10). The single segmentation
methods (with a set threshold) have been tested in the next step
using another database COST278 [10], where video recordings
of TV broadcasts from 3 Czech TV stations are included (1
hour). The shot changes of visual segments have also been set
manually in this database; the reliability (VSR) of single
segmentations methods can, therefore, be evaluated.
VSR =
CS − IS
100 [%]
NS
(10)
where NS is the number of all manually selected shot changes,
CS is the number of correctly recognized shot changes, IS is
the number of shot changes which were detected in addition.
Figure 4. The result from visual signal segmentation methods: a) M1_PB, b)
M2_PB, c) M3_HB, d) M4_FB, e) M5_BB, f) M6_BB, g) Mt_BB
The best testing method has been a DCT features-based
visual segmentation method with the VSR of 72,3%. The result
from the visual segmentation methods for a short visual signal
(figure 3.) is shown in figure 4. The y-axis (the segmentation
value) is normalized to the interval from 0 to 100 for a better
comparison, but another interval is used in the last method
because the segmentation value is almost zero between two
similar video images and it is highly variable for the detected
shot change. Only one video effect (dissolve) has been (used)
in our video recordings, but it has not been necessary to
prepare a special algorithm for tackling shot change detection
in this effect because all segmentation methods detected the
shot change in the dissolve video effect.
IV.
CONCLUSION AND FUTURE WORK
The utilization of several methods for visual signal
segmentation has been tested in this work - two pixel-based,
one histogram-based, one feature-based and three block-based
visual segmentation methods have been used in the
experiments. The best result has been reached by the featurebased visual segmentation method, where visual features are
computed from the DCT coefficients. The advantage of this
method is that it is possible to find a robust segmentation
threshold for reliable visual signal segmentation. The DCT
visual features-based segmentation method is used in our
experiments with our system for automatic TV broadcast
transcription. We would like to improve our visual
segmentation method in the near future and add some
algorithms for solving the visual segmentation task, where
several special video effects (fade, wipe, ..) are used in the
video recordings.
ACKNOWLEDGMENT
The research reported in this paper was partly supported by
the grant (TACR) no. TA01011204 and by the Czech Science
Foundation (GACR) through the project no. 102/08/0707.
REFERENCES
[1]
Nouza, J., Nejedlová, D., Žćánský, J., Kolorenþ, J.: Very Large
Vocabulary Speech Recognition System for Automatic Transcription of
Czech Broadcast. In: Proc. of ICSLP 2004, Jeju Island, Korea, pp. 409412, ISSN 1225-441x, 2004
[2] Žćánský, J.: BINSEG: An Efficient Speaker-based Segmentation
Technique. In: International Conference on Spoken Language
Processing Interspeech 2006 — ICSLP 2006, September, 2006,
Pittsburgh, USA, pp. 2182-2185, ISSN 1990-9772
[3] ýerva, P., Nouza, J., Silovský, J.: Two-Step Unsupervised Speaker
Adaptation Based on Speaker and Gender Recognition and HMM
Combination. In: International Conference on Spoken Language
Processing Interspeech 2006 — ICSLP 2006, September, 2006,
Pittsburgh, USA, pp. 2326-2329, ISSN 1990-9772
[4] Chaloupka, J.: Visual Speech Segmentation and Speaker Recognition for
Transcription of TV News. In: Proc. of International Conference on
Spoken Language Processing Interspeech 2006 — ICSLP 2006,
Pittsburgh, USA, pp. 1284-1287, ISSN 1990-9772, 2006
[5] Chan, L., H., Salleh, S., H., Ting, C., M.: PCA, LDA and neural network
for face identification, In: IEEE Conference on Industrial Electronics
and Applications, ICIEA 2009, art. no. 5138403, pp. 1256-1259, 2009
[6] Nagasaka, A., Tanaka, Y.: Automatic video indexing and full-video
search for object appearances. In: IFIP Working Conference on Visual
Database Systems, Hungary, pp. 113-127, 1991
[7] Tonomura, Y., Abe, S.: Content oriented visual interface using video
icons for visual database systems. In: Journal of Visual Languages and
Computing, pp. 183-198, 1990
[8] Kasturi, R., Jain, R., C.: Dynamic Vision, In: Computer vision:
principles, editors: Kasturi a Jain, IEEE Computer Society Press, USA,
pp. 469-480, 1991
[9] Lafevre, S., Holler, J., Vincent, N.: A review of real-time segmentation
of uncompressed video sequences for content-based search and retrieval,
In: Real-Time Imaging 9, pp. 73-98, 2003
[10] Vandecatseye et al.: The COST278 pan-european broadcast
newsdatabase. in Proc. of LREC 2004, Lisbon, Portugal, May 2004