Automatic Video Segmentation for Czech TV Broadcast Transcription
Transcription
Automatic Video Segmentation for Czech TV Broadcast Transcription
Automatic Video Segmentation for Czech TV Broadcast Transcription Josef Chaloupka Laboratory of Computer Speech Processing, Institute of Information Technology and Electronics Technical University of Liberec Liberec, Czech Republic Josef.chaloupka@tul.cz Abstract—This contribution deals with the testing and selection of methods and algorithms for the automatic video image (or visual signal) segmentation. The aim of this work has been to select a reliable and fast method for visual signal segmentation, which can be used in the system for audio-visual automatic TV broadcast transcription. Keywords-visual signal segmentation, shot change detection, audio-vsual TV broadcast processing I. INTRODUCTION Video recordings of TV broadcasts (or movies) are made up of single short shots; a short shot is a sequence of subsequent video images where visual information does not change too much or changes very little. We want to find the time boundaries (shot change) of these short shots in the task of automatic visual signal segmentation. The parts between time boundaries are visual segments which should correspond to the original short shots. We most often have shot cuts between two subsequent visual segments or a special video effect such as dissolve, fade or wipe is used. The shot cuts and dissolve have been in our TV broadcast video recordings (see fig. 1), therefore shot change detection has been solved for these two cases in our work. At present, the segmentation of a visual signal is used mainly for the indexation of audio-visual data (video recording). It is possible to represent each indexed visual segment by a key frame and work only with the key frames in very large audio-visual databases. Visual signal segmentation is further used in the research area of modern voice technologies, where visual information from visual segments can improve the recognition rate of information from audio signals. We have used visual segmentation in our very large vocabulary speech recognition system for the automatic transcription of Czech broadcasts (system ATT – Audio Transcription Toolkit). This system has been being developed in our Laboratory of Computer Speech Processing at the Technical University of Liberec since 2004 [1]. We are using our own recognizer for the automatic continuous speech recognition of the Czech language in the ATT. The speech recognizer works with a vocabulary of more than 300 000 Czech words and with a Czech language model. The principle 978-1-61284-398-8/11/$26.00 ©2011 IEEE of the ATT is as follows (see fig. 2): An input audio signal is preprocessed and parameterized in the first step. The parameterized signal is segmented into smaller audio segments containing homogenous information only – e.g. only one speaker speaks, some music plays, silence and so on. Audio segmentation is based on speech/non-speech and speakerchanged detection [2]. In the next step, a speech act is recognized in the audio speech segments. Our speech recognizer is based on the Hidden Markov Models (HMM) of single Czech phonemes. It is possible to use Speaker Independent (SI) HMM or Gender Dependent (GD) HMM. The recognition accuracy of a speech recognizer is higher if we use GD HMM instead of SI HMM. The identification accuracy of gender from a voice reaches more than 99% in our systems; we are, therefore, using GD HMM in our speech recognizer. Another strategy how to improve recognition accuracy is to use Speaker Adaptive (SA) HMM, where HMM are adapted for specific speakers [3]. Some TV broadcasters, politicians or well-known celebrities and people are very often in a TV broadcast, it is, therefore, possible to adapt HMM for them. The speakers are then identified and verified before speech recognition in the ATT. Speaker verification is used after speaker identification because it is necessary to find out whether the identified speaker has been identified correctly. The speaker may have been identified correctly, but he (she) sometimes could not be verified. Therefore, we have modified the ATT system [4] for the use of audio-visual speaker identification instead of audio identification and subsequent verification. The visual signal is first segmented and the visual segments are compared with audio segments according to time boundaries. It can be assumed that the information from an audio segment would be similar (or equivalent) to the information in the visual segment if the time boundaries of the audio segment are more or less the same as the time boundaries of the visual segments. For example, the speaker is camera scanned and it is recorded into audio and visual segments. It is, therefore, possible to identify the speaker from the audio signal and this information is compared with the visual speaker identification where the speaker is identified by the detected face in the video image from the visual segment. Several different methods and algorithms exist for the visual speaker identification based on the image of a human face at present. The method based on the Principle Component Analysis (PCA) is most often used for the visual speaker (face) identification [5]. Audio speaker verification is not used when the identified speaker is the same in the visual segment and in the relevant audio segment, the recognition accuracy being slightly higher when the module of audio-visual identification is incorporated into the ATT. One of the important tasks for the audio-visual speaker identification in the system of automatic TV broadcast transcription is to find a reliable algorithm for visual signal segmentation, so several visual segmentation methods have been tested in this work. There are many algorithms and methods for visual signal segmentation [9], but only the methods and the algorithms that were used in this work are described here. Shot cut: 8604 8605 8606 The resulting value of the similarity of two video images from (1) may be close to zero even when comparing two completely different video images, so it is better to directly compare pixels in two successive video images (2): X Y x =1 y =1 ¦¦ f (x, y ) − f (x, y ) > T i i+ N (2) The criterion for the boundary creation of a visual segment in almost all visual segmentation methods is based on comparing the resulting value with a threshold T. Such methods are quite reliable in the case of static video shots. However, “over-segmentation” can be expected if an object (or the camera) is moved. Over-segmentation is understood as one of the expected distributions of the multiple visual segment. 8607 # frame no. Dissolve: 8909 8910 8911 8912 # frame no. Figure 1. Video effects – shot cut, dissolve II. VISUAL SIGNAL SEGMENTATION METHODS A. Pixel Based Methods The simplest method for visual signal segmentation is based on the comparison of corresponding pixels in two successive video images (1, 2) [6], or we can determine how likely it is for the corresponding pixels to be identical, possibly to follow the development of changes in the color values of the corresponding pixels over several consecutive video images. X Y X Y x =1 y =1 ¦ ¦ f (x, y ) −¦¦ f (x, y ) > T i+ N i x =1 y =1 (1) where fi(x, y) is image function of a video image i from a video signal, where a values of image function can be a RGB color vector, a brightness or other color part from some color space. X and Y is a dimension (width a height) of video image, N is the shift to the next video image (usually 1) and T is a threshold which we use for the set of boundaries of visual segment. Figure 2. The principle of ATT system B. Histogram Based Methods One of the few global information sources which somehow characterize the image, are the image histograms. An image histogram is created from the frequency of single color values in single pixels. We get one image histogram for one video image but different video images may have the same image histogram, which may be a disadvantage of this method. However, due to the acquisition of global information from the image histogram are histogram-based visual segmentation methods [7] more robust as compared to pixel-based methods, mainly owing to the low shake or turn of the camera or an object located in the video shot. The simplest calculation can be realized by the difference of values in the image histograms of two successive video images: V ¦ H (v ) − H (v) > T i i+ N (3) v =1 where Hi(v) are values of a video image histogram. An image histogram is computed from the brightness or RGB of single pixels, but different color parts from different color spaces (HSV Hue, Saturation, Value, YcbCr, …) are used for the computation of image histograms in some further projects. These color parts can have different effect on their own information in each video image, therefore in some visual segmentation methods, different weights are set for each color part. The intersections of image histograms are searched or image histograms are normalized for a better comparison in other histogram-based segmentation methods. C. Feature Based Methods A video image (matrix of pixels) can be described with features which well characterize the video image. A video signal is segmented by the help of these features. A boundary of visual segment is set if the features from two successive video images are different. The color values of pixels from video image or values from image histogram can be features but only some smaller group of features is acceptable for us in the feature based visual segmentation methods. Useful features may be for example image moments, edges, parameters from some statistic methods, coefficients from some 2D transforms and so on. We have developed a feature-based visual segmentation method where features are extracted from the coefficients of the 2D Discrete Cosine Transform (DCT) [4]. The principle of the feature extraction and the subsequent visual segmentation is as follows: A video image is transformed by 2D DCT: P- the highest E coefficients are selected as features. The distance between the features from two successive video images is counted in the last step (7). The criterion for shot change detection is very similar to the one in the previous method, where a specific threshold is used. P ¦ VP ( p ) − VP ( p ) > T (7) i+N i p =1 where VPi(p) is feature vector from i video image. The advantage of our method is that the distance between two similar successive video images is several times lower than for two different ones. The advantage of this method is similar to the distance between consecutive frames is several times lower than that for two different. The disadvantage is that the first visual feature VPI (1) is usually several times higher than the others. Therefore, it is good to normalize the feature vector. The logarithm of the feature vector is used in our algorithm. D. Block Based Methods A video image is divided into several parts (blocks) using a block-based visual segmentation method. Visual information is compared in the same blocks from two successive video images. The result of the comparison from single blocks is then evaluated. We can assign different weights to the single blocks. Each block may contribute to the result of the evaluation in a different way. The same method can be used for evaluation in blocks such as those described above, where the features, the image histograms or the sum of pixel values are computed and compared only in blocks of video images. Another possibility is to count the statistical values in single blocks such as variance and mean [8], the function L(i, b) is then calculated for two corresponding blocks ((σ L(i, b ) = 2 i ,b + σ i2− N ,b ) / 2 + ((μ i ,b − μ i−N ,b ) / 2) ) 2 2 σ σ 2 i ,b 2 i − N ,b (8) (4) where ı2i,b is the variance and ȝi,b is mean color values in the single blocks b in i video image. where F(u,v) are DCT coefficients of transformed image f(x,y) a c are coefficients: Value L(i, b) is compared with same threshold Tb then. L(i, b) = 1 if it is higher than threshold, otherwise L(i, b) = 0. The criterion for shot change detection is: 2c(u )c(v) F (u , v) = N 1 ° c(k ) = ® 2 °¯ 1 N −1 N −1 ¦¦ x =0 y =0 § 2x + 1 · § 2 y + 1 · f (x, y ) cos¨ uπ ¸ cos¨ vπ ¸ © 2N ¹ © 2N ¹ for k = 0 B (5) otherwise The computation of DCT is relatively fast because there is an algorithm very similar to the FFT algorithm (Fast Fourier Transform) for the computation of DFT (Discrete Fourier Transform). The square of DCT coefficients is computed: E (u , v ) = F (u, v) 2 (6) ¦ w L(i, b ) > T b (9) b=1 where B is number of blocks, T is segmentation threshold and wb is weight value for single blocks. It is necessary to properly determine the number and distribution of single blocks in the video image for the blockbased segmentation methods. It is easier to correctly adjust the threshold value T if we choose a suitable number of blocks and their distribution in the video image. III. EXPERIMENTS Seven methods for visual signal segmentation have been tested in our experiments: M1_PB – a pixel-based segmentation method (equation 1), M2_PB – a pixel-based segmentation method (equation 2), M3_HB – a histogrambased segmentation method (equation 3), M4_FB – a DCT visual-based segmentation method, M5_BB – a block-based segmentation method where video image has been divided into 4 blocks (2 x 2); the segmentation evaluation has been computed by equation 1, M6_BB – a block based segmentation method – 16 blocks (4 x 4) and the segmentation evaluation has been computed just like in the previous method, M7_BB – a block based segmentation method – 16 blocks (4 x 4) – computation by equations 8, 9. 8606 (00:05:44) 8710 (00:05:48) 8814 (00:05:53) 8815 (00:05:53) 8844 (00:05:54) 8874 (00:05:55) 8875 (00:05:55) 8893 (00:05:56) 8909 (00:05:56) 8912 (00:05:56) 9146 (00:06:06) 9380 (00:06:15) # frame no. (hour:minute:second) Figure 3. The sample of short visual signal A database with almost 2 hours of video recordings of Czech TV broadcast news has been used for the automatic threshold setting for single segmentation methods. The boundaries (shot changes) of single visual segments have been found and set manually in this database for further evaluation where the threshold has been changed in an interval for each segmentation method. The resulting threshold has been selected according to the highest value of the Visual Segmentation Rate - VSR (10). The single segmentation methods (with a set threshold) have been tested in the next step using another database COST278 [10], where video recordings of TV broadcasts from 3 Czech TV stations are included (1 hour). The shot changes of visual segments have also been set manually in this database; the reliability (VSR) of single segmentations methods can, therefore, be evaluated. VSR = CS − IS 100 [%] NS (10) where NS is the number of all manually selected shot changes, CS is the number of correctly recognized shot changes, IS is the number of shot changes which were detected in addition. Figure 4. The result from visual signal segmentation methods: a) M1_PB, b) M2_PB, c) M3_HB, d) M4_FB, e) M5_BB, f) M6_BB, g) Mt_BB The best testing method has been a DCT features-based visual segmentation method with the VSR of 72,3%. The result from the visual segmentation methods for a short visual signal (figure 3.) is shown in figure 4. The y-axis (the segmentation value) is normalized to the interval from 0 to 100 for a better comparison, but another interval is used in the last method because the segmentation value is almost zero between two similar video images and it is highly variable for the detected shot change. Only one video effect (dissolve) has been (used) in our video recordings, but it has not been necessary to prepare a special algorithm for tackling shot change detection in this effect because all segmentation methods detected the shot change in the dissolve video effect. IV. CONCLUSION AND FUTURE WORK The utilization of several methods for visual signal segmentation has been tested in this work - two pixel-based, one histogram-based, one feature-based and three block-based visual segmentation methods have been used in the experiments. The best result has been reached by the featurebased visual segmentation method, where visual features are computed from the DCT coefficients. The advantage of this method is that it is possible to find a robust segmentation threshold for reliable visual signal segmentation. The DCT visual features-based segmentation method is used in our experiments with our system for automatic TV broadcast transcription. We would like to improve our visual segmentation method in the near future and add some algorithms for solving the visual segmentation task, where several special video effects (fade, wipe, ..) are used in the video recordings. ACKNOWLEDGMENT The research reported in this paper was partly supported by the grant (TACR) no. TA01011204 and by the Czech Science Foundation (GACR) through the project no. 102/08/0707. REFERENCES [1] Nouza, J., Nejedlová, D., Žćánský, J., Kolorenþ, J.: Very Large Vocabulary Speech Recognition System for Automatic Transcription of Czech Broadcast. In: Proc. of ICSLP 2004, Jeju Island, Korea, pp. 409412, ISSN 1225-441x, 2004 [2] Žćánský, J.: BINSEG: An Efficient Speaker-based Segmentation Technique. In: International Conference on Spoken Language Processing Interspeech 2006 — ICSLP 2006, September, 2006, Pittsburgh, USA, pp. 2182-2185, ISSN 1990-9772 [3] ýerva, P., Nouza, J., Silovský, J.: Two-Step Unsupervised Speaker Adaptation Based on Speaker and Gender Recognition and HMM Combination. In: International Conference on Spoken Language Processing Interspeech 2006 — ICSLP 2006, September, 2006, Pittsburgh, USA, pp. 2326-2329, ISSN 1990-9772 [4] Chaloupka, J.: Visual Speech Segmentation and Speaker Recognition for Transcription of TV News. In: Proc. of International Conference on Spoken Language Processing Interspeech 2006 — ICSLP 2006, Pittsburgh, USA, pp. 1284-1287, ISSN 1990-9772, 2006 [5] Chan, L., H., Salleh, S., H., Ting, C., M.: PCA, LDA and neural network for face identification, In: IEEE Conference on Industrial Electronics and Applications, ICIEA 2009, art. no. 5138403, pp. 1256-1259, 2009 [6] Nagasaka, A., Tanaka, Y.: Automatic video indexing and full-video search for object appearances. In: IFIP Working Conference on Visual Database Systems, Hungary, pp. 113-127, 1991 [7] Tonomura, Y., Abe, S.: Content oriented visual interface using video icons for visual database systems. In: Journal of Visual Languages and Computing, pp. 183-198, 1990 [8] Kasturi, R., Jain, R., C.: Dynamic Vision, In: Computer vision: principles, editors: Kasturi a Jain, IEEE Computer Society Press, USA, pp. 469-480, 1991 [9] Lafevre, S., Holler, J., Vincent, N.: A review of real-time segmentation of uncompressed video sequences for content-based search and retrieval, In: Real-Time Imaging 9, pp. 73-98, 2003 [10] Vandecatseye et al.: The COST278 pan-european broadcast newsdatabase. in Proc. of LREC 2004, Lisbon, Portugal, May 2004