Bag-of-Audio-Words Feature Representation Using GMM Clustering
Transcription
Bag-of-Audio-Words Feature Representation Using GMM Clustering
FO-1-2-5 Bag-of-Audio-Words Feature Representation Using GMM Clustering for Sound Event Classification Hyungjun Lim, Myung Jong Kim, and Hoirin Kim Department of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST) {hyungjun.lim, myungjong, hoirkim}@kaist.ac.kr Abstract This paper addresses the problem of sound event classification, focusing on feature representation methods. Sound events such as screaming and glass breaking show distinctive temporal and spectral characteristics. Therefore, extracting appropriate features to properly represent these characteristics is important in achieving a good performance. In this paper, we employ bag-of-audio-words feature representation, which is a histogram representation of frame-based features, to characterize the time-frequency patterns in the long-range segment of a sound event. In the method, Gaussian mixture model-based clustering is adopted to deal with the inconsistent dynamic range among frame-based features. Test sounds are classified by using a support vector machine. The proposed method is evaluated on a database of several hundred audio clips for fifteen sound events and the classification results show over 41% relative improvements compared to conventional bag-ofaudio-words representation methods. Keywords: Bag-of-audio-words, Gaussian mixture model (GMM) clustering, sound event classification. 1. Introduction Sound events are good descriptors in recognizing and understanding circumstances. In an audio surveillance application, for example, sound events such as screaming or explosion may indicate a dangerous situation whereas sound events such as conversation or music may imply a normal condition. Hence, a sound event classification method that produces highly accurate classification results will be very useful in understanding various situations such as audio surveillance [1, 2, 3], monitoring in health care [4], and military [5]. In general, such sound events show distinctive temporal and spectral characteristics [5]. Therefore, developing a feature representation method, which is proper to describe the characteristics of each sound event, is very important in improving the classification accuracy of the sound events. Sound event classification was conventionally performed by using general audio features that include MPEG-7 lowlevel features (LLFs) [1], linear-frequency cepstral coefficients (LFCCs) [2], Mel-frequency cepstral coefficients (MFCCs) [4], and their combinations [3, 5]. Kim and Kim [7] proposed segmental two-dimensional MFCCs which are based on two-dimensional discrete cosine transform to capture temporal and spectral characteristics of a sound event. Jonathan et al. [6] utilized image processing based techniques such as pseudo-coloring and partitioning in a spectrogram to overcome the noise sensitivity of MFCC. Lee et al. [8] employed angular radial transform to extract spectrogram shape features within a birdsong segment. In recent years, a bag-of-audio-words (BoAW) feature representation which is a histogram representation of framebased audio features, such as LLF, in a long-term segment instead of the frame-based audio features itself is successfully applied to sound event classification [9, 10] since the histogram may be suitable for describing the global characteristics of a sound event. In the method, the kmeans clustering based on the Euclidean distance measure is generally used to construct the histogram. However, since the dynamic range of each frame-based feature is diverse and inconsistent, the clustering result is subject to bias. To overcome the drawback, this paper presents the sound event classification method, focusing particularly on BoAW feature representation using Gaussian mixture model (GMM)-based clustering, which considers the dynamic range of each feature. A support vector machine (SVM) classifier is used to identify the class of a test sound among fifteen sound event classes. The remainder of the paper is organized as follows: The conventional BoAW feature representation is described in Section 2. In Section 3, we present the proposed distribution-based clustering method. Section 4 shows the experiments and finally, our conclusions are summarized in Section 5. 2. BoAW feature representation The block diagram of BoAW feature representation is shown in Figure 1. First, the frame-based features are - 170 - ICEIC 2015 Figure 1: Block diagram of BoAW feature representation (Dotted box is used only for the training phase.) extracted in each sound clip. Using these frame-based features, we choose a cluster that has a minimum distance between the frame-based features and the centroids of clusters. Note that the clusters are obtained using the kmeans clustering based on the Euclidean distance measure only in the training phase. Given a set of d -dimensional frame-based feature vectors ( x1 , x 2 , , x N ) , k-means clustering aims to partition n feature vectors into k sets S {S1 , S 2 , S k } so as to minimize the within-cluster sum of squares, i.e., k arg min ¦ ¦ || x ȝ i S kmenas ||2 (1) i 1 xSi where ȝ i kmenas is the centroid of the i -th cluster obtained by k-means clustering. Finally, the BoAW feature vector can be obtained by constructing histograms for selected clusters in a sound clip, L FBoAW_kmeans ¦ [G ( P c, 1), G ( P c, 2), , G ( P c, k )] T l l l (2) l 1 where Plc is the selected cluster for l -th frame-based feature, L is total number of frames in the sound clip, and G () is the Kronecker delta function. As a result, the BoAW feature representation contains all the frame-based features in a sound clip, so it can be useful to capture the global timefrequency characteristics of a sound event. 3. GMM clustering-based BoAW representation In the conventional BoAW feature representation described in Section 2, the Euclidean distance-based k- means clustering method is generally used. However, since the frame-based features have diverse dynamic ranges, the features that have a wide dynamic range are critical on clustering results. For example, the dynamic range of a short-time energy is broader than with zero-crossing rates. Therefore, we propose the BoAW feature representation based on the GMM clustering which is the one of the widely used distribution-based clustering methods to tackle this disadvantage of the k-means clustering [11]. The GMM clustering is a kind of soft clustering that uses probabilities instead of occurrence counts used in k-means clustering. It can effectively compensate the various dynamic ranges of frame-based features by using the posterior probabilities of each Gaussian component. More specifically, each Gaussian component in the GMM takes the role of a cluster in k-means clustering, so distances between frame-based features and each centroid is replaced by posterior probabilities of each Gaussian. Let the GMM has M number of Gaussian components, then the posterior probability of m-th Gaussian component is obtained as GMM wm 1 ( x | ȝ m , Ȉ m ) p ( m | x) (3) M GMM wi 1 ( x | ȝ i , Ȉ i ) i 1 ¦ GMM where x is the frame-based feature vector, ȝ m , Ȉm , and wm are the mean vector, covariance matrix, and mixture weight of m-th Gaussian component, respectively. The GMM is trained using an expectation-maximization (EM) algorithm [12]. Then the BoAW feature vector, which is the histogram of each frame-based features in the sound clip, FBoAW_GMM can be obtained by the summation of posterior probabilities for all frames in the sound clip as - 171 - ICEIC 2015 Table 1: Configurations of the database # Clips Total duration (sec) Avg. clip duration(Std.) (sec) Car crashing 36 154.9 4.3(·2.0) Crying 66 311.4 4.7(·1.0) Dog barking 81 372.6 4.6(·1.6) LLF 67.6 67.6 LFCC 78.7 81.7 MFCC 90.2 91.5 92.5 Classes Abnormal Table 2: Average classification accuracies (%) of the various features according to the number of clusters of the k-means and GMM clustering methods (Bold face represents the best result along the row axis.) Explosion 64 280.7 4.4(·1.7) Glass breaking 103 233.3 2.3(·1.3) Screaming 115 228.7 2.0(·0.9) Air conditioner 68 333.6 4.9(·0.3) Bird song 92 355.4 3.9(·1.4) Conversation 48 240.0 5.0(·0.0) Car horn 96 199.5 2.1(·1.2) Motorcycle 58 292.0 5.0(·0.5) Music 72 360.0 5.0(·0.0) Raining 65 324.2 5.0(·0.1) Ambulance siren 68 322.1 4.7(·0.5) Wind 56 350.1 4.9(·0.4) Framebased features k-means clustering 128 256 512 GMM clustering 128 256 512 65.6 85.4 85.7 87.1 76.2 81.5 83.3 85.9 93.3 95.0 95.6 4.1 Experimental setup 4.2 Experimental results In order to evaluate the proposed methods, we used fifteen classes of sound events consisting of car crashing, crying, dog barking, explosion, glass breaking, screaming, air conditioner, bird song, conversation, car horn, motorcycle, music, raining, ambulance siren, and wind which were collected from various sound effect libraries and the Web. Since the duration of a target sound is different, sound clips were made with a variable length which is about 1-8 sec long. Table 1 indicates data description in terms of the number of clips per each sound event class, total duration, and the average duration of clips. All sound clips were digitized in 16-bit per sample with 48 kHz sampling rate in mono-channel. 4.2.1 Effectiveness of GMM clustering Normal 4. Experiments To show the effectiveness of the proposed method, we evaluated the performances of the LLF, LFCC, and MFCC with the k-means and GMM clustering methods. The LLF consisted of a short-time energy, zero-crossing rate, spectral centroid, spectral bandwidth, sub-band energy, sub-band energy ratio, spectral flux, spectral flatness, and spectral roll-off. All the features were extracted from a short frame of 25 msec with 50% overlap. For clustering, we used 128, 256, and 512 clusters and Gaussians for the k-means and GMM clustering, respectively. A 5-fold cross validation was performed with the database that was split randomly into five equal-sized for reliable results. The classifier we used was a support vector machine (SVM) with a linear kernel [13]. For the application point of view, we tried to additional two experiments: distant environments and surveillance scenario. First, to generate additional distant sound database, each sound data were re-recorded by playing the original recording back on a loudspeaker with distances of 1m or 10m in a quiet outdoor environment. Second, to perform the experiments under the surveillance scenario, the fifteen classes of sound events were categorized into two classes: abnormal and normal. The abnormal class consists of car crashing, crying, dog barking, explosion, glass breaking, and screaming, and others were mapped into the normal class as shown in Table 1. L FBoAW_GMM ¦ [ p(m 1 | x l ), p ( m 2 | x l ), , p(m M | x l )] . (4) l 1 T Consequently, the BoAW feature representation based on distribution clustering may be more appropriate than the conventional method by compensating the inconsistent dynamic range of each feature to capture the distinct characteristics of sound events. Table 2 shows the performance comparison between the proposed GMM clustering and k-means clustering-based BoAW feature representation with the various frame-based features in terms of the average classification accuracy (CA) using original database. Here, the CA was averaged across 5-fold experiments. These results show that the GMM clustering outperformed the conventional k-means clustering in most cases, especially obtaining a 55.9% relative improvement when using the LLF as frame features and 256 clusters. Note that the relative improvement is computed by - 172 - ICEIC 2015 ERR % Conversation Crying Dog barking Explosion Glass breaking Car horn Motorcycle Music Raining Screaming Ambulance siren Wind Air conditioner Bird song Car crashing Conversation Crying Dog barking Explosion Glass breaking Car horn Motorcycle Music Raining Screaming Ambulance siren Wind Car crashing Actual Bird song Prediction Air conditioner Table 3: Confusion matrix for fifteen classes of sound event classification (The entry represents the percentage of clips belonging to the actual class and predicted by the system.) 98.5 0.0 0.0 0.0 0.0 0.0 3.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.6 0.0 0.0 0.0 0.0 69.4 0.0 0.0 0.0 1.6 1.9 1.0 3.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 97.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.5 0.0 5.6 0.0 0.0 0.0 92.2 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.3 0.0 0.0 0.0 1.6 95.1 0.0 3.4 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 91.7 0.0 1.4 0.0 5.2 0.0 0.0 0.0 0.0 11.1 0.0 0.0 0.0 0.0 0.0 1.0 93.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 98.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 5.6 0.0 0.0 0.0 0.0 0.0 2.1 0.0 0.0 0.0 91.3 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 98.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 CER baseline CER proposed u 100% Table 4: Average classification accuracies (%) of the k-means and GMM clustering methods on various distant environments (original, 1m distance, and 10m distance) for distance matched condition and multi-condition (5) CER baseline where ERR % and CER mean an error reduction rate and a classification error rate, respectively. This implies that the GMM clustering is more suitable for the BoAW framework by effectively dealing with various dynamic ranges of the frame features. We can also observe that the MFCC is more superior to the LLF and LFCC as frame-based features, showing 65.0% and 70.1% relative improvements when using 512 clusters of the GMM clustering method, respectively. This indicates that the MFCC is more effective in expressing the characteristics of sound events in the BoAW method. Therefore, the frame-based MFCC features and 512 clusters were used as the default setting in the following experiments. We analyze the best classification results (the MFCC and the GMM with 512 clusters) using the confusion matrix as shown in Table 3. As can be seen, most classes have very small amount of confusion except the car crashing class. We can interpret this result as two points of view: insufficient data and/or complex characteristics of the car crashing sound class. It can be seen that the car crashing class has the smallest amount of data in Table 1 (about 150 sec of total duration) which can cause poor modeling in training phase. Furthermore, we can simply imagine that the car crashing event composed of the ‘tire skid’ and ‘crash’ sounds which are similar to motorcycle, glass breaking, and explosion. Therefore, the higher misclassification rate is observed in the car crashing class compared to other sound classes. k-means clustering Conditions Distance matched condition Multi-condition GMM clustering Original 92.5 95.6 1m 90.5 94.3 10m 90.6 94.5 Original 92.3 92.9 1m 90.4 91.5 10m 88.7 91.2 4.2.2 Evaluation of the proposed method on various distant environments It is important to measure the performance of the distant environment in the audio surveillance because the sound related to the dangerous situation is likely to enter the system distantly. Table 4 shows the CA performances of the k-means and GMM clustering-based BoAW feature representation on the various distant environments: distance matched condition means that the acoustic model is trained using only distance matched training data with test data whereas multi-condition means the acoustic model is trained using all training data regardless of distance. As can be seen, the proposed method consistently shows better CA performances than with the conventional BoAW method for all distant environments and training conditions. Although the time-frequency characteristics of a sound event are - 173 - ICEIC 2015 event classifier. In order to evaluate the proposed features, experiments were performed in the aspect of the CA across fifteen sound classes. The experimental results show that the proposed feature representation method outperformed conventional BoAW representation based on k-means clustering, achieving a CA of 95.6% when using MFCC frame features and 512 clusters of the GMM clustering. Furthermore, additional experiments were performed related to the areas of audio surveillance. Our work verifies a possibility that the proposed method can be successfully applied to audio surveillance systems. Table 5: Average classification accuracies (%) for seven classes: car crashing, crying, dog barking, explosion, glass breaking, screaming, and normal classes k-means clustering GMM clustering Original 92.9 96.0 Conditions Distance matched condition Multi-condition 1m 91.5 94.8 10m 91.4 95.3 Original 93.3 93.6 1m 91.4 92.3 10m 90.8 93.0 6. Acknowledgements Table 6: Average classification accuracies (%) for two classes: abnormal and normal classes k-means clustering Conditions Distance matched condition Multi-condition GMM clustering Original 94.8 97.3 1m 94.3 96.4 10m 94.7 97.3 Original 95.5 95.6 1m 94.1 94.0 10m 93.7 95.0 This work was supported by the Technology Innovation Program of the Ministry of Trade, Industry & Energy. [10047788, Development of Smart Video/Audio Surveillance SoC & Core Component for Onsite Decision Security System] References [1] A. Harma, M. F. McKinney, and J. Skowronek, "Automatic surveillance of the acoustic activity in our living environment," in Proc. IEEE Int. Conf. Mult. Expo, Jul. 2005. distorted in distant environments because of significantly reduced power, the proposed method gives fairly good performances. This result obviously proves that the proposed method is more robust to distant environments. [2] P. K. Atrey, N. C. Maddage, and M. S. Kankanhalli, "Audio based event detection for multimedia surveillance," in Proc. IEEE Int. Conf. Acoust. Speech, and Signal Process., May 2006, pp. 813-816. 4.2.3 Evaluation of the proposed method under surveillance scenario [3] C. Clavel, T. Ehrette, and G. Richard, "Events detection for an audio-based surveillance system," in Proc. IEEE Int. Conf. Mult. Expo, Jul. 2005, pp. 1306-1309. Under the surveillance scenario, confusions between the mundane sounds are not considered because the only interest is to capture the dangerous situations. In this point of view, we perform additional experiments by mapping the classification results into normal or abnormal class. Table 5 presents the classification accuracy of normal and other abnormal sound events, i.e., 7-way classification: car crashing, crying, dog barking, explosion, glass breaking, screaming, and normal classes. Table 6 also presents the classification accuracy of normal and abnormal classes, i.e., 2-way classification. In the same context of previous experiments, the proposed method is more accurate than the conventional BoAW method which can be successfully applied to the surveillance applications. 5. Conclusion We proposed a feature representation method that employs BoAW based on the GMM clustering to effectively represent the distinct time-frequency patterns of sound events. An SVM with a linear kernel was adopted as a sound [4] Y. T. Peng, C. Y. Lin, M. T. Sun, and K. C. Tsai, "Healthcare audio event classification using hidden Markov models and hierarchical hidden Markov models," in Proc. IEEE Int. Conf. Mult. Expo, Jun. 2009, pp. 1218-1221. [5] S. Ntalampiras, I. Potamitis, and N. Fakotakis, "On acoustic surveillance of hazardous situations," in Proc. IEEE Int. Conf. Acoust. Speech, and Signal Process., Apr. 2009, pp. 165-168 [6] J. Dennis, H. D. Tran, and H. Li, "Image representation of the subband power distribution for robust sound classification," in Proc. Interspeech 2011, Aug. 2011, pp. 2437-2440. [7] M. J. Kim and H. Kim, "Audio-based objectionable content detection using discriminative transforms of timefrequency dynamics," IEEE Trans. Multimedia, vol. 14, no. 5, pp. 1390-1400, Oct. 2012. - 174 - ICEIC 2015 [8] C. H. Lee, S. B. Hsu, J. L. Shih, and C. H. Chou, "Continuous birdsong recognition using Gaussian mixture modeling of image shape features," IEEE Trans. Multimedia, vol. 15, no. 2, pp. 454-464, Feb. 2013. [9] S. Pancoast and M. Akbacak, "Bag-of-audio-words approach for multimedia event classification", in Proc. Interspeech 2012, Sep. 2012, pp. 2105-2108. [10] V. Carletti, P. Forggia, G. Percannella, A. Saggese, N. Strisciuglio, and M. Vento, "Audio surveillance using a bag of aural words classifier," in Proc. IEEE Int. conf. Adv. Video and Signal Based Surveillance, Aug. 2013, pp. 81-86. [11] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006. [12] T. K. Moon, "The expectation-maximization algorithm," IEEE Signal Process. Magazine, vol. 13, no. 6, pp. 47-60, Nov. 1996. [13] C. C. Chang and C. Lin, "LIBSVM: a library for support vector machines", ACM Trans. on Intelligent Systems and Technology, 2011. Software available at http://www.csie.ntu.edu. tw/~cjlin/libsvm. - 175 -