Background Paper
Transcription
Background Paper
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 1459 Statistical Modeling of Complex Backgrounds for Foreground Object Detection Liyuan Li, Member, IEEE, Weimin Huang, Member, IEEE, Irene Yu-Hua Gu, Senior Member, IEEE, and Qi Tian, Senior Member, IEEE Abstract—This paper addresses the problem of background modeling for foreground object detection in complex environments. A Bayesian framework that incorporates spectral, spatial, and temporal features to characterize the background appearance is proposed. Under this framework, the background is represented by the most significant and frequent features, i.e., the principal features, at each pixel. A Bayes decision rule is derived for background and foreground classification based on the statistics of principal features. Principal feature representation for both the static and dynamic background pixels is investigated. A novel learning method is proposed to adapt to both gradual and sudden “once-off” background changes. The convergence of the learning process is analyzed and a formula to select a proper learning rate is derived. Under the proposed framework, a novel algorithm for detecting foreground objects from complex environments is then established. It consists of change detection, change classification, foreground segmentation, and background maintenance. Experiments were conducted on image sequences containing targets of interest in a variety of environments, e.g., offices, public buildings, subway stations, campuses, parking lots, airports, and sidewalks. Good results of foreground detection were obtained. Quantitative evaluation and comparison with the existing method show that the proposed method provides much improved results. Index Terms—Background maintenance, background modeling, background subtraction, Bayes decision theory, complex background, feature extraction, motion analysis, object detection, principal features, video surveillance. I. INTRODUCTION I N COMPUTER vision applications, such as video surveillance, human motion analysis, human-machine interaction, and object based video encoding (e.g., MPEG4), objects of interest are often the moving foreground objects in an image sequence. One effective way of foreground object extraction is to suppress the background points in the image frames [1]–[6]. To achieve this, an accurate and adaptive background model is often desirable. Background usually contains nonliving objects that remain passive in the scene. The background objects can be stationary objects, such as walls, doors and room furniture, or nonstationary objects such as wavering bushes or moving escalators. Manuscript received June 19, 2003; revised January 29, 2004. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Luca Lucchese. L. Li, W. Huang, and Q. Tian are with Institute for Infocomm Research , Singapore, 119613 (e-mail: lyli@i2r.a-star.edu.sg; wmhuang@i2r.a-star.edu.sg; tian@i2r.a-star.edu.sg). I. Y.-H. Gu is with the Department of Signals and Systems, Chalmers University of Technology, SE-412 96 Göteborg, Sweden (e-mail: irenegu@ s2.chalmers.se). Digital Object Identifier 10.1109/TIP.2004.836169 The appearance of background objects often undergoes various changes over time, e.g., the changes in brightness caused by changing weather conditions or the switching on/off of lights. The background image can be described as consisting of static and dynamic pixels. The static pixels belong to the stationary objects, and the dynamic pixels are associated with nonstationary objects. The static background part can be converted to a dynamic one as time advances, e.g., by turning on a computer screen. A dynamic background pixel can also turn to a static one, such as a pixel in the bush when the wind stops. To describe a general background scene, a background model must be able to 1) represent the appearance of a static background pixel; 2) represent the appearance of a dynamic background pixel; 3) self-evolve to gradual background changes; 4) self-evolve to sudden “once-off” background changes. For background modeling without specific domain knowledge, the background is usually represented by image features at each pixel. The features extracted from an image sequence can be classified into three types: spectral, spatial, and temporal features. Spectral features could be associated with gray-scale or color information, spatial features could be associated with gradient or local structure, and temporal features could be associated with interframe changes at the pixel. Many existing methods utilize spectral features (distributions of intensities or colors at each pixel) to model the background [4], [5], [7]–[9]. To be robust to illumination changes, some spatial features are also exploited [2], [10], [11]. The spectral and spatial features are suitable to describe the appearance of static background pixels. Recently, a few methods have introduced temporal features to describe the dynamic background pixels associated with nonstationary objects [6], [12], [13]. There is, however, a lack of systematic approaches to incorporate all three types of features into a representation of a complex background containing both stationary and nonstationary objects. The features that characterize stationary and dynamic background objects should be different. If a background model can describe a general background, it should be able to learn the significant features of the background at each pixel and provide the information for foreground and background classification. Motivated by this, a Bayesian framework which incorporates multiple types of features for modeling complex backgrounds is proposed in this paper. The major novelties of the proposed method are as follows. 1) A Bayesian framework is proposed for incorporating spectral, spatial, and temporal features in the background modeling. 1057-7149/04$20.00 © 2004 IEEE 1460 2) A new formula of Bayes decision rule is derived for background and foreground classification. 3) The background is represented using statistics of principal features associated with stationary and nonstationary background objects. 4) A novel method is proposed for learning and updating background features to both gradual and “once-off” background changes. 5) The convergence of the learning process is analyzed and a formula is derived to select a proper learning rate. 6) A new real-time algorithm is developed for foreground object detection from complex environments. Further, a wide range of tests is conducted on a variety of environments, including offices, campuses, parks, commercial buildings, hotels, subway stations, airports, and sidewalks. The remaining part of the paper is organized as follows. After a brief literature review of existing work in Section I-A, Section II describes the statistical modeling of complex background based on principal features. First, a new formula of Bayes decision rule for background and foreground classification is derived. Based on this formula, an effective data structure to record the statistics of principal features is established. Principal feature representation for different background objects is addressed. In Section III, the method for learning and updating the statistics of principal features is described. Strategies to adapt to both gradual and sudden “once-off” background changes are proposed. Properties of the learning process are analyzed. In Section IV, an algorithm for foreground object detection based on the statistical background modeling is described. It contains four steps: change detection, change classification, foreground segmentation, and background maintenance. Section V presents the experimental results on various environments. Evaluations and comparisons with an existing method are also included. Finally, conclusions are given in Section VI. A. Related Work A simple and direct way to describe the background at each pixel is to use the spectral information, i.e., the gray-scale or color of the background pixel. Early studies describe background features using an average of gray-scale or color intensities at each pixel. Infinite impulse response (IIR) or Kalman filters [7], [14], [15] are employed to update slow and gradual changes in the background. These methods are applicable to backgrounds consisting of stationary objects. To tolerate the background variations caused by imaging noise, illumination changes, and the motion of nonstationary objects, the statistical models are used to represent the spectral features at each background pixel. The frequently used models include gaussian [8], [16]–[22] and mixture of gaussians (MoG) [4], [23]–[25]. In these models, one or a few gaussians are used to represent the color distributions at each background pixel. A mixture of Gaussian distributions can represent various background appearances, e.g., road surfaces under the sun or in the shadows [23]. The parameters (mean, variance, and weight) for each gaussian are updated using an IIR filter to adapt to gradual background changes. Moreover, by replacing an old gaussian with a newly learned color distribution, MoG can adapt to IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 “once-off” background changes. In [9], a nonparametric model is proposed for background modeling, where a kernel-based function is employed to represent the color distribution of each background pixel. The kernel-based distribution is a generalization of MoG which does not require parameter estimation. The computation is high for this method. A variant model is [5], where the distribution of temporal variations used in in color at each pixel is used to model the spectral feature of the background. MoG performs better in a time-varying environment where the background is not completely stationary. But, the method can lead to misclassification of foreground if the background scenes are complex [19], [26]. For example, if the background contains a nonstationary object with significant motion, the colors of pixels in that region may change widely over time. Foreground objects with similar colors (the camouflage foreground objects) could easily be misclassified as background. The spatial information has recently been exploited to improve the accuracy of background representation. The local statistics of the spectral features [27], [28], local texture features [2], [3], or global structure information [29] are found helpful for accurate foreground extraction. These methods are most suitable to stationary background. Paragios and Ramesh [10] use a mixture model (gaussians or laplacians) to represent the distributions of background differences for static background points. A Markov random field (MRF) model is developed to incorporate the spatio-spectral coherence for robust foreground segmentation. In [11], gradient distributions are introduced to MoG to reduce the misclassification purely depending on color distributions. Spatial information helps to detect camouflage foreground objects and suppress shadows. The spatial features are however not applicable to nonstationary background objects at pixel level since the corresponding spatial features vary over time. A few more attempts to segment foreground objects from nonstationary background have been made by using temporal features. One way is to estimate the consistency of optical flow over a short duration of time [13], [30]. The dynamic features of nonstationary background objects are represented by the significant variation of accumulated local optical flows. In [12], Li et al. propose a method to employ the statistics of color co-occurrence between two consecutive frames to model the dynamic features associated with a nonstationary background object. Temporal features are suitable to model the appearance of nonstationary objects. In Wallflower [6], Toyama et al. use a linear Wiener filter, a self-regression model, to represent intensity changes for each background pixel. The linear predictor could learn and estimate the intensity variations of a background pixel. It works well for periodical changes. The linear regression model is difficult to predict shadows and background changes with varying frequency in natural scene. A brief summary of the existing methods based on the types of used features is listed in Table I. Further, most existing methods perform the background and foreground classification with one or more heuristic thresholds. For backgrounds with different complexities, the thresholds should be adjusted empirically. In addition, these methods are often tested only on a few background environments (e.g., laboratories, campuses, etc.). LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS 1461 Using (5), the pixel with observed feature vector at time can be classified as a background or a foreground point, provided , , and that the prior and conditional probabilities are known in advance. TABLE I CLASSIFICATION OF PREVIOUS METHODS AND THE PROPOSED METHOD B. Principal Feature Representation of Background II. STATISTICAL MODELING OF THE BACKGROUND A. Bayes Classification of Background and Foreground For arbitrary background and foreground objects or regions, the classification of the background and the foreground can be formulated under Bayes decision theory. be the position of an image pixel, be the Let input image at time , and be a -dimensional feature vector extracted from the position at time from the image sequence. Then, the posterior probability of the feature vector from the background at can be computed by using the Bayes rule (1) where indicates the background. is the probability of the feature vector being observed as a background at , is the prior probability of the pixel belonging to the is the prior probability of the feature background, and vector being observed at the position . Similarly, the posterior probability that the feature vector comes from a foreground object at is (2) where denotes the foreground. Using the Bayes decision rule, a pixel is classified as belonging to the background according to its feature vector observed at time if (3) Otherwise, it is classified as belonging to the foreground. Note that a feature vector observed at an image pixel comes from either background or foreground objects, it follows: (4) Substituting (1) and (4) into (3), it follows that the Bayes decision rule (3) becomes (5) To apply (5) for classification of background and foreground, , , and should be the probability functions known in advance, or can be properly estimated. For complex backgrounds, the forms of these probability functions are unknown. One way to estimate these probability functions is to use the histogram of features. The problem that would be encountered is the high cost for storage and computation. Assuming is a -dimension vector and each of its element is quantized to values, the histogram would contain cells. For example, assuming the resolution of color has 256 levels, the histogram would contain 256 cells. The method would be unrealistic in terms of computational and memory requirements. It is reasonable to assume that if the selected features represent the background effectively, the intraclass spread of background features should be small, which implies that the distribution of background features will be highly concentrated in a small region in the histogram. Further, features from various foreground objects would spread widely in the feature space. Therefore, there would be less overlap between the distributions of background and foreground features. This implies that, with a proper selection and quantization of features, it would be possible to approximately describe the background by using only a small number of feature vectors. A concise data structure to implement such representation of background is created as follows. be the quantized feature vectors sorted in descending Let order with respect to for each pixel . Then, for a proper selection of features, there would be a small integer , a high percentage value , and a low percentage value (e.g., and ) such that the background could be well approximated by and (6) The value of and the existence of and depend on the selection and quantization of the feature vectors. The feature vectors are defined as the principal features of the background at the pixel . To learn and update the prior and conditional probabilities for the principal feature vectors, a table of statistics for the possible principal features is established for each feature type at . The table is denoted as (7) is the learned based on the observation of where records the statistics of the most the features and 1462 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 Fig. 1. One example of learned principal features for a static background pixel in a busy scene. The left image shows the position of the selected pixel. The two , the light gray part is right images are the histograms of the statistics for the most significant colors and gradients, where the height of a bar is the value of , and the top dark gray part is . The icons below the histograms are the corresponding color and gradient features. frequent feature vectors contains three components at pixel . Each (8) where is the dimension of the feature vector . The in the table are sorted in descending order with respect . The first elements from the table , to the value , are used in (5) for background and foretogether with ground classification. C. Feature Selection The next essential issue for principal feature representation is feature selection. The significant features of different background objects are different. To achieve effective and accurate representation of background pixels with principal features, the employment of proper types of features is important. Three types of features, the spectral, spatial, and temporal features, are used for background modeling. 1) Features for Static Background Pixels: For a pixel belonging to a stationary background object, the stable and most significant features are its color and local structure (gradient). Hence, two tables are used to learn the principal features. They and with and repare resenting the color and gradient vectors, respectively. Since the gradient is less sensitive to illumination changes, the two types of feature vectors can be integrated under the Bayes framework as the following. and assume that the and are indepenLet dent, the Bayes decision rule (5) becomes (9) For the features from static background pixels, the quantization measure should be less sensitive to illumination changes. Here, a normalized distance measure based on the inner product of two vectors is employed for both color and gradient vectors. The distance measure is (10) is less than where can be or , respectively. If a small value , and are matched to each other. The robustness of the distance measure (10) to illumination changes and imaging noise is shown in [2]. The color vector is directly obtained from the input images with 256 resolution levels for each component, while the gradient vector is obtained by applying Sobel operator to the corresponding gray-scale input im, is ages with 256 resolution levels. With found accurate enough to learn the principal features for static background pixels. An example of principal feature representation for static background pixel is shown in Fig. 1, where the histograms for the most significant color and gradient features and are displayed. The histogram of the color in features shows that only the first two are the principal colors for the background, and the histogram of the gradients shows that the first six, excluding the fourth, are the principal gradients for the background. 2) Features for Dynamic Background Pixels: For dynamic background pixels associated with nonstationary objects, color co-occurrences are used as their dynamic features. This is because the color co-occurrence between consecutive frames has been found to be suitable to describe the dynamic features associated with nonstationary background objects, such as moving tree branches or a flickering screen [12]. Giving an interframe to change from the color at the time instant and the pixel , the feature vector of color co-occurrence is defined as . Similarly, a table of is maintained at each statistics for color co-occurrence pixel. Let be the input is generated by color image; the color co-occurrence vector quantizing color components to low resolution. For example, by quantizing the color resolution to 32 levels for each com, one may obtain a good ponent and selecting principal feature representation for dynamic background pixel. An example of the principal feature representation with color co-occurrence for a flickering screen is shown in Fig. 2. Compared with the quantized color co-occurrence feature space of cells, implies that with a very small number of feature vectors, the principal features are capable of modeling the dynamic background pixels. LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS 1463 Fig. 2. One example of learned principal features for dynamic background pixels. The left image shows the position of the selected pixel. The right image is the histogram of the statistics for the most significant color co-occurrences in , where the height of a bar is the value of , the light gray part is , and the top dark gray part is . The icons below the histogram are the corresponding color co-occurrence features. In the screen, the color changes among white, dark blue, and light blue periodically. III. LEARNING AND UPDATING THE STATISTICS FOR PRINCIPAL FEATURES most frequent and significant feature vectors observed the at pixel . Since the background might undergo both gradual and “once-off” changes, two strategies to learn and update the statistics for principal features are proposed. The convergence of the learning process is analyzed and a formula to select a proper learning rate is derived. B. For “Once-Off” Background Changes A. For Gradual Background Changes (13) These probabilities are learned gradually with operations described by (11) and (12) at each pixel . When a “once-off” background change has happened, the new background appearance soon becomes dominant after the change. With the replacement operation (12), the gradual accumulation operation (11) and resorting at each time step, the learned new features will be . After some gradually moved to the first few positions in time duration, the term on the left hand of (13) becomes large ( 1) and the first term on the right hand of (13) becomes very small since the new background features are classified as foreground. From (6) and (13), new background appearance at can be found if At each time instant, if the pixel is identified as a static point, the features of color and gradient are used for foreground and background classification. Otherwise, the feature of is used. Let us assume that the feature color co-occurrence vector is used to classify the pixel at time based on the principal features learned previously. Then the statistics of the ( and , corresponding feature vectors in the table or ) is gradually updated at each time instant by (11) where the learning rate is a small positive number and . In (11), means that is classified as a background point at time in the final segmentation, otherwise, . Similarly, means that the th vector of the matches the input feature vector , and otherwise table . The above updating operation states the following. If the pixel is labeled as a background point at time , is slightly due to . Further, the probabilities increased from . for the matched feature vector are also increased due to However, if , then the statistics for the un-matched feature vectors are slightly decreased. If there is no match be, the tween the feature vector and the vectors in the table th vector in the table is replaced by a new feature vector According to (4), the statistics of the principal features satisfy (14) In (14), denotes the previous background before the “once-off” change and denotes the new background appearance after the prevents errors caused by “once-off” change. The factor a small number of foreground features. Using the notation in (7) and (8), the condition (14) becomes (15) Once the above condition is satisfied, the statistics for the foreground should be tuned to be the new background appearance. According to (4), the “once-off” learning operation is performed as follows: (12) If the pixel is labeled as a foreground point at time , and are slightly decreased with . However, the matched vector in the table is slightly increased. The updated elements in the table are resorted in a de, such that the table may keep scending order with respect to (16) for . 1464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 C. Convergence of the Learning Process If the time-evolving principal feature representation has successfully approximated the background, then should be satisfied. Hence, it is desirable that will converge to 1 with the evolution of the learning process. We shall show in the following that the learning operation (11) indeed meets such a condition. at time , and the th vector in the Suppose matches the input feature vector which has been table detected as background in the final segmentation at time . Then, according to (11), we have (17) which implies the sum of the conditional probabilities of the principal features being background will remain equal or close to 1 during the evolution of the learning process. at time due to some reasons Let us suppose such as the disturbance from foreground objects or the operation from the first vectors of “once-off” learning, and the matches the input feature vector , then we have in (18) elements of and the features after fall into the next . Then, the statistics at time can be described as (20) Since the new background appearance at pixel after time is classified as foreground before the “once-off” updating with (16), , and decrease exponentially, increases exponentially and will be whereas shifted to the first positions in the updated table with sorting at each time step. Once the condition of (15) is met at time , the new background state is learned. To make the expression simpler, let us assume that there is no resorting operation. Then the condition (15) becomes (21) From (11) and (20), it follows that at time conditions hold: , the following (22) If the pixel is detected as a background point at time , it leads to (23) (19) If , then . In this case, the sum of the conditional probabilities of the principal features being background increases slightly. On the other hand, , there will be , If and the sum of the conditional probabilities of the principal features being background decreases slightly. From these two cases, it can be concluded that the sum of the conditional probabilities of the principal features being background converges to 1 as long as the background features are observed frequently. D. Selection of the Learning Rate In general, for an IIR filtering-based learning process, there is a tradeoff in the selection of the learning rate . To make the learning process adapt to the gradual background changes smoothly and not to be perturbed by noise and foreground objects, a small value should be selected for . On the other hand, if is too small, the system becomes too slow to respond to the “once-off” background changes. Previous methods select it empirically [4], [5], [8], [14]. Here, a formula is derived to select according to the required time for the system to respond to “once-off” background changes. can be An ideal “once-off” background change at time assumed to be a step function. Suppose the features before fall into the first vectors in the table , (24) By substituting (22)–(24) to (21) and rearranging terms, one can obtain (25) where is the number of frames required to learn the new background appearance. Equation (25) implies that if one wishes the system to learn the new background state in no later than frames, one should choose , such that (25) is satisfied. For example, if the system is to respond to an “once-off” background , change in 20 s with the frame rate being 20 fps and should be satisfied. IV. FOREGROUND OBJECT DETECTION: THE ALGORITHM With the Bayesain formulation of background and foreground classification, as well as the background representation with principal features, an algorithm for foreground object detection from complex environments is developed. It consists of four parts: change detection, change classification, foreground object segmentation, and background maintenance. The block diagram of the algorithm is shown in Fig. 3. The white blocks from left to right correspond to the first three steps, and the blocks with gray shades correspond to background maintenance. In the first step, unchanged background pixels in the current frame are filtered LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS Fig. 3. 1465 Block diagram of the proposed method. out by using simple background and temporal differencing. The detected changes are separated into static and dynamic points according to interframe changes. In the second step, the detected static and dynamic change points are further classified as background or foreground using the Bayes rule and the statistics of principal features for background. Static points are classified based on the statistics of principal colors and gradients, whereas dynamic points are classified based on those of principal color co-occurrences. In the third step, foreground objects are segmented by combining the classification results from both static and dynamic points. In the fourth step, background models are updated. It includes updating the statistics of principal features for background as well as a reference background image. Brief descriptions of the steps are presented in the following. B. Change Classification is detected at a pixel , it is classified as If a dynamic point, otherwise, it is classified as a static point. A change that occurs at a static point could be caused by illumination changes, “once-off” background changes, or a temporarily motionless foreground object. A change detected at a dynamic point could be caused by a moving background or foreground object. They are further classified as background or foreground by using the Bayes decision rule and the statistics of the corresponding principal features. Let be the input feature vector at and time . The probabilities are estimated as A. Change Detection In this step, simple adaptive image differencing is used to filter out nonchange background pixels. The minor variations of colors caused by imaging noise are filtered out to save the computation for further processing. be the input image and Let be the reference background image maintained at time with denoting a color component. The background difference is obtained as follows. First, image differencing and thresholding for each color component are performed, where the threshold is automatically generated using the least median of squares (LMedS) method [31]. The backis then obtained by fusing the results ground difference from the three color components. Similarly, the temporal (or inbetween two consecutive frames terframe) difference and is obtained. If both and , the pixel is classified as a nonchange background point. In general, more than 50% of the pixels would be filtered out in this step. (26) is a feature vector set composed of those in where which match the input vector , i.e. and (27) matches , If no principal feature vector in the table and are set as 0. Then, the change point is both classified as background or foreground as the following. Classification of Static Point: For a static point, the probabilities for both color and gradient features are estimated by (26) and , respectively, where the vector distance with in (27) is calculated as (10). In this work, the measure statistics of the two type principal features ( and ) are learned separately. In general cases, there would be . The Bayes decision rule (9) can be applied for background and foreground classification. In some complex cases, 1466 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 one type of the features from the background might be unstable. One example is the temporal static states of a wavering water surface. For these states, the gradient features are not constant. Another example is video captured with an auto-gain camera. The gain is often self-tuned due to the motion of objects and the gradient features are more stable than the color features for static background pixels. To work stably in various conditions, the foland lowing method is adopted. Let . If ( in our test), the color and gradient features are coincident and both features are used for classification using the Bayes rule (9). Otherwise, only one type of the features with a larger prior value is used for classification using the Bayes rule (5). Classification of Dynamic Point: For a dynamic point at time , the feature vector of color co-occurrence is generated. The probabilities for are calculated as (26), where the distance between two feature vectors in (27) is computed as (28) and is chosen. Finally, the Bayes rule (5) is applied for background and foreground classification. As observed from our experiments, for the dynamic background points, only a small percentage of them are wrongly classified as foreground changes [12]. Further, the remainders have become isolated points, which can easily be removed by a smoothing operation. C. Foreground Object Segmentation A post processing is applied to segment the remaining change points into foreground regions. This is done by firstly applying a morphological operation (a pair of open and close) to suppress the residual errors. Then the foreground regions are extracted, holes are filled and small regions are removed. Further, an AND operation is applied to the resulting segments in consecutive frames to remove the false foreground regions detected by temporal differencing [32]. D. Background Maintenance With the feedback from the above segmentation, background models are updated. First, the statistics of principal features are updated as described in Section IV. For the static points, the tables and are updated. For the dynamic points, is updated. Meanwhile, a reference background the table image is also maintained to make the background difference accurate. Let be a background point in the final segmentation result at time . If it is identified as an unchanged background point in the change detection step, the background reference image at is smoothly updated by (29) where and is a small positive number. If is classified as background in change classification step, the background reference image at is replaced by the new background appearance (30) Fig. 4. Summary of the complete algorithm. With (30), the reference background image can follow the dynamic background changes, e.g., the changes of color between tree branch and sky, as well as “once-off” background changes. E. Memory Requirement and Computational Time The complete algorithm is summarized in Fig. 4. The major part of memory usage is to store the tables of the statistics , and ) for each pixel. In our implementa( tion, the memory requirement for each pixel is approximately 1.78 KB. For a video with image sized 160 120 pixels, the required memory is approximately 33.4 MB. While for image sized 320 240 pixels, 133.5-MB memory is required. For a standard PC, this is still feasible. With a 1.7-GHz Pentium CPU PC, real-time processing of image sequences is achievable at a rate of about 15 frames per second (fps) for images sized 160 120 pixels and at a rate of 3 fps for images sized 320 240 pixels. LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS 1467 Fig. 5. Experimental results on a meeting room environment (MR) with wavering curtains in the winds. The two examples are the results of the frame 1816 and 2268. Fig. 6. Experimental results on a lobby environment (LB) in an office building with switching on/off lights. Upper row: a frame before switching off some lights (364). Lower row: the frame 15 s after switching off some lights (648). V. EXPERIMENTAL RESULTS The proposed method has been tested on a variety of indoor and outdoor environments, including offices, campuses, parking lots, shopping malls, restaurants, airports, subway stations, sidewalks, and other private and public sites. It has also been tested on image sequences captured in various weather conditions, including sunny, cloudy, and rainy weather, as well as night and crowd scenes. In all the tests, the proposed method was automatically initialized (bootstrap) from “blinking background” (i.e., , , and for and ). The system gradually learned the most significant features for both stationary and nonstationary background objects. Once the “once-off” updating is performed, the system is able to separate the foreground from the background well. MoG [4] is a widely-used adaptive background subtraction method. It performs quite well for both stationary and nonstationary backgrounds among the existing methods [6]. The proposed method has also been compared with MoG in the experiments. The same learning rate was used for both the proposed method and MoG in each test.1 Further, for a fair comparison, the post processing used in the proposed method was applied for the MoG method as well. 1The similar analysis of the learning process and dynamic performance for MoG can be made as in Section III-C and III-D. The visual examples and quantitative evaluations of the experiments are described in the following two subsections, respectively. A. Examples on Various Environments Selected results on five typical indoor and outdoor environments are displayed in this section. The typical environments are offices, campuses, shopping malls, subway stations, and sidewalks. In the figures of this subsection, pictures are arranged in rows. In each row, the images from left to right are the input frame, the background reference image maintained by the pro, the manually genposed method at the moment erated “ground truth,” the results of the proposed method and MoG. 1) Office Environments: Office environments include offices, laboratories, meeting rooms, corridors, lobbies, and entrances. An office environment is usually composed of stationary background objects. The difficulties for foreground detection in these scenes can be caused by shadows, changes of illumination conditions, and camouflage foreground objects (i.e., the color of the foreground object is similar to that of the covered background). In some cases, background may consist of dynamic objects, such as waving curtains, running fans, and flickering screens. Examples from two test sequences are shown in Figs. 5 and 6, respectively. 1468 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 Fig. 7. Experimental results on a campus environment (CAM) containing wavering tree branches in strong winds. They are frame 1019, 1337, and 1393. The first sequence (MR) was captured by an auto-gain camera in a meeting room. The background curtain was moving in winds. The first example in the upper row came from a scenario containing significant motion of the curtain, as well as background changes caused by automatic gain adjustment. In the next example, the person wore clothes of bright colors, which are similar to the color of the curtain. In both cases, the proposed method separated the background and foreground satisfactorily. The second sequence (LB) was captured from a lobby in an office building. On this occasion, background changes were mainly caused by switching on/off lights. Two examples from this sequence are shown in Fig. 6. The first example shows a scene before some lights are switched off. A significant shadow of the person can be observed. The result of the proposed method is rather satisfactory apart from a small shadow included. The second example shows a scene at about 220 frames (about 15 s) after some lights have been switched off. In this example, even through the background reference image had not been recovered completely, the proposed method detected the person successfully. 2) Campus Environments: The second type of environments are campuses or parks. Changes in the background are often caused by motion of tree branches and their shadows on the ground surface, or the changes in the weather. Three examples displayed in Fig. 7 were from a sequence (CAM) captured in a campus containing moving tree branches. The great motion of tree branches was caused by strong winds which can be observed from the waving yellow flag in the left of the images. The moving tree branches also resulted in the changes of tree shadows. The three example frames contain vehicles of different colors. The results have shown that the proposed method has detected the vehicles quite well in such an environment. 3) Shopping Malls: The third type of typical environments are shopping centers, hotels, museums, airports, and restaurants. In these environments, the lighting are distributed from the ceil- ings and there are some specular highlight ground surfaces. In such cases, if multiple persons move in the scene, the shadows on the ground surface vary significantly in the image sequences. In these environments, the shadows can be classified into umbra and penumbra [33]. The umbra corresponds to the background area where the direct light is almost totally blocked by the foreground object, whereas in the penumbra area of the background, the lighting is partially blocked. Three examples from such environments are shown in Fig. 8. They were from a busy shopping center (SC), an airport (AP), and a buffet restaurant (BR) [6]. Significant shadows of moving persons cast on the ground surfaces from different directions can be observed. As one can see, the proposed method has obtained the satisfactory results apart from where small parts of the shadows have been detected in these three environments. The recognized shadows could also be observed in the maintained background reference images. This can be explained as a) the feature distance measure (10) that is robust to the illumination changes has played a major role in suppressing the penumbra areas; b) the learned color co-occurrences of the changes from the normal background appearance to umbra and vice versa could identify many background pixels in the umbra areas. Hence, without special models for the shadows, the proposed method has suppressed much of the various shadows in these environments. 4) Subway Stations: Subway stations are other public sites that often require monitoring. In these situations, the motion of background objects (e.g. trains and escalators) would make the background modeling difficult. Further, the background model is hard to be established if there are frequent human crowds in the scenes. Fig. 9 shows two examples from a sequence of a subway station (SS) recorded on a tape by a CCTV surveillance system. The scene contains three moving escalators and frequent human flows in the right side of the images. In addition, there are significant background changes caused by variation of lighting conditions due to many glass and stainless steel mate- LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS 1469 Fig. 8. Experimental results on shopping mall environments which contain specular ground surfaces. The three examples came from a busy shopping center (SC), an airport (AP), and a buffet restaurant (BR), respectively. Fig. 9. Experimental results on a subway station environments (SS). The examples are the frame 1993 and 2634. rials inside the building. Another difficulty for this sequence is caused by the noise which is due to the old video recording device. The busy flow of human crowds can be observed from the first example in the figure. Our test results have shown that the proposed method performed quite satisfactorily for such difficult scenarios. 5) Sidewalks: The pedestrians are often the targets of interest in many video surveillance systems. In such a case, a surveillance system may monitor the scene from day to night with a range of weather conditions. The tests were performed on such an environment around the clock. The image sequences (SW) were obtained from highly compressed MPEG4 videos through a local wireless network. There were large variations of background in the images. Five examples and test results are shown in Fig. 10. These correspond to sunny, cloudy and rainy weather conditions, as well as the night and crowded scenes. The interval between the first two frames was less than 10 s. Comparing the results with the “ground truths,” one can find that the proposed method performed very robustly in this complex environment. From the comparisons with MoG in these examples shown in Figs. 5–10, one can find that the proposed method has outperformed the MoG method in these selected difficult situations. The parameters used for these tests are listed in Tables II and III. The parameters in Table II were applied to all tests. The learning rates in the first row of Table III were applied to all tests except for three shorter sequences where larger rates (in the second row of the table) were applied. This is because if the image sequences are short, a slightly faster learning rate should be used to speed up the initial learning. Since the decision (5) for the classification of background and foreground is not directly dependent on any threshold, the performance of the proposed method is not very sensitive to these parameters. B. Quantitative Evaluations To get a systematic evaluation of proposed method, the performance of the proposed method was also evaluated quantitatively on randomly selected samples from ten sequences. 1470 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 Fig. 10. Experimental results of pedestrian detection from a sidewalk environment (SW) around the clock. From top to bottom are the frames from sunny, cloudy, rainy, night, and crowd scenes. TABLE II PARAMETERS USED FOR ALL TEST EXAMPLES the corresponding “ground truth,” then the similarity measure between regions and is defined as (31) TABLE III LEARNING RATES USED IN THE TEST EXAMPLES In the previous work [6], the results were evaluated quantitatively from the comparison with the “ground truths” in terms of 1) false negative error: the number of foreground pixels that are missed; 2) false positive error: the number of background pixels that are misdetected as foreground. However, it is found that when averaging the measures over various environments, they are not accurate enough. In this paper, a new similarity measure is introduced to evaluate the results of foreground segmentation. Let be a detected region and be Using this measure, approaches to a maximum value 1.0 if and are the same. Otherwise, varies between 1 and 0 according to their similarity. It approaches to 0 with the least similarity. It integrates the false positive and negative errors in one measure. But one drawback of the measure (31) is that it is a nonlinear measure. To obtain a visual impression of the quantities of the similarity measures, some matching images and their similarity values are displayed in Fig. 11. For systematic evaluation and comparison, the similarity measure (31) has been applied to the experimental results with the proposed method and the MoG method. A total of ten image sequences were used, including those in Figs. 5–10, as well as two others [watersurface (WS) and fountain (FT)]. We randomly selected 20 frames from each sequence, leading to total of 200 sample frames for evaluation. The “ground truths” of these 200 frames were generated manually by four invited persons. All of the ten test sequences, the results, and the “ground truths” of the sample frames are available.2 The 2http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html. LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS 1471 Fig. 11. Some examples of matching images with different similarity measure values. In the images, the bright color indicates the intersection of the detected regions and the “ground truths,” the dark gray color indicates the false negatives, and the light gray color indicates the false positives. TABLE IV QUANTITATIVE EVALUATION AND COMPARISON RESULT: averaging values of similarity measures for each individual sequence and for ten sequences are shown in Table IV. The corresponding values obtained from MoG method are also included. The ten test sequences are chosen among the difficult sequences, containing global background changes as well as persons staying motionless for quite a while besides the various background changes described in the previous subsection. Taking these situations into account, the obtained evaluation values for both methods are quite good. Comparing the results in Table IV and in Fig. 11, the performance of the proposed method is rather satisfactory. The comparison shows that the proposed method has provided improved results over those from the MoG method, especially for image sequences with complex background. C. Limitations of the Method Since the statistics are related to each individual pixel without considering its neighborhood, the method can wrongly absorb a foreground object into the background if the object remains motionless for a long time duration. For example, if a foreground moving person or car suddenly stopped moving in the scene and remains still for a long time duration. Further improvement should be made, e.g., by combining information from high-level object recognition and tracking in background updating [34], [35]. Another potential problem is that the method can wrongly learn the features of foreground objects as the background if crowded foreground objects (e.g., crowds) are constantly presented in the scenes. Adjusting the learning rate based on the feedback from the optical flow could provide a possible solution [36]. A method of controlling the learning processes from multilevel feedbacks is being investigated in order to further improve the results. VI. CONCLUSION For detecting foreground object from complex environments, this paper proposed a novel statistical method for background modeling. In the proposed method, the background appearance is characterized by the principal features and their statistics. VALUES FROM THE TEST SEQUENCES Foreground objects are detected through foreground and background classification under Bayesian framework. Our test results have shown that the principal features are effective in representing the spectral, spatial, and temporal characteristics of the background. A learning method to adapt to the time-varying background features has been proposed and analyzed. Experiments have been conducted on a variety of environments, including offices, public buildings, subway stations, campuses, parking lots, airports, and sidewalks. The experimental results have shown the effectiveness of the proposed method. Quantitative evaluation and comparison with the existing method have shown that an improved performance for foreground object detection in complex background has been achieved. Some limitations of the method have been discussed with suggestions to possible improvement. ACKNOWLEDGMENT The authors would like to thank R. Luo, J. Shang, X. Huang, and W. Liu for their work to generate the “ground truths” for evaluation. REFERENCES [1] D. Gavrila, “The visual analysis of human movement: A survey,” Comput. Vis. Image Understanding, vol. 73, no. 1, pp. 82–98, 1999. [2] L. Li and M. Leung, “Integrating intensity and texture differences for robust change detection,” IEEE Trans. Image Processing, vol. 11, pp. 105–112, Feb. 2002. [3] E. Durucan and T. Ebrahimi, “Change detection and background extraction by linear algebra,” Proc. IEEE, vol. 89, pp. 1368–1381, Oct. 2001. [4] C. Stauffer and W. Grimson, “Learning patterns of activity using realtime tracking,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 747–757, Aug. 2000. [5] I. Haritaoglu, D. Harwood, and L. Davis, “ : Real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 809–830, Aug. 2000. [6] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Principles and practice of background maintenance,” in Proc. IEEE Int. Conf. Computer Vision, Sept. 1999, pp. 255–261. [7] K. Karmann and A. Von Brandt, “Moving object recognition using an adaptive background memory,” Time-Varing Image Process. Moving Object Recognit., 2, pp. 289–296, 1990. [8] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland, “Pfinder: realtime tracking of the human body,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 780–785, July 1997. [9] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background subtraction,” in Proc. Eur. Conf. Computer Vision, 2000. 1472 [10] N. Paragios and V. Ramesh, “A MRF-based approach for real-time subway monitoring,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, Dec. 2001, pp. I-1034–I-1040. [11] O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robust background subtraction using color and gradient information,” in Proc. IEEE Workshop Motion Video Computing, Dec. 2002, pp. 22–27. [12] L. Li, W. M. Huang, I. Y. H. Gu, and Q. Tian, “Foreground object detection in changing background based on color co-occurrence statistics,” in Proc. IEEE Workshop Applications of Computer Vision, Dec. 2002, pp. 269–274. [13] L. Wixson, “Detecting salient motion by accumulating directionary-consistent flow,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 774–780, Aug. 2000. [14] N. J. B. McFarlane and C. P. Schofield, “Segmentation and tracking of piglets in images,” Mach. Vis. Applicat., vol. 8, pp. 187–193, 1995. [15] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel, “Toward robust automatic traffic scene analysis in real-time,” in Proc. Int. Conf. Pattern Recognition, 1994, pp. 126–131. [16] A. Bobick, J. Davis, S. Intille, F. Baird, L. Cambell, Y. Irinov, C. Pinhanez, and A. Wilson, “Kidsroom: Action recognition in an interactive story environment,” Mass. Inst. Technol., Cambridge, Perceptual Computing Tech. Rep. 398, 1996. [17] J. Rehg, M. Loughlin, and K. Waters, “Vision for a smart kiosk,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1997, pp. 690–696. [18] T. Olson and F. Brill, “Moving object detection and event recognition algorithm for smart cameras,” in Proc. DARPA Image Understanding Workshop, 1997, pp. 159–175. [19] T. Boult, “Frame-rate multi-body tracking for surveillance,” in Proc. DARPA Image Understanding Workshop, 1998. [20] T. Darell, G. Gordon, M. Harville, and J. Woodfill, “Integrated person tracking using stereo, color, and pattern detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1998, pp. 601–608. [21] A. Shafer, J. Krumm, B. Brumitt, B. Meyers, M. Czerwinski, and D. Robbins, “The new EasyLiving project at microsoft,” in Proc. DARPA/NIST Smart Space Workshop, 1998. [22] C. Eveland, K. Konolige, and R. C. Bolles, “Background modeling for segmentation of video-rate stereo sequences,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1998, pp. 266–271. [23] N. Friedman and S. Russell, “Image segmentation in video sequences: a probabilistic approach,” in Proc. 13th Conf. Uncertainty Artificial Intelligence, 1997. [24] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classification and tracking from real-time video,” in Proc. IEEE Workshop Application of Computer Vision, Oct. 1998, pp. 8–14. [25] M. Harville, G. Gordon, and J. Woodfill, “Foreground segmentation using adaptive mixture model in color and depth,” in Proc. IEEE Workshop Detection and Recognition of Events in Video, July 2001, pp. 3–11. [26] X. Gao, T. Boult, F. Coetzee, and V. Ramesh, “Error analysis of background adaption,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000, pp. 503–510. [27] K. Skifstad and R. Jain, “Illumination independent change detection from real world image sequence,” Comput. Vis., Graph. Image Process., vol. 46, pp. 387–399, 1989. [28] S. C. Liu, C. W. Fu, and S. Chang, “Statistical change detection with moments under time-varying illumination,” IEEE Trans. Image Processing, vol. 7, pp. 1258–1268, Aug. 1998. [29] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision system for modeling human interactions,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 831–843, Aug. 2000. [30] A. Iketani, A. Nagai, Y. Kuno, and Y. Shirai, “Detecting persons on changing background,” in Proc. Int. Conf. Pattern Recognition, vol. 1, 1998, pp. 74–76. [31] P. Rosin, “Thresholding for change detection,” in Proc. IEEE Int. Conf. Computer Vision, Jan. 1998, pp. 274–279. [32] Q. Cai, A. Mitiche, and J. K. Aggarwal, “Tracking human motion in an indoor environment,” in Proc. IEEE Int. Conf. Image Processing, Oct. 1995, pp. 215–218. [33] C. Jiang and M. O. Ward, “Shadow identification” (in June), in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 1992, pp. 606–612. [34] L. Li, I. Y. H. Gu, M. K. H. Leung, and Q. Tian, “Knowledge-based fuzzy reasoning for maintenance of moderate-to-fast background changes in video surveillance,” in Proc. 4th IASTED Int. Conf. Signal and Image Processing, 2002, pp. 436–440. IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004 [35] M. Harville, “A framework for high-level feedback to adaptive, per-pixel, mixture-of-gaussian background models,” in Proc. Eur. Conf. Computer Vision, 2002, pp. 543–560. [36] D. Gutchess, M. Trajkovic, E. Cohen-Solal, D. Lyons, and A. K. Jain, “A background model initialization algorithm for video surveillance,” in Proc. IEEE Int. Conf. Computer Vision, vol. 1, July 2001, pp. 733–740. Liyuan Li (M’96) received the B.E. and M.E. degrees from Southeast University, Nanjing, China, in 1985 and 1988, respectively, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2001. From 1988 to 1999, he was on the faculty at Southeast University, where he was an Assistant Lecturer (1988 to 1990), Lecturer (1990 to 1994), and Associate Professor (1995 to 1999). Since 2001, he has been a Research Scientist at the Institute for Infocomm Research, Singapore. His current research interests include video surveillance, object tracking, event and behavior understanding, etc. Weimin Huang (M’97) received the B.Eng. degree in automation and the M.Eng. and Ph.D. degrees in computer engineering from Tsinghua University, Beijing, China, in 1989, 1991, and 1996, respectively. He is a Research Scientist at the Institute for Infocomm Research, Singapore. He has worked on the research of handwriting signature verification, biometrics authentication, and audio/video event detection. His current research interests include image processing, computer vision, pattern recognition, human computer interaction, and statistical learning. Irene Yu-Hua Gu (M’94–SM’03) received the Ph.D. degree in electrical engineering from the Eindhoven University of Technology, Eindhoven, The Netherlands, in 1992. She is an Associate Professor in the Department of Signals and Systems, Chalmers University of Technology, Göteborg, Sweden. She was a Research Fellow at Philips Research Institute IPO, The Netherlands, and Staffordshire University, Staffordshire, U.K., and a Lecturer at The University of Birmingham, Birmingham, U.K., from 1992 to 1996. Since 1996, she has been with the Department of Signals and Systems, Chalmers University of Technology. Her current research interests include image processing, video surveillance and object tracking, video communications, and signal processing applications to electric power systems. Dr. Gu has served as an Associate Editor for the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS since 2000, and she is currently the ChairElect of the IEEE Swedish Signal Processing Chapter. Qi Tian (M’83–SM’90) received the B.S. and M.S. degrees in electrical and computer engineering from the Tsinghua University, Beijing, China, in 1967 and 1981, respectively, and the Ph.D. degree in electrical and computer engineering from the University of South Carolina, Columbia, in 1984. He is a Principal Scientist at the Media Division, Institute for Infocomm Research, Singapore. His main research interests are image/video/audio analysis, indexing and retrieval, media content identification and security, computer vision, and pattern recognition. He joined the Institute of System Science, National University of Singapore, in 1992. Since then, he has been working on robust character ID recognition and video indexing. He was the Program Director for the Media Engineering Program, Kent Ridge Digital Labs, then Laboratories for Information Technology, from 2001 to 2002. Dr. Tian has served on editorial boards of professional journals and as chairs and members of technical committees of the IEEE Pacific-Rim Conference on Multimedia (PCM), the IEEE International Conference on Multimedia and Expo (ICME), etc.
Similar documents
full thesis - faraday - Eastern Mediterranean University
were namely, approximated median filtering, mixture of Gaussians model, progressive background estimation method and histogram/group-based histogram approaches. These techniques were tested under d...
More information