Background Paper

Transcription

Background Paper

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004
1459
Statistical Modeling of Complex Backgrounds
for Foreground Object Detection
Liyuan Li, Member, IEEE, Weimin Huang, Member, IEEE, Irene Yu-Hua Gu, Senior Member, IEEE, and
Qi Tian, Senior Member, IEEE
Abstract—This paper addresses the problem of background
modeling for foreground object detection in complex environments. A Bayesian framework that incorporates spectral, spatial,
and temporal features to characterize the background appearance
is proposed. Under this framework, the background is represented
by the most significant and frequent features, i.e., the principal
features, at each pixel. A Bayes decision rule is derived for background and foreground classification based on the statistics of
principal features. Principal feature representation for both the
static and dynamic background pixels is investigated. A novel
learning method is proposed to adapt to both gradual and sudden
“once-off” background changes. The convergence of the learning
process is analyzed and a formula to select a proper learning rate
is derived. Under the proposed framework, a novel algorithm for
detecting foreground objects from complex environments is then
established. It consists of change detection, change classification,
foreground segmentation, and background maintenance. Experiments were conducted on image sequences containing targets of
interest in a variety of environments, e.g., offices, public buildings,
subway stations, campuses, parking lots, airports, and sidewalks.
Good results of foreground detection were obtained. Quantitative
evaluation and comparison with the existing method show that the
proposed method provides much improved results.
Index Terms—Background maintenance, background modeling, background subtraction, Bayes decision theory, complex
background, feature extraction, motion analysis, object detection,
principal features, video surveillance.
I. INTRODUCTION
I
N COMPUTER vision applications, such as video surveillance, human motion analysis, human-machine interaction,
and object based video encoding (e.g., MPEG4), objects of interest are often the moving foreground objects in an image sequence. One effective way of foreground object extraction is
to suppress the background points in the image frames [1]–[6].
To achieve this, an accurate and adaptive background model is
often desirable.
Background usually contains nonliving objects that remain
passive in the scene. The background objects can be stationary
objects, such as walls, doors and room furniture, or nonstationary objects such as wavering bushes or moving escalators.
Manuscript received June 19, 2003; revised January 29, 2004. The associate
editor coordinating the review of this manuscript and approving it for publication was Dr. Luca Lucchese.
L. Li, W. Huang, and Q. Tian are with Institute for Infocomm Research
,
Singapore, 119613 (e-mail: lyli@i2r.a-star.edu.sg; wmhuang@i2r.a-star.edu.sg;
tian@i2r.a-star.edu.sg).
I. Y.-H. Gu is with the Department of Signals and Systems, Chalmers
University of Technology, SE-412 96 Göteborg, Sweden (e-mail: irenegu@
s2.chalmers.se).
Digital Object Identifier 10.1109/TIP.2004.836169
The appearance of background objects often undergoes various
changes over time, e.g., the changes in brightness caused by
changing weather conditions or the switching on/off of lights.
The background image can be described as consisting of static
and dynamic pixels. The static pixels belong to the stationary objects, and the dynamic pixels are associated with nonstationary
objects. The static background part can be converted to a dynamic one as time advances, e.g., by turning on a computer
screen. A dynamic background pixel can also turn to a static
one, such as a pixel in the bush when the wind stops. To describe a general background scene, a background model must
be able to
1) represent the appearance of a static background pixel;
2) represent the appearance of a dynamic background pixel;
3) self-evolve to gradual background changes;
4) self-evolve to sudden “once-off” background changes.
For background modeling without specific domain knowledge,
the background is usually represented by image features at
each pixel. The features extracted from an image sequence can
be classified into three types: spectral, spatial, and temporal
features. Spectral features could be associated with gray-scale
or color information, spatial features could be associated with
gradient or local structure, and temporal features could be
associated with interframe changes at the pixel. Many existing
methods utilize spectral features (distributions of intensities or
colors at each pixel) to model the background [4], [5], [7]–[9].
To be robust to illumination changes, some spatial features are
also exploited [2], [10], [11]. The spectral and spatial features
are suitable to describe the appearance of static background
pixels. Recently, a few methods have introduced temporal
features to describe the dynamic background pixels associated
with nonstationary objects [6], [12], [13]. There is, however,
a lack of systematic approaches to incorporate all three types
of features into a representation of a complex background
containing both stationary and nonstationary objects.
The features that characterize stationary and dynamic background objects should be different. If a background model can
describe a general background, it should be able to learn the
significant features of the background at each pixel and provide
the information for foreground and background classification.
Motivated by this, a Bayesian framework which incorporates
multiple types of features for modeling complex backgrounds
is proposed in this paper. The major novelties of the proposed
method are as follows.
1) A Bayesian framework is proposed for incorporating
spectral, spatial, and temporal features in the background modeling.
1057-7149/04$20.00 © 2004 IEEE
1460
2) A new formula of Bayes decision rule is derived for background and foreground classification.
3) The background is represented using statistics of principal features associated with stationary and nonstationary background objects.
4) A novel method is proposed for learning and updating
background features to both gradual and “once-off” background changes.
5) The convergence of the learning process is analyzed and
a formula is derived to select a proper learning rate.
6) A new real-time algorithm is developed for foreground
object detection from complex environments.
Further, a wide range of tests is conducted on a variety of
environments, including offices, campuses, parks, commercial
buildings, hotels, subway stations, airports, and sidewalks.
The remaining part of the paper is organized as follows.
After a brief literature review of existing work in Section I-A,
Section II describes the statistical modeling of complex background based on principal features. First, a new formula of
Bayes decision rule for background and foreground classification is derived. Based on this formula, an effective data
structure to record the statistics of principal features is established. Principal feature representation for different background
objects is addressed. In Section III, the method for learning
and updating the statistics of principal features is described.
Strategies to adapt to both gradual and sudden “once-off”
background changes are proposed. Properties of the learning
process are analyzed. In Section IV, an algorithm for foreground
object detection based on the statistical background modeling
is described. It contains four steps: change detection, change
classification, foreground segmentation, and background maintenance. Section V presents the experimental results on various
environments. Evaluations and comparisons with an existing
method are also included. Finally, conclusions are given in
Section VI.
A. Related Work
A simple and direct way to describe the background at each
pixel is to use the spectral information, i.e., the gray-scale
or color of the background pixel. Early studies describe
background features using an average of gray-scale or color
intensities at each pixel. Infinite impulse response (IIR) or
Kalman filters [7], [14], [15] are employed to update slow
and gradual changes in the background. These methods are
applicable to backgrounds consisting of stationary objects. To
tolerate the background variations caused by imaging noise,
illumination changes, and the motion of nonstationary objects,
the statistical models are used to represent the spectral features
at each background pixel. The frequently used models include
gaussian [8], [16]–[22] and mixture of gaussians (MoG) [4],
[23]–[25]. In these models, one or a few gaussians are used to
represent the color distributions at each background pixel. A
mixture of Gaussian distributions can represent various background appearances, e.g., road surfaces under the sun or in the
shadows [23]. The parameters (mean, variance, and weight) for
each gaussian are updated using an IIR filter to adapt to gradual
background changes. Moreover, by replacing an old gaussian
with a newly learned color distribution, MoG can adapt to
“once-off” background changes. In [9], a nonparametric model
is proposed for background modeling, where a kernel-based
function is employed to represent the color distribution of each
background pixel. The kernel-based distribution is a generalization of MoG which does not require parameter estimation.
The computation is high for this method. A variant model is
[5], where the distribution of temporal variations
used in
in color at each pixel is used to model the spectral feature of
the background. MoG performs better in a time-varying environment where the background is not completely stationary.
But, the method can lead to misclassification of foreground if
the background scenes are complex [19], [26]. For example,
if the background contains a nonstationary object with significant motion, the colors of pixels in that region may change
widely over time. Foreground objects with similar colors (the
camouflage foreground objects) could easily be misclassified
as background.
The spatial information has recently been exploited to improve the accuracy of background representation. The local
statistics of the spectral features [27], [28], local texture features
[2], [3], or global structure information [29] are found helpful
for accurate foreground extraction. These methods are most
suitable to stationary background. Paragios and Ramesh [10]
use a mixture model (gaussians or laplacians) to represent the
distributions of background differences for static background
points. A Markov random field (MRF) model is developed to
incorporate the spatio-spectral coherence for robust foreground
segmentation. In [11], gradient distributions are introduced to
MoG to reduce the misclassification purely depending on color
distributions. Spatial information helps to detect camouflage
foreground objects and suppress shadows. The spatial features
are however not applicable to nonstationary background objects
at pixel level since the corresponding spatial features vary over
time.
A few more attempts to segment foreground objects from
nonstationary background have been made by using temporal
features. One way is to estimate the consistency of optical flow
over a short duration of time [13], [30]. The dynamic features
of nonstationary background objects are represented by the significant variation of accumulated local optical flows. In [12],
Li et al. propose a method to employ the statistics of color
co-occurrence between two consecutive frames to model the
dynamic features associated with a nonstationary background
object. Temporal features are suitable to model the appearance
of nonstationary objects. In Wallflower [6], Toyama et al. use
a linear Wiener filter, a self-regression model, to represent intensity changes for each background pixel. The linear predictor
could learn and estimate the intensity variations of a background
pixel. It works well for periodical changes. The linear regression
model is difficult to predict shadows and background changes
with varying frequency in natural scene. A brief summary of the
existing methods based on the types of used features is listed in
Table I. Further, most existing methods perform the background
and foreground classification with one or more heuristic thresholds. For backgrounds with different complexities, the thresholds should be adjusted empirically. In addition, these methods
are often tested only on a few background environments (e.g.,
laboratories, campuses, etc.).
LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS
1461
Using (5), the pixel with observed feature vector at time can
be classified as a background or a foreground point, provided
,
, and
that the prior and conditional probabilities
are known in advance.
TABLE I
CLASSIFICATION OF PREVIOUS METHODS AND THE PROPOSED METHOD
B. Principal Feature Representation of Background
II. STATISTICAL MODELING OF THE BACKGROUND
A. Bayes Classification of Background and Foreground
For arbitrary background and foreground objects or regions,
the classification of the background and the foreground can be
formulated under Bayes decision theory.
be the position of an image pixel,
be the
Let
input image at time , and be a -dimensional feature vector
extracted from the position at time from the image sequence.
Then, the posterior probability of the feature vector from the
background at can be computed by using the Bayes rule
(1)
where indicates the background.
is the probability
of the feature vector being observed as a background at ,
is the prior probability of the pixel belonging to the
is the prior probability of the feature
background, and
vector being observed at the position . Similarly, the posterior
probability that the feature vector comes from a foreground
object at is
(2)
where denotes the foreground. Using the Bayes decision rule,
a pixel is classified as belonging to the background according
to its feature vector observed at time if
(3)
Otherwise, it is classified as belonging to the foreground. Note
that a feature vector observed at an image pixel comes from
either background or foreground objects, it follows:
(4)
Substituting (1) and (4) into (3), it follows that the Bayes decision rule (3) becomes
(5)
To apply (5) for classification of background and foreground,
,
, and
should be
the probability functions
known in advance, or can be properly estimated. For complex
backgrounds, the forms of these probability functions are unknown. One way to estimate these probability functions is to use
the histogram of features. The problem that would be encountered is the high cost for storage and computation. Assuming
is a -dimension vector and each of its element is quantized to
values, the histogram would contain
cells. For example,
assuming the resolution of color has 256 levels, the histogram
would contain 256 cells. The method would be unrealistic in
terms of computational and memory requirements.
It is reasonable to assume that if the selected features represent the background effectively, the intraclass spread of background features should be small, which implies that the distribution of background features will be highly concentrated in a
small region in the histogram. Further, features from various
foreground objects would spread widely in the feature space.
Therefore, there would be less overlap between the distributions of background and foreground features. This implies that,
with a proper selection and quantization of features, it would
be possible to approximately describe the background by using
only a small number of feature vectors. A concise data structure
to implement such representation of background is created as
follows.
be the quantized feature vectors sorted in descending
Let
order with respect to
for each pixel . Then, for a
proper selection of features, there would be a small integer
, a high percentage value
, and a low percentage value
(e.g.,
and
) such
that the background could be well approximated by
and
(6)
The value of
and the existence of
and
depend
on the selection and quantization of the feature vectors. The
feature vectors are defined as the principal features of the
background at the pixel .
To learn and update the prior and conditional probabilities for
the principal feature vectors, a table of statistics for the possible
principal features is established for each feature type at . The
table is denoted as
(7)
is the learned
based on the observation of
where
records the statistics of the
most
the features and
1462
Fig. 1. One example of learned principal features for a static background pixel in a busy scene. The left image shows the position of the selected pixel. The two
, the light gray part is
right images are the histograms of the statistics for the most significant colors and gradients, where the height of a bar is the value of
, and the top dark gray part is
. The icons below the histograms are the corresponding color and gradient features.
frequent feature vectors
contains three components
at pixel . Each
(8)
where
is the dimension of the feature vector . The
in the table
are sorted in descending order with respect
. The first
elements from the table
,
to the value
, are used in (5) for background and foretogether with
ground classification.
C. Feature Selection
The next essential issue for principal feature representation
is feature selection. The significant features of different background objects are different. To achieve effective and accurate
representation of background pixels with principal features,
the employment of proper types of features is important. Three
types of features, the spectral, spatial, and temporal features,
are used for background modeling.
1) Features for Static Background Pixels: For a pixel belonging to a stationary background object, the stable and most
significant features are its color and local structure (gradient).
Hence, two tables are used to learn the principal features. They
and
with
and
repare
resenting the color and gradient vectors, respectively. Since the
gradient is less sensitive to illumination changes, the two types
of feature vectors can be integrated under the Bayes framework
as the following.
and assume that the and are indepenLet
dent, the Bayes decision rule (5) becomes
(9)
For the features from static background pixels, the quantization
measure should be less sensitive to illumination changes. Here,
a normalized distance measure based on the inner product of
two vectors is employed for both color and gradient vectors. The
distance measure is
(10)
is less than
where can be or , respectively. If
a small value ,
and
are matched to each other. The robustness of the distance measure (10) to illumination changes
and imaging noise is shown in [2]. The color vector is directly
obtained from the input images with 256 resolution levels for
each component, while the gradient vector is obtained by applying Sobel operator to the corresponding gray-scale input im,
is
ages with 256 resolution levels. With
found accurate enough to learn the principal features for static
background pixels. An example of principal feature representation for static background pixel is shown in Fig. 1, where the
histograms for the most significant color and gradient features
and
are displayed. The histogram of the color
in
features shows that only the first two are the principal colors for
the background, and the histogram of the gradients shows that
the first six, excluding the fourth, are the principal gradients for
the background.
2) Features for Dynamic Background Pixels: For dynamic
background pixels associated with nonstationary objects, color
co-occurrences are used as their dynamic features. This is because the color co-occurrence between consecutive frames has
been found to be suitable to describe the dynamic features associated with nonstationary background objects, such as moving
tree branches or a flickering screen [12]. Giving an interframe
to
change from the color
at the time instant and the pixel
,
the feature vector of color co-occurrence is defined as
. Similarly, a table of
is maintained at each
statistics for color co-occurrence
pixel. Let
be the input
is generated by
color image; the color co-occurrence vector
quantizing color components to low resolution. For example,
by quantizing the color resolution to 32 levels for each com, one may obtain a good
ponent and selecting
principal feature representation for dynamic background pixel.
An example of the principal feature representation with color
co-occurrence for a flickering screen is shown in Fig. 2. Compared with the quantized color co-occurrence feature space of
cells,
implies that with a very small number of
feature vectors, the principal features are capable of modeling
the dynamic background pixels.
1463
Fig. 2. One example of learned principal features for dynamic background pixels. The left image shows the position of the selected pixel. The right image is the
histogram of the statistics for the most significant color co-occurrences in
, where the height of a bar is the value of
, the light gray part is
, and
the top dark gray part is
. The icons below the histogram are the corresponding color co-occurrence features. In the screen, the color changes among
white, dark blue, and light blue periodically.
III. LEARNING AND UPDATING THE STATISTICS
FOR PRINCIPAL FEATURES
most frequent and significant feature vectors observed
the
at pixel .
Since the background might undergo both gradual and
“once-off” changes, two strategies to learn and update the
statistics for principal features are proposed. The convergence
of the learning process is analyzed and a formula to select a
proper learning rate is derived.
B. For “Once-Off” Background Changes
A. For Gradual Background Changes
(13)
These probabilities are learned gradually with operations described by (11) and (12) at each pixel . When a “once-off”
background change has happened, the new background appearance soon becomes dominant after the change. With the replacement operation (12), the gradual accumulation operation (11)
and resorting at each time step, the learned new features will be
. After some
gradually moved to the first few positions in
time duration, the term on the left hand of (13) becomes large
( 1) and the first term on the right hand of (13) becomes very
small since the new background features are classified as foreground. From (6) and (13), new background appearance at can
be found if
At each time instant, if the pixel is identified as a static
point, the features of color and gradient are used for foreground and background classification. Otherwise, the feature of
is used. Let us assume that the feature
color co-occurrence
vector is used to classify the pixel at time based on the
principal features learned previously. Then the statistics of the
(
and ,
corresponding feature vectors in the table
or ) is gradually updated at each time instant by
(11)
where the learning rate is a small positive number and
. In (11),
means that is classified as a
background point at time in the final segmentation, otherwise,
. Similarly,
means that the th vector of the
matches the input feature vector , and otherwise
table
.
The above updating operation states the following. If the pixel
is labeled as a background point at time ,
is slightly
due to
. Further, the probabilities
increased from
.
for the matched feature vector are also increased due to
However, if
, then the statistics for the un-matched
feature vectors are slightly decreased. If there is no match be, the
tween the feature vector and the vectors in the table
th vector in the table is replaced by a new feature vector
According to (4), the statistics of the principal features satisfy
(14)
In (14), denotes the previous background before the “once-off”
change and denotes the new background appearance after the
prevents errors caused by
“once-off” change. The factor
a small number of foreground features. Using the notation in (7)
and (8), the condition (14) becomes
(15)
Once the above condition is satisfied, the statistics for the foreground should be tuned to be the new background appearance.
According to (4), the “once-off” learning operation is performed
as follows:
(12)
If the pixel is labeled as a foreground point at time ,
and
are slightly decreased with
. However, the
matched vector in the table
is slightly increased.
The updated elements in the table
are resorted in a de, such that the table may keep
scending order with respect to
(16)
for
.
1464
C. Convergence of the Learning Process
If the time-evolving principal feature representation has successfully approximated the background, then
should be satisfied. Hence, it is desirable that
will
converge to 1 with the evolution of the learning process. We
shall show in the following that the learning operation (11) indeed meets such a condition.
at time , and the th vector in the
Suppose
matches the input feature vector which has been
table
detected as background in the final segmentation at time . Then,
according to (11), we have
(17)
which implies the sum of the conditional probabilities of the
principal features being background will remain equal or close
to 1 during the evolution of the learning process.
at time due to some reasons
Let us suppose
such as the disturbance from foreground objects or the operation
from the first
vectors
of “once-off” learning, and the
matches the input feature vector , then we have
in
(18)
elements of
and the features after fall into the next
. Then, the statistics at time can be described
as
(20)
Since the new background appearance at pixel after time
is classified as foreground before the “once-off” updating with
(16),
,
and
decrease exponentially,
increases exponentially and will be
whereas
shifted to the first
positions in the updated table
with sorting at each time step. Once the condition of (15) is
met at time , the new background state is learned. To make
the expression simpler, let us assume that there is no resorting
operation. Then the condition (15) becomes
(21)
From (11) and (20), it follows that at time
conditions hold:
, the following
(22)
If the pixel is detected as a background point at time , it leads
to
(23)
(19)
If
, then
. In this
case, the sum of the conditional probabilities of the principal
features being background increases slightly. On the other hand,
, there will be
,
If
and the sum of the conditional probabilities of the principal
features being background decreases slightly. From these two
cases, it can be concluded that the sum of the conditional probabilities of the principal features being background converges to
1 as long as the background features are observed frequently.
D. Selection of the Learning Rate
In general, for an IIR filtering-based learning process, there
is a tradeoff in the selection of the learning rate
. To make
the learning process adapt to the gradual background changes
smoothly and not to be perturbed by noise and foreground objects, a small value should be selected for . On the other hand,
if is too small, the system becomes too slow to respond to the
“once-off” background changes. Previous methods select it empirically [4], [5], [8], [14]. Here, a formula is derived to select
according to the required time for the system to respond to
“once-off” background changes.
can be
An ideal “once-off” background change at time
assumed to be a step function. Suppose the features before
fall into the first
vectors in the table
,
(24)
By substituting (22)–(24) to (21) and rearranging terms, one can
obtain
(25)
where is the number of frames required to learn the new background appearance. Equation (25) implies that if one wishes
the system to learn the new background state in no later than
frames, one should choose , such that (25) is satisfied. For example, if the system is to respond to an “once-off” background
,
change in 20 s with the frame rate being 20 fps and
should be satisfied.
IV. FOREGROUND OBJECT DETECTION: THE ALGORITHM
With the Bayesain formulation of background and foreground
classification, as well as the background representation with
principal features, an algorithm for foreground object detection
from complex environments is developed. It consists of four
parts: change detection, change classification, foreground object
segmentation, and background maintenance. The block diagram
of the algorithm is shown in Fig. 3. The white blocks from left to
right correspond to the first three steps, and the blocks with gray
shades correspond to background maintenance. In the first step,
unchanged background pixels in the current frame are filtered
Fig. 3.
1465
Block diagram of the proposed method.
out by using simple background and temporal differencing. The
detected changes are separated into static and dynamic points
according to interframe changes. In the second step, the detected
static and dynamic change points are further classified as background or foreground using the Bayes rule and the statistics of
principal features for background. Static points are classified
based on the statistics of principal colors and gradients, whereas
dynamic points are classified based on those of principal color
co-occurrences. In the third step, foreground objects are segmented by combining the classification results from both static
and dynamic points. In the fourth step, background models are
updated. It includes updating the statistics of principal features
for background as well as a reference background image. Brief
descriptions of the steps are presented in the following.
B. Change Classification
is detected at a pixel , it is classified as
If
a dynamic point, otherwise, it is classified as a static point. A
change that occurs at a static point could be caused by illumination changes, “once-off” background changes, or a temporarily
motionless foreground object. A change detected at a dynamic
point could be caused by a moving background or foreground
object. They are further classified as background or foreground
by using the Bayes decision rule and the statistics of the corresponding principal features.
Let be the input feature vector at and time . The probabilities are estimated as
A. Change Detection
In this step, simple adaptive image differencing is used to
filter out nonchange background pixels. The minor variations
of colors caused by imaging noise are filtered out to save the
computation for further processing.
be the input image and
Let
be the reference background image maintained at
time with
denoting a color component. The
background difference is obtained as follows. First, image differencing and thresholding for each color component are performed, where the threshold is automatically generated using
the least median of squares (LMedS) method [31]. The backis then obtained by fusing the results
ground difference
from the three color components. Similarly, the temporal (or inbetween two consecutive frames
terframe) difference
and
is obtained. If both
and
, the pixel is classified as a nonchange background point. In general, more than 50% of the pixels would be
filtered out in this step.
(26)
is a feature vector set composed of those in
where
which match the input vector , i.e.
and
(27)
matches ,
If no principal feature vector in the table
and
are set as 0. Then, the change point is
both
classified as background or foreground as the following.
Classification of Static Point: For a static point, the probabilities for both color and gradient features are estimated by (26)
and
, respectively, where the vector distance
with
in (27) is calculated as (10). In this work, the
measure
statistics of the two type principal features (
and
)
are learned separately. In general cases, there would be
. The Bayes decision rule (9) can be applied for background and foreground classification. In some complex cases,
1466
one type of the features from the background might be unstable.
One example is the temporal static states of a wavering water
surface. For these states, the gradient features are not constant.
Another example is video captured with an auto-gain camera.
The gain is often self-tuned due to the motion of objects and the
gradient features are more stable than the color features for static
background pixels. To work stably in various conditions, the foland
lowing method is adopted. Let
. If
(
in our test), the color and gradient features are coincident and
both features are used for classification using the Bayes rule (9).
Otherwise, only one type of the features with a larger prior value
is used for classification using the Bayes rule (5).
Classification of Dynamic Point: For a dynamic point at time
, the feature vector of color co-occurrence
is generated. The
probabilities for
are calculated as (26), where the distance
between two feature vectors in (27) is computed as
(28)
and
is chosen. Finally, the Bayes rule (5) is applied for
background and foreground classification. As observed from
our experiments, for the dynamic background points, only a
small percentage of them are wrongly classified as foreground
changes [12]. Further, the remainders have become isolated
points, which can easily be removed by a smoothing operation.
C. Foreground Object Segmentation
A post processing is applied to segment the remaining change
points into foreground regions. This is done by firstly applying
a morphological operation (a pair of open and close) to suppress the residual errors. Then the foreground regions are extracted, holes are filled and small regions are removed. Further,
an AND operation is applied to the resulting segments in consecutive frames to remove the false foreground regions detected
by temporal differencing [32].
D. Background Maintenance
With the feedback from the above segmentation, background
models are updated. First, the statistics of principal features are
updated as described in Section IV. For the static points, the
tables
and
are updated. For the dynamic points,
is updated. Meanwhile, a reference background
the table
image is also maintained to make the background difference
accurate. Let be a background point in the final segmentation
result at time . If it is identified as an unchanged background
point in the change detection step, the background reference
image at is smoothly updated by
(29)
where
and is a small positive number. If is
classified as background in change classification step, the background reference image at is replaced by the new background
appearance
(30)
Fig. 4.
Summary of the complete algorithm.
With (30), the reference background image can follow the dynamic background changes, e.g., the changes of color between
tree branch and sky, as well as “once-off” background changes.
E. Memory Requirement and Computational Time
The complete algorithm is summarized in Fig. 4. The major
part of memory usage is to store the tables of the statistics
,
and
) for each pixel. In our implementa(
tion, the memory requirement for each pixel is approximately
1.78 KB. For a video with image sized 160 120 pixels, the
required memory is approximately 33.4 MB. While for image
sized 320 240 pixels, 133.5-MB memory is required. For a
standard PC, this is still feasible. With a 1.7-GHz Pentium CPU
PC, real-time processing of image sequences is achievable at a
rate of about 15 frames per second (fps) for images sized 160
120 pixels and at a rate of 3 fps for images sized 320 240
pixels.
1467
Fig. 5. Experimental results on a meeting room environment (MR) with wavering curtains in the winds. The two examples are the results of the frame 1816 and
2268.
Fig. 6. Experimental results on a lobby environment (LB) in an office building with switching on/off lights. Upper row: a frame before switching off some lights
(364). Lower row: the frame 15 s after switching off some lights (648).
V. EXPERIMENTAL RESULTS
The proposed method has been tested on a variety of indoor
and outdoor environments, including offices, campuses, parking
lots, shopping malls, restaurants, airports, subway stations, sidewalks, and other private and public sites. It has also been tested
on image sequences captured in various weather conditions, including sunny, cloudy, and rainy weather, as well as night and
crowd scenes. In all the tests, the proposed method was automatically initialized (bootstrap) from “blinking background” (i.e.,
,
, and
for
and
). The system gradually learned the most significant features for both stationary and nonstationary background
objects. Once the “once-off” updating is performed, the system
is able to separate the foreground from the background well.
MoG [4] is a widely-used adaptive background subtraction
method. It performs quite well for both stationary and nonstationary backgrounds among the existing methods [6]. The proposed method has also been compared with MoG in the experiments. The same learning rate was used for both the proposed
method and MoG in each test.1 Further, for a fair comparison,
the post processing used in the proposed method was applied
for the MoG method as well.
1The similar analysis of the learning process and dynamic performance for
MoG can be made as in Section III-C and III-D.
The visual examples and quantitative evaluations of the experiments are described in the following two subsections, respectively.
A. Examples on Various Environments
Selected results on five typical indoor and outdoor environments are displayed in this section. The typical environments are
offices, campuses, shopping malls, subway stations, and sidewalks. In the figures of this subsection, pictures are arranged in
rows. In each row, the images from left to right are the input
frame, the background reference image maintained by the pro, the manually genposed method at the moment
erated “ground truth,” the results of the proposed method and
MoG.
1) Office Environments: Office environments include offices, laboratories, meeting rooms, corridors, lobbies, and
entrances. An office environment is usually composed of
stationary background objects. The difficulties for foreground
detection in these scenes can be caused by shadows, changes
of illumination conditions, and camouflage foreground objects
(i.e., the color of the foreground object is similar to that of the
covered background). In some cases, background may consist
of dynamic objects, such as waving curtains, running fans,
and flickering screens. Examples from two test sequences are
shown in Figs. 5 and 6, respectively.
1468
Fig. 7. Experimental results on a campus environment (CAM) containing wavering tree branches in strong winds. They are frame 1019, 1337, and 1393.
The first sequence (MR) was captured by an auto-gain
camera in a meeting room. The background curtain was
moving in winds. The first example in the upper row came from
a scenario containing significant motion of the curtain, as well
as background changes caused by automatic gain adjustment.
In the next example, the person wore clothes of bright colors,
which are similar to the color of the curtain. In both cases, the
proposed method separated the background and foreground
satisfactorily.
The second sequence (LB) was captured from a lobby in an
office building. On this occasion, background changes were
mainly caused by switching on/off lights. Two examples from
this sequence are shown in Fig. 6. The first example shows a
scene before some lights are switched off. A significant shadow
of the person can be observed. The result of the proposed
method is rather satisfactory apart from a small shadow included. The second example shows a scene at about 220 frames
(about 15 s) after some lights have been switched off. In this
example, even through the background reference image had not
been recovered completely, the proposed method detected the
person successfully.
2) Campus Environments: The second type of environments
are campuses or parks. Changes in the background are often
caused by motion of tree branches and their shadows on the
ground surface, or the changes in the weather. Three examples
displayed in Fig. 7 were from a sequence (CAM) captured in
a campus containing moving tree branches. The great motion
of tree branches was caused by strong winds which can be observed from the waving yellow flag in the left of the images.
The moving tree branches also resulted in the changes of tree
shadows. The three example frames contain vehicles of different
colors. The results have shown that the proposed method has detected the vehicles quite well in such an environment.
3) Shopping Malls: The third type of typical environments
are shopping centers, hotels, museums, airports, and restaurants.
In these environments, the lighting are distributed from the ceil-
ings and there are some specular highlight ground surfaces. In
such cases, if multiple persons move in the scene, the shadows
on the ground surface vary significantly in the image sequences.
In these environments, the shadows can be classified into umbra
and penumbra [33]. The umbra corresponds to the background
area where the direct light is almost totally blocked by the foreground object, whereas in the penumbra area of the background,
the lighting is partially blocked.
Three examples from such environments are shown in Fig. 8.
They were from a busy shopping center (SC), an airport (AP),
and a buffet restaurant (BR) [6]. Significant shadows of moving
persons cast on the ground surfaces from different directions
can be observed. As one can see, the proposed method has
obtained the satisfactory results apart from where small parts
of the shadows have been detected in these three environments.
The recognized shadows could also be observed in the maintained background reference images. This can be explained
as a) the feature distance measure (10) that is robust to the
illumination changes has played a major role in suppressing
the penumbra areas; b) the learned color co-occurrences of the
changes from the normal background appearance to umbra and
vice versa could identify many background pixels in the umbra
areas. Hence, without special models for the shadows, the
proposed method has suppressed much of the various shadows
in these environments.
4) Subway Stations: Subway stations are other public sites
that often require monitoring. In these situations, the motion of
background objects (e.g. trains and escalators) would make the
background modeling difficult. Further, the background model
is hard to be established if there are frequent human crowds
in the scenes. Fig. 9 shows two examples from a sequence of
a subway station (SS) recorded on a tape by a CCTV surveillance system. The scene contains three moving escalators and
frequent human flows in the right side of the images. In addition,
there are significant background changes caused by variation of
lighting conditions due to many glass and stainless steel mate-
1469
Fig. 8. Experimental results on shopping mall environments which contain specular ground surfaces. The three examples came from a busy shopping center (SC),
an airport (AP), and a buffet restaurant (BR), respectively.
Fig. 9.
Experimental results on a subway station environments (SS). The examples are the frame 1993 and 2634.
rials inside the building. Another difficulty for this sequence is
caused by the noise which is due to the old video recording device. The busy flow of human crowds can be observed from the
first example in the figure. Our test results have shown that the
proposed method performed quite satisfactorily for such difficult scenarios.
5) Sidewalks: The pedestrians are often the targets of interest
in many video surveillance systems. In such a case, a surveillance
system may monitor the scene from day to night with a range of
weather conditions. The tests were performed on such an environment around the clock. The image sequences (SW) were obtained from highly compressed MPEG4 videos through a local
wireless network. There were large variations of background in
the images. Five examples and test results are shown in Fig. 10.
These correspond to sunny, cloudy and rainy weather conditions,
as well as the night and crowded scenes. The interval between the
first two frames was less than 10 s. Comparing the results with
the “ground truths,” one can find that the proposed method performed very robustly in this complex environment.
From the comparisons with MoG in these examples shown in
Figs. 5–10, one can find that the proposed method has outperformed the MoG method in these selected difficult situations.
The parameters used for these tests are listed in Tables II and
III. The parameters in Table II were applied to all tests. The
learning rates in the first row of Table III were applied to all
tests except for three shorter sequences where larger rates (in
the second row of the table) were applied. This is because if the
image sequences are short, a slightly faster learning rate should
be used to speed up the initial learning. Since the decision (5) for
the classification of background and foreground is not directly
dependent on any threshold, the performance of the proposed
method is not very sensitive to these parameters.
B. Quantitative Evaluations
To get a systematic evaluation of proposed method, the performance of the proposed method was also evaluated quantitatively on randomly selected samples from ten sequences.
1470
Fig. 10. Experimental results of pedestrian detection from a sidewalk environment (SW) around the clock. From top to bottom are the frames from sunny, cloudy,
rainy, night, and crowd scenes.
TABLE II
PARAMETERS USED FOR ALL TEST EXAMPLES
the corresponding “ground truth,” then the similarity measure
between regions and is defined as
(31)
TABLE III
LEARNING RATES USED IN THE TEST EXAMPLES
In the previous work [6], the results were evaluated quantitatively from the comparison with the “ground truths” in terms of
1) false negative error: the number of foreground pixels that
are missed;
2) false positive error: the number of background pixels that
are misdetected as foreground.
However, it is found that when averaging the measures over various environments, they are not accurate enough. In this paper,
a new similarity measure is introduced to evaluate the results of
foreground segmentation. Let be a detected region and be
Using this measure,
approaches to a maximum value
1.0 if and are the same. Otherwise,
varies between
1 and 0 according to their similarity. It approaches to 0 with
the least similarity. It integrates the false positive and negative
errors in one measure. But one drawback of the measure (31) is
that it is a nonlinear measure. To obtain a visual impression of
the quantities of the similarity measures, some matching images
and their similarity values are displayed in Fig. 11.
For systematic evaluation and comparison, the similarity
measure (31) has been applied to the experimental results with
the proposed method and the MoG method. A total of ten
image sequences were used, including those in Figs. 5–10,
as well as two others [watersurface (WS) and fountain (FT)].
We randomly selected 20 frames from each sequence, leading
to total of 200 sample frames for evaluation. The “ground
truths” of these 200 frames were generated manually by four
invited persons. All of the ten test sequences, the results, and
the “ground truths” of the sample frames are available.2 The
2http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html.
1471
Fig. 11. Some examples of matching images with different similarity measure values. In the images, the bright color indicates the intersection of the detected
regions and the “ground truths,” the dark gray color indicates the false negatives, and the light gray color indicates the false positives.
TABLE IV
QUANTITATIVE EVALUATION AND COMPARISON RESULT:
averaging values of similarity measures for each individual
sequence and for ten sequences are shown in Table IV. The
corresponding values obtained from MoG method are also
included. The ten test sequences are chosen among the difficult
sequences, containing global background changes as well as
persons staying motionless for quite a while besides the various
background changes described in the previous subsection.
Taking these situations into account, the obtained evaluation
values for both methods are quite good. Comparing the results
in Table IV and in Fig. 11, the performance of the proposed
method is rather satisfactory. The comparison shows that the
proposed method has provided improved results over those
from the MoG method, especially for image sequences with
complex background.
C. Limitations of the Method
Since the statistics are related to each individual pixel
without considering its neighborhood, the method can wrongly
absorb a foreground object into the background if the object
remains motionless for a long time duration. For example, if
a foreground moving person or car suddenly stopped moving
in the scene and remains still for a long time duration. Further
improvement should be made, e.g., by combining information
from high-level object recognition and tracking in background
updating [34], [35].
Another potential problem is that the method can wrongly
learn the features of foreground objects as the background if
crowded foreground objects (e.g., crowds) are constantly presented in the scenes. Adjusting the learning rate based on the
feedback from the optical flow could provide a possible solution [36]. A method of controlling the learning processes from
multilevel feedbacks is being investigated in order to further improve the results.
VI. CONCLUSION
For detecting foreground object from complex environments,
this paper proposed a novel statistical method for background
modeling. In the proposed method, the background appearance
is characterized by the principal features and their statistics.
VALUES FROM THE TEST SEQUENCES
Foreground objects are detected through foreground and background classification under Bayesian framework. Our test results have shown that the principal features are effective in representing the spectral, spatial, and temporal characteristics of
the background. A learning method to adapt to the time-varying
background features has been proposed and analyzed. Experiments have been conducted on a variety of environments, including offices, public buildings, subway stations, campuses,
parking lots, airports, and sidewalks. The experimental results
have shown the effectiveness of the proposed method. Quantitative evaluation and comparison with the existing method have
shown that an improved performance for foreground object detection in complex background has been achieved. Some limitations of the method have been discussed with suggestions to
possible improvement.
ACKNOWLEDGMENT
The authors would like to thank R. Luo, J. Shang, X. Huang,
and W. Liu for their work to generate the “ground truths” for
evaluation.
REFERENCES
[1] D. Gavrila, “The visual analysis of human movement: A survey,”
Comput. Vis. Image Understanding, vol. 73, no. 1, pp. 82–98, 1999.
[2] L. Li and M. Leung, “Integrating intensity and texture differences for
robust change detection,” IEEE Trans. Image Processing, vol. 11, pp.
105–112, Feb. 2002.
[3] E. Durucan and T. Ebrahimi, “Change detection and background extraction by linear algebra,” Proc. IEEE, vol. 89, pp. 1368–1381, Oct. 2001.
[4] C. Stauffer and W. Grimson, “Learning patterns of activity using realtime tracking,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp.
747–757, Aug. 2000.
[5] I. Haritaoglu, D. Harwood, and L. Davis, “
: Real-time surveillance
of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell.,
vol. 22, pp. 809–830, Aug. 2000.
[6] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Principles and practice of background maintenance,” in Proc. IEEE Int. Conf.
Computer Vision, Sept. 1999, pp. 255–261.
[7] K. Karmann and A. Von Brandt, “Moving object recognition using an
adaptive background memory,” Time-Varing Image Process. Moving
Object Recognit., 2, pp. 289–296, 1990.
[8] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland, “Pfinder: realtime tracking of the human body,” IEEE Trans. Pattern Anal. Machine
Intell., vol. 19, pp. 780–785, July 1997.
[9] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for
background subtraction,” in Proc. Eur. Conf. Computer Vision, 2000.
1472
[10] N. Paragios and V. Ramesh, “A MRF-based approach for real-time
subway monitoring,” in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, vol. 1, Dec. 2001, pp. I-1034–I-1040.
[11] O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robust
background subtraction using color and gradient information,” in Proc.
IEEE Workshop Motion Video Computing, Dec. 2002, pp. 22–27.
[12] L. Li, W. M. Huang, I. Y. H. Gu, and Q. Tian, “Foreground object detection in changing background based on color co-occurrence statistics,” in
Proc. IEEE Workshop Applications of Computer Vision, Dec. 2002, pp.
269–274.
[13] L. Wixson, “Detecting salient motion by accumulating directionary-consistent flow,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp.
774–780, Aug. 2000.
[14] N. J. B. McFarlane and C. P. Schofield, “Segmentation and tracking of
piglets in images,” Mach. Vis. Applicat., vol. 8, pp. 187–193, 1995.
[15] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S.
Russel, “Toward robust automatic traffic scene analysis in real-time,” in
Proc. Int. Conf. Pattern Recognition, 1994, pp. 126–131.
[16] A. Bobick, J. Davis, S. Intille, F. Baird, L. Cambell, Y. Irinov, C. Pinhanez, and A. Wilson, “Kidsroom: Action recognition in an interactive
story environment,” Mass. Inst. Technol., Cambridge, Perceptual Computing Tech. Rep. 398, 1996.
[17] J. Rehg, M. Loughlin, and K. Waters, “Vision for a smart kiosk,” in Proc.
IEEE Conf. Computer Vision and Pattern Recognition, June 1997, pp.
690–696.
[18] T. Olson and F. Brill, “Moving object detection and event recognition
algorithm for smart cameras,” in Proc. DARPA Image Understanding
Workshop, 1997, pp. 159–175.
[19] T. Boult, “Frame-rate multi-body tracking for surveillance,” in Proc.
DARPA Image Understanding Workshop, 1998.
[20] T. Darell, G. Gordon, M. Harville, and J. Woodfill, “Integrated person
tracking using stereo, color, and pattern detection,” in Proc. IEEE
Conf. Computer Vision and Pattern Recognition, June 1998, pp.
601–608.
[21] A. Shafer, J. Krumm, B. Brumitt, B. Meyers, M. Czerwinski, and
D. Robbins, “The new EasyLiving project at microsoft,” in Proc.
DARPA/NIST Smart Space Workshop, 1998.
[22] C. Eveland, K. Konolige, and R. C. Bolles, “Background modeling for
segmentation of video-rate stereo sequences,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1998, pp. 266–271.
[23] N. Friedman and S. Russell, “Image segmentation in video sequences:
a probabilistic approach,” in Proc. 13th Conf. Uncertainty Artificial Intelligence, 1997.
[24] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classification
and tracking from real-time video,” in Proc. IEEE Workshop Application
of Computer Vision, Oct. 1998, pp. 8–14.
[25] M. Harville, G. Gordon, and J. Woodfill, “Foreground segmentation
using adaptive mixture model in color and depth,” in Proc. IEEE
Workshop Detection and Recognition of Events in Video, July 2001, pp.
3–11.
[26] X. Gao, T. Boult, F. Coetzee, and V. Ramesh, “Error analysis of background adaption,” in Proc. IEEE Conf. Computer Vision and Pattern
Recognition, June 2000, pp. 503–510.
[27] K. Skifstad and R. Jain, “Illumination independent change detection
from real world image sequence,” Comput. Vis., Graph. Image Process.,
vol. 46, pp. 387–399, 1989.
[28] S. C. Liu, C. W. Fu, and S. Chang, “Statistical change detection with moments under time-varying illumination,” IEEE Trans. Image Processing,
vol. 7, pp. 1258–1268, Aug. 1998.
[29] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision
system for modeling human interactions,” IEEE Trans. Pattern Anal.
Machine Intell., vol. 22, pp. 831–843, Aug. 2000.
[30] A. Iketani, A. Nagai, Y. Kuno, and Y. Shirai, “Detecting persons on
changing background,” in Proc. Int. Conf. Pattern Recognition, vol. 1,
1998, pp. 74–76.
[31] P. Rosin, “Thresholding for change detection,” in Proc. IEEE Int. Conf.
Computer Vision, Jan. 1998, pp. 274–279.
[32] Q. Cai, A. Mitiche, and J. K. Aggarwal, “Tracking human motion in an
indoor environment,” in Proc. IEEE Int. Conf. Image Processing, Oct.
1995, pp. 215–218.
[33] C. Jiang and M. O. Ward, “Shadow identification” (in June), in Proc.
IEEE Int. Conf. Computer Vision and Pattern Recognition, 1992, pp.
606–612.
[34] L. Li, I. Y. H. Gu, M. K. H. Leung, and Q. Tian, “Knowledge-based fuzzy
reasoning for maintenance of moderate-to-fast background changes in
video surveillance,” in Proc. 4th IASTED Int. Conf. Signal and Image
Processing, 2002, pp. 436–440.
[35] M. Harville, “A framework for high-level feedback to adaptive,
per-pixel, mixture-of-gaussian background models,” in Proc. Eur. Conf.
Computer Vision, 2002, pp. 543–560.
[36] D. Gutchess, M. Trajkovic, E. Cohen-Solal, D. Lyons, and A. K. Jain,
“A background model initialization algorithm for video surveillance,” in
Proc. IEEE Int. Conf. Computer Vision, vol. 1, July 2001, pp. 733–740.
Liyuan Li (M’96) received the B.E. and M.E. degrees from Southeast University, Nanjing, China, in
1985 and 1988, respectively, and the Ph.D. degree
from Nanyang Technological University, Singapore,
in 2001.
From 1988 to 1999, he was on the faculty at Southeast University, where he was an Assistant Lecturer
(1988 to 1990), Lecturer (1990 to 1994), and Associate Professor (1995 to 1999). Since 2001, he has
been a Research Scientist at the Institute for Infocomm Research, Singapore. His current research interests include video surveillance, object tracking, event and behavior understanding, etc.
Weimin Huang (M’97) received the B.Eng. degree
in automation and the M.Eng. and Ph.D. degrees
in computer engineering from Tsinghua University,
Beijing, China, in 1989, 1991, and 1996, respectively.
He is a Research Scientist at the Institute for Infocomm Research, Singapore. He has worked on the
research of handwriting signature verification, biometrics authentication, and audio/video event detection. His current research interests include image processing, computer vision, pattern recognition, human
computer interaction, and statistical learning.
Irene Yu-Hua Gu (M’94–SM’03) received the Ph.D.
degree in electrical engineering from the Eindhoven
University of Technology, Eindhoven, The Netherlands, in 1992.
She is an Associate Professor in the Department
of Signals and Systems, Chalmers University of
Technology, Göteborg, Sweden. She was a Research Fellow at Philips Research Institute IPO,
The Netherlands, and Staffordshire University,
Staffordshire, U.K., and a Lecturer at The University
of Birmingham, Birmingham, U.K., from 1992 to
1996. Since 1996, she has been with the Department of Signals and Systems,
Chalmers University of Technology. Her current research interests include
image processing, video surveillance and object tracking, video communications, and signal processing applications to electric power systems.
Dr. Gu has served as an Associate Editor for the IEEE TRANSACTIONS ON
SYSTEMS, MAN, AND CYBERNETICS since 2000, and she is currently the ChairElect of the IEEE Swedish Signal Processing Chapter.
Qi Tian (M’83–SM’90) received the B.S. and M.S.
degrees in electrical and computer engineering from
the Tsinghua University, Beijing, China, in 1967 and
1981, respectively, and the Ph.D. degree in electrical
and computer engineering from the University of
South Carolina, Columbia, in 1984.
He is a Principal Scientist at the Media Division, Institute for Infocomm Research, Singapore.
His main research interests are image/video/audio
analysis, indexing and retrieval, media content identification and security, computer vision, and pattern
recognition. He joined the Institute of System Science, National University
of Singapore, in 1992. Since then, he has been working on robust character
ID recognition and video indexing. He was the Program Director for the
Media Engineering Program, Kent Ridge Digital Labs, then Laboratories for
Information Technology, from 2001 to 2002.
Dr. Tian has served on editorial boards of professional journals and as chairs
and members of technical committees of the IEEE Pacific-Rim Conference
on Multimedia (PCM), the IEEE International Conference on Multimedia and
Expo (ICME), etc.

Background Paper

Transcription

Similar documents

Making a Shepard Fairey Style Poster

Background Updating for Visual Surveillance

ADVANTAGE 1.0

full thesis - faraday - Eastern Mediterranean University

Getting Started with Your Search

Shared Sampling for Real-Time Alpha Matting

Poster - Centre for Intelligent Machines

Change Detection with Weightless Neural Networks

LED Full Color Digital Sign - Lichtkranten en Led Displays

Posters: E-Learning - Electronic Visualization Laboratory

announcement and call for papers

Applied GIS - Satellite Image Processing :: 1. Theoretical principles