Relevance Feedback in Content-Based Image Search

Transcription

Relevance Feedback in Content-Based Image Search
Relevance Feedback in Content-Based Image Search
Hong-Jiang Zhang, Zheng Chen, Wen-Yin Liu and Mingjing Li
Microsoft Research, China
5F Sigma Center, 49 Zhichun Road
Beijing, 100080, China
E-mail: hjzhang@microsoft.com
Abstract: Content-based image retrieval (CBIR) is a research area dedicated to
address the retrieve and search multimedia documents for digital libraries. Relevance
feedback is a powerful technique in CBIR and has been an active research topic for the
past few years. In this paper, we review the current state-of-the-art of research on
relevance feedbacks for CBIR and present the iFind system developed at Microsoft
Research China equipped with a set of powerful relevance feedback algorithms. We
also provide an outlook on the remaining research issues in CBIR, especially on
applying learning and data mining technologies in search of multimedia data on the
Web.
1. Introduction: The Challenges in Content-Based Image Retrieval
Efficient image indexing and access tools are essential for efficient utilization of this
massive digital resource and substantial research efforts have been devoted to address
this issue. However, by and large, the earlier image database systems used in digital
libraries [e.g. 1] have all taken keyword or text-based approaches for indexing and
retrieval of image data. Admittedly, image annotation is a tedious process. Hence, it is
practically impossible to annotate all the images on the Internet. Furthermore, due to
the multiplicity of contents in a single image and the subjectivity of human perception
and understanding, it is also difficult to make exactly the same annotations to the same
image by different users. To address those limitations, content-based image retrieval
(CBIR) approaches have been researched on in the last decade [2, 3, 4]. These
approaches work with descriptions based on properties that are inherent in the images
themselves such as color, texture, and shape and utilize them for retrieval purposes.
Since visual features are automatically extracted from images, automated indexing of
image databases becomes possible.
However, despite the many research efforts, the retrieval accuracy of today’s CBIR
algorithms is still limited and often worse than keyword based approaches. The
problem stems from the fact that visual similarity measures, such as color histograms,
in general do not necessarily match perceptional semantics and subjectivity of images.
In addition, each type of image features tends to capture only one of many aspects of
image similarity and it is difficult to require a user to specify clearly which aspect
exactly or what combination of these aspects he/she wants to apply in defining a query.
To address these problems, interactive relevance feedback techniques have been
proposed. The idea is that we should incorporate human perception subjectivity into
the retrieval process and provide users opportunities to evaluate retrieval results, and
automatically refine queries on the basis of those evaluations. Lately, this research
topic has become the most challenging one in CBIR research.
In this paper, we first review the current state of the art of research on relevance
feedbacks in CBIR and discuss a set of representative approaches in Section 2. In
Section 3, we present a system called iFind©, developed at Microsoft Research China
to show case how functionalities of text based image search, query by image example,
relevance feedback and data mining can be integrated to build a powerful web-based
image search engine. Section 4 presents our conclusion remarks.
2. Relevance Feedback in CBIR: The State of the Art
In general, relevance feedback process in CBIR is as following. For a given query, the
CBIR system first retrieves a list of ranked images according to a predefined similarity
metrics, often defined by the distance between query vector and feature vectors of
images in a database. Then, the user selects a set of positive and/or negative examples
from the retrieved images, and the system will refine the query and retrieve a new list of
images. Hence, the key issue in relevance feedback approaches is how to incorporate
positive and negative examples in query and/or the similarity refinement.
2.1 Classical Relevance Feedback Schemes
The early relevant feedback schemes for ICBR have been mainly adopted from text
document retrieval researches and can be classified into two approaches: query point
movement (query refinement) and re-weighting (similarity measure refinement). Both
have been built based upon the vector model in information retrieval theory [5, 6, 7].
The query point movement method essentially tries to improve the estimate of the
“ideal query point” by moving it towards good example points and away from bad
example points. The frequently used technique to iteratively improve this estimation is
the Rocchio’s formula given below for a set of relevant documents D’R and
non-relevant documents D’N given by the user [6].
Q' = αQ + β (
1
N R'
1
∑ Di ) − γ ( N
i∈ D ' R
∑ Di )
(1)
N ' i∈ D ' N
where α, β, and γ are suitable constants; NR’ and NN’ are the number of documents in
D’R and D’N respectively. This is the technique implemented in the MARS system [8].
Another implementation of point movement strategy is using the Bayesian method,
such as the work in [9] which using Bayesian learning to incorporate user’s feedback to
update the probability distribution of all the images in the database. Experiments show
that the retrieval performance can be improved considerably by using such relevance
feedback approaches.
The central idea behind the re-weighting method is very simple and intuitive. Since
each image is represented by an N dimensional feature vector, we can view it as a point
in an N dimensional space. Then, the basic idea is to enhance the importance of those
dimensions of a feature that help in retrieving the relevant images and reduce the
importance of those dimensions that hinder this process. That is, if the variance of the
good examples is high along a principle axis j, then we can deduce that the values on
this axis is not very relevant to the input query so that we assign a low weight wj on it.
A simple algorithm based on this idea was described in the ImageRover system [10].
This algorithm automatically selects appropriate Minkowski distance metrics that
minimize the mean distance between the relevant images specified by the user.
Recently, more computationally robust methods that perform global optimization have
been proposed. The MindReader retrieval system designed by Ishikawa et al. [11]
formulates a minimization problem on the parameter estimating process. Unlike
traditional retrieval systems whose distance function can be represented by ellipses
aligned with the coordinate axis, the MindReader system proposed a distance function
that is not necessarily aligned with the coordinate axis. Therefore, it allows for
correlations between attributes in addition to different weights on each component.
A further improvement over this approach is given by Rui and Huang [12]. The inputs
to this query refining system are a query vector qi corresponding to the ith feature, an N
element vector π=[π1,...πN] that represents the degree of relevance for each of the N
input feedback samples, and a set of N training vectors xni for each feature i. An ideal
query vector for each feature i is described by the weighted sum of all positive feedback
images as follows.
qi T * =
π T Yi
∑n=1π n
N
(2)
where Yi is the N×Ki training sample matrix for feature i, obtained by stacking the N
feedback vectors xni into a matrix. Ki is the length of the ith feature vector. It is
interesting to note that the original query vector qi does not appear in (2), which shows
that the ideal query with respect to the feedbacks is not influenced by the initial query.
2.2 Relevance Feedback with Semantics
However, as presented above, while all the approaches adapted from text document
retrieval do improve the performance of ICBR, there are severe limitations: even with
feedback, it is still difficult to capture high level semantics of images when only
low-level image features are used in queries. The inherent problem with these
approaches is that the low-level features are often not as powerful in representing
complete semantic content of images as keywords in representing text documents. In
other words, applying the relevance feedback approaches used in text document
retrieval technologies to low-level feature based image retrieval will not be as
successful as in text document retrieval. Using low-level features alone does not be
effective in representing users’ feedbacks and in describing their intentions.
Furthermore, in these algorithms, the potentially captured semantic in the relevance
feedback processes in one query session is not memorized to continuously improve the
retrieval performance of a system. To overcome these limitations, another school of
ideas is to using learning approaches in incorporating semantics in relevance feedback.
The PicHunter framework by Cox, et al further extended the relevance feedback and
learning idea with a Bayesian approach [13]. With an explicit model of what users
would do, given what target image they want, PicHunter uses Bayesian rule to predict
what is the target they want, given their actions. This is done via a probability
distribution over possible image targets, rather than refining a query. To achieve this,
an entropy minimizing display algorithm is developed that attempts to maximize the
information obtained from a user at each iteration of the search. Also, this proposed
framework makes use of hidden annotation rather than a possibly inaccurate and
inconsistent annotation structure that the user must learn and make queries in.
However, this could be an disadvantage since it excluded the possibility of benefiting
from good annotations, which may lead to a very slow convergence.
In general, there are two different modes of user interactions involved in typical
retrieval systems: using keyword to represent semantic contents of the desired images,
or query by examples based on low-level image features. In most image retrieval
systems, these two modes of interaction are mutually exclusive. We argue that
combining these two approaches and allow them to benefit from each other yields a
great deal of advantages in terms of both retrieval accuracy and ease of use.
The framework proposed in [14] attempted to embed semantic information into CBIR
processes through relevance feedback using a semantic correlation matrix and
low-level feature distances. In this framework, semantic relevance between image
clusters is learnt from user’s feedback and used to improve the retrieval performance.
In other word, the framework maintains the strengths of feature-based image retrieval
while incorporating learning and annotation in the relevance feedback processes.
Experiments have shown that this new framework is effective not only in improving
retrieval performance in a given query session, but also utilizes the knowledge learnt
from previous queries to reduce the number of iterations in following queries.
We have also put forward a framework that performs relevance feedback and query
refinement on both the images’ semantic contents represented by keywords and the
low-level feature vectors. In other words, semantic and low-level feature based
relevance feedback are seamlessly integrated. Only when the semantic information is
not available, our method is reduced to one of the previously described low-level
feedback approaches as a special case. This framework has been implemented into the
image search engine system iFind as described in detail in next section.
3. iFind  An Image Search Engine
iFind© is a web-based image retrieval system developed at Microsoft Research China
and is implemented with Microsoft COM objects[15].
iFind provides the
functionalities of keyword-based image search, query by image example, and their
combination. Images in this system are represented by low-level visual features,
keyword features, and optionally, annotations when available. The key technology in
the system is the integrated semantics and feature based image retrieval, relevance
feedback approach and data mining of users’ feedback log. The performance
improvement of this new approach over traditional CBIR and relevance feedback
approaches is significant.
3.1 Document Modeling of Images
As an image search engine, most of the images in iFind are collected by a crawler, a
program that can automatically analyze the web pages and download images in web
pages and semantic information , from many websites.
We have built the document space model, which is a representation of images using a
set of (both visual and semantic) feature vectors, from the images and the text content
of the web pages. The text features are extracted from image URLs and filenames, page
titles, ALT text, hyperlinks, and surrounding text on the web pages according to a set of
empirical rules. These text descriptors compose a text feature vector for each image
The TF*IDF method [5] is used to weight each keyword in the text feature vector.
However, simple combination of traditional text-based retrieval and CBIR is not
adequate to deal with image retrieval on the WWW because of several reasons. First,
there is often too much clutter and irrelevant information on the web pages; thus, these
text features are less accurate than annotating text. There is also the mismatch between
the page author’s expression and the user’s understanding and expectation.
To overcome these difficulties, we apply relevant feedback and user log analysis
processes to improve the representation of images in three aspects. First, the original
document space model built from the images and the text content of the web pages is
analyzed to detect and remove clutter and irrelevant text information. The accuracy of
the semantic features is therefore improved. Second, the user space model, which is the
keyword vectors used by the users to represent images in the database, is constructed
from analysis the log data of the relevance feedback from users. The user space model
is then combined with the document space model to eliminate mismatch between the
page author’s expression and the user’s understanding and expectation. Third, the
relationship between the low-level features and the high-level features is also
discovered from the user log analysis.
3.2 iFind Retrieval and Relevance Feedback Framework
The iFind retrieval and relevance feedback framework consists of a semantic network
from an image database that links images to semantic annotations, an similarity
measure that integrating both semantic features and image features, and a machine
learning algorithm to iteratively update the semantic network and to improve the
system’s performance over time [16].
The semantic network is represented by a set of keywords having links to the images in
the database. Weights are assigned to each individual link. This representation is
shown pictorially in Figure 1. The degree of relevance of the keywords to the
associated images’ semantic content is represented as the weight on each link. It is
clear that an image can be associated with multiple keywords, each of which with a
different degree of relevance.
•••
•••
keyword
keyword
•••
keyword
Figure 1: Semantic network of the image database
In our system, initial keywords annotation can be from web through the crawler when
the images are from the Web. More keywords can be learned from the user’s feedback.
Whenever the user feeds back a set of image being relevant to a keyword or an example
image with a set of keywords, we add the input keywords into the system and link them
with these images. This effectively suggests a very simple voting scheme for updating
the semantic network in which the keywords with a majority of user consensus will
emerge as the dominant representation of the semantic content of their associated
images. In this way, as more queries are inputted into the system, the system is able to
expand its vocabulary. Also, through the voting process, the keywords that represent
the actual semantic content of each image will receive a large weight.
The iFind framework also extends the feedback algorithm defined by (2) to incorporate
the low-level feature based feedback and ranking results into high-level semantic
feedback and ranking. We define a unified distance metric function Gj to measure the
relevance of any image j within the image database in terms of both semantic and
low-level feature content. The function Gj is defined using a modified form of the
Rocchio’s formula as follows.
 1
 I    1
 I  
Gj = log(1+π j )Dj + β 
∑ 1+ 1 S jk −γ  ∑ 1+ 2 S jk 
NR k∈NR  A1   NN k∈NN  A2  
(3)
where Dj is the distance score computed by the low-level feedback according to (2), NR
and NN are the number of positive and negative feedbacks respectively; I1, I2 are the
number of distinct keywords in common between the image j and all the positive
feedback images, or negative feedback images, respectively; A1 and A2 are the total
number of distinct keywords associated with all the positive and negative feedback
images respectively; and finally Sij is simply the Euclidean distance of the low-level
features between the images i and j. We have replaced the first parameter α in
Rocchio’s formula with the logarithm of the degree of relevance of the jth image. The
other two parameters β and γ can be assigned by user to emphasize the weighting
difference between the positive and negative feedbacks. In addition, it can be easily
seen that our method degenerates into (1) when no semantic information is available.
The iFind system updates the annotation of feedback images by increasing the linkage
to the positive examples’ annotation and decreasing the linkage to the negative
examples’ annotation. The updated annotation can further help to improve image
retrieval results of the system in later use. Log mining can also help yield more
accurate retrieval results by refining the semantic features.
3.3 iFind Log Mining
To further reduce the ambiguity in the text descriptors extracted from web pages and
the low-level image features, and to improve the search performance, we have
proposed a user space model to supplement the original document space model. This is
achieved by applying a user log analysis process. .
The user space model is also a vector space model. The difference between the user
space model and the document space model is that vectors in the user space model are
constructed from the information mined from the log data of user interactions.
Let Q be the set of total queries accumulated and Tj (j=1, …, NT) be the set of all
keywords that appear in Q. For a query in Q, Iri is one of the relevant images specified
by the user and stored in the user log. Based on the Bayesian theory, we have the
probability of image Iri that contains keyword Tj being relevant to Tj as following
P(T j | I ri ) =
P( I ri | T j ) P(T j )
P( I ri )
(4)
where P(IriTj) is the probability that image Iri has been retrieved and marked as
relevant for those queries that contain word Tj and P(Tj) is the probability that a query
that contain Tj. For a given image I, P(TjI) (j = 1..NT) calculated using (4) forms a
vector for I. We call this vector the user space model of image I, compared to the
document space model of Image I, which is built from the related features extracted
from the web pages.
We have integrated the user space model as described above into the original document
space model to improve the accuracy of the final document space model. That is, for
each image I, vector U is the user space model, and vector D is the document space
model, and, the updated document space model is as below,
D new = ηU + (1 − η ) D
(5)
where, η is used to adjust the weight between the user space model and the document
space model.
Since irrelevant images are also recorded in the user feedback log, we can also utilize
this information. For each irrelevant image Iii we use P(IiiTj) as the confidence that Iii
is irrelevant to query Tj and form a vector I . Then, the text feature vector of the image
in the final document space model is defined by (6), similar to the TF*IDF method.
D final = D new ∗ (1 − I )
(6)
With the combination of document model and user perceived model, the retrieval
accuracy of iFind system is significantly improved compared to using only the
document model [15].
4. Conclusion Remarks
CBIR is an important technology for automating the indexing process and achieving
content-based search in digital multimedia libraries. Learning semantics through
relevance feedback is the enabling technique as well as the current research challenge
to improve the retrieval performance CBIR.
5. Reference:
1.
S. K. Chang and A. Hsu, “Image Information Systems: Where Do We Go From Here,” IEEE Transactions on
Knowledge and Data Engineering, October 1992,Vol.4,No.5, pp.431-442.
2.
R. Jain, A. Pentland and D. Petkovic (editors), Workshop Report: NSF-ARPA Workshop on Visual Information
Management Systems, Cambridge, Mass, USA, June 1995.
3.
Flickner M et al. (1995) Query by Image and Video Content. IEEE Computer 28(9):23-32.
4.
B. Furht, S. W. Smoliar and H. J. Zhang, Image and Video Processing in Multimedia Systems, Kluwer
Academic Publishers, 1995.
5.
Salton, G., and McGill, M. J. “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983.
6.
Rocchio JJ,“Relevance Feedback in Information Retrieval. “In: The SMART Retrieval System, 1971, pp.
313-323, Prentice Hall.
7.
Buckley, C., and Salton, G. “Optimization of Relevance Feedback Weights,” in Proc of SIGIR’95.
8.
Rui, Y., Huang, T. S., and Mehrotra, S. “Content-Based Image Retrieval with Relevance Feedback in MARS,”
in Proc. IEEE Int. Conf. on Image proc., 1997.
9.
Vasconcelos, N., and Lippman, A., “A Bayesian Framework for Content-Based Indexing and Retrieval”, In:
Proc. of DCC’98, Snowbird, Utah, 1998.
10. S. Sclaroff, L. Taycher, and M. L. Cascia, “ImageRover: a content-based image browser for the World Wide
Web,” Proc of IEEE Workshop on Content-based Access of Image and Video Libraries, 1997.
11. Ishikawa, Y., Subramanya R., and Faloutsos, C., “Mindreader: Query Databases Through Multiple Examples,”
In Proc. of the 24th VLDB Conference, (New York), 1998.
12. Rui, Y., Huang, T. S. “A Novel Relevance Feedback Technique in Image Retrieval,” ACM Multimedia, 1999.
13. Cox, I.J., et al. “The Bayesian Image Retrieval System, PicHunter: Theory, Implementation, and
Psychophysical Experiments” IEEE Tran. On Image Processing, Volume 9, Issue 1, pp. 20-37, Jan. 2000.
14. Lee, C., Ma, W. Y., and Zhang, H. J. “Information Embedding Based on user’s relevance Feedback for Image
Retrieval,” Proc. of SPIE Photonics East, 1998.
15. Z. Chen, et al. “Web Mining for Web Image Retrieval,” to appear at International Journal of the American
Society for Information Science, Special issue on Visual Based Retrieval Systems and Web Mining.
16. Lu, Y., Hu, C., Zhu, X., Zhang, H. J., and Yang, Q., “A Unified Framework for Semantics and Feature Based
Relevance Feedback in Image Retrieval Systems,” ACM Multimedia, 2000.