Mechanisms for Scalable Image Searching: A survey
Transcription
Mechanisms for Scalable Image Searching: A survey
International Journal of Computer and Advanced Engineering Research (IJCAER) Volume 02– Issue 02, APRIL 2015 ____________________________________________________________________________________________________________________ Mechanisms for Scalable Image Searching: A survey Ansila Henderson Kavitha K. V. Assistant professor M.Tech Student SCTCE, Pappanamcode Trivandrum, Kerala ansilahenderson@gmail.com Department of Computer Science & Engg. SCTCE, Pappanamcode Trivandrum, Kerala kavitha279@yahoo.co.in Abstract—With the increase in number of images on the Internet, there is a strong need to develop methods for efficient image retrieval. Today, the images features are of high dimensions and the image databases are huge. So the abilities of traditional systems, methods, and computer programs to perform image retrieval functions in an efficient, useful and timely manner become challenged. The latest techniques use hashing methods to embed high-dimensional image features into Hamming space. The image search can be performed in real-time based on Hamming distance of compact hash codes. The traditional methods (e.g., Euclidean) offers continuous distances, but the Hamming distances are discrete integer values. A large number of images would share equal Hamming distances to a query. But for an image search mechanism fine-grained ranking is very important. Keywords-inverted file; hashing; KD tree; indexing I. INTRODUCTION Today, the heterogeneity and the size of digital image collections grow exponentially. So there is a strong need to develop methods for efficient image retrieval. The primary goal of an image management system is to search images efficiently and in a timely manner. Thus, it could compete with the applications in the current era. The image searching should be based on its visual contents. To gain more accurate results with high retrieval performance, many researchers have devised techniques based on different parameters. An image can be the representation of a real object or a scene. With the development of the internet, and the availability of image capturing devices such as digital cameras, huge amounts of images are created every day in different areas. There are many different fields such as remote sensing, fashion, crime prevention, medicine, architecture, etc that are in need of an efficient image retrieval system. The image similarity can be something very subjective and corresponding to the world. A search for an image includes generating a set of feature vectors similar to image characteristics and comparing the set of features to the features indexed for multiple stored images. The search result can be produced based on the comparison of the set of indexed features. In recent techniques, usually the images were represented using the popular bag-of-visual-words (BoW) framework. In this framework, local invariant image descriptors were extracted and quantized based on a set of visual words. So the methods used for document retrieval was adopted to the image retrieval task. The BoW features can be embedded into compact hash codes for efficient search. Hashing is more suitable for image retrieval task than the tree-based indexing structures such as kd-tree. It requires greatly reduced memory and also works better for high-dimensional samples. Using the hash codes, image similarity can be efficiently measured (using logical XOR operations) in hamming space. It uses hamming distance, an integer value obtained by counting the number of bits at which the binary values of the images are different. II. OBJECTIVE OF SCALABLE IMAGE SEARCHING As the number of images increases, it is difficult to find visually similar content. Thorough search is infeasible for large scale applications because it consumes more time. So an efficient search mechanisms like indexing methods were needed to provide efficient search time and retrieval accuracy. The frequency of similar objects in the data space increases with the increase in the number of images. These objects may have similar semantics. So the inferences based on nearest neighbors would be more reliable. The images were usually described by sequences of descriptor vectors having more than a thousand dimensions. The image similarity is examined by nearest neighbor search. III. LITERATURE REVIEW Today, the existing image features are of high dimensions and the image databases are huge. So comparing the query image with every database sample becomes computationally prohibitive which makes the efficient search mechanism critical. Recent image feature representations such as BoW were similar to the bag-of-words representation of textual documents. So the methods used for document retrieval were adopted into the image retrieval task. The existing works on efficient search mechanisms can be mainly classified into three categories. They are 1. Inverted file 2. Tree-based indexing 3. Hashing A. Inverted file J. Zobel proposed the inverted file to retrieve similar documents [1]. Inverted file is an index structure used for text __________________________________________________________________________________________________________________ 1 International Journal of Computer and Advanced Engineering Research (IJCAER) Volume 02– Issue 02, APRIL 2015 ____________________________________________________________________________________________________________________ query evaluation. The index structure maps terms to the documents that contains them. The inverted file contains a collection of lists for each term which stores the identifiers of the documents containing that term. It has two major components: 1. Search structure or vocabulary 2. Set of inverted lists For each term the search structure stores a count and a pointer. The count specifies the number of documents containing the term and the pointer points to the start of the corresponding inverted list. The inverted list stores the identifiers of the document containing a particular term. It also stores the count of repetition of the terms in a document. The word positions within documents can also be recorded. These components provide all the information needed for the query evaluation. In a complete text database system, there were several other structures, including the documents themselves and a table that maps ordinal document numbers to disk locations. The difference is that each inverted list was stored contiguously. These lists were composed of a sequence of blocks which were linked or indexed in some way. The similarity score for a document was calculated based on how many times a particular term repeat in that document. In a ranked query, a phrase was treated as an ordinary term, a lexical entity that occurs in given documents with given frequencies, and contributes to the similarity score for that document when it does appear. Similarity can therefore be computed in the usual way, but the inverted lists must be used for the terms in the phrase to construct an inverted list for the phrase itself, using a Boolean intersection algorithm. The set of identified phrases were added to the vocabulary and have their own inverted lists. Without any alteration to query evaluation procedures the users could query them. However, such indexing was potentially expensive. The number of distinct two-word phrases grows more rapidly than the number of distinct terms. There was no obvious mechanism for accurately identifying which phrases might be used in queries, and the number of candidate phrases was enormous. Inverted index was initially proposed and is still very popular for document retrieval. It was then introduced to the field of image retrieval. In this structure, for a given query with several words, a list of references to each image for each visual word was created so that relevant images could be quickly located. A key difference of document retrieval from visual search was that the textual queries usually contain very few words. But a single image may contain hundreds of visual words, resulting in a large number of candidate images that need further verification. This largely limits the application of inverted files for large scale image search. While increasing visual vocabulary size in BoW the number of candidates can be reduced, but it will increase memory usage [2]. B. Tree-based indexing C. Silpa-Anan and R. Hartley proposed Optimised KDtrees for fast image descriptor matching [3]. The tree-based indexing dramatically improves the image retrieval quality. The elements stored in the KD-trees were high-dimensional vectors. At the root of the tree, the data was split into two halves by a hyperplane orthogonal to a chosen dimension at a threshold value. This split was usually made at the midpoint of the dimension with the greatest variance in the data set. By comparing the query vector with the splitting value, it would easily determine to which half of the data the query vector resides. Each of the two halves of the data was then recursively split in the same way to generate a fully balanced binary tree. At the bottom of the tree, each leaf node of the tree would be similar to a single point in the data set. In some cases, the leaf nodes may contain more than one point. The height of the tree would be log2N where N is the number of points in the data set [3]. For a given query vector, the tree needs log2N comparisons. The data point related to the root node was the first candidate for the nearest neighbor. Each node in the tree resembles a cell as shown in the figure. If a query point lying anywhere in a given leaf cell was searched, it would have lead to the same leaf node. The KD-trees were effective in low dimensions, but their efficiency decreases for high dimensional data. In high dimensional images, a large number of nodes would be there to search. So a KD tree takes a lot of time to backtrack through the tree to find the optimal solution. By limiting the amount of backtracking, the certainty of finding the absolute minimum was sacrificed and replaced with a probabilistic performance [3]. The recent researches were aimed at increasing the probability of success by keeping the backtracking within reasonable limits. C. Hashing Techniques The binary hash codes were compactable to memory. The search can be performed efficiently using the hash table lookup or bitwise operations. Thus it satisfies the time and also the memory requirements. Hashing methods used for image retrieval can be divided into three main categories [12]: unsupervised methods, supervised methods and semisupervised methods. 1. Supervised methods: Supervised hashing is based on the machine learning task known as supervised learning. It generates functions from labeled training data. The training data consist of a set of training samples. Here, each sample is a pair consists of an input object and a desired output value. The functions derived from the supervised learning algorithm can be used for mapping new samples. An optimal scenario would correctly determine the class labels for unseen specimens. The learning algorithms need to derive from the training data to unseen situations in a legitimate way. __________________________________________________________________________________________________________________ 2 International Journal of Computer and Advanced Engineering Research (IJCAER) Volume 02– Issue 02, APRIL 2015 ____________________________________________________________________________________________________________________ 2. Unsupervised methods: Unsupervised hashing is based on the machine learning task known as unsupervised learning. This method uses the unlabeled data to generate binary codes for the given points. It tries to find hidden structure from unlabeled data. There will be no error signal to estimate a possible solution. Many methods employed in unsupervised learning were based on data mining methods. The methods used in unsupervised learning include clustering such as k-means, mixture models etc and blind signal separation using feature extraction techniques for dimensionality reduction. 3. Semisupervised methods: Semi-supervised learning is a class of supervised learning. It also makes use of unlabeled data for training. Generally a small amount of labeled data is used with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine learning researchers have found that a considerable improvement in learning accuracy can be achieved by using a small amount of labeled data with the unlabeled data. Locality Sensitive Hashing (LSH) is one of the most popular unsupervised hashing methods in computer vision. Another effective method called Spectral Hashing (SH) was proposed recently by Weiss et al [4]. Since unsupervised methods do not require any labeled data, their parameters were easy to learn using a pre-specified distance metric. However, in vision problems, sometimes similarity between data points may not defined with a simple metric. Metric similarity of the image descriptors may not preserve semantic similarity. The pairs of images that contain ‘similar’ or ‘dissimilar’ images need to be provided. From such pair wise labeled data, hashing mechanism will automatically generate codes that respect the semantic similarity. The output of supervised methods may be meaningful to human. But it will be difficult to label all the images in a huge database system. Also not everything in the real world has a distinctive meaning. The image similarity is more related to high level image features than the low level features. So it is better to use the semi-supervised learning for image retrieval task. Locality Sensitive Hashing (LSH) P. Indyk and R. Motwani proposed the Locality Sensitive Hashing [5]. The Locality Sensitive Hashing maps similar samples to the same bucket with high probability. The property of locality in the original space would be preserved in the hamming space. More precisely, the hashing functions h (.) from LSH family satisfy the following elegant locality preserving property [5]: Ph(x) = h(y) = sim(x,y) where the similarity measure can be directly linked to the distance function. A typical category of LSH functions consists of random projections and thresholds as: h(x) = sign(w > x+b) (2.2) where w is a random hyperplane and b is a random intercept. Clearly, the random vector w is data independent, which is usually constructed by sampling each component of w randomly from a p-stable distribution. Although there exists an asymptotic theoretical guarantee for random projection based LSH, it is not very efficient in practice since it requires multiple tables with long codes. Constructing a total of l K-bit length hash tables H(x) = [h1(x), . . . ,hK(x)] provides the following collision probability: For a large scale application, the value of K should be considerably large to reduce the size of each hash bucket. However, a large value of K decreases the collision probability between similar samples. In order to overcome this drawback, multiple hash tables need to be constructed. This is inefficient because it needs extra storage cost and larger query time. Recently, Kulis and Grauman [6] extended LSH to work in arbitrary kernel space, and Chum et al. [7] proposed minHashing to extend LSH for sets of features. Spectral Hashing (SH) Weiss et al. proposed a spectral hashing (SH) method that hashes the input space based on data distribution [4]. In spectral hashing bits were calculated by finding the threshold a subset of eigenvectors of the Laplacian of the similarity graph. The SH algorithm consists of three key steps [4]: 1. The extraction of maximum variance directions through Principal Component Analysis (PCA) on the data. 2. The direction selection, which prefers to partition projected dimensions with large range and small spatial frequency. 3. The partition of projected data by a sinusoidal function with previously computed angular frequency. SH was very effective in encoding large-scale, lowdimensional data since the important Principal Component Analysis (PCA) directions were selected multiple times to create binary bits. However, for high dimensional images where many directions contain enough variance, each PCA direction would take only once. This is because the top few projections had similar range. Thus, a low spatial frequency was preferred. In such cases, SH replicates a PCA projection followed by a mean partition approximately. In SH, the projection directions were data dependent but learned in an unsupervised manner. Also, the assumptions of uniform data distribution were usually not true for real-world data. (2.1) __________________________________________________________________________________________________________________ 3 International Journal of Computer and Advanced Engineering Research (IJCAER) Volume 02– Issue 02, APRIL 2015 ____________________________________________________________________________________________________________________ Semantic Hashing R. Salakhutdinov and G. Hinton proposed the semantic hashing for the retrieval of similar documents [8]. Learning with deep belief networks (DBN) proposed for dimensionality reduction [9] was recently adopted for semantic hashing in large-scale search applications. The deep belief networks needs image labels during training phase to generate hash codes. The DBN structure gradually reduces the number of units in each layer. So the high-dimensional input of original image features can be projected into a compact Hamming space. A general DBN is a directed acyclic graph in which each node represents a stochastic variable. There were two main steps for hash code generation using DBN. First, learning the interactions between variables and second, inferring observations from inputs. The learning of a DBN with multiple layers needs to estimate millions of parameters. In order to reduce the difficulty in learning a DBN, the DBN has been structured based on the RBMs (Restricted Boltzmann Machine) [9]. Each RBM consists of two layers containing respectively output visible units and hidden units. The multiple RBMs were stacked to form a deep belief net. The network can be specifically designed to reduce the number of units, and finally output the compact hash codes. The training process of a DBN consists of two main stages: unsupervised pre-training and supervised fine-tuning. The pretraining phase aims to place the network weights (and the biases) to suitable neighborhoods in parameter space. After achieving convergence of the parameters of one layer via Contrastive Divergence, the outputs of this layer were fixed and treated as inputs to drive the training of the next layer. During the fine-tuning stage, labeled data was used to refine the network parameters through back-propagation. The network parameters were then refined to maximize this objective function using conjugate gradient descent. The optimal weights in the entire network were thus obtained using the DBN. Semi-Supervised Hashing J. Wang and Chang proposed the semi-supervised hashing for scalable image retrieval [10]. In this, for a given set of points, a fraction of pairs were associated with two categories of label information. A neighbor-pair can be considered as the pair of points either neighbors in a metric space or share common class labels. Similarly, a pair of points is called a non neighbor-pair if two samples are far away in metric space or have different class labels. The objective function of SSH consists of two major components, supervised empirical fitness and unsupervised information theoretic regularization. The supervised part tries to minimize an empirical error on a small amount of labeled data. The unsupervised term provides effective regularization by maximizing desirable properties like variance and independence of individual bits. Given a set of n points S =pi , i=1,2, ...n, pi ϵ RD, in which a small fraction of pairs were associated with two categories of label information. A pair of points (pi,pj) is denoted as a neighbor-pair M when share common class labels and is called a non neighbor pair C if the two samples have no common class label. ( )= ℎ ( )ℎ ( ) , − ℎ ( )ℎ ( ) , + [ℎ ( )] (2.3) where the first term measures the empirical accuracy over the labeled samples M and C and the second term realizes the maximum entropy principle. The empirical accuracy for a family of hash functions H is defined as the difference of the total number of correctly classified pairs and the total number of wrongly classified pairs. Maximizing the empirical accuracy for just a few pairs can lead to severe over fitting. Hence a regularizer is used which utilizes both labeled data and the unlabeled data. From the theoretic point of view, one would like to maximize the information provided by each bit. Using maximum entropy principle, a binary bit that gives balanced partition of points provides maximum information [10]. The maximum variance condition is that a hash function with maximum entropy must maximize the variance of the hash values and vice versa. D. Comparison of mechanisms in image retrieval A comparison of mechanisms used in image retrieval is given below: __________________________________________________________________________________________________________________ 4 International Journal of Computer and Advanced Engineering Research (IJCAER) Volume 02– Issue 02, APRIL 2015 ____________________________________________________________________________________________________________________ TABLE I. IMAGE SEARCHING TECHNIQUES COMPARISON TABLE METHODOLOGY ADVANTAGES DISADVANTAGES Inverted Files Indexing and Query evaluation Queries are ranked by statistical similarity Increases memory usage Tree-based Indexing Multiple randomized KDtrees Dramatic improvement in retrieval quality Do not work well with high dimensional features Semi-supervised hashing Minimize the empirical error in labeled data Performance degrades with less training data due to overfitting Hashing Techniques [5] IV. CONCLUSION With the increasing demands of multimedia applications over the Internet, the importance of image retrieval has also increased. In this research study, recent image searching techniques are discussed. All these techniques have their own advantages as well as certain limitations. State-of the- art solutions often use hashing methods to embed highdimensional image features into Hamming space, where search can be performed in real-time based on Hamming distance of compact hash codes. Unlike traditional metrics (e.g., Euclidean) that offer continuous distances, the Hamming distances are discrete integer values. As a consequence, there are often a large number of images sharing equal Hamming distances to a query, which largely hurts search results where fine-grained ranking is very important. REFERENCES [1] [2] [3] [4] J. Zobel and A. Moffat, “Inverted files for text search engines,” ACM Comput. Surveys, 2006. M. D. H. Jegou and C. Schmid, “Packing bag-of-features,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image descriptor matching,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008. A. T. Y. Weiss and R. Fergus, “Spectral hashing,” Adv.Neural Inf. Process. Syst, 2008. [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” Proc. Symp. Theory of Computing, 1998. B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” Proc. IEEE Int. Conf. Computer Vision, 2009. M. P. O. Chum and J. Matas, “Geometric min-hashing: Finding a (thick) needle in a haystack,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. R. Salakhutdinov and G. Hinton, “Semantic hashing,” Proc. Workshop of ACM SIGIR Conf. Research and Development in Information Retrieval, 2007. G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, 2006. S. K. J. Wang and S.-F. Chang, “Semi-supervised hashing for scalable image retrieval,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010. R. F. A. Torralba and Y. Weiss, “Small codes and large image databases for recognition,” Proc. of CVPR, 2008. G. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, 2002. E. Horster and R. Lienhar, “Deep networks for image retrieval on largescale databases,” Proc. ACM Int. Conf. Multimedia, 2008. G. H. J. Goldberger, S. Roweis and R. Salakhutdinov, “Neighbourhood components analysis,” Proc. of NIPS, 2005. G. H. J. Goldberger, S. Roweis and R. Salakhutdinov, “Neighbourhood components analysis,” Adv. Neural Inf. Process. Syst, 2004. X. X. Yu-Gang Jiang, Jun Wang and S.-F. Chang, “Query-adaptive image search with hash codes,” IEEE transactions on multimedia, 2013. __________________________________________________________________________________________________________________ 5