Photo search by face positions and facial attributes on
Transcription
Photo search by face positions and facial attributes on
Photo Search by Face Positions and Facial Attributes on Touch Devices Yu-Heng Lei, Yan-Ying Chen, Lime Iida, Bor-Chun Chen, Hsiao-Hang Su, Winston H. Hsu National Taiwan University, Taipei, Taiwan {ryanlei, yanying}@cmlab.csie.ntu.edu.tw, {limeiida, siriushpa}@gmail.com, b95901019@ntu.edu.tw, winston@csie.ntu.edu.tw ABSTRACT Query Canvas With the explosive growth of camera devices, people can freely take photos to capture moments of life, especially the ones accompanied with friends and family. Therefore, a better solution to organize the increasing number of personal or group photos is highly required. In this paper, we propose a novel way to search for face images according facial attributes and face similarity of the target persons. To better match the face layout in mind, our system allows the user to graphically specify the face positions and sizes on a query “canvas,” where each attribute or identity is defined as an “icon” for easier representation. Moreover, we provide aesthetics filtering to enhance visual experience by removing candidates of poor photographic qualities. The scenario has been realized on a touch device with an intuitive user interface. With the proposed block-based indexing approach, we can achieve near real-time retrieval (0.1 second on average) in a large-scale dataset (more than 200k faces in Flickr images). 2 Top Ranked Results 3 4 5 (a) (b) (c) (d) (e) Figure 1: Example queries and top 5 retrieval results from our image search system. (a) specifies two arbitrary faces with the larger one on the left and the smaller one on the right. (b) further constrains that the left face has attributes “female” and “youth” and the right face has attribute “kid.” (c) specifies two faces of “male” and “African” on the left and right, in addition to an arbitrary face on the center. (d) specifies a particular face in the database at the desired position and in the desired size. (e) specifies the previous database face on the left, and a face of “female” and “youth” on the right. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing methods; H.5.2 [User Interfaces]: Input devices and strategies General Terms Algorithms, Design, Experimentation, Performance Keywords phenomenon becomes more obvious in consumer photos because most of them contain family members or close friends that users care about and usually keep in mind. Users may forget where or when they took the photos but they would not forget their friends and family. Therefore, they can make use of facial attributes and face identities to effectively formulate their search intentions. Furthermore, reviewing the retrieved images probably recalls more scenes in users’ memory. For example, Alice seems standing next to me and an African kid sitting in the middle. The imagination of the photo in mind can be organized intuitively by graphically arranging people on a query “canvas” and refined by designating more face attributes and identities (Fig. 1). Although consumer photos naturally lack of annotations, automatic facial attribute and identity recognition techniques would make the scenario more economical and scalable. Recently, some efforts attempt to capture users’ intention by allowing them to visually describe the image content and layout on a query canvas. [1] revisits the problem Face attributes, Face retrieval, Touch-based user interface, Block-based indexing 1. 1 INTRODUCTION When browsing photos, what makes that image memorable? MIT Media Relations [6] pointed out that images with people in them are the most memorable, followed by images of human-scale space and close-ups of objects. The Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’11, November 28–December 1, 2011, Scottsdale, Arizona, USA. Copyright 2011 ACM 978-1-4503-0616-4/11/11 ...$10.00. 651 of sketch-based image search for scene photos. However, the gap between the user’s mind and their specified query can still be large even in such a system. For instance, users with poor drawing skills may have a hard time describing their intention accurately. In addition, many object details are naturally difficult to sketch, such as the age of a face. To deal with this sketching difficulty, [7] allows the user to formulate a 2-D “semantic map” by placing text boxes of various search concepts at desired positions and in desired sizes. However, it does not address the problem for discovering specific persons in the intended scene and is therefore inapplicable to managing consumer photos. Meanwhile, typing text is not a simple operation on touch panels even though both sketch-based and concept-based queries target at better user experience. Recent trends reveal that the popularity of touch devices brings new chances and challenges to image organization. In this paper, we propose a novel system for searching consumer photos by exploiting computer vision technologies in estimating facial attributes and face similarity. Rather than laboriously sketching detailed appearances [1] or typing text [7], our work allows users to formulate a query canvas by placing “icons” of desired facial attributes (Fig. 1 (b)(c)), a specific face instance (Fig. 1 (d)), or a specific face instance with a wildcard face (Fig. 1 (e)) at desired positions and in desired sizes. Moreover, we provide aesthetics filtering to retain images of better composition, colorfulness, and contrast, thus enhancing visual experience and saving time for reviewing poor photographic (usually unintended) image candidates. The scenario has been realized on a touch device with an intuitive user interface. To tackle the computation overheads for searching face positions in large-scale datasets, we propose a block-based indexing approach enabling rapid on-line retrieval response (0.1 second on average). The real system currently processes more than 200k faces in Flickr images and the number is scalable to larger photo collections. 2. Aesthetic Filtering Return Relevance Ranking Query Online Server User Pre-load Offline Aesthetics Assessment Image Database Face Detection Similarity Estimation Block-based Indexing Attribute Detection Figure 2: Framework of the proposed system. Photos are analyzed by facial attribute detection, face similarity estimation, aesthetics assessment, and block-based indexing in the off-line process. for face positions by the proposed block-based indexing approach (Sec. 6.3). 3. DETECTING FACE ATTRIBUTES Facial attributes possess rich information about people and have been shown promising for seeking specific persons in face retrieval and surveillance systems. In this work, we utilize eight people attributes including two of gender (female, male), three of age (kid, youth, elder) and three of race (Caucasian, Asian, African) to categorize faces in large-scale photos. We will extend to more facial attributes in the future. In the training phase, all the attributes are learned through a combination of Support Vector Machines (SVM) and Adaboost similar to [4]. Firstly, we crawl usercontributed photos from Flickr and extract facial regions by a face detector. Further, the face images are annotated manually and decomposed into different face components (e.g., whole face, eyes, nose, mouth), from each of which various low-level features (e.g., Gabor filter, HOG, Grid color moments, and local binary patterns) are extracted. A mid-level feature learned is an SVM classifier with a specific low-level feature extracted from a specific face component. Finally, the optimal mid-level features for the designated facial attribute are selected and weighted through Adaboost. The combined strong classifier represents the most important parts of that attribute, for example, <Gabor, whole face> is most effective for female attribute while <color, whole face> is most effective for African attribute. Experimenting in the benchmark data [4], the approach can effectively detect facial attributes and achieve more than 80% accuracy on the average. Meanwhile, the framework is generic for various facial attributes thus providing better scalability to precisely profile users for more attributes. The real-valued attribute scores are normalized to the interval (0,1) by a sigmoid function before they are used. OBSERVATIONS AND SYSTEM OVERVIEW When looking for a photo in mind, it is difficult and inefficient for users to indicate the exact file location in the storage even though they are well categorized by time or geo-locations. Some prevailing photo sharing websites employ crowd-sourcing to obtain free tags semantically associated to images, but the mechanism can not be duplicated to personal photo organizer because users are not expected to actively annotate their photos. Although current commercial photo management software begin to exploit technologies in face recognition and face clustering, such solutions still lack the capability of searching for scenes with faces deployed in a specific layout. In light of this observation, our proposed system (Fig. 2) attempts to make consumer photo management faster and easier. As the contributions of this paper, we are (1) to analyze “wild photos” with no tag information at all by automatic facial attribute detection (Sec. 3) and face similarity estimation (Sec. 4), (2) to enhance visual experience by aesthetic filtering which removes image candidates of poor photographic qualities (Sec. 5), (3) to advance search pattern from query by single face instance to query by multiple attributed faces allocated on a canvas (Sec. 6.1, 6.2), and (4) to support rapid and accurate search 4. ESTIMATING FACE SIMILARITIES To enable search through face appearance, we adapt the face retrieval framework [2]. The advantage of this framework includes: (1) efficiency, which is achieved by using sparse representation of face image with inverted indexing, 652 6. 6.1 PHOTO SEARCH ON TOUCH DEVICES User Interface Since our system is extremely suitable for a touch-based interface, we have implemented the user interface on a tablet device, as shown in Figure 3. The user can drag faces from top-right onto the canvas, and the result will be displayed in the result panel in real time. Holding a face icon invokes an popup attribute selector. We have designed a total of 48 face icons (3 x 4 x 4) to describe the attribute combinations. To search by similarity, the user can hold a face in the result panel, and use the new icon on the top-right to find similar faces in other photos. There is also a simple aesthetic filter to help find photos with better looks. 6.2 First note that in our system, coordinates are always represented as a fraction of the image width or height. This allows the computation to be adapted to the various aspect ratios in the query canvas or the database images. For a (query image, target image) pair, denoted as (q, t), the ranking problem is casted as a greedy version of maximum bipartite matching, with the two sets Q and T being the query image and the target image, respectively. Also note that bipartite matching ensures each face is matched at most once. By greedy, we mean the first query face is matched first by choosing the best matching face available. This significantly reduces the computational cost and coincides with the idea that the first face coming to the user’s mind is the most important. The matching score between a face in the query q and a face in the target t is proposed as a linear combination of face similarity, face attributes, face center position, and face size: Figure 3: The touch-based interface of our system. Users can formulate a query by adding face icons from the top-right control panel and dragging the icons into canvas at desired positions. When clicking icons, a pop-up window will show up for attribute selection. Users can browse the query results on bottom and hold any faces back to query canvas to find out more images with the similar faces. and (2) leveraging identity information, which is done by incorporating the identity information into the optimization process for codebook construction. Both of the above two points are suitable for our system. In details, detected faces are first aligned into canonical position, and then componentbased local binary patterns are extracted from the images to form feature vectors. Sparse representations are further computed from these feature vectors based on a learned dictionary combined with extra identity information. By incorporating such framework into our system, the user can not only specify positions and attributes of the face but also use a face image itself with position as the query. The realvalued similarity scores are normalized to the interval (0,1) before they are used. 5. Ranking Function match(q, t) = wsim (Sim(q, t)) + 1/|α| |α| Y Attr(q, t) + wattr α=1 (1) dw + dh dc + wsize 1 − wpos 1 − √ 2 2 The first term is the similarity score between q and t. The second term is the geometric mean of all the three attribute scores (|α| = 3). If an attribute is not specified, it is counted as 1. Notice in the UI, face similarity and face attributes are not specified at the same time. The third and the fourth terms are to normalize the errors in position and size to the interval (0,1), where dc is the L2 distance between the face centers, and dw and dh are the L1 distances between the face widths and heights. The overall matching score is then proposed as the arithmetic mean of the individual matching scores: X 1 score(Q, T ) = match(q, t) (2) max(|Q|, |T |) q,t∈M ASSESSING PHOTO AESTHETICS The function of filtering based on photo aesthetics is also integrated in the proposed system. According to the work [5], the bag-of-aesthetics-preserving features are extracted to model the photo aesthetics at the global scope in our paper. These features have the following advantages: 1) photos can be modeled in multiple resolutions by its decomposition method; 2) photos can be described from different aesthetic aspects, including color, texture, saliency, and edge, by applying patchwise operations proposed in [5]; 3) contrast information, which humans are more sensitive to, is taken into consideration. Therefore, based on these features, the aesthetic properties of photo composition, colorfulness, contrast, etc., can be modeled jointly. M denotes the set of matched faces. |Q| and |T| in the denominator are the number of faces in the query image and the target image, so if the numbers are different, there is a huge penalty in the overall score. 6.3 Block-based Indexing We apply a block-based method to spatially index all the database faces. Since the face center coordinates, width and 653 height, denoted as x, y, w, and h, are fractions, the infinitely many numbers in the interval (0,1) make indexing computationally infeasible and quantization too sensitive. Therefore, we first quantize each of the four variables into L levels, and pre-define overlapping blocks of various valid (x,y,w,h) combinations and use them throughout the system. Notice that not all the L4 combinations are valid blocks. The mapping between an (x,y,w,h) pair and a block id can be easily achieved by representing a block id as an L-nary number of 4 digits. This mapping is both unique and storage-free. We can then build an inverted index to record, for each block id, the image id’s that contain a face in this block and all of their attribute scores. Of course, examining only faces in the block of the query is still too sensitive. So in on-line search, the system runs a small “sliding window” to compute the scores of faces in neighboring blocks. The range of the sliding window indicates the level of tolerance. For multiple-face queries, each face is processed separately. It is important that we enforce constraints that each target face can be matched at most once, and that each query face matches at most one face in the same target image. 7. Table 1: Precision@10 of 8 selected queries. # Query Intention P@10 1 Single face (top-left) 0.99 2 Single face (profile canvas) 0.93 3 Single face (close-up) 0.99 4 Two faces (left and right) 0.97 5 Single face (male, youth) 0.75 6 Male (left) and female (right) 0.53 7 Female kid (left) and male kid (right) 0.29 8 Three faces on top and two below 0.68 trieved by our system, and ask them whether the retrieved result is relevant to the query. Table 1 shows the average results from twenty users. Query tasks 1 through 4 all have precision higher than 90%. The precision decreases as the query becomes more complicated. This is because the attribute detector itself has errors. When there are many attributes in a query, it becomes harder to find images with all correctly detected attributes, For instance, if the attribute detector has 80% detection rate, when there are three attributes specified in a query, only 51% of the images are correctly detected by the attribute detector. Query task 7 has the lowest P@10. This is probably because it is naturally hard for the attribute detector to find whether a kid is a male or female. EXPERIMENTS In this section, we describe the dataset and implementation details, and evaluate the performance on the joint ranking of face attributes, face position, and face size. For the performance on face similarity and photo aesthetics, please refer to the corresponding references [2] and [5]. For a video demonstration of the system, please visit our project page: http://www.csie.ntu.edu.tw/~winston/projects/face 7.1 8. Dataset and Implementation details The dataset is composed of two portions. As mentioned in section 3, we crawl a large number of user-contributed photos from Flickr as the main portion. For similar face retrieval, 732 daily photos containing 1,248 faces are added to the dataset as the second portion. After face detection by a commercial but free API [3], together there are 115,487 images containing 244,491 faces in the dataset (2.117 faces per image). Similar face retrieval is intended only for the second portion, so only the pairwise similarity scores in the second portion are estimated, and faces in the first portion have zero similarity scores. For the weights in the ranking function, we empirically choose wsim = 2.00, wattr = 0.70, wpos = 0.20, and wsize = 0.10. This reflects the user’s intention that face similarity and attributes are much more important when they are specified. For block-based indexing, we choose the number of quantization levels as L = 20, and the sliding window for tolerance is 5 levels in position and 4 levels in size. Aesthetic filtering is optionally applied to the initial result, which keeps images of top 50% aesthetic ranks in the final result. With the index and metadata preloaded, the proposed system reports a typical running time of 0.10 second on a 16-core, 2.40GHz Intel Xeon server with 48GB of RAM. The storage cost is 112MB. 7.2 CONCLUSIONS Our work proposes a novel way for effectively organizing and searching consumer photos by positioning attributed faces at desired positions and in desired sizes on a query canvas. Meanwhile, we can automatically detect facial attributes and measure face similarity in the off-line process to provide rapid on-line photo search. Integrated with aesthetics assessment, we can further save time for browsing photos with poor quality. The scenario has been realized on a touch device with an easy-to-use interface and has achieved fast retrieval response by the proposed block-based indexing approach. 9. REFERENCES [1] Y. Cao et al. Edgel index for large-scale sketch-based image search. CVPR, 2011. [2] B.-C. Chen, Y.-H. Kuo, Y.-Y. Chen, K.-Y. Chu, and W. Hsu. Semi-supervised face image retrieval using sparse coding with identity constraint. ACM Multimedia, 2011. [3] face.com API. http://developers.face.com. [4] N. Kumar et al. Facetracer: A search engine for large collections of images with faces. ECCV, 2008. [5] H.-H. Su, T.-W. Chen, C.-C. Kao, S.-Y. Chien, and W. Hsu. Scenic photo quality assessment with bag of aesthetics-preserving features. ACM Multimedia, 2011. [6] A. Trafton. What makes an image memorable? MIT Media Relations, 2011. [7] H. Xu et al. Image search by concept map. SIGIR, 2010. Performance Evaluation In order to evaluate our system, we manually create eight different queries containing different search intentions listed in the table 1. We then ask twenty people to do the evaluation by showing them the queries with top 10 results re- 654