EmoPhoto: Identification of Emotions in Photos Information Systems
Transcription
EmoPhoto: Identification of Emotions in Photos Information Systems
EmoPhoto: Identification of Emotions in Photos Soraia Vanessa Meneses Alarcão Castelo Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisor: Prof. Manuel João Caneira Monteiro da Fonseca Examination Committee Chairperson: Prof. José Carlos Martins Delgado Supervisor: Prof. Manuel João Caneira Monteiro da Fonseca Member of the Committee: Prof. Daniel Jorge Viegas Gonçalves October 2014 ii Abstract Nowadays, with the development in digital photography and the increasing easiness of acquiring cameras, taking pictures is a common task. Thus, the number of images in private collections of each person or in the Internet is becoming bigger. Every time we use our collection of images, for example, to search for an image of a specific event, the images we receive will always be the same. However, our emotional state is not always the same: sometimes we are happy, and other times sad. Depending of the emotions perceived from the image, we are more receptive to some images than others. In the worst case, we will feel worse, which, given the importance of the emotions in our daily life, will lead to a significantly negative performance during cognitive tasks such as attention or problem-solving. Although it seems interesting to take advantage of the emotions that an image transmits, currently there is no way of knowing which emotions are associated with a given image. In order to identify the emotional content present in an image, as well as the category of those emotions (Negative, Positive or Neutral), we describe in this document two approaches: one using Valence and Arousal information, and the other one using the content of the image. The two developed recognizers achieved recognition rates of 89.20% and 68.68%, for the categories of emotions, and 80.13% for the emotions. Finally, we also describe a new dataset of images annotated with emotions, obtained from sessions with users. Keywords: emotion recognition, emotions in images, fuzzy logic, content-based image retrieval, emotionbased image retrieval iii Resumo Actualmente, com os desenvolvimentos na área da fotografia digital e a crescente facilidade de aquisição de câmaras fotográficas, tirar fotos tornou-se uma tarefa comum. Consequentemente, o número de imagens nas colecções particulares de cada pessoa, bem como das imagens disponı́veis na Internet, aumentou. Sempre que procuramos uma imagem de um determinado evento na nossa colecção particular, as imagens apresentadas serão sempre as mesmas. No entanto, o nosso estado emocional não permanece igual: por vezes estamos felizes, e outras vezes tristes. Dependendo das emoções percepcionadas a partir de uma imagem, estamos mais receptivos a algumas imagens do que outras. No pior caso, vamos sentir-nos pior, o que, dada a importância das emoções no nosso quotidiano, poderá conduzir a uma deterioração no desempenho de tarefas a nı́vel cognitivo, como atenção ou resolução de problemas. Embora pareça interessante aproveitar as emoções transmitidas pelas imagens, actualmente não existe nenhuma forma de saber quais as emoções que estão associadas a uma determinada imagem. A fim de identificar os conteúdos emocionais presentes numa imagem, assim como a categoria desses conteúdos (negativa, positiva ou neutra), descrevemos neste documento duas abordagens: uma recorrendo aos nı́veis de Valência e Excitação da imagem, e uma outra utilizando o conteúdo da mesma. Os dois reconhecedores desenvolvidos alcançaram taxas de reconhecimento de 89.20% e 68.68%, para as categorias de emoções, e 80.13% para as emoções. Por fim, criámos um novo conjunto de dados de imagens anotadas com emoções, obtidas a partir de sessões com utilizadores. Palavras-chave: reconhecimento de emoções, emoções em imagens, lógica difusa, recuperação de imagens baseada em conteúdo, recuperação de imagens baseada em emoções iv Acknowledgments I would like to thank my supervisor, Prof. Manuel João da Fonseca, not only for being an inspiration in the fields of Human-Computer Interaction and Multimedia Information Retrieval, but also for being a supportive and committed supervisor, who has always believed in my work, provided valuable feedback and motivated me all the way through this journey. To Prof. Rui Santos Cruz, thank you for also encouraging me to go even further in my academic choices and for being always available to sort out any issue that I came across. To my family, in general, thank you for forgiving my “absence” these past years due to my academic life. To my brother Pedro Castelo, my sister Alexandra Castelo, and my sister-in-law Laura Pereira, thank you for all your support and care, and for making sure that I would withstand these past years. To my mother Carmo Meneses Alarcão, thank you for always believing in my success, and for being with grandma when I was not able to. To Jorge Cabrita, thank you for being the “father” that I never had. To my second “mommy” Luı́sa Bravo da Mata, for encouraging me and always cheering me on each decision I took, thank you! My deepest, fondest and heartfelt thank you goes to my grandma Alcinda Meneses Alarcão, for all the sacrifices she made throughout her life in order to get me where I am now. Without her, I would not be who I am, and would not have gotten this far as I did. In the last couple of years, I was fortunate to find true and amazing friends. Every time I was happy they were there to smile and celebrate with me. However, all the times I needed their support, they were also there: listening, helping, and most of the time, telling “our” silly jokes to cheer me up! Therefore, each one of you is the family that I chose: Ana Sousa, Andreia Ferrão, Bernardo Santos, Joana Condeço, João Murtinheira, Inês Bexiga, Inês Castelo, Inês Fernandes, Margarida Alberto, Maria João Aires, Miguel Coelho, Ricardo Carvalho, Rui Fabiano, and last, but not least, my “sister” Vânia Mendonça. A special thank you to my favorite “grammar-police” staff: Bernardo, João, Miguel, and Vânia, for your patience and availability to proof countless times for each part of this work. Thanks to all of you, this final version became much more complete and typo-free. I appreciate everything that you have taught me, my English skills have improved so much! Catarina Moreira, João Vieira, and João Simões Pedro, thank you for your precious assistance with your knowledge of Machine Learning and Statistics. Thank you to all the amazing people that I had the pleasure to work with these past years, whether in class projects or other academic projects (especially NEIIST): David Duarte, David Silva, Fábio Alves, Luis Carvalho, Mauro Brito, Ricardo Laranjeiro, Rita Tomé, among many others. I would also like to thank to everyone who accepted to participate in the user sessions I performed in the context of this thesis. To each and every one of you - Thank you. v vi To my grandma Alcinda vii viii Contents Abstract iii Resumo iv Acknowledgments v List of Figures xii List of Tables xiv List of Acronyms xvii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Contributions and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Context and Related Work 5 2.1 Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Emotions in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Facial-Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Relationship between features and emotional content of an image . . . . . . . . . 13 2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Emotion-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Fuzzy Logic Emotion Recognizer 23 3.1 The Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 ix 4 Content-Based Emotion Recognizer 39 4.1 List of features used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1 One feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Two feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.3 Three feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.4 Four feature type combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.5 Overall best features combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Dataset 51 5.1 Image Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Description of the Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 Pilot Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6 Evaluation 59 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1.1 Fuzzy Logic Emotion Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1.2 Content-Based Emotion Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7 Conclusions and Future Work 63 7.1 Summary of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.2 Final Conclusions and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Bibliography 67 Appendix A 73 Appendix B 89 x List of Figures 2.1 Universal basic emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Circumplex model of affect, which maps the universal emotions in the Valence-Arousal plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Wheel of Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Circumplex Model of Affect with basic emotions. Adapted from [75] . . . . . . . . . . . . . 24 3.2 Distribution of the images in terms of Valence and Arousal . . . . . . . . . . . . . . . . . . 25 3.3 Polar Coordinate System for the distribution of the images . . . . . . . . . . . . . . . . . . 25 3.4 Sigmoidal membership function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Trapezoidal membership function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 2-D Membership Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Membership Functions for Negative category . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 2-D Membership Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9 Membership Functions for Neutral category . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.10 2-D Membership Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.11 Membership Functions for Positive category . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.12 Membership Functions for Anger, Disgust and Sadness . . . . . . . . . . . . . . . . . . . 30 3.13 Membership Functions for Disgust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.14 Membership Functions for Disgust and Fear . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.15 Membership Functions for Disgust and Sadness . . . . . . . . . . . . . . . . . . . . . . . 32 3.16 Membership Functions for Fear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.17 Membership Functions for Happiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.18 Membership Functions for Sadness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.19 Membership Functions of Angle for all classes of emotions . . . . . . . . . . . . . . . . . 34 3.20 Membership Functions of Radius for all classes of emotions . . . . . . . . . . . . . . . . . 35 4.1 Average recognition considering all features . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Results for Color - one feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Results for Color - two features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Results for Color - three features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Time to build models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 EmoPhoto Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Emotional state of the users in the beginning of the test . . . . . . . . . . . . . . . . . . . 53 5.3 Classification of the Negative images of our dataset (from users) . . . . . . . . . . . . . . 56 5.4 Classification of the Neutral and Positive images of our dataset (from users) . . . . . . . . 56 6.1 Classification of the Negative and Positive images of our dataset (from users) . . . . . . . 61 xi B1 EmoPhoto Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 B2 1. Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 B3 2. Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 B4 3. Education Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 B5 4. Have you ever participated in a study using any Brain-Computer Interface Device? . . . 92 B6 7. How do you feel? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B7 8. Please classify your emotional state regarding the following cases: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise . . . . . . . . . . . . . . . . . . . . . . . xii 93 List of Tables 2.1 Comparision between International Affective Picture System (IAPS), Geneva Affective PicturE Database (GAPED) and Mikels datasets . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Confusion Matrix for the classes of emotions in the IAPS dataset . . . . . . . . . . . . . . 36 3.2 Confusion Matrix for the categories in the Mikels dataset . . . . . . . . . . . . . . . . . . . 36 3.3 Confusion Matrix for the categories in the GAPED dataset . . . . . . . . . . . . . . . . . . 36 4.1 List of best features for each category type . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Overall best features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Confusion Matrix for the categories between Mikels and our dataset . . . . . . . . . . . . 55 5.2 Confusion Matrix for the categories between GAPED and our dataset . . . . . . . . . . . 56 6.1 Confusion Matrix for the categories using our dataset . . . . . . . . . . . . . . . . . . . . 59 6.2 Confusion Matrix for the categories using our dataset . . . . . . . . . . . . . . . . . . . . 60 A1 Simple and Meta classifiers results for each feature . . . . . . . . . . . . . . . . . . . . . . 73 A2 Vote classifiers results for each feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A3 Results for Color using one feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A4 Results for combination of two Color features . . . . . . . . . . . . . . . . . . . . . . . . . 74 A5 Results for combination of three Color features . . . . . . . . . . . . . . . . . . . . . . . . 75 A6 Results for combination of four Color features . . . . . . . . . . . . . . . . . . . . . . . . . 76 A7 Results for combination of five Color features . . . . . . . . . . . . . . . . . . . . . . . . . 76 A8 Results for combination of six Color features . . . . . . . . . . . . . . . . . . . . . . . . . 76 A9 Results for combination of seven Color features . . . . . . . . . . . . . . . . . . . . . . . . 77 A10 Results for combination of all Color features . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A11 List of candidate features for Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A12 Results for Composition feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A13 Results for combination of Shape features . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A14 Results for combination of Texture features . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A15 Results for combination of Joint features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A16 Results for combination of Color and Composition features . . . . . . . . . . . . . . . . . 79 A17 Results for combination of Color and Shape features . . . . . . . . . . . . . . . . . . . . . 79 A18 Results for combination of Color and Texture features . . . . . . . . . . . . . . . . . . . . 79 A19 Results for combination of Color and Joint features . . . . . . . . . . . . . . . . . . . . . . 80 A20 Results for combination of Composition and Shape features . . . . . . . . . . . . . . . . . 80 A21 Results for combination of Composition and Texture features . . . . . . . . . . . . . . . . 80 A22 Results for combination of Composition and Joint features . . . . . . . . . . . . . . . . . . 80 xiii A23 Results for combination of Shape and Texture features . . . . . . . . . . . . . . . . . . . . 80 A24 Results for combination of Shape and Joint features . . . . . . . . . . . . . . . . . . . . . 81 A25 Results for combination of Texture and Joint features . . . . . . . . . . . . . . . . . . . . . 81 A26 Results for combination of Color, Composition and Shape features . . . . . . . . . . . . . 81 A27 Results for combination of Color, Composition and Texture features . . . . . . . . . . . . . 81 A28 Results for combination of Color, Composition and Joint features . . . . . . . . . . . . . . 82 A29 Results for combination of Color, Shape and Texture features . . . . . . . . . . . . . . . . 82 A30 Results for combination of Color, Shape and Joint features . . . . . . . . . . . . . . . . . 82 A31 Results for combination of Color, Texture and Joint features . . . . . . . . . . . . . . . . . 83 A32 Results for combination of Color, Composition, Texture and Shape features . . . . . . . . 83 A33 Results for combination of Color, Composition, Texture and Joint features . . . . . . . . . 84 A34 Results for combination of Color, Texture, Joint and Shape features . . . . . . . . . . . . . 84 A35 Results for combination of Color, Texture, Joint and Composition features . . . . . . . . . 84 A36 Confusion Matrices for each combination . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A37 Confusion Matrices for each combination using GAPED dataset with Negative and Positive categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A38 Confusion Matrices for each combination using GAPED dataset with Negative, Neutral and Positive categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A39 Confusion Matrices for each combination using Mikels and GAPED dataset . . . . . . . . 88 xiv List of Acronyms AAM Active Appearance Models ACC AutoColorCorrelogram ADF Anger, Disgust and Fear ADS Anger, Disgust and Sadness AF Anger and Fear AM Affective Metadata ANN Artificial Neural Network AS Anger and Sadness AU Action Unit Bag Bagging BCI Brain-Computer Interfaces CBIR Content-based Image Retrieval CBER Content-based Emotion Recognizer CBR Content-based Recommender CCV Color Coherence Vectors CEDD Color and Edge Directivity Descriptor CF Collaborative-Filtering CH Color Histogram CM Color Moments CMA Circumplex Model of Affect D Disgust DF Disgust and Fear DOM Degree of Membership DOF Depth of Field DS Disgust and Sadness xv EBIR Emotion-based Image Retrieval EEG Electroencephalography EH Edge Histogram F Fear FCP Facial Characteristic Point FCTH Fuzzy Color and Texture Histogram FE Feature Extraction FLER Fuzzy Logic Emotion Recognizer FER Facial Expression Recognition FCTH Fuzzy Color and Texture Histogram FS Fear and Sadness G Gabor GAPED Geneva Affective PicturE Database GLCM Gray-Level Co-occurence Matrix GM Generic Metadata GMM Gaussian Mixture Models GPS Global Positioning System Ha Happiness H Haralick HSV Hue, Saturation and Value IAPS International Affective Picture System IBk K-nearest neighbours IGA Interactive Genetic Algorithm J48 C4.5 Decision Tree (algorithm from Weka) JCD Joint Composite Descriptor KDEF Karolinska Directed Emotional Faces LB LogitBoost Log Logistic MHCPH Modified Human Colour Perception Histogram MIP Mood-Induction Procedures MLP Multi-Layer Percepton xvi NB Naive Bayes NDC Number of Different Colors OH Opponent Histogram PCA Principal Component Analysis PFCH Perceptual Fuzzy Color Histogram PFCHS Perceptual Fuzzy Color Histogram with 3x3 Segmentation POFA Pictures of Facial Affect PVR Personal Video Recorders RCS Reference Color Similarity RecSys Recommendation Systems RF Random Forest RSS RandomSubSpace RT Rule of Thirds S Sadness SAM Self-Assessment Manikin SM Similarity Measurement SMO John Platt’s sequential minimal optimization algorithm for training a support vector classifier SPCA Shift-invariant Principal Component Analysis Su Surprise SVM Support Vector Machine T Tamura V1 Vote 1 V2 Vote 2 V3 Vote 3 V4 Vote 4 V5 Vote 5 V6 Vote 6 VAD Valence, Arousal and Dominance VOD Video-On-Demand systems xvii xviii 1 Introduction In this chapter we present our motivation, the goals we intend to achieve, as well as the solution developed to identify emotions, in particular the Fuzzy Logic Emotion Recognizer (FLER) and the Contentbased Emotion Recognizer (CBER). We also enumerate the main contributions and results of our work, as well as the document outline. 1.1 Motivation Images are an increasingly important class of data, especially as computers become more usable, with greater memory and communication capacities [42]. Nowadays, with the development in digital photography and the increasing easiness of acquiring cameras and smartphones, taking pictures (and storing them) is a common task. Thus, the number of images in private collections of each person is becoming bigger. In the case of the images available in the internet, it is not big, it is huge. With this massive growth in the amount of visual information available, the need to store and retrieve images in an efficient manner arises, leading to an increase of the importance of Content-based Image Retrieval (CBIR) systems [78]. However, these systems do not take into account high level features like human emotions associated with the images or the emotional state of the users. To overcome this, a new technique, Emotion-based Image Retrieval (EBIR) was proposed in order to extend CBIR systems through the use of human emotions besides common features [81, 87]. Currently, emotion or mood information are already used as search terms within multimedia databases, retrieval systems or even multimedia players. We can interact with and explore image collections in many ways. One possibility is through their content, such as Colors, Shapes, Texture and Lines, or through associated information such as tags, data or Global Positioning System (GPS) information. Every time we search for something, for example for images from a specific day or event, the order in which the images are presented can be different but the images will always be the same. However, our emotional state is not always the same: sometimes we are happy, and other times sad or depressed. Therefore we are more receptive to some images than others, depending of the emotions perceived from the image. In the image domain, emotions describe the personal affectedness based on spontaneous perception 1 [19], and can be achieved, for example, through the colors or facial expressions of people present in the image. In the worst case, these results will make us feel even worse, which, given the importance of the emotions in our daily life, will lead to a significantly negative performance during cognitive tasks such as attention, creativity, memory, decision-making, judgment, learning or problem-solving. Although it seems interesting to take advantage of the emotions that an image transmits, for example, by using them to explore a collection of images, currently there is no way of knowing which emotions are associated with a given image. In order to identify the emotional content present in an image, i.e., the emotions that would be triggered when viewing the image, as well as the corresponding category of those emotions (Negative, Positive or Neutral), we will follow two approaches: one using Valence and Arousal information, and the other one using the content of the image. 1.2 Goals This work aims to be able to identify the emotional content present in an image, regarding the corresponding category of emotions, i.e., if an image transmits Negative, Positive or Neutral feelings to the viewer. We also want to be able to give an insight about what emotions would be triggered when viewing an image. To that end, we plan to take advantage of the Valence and Arousal values associated to some datasets of images, and in the case where there are no information about V-A, we plan to use the content of the images to derive the emotion or category of emotion that it conveys to the viewer. To achieve this, we need to focus on three different sub-goals: i) develop an emotion recognizer based on the Valence and Arousal information associated to images; ii) develop an emotion recognizer based on the visual content of the images, such as Colors, Shape or Texture; iii) finally, we want to collect information, using people, about the dominant emotions transmitted by a set of images, by performing an experiment with people. 1.3 Solution The solution developed, in the context of this work, consists in two recognizers that are able to identify the emotional content of an image using different inputs, and producing different levels of output. The first recognizer, Fuzzy Logic Emotion Recognizer (FLER), uses the normalized values of Valence and Arousal of an image to automatically classify the classes of emotions: Anger, Disgust and Sadness (ADS), Disgust (D), Disgust and Fear (DF), Disgust and Sadness (DS), Fear (F), Happiness (Ha), and Sadness (S) and categories: Positive, Neutral and Negative, conveyed by an image. To describe each class of emotions, as well as the categories, we used a Fuzzy Logic approach, in which each set is characterized by a membership function that assigns to each object a Degree of Membership laying between zero and one [90]. In the case of emotions, we used the Product of Sigmoidal membership function for the Angle that correlates Valence and Arousal values, and Trapezoidal membership function for the Radius, that will help to reduce emotion confusion between images with similar Angles. Regarding the categories we used the Trapezoidal membership function, both for the Angle and the Radius. Finally, for each class of emotions and category, we used a two-dimensional membership function that is the result of the composition of the two one-dimensional membership functions mentioned above. The second recognizer, Content-based Emotion Recognizer (CBER), uses visual content information of the image to automatically classify if an image is Negative or Positive. To select the best descriptors to use, we performed a large number of tests using different combinations of Color, Texture, Shape, Composition and Joint descriptors/features. We started by analyzing a set of classifiers, in order to understand which one best learns the relationship between features and the given category of emotion. 2 After that, and based on the relationships found, we proposed six different combinations of classifiers using the Vote classifier as a base. For each of the proposed combinations of classifiers, we tested several feature combinations. In the end, the best solution is composed by a Vote classifier, containing John Platt’s sequential minimal optimization algorithm for training a support vector classifier (SMO), Naive Bayes (NB), LogitBoost (LB), Random Forest (RF), and RandomSubSpace (RSS), and a combination of features, which include the Color Histogram (CH), Color Moments (CM), Number of Different Colors (NDC), and Reference Color Similarity (RCS). 1.4 Contributions and Results With the completion of this thesis, we achieved three main contributions: • A Fuzzy recognizer with a classification rate of 100% in the case of categories and 91.56% in the case of emotions, for Mikels dataset [66]; with the Geneva Affective PicturE Database (GAPED) we achieved an average classification rate of 95.59% for the categories. Using our dataset, we achieved a success rate of 68.70% for emotions. In the case of categories, we achieved 100% for Negative category, 85% for the Positive and 28% for the Neutral. • A recognizer based on the content of the images, that has a recognition rate of 87.18% for the Negative category, and 57.69% for the Positive, using a dataset of images selected both from International Affective Picture System (IAPS) and from GAPED datasets. Using our dataset, we achieved a recognition rate of 76.54% for the Negative category and 52.38% for the Positive. • A new dataset of 169 images from IAPS, Mikels and GAPED annotated with the dominant categories and emotions, according to what people felt while viewing each image. 1.5 Document Outline In chapter 2, we describe the importance of emotions, as well as how they can be represented. Along with it, we detail the previous works in the recognition of emotions from images, how to identify the emotional state of a user, and some research areas where these two topics are combined: Emotionbased Image Retrieval (EBIR) and Recommendation Systems (RecSys). We also describe the relationship between emotions and the different visual characteristics of an image. Finally, we present the datasets that we used in our work: International Affective Picture System (IAPS), Geneva Affective PicturE Database (GAPED) and Mikels. In chapter 3, we describe the Fuzzy Logic Emotion Recognizer (FLER) and the corresponding experimental results achieved, while in chapter 4, both the Content-based Emotion Recognizer (CBER) and the experimental results obtained are described. We also present an analysis of the different possible combinations between the different types of features used: Color, Texture, Composition, and Shape. In chapter 5, we present a new dataset that is annotated with information collected through experiments with users. In chapter 6, we present the evaluation of FLER and CBER using our new annotated dataset. Finally, a summary of the dissertation, the conclusions and future work are presented in chapter 7. 3 4 2 Context and Related Work Within this chapter we present a review and summary of the related works in the fields of emotions, the recognition of the emotions in images using Content-based Image Retrieval (CBIR) and Facial Expression Recognition (FER), as well as the relationship between the emotions and the visual characteristics mentioned in CBIR, and finally Emotion-based Image Retrieval (EBIR) and Recommendation Systems (RecSys). Although this seems to be a lot of fields, some of them such as emotions, and RecSys are described here only to give some context, while CBIR, FER and EBIR are our main focus. We also present and describe the datasets used in our work. 2.1 Emotions “An emotion is a complex psychological state that involves three distinct components: a subjective experience, a physiological response, and a behavioral or expressive response.” [35] Emotions have been described as discrete and consistent responses to external or internal events with particular significance for the organism. They are brief in duration and correspond to a coordinated set of responses, which may include verbal, behavioral, physiological and neural mechanisms. In affective neuroscience, the emotion can be differentiated from similar constructs like feelings, moods and affects. Feelings can be viewed as a subjective representation of emotions. Moods are diffuse affective states that generally last for much longer durations than emotions and are also usually less intense than emotions. Finally, affect is an encompassing term, used to describe the topics of emotion, feelings, and moods together [23]. The role of emotion in human cognition is essential. Emotions also play a critical role in rational decision-making, perception, human interaction, and in human intelligence [69]. Emotions play an important role in the daily life of human beings. The importance (and need) of automatic emotion recognition has grown with the increasing role of human-computer interface applications. Nowadays, new forms of human-centric and human-driven interaction with digital media have the potential of revolutionizing entertainment, learning, and many other areas of life. Emotion recognition could be done from text, speech, facial expressions or gestures [54]. 5 Currently, given the importance of emotions/emotion-related variables in the gaming behavior, people seek and are eager to pay for games that elicit strong emotional experiences. This can be achieved using bio-signals in a biofeedback system, which can be implicit or explicit. The implicit biofeedback is similar to affective feedback, i.e., the users are not aware that their physiological states are being sensed, because the intention is to capture their normal affective reactions; the system modulates its behavior according to the registered bio-signals. The explicit is originated from the field of medicine, with the intuit of making the subjects more aware of their bodily processes by displaying the information in an easy and clear way. This means the user has direct and conscious control over the application. If the user, in the implicit feedback, starts to learn how the system works and use that knowledge to obtain control over it, it becomes an explicit system. It is a popular trend that various game mechanics are used in other areas, such as education, simulation, exercising, group work and design. For this reason, there is the belief that the work in biofeedback interaction will find applications in a broad range of domains [46]. Previous studies have suggested that men and women process emotional stimuli differently. In [53], it was verified if there would be any consistency in regions of activation in men and women when processing stimuli portraying happy or sad emotions presented in the form of facial expressions, scenes and words. During emotion recognition of all forms of stimuli studied, the collected imaging data revealed that the right insula and left thalamus were consistently activated for men, but not for women. The findings suggest that men rely on the recall of past emotional experiences to evaluate current emotional experiences, whereas women seemed to engage the emotional system more readily. This finding is consistent with the common belief that women are more emotional than men, which suggests possible gender-related neural responses to emotional stimuli. This difference may be relevant to the evaluation of the emotional reaction of a user to a given picture. Figure 2.1: Universal basic emotions from Grimace 1 There are two different perspectives towards emotion representation. The first one (categorial), indicates that basic emotions have evolved through natural selection. Plutchik [71] proposed eight basic emotions: Anger, Fear, Sadness, Disgust, Surprise, Curiosity, Acceptance, and Joy. All the other emotions can be formed by these basic ones, for example, disappointment is composed of Surprise and Sadness. Ekman, following a Darwinian tradition, based his work in the relationship between facial expressions and emotions derived from a number of universal basic emotions: Anger, Disgust, Fear, happiness, Sadness, and Surprise (see Figure 2.1). Later he expanded the basic emotions by adding: Amusement, Contempt, Contentment, Embarrassment, Excitement, Guilt, Pride in achievement, Relief, Satisfaction, sensory Pleasure, and Shame. In the second perspective (dimensional), which is based on cognition, the emotions (also called affective labels [30]) are mapped into the Valence, Arousal and Dominance (VAD) dimensions. Valence goes from very Positive feelings to very Negative, Arousal is also called activation and goes from states like sleepy to excited, and finally, dominance that corresponds to the strength of the emotion [2, 20, 49, 54]. The most common model used is the two-dimensional, that only uses Valence and Arousal (see Figure 2.2). 1 http://www.grimace-project.net/ 6 Figure 2.2: Circumplex model of affect, which maps the universal emotions in the Valence-Arousal plane. In [76], some correlations between basic emotions were described. One of the most important results was that when happiness rises, all other emotions decline, and the other one, that Fear correlates positively with Sadness and with Anger. These correlations are well-known phenomena in the field of psychology. Many studies in psychology involve manipulating Valence and/or Arousal via emotional stimuli. This technique of inducing emotion in human participants is referred to as affective priming. Several methods have been introduced for priming participants with Positive or Negative affect. Common methods include images, text (stories), videos, sounds, word-association tasks, and combinations thereof. Such methods are commonly referred to as Mood-Induction Procedures (MIP). In general, Positive emotions tend to lead to better cognitive performance, and Negative emotions (with some exceptions) lead to decreased performance [32]. Affective computing is a rising topic within human-computer interaction that tries to satisfy other user needs, besides the need of the user to be as productive as possible. As the user is an affective human being, many needs are related to emotions and interaction [5]. Research has already been done into recognizing emotions from faces and voice. Humans can recognize emotions from these signals with a 70-98% accuracy, and computers are already pretty successful especially at classifying facial expressions (80-90%). With the rising interest for Brain-Computer Interfaces (BCI), user’s Electroencephalography (EEG) have been analyzed as well [5]. In [2], they use information about the affective/mental states of users to adapt interfaces or add functionalities. In [32], the authors describe a crowdsourced experiment in which affective priming is used to influence low-level visual judgment performance. They present results that suggest that affective priming significantly influences visual judgments, and that Positive priming increases performance. Additionally, individual personality differences can influence performance with visualizations. In addition to stable personality traits, research in psychology has found that temporary changes in affect (emotion) can also significantly impact performance during cognitive tasks such as memory, attention, learning, judgment, creativity, decision-making, and problem-solving. 7 The category-based models can be used for tagging purposes, specially with a list of different adjectives for the same mood, which allows the generalization of the subjective perceptions of multiple users and provides a dictionary for search and retrieval applications [19]. However, the dimensional model is preferable in emotion recognition experiments because a dimensional model can locate discrete emotions in its space, even when no particular label can be used to define a certain feeling [54]. For our work, we will use the six universal basic emotions with the addition of a new emotion: the Neutral. To map the emotions into a two-dimensional model (since it is better for the purpose of our work), an adaptation of the circumplex model of affect introduced in [72] will be used (see Figure 2.2) [83]. 2.2 Emotions in Images In order to extract emotions from an image, we need to understand how their contents affect the way emotions are perceived by users. For example, different Colors give us different feelings: bright Colors help to create a Positive and friendly mood whereas dark Colors create the opposite. In the other hand, the lines such as the diagonal ones indicate activity, and the horizontal ones express calmness. In the human interaction, if we communicate with someone that appears to be sad, we tend to sympathize with that person and feel sad too. However, if the person is happy we tend to become happier. The same effect is observed in case we see sad or happy expressions in pictures, i.e., it will also affect our emotional state. 2.2.1 Content-Based Image Retrieval Content-based Image Retrieval (CBIR) is a well-known technique that uses visual contents of an image to search images in large databases, according to users’ interests [42]. A wide range of possible applications for CBIR has been identified, such as crime prevention, architectural and engineering design, fashion and interior design, journalism and advertising, medical diagnosis, geographical information and remote sensing systems and education. Also, a large number of commercial and academic retrieval systems have been developed by universities, companies, government organizations and hospitals [78]. The initial CBIR systems can be divided into two categories, according to the type of queries: text or pictorial query. In the first case, the images are represented as text information like keywords or tags, which can be very effective if appropriate text descriptions are given to the images in the database. However, since the annotations are made manually, they are subjective and context-sensitive, and can be wrong, incomplete or nonexistent. Also, this method can be expensive and time-consuming [42]. In the second case, an example of the image is given as query. In order to obtain similar images, different low-level features such as colors, edges, shapes and textures can be automatically extracted. Typically, the system is composed by Feature Extraction (FE) and Similarity Measurement (SM). In the case of the FE, a set of features, such as the indicated previously, are generated to accurately represent the content of each image in the database. This set of features is called image signature or feature vector, and is usually stored in a feature database. In SM, the distance between the query image and each image in the database is computed, using the corresponding signatures, in a way that the closest images are retrieved. The most used distances to calculate similarity are MinkowskiForm distance, Quadratic Form distance, Mahalanobis distance, Kullback-Leibler divergence, Jeffrey divergence and the Euclidean distance [42, 78]. User interfaces in image retrieval systems consist of two parts: the query formulation and the result presentation part. Recent retrieval systems have incorporated users’ relevance feedback to modify the retrieval process in order to generate perceptually and semantically more meaningful retrieval results. An important thing we need to keep in mind is that human perception of image similarity is subjective, semantic, and task-dependent. Besides that, each type of visual feature usually captures only one 8 aspect of image property, and it is usually hard for the user to specify clearly how different aspects are combined. However, there is no single “best” feature that gives accurate results in any general setting, which means that, usually, a combination of features is needed to provide adequate retrieval results [78]. Nowadays, researchers are merging fields such as computer vision, machine learning and image processing, which provides an opportunity to find solutions of different issues such as semantic gap and dimensionality reduction [42]. Semantic gap shows the difference between high-level concepts such as emotions, events, objects or activities as conveyed by an image and limited descriptive power of low-level visual features. After the CBIR technical/theoretical characteristics have been explained, it is important to explain the visual features that this technique uses: Color: This is the most extensively used visual content for image retrieval since it is the basic constituent of images. It is relatively robust to background complication and independent of orientation and image size. In some works like [78], grayscale is also considered as a color. Usually, in these systems, Color histogram is the most used feature representation and gives us the description of the colors present in an image as well as their quantities. It is obtained by quantizing image color into discrete levels, then the number of times each discrete occur in the image is counted. They are insensitive to small perturbations in camera position and are computationally efficient to compute. When a database contains a large number of images, histogram comparison will saturate the discrimination. To solve this problem, the joint histogram technique [27] was introduced, in which it is incorporated additional information without affecting the robustness of Color histograms. It is possible that two different images have the same Color histogram because a single Color histogram extracted from an image lacks spatial information of colors in the image. In [78], the authors expose different possible solutions. In the first one, a CBIR takes account of the spatial information of the Colors by using multiple histograms. In the second one, the spatial features area (zero-order) and position (first-order) moments are used for retrieval. Finally, in the last one, a different way of incorporating spatial information into the Color histogram, Color Coherence Vectors (CCV), was proposed. Using the Color correlogram, it is possible to characterize not only the color distributions of pixels, but also the spatial correlation of pairs of colors. Modified Human Colour Perception Histogram (MHCPH) [77] is based on human visual perception of the Color. The gray weights and color are distributed to neighboring bins smoothly with respect to pixel information. The amount of weight that is distributed to the neighboring bins is estimated using NBS distance, which makes it possible to extract the background color information effectively along with the foreground information. Shape: It corresponds to an important criterion for matching objects based on their physical structure and profile. Shape is a well-defined concept and there is considerable evidence that natural objects are primarily recognized by their shape [78]. These features can represent the spacial information that is not represented by Texture or Color, and contains all the geometrical information of an object in the image. This information does not change even if the location or orientation of the object changes. The simplest Shape features are the perimeter, area, eccentricity and symmetry [42], but, usually, two main types of Shape features are commonly used: global features and local features. Aspect 9 ratio, circularity and moment invariants are examples of global features and sets of consecutive boundary segments corresponds to local features. Shape representations can be divided into two classes: boundary-based and region-based. In the first case, only the outer boundary of the shape is used. The most common approaches are the rectilinear shapes, polygonal approximation, finite element models, and Fourier-based shape descriptors. In the second one, the entire Shape region is used to compute statistical moments. A good Shape representation feature for an object should be invariant to translation, rotation and scaling [78]. Texture: It is defined as all that it is left after Color and local Shape has been considered. It is used to look for visual patterns, with properties of homogeneity that are not achieved by the presence of a single color, in images and how they are spatially defined. Also, it contains information about the structural arrangement of surfaces and their relationship to the surrounding environment. Texture similarity can be used to distinguishing between areas of images with similar Color such as sky and sea. Texture representation can be classified into three categories: statistical, structural and spectral. In the statistical approach, the Texture is characterized using the statistical properties of the gray levels in the pixels of the image. Usually, there is a periodic occurrence of certain gray levels. Some of the methods used are co-occurrence matrix, Tamura features, Shift-invariant Principal Component Analysis (SPCA), Wold decomposition and multi-resolution filtering techniques such as Gabor and Wavelet transform. Both the Tamura features and Wold decomposition are designed according to physiological studies on the human perception of Texture and described in terms of perceptual properties. The structural methods describe the Texture as a composition of texels (texture elements) that are arranged regularly on a surface according to some specific arrangement rules. Some methods, such as morphological operator and adjacency graphs, describe the Texture by the identification of structural primitives and their corresponding placement rules. If they are applied to regular textures, they tend to be very effective [78]. In the spectral method, the Texture description is done by using a Fourier transform of an image and then group the transformed data in a way that it gives some set of measurements. These descriptions allowed us to understand how Color, Shape and Texture are characterized, as well as how they usually appear in an image. It also allowed us to identify the visual descriptors and the different approaches that are most commonly used to capture each of the visual features used in CBIR systems. 2.2.2 Facial-Expression Recognition The human face is one of the major ”objects” in our daily lives that is used to provide information about the gender, attractiveness and age of a person, but also helps to identify the emotion that person is feeling; this has an important role in human communication. Underneath our skin, a large number of face muscles allow us to produce different configurations. These muscles can be summarized as Action Unit (AU) [21] and are used to define the facial expressions of an emotion. Facial expressions are typically classified as Joy, Surprise, Anger, Sadness, Disgust and Fear [15, 21]. Recent research in cognitive science and neuroscience has shown that humans use mostly the Shape for the perception and recognition of facial expressions of emotion. Furthermore, humans are 10 very good only at recognizing a few facial expressions of emotion. The most well recognized emotions are Happiness and Surprise and the worst are Fear and Disgust. Learning why our visual system easily recognizes some expressions and not others should help the definition of the form and dimensions of the computational model of facial expressions of emotion [63]. To describe how humans perceive and classify facial expressions of an emotion, there are two types of models: the continuous and categorical. In the first one, each facial expression of an emotion is defined as a feature vector in a face space, given by some characteristics that are common to all the emotions. In the second one, there are C classifiers, each one associated to a specific emotion category. The continuous model explains how expressions of emotion can be seen at different intensities, whereas the categorical explains, among other findings, why the images in a morphing sequence between two emotions, like happiness and surprise, are perceived as either happy or surprise but not something in between. Also several psychophysical experiments suggest the perception of emotions by humans is categorical [22]. There have been developed models of the perception and classification of the six facial expressions of emotion, in which sample feature vectors or regions of the feature space are used to represent each one of the emotion labels, but only one emotion can be detected from a single image, despite the fact that humans can perceive more than one emotion in a single image, even if they have no prior experience with it. Initially, researchers have created several feature and shape-based algorithms for recognition of objects and faces [40, 55, 61], in which geometric, Shape features and edges were extracted from an image and used to create a model of the face. Then, this model was fitted to the image, and in case of a good fit, it is used to determine the class and position of the face. In [63], an independent computational (face) space for a small number of emotion labels was presented. In this approach, it is only needed to sample faces of those few facial expressions of emotion. This approach corresponds to a categorical model, however the authors define each of these face spaces as continue feature spaces. Essentially, the observed intensity in this continuous representation is used to define the weight of the contribution of each basic category toward the final classification, allowing the representation and recognition of a very large number of emotion categories without the need to have a categorical space for each one or having to use many samples of each expression as in the continuous model. With this approach, a new model was introduced; it consists of C distinct continuous spaces, in which multiple emotion categories can be recognized by linearly combining these C face spaces. The most important aspect of this model is that it is possible to define new categories as linear combinations of a small set of categories. The proposed model thus bridges the gap between the categorical and continuous ones and resolves most of the debate facing each of the models. The authors explained that the face spaces should include configural and shape features, because the configural features can be obtained from an appropriate representation of shape, however expressions such as Fear and Disgust seem to be mostly based on Shape features, making the recognition process less accurate and more susceptible to image manipulation. Each one of the six categories of emotion used is represented in a shape space given by classical statistical shape analysis. The face and the shape of the major facial components are automatically detected, i.e., the brows, eyes, nose, mouth and jaw line. Then, the shape is sampled with d equally spaced landmark points and the mean of all the points is computed. To provide invariance to translation and scale, the 2d-dimensional shape feature vector is given by the x and y coordinates of the d shape landmarks subtracted by the mean and divided by its norm. In the case of the 3D rotation invariance, it can be achieved with the inclusion of a kernel. The authors used the algorithm defined by [29] to obtain the dimensions of each emotion category, because it minimizes the Bayes classification error. 11 Since two categories can be connected by a more general one, the authors use the already defined shape space to find the two most discriminant dimensions separating each of the six categories previously listed. Then, in order to test the model, they trained a linear Support Vector Machine (SVM) and achieved the following results: Happiness is correctly classified 99% of the times, Surprise and Disgust 95%, Sadness 90%, Fear 92% and Anger 94%. They also mentioned that adding new dimensions in the feature space and using nonlinear classifiers makes it possible to achieve perfect classification. In the last two decades, a new approach was studied: appearance-based, in which the faces are represented by their pixel-intensity maps or the response to some filters (e.g. Gabors). The main advantage of appearance-based model is that there is no need to: predefine a feature/shape model like in the previous approaches, since the face model is given by the training images. Also, it provides good results for near-frontal images of good quality, but it is sensitive to image manipulation like scale, illumination changes or poses. In [16], two methods are presented, one for static pictures and the other for video, for automatic facial expression recognition using the shape informations of the face, extracted using Active Appearance Models (AAM) that is a computer vision algorithm for matching a statistical model of a object shape and appearance to a new image. The main difference between these methods is the type of selected features. The system uses a face detection algorithm based on [68], a Facial Characteristic Point (FCP) extraction method based on AAM, and the classification of the emotions is made using SVM. The dataset used for training the facial expression recognizer was the Cohn-Kanade database containing a set of video sequences of different subjects on multiple scenarios. Each of these sequences contains a subject expressing an emotion from the Neutral state to the apex of that emotion, and only the first and last frames are used. The AAM was built using 19 shape models, 24 texture models and 22 appearance models, resulting in a shape vector of 58 face shape points. The model handles a certain degree of scaling, translation, rotation and asymmetry (using parameters for both sides of the face). The effect of illumination changes is minimized by scaling the texture data of the face samples during the training of the AAM. To increase the performance of the model fitting, the authors decided to use samples with occluded faces as well. A SVM classifier and 2-fold Cross Validation were used to present the results, and in the case of the video sequences the results were better than for static images in all emotions. The approximate results are: Fear 85% for image and 88% for video, Surprise 84%-89%, Sadness 83%-86%, Anger 76%-86%, Disgust 80%-82% and happy 73%-80%. In [62], a new approach for facial recognition classification, also based on AAM, is presented. In order to be able to work in real-time, the authors used AAM on edge images instead of gray ones, a two-stage hierarchical AAM tracker and a very efficient implementation. With the use of edge images, it is possible to overcome one of the problems in AAM: different illumination conditions. In this new approach, it was used a 2-dimensional shape model S with 58 points placed in regions of the face which usually have a lot of texture information and an appearance model used to transform the input image into a linear-space of Eigenfaces. The combination of these two models leads to a model instance, with appearance parameters and shape parameters p. The developed system is composed of foursubsystems: Face Detection, Coarse AAM, Detailed AAM and Facial Expression Classifier. The first one identifies faces in real-time using [86] face detector. The position and size of the detected faces are used to initialize the Coarse AAM and new shape components are added to describe: the scaling of the shape, an approximation of the in-plane rotation and the translation on the x-axis and y-axis. This step allows to do a coarse estimation of the input image. The Detailed AAM is initialized after the error associated with the previous step drops below a given threshold, and is used to estimate the details of the face that are necessary for a mimic recognition. Finally, for the classification of the facial expression were used an AAM-classifier set, a Multi-Layer Percepton (MLP) based classifier and a SVM based classifier. The emotions used were the six typical facial expressions and a new one: Neutral. The FEEDTUM mimic 12 database was used, and consists of 18 different persons (9 males and 9 females), each showing the six different basic emotions and the Neutral in a short video sequence. Using the SVM classifier, = 20 and p = 10, the average detection rate was 92%. 2.2.3 Relationship between features and emotional content of an image Color is the result of interpretation in the brain of the perception of light in the human eye [18]. It is also the basic constituent and the first discriminated characteristic of images for the extraction of the emotions. In the last years, many works in psychology have been making hypothesis about the relationship between Colors and emotions [25, 87]. This research has shown that Color is a good predictor for emotions in terms of saturation, brightness, and warmth [38], and that the relationship between Colors and human emotions has a strong influence on how we perceive our environment. The same happens for our perception of images, i.e, all of us are in some way emotionally affected when looking at a photograph or an image [81]. In photography and color psychology, color tones and saturation play important roles. Saturation indicates chromatic purity, i.e., corresponds to the intensity of a pixel color. The purer the primary colors, red (sunset, flowers), green (trees, grass), and blue (sky), the more striking the scenery is to viewers [39]. Brightness corresponds to a subjective perception of the luminance in the pixel’s color [18]. In the case of too much exposure it will lead to a brighter shot, that often yields to lower quality pictures, but in the case of the ones that are too dark, usually they are not appealing. However, an over-exposed or underexposed photograph under certain scenarios may yield very original and beautiful shots [17]. Also, in photographs, the pure colors tend to be more appealing than dull or impure ones [17]. Regarding color temperature, warm colors tend to be associated with excitement and danger, while images dominated by cool colors tend to create cool, clamming, and gloomy moods [?, 65]. Images of happiness tend to be brighter, more saturated and have more colors than images of Sadness [18]. Concerning the relationship between colors and emotions, usually red is considered to be vibrant and exciting and is assumed to communicate happiness, dynamism, and power. Yellow is the most clear, cheerful, radiant and youthful Color. Orange is the most dynamic Color and resemble glory. The blue color is deep and may suggest gentleness, fairness, faithfulness, and virtue. Green should elicit calmness and relaxation. Purple sometimes communicates Fear, while brown is associated with relaxing scenes. A sense of quietness and calmness can be conveyed by the use of complementary colors, while a sense of uneasiness can be evoked by the absence of contrasting hues and the presence of a single dominant color region. This effect may also be amplified by the presence of dark yellow and purple colors [25, 30]. Basic emotions seem to be fundamentally universal, and their external manifestation seems to be independent of culture and personal experience. In what regards the brightness, there are distinct groups of emotions: Happiness, Fear and Surprise combined with very light colors, Disgust and Sadness with colors of intermediate lightness, and Anger with rather dark colors (usually black and red). The colors relative to Sadness and Fear are very desaturated, while Happiness, Surprise and Anger are associated with highly chromatic colors [13]. In the wheel of Emotions (See Figure 2.3), proposed by Plutchik [71], it is possible to identify the different emotions and their corresponding colors. In the case of the basic emotions, we have the following associations: Anger corresponds to red, Disgust to the purple, Fear to the dark green, Sadness to the dark blue, Surprise to the light blue and, finally, Happiness to the yellow. 13 Figure 2.3: Wheel of Emotions Since perception of emotion in color is influenced by biological, individual and cultural factors [18], mapping low-level color features to emotions is a complex task which theories about the use of colors, cognitive models and involve cultural and anthropological backgrounds must be considered [?]. Given that colors can be used in different ways, we need effective methods to measure their occurrence in an image. Color Moments [17, 18, 60, 78] are measures that characterize color distribution in an image. Different histograms such as Color Histogram [74, 78], Fuzzy Histogram (for Dominant Colors) [4], Wang Histogram [?] and Emotion-Histogram [81, 87] give the representation of the colors in an image. Color Correlogram [78] allows combining the advantages of histograms with spatial and color information. Color Layout Descriptor [67] also captures the spatial distribution of color in an image. Number of Colors [18] will be used to differentiate Positive from Negative images, since the first ones usually have more colors. Scalable Color Descriptor [19, 67] allows analyzing the brightness/darkness, saturation/pastel/pallid and the color tone/hue. Itten Contrasts [60] captures information about the contrasts of brightness, saturation, hue and complements. Harmonious composition is essential in a work of art and useful to analyze an image’s character [?]. In terms of Composition, images with a simplistic composition and a well-focused center of interest are sometimes more pleasing than images with many different objects [17, 39]. Nature scenes, such as forests or waterscapes, are strongly preferred when compared to urban scenes for population groups from different areas of the world [39]. In terms of Composition, there are common and not-so-common rules. The most popular and widely known is the Rule of Thirds, that can be considered as a sloppy approximation to the ‘golden ratio’ (about 0.618) [17, 39]. It states that the most important part of an image is not the center of the image but instead at the one third and two third lines (both horizontal and vertical), and their four intersections. Therefore, viewer’s eyes can naturally concentrate on these areas than either the center or the borders of the image, meaning that it is often beneficial to place objects of interest in these areas. This implies that a large part of the main object often lies on the periphery or inside of the inner rectangle [17]. 14 The size of an image has a good chance of affecting the photo aesthetics. Although most images are scaled, their initially size must be agreeable to the content of the photograph. In the case of the aspect ratio of an image, it is well-known that some aspect ratios such as 4:3 and 16:9 (which approximate the ‘golden ratio’) are chosen as standards for television screens or movies, for reasons related to viewing pleasure [17]. A less common rule in nature photography is to use diagonal lines (such as a railway, a line of trees, a river, or a trail) or converging lines for the main objects of interest to draw the attention of the human eyes [39]. Professional photographers often reduce the Depth of Field (DOF) for shooting single objects by using larger aperture settings, macro lenses, or telephoto lenses. On the photo, areas in the DOF are noticeably sharper [17]. Another Composition rule is to frame the photo so that there are interesting objects in both the close-up foreground and the far-away background. According to Gestalt psychology, that produced influential ideas such as the concept of goodness of configuration, we do not see isolated visual elements but instead patterns and configurations, which are formed according to the processes of perceptual organization in the nervous system. This is given to the “law of Prägnanz”, which enhances properties such as closure, regularity, simplicity or symmetry, leading us to prefer the “good” structures [39]. Shape is a fairly well-defined concept, and there is considerable evidence that natural objects are primarily recognized by their shape [78]. Growing evidence indicates that the underlying geometry of a visual image is an effective mechanism for conveying the affective meaning of a scene or object, even for very simple context-free geometric shapes. Objects containing non-representational images of sharp angles are less well liked. Abstract angular geometric patterns tend to be perceived as threatening, and circles and curvilinear forms are usually perceived as pleasant [51]. Accordingly to the fields of visual arts and psychology, shapes and their characteristics, such as angularity, complexity, roundness and simplicity, have been suggested to affect the emotional responses of human beings. Complexity and roundness of shapes appear to be fundamental to the understanding of emotions. In the case of complexity, humans visually prefer simplicity. Although the perception of simplicity is partially subjective to individual experiences, it can also be highly affected by parsimony and orderliness. Parsimony refers to the minimalistic structures that are used in a given representation, whereas orderliness refers to the simplest way of organizing these structures. For the case of roundness, it indicates that geometric properties convey emotions like Anger or Happiness [56, 87]. Usually, perceptual Shape features are extracted through angles, line segments, continuous lines and curves. The number of angles, as well as the number of different angles, can be used to describe complexity. Line segments refer to short straight lines used to capture the structure of an image. Continuous lines are generated by connecting intersecting line segments having the same orientations with a small margin of error. Line segments and Continuous lines are used to describe and interpret complexity of an image. Curves are a subset of continuous lines that are used to measure the roundness of an image [56]. Regarding the lines, their directions can express different feelings. Strong vertical elements usually indicate high tensional states while horizontal ones are much more peaceful. Oblique lines could be associated with dynamism [12, 25, 87]. Lines with many different directions present chaos, confusion or action. The longer, thicker and more dominant the line, the stronger the induced psychological effect [?]. In the field of computer vision, Texture is defined as all that is left after Color and local Shape have been considered or it is defined by such terms as structure and randomness. Textures are also important for emotional analysis of an image [25, 60], and their use can change the way other features are perceived; for example, in the case of the emotion unpleasantness, the addition of texture changes the perception of the image’s colors [57]. From an aesthetics point of view, specific patterns such as flowers make people feel warm, while the 15 abstract patterns make people feel cool. Thin and sparse patterns such as dots and small flowers make people feel soft. In contrast, the thick and dense patterns such as plaid make people feel hard [88]. In some situations, a great deal of detail gives a sense of reality to a scene, and less detail implies more smoothing moods [12]. Artists and professional photographers, in specific situations and in order to achieve a desired expression, create pictures which are sharp, or where the main object is sharp with a blurred background. Purposefully blurred images were frequently present in the category of art photography images which expressed Fear [60]. Graininess or smoothness in a photograph can be interpreted in different ways. If as a whole it is smooth, the picture can be out-of-focus, in which case it is in general not pleasing to the eye. If as a whole it is grainy, one possibility is that the picture was taken with a grainy film or under high ISO settings. Graininess can also indicate the presence/absence and nature of Texture within the image [17]. The following Texture features, Tamura [60, 74, 78], Gabor Transform [25, 39, 78, 87, 88], Waveletbased [60] and Gray-Level Co-occurence Matrix (GLCM) [60], are intended to capture the granularity and repetitive patterns of surfaces in an image. With the use of these features we will be able to measure the roughness or the crinkliness, the coarseness characterizes the grain size of an image, the contrast, directionality, line-likeness and regularity of a surface [74]. 2.3 Applications All of us are in some way emotionally affected when looking at an image, which means we often relate some of our emotional response to the context, or to particular objects in the scene. Usually CBIR systems or Recommendation Systems (RecSys), do not take into account the emotions that the images convey. However, recently, to solve this, new efforts have been made, which will be explained below. 2.3.1 Emotion-Based Image Retrieval The low-level information used in CBIR systems does not sufficiently capture the semantic information that the user has in mind [89]. In marketing and advertising research, attention has been given to the way in which media content can trigger the particular emotions and impulse buying behavior of the viewer/listener since emotions are quite important in brand perception and purchase intents [76]. Nowadays, many posters and movie previews use specific emotions that are specifically designed to attract potential customers. Emotionbased Image Retrieval (EBIR) can be used to identify tense, relaxed, sad, or joyful parts of a movie, or to characterize the prevailing emotion of a movie, which could be a great enhancement to personalizing the recommendation processes in future Video-On-Demand systems (VOD) or Personal Video Recorders (PVR) [30]. As an analogy to the semantic gap in CBIR systems, extracting the affective content information from audiovisual signals requires bridging the affective gap. Affective gap can be defined as the lack of coincidence between the measurable signal properties, commonly referred to as features, and the expected affective state in which the user is brought by perceiving the signal [30]. In [89], the authors present the first studies that were made in this new area. One of them is based on the Color theory of Itten, the expressive and perceptual features were mapped into emotions. Their method segmented the image into homogeneous regions, extracted features such as color, hue, luminance, saturation, position, size and warmth from each region, and used its contrasting and harmonious relationships with other regions to capture emotions. But this method was only designed for art painting retrieval. In another study, the authors designed a psychology space that captures the human emotion and mapped those onto physical features extracted from images. In a similar approach, based on 16 wavelet coefficients, retrieved emotionally gloomy images through feedbacks called Interactive Genetic Algorithm (IGA), but this method has the limitation of only differentiating two categories: gloomy or not. Finally, in the last one, the authors proposed an emotional model to define a relationship between physical vales of color image patterns and emotions, using color, gray and texture information from an image and input them into the model. Then, the model returned the degree of strength with respect to each emotion. It, however, has a problem with generalization due to the narrow scope of experiments on only five images and could not be applied to the image retrieval directly. In [87], authors explored the strong relationship between colors and human emotions and proposed an emotional semantic query model based on image color semantic description. Image semantics has several levels: abstract semantics that contributes to the interpretation of the senses, semantic templates (categories) related to the accumulation of semantic knowledge, semantic indicators corresponding to image elements that are characteristic for certain semantic categories, and finally, the low-level image features. The proposed model contains three stages. In the first one, the images were segmented using color clustering in L*a*b* space because the definitions and measurements of this space color are suited for vision perception psychology. In the second one, semantic terms using fuzzy clustering algorithm were generated, and used to describe both the image region and the whole image. After that, in the last one, an image query scheme through image color semantic description, that allows the user to query images using emotional semantic words, was presented. This system is general and able to satisfy queries for which it had not been explicitly designed. Also, the presented results demonstrate that the features successfully captures the semantics of the basic emotions. In [89], a new EBIR method was proposed, using query emotional descriptors called query color code and query gray code. These descriptors were designed on the basis of human evaluation of 13 emotion pairs (like-dislike, beautiful-ugly, natural-unnatural, dynamic-static, warm-cold, gay-sober, cheerful-dismal, unstable-stable, light-dark, strong-weak, gaudy-plain, hard-soft, heavy-light) when 30 random patterns with different color, intensity, and dot sizes are presented. For the emotion image retrieval, when a user performs a query emotion, the associated query color code and query gray code are obtained, and codes that capture color, intensity, and dot size are extracted from each database image. After that, a matching process between the two color codes and between the two gray codes is performed to retrieve images with a sensation of the query emotion. The major limitation of this method was the use of the emotions pairs since they do not cover all emotions that a human can feel, and it is difficult to map them onto the six basic emotions frequently used. In 2008, the authors of [12] said: “On the contrary, there are very few papers on automatic photo emotion detection if any.”, and proposed an emotion-based music player that combines the emotions evoked by auditory stimulus with visual content (photos). The emotion detection from photos was made using an own database with emotions manually annotated and a Bayesian classifier. To combine the music and photos, besides the high-level emotions, low-level features as harmony and temporal visual coherence was used. It is formulated as an optimization problem, solved by a greedy algorithm. The photos for the database used were chosen based on two criteria: images related to daily life without specific semantic meaning and photos without human faces, because they usually dominate the moods of photos. The emotion taxonomy used was based in Hevner’s work, and consists of eight emotions: sublime, sad, touching, easy, light, happy, exciting and grand. Since each photo was labeled by many users and they could perceive different emotions to it, the aggregated annotation of an image is considered as a distribution vector over the eight emotion classes. A set of visual features that effectively reflect emotions was used: color, textureness and line. Using a Bayesian framework, the obtained accuracy was 43%, but the misclassified photos are often classified as nearby emotions. The first retrieval system that indexes and searches images using human emotion was presented 17 in [43] 2 . In this system, the 10 Kobayashi emotional keyword were used for image tagging: romantic, clear, natural, casual, elegant, chic, dynamic, classic, dandy and modern. In the case of the images collected from the web, an indexer extracts physical features such as color, texture and pattern, and transpose them to human emotions. The system allows the users to search through a query interface based on emotional keywords and example images. The authors used 389 textile images from different domains such as interior (images such as curtain, carpet and wallpaper), fashion (images of clothes) and artificial (product designs). In [19] the authors, in order to extract emotions (aggressive, euphoric, calm and melancholic) from images, developed three new features: Color Histogram, Haar Wavelet and Color Temperature Histogram. The Color Histogram is calculated in the Hue, Saturation and Value (HSV) Color space with an individual quantization of each channel (similar to the MPEG-7 Scalable Color Descriptor), and it covers the properties of brightness/darkness, saturation/pastel/pallid and the color tone/hue. The Haar Wavelet describes the mean and variance of the energy of each band, and allows to describe the structure or horizontal/vertical frequencies. The Color Temperature Histogram is based on a first k-means clustering of all image pixels in the LUV 3 Color space, and it describes the warm/cool impact of images. Using this features, the authors achieve the following recognition rates: 44% using a Gaussian Mixture Models (GMM) and 53.5% using a SVM. The authors also stated that these results seem to be worse when compared with other approaches, which can be explained by the heterogeneity of their reference set. However, the heterogeneity is needed to cover the different interpretations of mood of various subjects. In [67], the authors proposed a new EBIR system that uses an Artificial Neural Network (ANN) for labeling images with emotional keywords based on visual features only. Advantages of such approach is easiness adjustment to any kind of pictures and emotional preferences. The system consists of a database of images, neural network, searching engine and interface to communicate with a user. For all the images in the database, the authors extracted the following visual feature descriptors: Edge Histogram, Scalable Color Descriptor, Color Layout Descriptor, Color and Edge Directivity Descriptor (CEDD) and Fuzzy Color and Texture Histogram (FCTH). They used a supervised trained neural network for the recognition of the emotional content of images. The experiments showed that average retrieval rate depends on many factors: a database, a query image, number of similar images in the database and the training set of the neural network. The authors also suggest some improvements to increase accuracy of the results: a module for face detection and face expression analysis, and one to analyze existing textual descriptions of images and other meta-data. In [60], the authors investigate and develop new methods, based on theoretical and empirical concepts from psychology and art theory, to extract and combine low-level features that represent the emotional content of an image, and use them for image emotion classification. The features represent color, texture, composition and content (faces and skin). For Color, they implement the following features: brightness and saturation statistics, Colorfulness, Color names, hue statistics, Itten contrasts and Wang Wei-ning specialized histogram. In the case of the Texture, they implement the Tamura, wavelet Textures and GLCM features. Finally, for the Composition features, they used the level of detail of the image, the DOF, the rule of thirds, and the dynamics given by the lines. The authors performed several experiments and compared the results with similar works. They also stated that their feature sets outperform the results of the state of the work, specifically using the International Affective Picture System (IAPS) for five of the eight categories used, which means that the best feature set is dependent on both the category and dataset. 2 http://conceptir.konkuk.ac.kr 3 http://en.wikipedia.org/wiki/CIELUV 18 2.3.2 Recommendation Systems Recommendation systems are used to help users find a small but relevant subset of multimedia items based on their preferences. The most common implementations of recommendation systems are the TiVo 4 system and the Netflix 5 system [83]. These systems can be divided into two types: the Collaborative-Filtering (CF) and the Content-based Recommender (CBR). In the first one, they are based on collecting and analyzing a large amount of information on user’s behaviors, activities or preferences, and predicting what they will like based on their similarity to other users. In the second one, the items are annotated with metadata and the system estimates the relevance level of an observed item based on the inclination of the user toward the item’s metadata values. Traditionally, the recommendation systems relied on data-centric descriptors for content and user modeling. However, recently, there has been an increasing number of attempts to use emotions in different ways to improve the quality of recommendation systems [83, 84]. In [84] a new metadata field containing emotional parameters was used to increase the precision rate of the CBR systems: Affective Metadata (AM). The main assumption here is that the emotional parameters contain information that account for more variance than the Generic Metadata (GM) typically used. Furthermore, the users differ in the target emotive state while they are seeking and choosing multimedia content to view. These assumptions lead to an hypothesis: these individual differences can be exploited to achieve better recommendations. The authors propose a novel affective modeling approach using the first two statistical moments of the users emotive responses in VAD space. They performed a user-interaction session, and then compared the performance of the recommendations systems with both the AM and GM. The results achieved showed that the usage of the proposed affective features in a CBR system for images brings a significant improvement over generic features, and also indicated that SVM algorithm is the best candidate for the calculation of item’s rating estimates. These results indicate that the formulated hypothesis is true. One of the most well known problems of these systems is usually referred to as the matrix-sparsity problem. In theory, with the increase of the number of ratings per user, the model would be trained on a larger training set, which allows to achieve better accuracy for the recommended items. But, since the number of ratings per user is relatively low, the user models are not as good as they could be if the users had rated more items. However, if we replace the need for explicit feedback from the user with an implicit one, such as recording the emotional reaction of the user to a given item, and then use it as a way of rate that item, we can try to reduce this issue. This idea allows us to compute the proposed AM on the fly, as new information arrives. The inclusion of these methods would lead us to a standalone recommender system that can be used in real applications. In [83], a unifying framework using emotions in three different stages: entry, consumption and exit, of the model is presented, since it is important that the recommendation system application detects the emotions and makes good use of that information (as already explained earlier). When the user starts to use the system, he is in a given affective state: the entry stage, which is caused by some previous activities that are unknown to the system. When the recommendation system suggests a collection of items, the user’s mood influences the choice that he will do because the decision making process of the user (as explained in Section 2.1) is strongly influenced by his emotive state. For example, if a user is happy or sad, he might want to consume a different type of content according to the way he is feeling. In order to adapt the list of recommended items to the users entry mood, the system 4 http://www.tivo.com/ 5 https://signup.netflix.com/MediaCenter/HowNetflixWorks 19 must be able to detect the mood and to use it in the content filtering algorithm as contextual information. In the consumption stage, the user receives affective responses induced by the content that he is viewing. These responses can be single values (for example when watching an image) or a vector of emotions that change over the time (for example when watching a movie). In [84], these emotional responses were used for generating implicit affective tags for the content. The exit stage is when the user finishes the content consumption. In this stage, the exit mood will influence the user’s next actions, which will be taken in account in the entry mood if the user continues to use the recommendation system. 2.4 Datasets Several possibilities have been explored so far in order to induce emotional reactions, relying on different contexts and various degrees of participant involvement. The most used method of emotion induction is through the presentation of emotionally salient material like pictures, audio or video, without explicitly asking for a personal contribution from the participant. If stimuli are relevant enough, an appraisal is automatically executed and will trigger reactions in other measurable components of emotion such as physiological responses, expressivity, action tendencies, and subjective feeling. Although this kind of induction can target different perceptual modalities, the use of the visual channel remains the most common to convey emotional stimulation. In the different areas of research based on visual stimulation, such as EBIR systems or psychological studies, reliable databases are important for the success of emotion induction. Regarding this, in 1997, the IAPS [50] database was introduced. However, the extensive use of the same stimuli lowers the impact of the images since it increases the knowledge that participants have of the images. Another problem seems to be the limited number of pictures for specific themes in the IAPS database. This specially affects studies centered on a specific emotion thematic and designs that require a lot of trials from the same kind (e.g., EEG recordings). In order to increase the availability of visual emotion stimuli, in 2011, a new database called Geneva Affective PicturE Database (GAPED) [14], was created. It is important to remember that contrary to the IAPS database, the goal of the GAPED is not to be able to compare research performed by using the same database, but to provide researchers with some additional pre-rated emotional pictures. Even though research has shown that the IAPS is useful in the study of discrete emotions, the categorical structure of the IAPS has not been characterized thoroughly. In 2005, Mikels [66] collected descriptive emotional category data on subsets of the IAPS in an effort to identify images that elicit discrete emotions. In the following paragraphs we provide some detail about these three datasets. IAPS The IAPS database contains about 1182 images, and provides a set of normative emotional stimuli for experimental investigations of emotion and attention. The goal is to develop a large set of standardized, emotionally-evocative, internationally accessible, color photographs that includes contents across a wide range of semantic categories [50]. The authors rely on a relatively simple dimensional view, which assumes emotions can be defined by a coincidence of values on a number of VAD dimensions. Each picture of the database is plotted in terms of the mean Valence and Arousal rating. These ratings were made by male, female and children subjects using Self-Assessment Manikin (SAM) questionnaire for pleasure, Arousal and dominance, during 10 years. 20 GAPED To increase the availability of visual emotion stimuli, a new database called GAPED was created. The database contains 730 pictures, 121 representing Positive emotions using human and animal babies as well as natural sceneries, 89 for the Neutral emotions, mainly using inanimate objects, and 520 for the Negative emotions. In the case of the Negative pictures, they are divided into four categories: spiders, snakes, human rights violation and animal mistreatment. The pictures were rated according to Valence, Arousal, and the congruence of the represented scene with internal (moral) and external (legal) norms. These ratings were made by 60 subjects, where each subject rated 182 images. Given the size of the database, participants were divided into five groups, each one rated a subset of the database, which means that only 39 images were rated by all participants. Since Positive emotions are often neglected in the study of emotions, the GAPED has also followed this orientation, with attention being put on developing large Negative categories and a unique Positive category. Consequently, the database is asymmetric, with many more Negative than Positive pictures, and with contents more specific in the Negative pictures. Mikels This new dataset is composed of 330 images from IAPS: 133 Negative and 187 Positive, and was annotated with Positive and Negative emotions [79] [80]. The Positive emotions are Amusement, Awe, Contentment and Excitement, while the Negative are Anger, Disgust, Fear and Sadness. These data reveal multiple emotional categories for the images and indicate that this image set has great potential in the investigation of discrete emotions. The emotional category ratings were made by 30 males and 30 females, in two studies, using a subset of Negative images and a subset of Positive images, with a constrained set of categorical labels. For the Negative images, the study resulted in four categories: Disgust (31), Fear (12), Sadness (42), and blended (48), i.e, more than one emotion present in the image. In the case of the Positive images, the study resulted in six categories: Amusement (10), Awe (7), Contentment (15), Excitement (10), Blended (71), and Undifferentiated (74), i.e., with all the emotions present in the image. As we can see in Table 2.1, only GAPED and Mikels provides information about the category of an emotion, i.e., Negative, Neutral or Positive. Mikels also discriminate the emotions elicited by the images regarding Anger, Disgust, Fear, Sadness, Amusement, Awe, Contentment and Excitement. IAPS does not provide any information about the emotional content of the images that compose the dataset, only V-A information. IAPS GAPED Mikels # Total 1182 730 330 # Negative N/A 520 133 # Neutral N/A 89 N/A # Positive N/A 121 187 Emotions No No Yes Table 2.1: Comparision between IAPS, GAPED and Mikels datasets Besides the IAPS and GAPED databases, in which each image was annotated with their Valence and Arousal ratings, there are other databases (typically related to facial expressions) that were labeled with the corresponding emotions, such as NimStim Face Stimulus Set6 , Pictures of Facial Affect (POFA)7 or Karolinska Directed Emotional Faces (KDEF)8 . 6 http://www.macbrain.org/resources.htm 7 http://www.paulekman.com/product/pictures-of-facial-affect-pofa/ 8 http://www.emotionlab.se/resources/kdef 21 2.5 Summary Emotion in human cognition is essential and plays an important role in the daily life of human beings, namely in rational decision-making, perception, human interaction, and in human intelligence. Regarding the emotion representation, there are two different perspectives: categorial and dimensional. Usually, the dimensional model is preferable because it could be used to locate discrete emotions in space, even when no particular label could be used to define a certain feeling. In order to extract emotions from an image, we need to understand how their contents affect the way emotions are perceived by users. This content can be facial expressions of the faces present in the images, color, shape or texture information. To describe how humans perceive and classify facial expressions of an emotion, there are two types of models: the continuous and categorical. The continuous model explains how expressions of emotion can be seen at different intensities, whereas the categorical explains, among other findings, why the images in a morphing sequence between two emotions, like Happiness and Surprise, are perceived as either happy or surprise but not something in between. There have been developed models of the perception and classification of the six facial expressions of emotion. Initially, it used feature and shapebased algorithms, but, in the last two decades, appearance-based models (AAM) have been used. In both cases the recognition rates are already very good, varying from 80% to 90%. CBIR is a technique that uses visual contents of images to search images in large databases, using a set of features, such as Color, Shape or Texture. Color is the most extensively used visual content for image retrieval since it is the basic constituent of images. Shape corresponds to an important criterion for matching objects based on their physical structure and profile. Texture is defined as all that left after Color and local Shape has been considered; it also contains information about the structural arrangement of surfaces and their relationship to the surrounding environment. Each type of visual feature usually captures only one aspect of image property, which means that, usually, a combination of features is needed to provide adequate retrieval results. However, the low-level information used in CBIR systems does not sufficiently capture the semantic information that the user has in mind. In order to solve this, the EBIR systems could be used. These systems are a subcategory of the CBIR that, besides the common features, also use emotions as a feature. Most of the research in the area is focused on assigning image mood on the basis of eyes and lips arrangement, but colors, textures, composition and objects are also used to characterized the emotional content of an image, i.e., some expressive and perceptual features are extracted and then mapped into emotions. In the last five years, some EBIR systems have been development that were able to achieve recognition rates of 44% and 53.5% using classification methods such as GMM or SVM. Besides the extraction of emotions from an image, there has been an increasing number of attempts to use emotions in different ways such as the increase of the quality of recommendation systems. These systems help users find a small and relevant subset of multimedia items based on their preferences. Finally, the most known problems of these systems: matrix-sparsity problem, can be solved using implicit feedback, such as recording the emotional reaction of the user to a given item, and use it as a way of rate that item. As we can see a lot of work has been done identifying the relationship between emotions and the different visual characteristics of an image, recognizing faces in images and analyze the emotions that they transmit or even the new technique EBIR used to retrieve images based in emotion’s features. However, there is no system for identifying the emotional content present in an image. 22 3 Fuzzy Logic Emotion Recognizer As we have seen in Section 2.4, there are two types of datasets: those who have the images annotated with V-A values, and the ones with images annotated with the emotions they convey. However, there is no dataset with both characteristics or a model that, given the V-A values, can classify the emotions they represent. Hereupon, we propose a recognizer to classify an image with the universal emotions present in it and the corresponding category (Negative, Neutral and Positive), based on their V-A ratings using Fuzzy Logic. This recognizer will allow us to increase the number of images annotated with their emotions without the need of manual classification, reducing both the subjectivity of the classification and the extensive use of the same stimuli. This is particularly important because, if we use these images to perform manual classification, the impact of them in future studies will be lower, since it increases the knowledge that participants have about the images. 3.1 The Recognizer In order to map V-A ratings into emotion labels, we used the Circumplex Model of Affect (CMA) [75] [72] which states that all affective states arise from cognitive interpretations of core neural sensations that are the product of two independent neurophysiological systems: Valence and Arousal. It is important to mention that there are a lot of variations of this model with no consensus among them. In our case, we used the model in Figure 3.1 to be able to recognize the following six emotions (defined accordingly to Oxford Dictionaries1 ): Anger: A strong feeling of annoyance, displeasure, or hostility. Disgust: A feeling of revulsion or profound disapproval aroused by something unpleasant or offensive. Fear: An unpleasant emotion caused by the belief that someone or something is dangerous, likely to cause pain, or a threat. Happiness: The state of being happy, i.e., feeling or showing pleasure or contentment. Sadness: The condition or quality of being sad, i.e., feeling or showing sorrow or unhappy. 1 http://www.oxforddictionaries.com/us/definition/american_english/ 23 Surprise: An unexpected or astonishing event, fact, or thing. Figure 3.1: Circumplex Model of Affect with basic emotions. Adapted from [75] To build our dataset for training and testing our recognizer, we used Mikels dataset [79] [80] [66]. To our purposes, we have made two assumptions: 1) we assume that Amusement, Awe, Contentment and Excitement correspond to the basic emotion Happiness, and 2) besides each isolated emotion, we also consider classes of emotions that often occur together. According to the assumptions made, our initial dataset is composed by 1 image of Anger, Disgust and Fear (ADF), 6 images of Anger, Disgust and Sadness (ADS), 1 image of Anger and Fear (AF), 1 image of Anger and Sadness (AS), 31 images of Disgust (D), 25 images of Disgust and Fear (DF), 11 images of Disgust and Sadness (DS), 12 images of Fear (F), 3 images of Fear and Sadness (FS), 114 images of Happiness (Ha), and finally, 43 images of Sadness (S). Given that we removed the classes of emotions with fewer samples (less than 5), the resulting dataset includes: ADS, D, DF, DS, F, Ha and S. For each image in the dataset, we started by normalizing the V-A values (ranging between 0.5 and 0.5). Then, we divided the Cartesian Space, using these values, in order to define each class of emotions, and as we can see in Figure 3.2 there was a huge confusion among the different classes. In order to reduce the existing confusion, and considering the Circumplex Model of Affect (See Figure 3.1), we used the Polar Coordinate System (See Figure 3.3) to represent each image in terms of Angle (see Equation 3.1) and Radius (see Equation 3.2), each computed using the V-A. Angle was used to identify the class of emotion for each image belongs to, while Radius was used to help reduce emotion confusion between images with similar angles. Angle(V alence, Arousal) = arctan( Radius(V alence, Arousal) = Arousal ) 2 [0 , 360 ] V alence p 2 p 2 V alence2 + Arousal2 2 [0, 24 2 ] 2 (3.1) (3.2) Figure 3.2: Distribution of the images in terms of Valence and Arousal Figure 3.3: Polar Coordinate System for the distribution of the images 25 Even with the use of Angle and Radius to describe each image, it still exists confusing among the different classes of emotions, so instead of using rigid intervals we decided to used Fuzzy Set Theory to describe each class of emotions, as well as the categories. A fuzzy set corresponds to a class of objects with a continuum Degree of Membership (DOM), where each set is characterized by a membership function, usually denoted as µA (x), which assigns to each object a DOM with a range between zero and one [90]. Any type of continuous probability distribution function can be used as a membership function. In our work we used the Product of Sigmoidal membership function and the Trapezoidal membership function, which we shortly describe in the following paragraphs. Product of Sigmoidal membership function: A sigmoidal function (See Figure 3.4) depends on two parameters: a and c (see Equation 3.3). The first one controls the slope, while the second is the center of the function. Depending on the sign of the parameter a, the function is inherently open to the right or to the left. Figure 3.4: Sigmoidal membership function sigmf (x : a, c) = 1 1+e (3.3) a(x c) The final equation for this membership function is given by: psigmf (x : a1 , c1 , a2 , c2 ) = 1 1+e a1 (x c1 ) ⇥ 1 1+e a2 (x c2 ) (3.4) Trapezoidal membership function: The trapezoidal curve (See Figure 3.5) depends on four scalar parameters a, b, c, and d (see Equation 3.5). The parameters a and d locate the “feet” of the trapezoid and the parameters b and c locate the “shoulders”. Figure 3.5: Trapezoidal membership function trapmf (x : a, b, c, d) = max(min( x b a d , 1, a d x ), 0) c (3.5) Regarding the computation of the parameters for the membership functions, we started by using mean and stddev measures. In the case of the classes of emotions, for the Angle membership function, both measures were used for the slope parameters, i.e., a1 and a2 . The parameters c1 and c2 26 correspond, respectively, to the lowest and highest value of the Angle for that subset of images. For the Radius membership function, b is the minimum, and c the maximum value of the Radius for that subset of images, while a = b ✏1 and d = c + ✏2 , with ✏1 = ✏2 = 0.01 (empirical value). In the case of the categories parameters, and since we used trapezoidal memberships for the angles and the radiuses, b and c parameters correspond to the lowest and highest value of the Angle/Radius for that subset of images; in the case of the parameters a and d, the only difference are the ✏ values that vary according to each category. For all the classes of emotions, we removed the outliers, i.e, images with angles or radius that were distant from the angles or radius of the majority of the images for the corresponding class. Although fuzzy sets are commonly defined using only one dimension, they can be complemented with the use of cylindrical extensions. Given this, we used a two-dimensional membership function that is the result of the composition of the two one-dimensional membership functions mentioned above. For each category (see Figure 3.7 to 3.11) we used Trapezoidal membership function, both for Angle and Radius (see Equation 3.6). In the case of the classes of emotions (see Figures 3.12 to 3.18) we used the Product of Sigmoidal membership function for the Angle and the Trapezoidal membership function for the Radius (see Equation 3.7). category(Angle, Radius : a1 , c1 , a2 , c2 , a, b, c, d) = trapmf (Angle : a, b, c, d) ⇥trapmf (Radius : a, b, c, d) emotions(Angle, Radius : a1 , c1 , a2 , c2 , a, b, c, d) = psigmf (Angle : a1 , c1 , a2 , c2 ) ⇥trapmf (Radius : a, b, c, d) 2-D Membership Function 27 (3.6) (3.7) 1-D Membership Functions for Angle 1-D Membership Functions for Radius Figure 3.7: Membership Functions for Negative category 2-D Membership Function 1-D Membership Functions for Angle 1-D Membership Functions for Radius Figure 3.9: Membership Functions for Neutral category 28 2-D Membership Function 1-D Membership Functions for Angle 1-DMembership Functions for Radius Figure 3.11: Membership Functions for Positive category 2-D Membership Function 29 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.12: Membership Functions for Anger, Disgust and Sadness 2-D Membership Function 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.13: Membership Functions for Disgust 30 2-D Membership Function 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.14: Membership Functions for Disgust and Fear 2-D Membership Function 31 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.15: Membership Functions for Disgust and Sadness 2-D Membership Function 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.16: Membership Functions for Fear 32 2-D Membership Function 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.17: Membership Functions for Happiness 2-D Membership Function 33 1-D Membership Function for Angle 1-D Membership Function for Radius Figure 3.18: Membership Functions for Sadness Each image was annotated with the degree of membership for each possible category and class of emotions, and were also associated to the image the two dominant categories and the two dominant classes of emotions. In Figure 3.19 there is a global view of the membership functions of all the classes of emotions for Angle, being possible to see the existing confusion between each of the classes of emotions. There is clearly a differentiation between the Positive emotion Happiness, ([0 , 95 ] [ [300 , 360 ]) and the Negative emotions ([120 , 280 ]). However, there is a lot of confusion among the Negative emotions, being the main confusions between DF and F, between D, DF, ADS and DS, and finally between ADS, DS and S. With the exception of DF, that is overlapping with almost all other Negative emotions, even with the ones without any obvious connection (for example S), the remaining are logical, and expectable, overlaps of emotions. Figure 3.19: Membership Functions of Angle for all classes of emotions 34 In Figure 3.20 there is a global view of the confusion between emotions regarding Radius. In this case, and contrary to what happened in the case of Angle, there is no clear differentiation between Negative and Positive emotions. As we can see, there are no emotions in the proximity of the extremes (0 and 70), in fact the emotions are lying between 8 and 55. Almost all emotions are completely inside the D interval ([10, 55]), which is the emotion with the biggest range of values for Radius. In some cases, such as F, which is completely inside DF, or S, which almost overlaps completely DF, the Radius will not be particularly helpful. However, and considering the results for the Angle, in the case of confusions between ADS, DS and S, the use of Radius will be useful, for example in the interval of [10, 18] the emotion will undoubtedly be S. So, the combination of the two attributes (Angle and Radius) allow us to better distinguish the emotions. Figure 3.20: Membership Functions of Radius for all classes of emotions 3.2 Experimental Results In order to build the training dataset, we analyzed the dataset for each class of emotions and categories. For both cases, we concluded that the distribution of the images is not symmetric, being more evident in the classes of emotions. For the proper evaluation of our model for the classes of emotions, and taking into account that “Clinicians and researchers have long noted the difficulty that people have in assessing, discerning, and describing their own emotions. This difficulty suggests that individuals do not experience, or recognize, emotions as isolated, discrete entities, but that they rather recognize emotions as ambiguous and overlapping experiences.” as stated in [72], we consider that a result is correct if the expected class of emotion is present (totally or partially) in the result label; if it is not present, we consider the class of emotion with the biggest DOM as a confusion. For example, if the expected class of emotion was D, we considered correct results ADS, DS, D, DF or any combination of one of those with a second class of emotion. As we can see in Table 3.1, the best results were achieved for D, F and Ha. In the case of Ha, this result is due to the clear distinction between the Angle values for Ha and the remaining emotions; while 35 in the case of D, we believe it is due to the big interval, both for Angle and Radius; finally, in the case of F, we believe it is because of the interval of Angle that is only overlapping with DF. However, DF shows the worst result, but this is expectable given that both the Angle and Radius intervals are overlapping with the majority of the emotions. (%) ADS D DF DS F S H ADS 83.33 D 100 9.09 4.65 DF DS 76 8 90.91 4.65 F S 16.67 H 16 100 90.70 100 Table 3.1: Confusion Matrix for the classes of emotions in the IAPS dataset For evaluating the model when it come to categories, we follow a similar approach to the one described above. If the expected category is one of the returned categories, we consider the result as correct; otherwise we consider the one with the biggest DOM as a confusion. If the two categories have the same DOM we select the “worst”, i.e., for example between Neutral and Positive, we choose Neutral as the confusion result. Table 3.2 presents the achieved results for the IAPS dataset (corresponding to our training dataset), while Table 3.3 shows the results for the GAPED dataset. These results were achieved after the adjustment of the Radius parameters for each category (in order to eliminate some non-classified results). (%) Negative Positive Negative 100 Positive 100 Table 3.2: Confusion Matrix for the categories in the Mikels dataset (%) Negative Neutral Positive Negative 87.89 Neutral 7.69 98.88 Positive 4.42 1.12 100 Table 3.3: Confusion Matrix for the categories in the GAPED dataset As expected the results achieved when using the same set for training and test were 100% for Negative and Positive categories. With the use of GAPED as a testing set and with the addition of the Neutral category (which did not exist in the training dataset), the results for the Negative category became worse; however, this is mainly due to the existing confusion between the Negative and Neutral categories in the used dataset [14]. The Positive category maintains an accuracy of 100%, while the Neutral category obtained almost 99%. 3.3 Discussion In this work, we developed a model to automatically classify the emotions and categories conveyed by an image in terms of their normalized Valence and Arousal ratings. With this model we were able to successfully annotate our training set with the dominant categories with classification rates of 100% and the dominant classes of emotions with an average classification rate of 91.56%. 36 Although we intend to be able to recognize the six basic emotions, we don’t have any data for the emotions of Anger and Surprise. In the case of Anger, this is due to the fact that it is difficult to elicit the emotion through the images used as explained in [66], while in the case of the emotion Surprise, it is because the work we followed did not consider this emotion. In general, the results achieved are very good; however, in the case of emotions, it is important to mention the existing confusion between some of them, mainly Disgust and Sadness (and the corresponding classes that are composed with at least one of these emotions), which can be explained by the neuroanatomical findings in [48], in which the authors mentioned that some regions such as prefrontal cortex and thalamus are common to these emotions, as well as the association with activation of anterior and posterior temporal structures of the brain, using film-induced emotion. We also annotated the GAPED dataset and the remaining pictures of the IAPS dataset. For GAPED we have a non-classification rate of 23.4%, and an average classification rate of 95.59% for the categories (we don’t have any information about the emotions). In the case of IAPS we achieved a nonclassification rate of 4.86%, however we cannot identify the classification rates because, besides the images we used as training set (Mikels), we did not have any information about the categories or emotions. The non-classified results are explained with the lack of images covering the whole space of the CMA on both datasets, and, in the particular case of GAPED it can also be explained by the use of a slightly different CMA. The existing confusion between the Negative and Neutral categories (in the GAPED dataset) already existed as explained in [14], while the confusion between the Negative and Positive categories can be explained by the use of different models of the CMA. 3.4 Summary We developed a recognizer to classify an image with the universal emotions present in it and the corresponding category (Negative, Neutral and Positive), based on their V-A ratings using Fuzzy Logic. For each image in the dataset, we started by normalizing the V-A values, and computed the Angle and the Radius for each image in order to help reduce emotion confusion between images with similar angles. To describe each class of emotions, as well as the categories, we used the Product of Sigmoidal membership function and the Trapezoidal membership function. For the categories we used Trapezoidal membership function, both for Angle and Radius, while for the classes of emotions, we used the Product of Sigmoidal membership function for the Angle and the Trapezoidal membership function for the Radius; As expected, the results achieved when using the same set for training and test were 100% for Negative and Positive categories. In the case of the dominant classes of emotions we achieved an average classification rate of 91.56%. With the use of GAPED as a testing set we achieved an average recognition rate of 96% for categories. For GAPED, we achieved a non-classification rate of 23.4%, while in the case of the IAPS, we achieved a non-classification rate of 4.86%. 37 38 4 Content-Based Emotion Recognizer Emotion, which is also called mood or feeling, can be seen as emotional content of an image itself or the impression it makes on a human. When talking about emotions, it is important to mention the subjectivity inherent, since different emotions can appear in a subject while looking at the same picture, depending on its current emotional state [67]. However, the expected affective response can be considered objective, as it reflects the more-or-less unanimous response of a general audience to a given stimulus [30]. There is a general agreement on the fact that humans can perceive all levels of image features, from the primitive/syntactic to the highly semantic ones [74], and also that artists have been exploring the formal elements of art, such as lines, space, mass, light or color to express emotions [18]. Given this, we assumed that emotional content can be characterized by the image color, texture and shape. Additionally, and given that certain features in photographic images are believed, by many, to please humans more than others, we also consider the aesthetics of an image, that in the world of art and photography refers to the principles of the nature and appreciation of beauty [17, 39]. In order to acquire as much information as possible about an image, we will use different features regarding Color, Texture, Shape, Composition, among others. However, a commitment between all the information collected and the processing time has to be found. Since most of the descriptors only model a particular property of the images, and in order to obtain the best results, a combination of features is often required. As stated in [30, 74], low-level image features can be easily extracted using computer vision methods, however they are no match for the information a human observer perceives. After the identification of the features that can be used to describe an image in terms of their emotional content, we will train different classifiers in order to identify the best features to describe an image according to their category of emotions. Given this, our goal is to identify the combination of visual features that can match human perception as closely as possible regarding the Positive or Negative content of an image. Regarding the features’ extraction, we selected the most used in literature and easiest to compute resulting in the following descriptor vector features (for simplification purposes, in the future, we will only refer to them as “features”): AutoColorCorrelogram (ACC) [36], Color Histogram (CH) [78], Color Moments (CM) [17], Number of Different Colors (NDC) [18], Opponent Histogram (OH) [85], 39 Perceptual Fuzzy Color Histogram (PFCH) [3, 4], Perceptual Fuzzy Color Histogram with 3x3 Segmentation (PFCHS) [3], Reference Color Similarity (RCS) [45], Gabor (G) [64], Haralick (H) [31], Tamura (T) [82], Edge Histogram (EH) [8], Rule of Thirds (RT) [17], Color Edge Directivity Descriptor CEDD [9], Fuzzy Color and Texture Histogram FCTH [11] and Joint Composite Descriptor (JCD) [10]. The majority of the features were extracted using jFeatureLib [26] and LIRE [58, 59], although PFCH, PFCHS, RT and NDC were implemented by us. For the classification, we used Weka 3.7.11, a data mining software [28]. This software allows us to use three different groups of classifiers: simple, meta and combination. For the simple classifiers we use Naive Bayes (NB) [37], Logistic (Log) [52], John Platt’s sequential minimal optimization algorithm for training a support vector classifier (SMO) [33, 41, 70], C4.5 Decision Tree (algorithm from Weka) (J48) [73], Random Forest (RF) [7], and K-nearest neighbours (IBk) [1]. In the case of meta classifiers, i.e., classifiers based on other classifiers, we used LogitBoost (LB) [24], RandomSubSpace (RSS) [34], and Bagging (Bag) [6]. For the combination of classifiers we used Vote with the Average combination rule [44, 47]. Although one of the good practices of machine learning is to use normalized data, in our tests we did not find any difference in the results, so we kept the features unnormalized. The tests and results are described in subsection 4.2. 4.1 List of features used In this section we describe shortly the several features extracted from the images AutoColorCorrelogram [36] Given that a color histogram only captures the color distribution in an image and does not include any spatial correlation information, the highlight of this feature is the inclusion of the spatial correlation of colors with the color information. Color Histogram [78] This feature is a representation of the distribution of colors in an image, i.e., it represents the number of pixels that have colors in each of a fixed list of color ranges (quantization in bins). In our work we use a HSB Color histogram. Color Moments [17] This feature computes the basic color statistical moments of an image like mean, standard deviation, skewness and kurtosis. Number of Different Colors [18] This feature counts the number of different colors, using RGB space, that compose an image. Opponent Histogram [85] This feature is a combination of three 1D histograms based on the channels of the opponent Color space: O1 , O2 and, O3 . The color information is represented by O1 and O2 , while intensity information is represented by channel O3 . Perceptual Fuzzy Color Histogram [4] In this feature, for each pixel of the image, the degree of membership for its Hue is evaluated and assigned to the correspondent bin of the fuzzy histogram. Therefore, after processing the whole image, each of the 12 bins of the fuzzy histogram will have the sum of the DOMs (degree of membership) for the corresponding Hues. 40 Perceptual Fuzzy Color Histogram with 3x3 Segmentation [3] This feature divides an image into 9 equal parts, and performs PFCH in each one. The result is the combination of the nine fuzzy histograms. Reference Color Similarity [45] This feature is not a histogram, since reference colors used are processed independently; any subset of dimensions gives the same result as computing just these colors, making this feature space very favorable for feature bagging and other projections. Gabor [64] This feature represents and discriminates Texture information, using frequency and orientation representations of Gabor filters since they are similar to those of the human visual system. Haralick [31] This feature is based on statistics, and summarizes the relative frequency distribution, that describes how often one gray tone will appear in a specified spatial relationship to another gray tone on the image. Tamura [82] This feature implements three of the following Tamura features: coarseness, contrast, directionality, line-likeness, regularity and roughness. Edge Histogram [8] This feature captures the spatial distribution of undirected edges within an image. The image is divided into 16-equal-sized, non overlapping blocks. After that, each block is divided into a 5-bin histogram counting edges in the following categories: vertical, horizontal, 35 , 135 and nondirectional. Rule of Thirds [17] This feature computes the color moments for the inner rectangle of an image divided into 9 equal parts. Color Edge Directivity Descriptor [9] This feature incorporates Color and Texture information in a histogram. In order to extract the Color information, it uses a Fuzzy-Linking histogram. Texture information is captured using 5 digital filters that were proposed in the MPEG-7 Edge Histogram Descriptor. Fuzzy Color and Texture Histogram [11] This feature combines, in one histogram, Color and Texture information using 3 fuzzy systems. Joint Composite Descriptor [10] This feature corresponds to a joint descriptor that joins CEDD and FCTH information in one histogram. 4.2 Classifier For testing and training, we used Mikels dataset [66, 79, 80], with 113 Positive images and 123 Negative images. The Positive images are the ones with the Happiness label, while the Negative ones correspond to ADS (6), D (31), DF (20), DS (11), F (12) and S (43) labels. We separated the data into a training and test set using K-fold Cross Validation with K = 5 [60]. We started by analyzing a set of classifiers, in order to understand which one learned best the relation between features and the given category of emotion (See Table A1). In the case of the simple classifiers 41 (NB, Log, SMO, J48, RF and IBk) we used their default configurations, but in the case of the meta classifiers (LB, RSS and Bag) we used RF as the base classifier. For these preliminary tests, we used all the features, but without any combination between them, and we did not consider the time required to build the model. With these classifiers, we achieved average recognition rates between 52.75% and 56.62%. However, after the observation of the results, we were not able to choose only one classifier. For example, for ACC feature, the best result was achieved using Bag, while for PFCH or PFCHS the best result was achieved using the NB classifier. Based on these relations (for each feature), we studied the following combinations of classifiers (using Vote classifier): Vote 1 (V1) Vote(SMO+NB+LB+Log+Bag) Vote 2 (V2) Vote(SMO+NB+LB+RF+RSS) Vote 3 (V3) Is similar to V2, but with default configurations for the LB and RSS classifier Vote 4 (V4) Vote(SMO+NB+LB) Vote 5 (V5) Vote(SMO+NB+Log) Vote 6 (V6) Vote(SMO+NB) As we can see in Figure 4.1, the global results for Vote classifiers are better than the ones achieved using simple or meta classifiers. Regarding the features, and considering the average of recognition rates across all the classifiers, the most promising features correspond to the PFCH with 64.27%, CH with 64.13%, RT with 63.06%, PFCHS and JCD with 61.16%, CEDD with 60.59% and FCTH with 60.10%. Figure 4.1: Average recognition considering all features Although we were able to improve recognition rates with the use of the combination of classifiers, we still have similar average recognition rates, from 56.57% to 57.76%, between them. Additionally, on average, each classifier performs best for 3 of the 16 features, which means that there were no better classifiers in the majority of the features. 4.2.1 One feature type combinations Considering the preliminary results, we performed more tests using different combinations of features inside each feature type (Color, Composition, Texture, Shape and Joint) using the six vote classifiers (See Table A2). Given the amount of possible combinations, especially for Color features, we considered 42 as candidates for the best features those with a recognition rate greater than the average of all the features (for each classifier). Color In Table A3, we can see the results for Color using only one feature. Tables A4 to Table A10 show the results for combinations of two to all of the Color features. The grey cells in tables correspond to the cells with a value greater than the average for that classifier. In the following paragraphs we discuss the results for the various combinations of features. One feature: Even though we had only identified the Color features PFCH, PFCHS and CH, earlier, as the most promising, we also included OH and RCS to the Color candidates list. As expected from the literature, and as we can see in Figure 4.2 the best results were the ones corresponding to any type of histogram, in particular the commonly used Color Histogram, as well as PFCH and PFCHS, that take into account the way users perceive color. Figure 4.2: Results for Color - one feature Two features: When we considered combinations of two Color features, the average recognition rates increased almost 2%, considering all of the possible combinations. As we can see in figure 4.3 the best results were achieved using combinations that include CH, PFCHS or PFCH. Generally, for CH, the use of NDC, OH, PFCH and RCS improved its recognition rates, while CM and PFCHS reduced them. The features OH and NDC, when combined with the other features, also improved their average recognition rate. Finally, PFCH is improved slightly when combined 43 Figure 4.3: Results for Color - two features with RCS. As expected, and in general, the combination of the best individual features gave us the best results. Three features: In this case, the average rate between two and three features is almost the same. Considering the best combination features for two features: CH+RCS (68.22%), OH+PFCHS (67.37%), PFCHS+RCS (66.95%), CH+OH (66.95%), and CH+NDC (66.95%), we can see in figure 4.4 that, in general, the results were better. For example, the best combination for two features achieved 68.22%, but CH+OH+RCS has a recognition rate of 69.50%. Moreover, the second best combination for two features achieved worst results than all the best combinations for three features. Figure 4.4: Results for Color - three features For the other combinations with a recognition rate at least of ' 65%, we observe that the majority 44 of them achieved better results than the ones achieved only with the use of two features. However, for example, CH+OH+PFCH has a rate of 65.68% while CH+OH has a better rate of 66.95%. These observations allow us to conclude, for now, and against our beliefs, that with the increase of information, in general, the accuracy of the classification is not linearly better. Four features: Contrary to what happened in the previously analyzed tests, the average rate between three and four features has decreased. The same also happened if we consider only the average for the best features. Although the differences appears to be minimal, nevertheless they meet our previous observation. The best results were the ones including the combination CH+CM+NDC, where the best combination was CH+CM+NDC+RCS with a recognition rate of 68.64% using V2. Five features: Comparing these tests with the ones above, we noticed a marginal decrease in the average recognition rate for all combinations. Just by looking to the best features: CH+CM+NDC+OH+RCS (67.80%), and CH+CM+NDC+OH+PFCHS (66.95%), we found the contrary since there was a small increase, in the recognition rate, when compared with CH+CM+NDC+OH (66.10%). For all the combinations based on CM+NDC+OH+PFCH or NDC+OH+PFCH+PFCHS, in general, there were no noticeable differences in the average recognition rates when compared with the results from five features. But when we combined them with the RCS feature we verified an increase in the corresponding recognition rates. Six features: CH+CM+NDC+OH+PFCH+RCS (67.80%), and CH+CM+NDC+OH+PFCH+PFCHS (67.38%) were the best combinations. In both cases, when we compared them with CH+CM+NDC+OH+PFCH (65.25%) there was a significant improvement in the recognition rates achieved. In some cases, such as for example CM+NDC+OH+PFCH+PFCHS or CH+CM+NDC+OH+PFCH, if we combine them with the RCS feature, there was an improvement in the corresponding average recognition rates. Seven features: In this group we achieved a poor average recognition rate. However, this was expectable since the majority of the combinations include the ACC feature. This feature had achieved the worst results in all the performed tests. The best combination was CH+CM+NDC+OH+PFCH+PFCHS+RCS that achieved an average recognition rate of 66.95% for V3. All features: Across all the tests, this was one of the worst results. However, as in the previous test this was expectable since it includes the worst feature (ACC). In fact, as we can see in Table A9, the same combination without ACC feature has an average rate of 64.69%. In Table A11, there is the final list of Color candidate features. Given the size of this list, we started by reducing the number of classifiers to analyze. We first looked into the recognition rates for all the combinations for each of the classifiers. V4 was the best classifier with an average rate of 63.96%, followed by V1, V2, V3, V5, and V6, with similar recognition rates of 63.83%, 63.71%, 63.62%, 62.70%, and 62.19%. Given these results, we only kept analyzing the values for the best classifiers, i.e., V4, V1 and V2, and decided to keep only Color combinations with an average recognition rate of at least 66%, reducing the list only to 7 combinations of features (See Table 4.1). 45 For the remaining classifiers, we analyzed the time they took to learn and build the model. In Figure 4.5 we can see the time that each vote take to build the model for the different number of features used in the performed tests. As we can see, the most unstable classifiers was V1, V3 and V5. Given this, subsequently we will only consider the V2 and V4 classifiers. Figure 4.5: Time to build models Composition Given that we only consider one feature of this type (RT), we cannot perform an extended analysis. However, and considering that this feature corresponds to the Color moments of a segmented part of an image, i.e., it captures Color information for the inner rectangle of an image, we can perform some comparisons against the Color results (see Table A12). Across all the classifiers, this feature achieved an average recognition rate of 63.35%. If we compare it with the average recognition rate for Color, there is a difference of almost 3% in the recognition rates, but it is important to mention that RT only has a dimension of 4, which means it is extremely quick to extract from an image, while the average dimension of Color is 343. Given both the recognition rate and the dimension of this feature, we considered it as a promising feature, not only for combination with other features, but also to use as a single feature. Shape Similarly to Composition, for this type, we only considered one feature (EH), which achieved the worst results (see Table A13), in all the performed tests until now: an average rate of 44.71%. However, we selected it for further tests, in order to see if in combination with other types, such as colors, it helps to discriminate the emotional category of an image. Texture For this group, H was the best feature (56.78%) (see Table A14). If we observe the combinations of two Texture features, the best one was H+T with a small decrease when comparing to H. When we combine all the Texture features, the rate slightly decreases (55.08%). For further tests, we selected the two best features: H+T, and H. Joint The best features were JCD and CEDD, respectively with 63.56% and 62.71% recognition rates (see Table A15). For the combinations of two features, the majority achieved worse results than the individual 46 ones, with the exception of the FCTH+JCD that had the same rate as CEDD. For the combination of all the features we achieved an average recognition rate of 61.44%. The selected features were: JCD (61.16%), CEDD, and FCTH+JCD. In Table 4.1 we can see the final list of features to use in the following tests. Regarding the distribution of the type of features, we have 50.00% for Color features, 21.44% for Joint, 14.28% for Texture, and the remaining 14.28% equally divided between Composition and Shape features. At this point, and given these results, we expect that the combination of features of different types increases the recognition rates, and allows us to better discriminate the emotional category of a given image. The new tests were done using combinations of two and three different types of features. Color Composition Shape Texture Joint CH+RCS CH+NDC+RCS CH+OH+RCS CH+PFCH+RCS CH+PFCHS+RCS CH+CM+NDC+RCS CH+CM+NDC+OH+PFCH+RCS RT EH H H+T CEDD JCD FCTH+JCD Table 4.1: List of best features for each category type 4.2.2 Two feature type combinations In the case of the tests performed using combinations of two types of features, the results can be seen in Table A16 for Color and Composition, Table A17 for Color and Shape, Table A18 for Color and Texture, Table A19 for Color and Joint, Table A20 for Composition and Shape, Table A21 for Composition and Texture, Table A22 for Composition and Joint, Table A23 for Shape and Texture, Table A24 for Shape and Joint, and Table A25 for Texture and Joint. Using the combination of the best features for Color and Composition, almost all the combinations performed worse than the original feature Color (i.e., without the Composition feature); the only exception was OH+PFCHS+RCS+RT which increased the corresponding recognition rate. In the case of Color and Shape, with the addition of Color information to the Shape features, all the combinations achieved better results. For Color and Texture, some of the Color combinations were improved with the use of the Texture feature H, namely CH+PFCH+RCS and CH+PFCHS+RCS. In fact, CH+PFCH+RCS+H, is one of the best features. Regarding the two Texture features used: H and H+T, the first one when combined with the different feature colors achieved, on average, better results. In the tests using Color and Joint, we were combining two of the best feature types. None of the combinations performed better than the original Color feature, which means that the use of CEDD, JCD, and FCTH+JCD did not add any useful information to the one already captured by color. For the combination of Composition and Shape, if we compare it with Shape feature EH, it is slightly better, however it is considerably worse (more than 13%) if compared with the Composition feature RT. On average, the achieved results using combined Composition and Texture features were worse than the average recognition rate of the two types separately. In the case of the Texture feature H+T, it is 47 slightly better when combined with RT, with a similar dimension, which means that, in this case, it is better to use the combined feature. For Composition and Joint, all of the combinations achieved worse results than the isolated features. So, in this case it is preferable to use Composition feature alone. Regarding Shape and Texture, and although the tested combinations achieved a better average recognition rate when compared to Shape, it is still better to use one of the Texture features (H or H+T), since the corresponding recognition remains better. For Shape and Joint combinations, when compared with Shape the achieved results were better, but considerably lower than the results achieved for Joint. In the case of Texture and Joint all of the combinations, when compared with Texture features, were improved. 4.2.3 Three feature type combinations For the tests using combinations of three types of features, the results are in Table A26 for Color, Composition and Shape, Table A27 for Color, Composition and Texture, Table A28 for Color, Composition and Joint, Table A29 for Color, Shape and Texture, Table A30 for Color, Shape and Joint, and Table A31 for Color, Texture and Joint. For Color, Composition and Shape, all the combinations achieved worst results with the addition of Shape feature. For Color, Composition and Texture, with the addition of Texture information to Color and Composition combinations, some of the new combinations achieved better results such as OH+PFCHS+RCS+RT+H or OH+PFCHS+RCS+RT+H+T. For Color, Composition and Joint all the results were worst. In the case of Color and Shape and Texture we achieved better recognition rates, especially with the use of H Texture feature. For Color, Shape and Joint, we achieved some better results with the use of FCTH+JCD feature. For Color, Texture and Joint all the results were worst. In general, the results with the addition of more information tend to decrease, even though we were able to improve some of our previous results. Considering the results achieved until now, for the next tests we will only use the best three feature type combinations: OH+PFCHS+RCS+RT+H, OH+PFCHS+RCS+RT+H+T, CH+RCS+H+FCTH+JCD, and CH+PFCH+RCS+H+T+CEDD. 4.2.4 Four feature type combinations For these tests, the results can be seen in Table A32 for Color, Composition, Texture and Shape, Table A33 for Color, Composition, Texture and Joint, Table A34 for Color, Texture, Joint and Shape, and in Table A35 for Color, Texture, Joint and Composition. For all the combinations, the achieved results were considerably worse than the original ones. The average recognition rate of the initial combinations was 66.53%, while the new achieved recognition rate decreased to 62.83%. Given these results, we will not perform tests using all the feature types combinations. 4.2.5 Overall best features combinations In Table 4.2 we can see the best features across all the tests, and the recognition rates achieved. Color Color and Composition Color and Texture % CH + OH + RCS CH + CM + NDC + RCS CH + OH + RCS + RT CH + CM + NDC + RCS + RT CH + RCS + H CH + PFCH + RCS + H CH + CM + NDC + RCS + H CH + PFCH + RCS + H + T 48 V2 68.64 68.64 67.37 66.95 67.80 68.22 68.22 66.95 V4 66.53 66.10 64.83 65.25 64.83 66.10 64.41 65.68 Color and Joint Color, Composition and Texture Color, Texture and Joint CH + RCS + CEDD CH + NDC + RCS + CEDD CH + OH + RCS + CEDD OH + PFCHS + RCS + RT + H OH + PFCHS + RCS + RT + H + T CH + RCS + H + FCTH + JCD CH + PFCH + RCS + H + T + CEDD 66.95 67.80 68.64 65.25 66.95 68.22 66.95 65.23 65.25 64.83 68.22 66.56 63.98 66.10 Table 4.2: Overall best features We consider the following combinations as the best ones: CH+CM+NDC+RCS, CH+OH+RCS, CH+CM+NDC+RCS+H, CH+OH+RCS+CEDD, CH+PFCH+RCS+H, CH+PFCHS+RCS+RT+H+T, CH+ RCS+H+FCTH+JCD, and CH+PFCHS+RCS+RT+H, with recognition rates above 68.00% for V2, and 66.50% for V4. Almost all of the combinations were composed mainly by Color features, which was expected since color is the primary constituent of images, and usually the most important characteristic for influencing the way people perceive images. In some cases, the use of Texture or Joint features was useful to reduce the number of Color features used to capture the emotional information of the images. In Table A36 we can see the respective confusion matrices for each of the best features. For the Positive category the best combination was OH+PFCHS+RCS+RT+H (58.41%) using classifier V4, while for the Negative the best were: CH+CM+NDC+RCS, and CH+OH+RCS, both using classifier V2 with a recognition rate of 82.65%. In order to confirm if our selected combinations really discriminate an image in terms of their emotional content, we also trained the two classifiers V2 and V4 using a new dataset with images from GAPED. For the first tests, we used 121 Negative images (31 from Animal, 30 from Human, 30 from Snake, and 30 from Spider, chosen randomly) and 121 Positive images. Although we had only considered Positive and Negative category in the tests performed until now, due to the use of Mikels dataset, we also performed tests using Neutral category (89 images from GAPED dataset). The results of the first tests, i.e., the ones only using Positive and Negative categories, can be seen in Table A37 for confusion matrices. In Table A38 we can see the results using also Neutral category. For the tests using Positive and Negative categories the best combinations were CH+OH+ RCS+CEDD (70.25%) for Positive category, and CH+PFCH+RCS+H (82.11%) for Negative, in both cases using classifier V4. In the case of tests using also the Neutral category, the results were considerably worse, but since we did not train the model considering Neutral category, they were somewhat expectable. The biggest confusion was between Negative and Neutral categories, although this is a known problem for the GAPED dataset. The best combination for Positive category was OH+PFCHS+RCS+RT+H+T (58.78%) using classifier V2, in the case of Neutral category it was the CH+RCS+H+FCTH+JCD (65.17%) using classifier V4, finally, for the Negative category we had two best combinations (both using classifier V2): OH+PFCHS+RCS+RT+H (65.29%), and OH+PFCHS+RCS+RT+H+T (65.29%). Given the results achieved until now, and in order to select the best classifier and combination of features for our final recognizer, we created a new dataset of 468 images selected from both Mikels dataset and GAPED dataset. From each one we selected 121 Negative and 113 Positive images, giving us a total of 242 Negative images and 226 Positive images. We divided the dataset with 2 3 for training (312 images), and the remaining for test (156 images). As we can see in Table A39 the best combination for the Negative category was OH+PFCHS+RCS+RT+H+T using classifier V4 (88.50%), while for the Positive it was CH+RCS+H+FCTH+JCD using classifier V2 (61.54%). For both the categories, and the one that we choose as the best overall combination was CH+CM+NDC+RCS that using classifier V2 achieved an average recognition rate of 72.44% (Negative: 87.18% and Positive: 57.69%). 49 4.3 Discussion Regarding the tests performed using only combinations of Color features, the ACC feature always achieved the worst results. However, when we incorporate more Color information, the results have always increased, from 56.43% (using only ACC) to 59.25% (using all features). Globally, the best features seem to be CH, CM, RCS and OH. For each of the group tests, they are in general the ones with the best results and when used with Joint features, that appear to capture less information, the recognition rates increased. Regarding PFCH, PFCHS and NDC, these features don’t always improve the results. When comparing the number of features used in each combination, we observed that with up to four features, the increase of the average recognition rates are linear: more features gave us better results. However, in the other cases, it seems that sometimes adding more features only confuses the information. In fact, as we can see in Table 4.1, only one of the selected combinations has more than four features. Overall, and as expected from our previous studies from literature, the best results were achieved using Color features (and combination of Color features). All other types, except Shape, also achieved relatively good recognition rates, especially if we consider the subjectivity inherent to the way humans interpret the emotional content of an image. In general, the results with the addition of more information tend to decrease, even though we were able to improve some of our previous results. Given that we performed all the tests using a small number of observations, and that in the majority of the tests the amount of features used for each image is considerably bigger than the number of observations, we considered the possibility of overfitting. Overfitting is a phenomenon that occurs when a statistical model describes noise instead of the underlying relationship, i.e., it memorizes information, instead of learning it. Usually occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Although we used cross-validation in all the tests we performed, which is helpful in reduce overfit in classifiers, we decided to verify if our final classifier suffers from overfitting. We started by testing our classifier using only the training set; in case of overfitting, the expectable accuracy should be around 100%, however it is only in the order of 70%. Besides that, if our model was Overfitting, when we use images to test that were not used to train the model, the classifier should perform considerably worst, however the recognition rates were similar as the ones using the training set. Given this, we believe that our classifier is not suffering from overfitting. Additionally, we also tried to reduce the number of features used, by performing Principal Component Analysis (PCA), however the results achieved were that all the features used are important. 4.4 Summary We developed a recognizer to classify an image with the corresponding emotion category: Positive or Negative, based on the content of the image, such as Color, Texture or Shape. Using a set of 156 images, for testing, that were not used for training, we achieved an average recognition rate of 72.44% (Negative: 87.18% and Positive: 57.69%). The recognizer uses a Vote classifier based on SMO, NB, LB, RF, and RSS, and is composed by CH, CM, NDC, and RCS features. 50 5 Dataset In order to provide a new dataset annotated with the emotional content of each image, we performed a study with different subjects. For this purpose we developed a Java application: EmoPhotoQuest, that used the Swing toolkit to show the images to the users and collect the ratings for each one of the displayed images. 5.1 Image Selection Concerning the creation of the dataset, we started by selecting the images, using the results of the recognizer developed in chapter 3, from the following datasets: IAPS, GAPED and Mikels. From the first one we selected 86 images: 9 of A (Anger), ADS, D, DF, DS, F, Ha, N (Neutral), S and 5 of Surprise (Su). From GAPED we selected 76 images: 8 of A, ADS, D, DF, DS, F, Ha, N, S, and 4 of Su. Finally, from Mikels we selected 7 images: 1 for ADS, D, DF, DS, F, Ha and S. For each class of emotions, we selected images with the biggest DOM possible. The dataset contains multiple images with animals, such as snakes, spiders, dogs, sharks, horses, cats, tiger, among others. The remaining images include children, war scenarios, mutilation, poverty, diseases and death situations. It also include images from cirurgical procedures, as well as images of natural catastrophes, car accidents or fire. For the experience, we divided the dataset into 4 subsets: DS0 to DS3. The first one contains 57 images, 20 from our subset of IAPS, 20 from our subset of GAPED, and all the images from our subset of Mikels. This dataset will be rated by all the participants. Dataset DS1 contains 40 images, while DS2 and DS3 contain 36 images each. 5.2 Description of the Experience First, we started by explaining the purposes of the study and how it will be held. To ensure the willingness of the subject, regarding Negative images, we started by showing three images as examples of what can be expected. After that, the subject could decide to continue or not the study. If the subject decides to continue the study, s/he should fill the user’s questionnaire (See Figure 7.3) with his/her personal 51 information (age, gender, etc.), as well as the classification of their current emotional state (categories and emotions). In the initial screen of EmoPhotoQuest (See Figure 5.1a), it is possible to select the language (Portuguese or English), as well as reading a summary of the most important aspects of the study. There are 7 different blocks with nearly 14 images each. Images were presented in a random order, i.e., each user will have a different sequence of images. Each image was shown to the user during 5 seconds (See Figure 5.1b). After looking at the image, the user should evaluate his/her current emotional state, and rate the image according to each of the universal emotions using a 5-Likert scale (See Figure 5.1c). When the user fills all the requested information for that image, the Next button appears and s/he can move on to the next image. Although in other study users usually have a limited time to respond, we decided not to do it. This way we allowed the user to spend as much time as needed, without feeling pressured to respond or even stressed out. In order to relax and avoid user fatigue, we provided a 30 seconds interval during which only a black screen was displayed (See Figure 5.1d). (a) EmoPhoto: Start screen (b) EmoPhoto: Image visualization screen (c) Rate screen (d) Pause screen Figure 5.1: EmoPhoto Questionnaire 52 5.3 Pilot Tests In order to verify and validate if our procedure had any error and also if it was completely clear to the subjects, we performed two preliminary tests with different subjects. The first one was a 27 year old man, that performed the test in Portuguese and took 35 minutes to complete it. The second subject was an 18 year old female, that also preferred to take the test in Portuguese. The time spent to conclude the test was 42 minutes. Regarding an image that was duplicated, none of the subjects had any doubt or detected any error in our application for collecting their emotional information. An interesting aspect of the performed tests was the sensitivity to the Negative images. The first subject considered the majority of the images very violent, while the second one considered them almost Neutral, and in some cases she enjoyed the Negative content. These preliminary results demonstrated how subjective the emotional content of an image can be. 5.4 Results We conducted 60 tests: 26 with females and 34 with males, with 70% of them belonging to the 18-29 age group (See Figure B2), and almost 60% having a BsC Degree (See Figure B4). Only 3 of the users had participated in a study using any Brain-Computer Interface Device (See Figure B5), while none of the users had participated in a study using the IAPS or GAPED database. In fact, the overwhelming majority had no knowledge about these databases. Regarding their current emotional state (in terms of categories), 31 participants classify it as Neutral, 25 as Positive, and only 4 as Negative (See Figure 5.2a). Considering now the emotional state according to each of the following emotions: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise, we can see in Figure 5.2b that the majority of the participants were feeling moderately Happy or moderately Neutral, both with a Median of 3, in the beginning of the tests. Given the number of participants in our tests, each image of DS0 was rated by 60 participants, while each image of DS1, DS2 and DS3 was rated by 20 participants. (a) Categories (b) Emotions Figure 5.2: Emotional state of the users in the beginning of the test During each session, participants were encouraged to share their opinions/comments about the experience. More than 40% of the participants indicated some type of difficulty in understanding the content of some of the images, leading to confusion about their feelings. The majority identified the lack of context as the main reason for this. 53 For example, some users did not understand if an animal in front of a car will be hit by it or not. In this case there is confusion between feeling Negative if the animal is hit, and Neutral/Positive otherwise. In some images of animals with people around them, it is not clear if the people are helping the injured animal, or if they are the ones that caused injury. As in the previous example, if the people are helping, the users tend to feel Positive, otherwise they feel Negative and irritated/angry. Another example is the case of animals that are lying on the ground; it is not clear for the user if the animal is dead or just sleeping. This doubt also influences the way the user feels: Negative in case of death, Neutral/Positive otherwise. Besides these concrete examples, one of these users explained that if he sees an image of a hideous act that is made based on religious fanaticism, he feels disgusted and angry, but if it is due to necessity (poverty, to get food, etc.) he only felt Sadness. Another user reported that he feels disgusted, not for what the image expresses by itself, but because of the situation in which the image was taken: violence against women or poverty. Some users also mentioned that some images had bad quality (pixelated) or appear to be faked/manipulated with programs, such as Photoshop, which means they did not feel affected by these images. Some of the users (2) indicated that there are too many emotions to rate. However, other users (8) suggest that there should be an option such as Confusion, Strangeness, Anxiety or Disturbed, because they consider that some images do not correspond to any of the available emotions. Moreover, another user considered that Happiness is not enough to discriminate the Positive feelings of some images, such as cuteness. In the case of Surprise, some users (5) claimed that it is subjective, difficult to understand and difficult to elicit from an image. There seem to be some exceptions to this, such as a shark moving as it is attacking a person or images with unexpected content like a lamp or stairs. However, two users considered Surprise as one of the most common emotions in the beginning of the test, but that tends to disappear during the test. In the case of Anger, two users explicitly indicated us that none of the images was able to trigger that emotion. In the case of the Neutral emotion, and given the existence of the Neutral category, four users did not comprehend the use of the emotion suggesting that a rate of “3” in all the other emotions will be equivalent to ”feeling Neutral”; one user suggested the use of indifference/apathy term instead of Neutral. Regarding the personal taste of the users, some of them do not appreciate spiders (4), snakes (3) or aquatic animals (1), but some of them consider images with these animals beautiful because of the colors in them. However, the opposite is also true (some users appreciate snakes (4) or spiders (3)). Two users hate needles, one user hates hunting and another is afraid of “heights”, i.e, he reported that he felt Fear from an image in which he thinks that should feel happy. In the contrary, two users identified that a specific image should be considered as “Negative”, but since they enjoy the content in it (fire and cirurgical instruments), they felt Positive and happy. Some users (3) declared that they are not sensitive to some images, for example the ones with children smiling. They said that they should feel “happy”, but they feel Neutral. Finally, one user also noticed that in a image with a couple in which the woman is pregnant, usually this scenario would be Neutral to him, but as his sister is pregnant, he feels happy because he remembers his sister. One of the users was particularly happy in the beginning of the test, and reported us he did not feel affected by the images. However after viewing various Negative images, he said that his emotional state was getting worse. In fact, more users (4) stated that sequential Negative images, for example 3, Negatively affects the emotional state more than, for example, one Negative, one Neutral, one Negative. The same happens for a Positive image, the user feels Positive, but he is also influenced by the Negative images, so he did not feel so happy as he “should”. However, two users justified that, given the extensive amount of sequential Negative images, they tend to rate a Positive image with a higher value. Finally, some users mentioned that the emotional content of the last image also interferes in the way they were 54 feeling at that moment. In the case of the impact of images, two users indicated that if it were real, i.e., for example if they were around a snake they would feel much more affected than by only seeing an image of a snake. Another user mentioned that if the person (or people) that appear in the image were from his family or friends, the impact in his emotional state would be considerably bigger. Two users also reported feeling Fear not for what the image transmits, but because they imagined themselves in that situation. Concerning the Negative images, one user mentioned that they should be “more shocking”. Four users would have preferred if the images were larger, ideally fullscreen and with high definition quality, while two other users suggested that the use of videos instead of images would cause a bigger impact in their emotional state. Finally, one user suggested the use of 3D using a device such as Oculus Rift. Concerning the design of the test, six users considered it very long, i.e., with too many images, and two other users suggested that should have been more Positive images. A large number of users also reported that the test had too many images of snakes (18) or spiders (7). With so many images of snakes/spiders, the users (6) reported that they got used to them, and stopped feeling afraid or disgusted. To avoid the use of many images with the same animals (snakes and spiders) users (3) suggested the use of salamanders, grasshoppers, scorpions or maggots. In the case of the pause screen, seven users considered it very long, and one of them did not even understand the need of a pause between blocks of images. At least one user appreciated the pause screen, and suggested the use of a timer to indicate the time left for resting. Finally, some users (6) explained that it was complicated to analyse what they were feeling, given that it was very subjective, and also difficult to rate from 1 to 5; two of them gave the example that they would only give a rating of 5 in extreme cases, such as if they started crying or laughing out loud. Besides this, three of them also mentioned that the first images of each sequence could have had biased ratings because people are adapting to the rating scheme. The existing comments as well as the reported inconsistencies represent a minority of participants (10%). The remaining participants did the experiment as it should be, and their responses were aligned with the emotions that were supposed to be transmitted by the images. 5.5 Discussion For the images classified as Negative by our users (Figure 5.3) almost all of them had at least 50% of negative ratings, however 20 to 30% of the images also had a significant number of neutral ratings. Besides that, only 27% did not have any positive vote. Regarding the images classified as Neutral or Positive (see Figure 5.4), for the first case (images from 1033 to Sp139) almost 39% had a considerable number of negative ratings, and only 12% did not have any Positive vote. In the case of the Positive images (from 1340 to P124) almost 50% had at least one negative rating, while 10% were rated by all the participants as positive. As in the case of the Negative images, we can see a lot of neutral ratings for each positive image. We compared the achieved results, concerning categories, between our dataset and the GAPED/Mikels datasets for each of the images of our dataset, in order to obtain the agreement between them. In the case of the images from Mikels dataset (see Table 5.1), the agreement was 100% for the Positive image, where in the case of the Negative there is confusion between the Negative and Neutral categories. (%) Negative Positive Negative 66.77 Neutral 33.33 Positive 100 Table 5.1: Confusion Matrix for the categories between Mikels and our dataset 55 Figure 5.3: Classification of the Negative images of our dataset (from users) Figure 5.4: Classification of the Neutral and Positive images of our dataset (from users) For the GAPED dataset (see Table 5.2) we analyzed 76 images (33 Negative, 9 Positives, and 34 Neutral). For the Neutral and Positive categories the achieved agreement was 100% for each, while in the case of the Negative, similarly to what happens for Mikels dataset, there is confusion between the Negative and Neutral categories. (%) Negative Neutral Positive Negative 55 Neutral 43.33 100 Positive 1.67 100 Table 5.2: Confusion Matrix for the categories between GAPED and our dataset 56 5.6 Summary In this chapter we described the experience performed to annotate a new dataset of images with the emotional content of each image. Besides that, we also collected important information about what users think about the experience, and what influences the way they feel during the visualization of an image. Given this, we consider the following aspects as the most important: the way a person interprets an image, specifically the context in which the image is inserted, the current emotional state of the person, and the previous personal experiences of the person. From the results achieved we conclude that there was no clear agreement between the users, with this fact being more evident in the Negative and Neutral categories, while the Positive category was the most consensual. We also compared the agreement, for each image, between our dataset and Mikels/GAPED datasets. In both cases, there was an overall good agreement, with the worst results achieved in the Negative category, where the images considered as Negative in the Mikels or GAPED were mainly considered as Neutral by our users. 57 58 6 Evaluation In this chapter we present the evaluation, using the new dataset, of the two recognizers: Content-based Emotion Recognizer (CBER) and Fuzzy Logic Emotion Recognizer (FLER). 6.1 Results Each image of the new dataset was classified by the two recognizers: FLER and CBER. Concerning the categories, each image was annotated with the dominant category using CBER and FLER; in the later each image was annotated with up to two dominant categories. In the case of the emotions, only FLER was used to annotate the image with the most dominant emotions (up to three). Besides the classifications made by our recognizers, each image had already the classification made by the participants of our study (see Chapter 5). 6.1.1 Fuzzy Logic Emotion Recognizer In the following paragraphs we will describe the evaluation performed concerning the categories (Negative, Neutral, and Positive), as well as the emotions (ADS, D, DF, DS, Ha, N, and S). Categories In Table 6.1 we can see the results achieved, using our dataset, to evaluate FLER considering the categories. From our dataset we used 21 Positive, 67 Neutral and 81 Negative images. In the case of the Negative category, the achieved recognition rate was 100%, while in the Positive category, it achieved almost 86%. For the Neutral category, the achieved results was considerably worst (only 28%). (%) Negative Neutral Positive Negative 100 61.19 4.76 Neutral Positive 28.36 9.52 10.45 85.71 Table 6.1: Confusion Matrix for the categories using our dataset 59 When we compared these results with the ones achieved using only GAPED dataset (see Section 3.2) the Negative and Positive categories achieved good results, existing an increase in the Negative results (from 87.89 to 100%), and a decrease in the Positive (from 100% to 85.71%). It is also clear that the Neutral category achieved a poor result; it decreases from almost 99% to 28%. However, this result can be explained by the lack of agreement between the results from our users and the previous classification of the images from the GAPED dataset (See Section 5.3), as well as the existing confusion between Negative and Neutral category for the GAPED. Emotions Concerning the classification in terms of the emotions that an image conveys, we considered that a given emotion is present in the image if the median of the values assigned by users to that emotion was 2.0. Considering this, from the 169 images that compose our dataset, almost 23% did not have any emotion associated. None of the non-annotated images belongs to the Positive category, and almost 60% corresponds to the Neutral category. Considering only the 131 images with emotions associated, there were no images with the emotions Anger or Surprise. For the remaining Negative emotions, we had 18 images of Sadness, 8 of Fear, and 5 images associated with Disgust. In the case of Happiness there were 17 images, while for Neutral we had 36. In the case of two emotions in the same image, we had the following combinations: DS (8), AS (7), HaN (4), DF(3), FSu (2), NS (2), AD (1), AF (1), DSu (1), and FS (1). Considering combinations of three emotions in the same image, we had: ADS (7), AFS (3), ADF(2), DFN (1), DFSu (1), FHaN (1), and HaNSu (1). Finally, there was only one image with four emotions associated: ADSSu. To check if the emotion identified by our recognizer was correct, we assumed that a result is considered correct, if at least one of the emotions for our dataset is present in the emotions identified by the recognizer. For example if an image has the emotions ADS from the dataset, all the following emotions, from the recognizer, will be considered correct: A, D, S, AD, AS, or DS. Given this, our FLER achieved a success rate of 68.70%. Considering the subset of images annotated with negative emotions, we had a success rate of 88.41%, while in the case of the images with the positive emotion it was 82.35%. In the case of the Neutral, it was only 38.89%, and in this case we observed a lot of confusion between the N and S, DF or F. 6.1.2 Content-Based Emotion Recognizer For this evaluation, and given that CBER only classifies an image in terms of being Negative or Positive, we did not consider the Neutral images of our dataset; therefore we used 21 Positive and 81 Negative images. In Table 6.2 we can see the achieved results. (%) Negative Positive Negative 76.54 47.62 Positive 23.46 52.38 Table 6.2: Confusion Matrix for the categories using our dataset If we compare these results with the ones obtained in Section 4.2.5, in both cases there was a decrease in the recognition rates, from 87.18% to 76.54% to the Negative category, and from 57.69% to 52.38% in the case of the Positive. Although this can be justified by the use of only one category, for each image, given that even across our users, in many cases and for different reasons (see Section 5.4), there is no consensus about which feeling each image transmits (see Figure 6.1). If we consider the negative images (from1304 to Sp146 in Figure 6.1), almost all the images had at least 50% of negative ratings, but there are also a lot of neutral ratings (in average from 20% to 30%) 60 in these images. Besides that, only 27% of the negative images did not have any positive vote. For the positive images (from 1340 to P124), almost 50% had at least one negative rating, while 10% were rated by all the participants as positive. As in the negative images, we can see a lot of neutral ratings for each positive image. Figure 6.1: Classification of the Negative and Positive images of our dataset (from users) 6.2 Discussion Although there is a lot of work done in understanding the content of an image (see section 2.3.1), the majority of this work did not specifically focus on the emotions or categories that an image conveys. In some cases it was possible to identify whether a picture is gloomy or not, associate the visual content of an image to adjectives such as sublime, sad, touch, aggressive, romantic, elegant, chic, or calm, or pairs of emotions such as like-dislike or gaudy-plain. Besides that, in general, the images that were used were not generic, since they correspond to painting art or textures related to clothing and decoration. The most similar work [?], to the one that we have developed in the case of CBER, managed to sort pictures using categories (Positive/Negative) with an accuracy of 55%. In the case of basic emotions: Happiness, Sadness, Anger, Disgust and Fear, they obtained an accuracy of 52%. Concerning the work we had developed in FLER, and as far as we know, there is no similar work. 6.3 Summary In this chapter, we perform additional evaluation of our recognizers, using the new dataset. For each image, we compared the classification of each recognizer to the one achieved using the experiment described in Chapter 5. In the case of CBER, using our dataset, we achieved the following recognition rates: 76.54% for the Negative category and 53.28% for the Positive. In the case of FLER, we achieved a success rate of 68.70%, using our dataset, for emotions. In the case of categories, we achieved 100% for Negative category, 88% for the Positive and 28% for the Neutral. We also briefly compare our work with the works detailed in Chapter 2. 61 62 7 Conclusions and Future Work In this Chapter, we present a summary of the dissertation, our final conclusions and the contributions of our work. We also present the new issues that might be addressed in the future. 7.1 Summary of the Dissertation In this work, we proposed two solutions to identify the emotional content conveyed by an image, one using the Valence and Arousal values, and another using the content of the image, such as colors, texture or shape. We also provide a new dataset of images annotated with emotions, obtained from experiments with users. In Chapter 2, we described the importance of emotions, as well as how they can be represented. Emotion in human cognition is essential and plays an important role in the daily life of human beings, namely in rational decision-making, perception, human interaction, and in human intelligence. Regarding the emotion representation, there are two different perspectives: categorial and dimensional. Usually, the dimensional model is preferable because it could be used to locate discrete emotions in space, even when no particular label could be used to define a certain feeling. Along with it, we detailed the previous works in the recognition of emotions from images, and how image contents, such as faces, Color, Shape or Texture information, affect the way emotions are perceived by the users. To describe how humans perceive and classify facial expressions of an emotion, there are two types of models: the continuous and categorical. The continuous model explains how expressions of emotion can be seen at different intensities, whereas the categorical explains, among other findings, why the images in a morphing sequence between two emotions, like Happiness and Surprise, are perceived as either happy or surprise but not something in between. There have been developed models of the perception and classification of the six facial expressions of emotion. Initially, they used feature and Shape-based algorithms, but, in the last two decades, appearance-based models (AAM) have been used. We also described the relationship between emotions and the different visual characteristics of an image, namely Color, Shape, Texture, and Composition. Color is the most extensively used visual content for image retrieval since it is the basic constituent of images. Shape corresponds to an important 63 criterion for matching objects based on their physical structure and profile. Texture is defined as all that is left after color and local shape has been considered; it also contains information about the structural arrangement of surfaces and their relationship to the surrounding environment. Composition is based on common (and not-so-common) rules. The most popular and widely known is the Rule of Thirds, that can be considered as a sloppy approximation to the golden ratio (about 0.618) [41, 42]. It states that the most important part of an image is not the center of the image but instead at the one third and two third lines (both horizontal and vertical), and their four intersections. We also presented CBIR: a technique that uses visual contents of images to search images in large databases, using a set of features, such as Color, Shape or Texture. However, the low-level information used in CBIR systems does not sufficiently capture the semantic information that the user has in mind. In order to solve this, the EBIR systems could be used. These systems are a subcategory of the CBIR that, besides the common features, also use emotions as a feature. Most of the research in the area is focused on assigning image mood on the basis of eyes and lips arrangement, but colors, textures, composition and objects are also used to characterized the emotional content of an image, i.e., some expressive and perceptual features are extracted and then mapped into emotions. Besides the extraction of emotions from an image, there has been an increasing number of attempts to use emotions in different ways, such as the increase of the quality of recommendation systems. These systems help users find a small and relevant subset of multimedia items based on their preferences. Finally, the most wellknown problem of these systems, matrix-sparsity problem, can be solved using implicit feedback, such as recording the emotional reaction of the user to a given item, and use it as a way of rate that item. Finally, we present the datasets that we used in our work: IAPS, GAPED and Mikels. The IAPS database provides a set of normative emotional stimuli for experimental investigations of emotion and attention. The goal is to develop a large set of standardized, emotionally-evocative, internationally accessible, color photographs that includes contents across a wide range of semantic categories [59]. To increase the availability of visual emotion stimuli, a new database called GAPED was created. Even though research has shown that the IAPS is useful in the study of discrete emotions, the categorical structure of the IAPS has not been characterized thoroughly. In 2005, Mikels collected descriptive emotional category data on subsets of the IAPS in an effort to identify images that elicit discrete emotions. Besides the IAPS and GAPED databases, in which each image was annotated with their Valence and Arousal ratings, there are other databases (typically related to facial expressions) that were labeled with the corresponding emotions, such as NimStim Face Stimulus Set, Pictures of Facial Affect (POFA) or Karolinska Directed Emotional Faces (KDEF). In Chapter 3, we presented a recognizer to classify an image, based on their V-A ratings using Fuzzy Logic, with the universal emotions present in it and the corresponding category (Negative, Neutral and Positive). For each image in the dataset, we started by normalizing the V-A values, and computed the Angle and the Radius for each image in order to help reduce emotion confusion between images with similar angles. To describe each class of emotions, as well as the categories, we used the Product of Sigmoidal membership function and the Trapezoidal membership function. For the categories, we used Trapezoidal membership function, both for Angle and Radius, while for the classes of emotions, we used the Product of Sigmoidal membership function for the Angle and the Trapezoidal membership function for the Radius; We also present the achieved results concerning the experimental results that we have done. When using the same set for training and test, we achieved a recognition rate of 100% for Negative and Positive categories. In the case of the dominant classes of emotions we achieved an average classification rate of 91,56%. With the use of GAPED as a testing set we achieved an average recognition rate of 96% for categories. For GAPED, we achieved a non-classification rate of 23.4%, while in the case of the IAPS, we achieved a non-classification rate of 4.86%. 64 In chapter 4, we described a recognizer to classify an image with the corresponding emotion category: Positive or Negative, based on the content of the image, such as Color, Texture or Shape. We also presented the several studies that we have made, concerning the combinations of different visual features, to select the best one for our recognizer. The recognizer uses a Vote classifier based on SMO, NB, LB, RF, and RSS, and is composed by CH, CM, NDC, and RCS features. Finally, we presented the experimental results, in which, using a set of 156 images, for testing, that were not used for training, we achieved an average recognition rate of 72.44% (Negative: 87.18% and Positive: 57.69%). In chapter 5, we described the experience performed to annotate a new dataset of images with the emotional content of each image, as well as the collected information about what users think about the experience, and what influenced the way they felt during the visualization of an image. Next, we discuss the aspects that we considered as the most important: the way a person interprets an image, specifically the context in which the image is inserted, the current emotional state of the person, and the previous personal experiences of the person. We also presented the comparison about the agreement, for each image, between our dataset and Mikels/GAPED datasets. In the case of the images from Mikels dataset, the agreement was 100% for all the 4 Negative images, as well as for the only Positive image. The remaining two images were classified as Neutral by the people, although their original classification was Negative. For the GAPED dataset, we achieved 100% of agreement for the Negative category, and almost 90% for Positive. In the case of the Neutral category, only about 24% of the images were considerer as Neutral in both datasets, with the majority, almost 77%, being considered Negative. In chapter 6, we performed additional evaluation of our recognizers, using the new dataset. For each image, we compared the classification of each recognizer to the one achieved using the experiment described in Chapter 5. We also briefly compare our work with the works detailed in Chapter 2. In the case of CBER, using our dataset, we achieved the following recognition rates: 76.54% for the Negative category and 53.28% for the Positive. In the case of FLER, we achieved a success rate of 68.70%, using our dataset, for emotions. In the case of categories, we achieved 100% for Negative category, 88% for the Positive and 28% for the Neutral. 7.2 Final Conclusions and Contributions Although there is a lot of work regarding the retrieval of images based on their content, most of this work did not take into account the emotions that an image conveys. Therefore, our work focused on retrieving the emotions related to a given image, by providing two recognizers: one using the Valence and Arousal information from the image, and the other using the visual content of the image. This way, we increased the number of images annotated with their emotions without the need of manual classification, reducing both the subjectivity of the classification and the extensive use of the same stimuli. In short, the main contributions of this work were: • A Fuzzy recognizer that achieved a recognition rate of 100% for categories of emotion and 91.56% for emotions, using Mikels dataset [66]; for GAPED, the recognizer achieved an average classification rate of 95.59% for the categories of emotion and, finally, using our dataset, it achieved a success rate of 68.70% for emotions, and, in the case of categories, it achieved 100% for Negative category, 88% for the Positive and 28% for the Neutral. • A recognizer based on the content of the images, that has obtained a recognition rate of 87.18% for the Negative category and 57.69% for the Positive, using a dataset of images selected both 65 from IAPS and from GAPED datasets. Using our dataset, this recognizer achieved a recognition rate of 76.54% for the Negative category and 53.28% for the Positive. • A new dataset of 169 images from IAPS, Mikels and GAPED annotated with the dominant categories and emotions, havin in account what users felt while viewing each image. 7.3 Future Work From the experimental evaluation of the developed recognizers detailed in Section 3.2, 4.2 and Chapter 6, we can establish new guidelines for the work to be done in the future. Concerning FLER, we used 6 images for ADS, 11 for DS, 12 for F, 24 for DF, 31 for D, 43 for S, and finally, 114 for Ha, for the creation of each Fuzzy Set. As we can see, the distribution of the images according to each class of emotion is not balanced, and in the majority of the cases, there is a small number of images of each. Given this, we consider important to use more annotated images to adjust the Fuzzy Sets for each class of emotions, and consequently the Fuzzy Sets for each of the categories. Considering the results obtained throughout this work in the case of the categories, the next possible step is to merge the two recognizers into one. If a particular image, provided as input to the “new” recognizer has information about their Valence and Arousal values, a weighting system should be used between the values of DOM (assigned by FLER) and the estimated probability (assigned by the CBER) in order to classify the image. Otherwise, it should be used only CBER classification. Further, we suggest to complement the new dataset with data collected using an BCI device (e.g., Emotiv). This way, each image will have the emotion felt, and the emotion reported by the users. Another possibility is to use the automatic identification of the category of emotions from content, in order to organize or sort the results of an image search, or even to filter the images that will be displayed to the user. Besides that, for example in therapy sessions, it may be helpful to use the emotional information from the images and emotional state of the user, to improve their emotional state. 66 Bibliography [1] D Aha and D Kibler. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991. [2] O AlZoubi, RA Calvo, and RH Stevens. Classification of eeg for affect recognition: an adaptive approach. AI 2009: Advances in Artificial Intelligence Lecture Notes in Computer Science, 5866:52– 61, 2009. [3] JC Amante. Colorido : Identificação da Cor Dominante de Fotografias. PhD thesis, 2011. [4] JC Amante and MJ Fonseca. Fuzzy Color Space Segmentation to Identify the Same Dominant Colors as Users. DMS, 2012. [5] Danny Oude Bos. EEG-based Emotion Recognition: The Influence of Visual and Auditory Stimuli. Capita Selecta Paper, 2007. [6] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [7] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001. [8] Shih-Fu Chang, T Sikora, and A Purl. Overview of the MPEG-7 standard. Circuits and Systems for Video Technology, IEEE Transactions on, 11(6):688–695, June 2001. [9] SA Chatzichristofis and YS Boutalis. CEDD: color and edge directivity descriptor. A compact descriptor for image indexing and retrieval. Computer Vision Systems, pages 312–322, 2008. [10] Savvas A Chatzichristofis, Y S Boutalis, and Mathias Lux. Selection of the proper compact composite descriptor for improving content based image retrieval. In B Zagar, editor, Signal Processing, Pattern Recognition and Applications (SPPRA 2009), page 0, Calgary, Canada, February 2009. ACTA Press. [11] Savvas A Chatzichristofis and Yiannis S Boutalis. FCTH: Fuzzy Color and Texture Histogram - A Low Level Feature for Accurate Image Retrieval. In Proceedings of the 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS ’08, pages 191–196, Washington, DC, USA, 2008. IEEE Computer Society. [12] Chin-han Chen, MF Weng, SK Jeng, and YY Chuang. Emotion-based music visualization using photos. Advances in Multimedia Modeling, 4903:358–368, 2008. [13] O da Pos and Paul Green-Armytage. Facial expressions, colours and basic emotions. Journal of the International Colour Association, 1:1–20, 2007. [14] Elise S Dan-Glauser and Klaus R Scherer. The Geneva affective picture database (GAPED): a new 730-picture database focusing on valence and normative significance. Behavior research methods, 43(2):468–77, June 2011. 67 [15] Charles Darwin. The Expression of the Emotions in Man and Animals. 1872. [16] Drago Datcu and L Rothkrantz. [17] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. Studying Aesthetics in Photographic Images Using a Computational Approach. In Proceedings of the 9th European Conference on Computer Vision - Volume Part III, ECCV’06, pages 288–301, Berlin, Heidelberg, 2006. SpringerVerlag. [18] CM de Melo and Jonathan Gratch. Evolving expression of emotions through color in virtual humans using genetic algorithms. Proceedings of the 1st International Conference on Computational Creativity ({ICCC-X)}, 2010. [19] Peter Dunker, Stefanie Nowak, André Begau, and Cornelia Lanz. Content-based Mood Classification for Photos and Music: A Generic Multi-modal Classification Framework and Evaluation Approach. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, MIR ’08, pages 97–104, New York, NY, USA, 2008. ACM. [20] Paul Ekman. Basic emotions, chapter 3, pages 45–60. John Wiley & Sons Ltd, New York, 1999. [21] Paul Ekman and Wallace Friesen. Pictures of Facial Affect. Consulting Psychologists Press, Palo Alto, CA, 1976. [22] Paul Ekman and Erika L. Rosenberg. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (Facs) (Series in Affective Science). Oxford University Press, 2005. [23] Elaine Fox. Emotion Science Cognitive and Neuroscientific Approaches to Understanding Human Emotions, September 2008. [24] J Friedman, T Hastie, and R Tibshirani. Additive Logistic Regression: a Statistical View of Boosting. Technical report, Stanford University, 1998. [25] Syntyche Gbèhounou, François Lecellier, Christine Fernandez-maloigne, and U M R Cnrs. Extraction of emotional impact in colour images. 6th European Conference on Colour in Graphics, Imaging and Vision, 2012. [26] Franz Graf. JFeatureLib, 2012. [27] Ramin Zabih Greg Pass. Comparing Images Using Joint Histograms. 1999. [28] Mark Hall, Eibe Frank, and Geoffrey Holmes. The WEKA Data Mining Software: An Update. ACM SIGKDD, 11(1):10–18, 2009. [29] Onur C Hamsici and Aleix M Martı́nez. Bayes Optimality in Linear Discriminant Analysis. IEEE Trans. Pattern Anal. Mach. Intell., 30(4):647–657, 2008. [30] Alan Hanjalic. Extracting Moods from Pictures and Sounds. IEEE SIGNAL PROCESSING MAGAZINE, (March 2006):90–100, 2006. [31] R Haralick, K Shanmugam, and I Dinstein. Texture Features for Image Classification. IEEE Transactions on Systems, Man, and Cybernetics, 3(6), 1973. [32] Lane Harrison, Drew Skau, and Steven Franconeri. Influencing visual judgment through affective priming. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2949–2958, 2013. 68 [33] Trevor Hastie and Robert Tibshirani. Classification by Pairwise Coupling. In Michael I Jordan, Michael J Kearns, and Sara A Solla, editors, Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998. [34] Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998. [35] D.H. Hockenbury and S.E. Hockenbury. Discovering psychology. New York: Worth Publishers, 2007. [36] Jing Huang, S Ravi Kumar, Mandar Mitra, Wei-Jing Zhu, and Ramin Zabih. Image Indexing Using Color Correlograms. 1997 IEEE Conference on Computer Vision and Pattern Recognition, 0:762, 1997. [37] George H John and Pat Langley. Estimating Continuous Distributions in Bayesian Classifiers. In Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338–345, San Mateo, 1995. Morgan Kaufmann. [38] Evi Joosten, GV Lankveld, and Pieter Spronck. Colors and emotions in video games. 11th International Conference on Entertainment Computing, 2010. [39] Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, Quang-tuan Luong, James Z Wang, Li Jia, and Jiebo Luo. Aesthetics and Emotions in Images [A computational perspective ]. IEEE Signal Processing Magazine, (SEPTEMBER 2011):94–115, 2011. [40] Takeo Kanade. Picture Processing System by Computer Complex and Recognition of Human Faces. 1973. [41] S S Keerthi, S K Shevade, C Bhattacharyya, and K R K Murthy. Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation, 13(3):637–649, 2001. [42] A Khokher and R Talwar. Content-based image retrieval: state of the art and challenges. (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES, 9(2):207–211, 2011. [43] Youngrae Kim, So-jung Kim, and Eun Yi Kim. EBIR: Emotion-based image retrieval. 2009 Digest of Technical Papers International Conference on Consumer Electronics, pages 1–2, January 2009. [44] J Kittler, M Hatef, Robert P W Duin, and J Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998. [45] Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. Evaluation of Multiple Clustering Solutions. In MultiClust@ECML/PKDD, pages 55–66, 2011. [46] Kai Kuikkaniemi, Toni Laitinen, Marko Turpeinen, Timo Saari, Ilkka Kosunen, and Niklas Ravaja. The influence of implicit and explicit biofeedback in first-person shooter games. CHI’10 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 859–868, 2010. [47] Ludmila I Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley and Sons, Inc., 2004. [48] R D Lane, E M Reiman, G L Ahern, G E Schwartz, and R J Davidson. Neuroanatomical correlates of happiness, sadness, and disgust. The American journal of psychiatry, 154(7):926–33, July 1997. 69 [49] P J Lang. The emotion probe: Studies of motivation and attention. American psychologist, 50:372, 1995. [50] P.J. Lang, M.M. Bradley, and B.N. Cuthbert. International affective picture system (IAPS): Affective ratings of pictures and instruction manual. NIMH Center for the Study of Emotion and Attention, 1997. [51] Christine L. Larson, Joel Aronoff, and Elizabeth L. Steuer. Simple geometric shapes are implicitly associated with affective value. Motivation and Emotion, 36(3):404–413, October 2011. [52] S le Cessie and J C van Houwelingen. Ridge Estimators in Logistic Regression. Applied Statistics, 41(1):191–201, 1992. [53] T M C Lee, H-L Liu, C C H Chan, S-Y Fang, and J-H Gao. Neural activities associated with emotion recognition observed in men and women. Molecular psychiatry, 10(5):450–5, May 2005. [54] Yisi Liu, Olga Sourina, and MK Nguyen. Real-time EEG-based emotion recognition and its applications. Transactions on computational science XII, 2011. [55] David G. Lowe. Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31:355–395, 1987. [56] Xin Lu, Poonam Suryanarayan, Reginald B Adams, Jia Li, Michelle G Newman, and James Z Wang. On Shape and the Computability of Emotions. Proceedings of the ACM Multimedia Conference, 2012. [57] Marcel P. Lucassen, Theo Gevers, and Arjan Gijsenij. Texture affects color emotion. Color Research & Application, 36(6):426–436, December 2011. [58] Mathias Lux and Savvas A Chatzichristofis. Lire: Lucene Image Retrieval: An Extensible Java CBIR Library. In Proceedings of the 16th ACM International Conference on Multimedia, MM ’08, pages 1085–1088, New York, NY, USA, 2008. ACM. [59] Mathias Lux and Oge Marques. Visual Information Retrieval Using Java and LIRE. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, 2013. [60] Jana Machajdik and Allan Hanbury. Affective image classification using features inspired by psychology and art theory. Proceedings of the international conference on Multimedia - MM ’10, page 83, 2010. [61] D Marr. Early processing of visual information. Philosophical Transactions of the Royal Society of London, B275:483–524, 1976. [62] Christian Martin, Uwe Werner, and HM Gross. A real-time facial expression recognition system based on active appearance models using gray images and edge images. IEEE, 216487(216487):1–6, 2008. [63] Aleix Martinez and Shichuan Du. A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives. Journal of Machine Learning Research : JMLR, 13:1589–1608, May 2012. [64] S Marčelja. Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am., 70(11):1297–1300, November 1980. 70 [65] Celso De Melo and Ana Paiva. Expression of emotions in virtual humans using lights, shadows, composition and filters. Affective Computing and Intelligent Interaction, pages 549–560, 2007. [66] Joseph a Mikels, Barbara L Fredrickson, Gregory R Larkin, Casey M Lindberg, Sam J Maglio, and Patricia a Reuter-Lorenz. Emotional category data on images from the International Affective Picture System. Behavior research methods, 37(4):626–30, November 2005. [67] Katarzyna Agnieszka Olkiewicz and Urszula Markowska-kaczmar. Emotion-based image retrieval - An artificial neural network approach. Proceedings of the International Multiconference on Computer Science and Information Technology, pages 89–96, 2010. [68] Michael Jones Paul Viola. Robust Real-time Object Detection. International Journal of Computer Vision, 2001. [69] W.R. Picard. Affective Computing, 1995. [70] J Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In B Schoelkopf, C Burges, and A Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998. [71] R. Plutchik. The nature of Emotions. Am. Sci., 89(4):344–350, 2001. [72] Jonathan Posner, James a Russell, and Bradley S Peterson. The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and psychopathology, 17(3):715–34, January 2005. [73] Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA, 1993. [74] Thomas Rorissa, Abebe; Clough, Paul; Deselaers. Exploring the Relationship Between Feature. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 59(5):770–784, 2008. [75] J A Russell. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161– 1178, 1980. [76] Stefanie Schmidt and WG Stock. Collective indexing of emotions in images. A study in emotional information retrieval. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(February):863–876, 2009. [77] SG Shaila and A Vadivel. Content-Based Image Retrieval Using Modified Human Colour Perception Histogram. ITCS, SIP, JSE-2012, CS & IT, pages 229–237, 2012. [78] DS Shete and MS Chavan. Content Based Image Retrieval: Review. International Journal of Emerging Technology and Advanced Enginnering, 2(9):85–90, 2012. [79] A. Smith. A new set of norms. Behavior Research Methods, Instruments, and Computers, (3x(x), xxx-xxx), 2004. [80] A. Smith. Smith2004norms.txt. Retrieved October 2, 2004 from Psychonomic Society Web Archieve: http://www.psychonomic.org/ARCHIEVE/, 2004. [81] Martin Solli. Color Emotions in Large Scale Content Based Image Indexing. PhD thesis, 2011. [82] H Tamura, S Mori, and T Yamawaki. Texture features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics, 8(6), 1978. 71 [83] M Tkalčič, A Kosir, and J Tasic. Affective recommender systems: the role of emotions in recommender systems. Decisions@RecSys, 2011. [84] Marko Tkalčič, Urban Burnik, and Andrej Košir. Using affective parameters in a content-based recommender system for images. User Modeling and User-Adapted Interaction, 20(4):279–311, September 2010. [85] Koen E a van de Sande, Theo Gevers, and Cees G M Snoek. Evaluating color descriptors for object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 32(9):1582–96, September 2010. [86] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I—-511. IEEE, 2001. [87] WN Wang and YL Yu. Image emotional semantic query based on color semantic descrip- tion. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, (August):18–21, 2005. [88] X Wang, Jia Jia, Yongxin Wang, and Lianhong Cai. Modeling the Relationship Between Texture Semantics and Textile Images. Research Journal of Applied Sciences, Engineering and Technology, 3(9):977–985, 2011. [89] HW Yoo. Visual-based emotional descriptor and feedback mechanism for image retrieval. Journal of information science and engineering, 1227:1205–1227, 2006. [90] L. A. Zadeh. Fuzzy Sets*. Information and Control, 8:338–353, 1965. 72 Appendix A % NB 56.78 61.44 53.81 50.00 58.05 65.68 66.53 57.20 56.78 46.18 47.88 49.58 51.69 62.71 51.69 62.29 56.14 ACC CH CM NDC OH PCFH PCFHS RCS RT EH Gabor Haralick Tamura CEDD FCTH JCD Average RF 56.78 56.36 52.12 51.27 56.78 57.63 54.66 58.05 61.02 52.54 45.76 53.81 50.00 61.44 53.81 59.75 55.11 Log 50.00 61.44 56.78 50.85 53.81 61.86 50.85 61.02 56.78 47.03 36.86 56.78 55.93 55.08 51.27 52.54 53.68 J48 57.20 56.36 48.73 52.12 53.39 55.93 56.36 57.63 60.17 55.51 50.00 48.31 47.03 55.08 52.12 59.32 54.08 SMO 55.93 65.25 52.97 52.12 58.90 63.98 58.90 58.05 61.86 44.92 50.85 55.51 52.97 58.90 55.93 58.90 56.62 Ibk 56.36 58.05 47.03 51.69 49.58 55.80 54.66 54.24 61.44 49.58 45.34 46.19 52.54 54.66 52.97 53.81 52.75 Bag 60.17 59.75 51.27 47.03 41.27 56.78 60.59 62.29 59.32 48.73 43.22 50.42 47.88 62.29 55.93 58.47 54.09 RSS 58.90 60.59 52.54 51.69 52.97 60.59 58.90 60.17 58.90 47.46 43.64 54.24 46.19 60.59 55.08 59.32 55.11 LB 58.47 61.86 55.51 52.54 54.66 57.63 57.20 61.02 61.02 46.19 43.22 50.42 47.88 58.05 58.05 59.32 55.19 Table A1: Simple and Meta classifiers results for each feature % V1 55.93 65.68 52.54 52.97 57.63 64.41 59.75 61.02 64.41 44.07 44.07 57.20 56.36 60.59 60.17 58.90 57.23 ACC CH CM NDC OH PCFH PCFHS RCS RT EH Gabor Haralick Tamura CEDD FCTH JCD Average V2 56.78 63.56 51.27 52.54 58.05 61.86 63.14 60.17 64.83 44.92 44.07 56.78 53.81 62.71 59.75 63.56 57.36 V3 59.32 63.98 51.69 51.69 59.75 64.83 62.71 60.17 63.98 44.92 45.76 55.93 52.54 61.02 62.71 62.29 57.71 V4 56.78 63.14 52.97 52.12 59.75 64.83 62.29 59.75 61.86 44.92 48.73 56.36 53.81 61.86 61.44 63.56 57.76 V5 53.39 63.56 52.97 52.12 58.47 65.68 60.17 60.59 61.44 44.92 46.19 54.66 52.97 58.47 59.75 59.75 56.57 V6 56.36 64.83 52.97 52.12 58.90 63.98 58.90 58.05 61.86 44.92 51.27 55.51 52.97 58.90 56.78 58.90 56.70 V4 56.78 63.14 52.97 V5 53.39 63.56 52.97 V6 56.36 64.83 52.97 Table A2: Vote classifiers results for each feature % V1 55.93 65.68 52.54 ACC CH CM 73 V2 56.78 63.56 51.27 V3 59.32 63.98 51.69 52.97 57.63 64.41 59.75 61.02 58.74 NDC OH PFCH PFCHS RCS Average 52.54 58.05 61.86 63.14 60.17 58.42 51.69 59.75 64.83 62.71 60.17 59.27 52.12 59.75 64.83 62.29 59.75 58.95 52.12 58.47 65.68 60.17 60.59 58.37 52.12 58.90 63.98 58.90 58.05 58.26 V4 58.90 55.08 56.78 57.20 54.24 56.36 56.36 61.86 63.56 63.98 63.98 64.83 64.83 53.81 59.32 62.71 63.56 58.90 59.32 65.68 62.71 59.75 62.71 64.83 62.29 62.29 65.68 66.95 61.02 V5 54.66 52.97 53.81 53.81 55.51 51.69 53.39 62.71 63.98 65.25 63.56 60.59 66.10 52.54 60.17 62.29 61.02 59.32 58.47 66.10 60.59 59.32 62.71 64.83 61.44 59.32 63.98 63.14 59.76 V6 58.05 56.36 56.36 57.63 57.20 56.78 57.20 64.41 65.25 66.52 65.25 58.90 67.80 52.54 59.32 61.86 58.90 56.78 58.90 64.41 59.75 58.47 61.44 61.44 61.02 60.59 65.68 61.44 60.37 V4 59.32 59.32 56.36 57.20 56.36 60.17 55.08 54.66 55.51 56.36 55.93 57.63 54.66 56.36 57.20 55.93 V5 54.24 54.24 55.08 53.39 54.66 53.39 50.85 51.27 54.24 52.12 52.12 54.24 55.08 52.54 56.78 58.47 V6 57.63 58.05 57.20 57.63 58.90 58.05 56.78 57.20 57.63 58.05 57.20 57.63 57.63 57.20 57.63 58.05 Table A3: Results for Color using one feature % V1 56.36 55.51 54.66 56.36 55.93 53.39 55.08 65.68 66.95 66.95 64.83 63.98 65.25 48.73 58.05 62.29 61.02 61.02 57.20 64.83 61.44 63.14 62.29 67.37 62.29 61.02 63.98 64.41 60.71 ACC+CH ACC+CM ACC+NDC ACC+OH ACC+PFCH ACC+PFCHS ACC+RCS CH+CM CH+NDC CH+OH CH+PFCH CH+PFCHS CH+RCS CM+NDC CM+OH CM+PFCH CM+PFCHS CM+RCS NDC+OH NDC+PFCH NDC+PFCHS NDC+RCS OH+PFCH OH+PFCHS OH+RCS PFCH+PFCHS PFCH+RCS PFCHS+RCS Average V2 59.32 55.93 58.05 61.02 57.20 56.36 59.32 65.25 64.83 64.83 63.98 64.41 68.22 51.27 58.47 60.17 63.14 61.02 58.32 63.56 62.29 60.59 62.29 62.72 60.59 63.14 63.56 64.83 61.24 V3 56.36 57.78 58.47 57.20 56.78 56.78 58.05 63.56 62.29 64.41 64.41 63.14 64.83 49.58 58.47 61.86 63.14 60.59 59.75 65.25 63.56 60.59 62.71 64.83 61.86 63.14 65.25 65.68 61.08 Table A4: Results for combination of two Color features % V1 57.20 56.36 54.66 56.36 57.63 57.20 54.66 52.54 53.81 54.66 55.08 56.78 56.78 52.96 58.90 59.32 ACC+CH+CM ACC+CH+NDC ACC+CH+OH ACC+CH+PFCH ACC+CH+PFCHS ACC+CH+RCS ACC+CM+NDC ACC+CM+OH ACC+CM+PFCH ACC+CM+PFCHS ACC+CM+RCS ACC+NDC+OH ACC+NDC+PFCH ACC+NDC+PFCHS ACC+NDC+RCS ACC+OH+PFCH 74 V2 59.75 58.47 57.20 57.63 58.05 59.32 55.93 56.36 56.78 58.05 58.48 56.78 57.20 56.36 60.17 57.63 V3 58.47 58.47 55.51 55.93 56.36 60.59 57.63 55.51 56.78 57.20 56.78 57.78 55.93 56.78 58.90 58.05 53.39 57.20 57.20 55.51 54.66 64.41 66.10 64.83 63.56 63.98 66.52 64.41 63.98 66.95 65.68 64.83 69.50 61.86 66.52 64.83 58.90 63.56 61.86 61.44 61.02 63.56 60.59 61.86 62.29 62.71 62.29 64.83 60.59 61.44 63.14 63.98 63.98 61.44 64.83 61.86 60.66 ACC+OH+PFCHS ACC+OH+RCS ACC+PFCH+PFCHS ACC+PFCH+RCS ACC+PFCHS+RCS CH+CM+NDC CH+CM+OH CH+CM+PFCH CH+CM+PFCHS CH+CM+RCS CH+NDC+OH CH+NDC+PFCH CH+NDC+PFCHS CH+NDC+RCS CH+OH+PFCH CH+OH+PFCHS CH+OH+RCS CH+PFCH+PFCHS CH+PFCH+RCS CH+PFCHS+RCS CM+NDC+OH CM+NDC+PFCH CM+NDC+PFCHS CM+NDC+RCS CM+OH+PFCH CM+OH+PFCHS CM+OH+RCS CM+PFCH+PFCHS CM+PFCH+RCS CM+PFCHS+RCS NDC+OH+PFCH NDC+OH+PFCHS NDC+OH+RCS NDC+PFCH+PFCHS NDC+PFCH+RCS NDC+PFCHS+RCS OH+PFCH+PFCHS OH+PFCH+RCS OH+PFCHS+RCS PFCH+PFCHS+RCS Average 58.90 60.17 58.90 60.59 58.90 65.25 66.52 63.98 63.98 67.80 65.25 63.14 63.56 67.37 64.41 65.68 68.64 63.98 66.52 64.83 58.90 63.14 60.17 60.59 61.44 63.14 60.59 63.14 62.29 61.44 61.02 64.83 58.48 64.41 62.29 65.68 63.56 63.14 67.37 66.10 61.68 57.63 58.90 57.20 59.32 57.20 63.56 62.29 63.14 63.98 63.56 62.71 63.14 63.98 65.69 61.86 64.83 65.25 62.29 66.52 65.25 59.75 63.98 63.56 59.32 60.59 63.98 61.86 63.98 63.98 61.44 62.29 64.83 61.86 61.86 64.41 64.83 62.29 65.25 66.95 66.52 61.22 56.36 57.20 57.63 58.48 57.20 62.72 62.70 64.41 64.83 65.25 63.56 64.41 65.68 65.68 63.14 66.10 66.53 63.98 66.52 64.41 59.32 63.56 63.56 59.32 60.59 63.56 62.29 62.71 63.56 63.98 63.14 64.41 61.85 61.86 64.83 67.37 62.29 64.83 66.10 66.52 61.26 53.39 56.36 53.39 53.39 53.82 63.56 63.25 63.14 61.02 65.25 63.68 63.98 62.29 66.10 63.56 63.14 69.07 58.05 64.83 61.86 60.17 63.14 61.02 59.32 61.86 62.71 60.59 60.59 62.71 60.59 62.71 63.98 61.02 60.59 65.25 62.29 62.71 62.29 64.83 60.17 59.36 55.93 57.63 56.36 56.78 57.63 65.25 64.83 64.41 58.90 66.95 66.25 66.10 58.90 67.80 64.41 59.32 67.80 61.44 65.68 61.44 59.32 63.14 58.90 55.51 61.02 58.90 60.17 60.17 62.71 58.05 61.86 61.02 60.59 60.59 64.83 62.72 58.05 63.98 60.59 60.59 60.34 V4 58.90 57.20 58.90 56.36 59.75 55.08 55.51 56.78 56.78 55.08 57.63 57.20 56.78 59.32 58.47 V5 53.81 55.08 52.97 55.93 53.81 50.85 53.81 50.85 54.66 57.20 54.24 57.20 53.81 55.93 54.66 V6 57.63 58.48 57.20 59.75 57.20 56.78 57.63 58.05 58.05 58.05 57.63 58.05 56.36 57.20 57.63 Table A5: Results for combination of three Color features % ACC+CH+CM+NDC ACC+CH+CM+OH ACC+CH+CM+PFCH ACC+CH+CM+PFCHS ACC+CH+CM+RCS ACC+CM+NDC+OH ACC+CM+NDC+PFCH ACC+CM+NDC+PFCHS ACC+CM+NDC+RCS ACC+NDC+OH+PFCH ACC+NDC+OH+PFCHS ACC+NDC+OH+RCS ACC+OH+PFCH+PFCHS ACC+OH+PFCH+RCS ACC+PFCH+PFCHS+RCS V1 56.36 56.78 55.51 58.05 56.78 54.66 56.36 56.78 56.36 58.05 56.78 55.93 55.93 55.93 59.32 75 V2 58.05 59.32 58.05 57.63 59.32 55.93 58.05 56.78 60.17 58.05 56.78 59.75 58.90 60.17 59.32 V3 59.75 59.32 59.32 59.75 59.32 57.63 56.78 54.24 60.17 57.20 56.78 58.47 57.63 59.32 57.63 66.10 63.98 63.98 66.10 62.29 63.56 61.44 62.29 63.98 66.10 61.02 65.25 59.84 CH+CM+NDC+OH CH+CM+NDC+PFCH CH+CM+NDC+PFCHS CH+CM+NDC+RCS CM+NDC+OH+PFCH CM+NDC+OH+PFCHS CM+NDC+OH+RCS NDC+OH+PFCH+PFCHS NDC+OH+PFCH+RCS NDC+OH+PCFHS+RCS NDC+PFCH+PFCHS+RCS OH+PFCH+PFCHS+RCS Average 64.41 64.83 63.14 68.64 61.44 62.71 60.59 63.98 63.14 65.78 63.14 61.02 60.71 60.59 65.25 62.29 65.25 62.71 62.71 62.71 61.44 63.98 65.68 65.25 63.98 60.56 62.71 64.83 65.68 66.10 62.29 63.56 61.86 62.29 64.83 66.10 65.68 63.56 60.34 65.25 63.56 61.44 65.68 63.56 63.56 60.59 61.44 62.71 65.68 58.90 62.29 58.13 64.83 64.41 57.20 67.80 62.29 59.75 59.75 58.05 62.71 60.59 60.17 60.17 59.39 V4 56.36 59.32 56.26 59.75 56.78 55.93 56.78 56.35 59.32 58.05 64.41 66.95 63.98 62.71 63.14 63.98 60.00 V5 55.93 53.81 55.51 55.08 56.36 54.66 54.66 54.24 57.63 54.66 63.56 62.71 64.41 59.75 62.28 63.14 58.02 V6 58.48 58.05 59.76 57.63 58.05 57.63 58.05 57.63 57.20 57.20 64.41 58.90 65.25 58.47 61.44 60.17 59.27 V4 58.05 56.78 58.90 57.20 59.74 59.75 56.78 60.17 60.17 55.93 59.32 59.32 58.05 67.38 64.83 65.25 62.29 59.99 V5 56.36 55.51 55.93 54.66 55.93 55.51 56.34 57.63 55.93 53.81 57.35 54.66 53.81 61.86 63.14 63.14 61.44 57.24 V6 58.47 58.90 58.05 59.32 57.20 58.90 58.90 57.20 58.05 58.05 58.05 58.90 57.20 61.02 64.41 61.86 59.75 59.07 Table A6: Results for combination of four Color features % ACC+CH+CM+NDC+OH ACC+CH+CM+NDC+PFCH ACC+CH+CM+NDC+PFCHS ACC+CH+CM+NDC+RCS ACC+CM+NDC+OH+PFCH ACC+CM+NDC+OH+PFCHS ACC+CM+NDC+OH+RCS ACC+NDC+OH+PFCH+PFCHS ACC+NDC+OH+PFCH+RCS ACC+OH+PFCH+PFCHS+RCS CH+CM+NDC+OH+PFCH CH+CM+NDC+OH+PFCHS CH+CM+NDC+OH+RCS CM+NDC+OH+PFCH+PFCHS CM+NDC+OH+PFCH+RCS NDC+OH+PFCH+PFCHS+RCS Average V1 57.63 57.20 58.90 58.90 57.63 56.78 56.36 55.51 58.47 57.20 65.25 65.25 65.25 61.86 62.71 65.68 60.04 V2 58.90 59.75 58.05 61.44 58.05 58.05 60.17 57.63 59.32 58.90 62.71 64.41 67.80 61.44 61.44 63.14 60.70 V3 57.63 59.75 59.75 60.59 57.2 58.9 58.47 58.47 59.75 59.75 63.56 64.83 62.71 61.02 62.71 64.41 60.59 Table A7: Results for combination of five Color features % ACC+CH+CM+NDC+OH+PFCH ACC+CH+CM+NDC+OH+PFCHS ACC+CH+CM+NDC+OH+RCS ACC+CH+CM+NDC+PFCH+PFCHS ACC+CH+CM+NDC+PFCH+RCS ACC+CH+CM+NDC+PFCHS+RCS ACC+CH+CM+OH+PFCH+PFCHS ACC+CH+CM+OH+PFCH+RCS ACC+CH+CM+OH+PFCHS+RCS ACC+CM+NDC+OH+PFCH+PFCHS ACC+CM+NDC+OH+PFCH+RCS ACC+CM+NDC+OH+PFCHS+RCS ACC+NDC+OH+PFCH+PFCHS+RCS CH+CM+NDC+OH+PFCH+PFCHS CH+CM+NDC+OH+PFCH+RCS CH+CM+NDC+OH+PFCHS+RCS CM+NDC+OH+PFCH+PFCHS+RCS Average V1 56.78 57.63 58.47 59.75 58.05 58.90 61.44 58.47 59.32 56.78 57.20 57.63 57.20 63.56 65.68 63.98 63.56 59.67 V2 61.44 60.59 58.47 60.17 60.59 59.32 59.75 58.90 60.59 58.90 58.90 60.59 59.75 63.14 67.80 64.83 64.41 61.07 V3 59.75 58.90 58.47 60.59 60.17 60.17 57.20 59.75 59.75 58.47 58.05 60.59 58.47 66.10 65.25 64.83 63.56 60.59 Table A8: Results for combination of six Color features 76 % ACC+CH+CM+NDC+OH+PFCH+PFCHS ACC+CH+CM+NDC+OH+PFCH+RCS ACC+CH+CM+NDC+OH+PFCHS+RCS ACC+CH+CM+NDC+PFCH+PFCHS+RCS ACC+CH+CM+OH+PFCH+PFCHS+RCS ACC+CH+NDC+OH+PFCH+PFCHS+RCS ACC+CM+NDC+OH+PFCH+PFCHS+RCS CH+CM+NDC+OH+PFCH+PFCHS+RCS Average V1 59.75 58.47 58.48 58.90 62.29 59.32 57.63 65.68 60.07 V2 60.17 61.86 61.86 60.59 61.44 61.44 60.59 63.98 61.49 V3 60.59 58.90 62.71 61.02 59.32 60.17 58.9 66.95 61.07 V4 57.20 60.17 59.75 59.32 58.90 58.90 58.90 65.68 59.85 V5 54.66 56.36 56.36 55.08 58.05 55.51 54.66 62.71 56.67 V6 59.32 57.63 58.90 60.17 60.59 58.90 59.32 63.14 59.75 V4 58.90 V5 56.36 V6 60.59 V4 63.14 59.75 64.83 62.29 59.75 61.86 63.56 63.98 63.98 64.83 64.83 62.71 63.56 65.68 62.71 62.71 64.83 62.29 62.29 65.68 66.95 62.72 62.70 64.41 64.83 65.25 63.56 64.41 65.68 65.68 63.14 66.1 66.53 63.98 66.52 64.41 63.56 63.56 63.56 62.71 V5 63.56 58.47 65.68 60.17 60.59 62.71 63.98 65.25 63.56 60.59 66.10 62.29 61.02 66.10 60.59 62.71 64.83 61.44 59.32 63.98 63.14 63.56 63.25 63.14 61.02 65.25 63.68 63.98 62.29 66.1 63.56 63.14 69.07 58.05 64.83 61.86 63.14 61.02 62.71 60.59 V6 64.83 58.90 63.98 58.90 58.05 64.41 65.25 66.52 65.25 58.90 67.80 61.86 58.90 64.41 59.75 61.44 61.44 61.02 60.59 65.68 61.44 65.25 64.83 64.41 58.90 66.95 66.25 66.10 58.90 67.8 64.41 59.32 67.80 61.44 65.68 61.44 63.14 58.90 58.90 60.17 Table A9: Results for combination of seven Color features % V1 60.59 ALL V2 59.75 V3 59.32 Table A10: Results for combination of all Color features % V1 65.68 57.63 64.41 59.75 61.02 65.68 66.95 66.95 64.83 63.98 65.25 62.29 61.02 64.83 61.44 62.29 67.37 62.29 61.02 63.98 64.41 64.41 66.10 64.83 63.56 63.98 66.52 64.41 63.98 66.95 65.68 64.83 69.50 61.86 66.52 64.83 63.56 61.86 63.56 61.86 CH OH PFCH PFCHS RCS CH+CM CH+NDC CH+OH CH+PFCH CH+PFCHS CH+RCS CM+PFCH CM+PFCHS NDC+PFCH NDC+PFCHS OH+PFCH OH+PFCHS OH+RCS PFCH+PFCHS PFCH+RCS PFCHS+RCS CH+CM+NDC CH+CM+OH CH+CM+PFCH CH+CM+PFCHS CH+CM+RCS CH+NDC+OH CH+NDC+PFCH CH+NDC+PFCHS CH+NDC+RCS CH+OH+PFCH CH+OH+PFCHS CH+OH+RCS CH+PFCH+PFCHS CH+PFCH+RCS CH+PFCHS+RCS CM+NDC+PFCH CM+NDC+PFCHS CM+OH+PFCHS CM+PFCH+PFCHS 77 V2 63.56 58.05 61.86 63.14 60.17 65.25 64.83 64.83 63.98 64.41 68.22 60.17 63.14 63.56 62.29 62.29 62.72 60.59 63.14 63.56 64.83 65.25 66.52 63.98 63.98 67.80 65.25 63.14 63.56 67.37 64.41 65.68 68.64 63.98 66.52 64.83 63.14 60.17 63.14 63.14 V3 63.98 59.75 64.83 62.71 60.17 63.56 62.29 64.41 64.41 63.14 64.83 61.86 63.14 65.25 63.56 62.71 64.83 61.86 63.14 65.25 65.68 63.56 62.29 63.14 63.98 63.56 62.71 63.14 63.98 65.69 61.86 64.83 65.25 62.29 66.52 65.25 63.98 63.56 63.98 63.98 62.29 62.71 62.29 64.83 60.59 61.44 63.14 63.98 63.98 61.44 64.83 61.86 66.10 63.98 63.98 66.10 62.29 63.56 61.44 62.29 63.98 65.25 65.25 65.25 65.25 61.86 62.71 65.68 63.56 65.68 63.98 63.56 65.68 60.59 63.83 CM+PFCH+RCS CM+PFCHS+RCS NDC+OH+PFCH NDC+OH+PFCHS NDC+OH+RCS NDC+PFCH+PFCHS NDC+PFCH+RCS NDC+PFCHS+RCS OH+PFCH+PFCHS OH+PFCH+RCS OH+PFCHS+RCS PFCH+PFCHS+RCS CH+CM+NDC+OH CH+CM+NDC+PFCH CH+CM+NDC+PFCHS CH+CM+NDC+RCS CM+NDC+OH+PFCH CM+NDC+OH+PFCHS CM+NDC+OH+RCS NDC+OH+PFCH+PFCHS NDC+OH+PFCH+RCS OH+PFCH+PFCHS+RCS CH+CM+NDC+OH+PFCH CH+CM+NDC+OH+PFCHS CH+CM+NDC+OH+RCS CM+NDC+OH+PFCH+PFCHS CM+NDC+OH+PFCH+RCS NDC+OH+PFCH+PFCHS+RCS CH+CM+NDC+OH+PFCH+PFCHS CH+CM+NDC+OH+PFCH+RCS CH+CM+NDC+OH+PFCHS+RCS CM+NDC+OH+PFCH+PFCHS+RCS CH+CM+NDC+OH+PFCH+PFCHS+RCS ALL Average 62.29 61.44 61.02 64.83 58.48 64.41 62.29 65.68 63.56 63.14 67.37 66.10 64.41 64.83 63.14 68.64 61.44 62.71 60.59 63.98 63.14 61.02 62.71 64.41 67.8 61.44 61.44 63.14 63.14 67.8 64.83 64.41 63.98 59.75 63.71 63.98 61.44 62.29 64.83 61.86 61.86 64.41 64.83 62.29 65.25 66.95 66.52 60.59 65.25 62.29 65.25 62.71 62.71 62.71 61.44 63.98 63.98 63.56 64.83 62.71 61.02 62.71 64.41 66.1 65.25 64.83 63.56 66.95 59.32 63.62 63.56 63.98 63.14 64.41 61.85 61.86 64.83 67.37 62.29 64.83 66.10 66.52 62.71 64.83 65.68 66.10 62.29 63.56 61.86 62.29 64.83 63.56 64.41 66.95 63.98 62.71 63.14 63.98 67.38 64.83 65.25 62.29 65.68 58.90 63.97 62.71 60.59 62.71 63.98 61.02 60.59 65.25 62.29 62.71 62.29 64.83 60.17 65.25 63.56 61.44 65.68 63.56 63.56 60.59 61.44 62.71 62.29 63.56 62.71 64.41 59.75 62.28 63.14 61.86 63.14 63.14 61.44 62.71 56.36 62.70 V5 61.44 V6 61.86 V5 44.92 V6 44.92 V5 46.19 54.66 52.97 52.12 51.69 57.20 56.36 53.03 V6 51.27 55.51 52.97 53.81 52.12 57.20 58.05 54.42 Table A11: List of candidate features for Color % RT V1 64.41 V2 64.83 V3 63.98 V4 61.86 Table A12: Results for Composition feature % EH V1 44.07 V2 44.92 V3 44.92 V4 44.49 Table A13: Results for combination of Shape features % G H T G+H G+T H+T G+H+T Average V1 44.07 57.20 56.36 50.85 48.73 55.93 56.36 52.79 V2 44.07 56.78 53.81 48.31 47.88 55.93 55.08 51.69 V3 45.76 55.93 52.54 52.54 49.58 55.93 55.08 52.48 V4 48.73 56.36 53.81 51.27 52.12 56.36 54.24 53.27 Table A14: Results for combination of Texture features 78 62.71 58.05 61.86 61.02 60.59 60.59 64.83 62.72 58.05 63.98 60.59 60.59 64.83 64.41 57.20 67.80 62.29 59.75 59.75 58.05 62.71 60.17 64.41 58.9 65.25 58.47 61.44 60.17 61.02 64.41 61.86 59.75 63.14 60.59 62.19 % CEDD FCTH JCD CEDD+FCTH CEDD+JCD FCTH+JCD CEDD+FCTH+JCD Average V1 60.59 60.17 58.90 59.32 58.05 61.44 62.29 60.11 V2 62.71 59.75 63.56 60.59 63.56 62.71 61.44 62.05 V3 61.02 62.71 62.29 62.29 61.02 62.29 63.56 62.17 V4 61.86 61.44 63.56 61.44 60.59 61.02 61.02 61.56 V5 58.47 59.75 59.75 57.20 56.78 58.90 58.05 58.41 V6 58.90 56.78 58.90 55.08 55.08 57.20 55.08 56.72 Table A15: Results for combination of Joint features % CH+RCS+RT CH+NDC+RCS+RT CH+OH+RCS+RT CH+PFCH+RCS+RT OH+PFCHS+RCS+RT CH+CM+NDC+RCS+RT CH+CM+NDC+OH+PFCH+RCS+RT Average V2 65.21 65.25 67.37 67.80 64.41 66.95 64.41 65.92 V4 64.41 65.25 64.84 63.14 67.38 65.25 63.56 64.83 Table A16: Results for combination of Color and Composition features % CH+RCS+EH CH+NDC+RCS+EH CH+OH+RCS+EH CH+PFCH+RCS+EH OH+PFCHS+RCS+EH CH+CM+NDC+RCS+EH CH+CM+NDC+OH+PFCH+RCS+EH Average V2 66.10 64.41 63.56 60.17 61.44 61.86 62.72 62.89 V4 64.41 63.56 63.56 60.17 62.71 62.29 59.75 62.35 Table A17: Results for combination of Color and Shape features % CH+RCS+H CH+NDC+RCS+H CH+OH+RCS+H CH+PFCH+RCS+H OH+PFCHS+RCS+H CH+CM+NDC+RCS+H CH+CM+NDC+OH+PFCH+RCS+H CH+RCS+H+T CH+NDC+RCS+H+T CH+OH+RCS+H+T CH+PFCH+RCS+H+T OH+PFCHS+RCS+H+T CH+CM+NDC+RCS+H+T CH+CM+NDC+OH+PFCH+RCS+H+T Average V2 67.80 66.95 66.10 68.22 63.56 68.22 64.83 67.80 67.80 63.98 66.95 62.29 67.37 63.98 66.13 V4 64.83 64.40 63.14 66.10 66.56 64.41 62.29 63.56 63.56 61.86 65.68 62.71 63.14 66.10 64.17 Table A18: Results for combination of Color and Texture features % V2 66.95 CH+RCS+CEDD 79 V4 65.23 CH+NDC+RCS+CEDD CH+OH+RCS+CEDD CH+PFCH+RCS+CEDD OH+PFCHS+RCS+CEDD CH+CM+NDC+RCS+CEDD CH+CM+NDC+OH+PFCH+RCS+CEDD CH+RCS+JCD CH+NDC+RCS+JCD CH+OH+RCS+JCD CH+PFCH+RCS+JCD OH+PFCHS+RCS+JCD CH+CM+NDC+RCS+JCD CH+CM+NDC+OH+PFCH+RCS+JCD CH+RCS+FCTH+JCD CH+NDC+RCS+FCTH+JCD CH+OH+RCS+FCTH+JCD CH+PFCH+RCS+FCTH+JCD OH+PFCHS+RCS+FCTH+JCD CH+CM+NDC+RCS+FCTH+JCD CH+CM+NDC+OH+PFCH+RCS+FCTH+JCD Average 67.80 68.64 65.68 63.98 67.37 66.10 65.25 65.25 65.25 62.29 64.41 66.95 63.56 66.53 65.23 65.25 62.29 64.41 66.95 64.41 65.45 65.25 64.83 63.98 61.02 63.98 62.71 65.25 64.83 63.56 63.56 61.12 64.83 59.75 63.14 64.83 63.56 63.56 61.02 64.83 64.41 63.58 Table A19: Results for combination of Color and Joint features % V2 47.88 RT+EH V4 52.12 Table A20: Results for combination of Composition and Shape features % V2 55.51 55.93 55.72 RT+H RT+H+T Average V4 57.63 58.90 58.27 Table A21: Results for combination of Composition and Texture features % V2 61.44 62.29 62.71 62.15 RT+CEDD RT+JCD RT+FCTH+JCD Average V4 61.44 60.59 59.32 60.45 Table A22: Results for combination of Composition and Joint features % V2 50.85 50.42 50.64 EH + H EH + H + T Average V4 51.65 49.15 50.40 Table A23: Results for combination of Shape and Texture features % V2 56.78 56.36 57.63 56.92 EH + CEDD EH + JCD EH + FCTH + JCD Average 80 V4 52.97 59.75 57.63 56.78 Table A24: Results for combination of Shape and Joint features % V2 61.02 59.32 58.47 60.59 61.44 61.44 60.38 H+CEDD H+JCD H+FCTH+JCD H+T+CEDD H+T+JCD H+T+FCTH+JCD Average V4 58.57 56.78 60.17 59.32 61.02 61.02 59.48 Table A25: Results for combination of Texture and Joint features % CH+RCS+RT+EH CH+NDC+RCS+RT+EH CH+OH+RCS+RT+EH CH+PFCH+RCS+RT+EH OH+PFCHS+RCS+RT+EH CH+CM+NDC+RCS+RT+EH CH+CM+NDC+OH+PFCH+RCS+RT+EH Average V2 62.29 63.98 61.44 63.98 62.29 62.29 61.86 62.59 V4 61.02 61.02 62.71 61.02 63.98 61.86 59.32 61.56 Table A26: Results for combination of Color, Composition and Shape features % CH+RCS+RT+H CH+NDC+RCS+RT+H CH+OH+RCS+RT+H CH+PFCH+RCS+RT+H OH+PFCHS+RCS+RT+H CH+CM+NDC+RCS+RT+H CH+CM+NDC+OH+PFCH+RCS+RT+H CH+RCS+RT+H+T CH+NDC+RCS+RT+H+T CH+OH+RCS+RT+H+T CH+PFCH+RCS+RT+H+T OH+PFCHS+RCS+RT+H+T CH+CM+NDC+RCS+RT+H+T CH+CM+NDC+OH+PFCH+RCS+RT+H+T Average V2 66.92 66.10 64.41 66.53 65.25 67.37 64.83 66.95 66.52 64.41 66.95 66.95 66.10 66.10 66.10 V4 63.98 65.25 63.98 64.41 68.22 64.41 63.14 63.98 64.41 62.71 63.98 66.56 64.83 65.25 64.65 Table A27: Results for combination of Color, Composition and Texture features % CH+RCS+RT+CEDD CH+NDC+RCS+RT+CEDD CH+OH+RCS+RT+CEDD CH+PFCH+RCS+RT+CEDD OH+PFCHS+RCS+RT+CEDD CH+CM+NDC+RCS+RT+CEDD CH+CM+NDC+OH+PFCH+RCS+RT+CEDD CH+RCS+RT+JCD CH+NDC+RCS+RT+JCD CH+OH+RCS+RT+JCD CH+PFCH+RCS+RT+JCD 81 V2 63.98 67.37 68.22 63.14 61.44 65.68 66.52 65.25 65.68 66.10 66.10 V4 61.86 62.71 63.14 60.59 61.86 61.86 61.44 64.83 63.98 64.41 63.98 OH+PFCHS+RCS+RT+JCD CH+CM+NDC+RCS+RT+JCD CH+CM+NDC+OH+PFCH+RCS+RT+JCD CH+RCS+RT+FCTH+JCD CH+NDC+RCS+RT+FCTH+JCD CH+OH+RCS+RT+FCTH+JCD CH+PFCH+RCS+RT+FCTH+JCD OH+PFCHS+RCS+RT+FCTH+JCD CH+CM+NDC+RCS+RT+FCTH+JCD CH+CM+NDC+OH+PFCH+RCS+RT+FCTH+JCD Average 63.56 65.68 65.25 65.25 65.68 66.10 66.10 61.86 65.68 64.83 65.21 61.44 63.14 63.98 64.83 63.98 64.41 63.98 60.17 63.14 63.56 63.01 Table A28: Results for combination of Color, Composition and Joint features % CH+RCS+EH+H CH+NDC+RCS+EH+H CH+OH+RCS+EH+H CH+PFCH+RCS+EH+H OH+PFCHS+RCS+EH+H CH+CM+NDC+RCS+EH+H CH+CM+NDC+OH+PFCH+RCS+EH+H CH+RCS+EH+H+T CH+NDC+RCS+EH+H+T CH+OH+RCS+EH+H+T CH+PFCH+RCS+EH+H+T OH+PFCHS+RCS+EH+H+T CH+CM+NDC+RCS+EH+H+T CH+CM+NDC+OH+PFCH+RCS+EH+H+T Average V2 64.41 65.25 62.71 63.14 61.44 64.83 62.71 62.29 61.01 63.98 63.98 63.14 63.14 63.98 63.29 V4 65.68 65.68 63.98 63.56 62.71 63.56 61.86 63.29 63.98 65.68 61.86 62.71 63.98 62.29 63.63 Table A29: Results for combination of Color, Shape and Texture features % CH+RCS+EH+CEDD CH+NDC+RCS+EH+CEDD CH+OH+RCS+EH+CEDD CH+PFCH+RCS+EH+CEDD OH+PFCHS+RCS+EH+CEDD CH+CM+NDC+RCS+EH+CEDD CH+CM+NDC+OH+PFCH+RCS+EH+CEDD CH+RCS+EH+JCD CH+NDC+RCS+EH+JCD CH+OH+RCS+EH+JCD CH+PFCH+RCS+EH+JCD OH+PFCHS+RCS+EH+JCD CH+CM+NDC+RCS+EH+JCD CH+CM+NDC+OH+PFCH+RCS+EH+JCD CH+RCS+EH+FCTH+JCD CH+NDC+RCS+EH+FCTH+JCD CH+OH+RCS+EH+FCTH+JCD CH+PFCH+RCS+EH+FCTH+JCD OH+PFCHS+RCS+EH CH+CM+NDC+RCS+EH+FCTH+JCD CH+CM+NDC+OH+PFCH+RCS+EH+FCTH+JCD Average V2 62.71 63.56 63.14 62.71 60.59 63.14 62.29 62.29 61.44 63.14 62.29 63.14 63.98 61.44 62.29 65.25 62.71 62.71 61.44 63.56 62.29 62.67 Table A30: Results for combination of Color, Shape and Joint features 82 V4 62.71 62.29 64.41 59.75 59.75 62.71 59.32 64.41 63.98 61.86 61.86 60.59 64.41 61.02 64.41 64.83 63.14 63.56 62.71 62.29 60.59 62.41 % CH+RCS+H+CEDD CH+NDC+RCS+H+CEDD CH+OH+RCS+H+CEDD CH+PFCH+RCS+H+CEDD OH+PFCHS+RCS+H+CEDD CH+CM+NDC+RCS+H+CEDD CH+CM+NDC+OH+PFCH+RCS+H+CEDD CH+RCS+H+T+CEDD CH+NDC+RCS+H+T+CEDD CH+OH+RCS+H+T+CEDD CH+PFCH+RCS+H+T+CEDD OH+PFCHS+RCS+H+T+CEDD CH+CM+NDC+RCS+H+T+CEDD CH+CM+NDC+OH+PFCH+RCS+H+T+CEDD CH+RCS+H+JCD CH+NDC+RCS+H+JCD CH+OH+RCS+H+JCD CH+PFCH+RCS+H+JCD OH+PFCHS+RCS+H+JCD CH+CM+NDC+RCS+H+JCD CH+CM+NDC+OH+PFCH+RCS+H+JCD CH+RCS+H+T+JCD CH+NDC+RCS+H+T+JCD CH+OH+RCS+H+T+JCD CH+PFCH+RCS+H+T+JCD OH+PFCHS+RCS+H+T+JCD CH+CM+NDC+RCS+H+T+JCD CH+CM+NDC+OH+PFCH+RCS+H+T+JCD CH+RCS+H+FCTH+JCD CH+NDC+RCS+H+FCTH+JCD CH+OH+RCS+H+FCTH+JCD CH+PFCH+RCS+H+FCTH+JCD OH+PFCHS+RCS+H+FCTH+JCD CH+CM+NDC+RCS+H+FCTH+JCD CH+CM+NDC+OH+PFCH+RCS+H+FCTH+JCD CH+RCS+H+T+FCTH+JCD CH+NDC+RCS+H+T+FCTH+JCD CH+OH+RCS+H+T+FCTH+JCD CH+PFCH+RCS+H+T+FCTH+JCD OH+PFCHS+RCS+H+T+FCTH+JCD CH+CM+NDC+RCS+H+T+FCTH+JCD CH+CM+NDC+OH+PFCH+RCS+H+T+FCTH+JCD Average V2 66.52 64.83 66.53 65.68 63.56 65.25 62.71 66.95 66.53 66.10 66.95 62.71 64.83 63.14 65.25 62.29 64.83 63.98 59.75 64.41 64.83 66.52 65.68 65.68 63.56 58.48 63.98 64.41 68.22 65.68 64.83 64.41 60.59 64.41 65.25 65.68 65.25 65.68 65.25 62.71 63.25 66.52 64.61 V4 63.14 62.71 63.14 61.02 62.29 61.86 62.29 64.41 64.41 64.41 66.10 62.71 64.83 65.68 63.14 61.86 63.56 62.71 60.59 61.86 64.41 61.44 60.59 62.71 65.25 59.75 61.44 63.56 63.98 65.98 63.56 63.68 59.75 64.41 63.56 62.29 61.86 63.98 64.41 60.17 62.29 63.98 62.99 Table A31: Results for combination of Color, Texture and Joint features % OH + PFCHS + RCS + RT + H + EH OH + PFCHS + RCS + RT + H + T + EH Average V2 61.86 61.86 61.86 V4 62.71 63.56 63.14 Table A32: Results for combination of Color, Composition, Texture and Shape features % OH + PFCHS + RCS + RT + H + T + CEDD OH + PFCHS + RCS + RT + H + T + JCD OH + PFCHS + RCS + RT + H + T + FCTH + JCD OH + PFCHS + RCS + RT + H + CEDD 83 V2 63.56 63.56 62.71 63.56 V4 62.71 61.86 60.17 63.56 61.44 64.83 63.28 OH + PFCHS + RCS + RT + H + JCD OH + PFCHS + RCS + RT + H + FCTH + JCD Average 62.29 61.02 61.94 Table A33: Results for combination of Color, Composition, Texture and Joint features % CH + RCS + H + FCTH + JCD + EH CH + PFCH + RCS + H + T + CEDD + EH Average V2 63.56 61.86 62.71 V4 65.25 61.86 63.56 Table A34: Results for combination of Color, Texture, Joint and Shape features % CH + RCS + H + FCTH + JCD + RT CH + PFCH + RCS + H + T + CEDD + RT Average V2 65.25 63.14 64.20 V4 63.98 61.86 62.92 Table A35: Results for combination of Color, Texture, Joint and Composition features CH+CM+NDC+RCS % Negative Positive V2 Negative 82.11 46.02 Positive 17.89 53.98 % Negative Positive V4 Negative 75.61 44.25 Positive 24.39 55.75 V4 Negative 73.17 45.13 Positive 26.83 54.87 V4 Negative 74.80 46.02 Positive 25.20 53.98 V4 Negative 77.24 45.13 Positive 22.76 54.87 V4 Negative 72.36 40.71 Positive 27.64 59.29 CH+CM+NDC+RCS+H % Negative Positive V2 Negative 79.67 44.25 Positive 20.33 53.98 % Negative Positive CH+OH+RCS+CEDD % Negative Positive V2 Negative 79.67 43.36 Positive 20.33 56.64 % Negative Positive CH+OH+RCS % Negative Positive V2 Negative 82.11 46.02 Positive 17.89 53.98 % Negative Positive CH+PFCH+RCS+H % Negative Positive V2 Negative 80.49 45.13 Positive 19.51 54.87 % Negative Positive 84 CH+RCS+H+FCTH+JCD % Negative Positive V2 Negative 79.67 44.25 Positive 20.33 55.75 % Negative Positive V4 Negative 75.61 48.67 Positive 24.39 51.33 V4 Negative 77.24 41.59 Positive 22.76 58.41 V4 Negative 78.05 46.02 Positive 21.95 53.98 OH+PFCHS+RCS+RT+H % Negative Positive V2 Negative 75.61 46.02 Positive 24.39 53.98 % Negative Positive OH+PFCHS+RCS+RT+H+T % Negative Positive V2 Negative 78.86 46.02 Positive 21.14 53.98 % Negative Positive Table A36: Confusion Matrices for each combination CH+CM+NDC+RCS % Negative Positive V2 Negative 78.51 34.71 Positive 21.49 65.29 % Negative Positive V4 Negative 81.82 33.06 Positive 18.18 66.94 V4 Negative 80.17 34.71 Positive 19.83 65.29 V4 Negative 71.07 29.75 Positive 28.93 70.25 V4 Negative 76.86 34.71 Positive 23.14 65.29 V4 Negative 82.64 35.54 Positive 17.36 64.46 CH+CM+NDC+RCS+H % Negative Positive V2 Negative 77.69 34.71 Positive 22.31 65.29 % Negative Positive CH+OH+RCS+CEDD % Negative Positive V2 Negative 75.21 32.23 Positive 24.79 67.77 % Negative Positive CH+OH+RCS % Negative Positive V2 Negative 77.69 32.23 Positive 22.31 67.77 % Negative Positive CH+PFCH+RCS+H % Negative Positive V2 Negative 80.17 38.02 Positive 19.83 61.98 % Negative Positive 85 CH+RCS+H+FCTH+JCD % Negative Positive V2 Negative 78.51 35.54 Positive 21.49 64.46 % Negative Positive V4 Negative 76.86 32.23 Positive 23.14 67.77 V4 Negative 76.86 33.88 Positive 23.14 66.12 V4 Negative 77.69 33.88 Positive 22.31 66.12 OH+PFCHS+RCS+RT+H % Negative Positive V2 Negative 75.21 34.71 Positive 24.79 65.29 % Negative Positive OH+PFCHS+RCS+RT+H+T % Negative Positive V2 Negative 76.03 32.23 Positive 23.97 67.77 % Negative Positive Table A37: Confusion Matrices for each combination using GAPED dataset with Negative and Positive categories CH+CM+NDC+RCS % Negative Neutral Positive V2 Negative Neutral 51.24 28.93 33.71 49.44 23.97 22.31 Positive 19.83 16.85 53.72 % Negative Neutral Positive V4 Negative Neutral 37.19 44.63 24.72 62.92 22.31 29.75 Positive 18.18 12.36 47.93 V4 Negative Neutral 51.24 31.41 25.84 62.92 25.62 27.27 Positive 17.36 11.24 47.11 V4 Negative Neutral 58.68 22.31 24.72 64.04 21.49 26.45 Positive 19.01 11.24 52.07 V4 Negative Neutral 37.19 47.11 30.34 56.18 23.97 29.75 Positive 15.70 13.48 46.28 CH+CM+NDC+RCS+H % Negative Neutral Positive V2 Negative Neutral 61.16 18.18 23.60 57.30 27.27 21.49 Positive 20.66 19.19 51.24 % Negative Neutral Positive CH+OH+RCS+CEDD % Negative Neutral Positive V2 Negative Neutral 55.37 21.49 22.47 59.55 23.14 22.31 Positive 23.14 17.98 54.55 % Negative Neutral Positive CH+OH+RCS % Negative Neutral Positive V2 Negative Neutral 37.19 47.11 30.34 56.18 23.97 29.75 Positive 15.70 13.48 46.28 % Negative Neutral Positive 86 CH+PFCH+RCS+H % Negative Neutral Positive V2 Negative Neutral 64.46 14.05 26.97 51.69 28.10 18.18 Positive 21.49 21.35 53.72 % Negative Neutral Positive V4 Negative Neutral 54.55 28.93 24.72 64.04 26.45 23.14 Positive 16.53 11.24 50.41 V4 Negative Neutral 54.55 25.62 25.84 65.17 23.14 23.14 Positive 19.83 8.99 53.72 V4 Negative Neutral 59.50 24.79 30.34 52.81 22.31 23.97 Positive 15.70 16.85 53.72 V4 Negative Neutral 61.16 21.49 28.09 55.06 21.49 23.14 Positive 17.36 16.85 55.37 CH+RCS+H+FCTH+JCD % Negative Neutral Positive V2 Negative Neutral 62.81 16.53 24.72 59.55 23.97 21.49 Positive 20.66 15.73 54.55 % Negative Neutral Positive OH+PFCHS+RCS+RT+H % Negative Neutral Positive V2 Negative Neutral 65.29 14.05 29.21 46.07 23.14 19.01 Positive 20.66 24.72 57.85 % Negative Neutral Positive OH+PFCHS+RCS+RT+H+T % Negative Neutral Positive V2 Negative Neutral 65.29 12.40 32.58 42.70 23.14 18.18 Positive 22.31 24.72 58.78 % Negative Neutral Positive Table A38: Confusion Matrices for each combination using GAPED dataset with Negative, Neutral and Positive categories CH+CM+NDC+RCS % Negative Positive V2 Negative 87.18 42.31 Positive 12.82 57.69 % Negative Positive V4 Negative 82.05 44.87 Positive 17.95 55.13 V4 Negative 82.05 42.31 Positive 17.95 57.69 V4 Negative 83.33 39.74 Positive 16.67 60.26 CH+CM+NDC+RCS+H % Negative Positive V2 Negative 79.49 39.74 Positive 20.51 60.26 % Negative Positive CH+OH+RCS+CEDD % Negative Positive V2 Negative 80.77 41.03 Positive 19.23 58.97 % Negative Positive 87 CH+OH+RCS % Negative Positive V2 Negative 78.21 43.59 Positive 21.79 56.41 % Negative Positive V4 Negative 75.64 46.15 Positive 24.36 53.85 V4 Negative 79.49 43.59 Positive 20.51 56.41 V4 Negative 85.90 44.87 Positive 14.10 55.13 V4 Negative 87.18 47.45 Positive 12.82 52.56 V4 Negative 88.50 46.15 Positive 11.54 53.85 CH+PFCH+RCS+H % Negative Positive V2 Negative 80.77 39.74 Positive 19.23 60.26 % Negative Positive CH+RCS+H+FCTH+JCD % Negative Positive V2 Negative 79.49 38.47 Positive 20.51 61.54 % Negative Positive OH+PFCHS+RCS+RT+H % Negative Positive V2 Negative 82.05 44.87 Positive 17.95 55.13 % Negative Positive OH+PFCHS+RCS+RT+H+T % Negative Positive V2 Negative 85.90 43.59 Positive 14.10 56.41 % Negative Positive Table A39: Confusion Matrices for each combination using Mikels and GAPED dataset 88 Appendix B Questionnaire 89 Figure B1: EmoPhoto Questionnaire 90 Results of the Questionnaire Figure B2: 1. Age Figure B3: 2. Gender 91 Figure B4: 3. Education Level Figure B5: 4. Have you ever participated in a study using any Brain-Computer Interface Device? 92 Figure B6: 7. How do you feel? Figure B7: 8. Please classify your emotional state regarding the following cases: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise 93 94