Perceptually-motivated Non-Photorealistic Graphics
Transcription
Perceptually-motivated Non-Photorealistic Graphics
NORTHWESTERN UNIVERSITY Perceptually-motivated Non-Photorealistic Graphics A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree DOCTOR OF PHILOSOPHY Field of Computer Science By Holger Winnemöller EVANSTON, ILLINOIS December 2006 2 c Copyright by Holger Winnemöller 2006 All Rights Reserved 3 ABSTRACT Perceptually-motivated Non-Photorealistic Graphics Holger Winnemöller At a high level, computer graphics deals with conveying information to an observer by visual means. Generating realistic images for this task requires considerable time and computing resources. Human vision faces the opposite challenge: to distill knowledge of the world from a massive influx of visual information. It is reasonable to assume that synthetic images based on human perception and tailored for a given task can (1) decrease image synthesis costs by obviating a physically realistic lighting simulation, and (2) increase human task performance by omitting superfluous detail and enhancing visually important features. This dissertation argues that the connection between non-realistic depiction and human perception is a valuable tool to improve the effectiveness of computer-generated images to support visual communication tasks, and conversely, to learn more about human perception of such images. Artists have capitalized on non-realistic imagery to great effect, and have become masters of conveying complex and even abstract messages by visual means. The relatively new field of non-photorealistic computer graphics attempts to harness artists’ implicit expertise by imitating their visual styles, media, and tools, but only few works move beyond such simulations to verify 4 the effectiveness of generated images with perceptual studies, or to investigate which stylistic elements are effective for a given visual communication task. This dissertation demonstrates the mutual beneficence of non-realistic computer graphics and perception with two rendering frameworks and accompanying psychophysical studies: (1) Inspired by low-level human perception, a novel image-based abstraction framework simplifies and enhances images to make them easier to understand and remember. (2) A non-realistic rendering framework generates isolated visual shape cues to study human perception of fast-moving objects. The first framework leverages perception to increase effectiveness of (non-realistic) images for visually-driven tasks, while the second framework uses non-realistic images to learn about task-specific perception, thus closing the loop. As instances of the bi-directional connections between perception and non-realistic imagery, the frameworks illustrate numerous benefits including effectiveness (e.g. better recognition of abstractions versus photographs), high performance (e.g. real-time image abstraction), and relevance (e.g. shape perception in non-impoverished conditions). 5 Dedication To my parents, for their unconditional love and support. 6 Acknowledgements There are many people who can be blamed, to various degrees, for helping me get away with a PhD: My parents endowed me with a working brain, and always made sure that I use it to its full potential. My sister, Martina, kept telling me to believe in myself, and I am starting to listen to her. Angela has been my confidante and friend in good times and when things were rough, which they were a bit. Bruce Gooch, my advisor at Northwestern University, believed in my ideas, gave me the freedom to pursue my goals, supported me generously wherever he could, and has been the mentor that I had wanted for a long time. He also managed to give the graphics group a sense of family and belonging. Without Jack Tumblin, I would never have come to Northwestern to begin with. He made the first contact, invited me to come to Evanston as a scholar, and has been supportive and interested even when I decided to work with Bruce. Talking to Jack reminds you that there is always more to learn in life. Amy Gooch was one of my NPR contacts when I was at Cape Town university, looking for a place to finish my PhD. She was helpful then, and has helped me with papers, gruesome corrections, and good advice ever since. Bryan Pardo graciously agreed to be on my PhD committee, and gave me much of his time and many helpful suggestions for the dissertation. James Gain kindly offered to be my co-advisor at UCT when all else failed. I am sure he would have done a fine job. 7 Ankit Mohan and Pin Ren have been my Evanston friends from the first day they welcomed me in their office. Since that day, we’ve had many interesting, silly, and funny times together. I’ll always remember our 48 hour retreat around Lake Michigan. Sven Olsen has been my conspirator for the Videoabstraction project and a great companion during those long office hours when everybody else was already sleeping. David Feng joined the team only later, but quickly became an integral part of the crew. I miss having a worthy squash opponent. Marc Nienhaus joined the graphics group as a post-doc and left three months later as a cool roommate and good friend. I will also miss the rest of the graphics lab, Tom Lechner, Sangwon Lee, Yolanda Rankin, Vidya Setlur, and Conrad Albrecht-Buehler, but I am sure that our paths will cross in the times to come. The rest of my family, especially my brother Ronald, and my friends in Germany and South Africa have been a constant source of inspiration in my life. They have achieved so much, made me so proud, and given me good reasons not to give up whenever times were tough. You know who you are. I would like to thank the many volunteers for my experiments, who were always patient, courteous, and interested. I also owe thanks to Rosalee Wolfe and Karen Alkoby for the deaf signing video; Douglas DeCarlo and Anthony Santella for proof-reading and supplying datadriven abstractions and eye-tracking data for the Videoabstraction project; as well as Marcy Morris and James Bass for acquiring image permission from Cameron Diaz. 8 Table of Contents ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.1. Realistic versus Non-realistic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2. The Art of Perception and the Perception of Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 2. General Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.1. Simple Error Metrics (Non-Perceptual) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2. Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3. Visible Difference Predictors (VDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 3. 3.1. Real-Time Video Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 9 3.2. Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5. Framework Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 4. An Experiment to Study Shape-from-X of Moving Objects . . . . . . . . . . . . . . . . . 102 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2. Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.5. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.7. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Chapter 5. General Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.1. Vision Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Chapter 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.1. Conclusion drawn from Real-time Video Abstraction Chapter . . . . . . . . . . . . . . . . . . 173 6.2. Conclusions drawn from Shape-from-X Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 10 Appendix A. User-data for Videoabstraction Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Appendix B. User-data for Shape-from-X Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Appendix C. Links for Selected Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 11 List of Tables 2.1 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Within-“Aggregate Measure” Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.2 Significance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.1 Data for Videoabstraction Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 A.2 Data for Videoabstraction Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 B.1 Shading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 B.2 Outline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 B.3 Mixed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 B.4 TexISO Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 B.5 TexNOI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B.6 Questionnaire Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 C.1 Internet references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 12 List of Figures 1.1 Photorealistic Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Realism vs. Non-Realism - Subway System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Perceptual Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4 Realism in Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1 Simple Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Abstraction Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Explicit Image Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Scale-Space Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Linear vs. Non-linear Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Diffusion Conduction Functions and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Progressive Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8 Data-driven Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.9 Painted Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.10 Separable Bilateral Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.11 Center-Surround Cell Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 13 3.12 DoG Edge Detection and Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.13 DoG Parameter Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.14 Edge Cleanup Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.15 DoG vs. Canny Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.16 IWB Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.17 Computing Warp Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.18 Luminance Quantization Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.19 Sample Images for Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.20 Sample Images from Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.21 Participant-data for Video Abstraction Experiments . . . . . . . . . . . . . . . . . . . . . 82 3.22 Failure Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.23 Benefits for Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.24 Automatic Indication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.25 Motion Blur Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.26 Motion Blur Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.1 Shape-from-X Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2 Left: Depth Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3 Right: Tilt & Slant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4 Real-time Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 14 4.6 Display Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.7 The First Version of the Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.8 Constructing Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.9 Shape Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.10 Experiment Object Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.11 Mistaken Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.12 Aggregate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.13 Detailed Aggregate Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.14 Detailed Aggregate Measures Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.1 Lifecycle of a Synthetic Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.2 Flicker Color Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.3 Retinex Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.4 Originals for Retinex Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.5 Anomalous Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.6 Deleted Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 B.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 15 CHAPTER 1 Introduction This dissertation presents two rendering frameworks and validatory studies to demonstrate, by example, the intimate connection between non-photorealistic (NPR) graphics and perception, and to show how the two research areas can form an effective and natural symbiosis. While the notion of such a connection is not novel in itself, it is also not commonly leveraged, particularly within the NPR community. It is my hope that future researchers will adopt the frameworks, methodologies, and experiments documented in this dissertation to the mutual benefit of both communities. To explain the origins of this connection and its significance, it is instructive to discuss non-photorealism and then list the commonalities of non-photorealism and perception. Seeing as non-photorealism is defined per exclusion (being not realistic) instead of explicit goals, it seems appropriate to look briefly at the historical contrast between realistic and non-realistic graphics. 1.1. Realistic versus Non-realistic Graphics Traditionally, the ultimate goal of computer graphics has been photo-realism; to generate synthetic images that are indistinguishable from photographs [166, 63, 85, 28]. Today, this goal has arguably been achieved. Given enough time and resources, synthetic renderers can generate imagery that is indistinguishable from photographic images to the naked eye (Figure 1.1), and models exist that simulate optical processes down to the level of individual photons [37]. While 16 Figure 1.1. Photorealistic Graphics. This image shows a state-of-the-art rendering of a synthetic scene (using POV-Ray 3.6). Notable realistic optical effects include: reflection, refraction, global illumination, depth-of-field, and lens distortion.— {By Gilles Tran, Public Domain. See Table C.1 for URLs to selected images.} this success does not foreclose further research to advance the number and types of optical phenomena that can be modeled, or to improve efficiency, there are an increasing number of researchers that question realism as the only viable goal for computer graphics. The question these scientists ask is: What are the images we create used for?1 1.1.1. Depiction Purpose “Because pictures always have a purpose, producing a picture is essentially an optimization process. Depiction consists in trying to make the picture that best satisfies the goals.” ([40], pg. 116, 1 Pat Hanrahan, in his Eurographics 2005 keynote address, saw slideshow presentations at conferences as one of the main uses. 17 (a) Photographic (b) Schematic Figure 1.2. Realism vs. Non-Realism - Subway System. (a) An aerial photograph of London. This image is ill-suited to show the underground subway system covering the photographed area. (b) A schematic (non-realistic) map of part of the London subway system with a variety of abstractions/simplifications: All streets, buildings, and parks are omitted. Train-paths are drawn color-coded and so that angles are multiples of 45◦ . Train stations are symbolized by circles, indicating connections through connected circles. Other symbols list additional services offered at a given station.— {(a) by Andrew Bossi, GNU Free Documentation License. (b) after maps of Transport of London.} original emphasis). If this purpose is simulation of physical interaction between light and matter (for research, realistic conceptualization, or entertainment) [162, 37, 165], then photo-realism is a logically sound choice. If, on the other hand, the purpose is more general or abstract (to convey an idea, to give directions, to explain a situation, to give an example) then photo-realism may confuse the issue at hand through unnecessary specificity, visual clutter (masking), and physical limitations. For example, the spatial layout map of a subway system does not include every bend and corner (specificity) because only the stations and their relative positions are of interest to the viewer (Figure 1.2). The map does not include all the buildings and streets where the subway runs (visual clutter) because this would make it difficult to see the subway paths. Lastly, the map could not have been captured in a single photograph (physical limitation) because most parts of the subway system are underground and mutually hidden. 18 1.1.2. Realistic Non-realism Images generated with a specific purpose in mind could thus be called artistic, symbolic, stylistic, comprehensive, instructive, expressive, or communicative but, unfortunately, the rather unimaginative term non-photorealism has become established. Perhaps because of this lack of a purpose-statement, much of the research in non-photorealism has focussed, again, on realism instead. This dissertation uses a similar classification to Gooch and Gooch [61] who identify three main areas of NPR research: (1) Natural media simulation; (2) Artistic Tools; and (3) Artistic style simulation. Natural media simulation. Natural media simulation concerns itself with simulating (realistically) the substance that is applied to an image (e.g. oil, acrylic, coal), the instruments with which the substance is applied (e.g. brush, pencil, crayon), and the substrate to which the substance is applied (e.g. canvas, paper) [61]. In all cases, the simulated media is intended to produce surface marks that are indistinguishable from the real media [131, 170, 143, 32]. Artistic Tools. Simulated media itself is of little practical use if it is not controlled by some entity. Assisting users in creating images is therefore a worthwhile endeavor. Commercial products, such as Photoshop or CorelDraw provide a rich set of tools and functionality by repurposing standard input devices (mouse, keyboard, digital tablet). Other software and research work assist users with technically challenging, tedious, or repetitive tasks [152, 190, 120, 72, 24], but ultimately the user still has to create the image and therefore make all decisions about layout, design, placement, etc. 19 Artistic Styles. The last category of NPR research takes inspiration from existing artistic styles and attempts to automatically transform some data (usually geometric models or photographs) into images in a given artistic style. Examples of this work include the creation of line drawings from three-dimensional models [170, 38, 86, 174], light-models for cartoon-like shading [99, 25, 84, 173], and painterly systems from geometric models [110], videos [73, 68] or photographs [176]. It should be noted that none of these systems presumes to create Art, they merely generate images that (as realistically as possible) resemble a particular artistic style. In short, much of NPR research is still devoted to realistic picture (as opposed to photo) creation. There exist several noteworthy exceptions to this trend, which serve as inspiration and which I discuss throughout this dissertation. Saito and Takahashi [142] increased the comprehensibility of image using G-Buffers. Gooch and Willemsen used a non-realistic virtual environment to study virtual distance perception [60]. DeCarlo and Santella created meaningful abstractions guided by eye-tracking data [36, 144]. Gooch et al. showed that illustrations of faces were more effective than photographs for facial recognition [62]. Raskar et al. visually enhanced images with multi-flash hardware [138]. DeCarlo et al. facilitated shape perception with suggestive contours [35]. It is clear from these citations that currently only a fairly small number of researchers are addressing NPR issues beyond realistic simulation of artistic media and styles. 1.1.3. Stylistic Effects and Effective Styles So what is wrong with reproducing an artistic style? After all, artists have been very successful at communicating abstract ideas, expressing emotions, triggering curiosity, and entertaining. The answer is that there might be nothing wrong at all, but we cannot be sure. As Durand put 20 it, “[the] availability of this new variety of styles raises the important question of the choice of an appropriate style, especially when clarity is paramount” ([40], pg. 120). Santella and DeCarlo [144] argued that many NPR systems abstract imagery, but often without meaning or purpose, other than stylistic. In Santella and DeCarlo’s experiment they found that lack of shading and uniformly varying the level of detail in pure line drawings had no effect on the number of viewer’s fixation points, whereas targeted control of detail “[. . . ] affected viewers in a way that supports an interpretation of enhanced understanding.” ([144], pg.77). Many authors of NPR systems motivate their work with the expressiveness and communicative benefits of stylistic imagery, but few go on to prove that transferring visual aspects of a given style to their synthesized imagery satisfies these higher perceptional or cognitive goals. Admittedly, some NPR systems do not stand to gain much from perceptual validation, particularly artistic tools and natural media simulations. Such systems are better served with the extreme programming method and physical validation, respectively. Most other NPR systems and the NPR community at large, however, stand to benefit from perceptual evaluation and validation. The measures currently employed to compare different NPR algorithms and implementations are mainly taken from realistic graphics performance measures, chiefly frame-rates dependent on geometric complexity or screen resolution. While such measures are suitable to demonstrate performance enhancements for algorithmic improvements, they are ill suited to objectively answer questions like: “Does this system capture the essence of an artistic style?”; “Does this system help a user in detecting certain features of an image faster?”; or “How do we know which style to choose to support a given perceptual task?” Several authors, including myself, believe that the answers to these questions lie in perception. Seeing as most visual art is conceived through visual experience and expressed through an 21 (a) Size Constancy (b) Shape Constancy Figure 1.3. Perceptual Constancy. (a) Objects at a distance or reflections in a mirror produce greatly reduced retinal images, yet they are perceived as being of a normal size (e.g. a train at a distance does not appear to be a toy-train; it appears as a normal-sized train at a distance). (b) Although the two views of the bunny generate radically different retinal images, they are nonetheless perceived as depicting the same bunny. Note that shape constancy does not require an observer to have seen any particular view previously.— {Bunny model courtesy of Stanford University Computer Graphics Laboratory.} artistic process that heavily relies on feedback from the human visual system (HVS), it is likely that (1) Perception is a large influence in the creation of Art; and (2) the analysis of artistic principles may lead to insights on human perception. The following Section discusses these connections between perception and art (and by extension NPR). 1.2. The Art of Perception and the Perception of Art The neurobiologist Semir Zeki claims that “[. . . ] the overall function of art is an extension of the function of the brain” ([189], pg. 76). More specifically, Zeki defines the function of the (visual) brain as the “[. . . ] search for constancies with the aim of obtaining knowledge about the world” (pg. 79). Similarly, he defines the general function of art as a “[. . . ] search for the constant, lasting, essential, and enduring features of objects, surfaces, situations, and so on.” (pg. 79). 22 1.2.1. Constancy Why is constancy so important? In the words of Durand, “The notion of invariants and constancy are crucial in studying vision and the complex dualism of pictures. Invariants are intrinsic properties of scenes or objects, such as reflectance, as opposed to accidental extrinsic properties such as outgoing light that vary with, e.g., the lighting condition or the viewpoint. Constancy is the ability to discount the accidental conditions and to extract invariants.” ([40], pg. 113, original emphasis). There are many examples of perceptual constancy (Figure 1.3): color constancy allows us to see a green apple as green, regardless of whether we encounter it during an orange sunset or in a fluorescently-lit room. Size constancy allows us to subjectively perceive our own reflection in a mirror as normal-sized, although the dimensions of the reflection are objectively halved. Shape constancy permits objects to be recognized from a variety of viewpoints, even novel ones that have not been experienced before. Not surprisingly, the notion of intrinsic and extrinsic properties has had a profound impact on the evolution of art. For example, the Dutch Golden Age of the 17th century focussed on high detail and realism, whereas many of the modern artistic styles, like cubism, pointillism, fauvism, and expressionism focussed instead on cognitive and perceptual aspects of depiction. The difference between the realistic and expressionistic art forms (Figure 1.4) “[. . . ] can also be stated in terms of depicting ’what I see’ (extrinsic) as opposed to depicting ’what I know’ (intrinsic).” ([40], pg. 113). 1.2.2. Goals of Art and Vision Focussing again on vision, Gregory believes that, “[. . . ] perception involves going beyond the immediately given evidence of the senses: this evidence is assessed on many grounds and 23 (a) Photorealistic Painting (b) Expressionistic Painting Figure 1.4. Realism in Art. Two approaches to engage a viewer. (a) This painting, called Escaping Criticism (1874), by Pere Borrell de Caso is an example of a trompe l’oeil, a work of art that is so realistic that it tricks the observer into beliving that the depicted scene exists in reality. (b) This Portrait of Dr. Gachet (1890) by Vincent van Gogh shortly before his suicide employs various stylistic elements like visible brush-strokes, contrasting colors, and symbolism (the foxglove was used for medical cures and thus attributes Gachet)— {Both images in public domain.} generally we make the best bet, and see things more or less correctly. But the senses do not give us a picture of the world directly; rather they provide evidence for the checking of hypothesis about what lies before us. Indeed, we may say that the perception of an object is an hypothesis, suggested and tested by sensory data” ([65], p. 13). The process of seeing is therefore not just a passive absorption of electromagnetic radiation, but an active, highly complex, and parallel search2 to gain knowledge from our visual surroundings. It appears then, that many of the goals of art and perception are similar - “[. . . ] the brain must discount much of the information 2 Given the complexity of the vision process, Hoffman refers to the mechanisms which allow for our effortless visual experience as Visual Intelligence [75]. Biological evolutionists have even offered that much of the human brain’s cognitive and intellectual capabilities owe to the great computational demands of vision [117, 34]. 24 reaching it, select only what is necessary in order to obtain knowledge about the visual world, and compare the selected information with its stored record of all that it has seen.” ([189], pg. 78). An “[. . . ] artist must also be selective and invest his work with attributes that are essential, discarding much that is superfluous. It follows that one of the functions of art is an extension of the major function of the visual brain.” ([189], pg. 79). Given this goal agreement, it is not farfetched to assume that artistic images (e.g. pictures, paintings) that are designed appropriately can greatly assist the brain in performing its difficult task. 1.2.3. Perceptual Art(ists) Some authors go as far as claiming that many artistic styles are based upon the collective perceptual insight of generations of artists (e.g. [189, 135]). Zeki (himself a leading neurologist) writes, “artists are neurologists, studying the brain with techniques that are unique to them and reaching interesting but unspecified conclusions about the organization of the brain. Or, rather, that they are exploiting the characteristics of the parallel processing-perceptual systems of the brain to create their works, sometimes even restricting themselves largely or wholly to one system, as in kinetic art.” ([189], pg. 80). Specifically, Zeki and Lamb found that various types of late kinetic art are ideal stimuli for the motion sensitive cells in area V5 of the visual cortex [185]. In another experiment, Zeki and Marini [186] showed that fauvist paintings, which often divorce shapes from their naturally assumed colors, excite quite distinct neurological pathways from representational art where objects appear in normal color. Gooch et al. demonstrated 25 that caricatured line drawings of unknown faces are learned up to two times faster than the corresponding photographs [62]. Ryan and Schwartz reported similar findings for drawings and cartoons of objects [141]. Zeki refers to Art that is designed to specifically stimulate particular types of cortical cells (intentionally or not) as art of the receptive field. “The receptive field is one of the most important concepts to emerge from sensory physiology in the past fifty years. It refers to the part of the body (in the case of the visual system, the part of the retina or its projection into the visual field) that, when stimulated, results in a reaction from the cell, specifically, an increase or decrease in its resting electrical discharge rate. To be able to activate a cell in the visual brain, one must not only stimulate in the correct place (i.e., stimulate the receptive field) but also stimulate the receptive field with the correct visual stimulus, because cells in the visual brain are remarkably fussy about the kind of visual stimulus to which they will respond. The art of the receptive field may thus be defined as that art whose characteristic components resemble the characteristics of the receptive fields of cells in the visual brain and which can therefore be used to activate such cells.” ([189], pp. 88). 1.2.4. Benefits of combining NPR and Perception One principled method3 of unlocking the perceptual potential of art for the purpose of creating task-oriented computer-generated imagery, then, is to study and leverage the different visual areas of the brain, or more precisely, the cells comprising these areas and the stimuli to which these cells are responsive. The benefit of designing imagery based on perceptual principles 3 This is not to say that artistic development is unprincipled, but rather that less quantifiable factors, like experience, intuition, and aesthetic sense, play a more marked role than is commonly regarded as scientific (of course, it is often exactly these qualities that lead to the most exciting and groundbreaking scientific discoveries). 26 instead of physical/optical laws is that we can focus on creating and supporting the visual stimuli pertinent for a given perceptual task and eliminate unnecessary detail. The reverse approach is similarly advantageous (compared to fully realistic imagery) - by generating non-realistic images that purposefully only trigger certain visual areas we can study how the generated visual stimuli influence task-specific perception in isolation. These are the two approaches exemplified by the NPR rendering frameworks and perceptual studies in this dissertation. 1.3. Contributions This dissertation presents two frameworks and accompanying studies that demonstrate the important link between non-realistic graphics and perception research. Each framework uses fundamental concepts of one research area to inform the other. 1.3.1. Perception informing Graphics Chapter 3 presents a real-time NPR image processing framework to convert images or video into abstracted representations of the input data. The framework is designed to operate on general natural scenes and produces abstractions that can improve the communication content of the resulting imagery. Specifically, participants in two user studies are able to recognize/identify objects and faces quicker than in the source photographs. The framework achieves meaningful abstraction by implementing a simple model of low-level human vision. This model estimates regional perceptual importance within images and removes superfluous detail (simplification) while at the same time supporting perception of important regions by increasing local contrast (enhancement) and thus catering specifically to edge-sensitive cortical cells. Compared 27 to other automatic abstraction systems, the framework presented here offers superior temporal coherence, does not rely on an explicit image structure representation, and can be efficiently implemented on modern parallel graphics hardware. 1.3.2. Graphics informing Perception The human visual system derives shape from a multitude of shape cues. Chapter 4 presents a novel experiment to study shape perception of dynamically moving objects. The experimental framework generates NPR display conditions that specifically target individual shape perception mechanisms. By comparing user performance for a highly dynamic, interactive task under each of the display conditions, the experiment establishes a relative effectiveness ordering for the given shape cues. Data collected during experimentation indicates that shape perception in a severely time-constrained condition may behave differently from static shape perception and that a shape cue prioritization may occur in the former condition. The sensitivity of the experimental design and its flexibility enable a large number of future investigations into the effects of isolated shape cues and their parameterizations. Such research, in turn, should help in the design of better graphics and visualization systems. 1.3.3. Evaluation for NPR systems Several reasons exist why psychophysical evaluation and validation experiments are not performed more commonly for NPR systems designed to increase the communication potential or expressiveness of images. Experiments are difficult to devise, time consuming to perform, and require careful analysis. These issues could be somewhat mitigated by establishing a corpus of experiments for NPR validation, along with a database of test imagery. This dissertation 28 contributes to such a corpus by defining clear perceptual goals for the stylization frameworks presented, along with psychophysical experiments to test the effectiveness of achieving these goals. It is my hope that the presented frameworks will provide a foundation for future NPR work, and similarly, that the given validatory experiments will be used to evaluate and compare future NPR systems. 29 CHAPTER 2 General Related Work Various existing works have used mathematical and perceptual models and metrics to guide approximation algorithms, to control data compression, and to determine data similarity. Although many metrics designed for photorealistic imagery do not directly apply to NPR imagery, they are nonetheless illustrative of the different approaches to compression, comparison, and analysis that realistic and non-realistic imagery require. I therefore discuss related photorealistic works in this chapter and defer the discussion of non-photorealistic works to the individual frameworks in Chapter 3 and Chapter 4. Most research into perception for photorealistic graphics1 centers around perceptual models and metrics. In the context of this dissertation a perceptual model is an algorithm that simulates a particular aspect of human visual perception (for example saliency or contrast sensitivity), whereas a perceptual metric may use a given model to quantify the perceived differences between two stimuli or the probability that artifacts (e.g. as a result of compression) in a stimulus may be detected. 2.1. Simple Error Metrics (Non-Perceptual) A number of commonly used metrics, particularly in the compression and signal processing communities, are mathematical in nature and not derived from perceptual models. Among 1 It is interesting to note that many photorealistic applications employ perceptual metrics to degrade imagery up to the point where such degradation becomes perceptible or even objectionable. Their ultimate goal therefore shifts from physical realism to perceived realism; a goal much more in line with other perceptually-guided but intentionally non-realistic graphics. 30 these are the relative error (RE), the mean-squared error (MSE), and the peak signal-to-noise ratio (PSNR). Given two grayscale images2, A and B, with J pixels in the horizontal direction (width) and I pixels in the vertical direction (height), the measures are defined as: I−1 P J−1 P (2.1) RE(A, B) = (Ai,j − Bi,j )2 i=0 j=0 I−1 P J−1 P , 2 (Ai,j ) i=0 j=0 I−1 P J−1 P (2.2) (2.3) M SE(A, B) = (Ai,j − Bi,j ) i=0 j=0 I ·J P SN R(A, B, m) = 10 · log10 , m2 M SE(A, B) . While Equation 2.2 yields an absolute value depending on the range of A and B, Equation 2.1 and Equation 2.3 give a relative error value. In the case of PSNR, the result is based on a maximum possible value, m, for each pixel3, and expressed in decibels (dB). Figure 2.1 puts the use of these error metrics for quantifying image quality or image fidelity into perspective. I generated these images by creating an abstraction (see Chapter 3) of an original image, computing the error values between original and abstraction, and generating two other types of common image distortion (noise and blur) with similar error values (Table 2.1). Comparing the images in Figure 2.1 should make it clear that the perceived quality of each image, and the perceived fidelity to the original image differs greatly between the types of distortions, despite the fact that their RE, MSE, and PSNR scores are nearly identical. 2 This discussion uses scalar-valued images for simplicity, but applies equally to color images. For a common grayscale image, m = 28 − 1 = 255. 3 31 Figure 2.1. Simple Error Metrics. An original image and three variations with the same level of errors. Noise: The original image with salt-and-pepper noise added. Blur: The original image with a Gaussian filter applied. Abstract: The original image processed with the real-time abstraction framework discussed in Chapter 3. Metric Noise Blur Abstract Polarity RE 0.0576 0.0570 0.0578 ↓ MSE 1.004 × 103 0.991 × 103 1.006 × 103 ↓ PSNR 18.106 18.169 18.113 ↑ PNG (78.4%) 94.7% 31.3% 40.5% n.a. PDIFF4 1070 2917 3144 ↓ HDR-VDP5 42.53% 99.71% 54.72% ↓ Table 2.1. Comparison Metrics. This table lists numeric values for a number of error and comparison metrics applied to the images in Figure 2.1. Polarity symbolizes whether a low numeric value indicates a small error (↓) or a large error (↑). Another method of comparing images is to look at their information content (entropy). Considering that humans have to extract information from images in order to understand them, this seems like a sensible approach. 4 Number of pixels perceived to be different from original. Settings: gamma = 2.2, luminance = 100 lux, fov = 6◦ . Percentage of pixels with p > 95% chance of being perceived as different. Same settings as PDIFF. 5 32 Table 2.1, row 4, lists file-size ratios for the lossless PNG compression6 compared to an uncompressed image. The compression ratio of the original image is given in the Metric column. When examining the other columns we can see that adding noise to the image increases the entropy of the original image, while blurring (averaging) reduces the entropy, as expected. The problem here is that my generic use of the word information (or entropy) does not determine how useful this information might be for visual communication purposes. The addition of random information (noise), uncorrelated to the content of the image, does not enhance the image. Conversely, I demonstrate in Chapter 3 that targeted removal of information (unlike the uniform blur in Figure 2.1) can actually help perceptual tasks based on image understanding. From Section 1.2.1, we know that much of visual perception is concerned with removing extrinsic information while distilling intrinsic information, so it is not information in itself that is important but the type of information plays a deciding role. Simple metrics are not designed to make such distinctions. 2.2. Saliency Simple, mathematical metrics commonly fail for perceptual applications because, when it comes to the human visual system, not all pixels are created equal7. The location, neighborhood, and semantic meaning of (a group of) pixels are generally more important than their exact color. Humans can only focus in a very narrow foveal region8, so pixels in this region have more impact on the perceived image. Additionally, color discrimination in this region is fairly good 6 ISO standard, ISO/IEC 15948:2003 (E). Besides the fact that humans do not operate on pixels per se, anyway. 8 The fovea spans about 15◦ visual angle. 7 33 but motion detection is better outside the foveal region. Pixels can further be masked9 by texture or noise [46]. As a rule, some image regions are visually more important (have a higher saliency) than others. Given their narrow foveal extent, humans have to continually scan their visual field with head movements and quick saccadic eye movements. For visual efficiency and to preserve energy, these movements are mostly directed towards salient regions in the visual field. Saliency is therefore an important tool to model and predict perceptual attention10. Itti et al. [79, 78] computed explicit contrast measures for brightness, color opponency, and orientation (via Gabor filters) at multiple spatial scales. They then averaged the individual contrasts over all scales onto an arbitrary common scale. Finally, they normalized and averaged all contrasts to obtain a combined saliency map. From this, they predicted the sequence and durations of eye fixations using local maxima and a capacitance-based model of inhibition-ofreturn11. Privitera and Stark [130] analyzed the effectiveness of 10, partly perceptually-inspired, image processing operators (including Gabor, discrete wavelet transform, and Laplacian of Gaussian) to predict human eye fixations. They computed the 10 operators for a test image and clustered local maxima until they reached a predetermined number of clusters. By comparing the remaining clusters with actual fixation locations obtained from human subjects, they determined the reliability of each operator to predict fixation points. Privitera and Stark’s approach 9 For example, a green leaf on a red blanket is perceived very prominently, whereas the same leaf would probably not be noticed in a pile of other leaves. The pile of leaves thus masks the single leaf. 10 Santella and DeCarlo [36, 144] exploit this fact by using eye-tracking data to guide their NPR abstraction system. 11 Preventing visiting the same maxima in short succession. 34 was novel in that they did not assert a priori which image processing operator would model human attention accurately. Rather, they assembled a number of suitable operators and evaluated them empirically. Because image distortions in non-salient regions commonly remain unnoticed, saliency forms a central component in many perceptual error metrics (Section 2.3) as well as optimization and compression algorithms (Section 2.4). 2.3. Visible Difference Predictors (VDPs) To address the shortcomings of simple error metrics, researchers have designed several perceptually-based difference predictors that take into account a limited number of low level human vision mechanisms, including saliency. As the name suggests, a VDP metric predicts if a human could tell two images apart, or how different a human would judge two images to be. Daly’s [33] VDP modeled three aspects of human vision: non-linear brightness perception, the contrast sensitivity function (CSF), and masking [46] due to texture and other noise. Mantiuk et al. [106] modified Daly’s VDP for use with high-dynamic range (HDR) imagery. Yee et al. [180] defined a predictive error map, ℵ, that considered intensity, color, orientation, and motion at different spatial scales to estimate visual saliency and to determine the perceived visual differences in salient regions. The PDIFF and HDR-VDP entries in Table 2.1 list the pairwise difference scores between the original image and the distorted images in Figure 2.1 for Yee et al.’s [180] (PDIFF) and Mantiuk et al.’s [106] (HDR-VDP) metrics. For the comparisons, I chose environmental conditions similar to the user-studies in Section 3.4. The PDIFF scores for Noise and Blur appropriately indicate the aggressive distortion of the blur operation. Note though, that the abstracted image, 35 itself derived using a model of human perception, attains the worst score of all. Although HDRVDP still prefers the Noise image to the Abstract image, the metric at least performs better at judging the excessive visual loss in the Blur image. The problem lies not in the abstracted image and not even necessarily in the VDPs but in my use of the VDPs. The above VDPs predict perceivable differences between images, they do not predict the perceived likeness of images. Many forms of art are exceptionally good likenesses of a scene, despite the fact that their visual appearance is markedly different from the real world. For this reason, standard VDPs and other perceptual metrics devised for realistic scenes generally fare poorly on NPR imagery. To the best of my knowledge, no NPR image quality or fidelity metrics exist to-date and I believe this to be an excellent opportunity for future research. As a starting point it might be interesting to leverage the null-operator qualities12 of some NPR systems to transform both images to be compared into the same domain and then compute a simple error score. 2.4. Applications While the above models and metrics can be used directly to compare images, for example in database searches, they are more commonly integrated into applications to control image distortion. The two application areas I focus on, lossy compression (Section 2.4.1) and adaptive rendering (Section 2.4.2), are both very active research areas in their own right. Because this section only addresses peripherally related work and because this work is too vast to present comprehensively in this space, I limit my discussion to exemplary applications, instead. Two points are worth remembering throughout the following discussion. First, all of the listed works are perceptually motivated, yet only the smallest number of them perform user 12 For example, abstracting an already abstracted image in Chapter 3 changes almost nothing. 36 studies for perceptual validation. Second, even the most sophisticated perceptual models are mostly used only to hide artifacts and to degrade images without an objectionable loss in visual quality, they are generally not designed to make an image easier or quicker to understand. Although counter-examples exist, particularly for contrast reduction and tone-mapping work, these shortcomings prevent us from harnessing the full potential of perceptual models for graphical applications. 2.4.1. Lossy Compression To obtain very high compression ratios, lossy compression methods sacrifice some information, that is, the signal recovered from a compressed stream is commonly not identical to the original signal. To ensure that this information loss remains below a perceivable threshold, or at least does not become objectionable, perceptual models and metrics can guide the compression process. In the past, many researchers have developed lossy compression methods for a number of different signal types and signal dimensions. The most common types are images, video, and geometric meshes, while the most common dimensions are spatial (domain), temporal, and dynamic range. Images. Reid et al. [139], gave an overview of so-called second-generation (2G) coding techniques, i.e. lossy image compression systems that incorporate a simple HVS model. They concluded that most existing 2G systems outperform first-generation systems, that the 2G systems are of similar complexity, and that an objective quality comparison is impossible until a quantitative quality metric is adopted. 37 In similar work, Kambhatla et al. [87] compared several image compression schemes, including mixture of principal components (MPC), wavelets, and Karhunen Loève transform (KLT; also known as principal component analysis, PCA). They found that while PSNR for wavelet transform and KLT are higher than MPC, the MPC method produced less subjective errors as judged by (a13) radiologist(s) analyzing brain magnetic resonance images (MRI). Video. Bordes and Philippe [15] proposed perceptual enhancements to the MPEG-2 compression standard14. They developed a quality map based on a pyramid decomposition of spatial frequencies together with a multi-resolution motion representation. This quality map was then used in a pre-process to remove non-visible15 information to limit the amount of data to be encoded. The second use of the quality map was to locally adapt the encoding quantization for constant bitrate encoding. Meshes. Williams et al. [167] developed a view-dependent mesh simplification algorithm sensitive to an object’s silhouette, its texture, as well as the dynamic scene illumination. The authors weighed the cost-benefit trade-off between these factors in terms of distortion effects and rendering costs, and allocated run-time resources accordingly. Watson et al. [163] applied two different mesh simplification schemes, VClust and QSlim, to 36 polygonal models of animals and manmade artifacts. They then compared results of a series of user-studies including naming times, ratings, and preferences, to the results of numerous automatic measures computed in object and image space. They found that ratings and preferences were predicted adequately with automatic measures, while naming times were not. The 13 The authors gave no details on their subjective evaluation. This codec, most commonly used for high-quality DVD video-encoding, is part of the larger MPEG compression and coding family. More information is available at http://www.mpeg.org. 15 The authors did not define this term clearly. I assume they referred to quality-loss below a certain threshold. 14 38 authors also found significant effects between the two object types, indicating that mesh simplification systems may need to consider a broader range of information than mere geometry and connectivity. Dynamic Range. To address the problem of displaying high dynamic range images on low dynamic range displays, Tumblin et al. [157] developed two contrast reduction methods. The first method, practical only for synthetic images, computed separate image channels for lighting and surface information. By compressing only the lighting channels the authors were able to reduce the overall contrast of images while preserving much of the surface information. The second, generally applicable method allowed users to manually specify foveal fixation locations. The algorithm then adjusted global contrast based on foveal contrast adaptation while attempting to preserve local contrast in the fixation regions. Tumblin and Turk [158] took inspiration from artists’ approach to high dynamic range reproduction in developing their low curvature image simplifier (LCIS). They argued that skilled artists preserve details by drawing scene contents in coarse-to-fine order using a hierarchy of scene boundaries and shadings. The LCIS operator, a partial differential equation inspired by anisotropic diffusion, was designed to dissect a scene into smooth regions bounded by sharp gradient discontinuities. A single parameter, K, chosen for each LCIS, controlled region size and boundary complexity. Using a hierarchy of LCISs the authors could compress the dynamic range of large contrast features and then add detail from small features back into the final image. In addition to its value as a tone reproduction operator, this work is relevant to my research due to its similar approach (albeit for different reasons) to feature analysis and simplification via anisotropic-like diffusion (Section 3.3.2). 39 Temporal Dynamic Range. Mantiuk et al. [107] extended the MPEG-4 video com- pression standard to deal with high dynamic range video. They described a luminance quantization method optimized for contrast threshold perception of the HVS. Additionally, the proposed quantization offered perceptually-optimized luminance sampling to implement global tone mapping operators via simple and efficient look-up tables. Pattanaik et al. [122] proposed a new operator to account for transient scene intensity adjustments of the HVS in animation or interactive real-time simulations. Their operator simulated the dramatic compression of visual responses, and the gradual recovery of normal vision, caused by large contrast fluctuations, for example when quickly entering or leaving a dark tunnel on a bright sunny day. 2.4.2. Adaptive Rendering Realistic image synthesis is computationally extremely expensive due to the complexity of interactions between light and matter that need to be modeled to achieve a convincing level of optical/physical realism. This problem can be mitigated by lowering the goal from optical realism to perceived realism, instead. Using perceptual models and metrics, applications can allocate rendering resources to salient image regions, while reducing computational accuracy and resolution in less salient regions. Error Sources. Arvo et al. [2] defined three main causes of error in global illumina- tion algorithms: (1) Perturbed Boundary Data - errors in the input data due to limitations of measurement or modeling; (2) Discretization Errors - introduced when analytical functions are replaced by finite-dimensional linear systems for actual computations; and (3) Computational 40 Errors - due to limited arithmetic precision. Any or all of these errors can result in the visual degradation of synthetic images and objectionable artifacts, such as faceting on tesselated curved surfaces, banding (and even exaggerated Mach-banding effects) due to quantization, aliasing as a result of insufficient sampling, and noise as a residual effect of stochastic models used in random sample placement. Static Scenes. Ferweda et al. [46] made use of the common observation that some of these artifacts can be masked (hidden) when they appear co-located with visual texture. The authors developed a computational model of visual masking that predicted how the presence of one visual pattern affected the detection of another. Using their system, the authors could select and devise texture patterns to use in synthetic image generation that would hide artifacts due to the above error-types. Bolin and Meyer [13] presented a perceptually inspired approach to optimize sampling distributions for image synthesis. They computed a wavelet representation of the currently rendered scene and used a custom image quality model in combination with statistical information about the spatial frequency distribution of natural images to determine locations where additional samples needed to be taken. Their approach was able to predict masking effects and could be used to attain equivalent visual quality from different rendering techniques by controlling sample placement. In similar work, Ramasubramanian et al. [136] devised a physical error metric that accounted for the HVS’s loss of sensitivity at high background illumination levels, high spatial frequencies, and high contrast levels (visual masking). To reduce the cost of their metric for adaptive rendering, the authors separated luminance-dependent processing from the expensive spatially-dependent component, which could be pre-computed one off. 41 Recently, Cater at al. [23] performed user studies to demonstrate that different visual tasks have an effect on eye-tracking of images (effectively changing saliency in an image). They therefore extended previous HVS-based systems by additionally considering a so-called taskmap. The task map encoded information about objects’ locations and their purpose for a given task, and was generally specified manually. The authors modified the Radiance rendering engine [162] to synthesize images, optimized for a given task. Unlike Santella and DeCarlo [144] they did not perform further user studies to prove that their optimized images retained the same fixation locations as the unoptimized images. Dynamic Scenes. In addition to a saliency map and spatial frequency estimation, the perceptual model of Yee et al. [180] included an estimate of retinal velocity. Because detail resolution in high velocity regions is limited, the authors could speed up global illumination solutions by up to an order of magnitude. Myszkowski [116], developed an extension to Daly’s [33] VDP, called Animation Quality Metric (AQM) to facilitate high-quality walk-throughs of static environments and to speed up global illumination computations of dynamic environments. 42 CHAPTER 3 Real-Time Video Abstraction Figure 3.1. Abstraction Example. Abstractions like the one shown here can be more effective in visual communication tasks than photographs. Original: Snapshot of two business students on an overcast day. Abstracted: After several bilateral filtering passes and with DoG-edges overlayed. Quantized: Luminance channel soft-quantized to 8 bins. Note how folds in the clothing and shadows on ground are emphasized. In this chapter, I present an automatic, real-time video and image processing framework with the goal of improving the effectiveness of imagery for visual communication tasks (Figure 3.1). This goal is naturally broken down into two tasks: (1) Modifying imagery based on visual perception principles (Sections 3.2-3.3); and (2) proving that such modifications can lead to 43 improved performance in visual communication (Section 3.4). Additionally, I show how the various processing steps in my framework can be utilized for artistic stylization purposes. The framework operates by modifying the contrast of perceptually important features, namely luminance and color opponency. It reduces contrast in low-contrast regions using an approximation to anisotropic diffusion, and artificially increases contrast in higher contrast regions with difference-of-Gaussian edges. The abstraction step is extensible and allows for artistic or data-driven control. Abstracted images can optionally be stylized using soft color quantization to create cartoon-like effects. Technical Contributions. Unlike most previous video stylization systems, my framework is purely image-based and refrains from deriving an explicit image representation1. That is, instead of computing a structural description of the image content and then subsequently stylizing or otherwise modifying this description, my framework directly manipulates perceptual features of an image, in image space. While this may seem to limit stylization capabilities at first sight, I devise several soft quantization functions that offer important benefits for abstraction, performance, and stylization: (1) a significant improvement in temporal coherence without requiring user-correction; (2) a highly parallel framework design, allowing for a GPU-based, real-time implementation; and (3) parameters for the quantization functions which allow for a different, but rich set of stylization options, not easily available to previous systems. Theoretical Contributions. I demonstrate the effectiveness of the abstraction framework with two user-studies and find that participants are faster at naming abstracted faces of known persons compared to photographs. Traditionally, faces are considered very difficult to abstract 1 While implementation details may vary, an explicit image representation generally describes an image in terms of vector or curve-based bounded areas. See Section 3.1.1, pg. 47 and Figure 3.2 for details. 44 and stylize. Participants are also better at remembering abstracted images of arbitrary scenes in a memory task. The user studies employ small images to emulate portable display technology. I believe that small imagery will play an increasingly important role in the immediate future, with the onset in ubiquity of mobile display-enabled devices like mobile phones, digital cameras, personal digital assistants, game consoles, and multimedia players. To keep these devices portable, their display size is necessarily limited and the given screen space has to be used effectively. A framework that offers increased recognition of image features for visual communication purposes while reducing the complexity of images and thus aiding compression is therefore a valuable asset. My framework is one of only a few existing automatic abstraction systems built upon perceptual principles, and the only one to date that achieves real-time performance. 3.1. Related Work A number of issues are important for most stylization and abstraction systems and can be used to differentiate my work from previous systems. These are defined in the following section and later used to discuss previous systems. 3.1.1. Definitions Automatic vs. User-driven. As discussed in Section 2.3 various computational models of low-level human perception have been proposed. These automatically approximate a limited set of visual perceptual phenomena. No computational (or even theoretical) model exists todate that satisfactorily predicts or synthesizes anything but the most basic visual features. Most models break down when attempting to analyze global effects requiring semantic information 45 or integration over the entire visual field, and effects based on binocular vision. These limitations are partly due to the fact that not much is known about how humans achieve such global analysis [188]. Consequently, any system relying on semantic information or intended to create art requires human interaction. Other systems, particularly those intended to aid (and not replace) humans in a particular visual task, can well benefit from automation. Ideally, a system should offer a best-effort automatic solution along with an overriding or extension mechanism to improve upon the results. This is the approach I have taken in my automatic video abstraction framework. Real-time vs. Off-line. By definition, the amount of computation that can be performed by a real-time system is limited by the intended frame-rate. Because my framework is designed to support visual communication, real-time performance is paramount to support interactive applications like video-telephony or video-conferencing. Other applications, like visual database searches or summaries can be created off-line and then accessed asynchronously. My framework design leverages parallelism of the underlying image processing operations wherever possible, enabling real-time performance on modern GPU processors. Temporal Coherence. Temporal coherence is a desirable property of any animation and video system because unintentional incoherence draws perceptional attention and is therefore distracting. A system exhibits temporal coherence if small input changes lead to small output changes and is not given for most stylization systems using discrete conditionals and hard quantization functions. Additional problems arise if scene objects need to be identified and tracked through computer vision algorithms, as those algorithms are often brittle (see Explicit 46 Figure 3.2. Explicit Image Structure. Two pairs of images showing explicit image structure (Left image of pair shows color coded segments. Right image of pair shows colors derived from original image). Coarse Segmentation: The level of detail is manually chosen to segment the image into semantically meaningful segments. Some detail, like the face, is too fine to be resolved at this level. Fine Segmentation: The level of detail is chosen so that the face is resolved, but this leads to over-segmentation in the remaining image. A common approach to this problem is to over-segment an image and then use a heuristical method to merge adjacent segments, but such heuristics are commonly non-robust and temporally incoherent, requiring user correction. image structure, below). My framework offers temporal coherence by two different mechanisms: (1) reducing noise in the input images with non-linear diffusion; and (2) soft pseudoquantization functions that are all continuous or semi-continuous2 (and adaptive where applicable). 2 Formally, a function, f , defined on some topological space X, f : X 7→ R, is upper semi-continuous at x0 , if lim sup f (x) ≤ f (x0 ), and lower semi-continuous, if lim inf f (x) ≥ f (x0 ). x→x0 x→x0 For my soft quantization functions, it is also true that the ranges of the continuous intervals are much greater than the ranges of the discontinuities. 47 Explicit Image Structure and Stylization. An explicit image structure is the logical rep- resentation of image elements, such as objects, and their relative positioning (Figure 3.2). Image structure is commonly represented with a (possibly multi-resolution) hierarchy of contourbound areas, expressed as polylines or parametric curves. There exist several advantages of such explicit representations. They can be arbitrarily scaled, they can be recombined in different ways, and most importantly for stylization systems, their geometric descriptions can be parameterized and then simplified or stylized freely. Several disadvantages counterbalance these benefits. Correctly identifying and extracting image structure from raw images is a difficult and costly vision problem, often requiring user-correction and preventing real-time performance (see Automatic vs. User-driven and Real-time vs. Off-line, above). A related problem is that of tracking image structure between successive frames, particularly for noisy input, non-trivial camera movements, and occlusions. My framework stays clear of these vision problems to become fully automatic as well as real-time at the cost of a more limited range of stylistic options. I offset this limitation by providing a rich set of user-parameters to the quantization functions of the framework. In addition to the points mentioned above, the discussion in Section 1.1.3 on the merits of psychophysical validation applies directly to related works as well. Having defined the most important design factors for work directly related to mine, I can now continue to discuss previous systems in terms of these factors. 3.1.2. Previous Systems Among the earliest work on image-based NPR was that of Saito and Takahashi [142] who performed image processing operations on data buffers derived from geometric properties of 48 3-D scenes. These buffers contained highly accurate values for scene normals, curvature, depth discontinuities and other measure that are difficult to derive from natural images without knowledge of the underlying scene geometry. Unlike my own framework, their approach was mainly limited to visualizing synthetic scenes with known geometry. To reliably derive limited image structure from their source data, Raskar et al. [138] computed ordinal depth from pictures taken with purpose-built multi-flash hardware. This allowed them to separate texture edges from depth edges and perform effective texture removal and other stylization effects. My own framework cannot derive ordinal depth information or deal well with general repeated texture but also requires no specialized hardware and therefore does not face the technical challenges of multi-flash for video. Several video stylization systems have been proposed, mainly to help artists with laborintensive procedures [161, 26]. Such systems computed explicit image structure by extending the mean-shift-based stylization approach of DeCarlo and Santella [36] to computationally expensive3 three-dimensional segmentation surfaces. Difficulties with contour tracking required substantial user intervention to correct errors in the segmentation results, particularly in the presence of occlusions and camera movement. My framework does not derive an explicit representation of image structure but offers a different mechanism for stylization, which is much faster to compute, fully automatic, and temporally coherent. Contemporaneous work by Fischer et al. [49] explored the use of automatic stylization techniques in augmented reality applications. To visually merge virtual objects with a live video stream, they applied stylization effects to both virtual and real inputs. Although parts of their 3 Wang et al.’s [161] system took over 12 hours to segment 300 frames (10 seconds of video) and users had to correct errors in approximately every third frame. 49 system are similar to the framework presented here, their approach is style-driven instead of perceptually motivated, leading to different implementation approaches. As a result, their system is limited in the amount of detail it can resolve, their stylization edges require a post-processing step for thickening, and their edges tend to suffer from temporal noise4. Recently, some authors of NPR systems have defined task-dependent objectives for their stylized imagery and tested these with perceptual user studies. DeCarlo and Santella [36] used eye-tracking data to guide image simplification in a multi-scale system. In follow-up work, Santella and DeCarlo [144] found that their eye-tracking-driven simplifications guided viewers to regions determined to be important. They also considered the use of computational saliency as an alternative to measured saliency. My own work does not rely on eye-tracking data, although such data can be used. My implicit visual saliency model is less elaborate than the explicit model of Santella and DeCarlo’s later work, but can be computed in real-time and can be extended for a more sophisticated off-line version. Their explicit image structure representation allowed for more aggressive stylization, but included no provisions for the temporal coherence featured in my framework. Gooch et al. [62] automatically created monochromatic human facial illustrations from Difference-of-Gaussian (DoG) edges and a simple model of brightness perception. Using an extended soft-quantization version of a DoG edge detector, my framework can create similar illustrations in a single pass and additionally address color, real-time performance and temporal coherence. My face recognition study follows closely the protocol set forth by Stevenage [149] and consequently used by Gooch et al. [62]. 4 It should be noted that while these drawbacks are generally not desirable for a video stylization system, they helped to effectively hide the boundaries between real and virtual objects in Fischer et al.’s system. 50 Work by Tumblin and Turk [158], traditionally associated with the tone-mapping literature, is worth mentioning for its use of related techniques and the fact that the authors took inspiration from artistic painterly techniques5. In order to map high-dynamic range (HDR) images into a range displayable on standard display devices, Tumblin and Turk decomposed an HDR image into a hierarchy of large and fine features (as defined by a conductance threshold function, related to local contrast). Hierarchical levels with a large dynamic range were then compressed before combination with smaller features, effectively compressing the range of the entire image without sacrificing small detail. The low curvature image simplifiers (LCIS) used at each hierarchy level are closely related to the approximate anisotropic diffusion operation I use for simplification, but are based on higher order derivatives. Despite this similarity, Tumblin and Turk’s goals were different in that they would not modify low dynamic range images, whereas I am interested in simplifying and abstracting these. 3.2. Human Visual System Visual processing of information in humans involves a large part of the brain and processing operations too vast and complex to be currently fully understood, let alone be modeled. Given the design considerations defined in Section 3.1.1 (automation, real-time performance, temporal coherence) I limit the framework to modeling a small part of visual processing and base my design on the following assumptions: (1) The human visual system operates on various features of a scene. (2) Changes in these features (contrasts) are of perceptual importance and therefore visually interesting (salient). 5 Artists are commonly faced with the difficulty of capturing high dynamic range, real-world scenes on a canvas of limited dynamic range 51 (3) Polarizing contrast (decreasing low contrast while increasing high contrast) is a basic but useful method for automatic image abstraction. 3.2.1. Features Although the human visual experience is generally holistic, several distinct visual features are believed to play a vital role in low level human vision, among these are luminance, color opponency, orientation, and motion [121]. Evidence for such features derives from several sources. Within the visual cortex, several structurally different and variedly connected sub-regions have been identified, whose comprising cells are selectively sensitive to very distinct visual stimuli (e.g. Area V3: Orientation. Area V4: Color. Area V5: Global Motion) [188]. In addition, cerebral lesions and other pathological conditions can lead to cases where the holistic visual experience is selectively impaired (e.g. color blindness types: protanopia, deuteranopia, tritanopia, monochromasy, and cerebral achromatopsia; form deficiency: visual agnosia; motion blindness: akinetopsia) [188]. Similar evidence can be gleaned from blind people who regain sight. Their visual system is generally heavily underdeveloped and (depending on age) may never fully recover, but they can almost immediately perceive lines, edges, brightness, and color6 [65]. Based on this evidence, I consider luminance, color, and edges (which really are a secondary feature) in my real-time framework. The framework uses the perceptually uniform CIELab [179] color space to encode the luminance (L) and color opponency (a and b) features of input images and performs all abstraction operations in this feature space. The perceptual 6 The problem often does not lie in perceiving the individual features of the visual world, but their meaningful integration and interpretation. 52 uniformity of CIELab guarantees that small distances measured in this space correspond to perceptually just noticable differences (see Contrast, below). The framework design further allows for inclusion of additional features for off-line processing or when such features can be viably computed in real-time on future hardware (Section 3.5.4). 3.2.2. Contrast Constant features are generally not a prime source of biologically vital information (e.g. a featureless blue sky; a tree with uniformly green leaves; a stationary bush). Changes in features (feature contrasts) are often much more important (e.g. the silhouette of a hawk hovering above; the color-contrast of a red apple on a green tree; the motion of a tiger moving in the bushes). For this reason, humans are notoriously inept at estimating absolute stimuli and much more proficient at distinguishing even small differences between two similar stimuli [105]. People can name and describe only a handful of colors, yet they can differentiate hundreds of thousands of colors. People have difficulty estimating speed when moving, yet they are extremely sensitive to acceleration and deceleration7. Only very few people can tell the frequency of a pure sinusoidal sound wave, yet most people can distinguish two different notes. In technical terms, the resolution of absolute measures of features can be orders of magnitude less than the differential resolution of so-called just-noticeable-differences 8 (JND) [105, 45]. Because changes play such an important role in perception, much of my framework is based on contrasts (see below). 7 For example, without visual feedback, one cannot tell if an elevator is moving or stationary, only if it is starting or stopping. 8 Is is therefore not surprising to find differential measures becoming increasingly prominent in computer graphics research [123, 156, 59]. 53 3.2.3. Saliency To remove extraneous detail from imagery while emphasizing important detail requires a measure of visual importance. Itti et al. [79, 78] recognized the biological importance of high feature contrasts in their saliency model, introduced in Section 2.2. Because their explicit model is computationally rather expensive and thus too complex for real-time applications, I employ a simpler, implicit9 saliency model for my automatic, real-time implementation. Within my framework, the following restrictions apply: (1) It considers just two feature contrasts: luminance, and color opponency. (2) It does not model effects requiring global integration. (3) It processes images only within a small range of spatial scales (Section 3.2.5). Since the framework (Figure 3.4) optionally allows for externally-guided abstraction via usermaps (Equation 3.2 and Figure 3.9), a more complex saliency map, like that of Itti et al., can be supplied at the cost of sacrificing real-time performance. 3.2.4. Contrast Polarization Exaggerating feature contrasts can aide in visual perception. For example, super-portraits and caricatures have been shown to help recognition of faces10 [17, 149, 62] and can be considered a special case of the more general peak-shift effect [66]. My approach for image simplification and abstraction is therefore to simply polarize the existing contrast characteristics of an image: to diminish feature contrast in low contrast regions, 9 Here, implicit means that contrast is both the measure that defines saliency and the operand that is modified via saliency. 10 Here, feature refers to facial features (like big nose, tight lips); and contrast refers to feature differentials compared to an ideal norm-face. 54 Figure 3.3. Scale-Space Abstraction. Left: Image of a man used as the base level in a scale-space representation. Left to right: Difference-of-Gaussian (DoG) feature edges computed at increasingly coarser scales. As the kernel size for the DoG filters increases (about an order of magnitude from left to right), the visual depiction changes from a concrete instance of a man in shirt and trousers, to a generic and abstract standing figure. while increasing feature contrast in high contrast regions in order to yield abstractions that are easier and faster to understand. 3.2.5. Scale-Space Real-world entities are comprised of structural elements at different scales (Figure 3.3). A forest can span dozens of kilometers, each tree can be dozens of meters high, branches are several meters long, leaves are best measured in centimeters, while the leaves’ cells extend only fractions of millimeters. It makes as little sense to describe a forest in terms of millimeters as it does to describe a leaf in terms of kilometers. The fact that scale is such an important aspect when discussing structure has led to the development of several scale-space theories [175, 93, 102]. In terms of the human visual system, Witkin’s continuous (linear) scale-space theory is compatible with results by De Valois and De Valois [159], showing that receptive fields (Figure 3.11) of cortical cells include a fairly dense representation of sizes in the spatial frequency domain [121]. 55 Figure 3.4. Framework Overview. Each step lists the function performed, along with user parameters. The right-most paired images show alternative results, depending on whether luminance quantization is enabled (right) or not (left). The top image pair shows the final output after the optional image-based warping step.— {Cameron Diaz with permission of Cameron Diaz.} My framework supports structural scale with various framework parameters (σd , σe ), which can be used to extract and smooth features at a given scale (Figures 3.3 and 3.13). Particularly, a single spatial scale can be defined for edge-detection, and the non-linear diffusion process (Section 3.3.2) inherently operates at multiple scales due to its iterative nature11. 3.3. Implementation The basic workflow of my framework is shown in Figure 3.4. The framework first polarizes the given contrast in an image using nonlinear diffusion (Section 3.3.2). It then adds highlighting edges to increase local contrast (Section 3.3.3), and it optionally stylizes (Section 3.3.5) and sharpens (Section 3.3.4) the resulting images. 11 It should be noted that only the base scale of the non-linear diffusion is well-defined. Additional scales are spatially varying due to the non-linearity. As such, the multi-scale operations are less powerful than those based on Itti et al.’s explicit representation. 56 3.3.1. Notation This work combines results from such diverse disciplines as Psychology, Physics, Computer Vision, and Computer Graphics, each of which tend to have their unique formalisms and notation. In my own work, I try to use recognizable formulations of existing results and favor readability over mathematical rigor. Specifically, I mix notation from continuous and discrete domains and I do not discuss issues arising due to boundary conditions, as these issues are not specific to my work. Additionally, numerical accuracy is not a deciding factor in my framework because there exists no ground-truth to judge against and because the filters I employ are stable for the parameter ranges given, unless explicitly stated otherwise. 3.3.2. Extended Nonlinear Diffusion Linear vs. Non-linear diffusion. Linear filters, like the well-known Gaussian blur, are an effective method for decreasing the contrast of an image (Figure 3.5). In the frequency domain, the Gaussian blur acts as a low-pass filter, meaning that high-frequency components are subdued or even eliminated. As a result, edges become softer and contrast decreases. Unfortunately, this particularly applies to sharp edges, which contain a broad spectrum of frequency components. As it is my goal not only to lower low contrast, but also to preserve or even increase high contrast, edge blurring poses a problem. To explain the relevant technical terms in their historical context, it is useful to introduce an alternative description of the Gaussian blur. Several linear filters, the Gaussian blur among them, can be interpreted as solutions to the heat equation [43]. To gain an intuitive understanding one can imagine a room filled with a gas of spatially varying temperature. Because the gas is free to move around, it will attempt to reach an equilibrium state of constant temperature 57 Figure 3.5. Linear vs. Non-linear Filtering. Scanlines (top row) of luminance values for the horizontal lines marked in green in the bottom row images. Significant luminance discontinuities are marked with vertical lines. Original: The original scanline contains several large and sharp discontinuities, corresponding to semantically meaningful regions in the source image that I would like to preserve (wall, guard outline right, right leg fold, guard outline left). The scanline also contains a large amount of small, high frequency components on top of the base signal. These smaller components generally constitute texture or noise, which I would like to subdue. Linear Filter: Linear filtering (here, Gaussian blur) successfully subdues high-frequency components, thus simplifying the scanline. Since a linear filter operates isotropically and homogenously it also suppresses the high-frequency components of the sharp discontinuities thus smoothing these undesirably. Non-Linear Filter: The anisotropic and inhomogeneous action of the non-linear filter smooths away high frequencies in low contrast regions, while preserving most frequencies in high contrast regions. Compare the shape of all scanlines, especially at the discontinuities marked with vertical lines. everywhere. If there exists no spatial bias in the way the gas can move (apart from boundary conditions), the system is said to have a constant diffusion conduction function and the gas diffuses isotropically in all directions. In that case a linear diffusion function, like the Gaussian blur, can be used to model the diffusion process12. 12 To bring this example back to the image domain, imagine an arbitrary image to which one applies a very small Gaussian blur. As a result neighboring colors mix. Repeating this process ad infinitum mixes all colors into a single color that is the average of the initial colors. 58 Perona and Malik [125] defined a class of filters with spatially varying diffusion conduction functions, resulting in anisotropic diffusion. These filters have the property of blurring small discontinuities and sharpening edges13 (Figure 3.5). Using such a filter with a conduction function based on feature contrast, the contrast can be effectively amplified or subdued. Unfortunately, Perona and Malik’s neural-net-like implementation is not efficient enough for a real-time system on standard graphics hardware, due to its very small (see Footnote 12) spatial support14. Barash and Comaniciu [4] demonstrated that anisotropic diffusion solvers can be extended to larger spatial neighborhoods, thus producing a broader class of extended nonlinear diffusion filters. This class includes iterated bilateral filters as one special case, which I prefer due to their larger support size and the fact that they can be approximated quickly and with few visual artifacts using a separated kernel [126]. Extended Nonlinear Diffusion. Given an input image f (·), which maps pixel locations into some feature space, I define the following customized bilateral filter, H(·): Z (3.1) H(x̂, σd , σr ) = − 21 e Z kx̂−xk σd − 12 e 13 2 w(x, x̂, σr )f (x) dx kx̂−xk σd 2 w(x, x̂, σr ) dx A filter that increases the steepness of edges towards a true step-like discontinuity is sometimes called shockforming. 14 Their approach is still very much parallelizable and could be efficiently implemented in special hardware. 59 x̂ : Pixel location w(·) : Range weight function x : Neighboring pixel m(·) : Linear weighting function σd : Neighborhood size w0 (·) : Diffusion conduction function σr : Conductance threshold u(·) : User-defined map In this formulation, x̂ is a pixel location and x are neighboring pixels, where the neighborhood size is defined by σd (blur radius). For implementation purposes, I limit the evaluation radius to two standard deviations, ±2σd , and normalize the convolution kernel to account for the missing area under the curve. This rule applies similarly to all following functions involving convolutions with exponential fall-off. Increasing σd results in more blurring, but if σd is too large features may blur across significant boundaries. The range weighting function, w(·), is closely related to the diffusion conduction function (see below) and determines where in the image contrasts are smoothed or sharpened by iterative applications of H(·). (3.2) w(x, x̂, σr ) = (1 − m(x̂)) · w0 (x, x̂, σr ) + m(x̂) · u(x̂) Range Weights and Diffusion Conduction Functions. My definition of Equation 3.2 extends the traditional Bilateral filter to become more customizable for data-driven or artistic control. For the real-time, automatic case, I set m(·) = 0, such that w(·) = w0 (·) and Equation 3.1 becomes the familiar bilateral filter [155]. Here, w0 (·), is the traditional diffusion conduction 60 Figure 3.6. Diffusion Conduction Functions and Derivatives. Three possible range functions, w0 , for use in Equation 3.2. All functions have a Gauss-like bell shape, but differ in their differentiability and differential function shape. Since all functions produce very similar results when applied to an image, the best choice for a given application depends largely on the support for optimized implementations of each function. function and can take on numerous forms (Figure 3.6), given that w0 (x = x̂, x̂, σr ) = c, with c some finite constant, and lim w0 (x, x̂, σr ) = 0. x→±∞ 61 (3.3) wE0 (x, x̂, σr ) (3.4) wI0 (x, x̂, σr ) = (3.5) − 12 = e wC0 (x, x̂, σr ) = ∆fx̂ σr A 2 ∆fx̂ σr 2 1+ x̂ ·π A2 · 1 + cos ∆f 3·σ r 0 (3.6) if − 3 σr ≤ ∆fx̂ ≤ 3 σr , otherwise. where ∆fx̂ = kf (x̂) − f (x)k Equation 3.3 is the conduction function used by Tomasi and Manduchi [155] (they use the term range weighting function, as above) and employed for most images in this chapter. Equation 3.4 is based on one of Perona and Malik’s [125] original functions and Equation 3.5 is a function I devised for its finite spatial support (both other functions have infinite support and need to be truncated and normalized for practical implementations). Figure 3.6 shows comparisons of these functions along with their first two derivatives. In practice, I find that all functions give comparable results and a selection is best based on implementation efficiency on a given platform and subjective quality estimates. As I am interested in manipulating contrast, all proposed conduction functions operate on local contrast15, as defined in Equation 3.6. Perona and Malik [125] called parameter σr in Equations 3.2-3.5 the conductance threshold in reference to its deciding role in whether contrasts are sharpened or blurred. Small values of σr preserve almost all contrasts, and thus lead to filters with little effect on the image, whereas for large 15 Other non-linear diffusion filters that operate on higher-order derivatives of the image have been proposed to achieve different goals [164, 158]. 62 Figure 3.7. Progressive Abstraction. This figure shows a source image (unfiltered) that is progressively abstracted by successive applications of an extended nonlinear diffusion filter. Note how low contrast detail (e.g. the texture in the stone wall and the soft folds in the guard’s garments) is smoothed away, while high contrast detail (facial features, belts, sharp creases in garment) is preserved and possibly enhanced. values, lim w0 (·) = 1, thus turning H(·) into a standard, linear Gaussian blur. For intermeσr →∞ diate values of σr , iterative filtering of H(·) results in an extended nonlinear diffusion process, where the degree of smoothing or sharpening is determined by local contrasts in f (·)’s feature space. Figure 3.7 shows the progressive removal of low contrast detail due to iterative nonlinear diffusion. Automatic vs. Data-driven Abstraction. With m(·) 6= 0, the range weighting function, w(·), turns into a weighted sum of w0 (·) and an arbitrary importance field, u(·), defined over the image. In this case, m(·) and u(·) can be computed via a more elaborate visual saliency model [130, 78], derived from eye-tracking data [36], or painted by an artist [71]. Figure 3.8 shows comparisons between DeCarlo and Santella’s [36] explicit stylization system and my 63 Figure 3.8. Data-driven Abstraction. This figure shows images abstracted automatically (left) vs. abstractions guided by eye-tracking data (right). The top row are original images by DeCarlo and Santella [36], while the bottom row shows my results given the same data. It is noteworthy that despite some stylistic differences the two systems abstract images very similarly, although the systems themselves are radically different in design. Particularly, it is not necessary to derive a computationally expensive explicit image representation to achieve meaningful abstraction.— {Top images and eye-tracking data by Doug DeCarlo and Anthony Santella, with permission.} implicit framework. I created the data-driven example by converting DeCarlo and Santella’s eye-tracking data into an importance map, u(·), setting m(·) := u(·), and tuning the remaining framework parameters to approximate the spatial scales and simplification levels found in DeCarlo and Santella’s original image, to allow for better comparability. After setting the initial 64 Figure 3.9. Painted Abstraction. User-painted masks achieve an effective separation of foreground and background objects. Automatic: The automatic abstraction of the source image yields the same level of abstraction everywhere. Foreground & Background: Masks (shown as insets) selectively focus abstraction primarily on the background and foreground, respectively.— {Original Source image in public domain.} parameters, the framework ran automatically. Note that although the two abstraction systems are radically different in design and implementation (e.g. DeCarlo and Santella’s image-structurebased system versus my image-based framework), the level of abstraction achieved by both is very similar. Figure 3.9 demonstrates the use of a user-painted importance mask, u(·). As above, I set m(·) := u(·). The masks, shown as insets in the figure, are kept simple for demonstrative reasons but could be arbitrarily complex. In effect, a user can simply paint abstraction onto an image with a brush, the level of abstraction depending on the brightness and spatial extent of the brush. Since the framework operates in real-time, this process affords immediate visual feedback to the user and allows even novice users to easily create abstractions with a simple and intuitive interaction common in many image manipulation products. 65 Optimizations and Other Considerations. Applying a full extended non-linear diffu- sion solver with reasonable spatial support and sufficient iterations to achieve valuable abstraction is computationally too expensive for real-time purposes. Fischer et al. [49] addressed this problem by applying their full filter implementation on a downsampled input image and then interpolating the result to the original size. While this allowed them to perform at least one iteration in real-time, the upsampling interpolation caused blurring of the resulting image, as expected. My solution uses a separable implementation of the non-linear diffusion kernel. A twodimensional kernel is separable if it is equal to the convolution of two one-dimensional kernels: Z Z k2 (x1 ) dx1 ∗ k1 (x1 , x2 ) dx1 dx2 = R2 Z R k3 (x2 ) dx2 R The one-dimensional kernels can be applied sequentially (the latter operating on the result of the former) thus reducing the computational complexity from O(n2 ) to O(n), where n is the radius of the convolution kernel, and in turn limiting costly memory fetches. Mathematically, a nonlinear filter is generally not separable, the bilateral filter included. Still, I have obtained good results with this approach in practice. My results show empirically that a separable approximation to a bilateral filter produces minor (difficult to see with the naked eye) spatial biasing artifacts compared to the full implementation for a small number of iterations (< 5 for most images tested and using the default values in this chapter). Due to the shock-forming behavior of the bilateral filter, these biases tend to harden and become more pronounced with successive iterations (Figure 3.10). 66 Figure 3.10. Separable Bilateral Approximation. Two images of a bilateral filter diffusion process after 41 iterations. Full: Using the full two-dimensional implementation. Approximate: Using two separate one-dimensional passes. In most cases these errors are fairly small and only become prominent after a large number of iterations.— {Fair use: The images shown here for educational purposes are derivations of a small portion of an original image as shown on daily television.} Pham and Vliet [126] corroborate this result in contemporaneous work. They show empirically that a single iteration of a separable bilateral filter produces few visual artifacts, even for the worst-case scenario of a 45◦ tilted discontinuous edge. Figure 3.10 shows results for large number of iterations where errors tend to accumulate. I observe two types of effects: (1) sharp diagonal edges often evolve into jagged horizontal and vertical steps (examples À, Á, and Â); and (2) soft diagonal edges fail to evolve (examples  and Ã). For most of my videos and images, including the user-study, I have found it sufficient to apply between 2-4 iterations, so that spatial biases are rarely noticeable. The speed improvement, on the other hand, is in excess of 30 times in the GPU implementation. 67 Receptive Field Output 1D Response Profile Position Figure 3.11. Center-Surround Cell Activation. The receptive field of a cortical cell is modeled as an antagonistic system in which the stimulation of the central cell (blue) is inhibited by the simultaneous excitation of its surrounding neighbors (green). In other words, a center-surround cell is only triggered if it itself receives a signal while its receptive field is not stimulated. This system gives rise to the Mexican hat shape in 3-D (left, checkered shape) and the corresponding curves shown in the right image. The combined response curve can be modeled by subtracting two Gaussian distribution functions whose standard deviations are proportional to the spatial extent of the central cell and its receptive field [121]. As noted previously, all abstraction operations are performed in CIELab space. Consequently, the parameter values given here and in the following sections are based on the assumption that L ∈ [0, 100] and (a, b) ∈ [−127, 127]. 3.3.3. Edge detection In general, edges are defined by high local contrast, so adding visually distinct edges to regions of high contrast further increases the visual distinctiveness of these locations. 68 Figure 3.12. DoG Edge Detection and Enhancement. The center-surround mechanism described in Figure 3.11 can be used to detect edges in an image. Source: An abstracted image used to detect edges. DoG Result: The raw output of a DoG filter needs to be quantized to obtain high contrast edges. Step Quantization: Discontinuous quantization results in temporally incoherent edges near the step boundary. Smooth Quantization: Using Equation 3.7 for quantization results in edges that quickly fade at the quantization boundary, leading to improved temporal coherence. Compare the circled edges in the bottom images. Marr and Hildreth [109] formulated an edge detection mechanism based on zero-crossings of the second derivative of the luminance function. They postulated that retinal cells (center), which are stimulated while their surrounding cells are not stimulated, could act as neural implementations of this edge detector (Figure 3.11). A computationally efficient approximation to this edge detection mechanism is the quantized result of the difference-of-Gaussians 69 Figure 3.13. DoG Parameter Variations. Extending the standard DoG edge detector with soft quantization parameters allows me to create a rich set of stylistic variations. Left: A classic DoG result (no shading) with fine edges (low scale parameter σe ). (σe , τ, ) = (0.7, 0.9904, 0). Center: Same edge scale as in left image, but with additional shading information. Note that this image is not simply a combination of edges with luminance information as in Gooch et al. [62], because edges in dark regions (e.g. person’s right cheek, bottom of beard) are still visible (as bright lines against dark background). In terms of style, the image has a distinct charcoal-and-pencil appearance. (σe , τ, ) = (0.7, 0.9896, 0.00292). Right: Coarse edges using a large spatial kernel (compare detail in hair and hat with left image) and light shading around eyes, cheek and throat. (σe , τ, ) = (1.8, 0.9650, 0.01625). Parameter ϕe = 5.0 throughout. Given the above parameters, these images are created fully automatc ically in a single processing step.— {Original photograph used as input Andrew Calder, with permission.} (DoG) operator (Figure 3.12). Rather than using a binary edge quantization model as in previous works [22, 62, 49], I define my edges using a slightly smoothed continuous function, D(·), (Equation 3.7; depicted in Figure 3.4, bottom inset) to increase temporal coherence in animations and to allow for a wider range of stylization effects (Figure 3.13) than previous implementations. 70 Figure 3.14. Edge Cleanup Passes. DoG Edges are extracted after ne < nb bilateral filter passes to eliminate noise that could lead to temporal incoherence in the edges. From left to right, this figure shows the original edges contained in a source image, the edges extracted after two and after four bilateral cleanup passes. Note that the differences between no cleanup and two passes are much greater than between two and four passes, indicating that a point of diminishing c returns is quickly reached.— {Original photograph used as input Andrew Calder, with permission.} (3.7) D(x̂, σe , τ, , ϕe ) = 1 if (Sσe − τ · Sσr ) > , 1 + tanh(ϕe · (Sσe − τ · Sσr )) otherwise. (3.8) Sσe ≡ S(x̂, σe ) (3.9) Sσr ≡ S(x̂, (3.10) S(x̂, σe ) = √ 1 2πσe 2 1.6 · σe ) Z 1 kx̂−xk 2 f (x) e− 2 ( σe ) dx Equation 3.8 and Equation 3.9 represent Gaussian blurs (Equation 3.10) with different standard deviations and correspond to the center and negative surround responses of a cell, respectively. The factor of 1.6 in Equation 3.9 relates the size of a typical center-surround cell to 71 Figure 3.15. DoG vs. Canny Edges. DoG Edges: Soft DoG edges tuned to yield results comparable to the Canny edges. The thickness of the lines are proportional to the strength of edges as well as the scale at which edges are detected (Figure 3.3) giving the lines an organic feel. Canny Edges: Canny edgelines are designed to be infinitely thin, irrespective of scale. This is advantageous for image segmentation (Figure 3.2), but often belies the true scale of edges, making it more difficult to visually interpret the resulting lines. Canny Edges Eroded: Morphological thickening of lines, as in Fischer et al. [49], can easily c hide small detail (e.g. threads in hat).— {Original photograph used as input Andrew Calder, with permission.} the extent of its receptive field [109]. Together, the parameters τ and in Equation 3.7 control the amount of center-surround difference required for cell activation. Parameter commonly remains zero, while τ is smaller yet very close to one. Various visual effects can be achieved by changing these default values (Figure 3.13). Parameter ϕe controls the sharpness of the activation falloff. A larger value of ϕe increases the sharpness of the fall-off function thereby creating a highly sensitive edge detector with reduced temporal coherence, while a small value increases temporal coherence but only detects strong edges. Typically, I set ϕe ∈ [0.75, 5.0]. Parameter σe determines the spatial scale for edge detection (Figure 3.3). The larger the value, the coarser the edges that are detected. For nb bilateral iterations, I extract edges after ne < nb iterations to reduce noise (Figures 3.14 and 3.23). Typically, ne ∈ {1, 2} and nb ∈ {3, 4}. 72 Canny [22] devised a more sophisticated edge detection algorithm (sometimes called optimal), which due to its computer vision roots is commonly used to derive explicit image representations via segmentation [36], but has also been used in purely image-based systems [49]. Canny edges are well suited for image segmentation because they are infinitely thin16 and guaranteed to lie on any real edge in an image, but at the same time they can become disconnected for large values of σe and are computationally more expensive than DoG edges. DoG edges are cheaper to compute and not prone to disconnectedness, but may drift from real image edges for large values of σe . I prefer DoG edges for computational efficiency, temporal coherence, because their thickness scales naturally with σe (Figure 3.3 and Figure 3.15), and because my soft-quantization version (Equation 3.7) allows for a number of stylistic variations. I address edge drift with image-based warping. 3.3.4. Image-based warping (IBW) DoG edges can become dislodged from true edges for large values of σe and may not line up perfectly with edges in the color channels. To address such small edge drifts and to sharpen the overall appearance of the final result (Figure 3.4, top-right), I optionally perform an imagebased warp, or warpsharp filter. IBW is a technique first proposed by Arad and Gotsman [1] for image sharpening and edge-preserving upscaling, in which they moved pixels along a warping field towards nearby edges (Figure 3.16). 16 For their image-based system, Fischer et al. [49] artificially increase the thickness of Canny edges using morphological operations. 73 Figure 3.16. IWB Effect. Top Row: An image before and after warping and the color-coded differences between the two (Green = black expands; Red = black recedes). Bottom Row: Detail of the person’s left eye. Note that although the effect is fairly subtle (zoom in for better comparison) it generally improves the subjective quality of images considerably, particularly for upscaled images. This figure uses an edge image as input for clarity, but in the full implementation the c entire color image is warped.— {Original photograph used as input Andrew Calder, with permission.} Given an image, f (·), and a warp-field, Mw : R2 7→ R2 , which maps the image-plane onto itself17, the warped image, W (x), is constructed as: (3.11) W (x) = f (Mw−1 (x)) This notation is after Arad and Gotsman [1] where Mw−1 is used to indicate backward mapping, which is preferable for upscaling interpolation. 17 That is, the warp-field maps pixels positions rather than pixel values. 74 Figure 3.17. Computing Warp Fields. An input image is blurred and convolved with horizontal and vertical Sobel kernels, resulting in spatially varying c warp fields for sharpening an image.— {Original photograph used as input Andrew Calder, with permission.} In my implementation, which closely follows Loviscach’s [103] simpler IBW approach, Mw is the blurred and scaled result of a Sobel filter, a simple 2-valued vector field that in the discrete domain (see Section 3.3.1) is easily invertible to obtain Mw−1 : (3.12) (3.13) 1 Mw (x̂, σw , ϕw ) = ϕw · 2πσw 2 Ψ(x) = f (x) ∗ L Z 1 ·Ψ(x) · e− 2 ( −1 0 +1 −2 0 +2 −1 0 +1 kx̂−xk 2 σw ) dx , fL (x) ∗ +1 +2 +1 0 0 0 −1 −2 −1 T Here, parameter σw in Equation 3.12 controls the area of influence that edges have on the resulting warp. The larger the value the more distant pixels are affected. Parameter ϕw controls the warp-strength, that is, how much affected pixels are warped toward edges. A value of zero has no effect, while very large values can significantly distort the image and push pixels beyond 75 the attracting edge18. For most images, I use σw = 1.5, and ϕw = 2.7 with bi-linear or bi-cubic backward mapping. Note, that while Equation 3.11 operates on all channels of the input image, Equation 3.13 is only based on the Luminance channel, fL , of the image. Figure 3.17 shows the horizontal and vertical Sobel components of Ψ(·) for a given input image. 3.3.5. Temporally Coherent Stylization To further simplify an image (in terms of its color histogram) and to open the framework further for creative use, I perform an optional color quantization step on the abstracted images, which results in cartoon or paint-like effects (Figures 3.1 and 3.18). (3.14) Q(x̂, q, ϕq ) = qnearest + ∆q tanh(ϕq · (f (x̂) − qnearest )) 2 In Equation 3.14, Q(·) is the pseudo-quantized image, ∆q is the bin width, qnearest is the bin boundary closest to f (x̂), and ϕq controls the sharpness of the transition from one bin to another (top inset, Figure 3.4). Equation 3.14 is formally a discontinuous function, but for sufficiently large ϕq , these discontinuities are not noticeable. For a fixed ϕq , the transition sharpness is independent of the underlying image, possibly creating many noticeable transitions in large smooth-shaded regions. To minimize jarring transitions, I define the sharpness parameter, ϕq , to be a function of the luminance gradient in the abstracted image. I allow hard bin boundaries only where the luminance gradient is high. In low gradient regions, bin boundaries are spread out over a larger area. I thus offer the user a trade-off between reduced color variation and increased quantization artifacts by defining a 18 For completeness: negative values of ϕw push pixels away from edges, which looks interesting, but is generally not useful for meaningful image abstraction. 76 Figure 3.18. Luminance Quantization Parameters. An original image along with parameters resulting in sharp and soft quantizations. Compare details in marked regions. Sharp: A very large ϕq creates hard, toon-shading like boundaries. (q, Λϕ , Ωϕ , ϕq ) = (8, 2.0, 32.0, 500.0). Soft: A larger number of quantization bins and low value of ϕq creates soft, paint-like whisks at the quantization boundaries. (q, Λϕ , Ωϕ , ϕq ) = (14, 3.4, 10.6, 9.7). Edge scale σe = 2.0 for both abstractions. target sharpness range [Λϕ , Ωϕ ] and a gradient range [Λδ , Ωδ ]. I clamp the calculated gradients to [Λδ , Ωδ ] and then generate a ϕq value by mapping them linearly to [Λϕ , Ωϕ ]. The effect for typical parameter values are hard, cartoon-like boundaries in high gradient regions and soft, painterly-like transitions in low gradient regions (Figure 3.18). Typical values for these parameters are q ∈ [8, 10] equal-sized bins and a gradient range of [Λδ = 0, Ωδ = 2], mapped to sharpness values between [Λϕ = 3, Ωϕ = 14]. Although soft quantization is not a novel idea, it has hardly been used for abstraction systems, particularly in a locally adaptive form. My pseudo quantization approach, apart from being effective and efficient to implement, offers significant temporal coherence advantages over previous systems using discontinuous quantization or automatic image-structure-based systems. 77 In standard quantization, an arbitrarily small luminance change can push a value to a different bin, thus causing a large output change for a small input change, which is particularly troublesome for noisy input. With soft quantization, such a change is spread over a larger area, making it less noticeable. Using a gradient-based sharpness control, sudden changes are further subdued in low-contrast regions, where they would be most objectionable. Finally, an adaptive controlling mechanism offers the benefits of both effective quantization and temporal coherence with easily adjustable trade-off parameters set by the user. 3.3.6. Optimizations In designing my framework, I capitalize on two types of optimizations: parallelism and separability. Parallelism. Modern graphics processor units (GPUs) are highly efficient parallel com- putation machines and are particularly well suited for many image processing operations. To take advantage of this parallel computing power, every element in my processing framework is P highly parallelizable, that is, it does not rely on global operations (like min(·), max(·), (·), etc.) and all operations only rely on previous processing steps (i.e. no forward dependencies). In addition, the non-linear diffusion (Section 3.3.2) and edge-detection (Section 3.3.3) operations after the initial noise-removal iterations (n > ne ) can be performed in parallel, as can the center and surround kernel convolutions of the edge-detection itself. I use Olsen’s [119] GPU image processing system to automatically compute and schedule processes and resolve memory dependencies. 78 Separability. As discussed in Section 3.3.2, the separable implementation of a two- dimensional filter kernel yields a significant performance gain. Since the Gaussian(-like) convolution features so heavily in the abstraction framework (see Section 6.1.3 for a discussion of this observation), I take advantage of this optimization in almost every processing step (nonlinear diffusion, edge-detection, and image-based-warping). 3.4. Experiments Section 3.2 explains the perceptual considerations that have gone into the framework design and Section 3.3 details the various image processing operations that implement the corresponding image simplification and abstraction steps, but this still does not guarantee that the abstracted images are effective for visual communication. To verify that my abstractions preserve or even distill perceptually important information, I performed two task-based studies that test recognition speed and short term memory retention. The studies use small images because (1) I see portable visual communication and low-bandwidth applications to practically benefit most from my framework and (2) because small images may be a more telling test of the framework as each pixel represents a larger percentage of the image. Participants. In each study, 10 (5 male, 5 female) undergraduates, graduate students or research staff acted as volunteers. Materials. Images in Study 1 are scaled to 176 × 220, while those in Study 2 are scaled to 152 × 170. These resolutions approximate those of many portable devices. Images are shown centered on an Apple Cinema Display at a distance of 24 inches to subtend visual angles of 6.5◦ and 6.0◦ , respectively. The unused portion of the monitor framing the images is set to white. 79 Figure 3.19. Sample Images for Study 1. The top row shows the original images (non-professional photographs) and the bottom row shows the abstracted versions. Note how many wrinkles and individual strands of hair are smoothed away, reducing the complexity of the images while actually improving recognition in the experiment. All images use the same σe for edges and the same number of simplification steps, nb . — {Pierce Brosnan and Ornella Muti by Rita Molnár, Creative Commons License. Paris Hilton by Peter Schäfermeier, Creative Commons License. George Clooney, public domain.} In Study 1, 50 images depicting the faces of 25 famous movie stars are used as visual stimuli. Each face is depicted as a color photograph and as a color abstracted image created with my framework (Figure 3.19). In Study 2, 32 images depicting arbitrary scenes are used as visual stimuli. Humans are a component in 16 of these images (Figure 3.20). Analysis. For both studies, p-values are computed using two-way analysis of variance (ANOVA), with α = 0.05. 80 Figure 3.20. Sample Images from Study 2. The top row shows the original snapshot-style photographs and the bottom row shows the abstracted versions. Note how much of the texture in the original photographs (like water waves, sand, and grass) is abstracted away to simplify the images. All images use the same σe for edges and the same number of simplification steps, nb . 3.4.1. Study 1: RecognitionSpeed Hypothesis. Study 1 tests the hypothesis (H1 ) that abstracted images of familiar faces are recognized quicker compared to normal photographs. Faces are a very important component of daily human visual communication and I want the framework to help in the efficient representation of faces. Procedure. To ensure that participants in the study are likely to know the persons depicted in the test images, I use photographs of celebrities as source images and controls. The study uses a protocol [149] demonstrated to be useful in the evaluation of recognition times for facial 81 images [62] and consists of two phases: (1) reading the list of 25 movie star names out loud; and (2) a reaction time task in which participants are presented with sequences of the 25 facial images. All faces take up approximately the same space in the images and are three quarter views. By pronouncing the names of the people that are rated, participants tend to reduce the tip-of-the-tongue effect where a face is recognized without being able to quickly recall the associated name [149]. For the same reason, participants are told that first, last, or both names can be given, whichever is easiest. Each participant is asked to say the name of the pictured person as soon as that person’s face is recognized. A study coordinator records reaction times, as well as accuracy of the answers. Images are shown and reaction times recorded using the Superlab software product for 5 seconds at 5-second intervals. The order of image presentation is randomized for each participant. Data Conditioning. Two additional volunteers were eliminated from the study after failing familiarity requirements. One volunteer was not familiar with at least 25 celebrities. Another volunteer claimed familiarity with at least 25 celebrities, but his or her accuracy for both photographs and abstractions was more than three standard deviations from the remainder of the group, indicating the the volunteer was not reliably able to associate faces with names. By the same reasoning, two images were deleted from the experimental evaluation because their accuracy (in both conditions) was more than three standard deviations from the mean. This could indicate that those images simply were not good likenesses of the depicted celebrities or that familiarity with the celebrities’ names was higher than with their faces. Results and Discussion. Data for this study (Figure 3.21, Top Graph; Table A.1) shows a correlation trend between timings for abstractions and photographs. Three data pairs (2, 4 & 5) 82 Figure 3.21. Participant-data for Video Abstraction Experiments. Top Graph: Data for study 1 showing per-participant averages for all faces. Middle & Bottom Graphs: Data for study 2 showing timings and number of clicks for participants to complete two memory games, one with photographs and one with abstractions. Data pairs for both experiments are not intended to refer to the same participant and are sorted in ascending order of abstraction time. 83 show only a very small difference between recognition times in both presentation conditions, but for all data pairs, the abstraction condition requires less time than the photographs. Averaging over all participants shows that participants are faster at naming abstract images (M = 1.32s) compared to photographs (M = 1.51s), thus rejecting the null hypothesis in favor of H1 (p < 0.018). In other words, the likelihood of obtaining the results of our study by pure chance is less than 1.8% and it is therefore more reasonable to assume that the results were caused by a significant increase in recognizability of the abstracted images. The accuracy for recognizing abstract images and photographs are 97% and 99% respectively, and there is no significant speed for accuracy trade-off. I can thus conclude that substituting abstract images for fully detailed photographs reduces recognition latency by 12.6%. Interestingly, this significant improvement was neither reported by Stevenage [149] nor by Gooch et al. [62]. Since both of these authors only used black-and-white stimuli, I suspect that the simplified color information in my abstraction framework contributes to the measured improvement in recognition speed. This promises to be a worthwhile avenue for future research. It is worth pointing out that the performance improvement measured in this study might seem small in terms of percentage, but it represents an improvement in a task that humans are already extremely proficient at. In fact, there exist brain structures dedicated to the recognition of faces [188] and many people can recognize familiar faces from the image of a single eye or the mouth alone. A similar remark can be made about the result of the next study, which are even more marked. 84 3.4.2. Study 2: M emory Game Hypothesis. Study 2 tests the hypothesis (H2 ) that abstracted images are easier to memorize (in a memory game) compared with photographs. By removing extraneous detail from source images and highlighting perceptually important features, my framework emphasizes the essential information in these images. If done successfully, less information needs to be remembered and prominent details are remembered more easily. Procedure. Study 2 assesses short term memory retention for abstract images versus photographs with a memory game, consisting of a grid of 24 cards (12 pairs) that are randomly distributed and placed face-down. The goal is to create a match by turning over two identical cards. If a match is made, the matched cards are removed. Otherwise, both cards are returned to their face-down position and another set of cards is turned over. The game ends when all pairs are matched. The study uses a Java program of the card game in which a user turns over a virtual card with a mouse click. The 12 images used in any given memory game are randomly chosen from the pool of 32 images without replacement, and randomly arranged. The program records the time it takes to complete a game and the number of cards turned over (clicks) before all pairs are matched. Study 2 consists of three phases: (1) a practice memory game with alphabet cards (no images); (2) a memory game of photographs; and (3) a memory game of abstract images. All participants first play a practice game with alphabet cards to learn the user-interface and to develop a game strategy without being biased with any of the real experimental stimuli. No data is recorded for the practice phase. For the remaining two phases, half the participants are 85 presented with photographs followed by abstracted images; and the other half is presented with abstracted images followed by photographs. Results and Discussion. In the study, participants were significantly faster in com- pleting a memory game using abstract images (Mtime = 60.0s) compared to photographs (Mtime = 76.1s), thus rejecting the null hypothesis in favor of H2 (p time < 0.003). The fact that the probability of obtaining the measured timings by pure chance is less than 0.3% indicates a statistically highly significant result. Participants further needed to turn over far less cards in the game with abstract images (Mclicks = 49.2) compared to photographs (Mclicks = 62.4) with a type-I error likelihood of p clicks < 0.004, again highly significant. Presentation order (abstractions first or photographs first) did not have a significant effect. Despite the fact that the measured reduction in time (21.3%) and the reduction in the number of cards turned over (21.2%) were almost identical, the per-participant data in Figure 3.21 (Middle and Bottom graphs) and Table A.1 does not indicate a strong correlation between timing results and clicks. As in study 1, the results (both timing and clicks) for all participants were lower for abstractions than for photographs (only minimally so for the timing data pairs 2 & 10). Since the number of clicks corresponds to the number of matching errors made before completing the game19, the lower number of clicks for the abstracted images indicates significantly fewer matching errors compared to photographs and I conclude that my framework can simplify images in a way that makes them easier to remember. 19 The minimum number of clicks is 24, one per card. This is unrealistic, however, as the probability for randomly picking a matching pair by turning two cards out of 24 is 1 : 23. By removing this pair, no additional knowledge of the game is discovered, so that even with perfect memory the probability for the next pair is 1 : 21, and so on. 86 Figure 3.22. Failure Case. A case where the contrast-based importance assumption fails. Left: The subject of this photograph has very low contrast compared with its background. Right: The cat’s low contrast fur is abstracted away, while the detail in the structured carpet is further emphasized. Despite this rare reversal of contrast assignment, the cat is still well represented. 3.5. Framework Results and Discussion 3.5.1. Performance The framework was implemented and tested in both a GPU-based real-time version, using OpenGL and fragment shader programs, and a CPU-based version. Both versions were tested on an Athlon 64 3200+ with Windows XP and a GeForce GT 6800. Performance values depend on graphics drivers, image size, and framework parameters. Typical values for a 640 × 480 video stream and the default parameters given in this text are 9 − 15 frames per second (FPS) for the GPU version and 0.3 − 0.5 FPS for the CPU version. 3.5.2. Limitations Contrast. The framework depends on local contrast to estimate visual saliency. Images with 87 very low contrast do not carry much visual information to abstract (e.g. the fur in Figure 3.22). Simply increasing contrast of the original image may reduce this problem, but also increases noise. Figure 3.22 demonstrates a rare inversion of this general assumption, where the main subject exhibits low contrast and is deemphasized, while the background exhibits high contrast and is emphasized. Extracting semantic meaning about foreground versus background from images automatically and reliably is a hard problem, which is why I use the contrast heuristic, instead. Note that despite the contrast reversal the cat in the abstracted image in Figure 3.22 is still clearly separated from the similarly colored background due to overall contrast polarization. In practice, I have obtained good results for many indoor and outdoor scenes. Scale-Space. Human vision operates at a large range of spatial scales simultaneously. By applying multiple iterations of a non-linear diffusion filter, the framework covers a small range of spatial scales, but the range is not explicitly parameterized and not as extensive as that of real human vision. Global Integration. Several features that may be emphasized by my framework are actu- ally deemphasized in human vision, among these are specular highlights and repeated texture (like the high-contrast carpet in Figure 3.22). Repeated texture can be considered a higherorder contrast problem: while the weaving of the carpet exhibits high-contrast locally, at a global level the high-contrast texture itself is very regular and therefore exhibits low contrast in terms of texture-variability. Dealing with these phenomena using existing techniques requires global image processing, which is impractical in real-time on today’s GPUs, due to their limited gather-operation capabilities20. 20 The framework deals partially with some types of repeated texture. See Section 3.5.7 (Indication) for details. 88 3.5.3. Compression A thorough discussion of theoretical data compression and codecs exceeds the scope of this dissertation because traditional compression schemes and error metrics are optimized for natural images, not abstractions (Sections 2.3 and 2.4.1). To recall, many existing error metrics, even perceptual ones, yield a high error value for the image pairs in Figures 3.19 and 3.20, although I have shown in Section 3.4 that my abstractions are often better at representing image content for visual communication purposes than photographs. An interesting point of discussion in this respect is the error source. Several popular blockbased encoding schemes (e.g. JPEG, MPEG-1, MPEG-2) exhibit blockiness artifacts at low bitrates while many frequency-based compression schemes produce ringing around sharp edges. All of these artifacts are perceptually very noticeable. Artifacts in abstraction systems, like that presented here, are of stylistic nature and people tend to be much more accepting of these [145] because they do not expect a realistic result. Non-realistic image compression promises to be an exciting new research direction. In terms of the constituent filters in the framework, Pham and Vliet [126] have shown that video compresses better using traditional coding methods when bilaterally filtered beforehand, judged by RMS error and MPEG quality score. Collomosse et al. [26] list theoretical compression results for vectorized cartoon images. Possibly most applicable to my abstractions is work by Elder [42], who describes a method for storing the color information of an image only in high-contrast regions, achieving impressive compression results. Without going into technical detail, it can be shown that the individual filter steps in the framework simplify an image in the Shannon [146] sense and a suitable component compression scheme should be able to capitalize on that. For example, the emphasis edges in Section 3.3.3 89 pose a problem for most popular compression schemes due to their large spectral range, yet the edges before quantization are derived from a severely band-limited DoG filter of an image. In general, an effective compression scheme would not attempt to compress the final images, but rather the individual filter outputs before quantization. The final composition of the channels would then be left to the decompressor. Another advantage of this approach, which promises novel applications for streaming video, is that only selected channels may be distributed for extreme low-bandwidth transmission (e.g. only the highlight edges) and that the stylistic options represented by the quantization parameters can be chosen by a decompression client (viewer) instead of hard-coded into the image-stream. 3.5.4. Feature Extension I do not include an orientation dependent feature in the contrast feature space because of its relatively high computational cost and because orientation is generally only necessary for highlevel vision processes, like object recognition, whereas my work focuses on using low level human vision processes to improve visual communication. Should such a feature be required, the combined response for Gabor filters at different angular orientations can be included in the input feature space conversion step in Figure 3.4. This response would need to be scaled to a comparable range as the other feature channels to retain perceptual uniformity. For implementation details of a separable, recursive Gabor filter, compatible with the framework, see Young et al. [181]. 90 Figure 3.23. Benefits for Vectorization. Vectorizing abstracted images carries several advantages. Edges: Extracting edges after ne smoothing passes removes many noisy artifacts that would otherwise have to be vectorized. Here, the difference between two consecutive passes is shown. Color: Quantization results after 1 and 5 non-linear diffusion passes, respectively. The simplification achieved by the abstraction is evident in the simplified quantization contours, which requires fewer control-points for vector-encoding. Zoom in for full detail. 3.5.5. Video Segmentation My stylization step in Section 3.3.5 is a relatively simple modification to an abstracted image. Despite this, I have found that it yields surprisingly good result in terms of color flattening and is much faster than the mean-shift procedures used in off-line cartoon stylization for video [26, 161]. Interestingly, several authors [4, 14] have shown that anisotropic diffusion filters are closely related to the mean-shift algorithm [27]. It is thus conceivable that various graphics applications that today rely on mean-shift could benefit from a much faster anisotropic diffusion implementation, at least as a pre-process to speed up convergence. 91 3.5.6. Vectorization An explicit image representation is an integral part of many existing stylization systems. Although I already discussed the trade-offs between these explicit representations and my imagebased approach, I show here how my abstraction framework can be used as a pre-process to derive an efficient explicit image representation. Benefits. Vectorization21 of images of natural scenes requires significant simplification for most practical applications because generally neighboring pixels are of different colors so that a true representation of an input image might require a single polygon for each pixel. This simplification is essentially analogous to the abstraction qualities I have discussed so far: I want to simplify (the contours and colors) of a vector representation of an image, while retaining most of the perceptually important information. Consequently, it is not surprising that my abstraction framework can aid in this simplification process with its use of a non-linear diffusion filter and pseudo color quantization. Figure 3.23 demonstrates the two key benefits: (1) noise removal; and (2) contour simplification. Because the non-linear diffusion step of the framework removes high-frequency noise, this information does not need to be encoded into a complex vector representation. Similarly, the quantization contours of the abstracted images are progressively simplified in their shape, requiring fewer control points to encode into any standard vector format. This approach of simplification followed by vectorization contrasts with the traditional approach of vectorization followed by simplification [36, 161, 26]. The main advantage of the traditional approach is that vector representations at different spatial scales can be treated independently, in the course of which some features may be removed completely as part of the simplification. The advantages of the approach presented here are that, as above, many 21 Here, defined as the act of converting an image into bounded polygons of a single color. 92 features do not need to be vectorized in the first place, that the simplification can happen much faster, and that temporal coherence of consecutive frames is improved. The reason for increased temporal coherence is rooted in the sparse parametric representation that vectorization affords. Given an efficient (low redundancy) vector representation (e.g. B-splines) of a shape, this shape can change considerably if one of its control-points is removed or altered excessively. Since simplification of vectors includes just these types of modifications, the traditional vectorization approach is prone to very unstable shape representations22. If vectorization is performed after simplification, then temporal coherence is mainly a function of the coherence quality of the vectorization input. Given the good coherence characteristics of my framework23, this leads to improved temporal coherence after vectorization. Implementation. The vectorization implementation I have chosen is based on simple iso- contour extraction of the color information in the abstracted and hard-quantized images. I vectorize the edge and color information separately to keep the vectorized representation as simple as possible. Individual polygons are expressed as polylines or Beziér curves, depending on the local curvature of the underlying contours and written out as Postscript files. Vectorization of a single image takes in the order of 1-3 seconds, depending on the resolution of the input image and the desired complexity of the vectorized output. This process is not optimized for efficiency. Limitation. An advantage of temporal vectorization extensions, in addition to increased temporal coherence, is the possibility of a more compact temporal vector representation. Instead of encoding each frame independently one can specify an initial shape and then encode how the 22 This has led to some computationally very expensive temporal vectorization extensions [161, 26]. This refers to coherence as a result of simplification and smoothing, not soft-quantization functions, as these functions cannot be used for vectorization (see Limitation, below). 23 93 shape transforms for successive frames24. Unfortunately, this is a difficult problem, as it requires accurate knowledge of where a shape in one frame can be found in the next. This objecttracking problem (in this case, contour-tracking) is a major research effort in the computer vision community and not, as yet, robustly solved. For this reason, both Wang et al. [161] and Collomosse et al. [26] require user-interaction to correct for tracking mistakes, particularly in the presence of camera movement and occlusions. My vectorization approach faces the same challenges and limitations when moving from a single frame encoding to an inter-frame encoding scheme. Vectorization, as defined here, requires true discontinuous quantization boundaries (for both edges and color information). As a result my vectorized images loose those temporal coherence advantages that stem from the soft-quantization functions of my framework. 3.5.7. Complimentary Framework Effects In addition to the design goals that I implemented within the abstraction framework, a handful of stylistic effects presented themselves for free as a result of the framework’s various image processing operations. Initially, this came as a surprise to me, considering that (1) I did not intentionally program these effects; (2) most of the effects are traditionally considered artistic, not perceptual or computational; and (3) most effects are considered challenging research objectives in their own right (see Indication, below). Upon reflection, though, these observations strengthen my belief that there are many unknown connections between perception and art that await to be modeled and measured with the use of NPR. In this dissertation, I include the two most prominent effects that have also been discussed in previous work. 24 Most video compression schemes make use of this inter-frame coherence by encoding just the information that changes between two frames in so called delta-frames. 94 Figure 3.24. Automatic Indication. The inhomogeneous texture in these images causes spatially varying abstraction. As a result, fine detail subsists in some regions, while being abstracted away in other regions. Note how the bricks in the top and middle images are only represented intermittently with edges, yet the observer perceives the entire wall as bricked. The few visible brick instances are interpreted as indicating a brick wall and empty regions are visually interpolated. The same applies to the blinds in the middle image and the shingles, the wheat and the trees in the bottom image. These types of indication are commonly used by artists, particularly the shadows indicated underneath the windowsill in the top image and the fine branches in the bottom image, which are hinted at by faint color, while only the main branches are drawn with edges. 95 Indication. Indication is the process of representing a repeated texture with a small num- ber of exemplary patches and relying on an observer to interpolate between patches. Winkenbach and Salesin [169] explain the associated challenges thus: “Indication is one of the most notoriously difficult techniques for the pen-and-ink student to master. It requires putting just enough detail in just the right places, and also fading the detail out into the unornamented parts of the surface in a subtle and unobtrusive way. Clearly, a purely automated method for artistically placing indication is a challenging research project.” For structurally simple, slightly inhomogeneous textures with limited scale variation, like the examples in Figure 3.24, my framework can perform simple automatic indication, including stroke texture25 (Figure 3.24: top-right image, shadows under window-sill). The framework achieves indication by extracting edges after a number of abstraction simplification steps. Depending on the given image contrast and Equation 3.2, some parts of an image are simplified more, some less, in an approximation to the perceived difference in those image regions. The emphasis DoG edges then highlight high contrast texture regions that remain prominent throughout the simplification process. All other edges in textured regions are removed, leaving the missing texture to be inferred by an observer. As DeCarlo and Santella [36] noted, such simple indication does not deal well with complex or foreshortened textures. The automatic indication in my framework is not as effective as the user-drawn indications of Winkenbach and Salesin [169], but some user guidance can be supplied via Equation 3.2, to provide vital semantic meaning. 25 Winkenbach and Salesin [169] refer to line markings that represent both texture and tone (brightness) as stroke texture. 96 Figure 3.25. Motion Blur Examples. Motion Lines: Cartoons often indicate motion with motion lines. Motion Blur: This sequence shows a radial pattern of rays at different orientations (angle) and of varying width (radius), which is convolved with a motion blur filter at different orientations. Note that lines parallel to the direction of the motion blur are preserved, while lines perpendicular to the motion blur are maximally blurred. Figure 3.26. Motion Blur Result. Original: Images of a stationary car and a moving motion-blurred car. DoG Filter: Corresponding images from my modified DoG filter. Note how many of the remaining horizontal lines resemble the speed lines used by comic artists.— {Original image released under GNU Free Documentation License.} Motion Lines. Comic artists commonly indicate motion with motion lines parallel to the suggested direction of movement (Figure 3.25, Motion Lines). Interestingly, Kim and Francis [89, 52] showed that these motion lines are not purely artistic and actually have perceptual foundations, which is likely the reason why artists have adopted them in the first place and why 97 they are so easily understood. The DoG edges in my framework automatically create streaks resembling motion lines as shown in Figure 3.26. Although I did not explicitly program this behavior (as in Collomosse et al. [26]), it can be easily explained. Motion blur is a temporal accumulation effect that occurs when a camera moves relative to a photographed scene. This relative movement can be any affine transformation like translation, rotation, and scaling but I focus this discussion on translational movements only. A motion blur, or oriented Blur, O(·), can be formulated using a modification of the familiar Gaussian kernel: Z (3.15) (3.16) O(x̂, σo , θ) = Θ(x, θ) = 2 1 kΘ(x̂−x,θ)k f (x) · e− 2 ( σo ) dx Z 1 kΘ(x̂−x,θ)k 2 e− 2 ( σo ) dx cos(θ) sin(θ) 0 0 ! ·x Here, parameter σo determines how much the image is blurred, i.e. the duration of the exposure in relation to the speed of the scene relative to the camera. Parameter θ indicates the blur direction in the image plane. Equation 3.15 is a very simple but sufficient model for this discussion that does not take into account depth (image elements moving at different speeds) and assumes that only the camera moves with respect to the scene. Figure 3.25 (Motion Blur) shows the result of this filter on a pattern of lines of varying widths and different orientations. Lines parallel to the blur direction are blended only with themselves and appear unaffected, while lines perpendicular to the blur direction are blended with neighboring lines and loose sharpness. Intermediate angles vary with the sine of the angle. In the car example in Figure 3.26, the vertical line of the door is blurred away, while the door’s horizontal line (parallel to the motion) is preserved. 98 The DoG filter therefore mainly detects edges in the direction of motion, because other edges are largely blurred away. As a consequence, the resulting image looks like it has motion lines added. 3.5.8. Comparison to Previous Systems I have pointed out throughout this Chapter how my framework differs from previous systems in terms of the design goals I have chosen, and I have demonstrated performance increases in perceptual tasks not evident in previous work [149, 62]. However, these comparisons are still not as detailed as they should be, mainly due to the fact that the NPR community is lacking comparison criteria above the level of simple frame-rate counts. There are other issues that compound this problem. Most stylization systems are not based on perceptual principles and therefore not psychophysically validated. Performing such comparative analyses oneself is complicated by the fact that previous stylization systems are rarely freely available, difficult and time-consuming to implement, and that they generally have a limited amount of results openly available. There simply is no standard repository of imagery for NPR applications (like the Stanford bunny for meshes, or the Lena image for image processing). I hope that my work can contribute to the solution to these problems by making available a large number of input and result images and videos, and more importantly, by validating my own framework with psychophysical experiments that can be used in direct comparison with future NPR systems. 99 3.5.9. Future Work Despite the numerous processing steps that comprise my video abstraction framework, it is simple to implement and shows great potential in terms of computational and perceptual efficiency. I therefore hope that the framework will be adopted for a number of interesting research directions. NPR Compression. As noted in Section 3.5.3, I believe that abstractions generated by my framework are subject to good compression ratios, yet most current compression schemes are likely to perform sub-optimally. Non-photorealistic compression is basically unheard of, partly because the compression community has very well-defined and rigid ideas about realism and desirable image fidelity. I believe NPR compression to be promising future research mainly because of the significant removal of information in abstractions and because of the ability to alter the reconstruction parameters on the decompression side for stylistic effect and perceptual efficiency. Minimal Graphics. In their paper called Minimal Graphics, Herman and Duke [69] state that “[the] main question which still remains is how to automatically extract the minimal amount of information necessary for a particular task?”. I have shown that two specific tasks can be performed better given my abstractions, but I did not show (nor do I believe) that this performance increase is maximal. As Section 3.4 demonstrated, removal of information can actually lead to better efficiency for specific perceptual tasks, but there is a point at which additional removal of information will bring about a decline in efficiency26. It would be interesting and valuable to use a framework like the one presented here to graph a chart of image information versus task efficiency and to map these findings to framework parameters. Such perceptual 26 This can be proven by considering the extreme case of removing all information. 100 research using an NPR framework would be another example of how to close the loop of mutual beneficence that this dissertation is intended to demonstrate. 3.6. Summary In this chapter, I presented a video and image abstraction framework (Figure 3.4) that works in real-time, is temporally coherent, and can increase perceptual performance for two recognition and memory tasks (Section 3.4). Framework. Unlike previous systems, my framework is purely image-based and demon- strates that meaningful abstraction is possible without requiring a computationally expensive explicit image representation. To the best of my knowledge, my framework is one of only three automatic abstraction systems that prove effectiveness for visual communication tasks with user studies. Of these, my studies are the most comprehensive (two tasks with colored stimuli compared to one study with colored stimuli and two studies with black-and-white stimuli, respectively). By basing the framework design on perceptual principles, I obtain at least two visual effects (Section 3.5.7) in my output images for free27, which previous systems implemented explicitly and with computational overhead. These effects are indication, the suggestion of extensive image texture by sparse texture elements, and motion lines, an artistic technique to illustrate motion in static images. Customizable Non-linear Diffusion. I developed an extension (Equation 3.2) to the bilateral filter (Equation 3.1) as an approximation to non-linear diffusion that allows for external control via user-data in various forms (painted, data-driven, or computed). 27 Without explicit computation devoted to these effects. 101 Temporally Coherent Quantization. I constructed two smooth quantization functions, both of which (1) increase temporal coherence for animation; and (2) offer stylization options not available to previous systems using discontinuous quantization functions. The first quantization function (Equation 3.7) operates on the well-known DoG function to extract edges from an image. The second quantization function (Equation 3.14) flattens colors in an image for data reduction and artistic purposes. Another contribution of this second function is its spatially adaptive behavior, which achieves a good trade-off between the desired level of quantization and temporal coherence by adapting to local image gradients. Additional Materials. More information on this project, including a conference pa- per [171], GPU code, and an explanatory video, can be found on the Siggraph 2006 conference DVD. The same materials and additional images are also available online at http: //videoabstraction.net. 102 CHAPTER 4 An Experiment to Study Shape-from-X of Moving Objects Figure 4.1. Shape-from-X Cues. The human visual system derives shape information from a number of distinct visual cues which can be targeted and tested using non-photorealistic imagery. Shading: Lambertian shading varies with the cosine of the angle between light direction and surface normal. Flat surfaces exhibit constant color, while curved surfaces show color gradients. Texture: Texture elements, or texels, accommodate their form to align with the underlying surface, causing texture compression. Contours: Discontinuities of various surface properties are shown in different colors (red: silhouette, black: outline, green: ridges, blue: valleys). Note that the objects’ shadows are synonymous with silhouettecontours as seen from the casting light’s point-of-view. Motion: Under rigid body motion, points on the surface move at different speeds and in different directions. In this Chapter, I present a psychophysical experiment that uses non-photorealistic imagery to study the perception of several shape cues for rigidly moving objects in an interactive task. Traditionally, most shape perception studies only display a small number (generally one; two for some comparison experiments. See Section 4.3.4) of static objects. Yet, most interactive graphical environments, such as medical visualization, architectural visualization, virtual 103 reality, physical simulations, and games, contain a large number of concurrent dynamic shapes and objects that move independently or relative to an observer. Because shape perception is vital to many recognition and interaction tasks, it is of great interest to study shape perception for multiple shapes in dynamic environments, in order to develop effective display algorithms. The experiment I propose in this chapter benefits greatly from carefully designed nonphotorealistic imagery to separate and individually study shape cues that find common usage in many computer graphics applications. 4.1. Introduction The art and science of photorealism in computer graphics, as exemplified in Figure 1.1, has shown impressive improvements over the last decades, but the associated computational demands have put this level of realism out of reach of most real-time applications. As a result, real-time 3-D graphics commonly only offer best-effort approximations in terms of realistic lighting, shading, and material effects. These limitations beg several important questions. What effects do the approximations have on applications depending on shape perception? If we want to prioritize computational resources for the most effective shape cues for a given set of shapes or a given application, how do we determine this effectiveness? Another set of questions concerns the necessity for realism. I have already mentioned in Chapter 1 and demonstrated in Chapter 3 that sometimes less is more when it comes to visual stimuli for humans1. Realistic images can, at times, be overbearing or conflicting in terms of the information that is presented to a viewer, and it may be more effective for a given task to display less information that is emphasized appropriately [150]. Being freed from the restrictions that reality (even an approximate one) imposes, how can we emphasize the shape of an 1 Incidentally, results in this Chapter reiterate this concept. 104 object effectively using stylistic (non-realistic) elements? Similarly, how can we compare the effects of various known stylization techniques for conveying shape? I believe the answers to these questions to be important, not only because they will advance the state-of-the-art in realistic and non-realistic graphics, but because they may provide insights into the development of art, and our perception of art. Of course, this chapter provides only very few actual answers to these questions. What it provides instead is a simple and flexible experiment that enables research into these questions. 4.1.1. Experimental Design Goals The set of shape cues I investigate (shading, contours, and textures) is not meant to be exhaustive, but rather demonstrative. The number of additional existing shape cues, their possible parameterizations, and the permutations of combined effects are probably too vast to explore in a single lifetime. As such, the main purpose of this chapter is to demonstrate an example of the types of studies that my experiment supports and to offer my methodology up for other researchers in computer graphics and perception to perform their own investigations. In designing the experiment, I take special care to address the following key issues: (1) A number of different shape cues can be studied in isolation and in combination — This is important to support a broad range of studies. (2) The difficulty of the experimental task can be easily adjusted — If the task is too easy or too difficult no meaningful data can be gathered. The task should be designed so that participants at different performance levels can provide meaningful statistical data. 105 (3) The interaction itself is simple — It is important to separate the task from the interaction necessary to perform the task. While the task should be as difficult as possible (without being impossible), the interaction should be very simple to ensure that the performance of the task is measured and not that of the interaction. (4) The performance of participants can be tested under time-constrained conditions — Most traditional shape experiments have no time limit for their trials. Because humans can only attend to very few stimuli simultaneously [39, 160] the results under timepressure might very well be different from these static experiments and offer important guidance for the design of real-time applications. (5) The experimental shapes are general, relevant, and parameterizable — This is important so that valid and meaningful statements can be made about the shapes that are tested and the results that apply to them. It also facilitates replication and verification of experimental results by third parties. (6) Learning effects and other biases for the task are minimal — After an initial period of getting acquainted with the interaction and developing a strategy, learning and memory of the experimental procedure should not impact the performance of the interactive task2, so that performance differences are due to varied experimental conditions and not increasing experience. For the same reason, the experimental conditions should not be biased or otherwise predictable to ensure the experimental data reflects perceptual performance instead of system biases or deductive reasoning abilities. 2 Note, that this is different from studying memory performance, as in Section 3.4.2. Even there, the position of cards between trials was randomized, so that participants could not remember the correct position from the previous trial. Instead, participants had to remember the positions anew for each trial, thus making each trial independent. 106 It is my hope that these design goals are specific enough to provide meaningful results, yet general enough to allow other researchers to (1) adopt my experimental framework to study other types of shape cues, and to (2) evaluate the effectiveness of interactive non-photorealistic rendering systems to convey shape information. Section 4.4 explains how I implement the above goals in my own experiment and Section 4.7 demonstrates via data analysis that these goals were attained. 4.1.2. Overview In the experiment I present here, participants are shown 16 moving objects, 4 of which are designated targets, rendered in different shape-from-X styles. Participants select these targets by simply touching a touch-sensitive table onto which the objects are projected. The experimental data shows that simple Lambertian shading offers the best shape cue, followed by outline contours and, lastly, texturing. The data also indicates that multiple shape cues should be used with care, as these may not behave additively in a highly dynamic environment. This result is in contrast to previous additive shape cue studies for static environments and reflects the importance of investigating shape perception in the presence of motion. To the best of my knowledge, my experiment is unique in its capacity to compare the effectiveness of multiple shape cues in dynamic environments and it represents a step away from traditional, impoverished (reductionist) test conditions, which may not translate well to real-time, interactive applications. Other advantages of the experiment are that it is simple to implement, engaging and intuitive for participants, and sensitive enough to detect significant performance differences between all single shape cues. 107 4.1.3. Note on Chapter structure Although this chapter follows largely the same structure as Chapter 3, it does so in a slightly different order. This is due to the fact that Chapter 3 presents an automatic abstraction system based on perception and verified by two experiments; whereas this chapter presents a psychophysical experiment to study perception, based on non-photorealistic imagery. In this chapter, I therefore introduce important aspects of the human visual system (Section 4.2) before discussing related work (Section 4.3). 4.2. Human Visual System Most interaction with our visual world requires some shape identification or categorization. The shape of the visible portion of an object can be correctly interpreted if the distance between each point on the surface of the object, PO , and its projection onto the eye’s retina, PE , is known (Figure 4.2). I will refer to this distance as the depth at PE . Calculating the depth from the light-signal at the retina is an ill-constrained problem, because the light reaching PE could have emanated from any point along the view-ray cast through PO . To address this problem, the human visual system is equipped with a number of mechanisms to infer depth information from an image. The convergence of depth interpretations from different mechanisms leads to a stable perception of shape3. The different depth interpretation mechanisms that allow shape perception are collectively referred to as Shape-from-X. The most important shape cues for computer graphics applications are shading, texture, contours, and motion. Other important 3 Sometimes convergence does not occur, leading to multiple possible shape interpretations, as in the famous Necker cube illusion. It is interesting that in such cases only one interpretation can be perceived at a time and that the different perceived interpretations alternate perpetually [75]. 108 Figure 4.2. Left: Depth Ambiguity. Light reflects off a surface point PO , reaching the retina at point PE . The length of the vector |~v |, ~v = PO − PE , is the distance between the surface point and the viewer. This situation is ambiguous for the viewer, because the light could have emanated anywhere along the ray, PE + α · ~v , α ∈ R+ . Figure 4.3. Right: Tilt & Slant. The orientation of a surface at a point can be described by the tilt and slant of a thumbtack gimbal placed at that point and aligned with the surface normal. Both the length of the gimbal’s rod and the elongation of the attached disk in the image plane indicate the local surface orientation. shape cues exist, like binocular stereopsis and ocular accommodation and vergence, but are less commonly applied in a computer graphics context. 4.2.1. Shading The shading of an object is a complex function of the object’s properties, such as shape and material, as well as those of its environment, including lights, other objects and the direction from which it is viewed. For simple illumination conditions (see below) a change in surface orientation can be inferred from a change in surface shading [95, 96]. Real-time computer graphics commonly approximate realistic shading with the Phong reflection model [127], a 109 local illumination model that considers ambient light, diffuse reflection, and specular reflection. To reduce the number of free variables in my experiment, I set the ambient contribution to zero and only model diffuse reflection (Lambertian shading) as Ir = kd · I0 · dot(~n, ~`), where I0 and Ir are the incoming and reflected light intensities, respectively, ~n is the surface normal at PO , ~` is the incoming light direction (as in Figure 4.2), dot denotes the vector dotproduct and kd ∈ [0 . . . 1] indicates the diffuse reflectance properties of the object. I use a single point-light-source at infinity. Since the dot-product of two vectors changes smoothly according to the cosine of the angle between the vectors, the change in light-intensity on a Lambertian surface is a good indicator of change in surface orientation (Figure 4.1). 4.2.2. Contours Smooth changes in depth often indicate smooth changes in the shape of a surface. Depth discontinuities, on the other hand, are a likely sign of figure/ground separation (where figure is an object and ground is everything else), changes in local topology, or abutting but distinct surfaces. Such discontinuities are therefore important visual markers for the distinction of figure from ground, object components, and surfaces. Changes in surface normals and other differential geometry measures, such as principal curvatures, can also be used to mark shape discontinuities or extrema in images. Together, these define the set of contours of an object, some of which are shown in Figure 4.1, Contours. Note that while some contour-types depend 110 only on object shape, others also depend on the observer’s point-of-view [97]. Several nonphotorealistic rendering algorithms rely on contours to convey essential, but much condensed shape information [70, 35]. 4.2.3. Texture Texture is most often described in terms elemental texture units, called texels, and their distribution on a surface. Figure 4.1, Texture, illustrates the use of a random cell texture to indicate shape. Many natural scenes and materials contain textures, such as fields of grass or flowers, heads in a crowd, woven fabric, etc. While Gibson [56] was the first to identify and investigate the importance of texture as a depth-cue, several works have since extended his research. Cumming et al. [31] defined three distinct parameters along which texture covaries with depth: compression4, density, and perspective. They found that compression accounts for the majority of texture variation in shape, so I focus on this cue. Compression refers to the change in shape of a texel when mapped onto a surface non-orthogonal to the viewer. Another important factor in texturing is the distribution of texels on a surface, which is generally achieved through a parametrization function of the surface. This function provides a mapping relating texel distribution and orientation to surface shape. 4.2.4. Motion If an object moves relative to an observer (via translation, rotation, or a combination of the two), then points on its surface that are at different depths move at different relative speeds. Therefore, the relative movements of these points convey the underlying depths for rigid objects 4 Not to be confused with the term compression used in information theory. 111 (Figure 4.1, Motion). The rigidity constraint is necessary because plastic deformations can also lead to relative movement of surface points, and the two types of motion, rigid and plastic, cannot be distinguished visually. Such a constraint is also employed by human perception in the form of a bias towards recognizing motion as rigid, if such an interpretation is consistent with the visual stimulus [121]. 4.2.5. Limitations None of the shape cues above is a sufficient requisite for shape perception. The shading of an object may be indistinguishable from the color variation of its material. The efficiency of shape-from-texture depends largely on the texture used, its homogeneity and its parametrization, all of which are arbitrary. Contours are highly localized visual markers, requiring visual interpolation, and are commonly under-constrained. Lastly, shape-from-motion depends on the reliable tracking of surface points, as well as robust distinction of rigid and plastic motion, both of which cannot be guaranteed. This insufficiency of any single shape cue to provide robust depth-information explains the redundant shape detection mechanisms of the human visual system. 4.3. Related Work Compared to Chapter 3, the rendering techniques used in my experiment are too basic to warrant a comparison to previous work. Instead, I focus here on the related experiments that researchers have undertaken to study shape. The following discussion is structured according to the shape-from-X cues of Section 4.2. 112 4.3.1. Shape from Shading In early pioneering work on non-realistic shape perception, Ryan and Schwartz [141] presented participants with photographs and shaded images of objects in different configurations and measured the time it took participants to correctly identify the depicted configuration. Due to the preliminary nature of their study, they used only 3 arbitrary real objects. The configurations of the objects depended on their functionality, which may not have been known to participants. More importantly, the authors lacked computer graphics capabilities and commissioned an artist to produce the shaded images. Their experiment therefore largely measured the artist’s craftsmanship at conveying the different object configurations. Koenderink et al. [95, 96] invented a thumbtack-shaped widget, as in Figure 4.3, for participants to indicate the perceived shape on a Lambertian surface. Sweet and Ware [153] investigated interaction between shading and texture and also included specular reflections in one of their experiments. Again, participants used Koenderink’s thumbtack widget for feedback. Johnston and Passmore [83] mapped Phong-shaded spheres with band-limited random-dot textures. Instead of using the thumbtack widget they asked subjects forced-choice questions about the spheres and a paired surface patch, which was oriented in a different direction or had a different curvature to the spheres. As explained in Section 4.3.4, none of the presented evaluation techniques lend themselves to experimentation with moving objects and their results may therefore not apply to many highly dynamic computer graphics applications. Barfield et al. [5] investigated the effect of simple computer shading techniques (wireframe, flat shading with one or two light sources, and smooth shading) on the mental rotation performance (see Section 4.3.4, Mental Rotation) of participants. The mental rotation task is similar 113 to the task in my experiment, but several differences exist to enable real-time dynamic interaction: my experiment uses multiple concurrent shapes, the shapes all move, and the shapes differ in their constituent parts, not in their arrangement. Rademacher et al. [132] compared photographs to shaded computer graphics images of simple geometric shapes and asked participants whether they were seeing a photograph or synthetic image. In a similar experiment, Ferwerda et al. [47] compared the perceived realism of photographs of automobile designs with versions rendered in OpenGL and rendered with a global illumination model. As such, neither experiment directly measured the perception of shape, but the contribution of soft shading, number of lights, and surface properties to the perception of subjective realism. Kayert et al. [88] probed neural activity of Macaque monkeys using invasive surgical probes to study the modulation of inferior temporal cells to nonaccidental shape properties (NAP) versus metric shape properties (MP) of shaded objects. Such experiments can obviously not be performed on human subjects. Biederman and Bar [8] used non-sensical objects with diffuse and specular shading to compare shape perception theories based on NAP against theories based on MP. The effects of shading itself were not measured. 4.3.2. Shape from Contours In their study described above, Ryan and Schwartz [141] also presented participants with line drawings and cartoons of different object configurations. Because an artist generated their images, it is likely that considerable perceptual and cognitive effort went into creating effective 114 images5. At the same time, the method for creating the images cannot be described quantitatively. The results of their experiment are thus largely biased by the artist and difficult to replicate. Shepard and Metzler [147] evaluated mental rotation performance of three-dimensional chained cubes presented in a line-drawing style. Yuille and Steiger [184] performed followup work based on the same experiment. In his recognition-by-components (RBC) paper, Biederman [10] used line drawings to illustrate his theory and perform various experiments to determine the effects of reducing the number of components used to represent an object, and of deleting parts of the lines used to represent each component. The experiments were designed to support the RBC theory and not to test the effectiveness of line-drawings to purvey shape. In contrast to the large corpus of research in rendering techniques for contours on polygonal meshes [35, 70], implicit surfaces [16, 128], and even volumetric data [20, 41], the literature on perceptual evaluation of contour rendering from 3-D models is relatively sparse. Gooch and Willemsen [60] performed a blind walking task in an immersive virtual environment, rendered with contours, to determine perceived distances as compared to distances estimated in the real world. They did not evaluate how contours compared to other shape cues. Another difference to my work is that Gooch and Willemsen probed estimation of quantitative distances, a task that humans find notoriously difficult, whereas my experimental design is geared towards shape estimation and categorization, for which only relative or qualitative depth-information is required. 5 In fact, the different types of representations varied not only in their use of shading or lines, but also in the amount of detail that was depicted. Particularly the cartoon representations were more symbolic than literal copies of the original scene. 115 4.3.3. Shape from Texture Various authors have shown that texture elements, aligned with the first and second principle curvature directions of a surface, are good candidates for indicating local surface shape and curvature [77, 58, 90, 153]. The specific experiments of these authors do not translate well to dynamic scenes, but it will be interesting future work to verify their results for dynamic environments using my experiment. 4.3.4. Measurements Most work on shape perception uses one of the following established methods to measure perceived surface shape: (1) Thumbtack gimbal. Participants place a (virtual) gimbal widget akin to a thumb- tack at a particular orientation on the object’s surface so that the pin’s direction is aligned with the estimated surface normal. Both the direction of the pin and the eccentricity of the attached disk are used to indicate and measure estimated tilt and slant, as in Figure 4.3 [95, 98, 118, 96, 58, 90, 153]. (2) Mental Rotation. Participants are shown a pair of images in one of two configu- rations: (a) depicting the same object but from different viewpoints; and (b) depicting different objects. The experimenter measures the time a participant takes to decide on a given configuration. This task requires participants to mentally rotate one of the shapes to match the other and can employ 2-D shapes and rotations (in-plane) [30, 51] or 3-D shapes and rotations [147, 184, 5, 11]. 116 (3) Exemplars/Comparisons. Several physical objects with a variety of surface shapes are kept at hand. These are used as exemplars, so that a participant of the study can indicate the position on a surface with similar properties to that of the object being studied (e.g. [151]). When measuring perceived distances in virtual environments some experiments required users to walk the estimated distance in the real-world as an analogous measure [60, 111], while others asked participants to estimate the time it would take them to walk the perceived distance at a constant pace in the real world [129]. (4) Naming of objects. Subjects are shown depictions of real-world objects and are asked to name the depicted object as quickly as they can [10]. Discussion. While the gimbal and exemplar methods are capable of yielding highly sen- sitive quantitative data, they do not transfer to a fully dynamic context like the one I am investigating. Moving objects simply do not hold still long enough to perform these types of measurements. The same restriction applies to mental rotation, because at least one of the two shapes to be compared has to be stationary. Naming of objects requires participants to be familiar with the object they are presented with. Even if they know the name, participants might still suffer from the tip-of-the-tongue effect, where a known word is not readily verbalized [149]. My solution is to opt for the more qualitative shape perception task detailed in Section 4.4. Technically, this task is closer to shape categorization than exact shape quantification, but then, so are most everyday shape-dependent tasks. To distinguish a plate from a cup humans need to make qualitative judgements about the objects’ shapes rather than compare their exact dimensions or other geometric measures. 117 Direct vs. indirect Measurements. Another way to classify measurements is in terms of direct versus indirect measurements. Placing a widget on a surface position yields direct numerical values for the estimated surface normal. As discussed, this is not practical for moving objects. Consequently, much of the work on distance perception in Virtual Environments (VEs) uses indirect measurements. Plumert at al. [129], Gooch and Willemsen [60], and Messing and Durgin [111], have all used walking-related tasks to indirectly estimate perceived distances in VEs. The task was either to guess the time it would take to walk from the current position to another position inside the VE, or to actually walk the estimated distance in the real world without visual feedback from the VE. Indirect measurements have the disadvantage that they generally include larger individual variations and have to be related to the measure of interest via some mapping, which may introduce additional error. The advantage is that they allow a dynamic and often more natural and intuitive experimental scenario, e.g. walking versus orientating a widget using a mouse. In my experiment, I use several indirect measures of performance, allowing me to present participants with an intuitive task and interaction paradigm that evokes competitive performance levels. I address individual variations and cumulative errors in the statistical analysis of the experimental data (Section 4.6.2 and Section 4.6.3). 4.3.5. Test Shapes The test objects in previous studies can be broken down into two broad categories: (1) representational, i.e. representing real-world objects (e.g. flashlight, banana, table, chair) [141, 10, 11, 60, 47, 129]; and (2) non-representational (or nonsensical) objects. 118 Figure 4.4. Real-time Models. The complexity of models used in real-time and interactive applications, like games and 3-D visualizations, is often kept fairly low to minimize the computational demand of the rendering process. Many of these simple models can thus be well described in terms of generalized cylinders [12, 67]. Because the perception of representational objects is affected by a person’s familiarity with the object (e.g. a telephone versus a specialized technical instrument), most related work employed non-representational shapes, instead. Some authors have used wire-like [140, 18] or tube-shaped [5] objects that resemble bent paper-clips and pipes, while others used similar stimuli comprised of chained cubes instead of wires [147, 184, 154]. Several authors have used soft, organic, blob-shaped or amoebic objects [18, 118]. Some experiments were based on generalized cylinders [8, 11, 88]. Finally, a number of experiments, mostly those involving shape-from-texture studies, used shapes resembling undulating hills and valleys or folded cloth [151, 153, 90, 3]. My own experiment uses a form of generalized cylinders, called geons [10] (Section 4.4.6) to avoid the familiarity problem associated with real-world objects. Additionally, generalized cylinders are easily parameterized, and, unlike wires, cubes, blobs, or cloth, are flexible enough 119 to describe many basic shapes, and can be combined to form a large number of real-world objects, particularly low-resolution objects commonly found in real-time graphics (Figure 4.4). 4.4. Implementation In this Section, I use the concepts and terminology introduced in the related work (Section 4.3) to explain how my experimental design implements the goals put forth in the introduction (Section 4.1). I start off with a brief overview of the experiment to define the given task and interaction, and then list the details of implementing each of the goals. 4.4.1. Overview Participants in the experiment have to distinguish particular shapes from a set of moving objects under different display conditions. Figure 4.5 shows the experimental setup. Participants sit in front of a touch-sensitive board onto which I project moving test shapes with an overhead dataprojector. Participants are asked to select objects that share certain shape characteristics. An object is selected by simply touching a finger on the board where the object is displayed. Each experimental trial consists of different phases during which objects are displayed in different shape-from-X modes (Figure 4.6). Although I am mainly interested in shape from shading, contours, and texture, I test two additional display modes, one combining shading and contours and one using an alternative color texture (for details, see Section 4.4.2). The system records all user events, as well as several system events (Section 4.6). 120 Figure 4.5. Experimental Setup. Left: Schematic diagram of setup. A data projector projects imagery onto a conductivity-based touch-sensitive surface. The user, grounded through the seat and a foot mat, simply taps on virtual objects displayed on the surface. A single computer synthesizes imagery and gathers data. Right: Photograph of actual setup. 4.4.2. Shape Cue Variety (1) In the real world (or in photorealistic rendering) all shape cues are simultaneously present in a scene to various degrees. The fact that the human visual system derives shape information from numerous sources does not mean, however, that all of these sources are equally valuable or can be leveraged equally. To test the individual contributions of shape cues to shape perception, the cues have to be separated into orthogonal (mutually independent) stimuli. The following list 121 Figure 4.6. Display Modes. Left column: Screen-shots from the experiment for each of the display modes. Test objects are rendered onto a static background with the same visual characteristics as the foreground objects to prevent outlines from depth-discontinuities (as in the right column). Static objects are extremely difficult to identify. Once they are moving, they pop out from the background immediately. Right column: Objects highlighted visually for comparison. 122 describes the rendering techniques used to produce these non-photorealistic stimuli, which are depicted in Figure 4.6. (1) Outline. Of the many possible contours I can render (silhouette, outline, creases, ridges, valleys, etc. [70, 137]), I choose outlines, because they are the basis of most NPR line-style algorithms. Outlines are those edges, e, for which (4.1) dot(~v , ~t1 ) · dot(~v , ~t2 ) < 0 , with e = {~t1 , ~t2 }, where ~v is the view vector, and e is the edge shared by the triangles6 with normals ~t1 and ~t2 . There exist many efficient methods to implement Equation 4.1 [70], but one of the fastest methods is a two-pass rendering approach in which first only back-facing (dot(~v , ~tb ) > 0) triangles are rendered into the graphics card’s depth buffer, followed by front-facing (dot(~v , ~tf ) < 0) triangles that are rendered into the color buffer with OpenGL’s line drawing mode. Since colored, shaded, and textured objects create a natural silhouette (a type of contour) against a differently colored, shaded, or textured background (right column in Figure 4.6), I have to ensure that the other display modes do not inadvertently create a contour cue. To ensure this, I fill the display background with static random elements as in Figure 4.6, left column. These backgrounds are designed to resemble the current display mode without containing any complete instance of the 16 test objects (this ensures that participants are not exposed to targets in the background with which they might try to interact) and thereby create a homogenous display. While this makes 6 I use triangles for the outline definition because triangles form a generic geometric primitive supported by most rendering systems. 123 static identification extremely difficult, the test objects pop out immediately from their surroundings when animated (see also the discussion on Motion, below). (2) Shading. The open graphics language (OpenGL [177]) provides built-in support for the shading model described in Section 4.2.1. To obtain smooth shading across triangles approximating curved surfaces, OpenGL interpolates the normals at the vertices. Special care needs to be taken when rendering sharp edges (e.g. the boxes in Row 3 of Figure 4.10) to prevent the interpolation scheme from visually smoothing out these edges. I achieve this with so-called smoothing groups, which only interpolate normals within each group, but not across groups. This requires specifying several normals per vertex, depending on which face references the vertex. (3) Mixed. In anticipation that Shading and Outline might yield statistically indistinguishable data, I add a M ixed mode combining the two, to test for cumulative effects7. (4) TexISO & TexNOI. For texturing I rely on OpenGL’s built-in texturing capabilities. I use a trichromatic random-design texture, which is sphere-mapped onto the objects8. To prevent the texture cue from interfering with the shading cue, the colors of the texture are isoluminant (T exISO mode), i.e. they have different chrominance values but the same luminance9. The colors are chosen to roughly fall within the red, green, and blue part of the color spectrum, and are calibrated for equal luminance at the participant’s head position using a Pentax Spotmeter V lightmeter. In case that 7 See Section 4.7.3 for a discussion of possible interactions between shape cues. Along with the choices for lighting model and contour type, this mapping was picked with reason but can still be considered fairly arbitrary. There exist many more possible mappings than can be explored in this dissertation. See Section 6.2 for further discussion. 9 Note that colors may reproduce non-isoluminant on your printer or display device. 8 124 an isoluminant texture interacts particularly strongly with motion [104], I include a control texture mode without isoluminant colors (T exN OI mode). Motion. As illustrated in Figure 4.1, motion itself is a shape cue. There are several rea- sons why I do not separate motion from the other shape cues. First, many real-time graphics environments, including games, immersive VR, and visualizations contain a significant number of dynamic elements. Since shape cues may perform differently in highly dynamic environments compared to static environments, it is sensible to include motion in the experimental setup. Second, shape-from-Motion relies on discernible parts of an object to move. In general, these parts are only discernible because of their color, shading, contour, or texture properties. Although motion is processed independently in separate cortical structures (area V5, to be specific [188]) these structures rely on the output from other cortical areas. It is therefore simply impractical to separate motion from other shape cues. One concern might still be that motion interacts with (depends on) some cues more strongly than with others. Critics of my experiment have even ventured that motion is the main effect I am able to measure. My position is, as above, that I am interested in the effectiveness of different shape cues in dynamic environments, irrespective of any naturally existing bias that may exist and which is therefore also part of any real or virtual dynamic scene. In response to the above criticism, I refer to the results in Section 4.7, showing significant performance differences for all different types of shape cues. 4.4.3. Adjusting Task Difficulty (2) If an experimental task is too easy all participants are likely to excel and no statistical variation can be measured to compute the effect of different experimental conditions. The same applies 125 to a task that is too difficult. If no participant is able to perform the task, no meaningful data can be gathered. If we think of the performance graph as a simple statistical curve with total failure (0%) on the one side and absolute success (100%) on the other, then there exists some transitional region in the middle that represents the performance threshold for that task. The best data is measured at the point of greatest slope in the transition region because small changes in subjective task difficulty lead to the greatest effect in measured performance. Of course, this point is difficult to find in practice because it depends on many variables, including priming, learning, daily form and individual variability. Unlike physical performance measures like speed and strength, perceptual performance tends not to vary as greatly between participants. Small variations do exist, though, and should be accounted for in the task design. In short, the given task should be difficult enough to challenge experts, yet manageable for novices. I design towards this goal by including two system parameters that affect task difficulty: object speed and object count. The speed of objects is set low enough to ensure adequate visual detection and interaction, that is, all participants can interact at least with some objects and the number of objects they interact with, correctly or incorrectly, determines their performance. The number of objects, on the other hand, is set so high as to make a perfect trial (correct interaction with all objects) unlikely, even for an expert. I determined the actual parameters for speed and object count heuristically by performing a small set of trials. 4.4.4. Simple Interaction (3) Version 0.1: What not to do. In a first version of the experiment, I initially used a task that required participants to drive a virtual car through a winding course (Figure 4.7). The idea was to use a task that participants were used to from daily experience and that would bear 126 practical relevance for interactive tasks in computer graphics application (e.g. navigation and orienteering). A fair amount of effort and coding went into modeling the car’s interaction with gravity, the terrain, friction, inertia, etc. down to the wheels spinning at the correct rotational velocity when in contact with the ground. I took every precaution to ensure that the car would handle as expected from a real car, yet would be easy to drive. I even set the initial acceleration and maximum speed so that participants only had to steer right or left. Despite all this, the experiment turned out to be a failure because for the majority of participants the driving task was simply too difficult. Because my intention in this chapter is not merely to introduce a single experiment, but rather a flexible methodology, I believe the mistakes of that first design are instructive as good examples of what problems to be aware of and what design decisions can impact interaction performance (the main discussion continues with the Conclusion paragraph on page 128). Adjusting task difficulty. As discussed above, the task should accommodate participants with different performance levels or skills. In the car experiment, the main means of adjusting the task difficulty was to change the maximum speed of the car. Because a faster car allowed less reaction time, the task difficulty could be increased. Due to the realistic physics, however, a faster car was also more difficult to control, would break out in sharp turns, etc. (Figure 4.7, right). Despite several trials to establish a good common speed, the interaction turned out to be too difficult for most participants and too easy for some. Measuring performance. I employed several indirect performance measurements in the car experiment including lap-time and deviation from an ideal path. In the analysis of the data, I determined that lap-time was directly correlated with deviation from the ideal path and that the deviation was mostly a factor of interaction difficulty and not of the difficulty in perceiving the 127 Figure 4.7. The First Version of the Experiment. Left: Screen-shots of single (top row) and dual (bottom row) shape cue modes. The outline mode provides a strong visual cue for the bottom of the valley (center of screen), a good guide to drive towards. Shading provides the same cue although less pronounced, but additionally yields curvature information. Texture does not provide much shape information in a static screen-shot but helps to produce shape-from-Motion when animated. Right: Analysis view (top-down) of 1.5 laps of a good driving performance with inset showing the participant’s view at the simulated time. Note how the driver undulated around the ideal path, even on straight sections. shape of the driving terrain. Participants constantly oversteered in one direction, followed by overcompensation in the opposite direction (Figure 4.7, right). As a result, the distance traveled by some participants was almost 1.7 times that of the ideal path and many participants suffered various degrees of motion sickness. Viewpoint. To increase the available reaction time and give a better sense of spatial awareness, the car experiment featured a third person (bird’s eye) view of the scene, common in many games (Figure 4.7, left). This allowed participants to see further ahead and perceive the car in its environmental context. It also placed the participants in a very unusual position to drive a car and most participants had trouble adjusting to this view. As it turned out, participants with 128 game experience had few problems adjusting to the third person perspective, while for most others the cognitive leap seemed too large without significant practice. Conclusion. The interaction with the system should fulfil the following requirements: • Learning – The interaction should be simple to learn so that the amount of necessary training per participant is minimized. • Intuition – The interaction should be intuitive, so that participants are not preoccupied with remembering arbitrary mappings between interaction and task. • Unobtrusiveness – The interaction should not be obtrusive (e.g. by attaching many wires or restrictive head-gear) to ensure that participants behave naturally. • Dynamics – The interaction should be suitable for a task with moving objects. The car experiment was able to address the dynamics and unobtrusiveness requirements, but appeared difficult to learn and unintuitive for some participants. Solution. My approach in the present version of the experiment is a touch-to-select inter- action paradigm (Figure 4.5), which requires no technical skills10 and is commonplace in many social situations (simple learning). To the best of my knowledge, the use of a touch-table interface for shape perception studies is novel and offers three distinct advantages. First, it removes the level of indirection associated with many pointing devices (intuition). Mice, for example, translate motion on the plane of their supporting surface (e.g. table) into motion in the plane of the display device. These two planes are commonly perpendicular to each other, resulting in some learning effort for novice users. Second, it eliminates the need for a cursor to indicate the current position of the pointing 10 In fact, pointing is one of the earliest methods of gestural expression for children as young as infancy [50]. 129 device, which might otherwise distract. Third, it does not require any external pointing device (unobtrusiveness). A disadvantage of the particular front-projection model I use is that the participant’s hands cast visible shadows on the display. Interestingly, only 7 of the 21 participants ever noticed these shadows and of those 7, only 3 felt somewhat impaired by the shadows. 4.4.5. High-demand Task/Time-Constraint (4) In shape discrimination tasks involving reaction time there are two conceptual methods to increase the task difficulty. One is to decrease the perceptual difference between target and distractor stimuli, thereby increasing the time participants take to correctly distinguish the two. The other is to decrease the time participants have to make a distinction. I implement the second method by exposing participants to a large number of potential targets (not by minimizing the time each target is displayed) so that there is less reaction time available per target11. The reason I prefer the second method is that, in my opinion, it reflects better the type of real-life dynamic interactions we encounter in our daily lives. When driving down a road we have to make split-second decisions about objects that are either negligible or potential hazards. When playing sports or even just walking in a crowded room, we have to constantly make decisions about the shape and motion of our surroundings, often without much time to make these decisions. In a severely time-constrained situation, we may not have the luxury of taking in all the available evidence and making a correct decision. Instead, we have to make a best-effort decision with the information available at the time. Additionally, it is known that humans can only 11 Of course, participants can choose to ignore most targets and instead focus only on a few but with more accuracy. My experiment accounts for such strategic variations (Section 4.6). 130 attend to a rather limited number of perceptual stimuli at a time [39, 160]. This may lead to shape cue prioritization not evident in static experiments but important for real-time graphics applications. 4.4.6. Test Shapes (5) Section 4.3.5 briefly discussed the different test shapes that have featured in previous work. I now pick up some of the concepts introduced there to define what I consider to be important characteristics of a test shape set. Familiarity Bias. Familiarity with the shapes should not influence their perception. Some people might be more familiar with dogs than with chinchillas, and this familiarity may influence the reaction time and accuracy for shape-dependent tasks, particularly those using naming times as measurements. Generality. The test shapes should represent a large number of basic shapes. If, for example, a shape set only consists of shapes with right angles, then the results I obtain from a study using these shapes do not apply to rounded shapes, or shapes with non-orthogonal angles. Relevance. While I prefer nonsensical shapes to avoid familiarity biases, I want the set of shapes to reasonably approximate a number of real-world objects in conjunction with each other. For example, stick-shaped objects could be used to model a broom or rake, but they would make poor components for building voluminous objects like a book or a refrigerator. Parametrization. In reporting my experimental findings, I should be able to parameter- ize the shapes that I used, so that others can get an intuition for the types of shapes the findings 131 are valid for and so that they can replicate my results. A counter-example I mentioned previously is the work by Ryan and Schwartz [141]. They used images of a hand, a switch, and a steam valve as stimuli. No obvious correlation exists between these objects, they are not representative of a class of objects, and it is not obvious which of their shape characteristics had an influence on the experimental outcome. Shape Theories. A number of different theories exist to explain human perception of shape [94, 108, 57, 10, 64], each with their own strengths and limitations. While some theories are based on exemplars, others define metric properties, or non-accidental properties of shapes12. For the purpose of my experiment, I can choose any theory that suitably fulfills my requirements of non-bias, generality, relevance, and parametrization. An important point to note is that I do not actually require the theory to be valid in terms of modeling human perception because I use the theory only to model shapes, not to model perception13. Geon Theory. One theory that suits my requirements, and which describes shapes that are easily parameterized with standard computer graphics techniques is Biederman’s [10] recognition-by-components (RBC), or geon theory. A geon is a volumetric shape similar to a generalized cylinder [12, 67], i.e. a volume constructed by sweeping a two-dimensional shape along a possibly curved axis. Geons can vary in terms of the geometry and symmetry of the sweeping shape and axis. Biederman defined 4 categories in which geons may vary, by imposing the restriction that each category must produce non-accidental viewing features. That is, the feature (e.g. curved vs. straight sweeping axis) must be evident for all but a small number 12 An in-depth discussion of different theories is beyond the scope of this dissertation but the interested reader might find Gordon [64] useful as a starting reference. 13 A good theory can, of course, help to interpret the results of an experiment. 132 Figure 4.8. Constructing Shapes. Each experimental object consists of a main body and two attachments. To ensure that attachments are visible from any viewpoint, they are duplicated for each object, mirrored, and rotated by 90◦ . Colors are used only for illustration purposes. Figure 4.9. Shape Categories. Both the main body of objects and the attachments vary along two non-accidental shape categories. Main Body: Main bodies either have a round or square cross-section, and have a longitudinal axis of constant or tapered width. Attachments: Attachments also have a round or square cross-section, but their longitudinal axis is either straight or curved. 133 of distinct accidental views and that slight perturbation of an accidental view must reveal the feature. This restriction ensures that unique geon identification is invariant to most translations and rotations. To construct compound objects, geon theory devises a hierarchical network, whose nodes are geons and whose structure indicates relative geon positioning. Like all other existing shape categorization theories, geon theory faces problems of generality, i.e. it is not evident how subtle shape differences, like those between apples and oranges, could be modeled. Nonetheless, the simple shapes described by geon theory are reminiscent of the basic shape primitives found in many virtual environments, architectural mockups, and computer games (Figure 4.4). The principled categorization of geon shapes further allows me to specify exactly the types of shapes and objects for which my experimental results are valid. Constructing Shapes. To construct shapes for my experiment, I combine a main body shape with two identical but mirrored and rotated attachment shapes (Figure 4.8). Because a single large shape is easy to see and differentiate, participants are instructed to ignore the shape of the main body and only differentiate shapes according to their attachments. The main body therefore merely serves to increase the difficulty of the perceptual task without increasing the cognitive load on participants14. The attachments are duplicated and transformed so that they are visible from any direction as the compound object moves across the display. The main bodies and attachments vary along three parametric dimensions (CS, LS, LA, see Figure 4.10 caption), adapted from a subset of Biederman’s descriptors. Main bodies vary according to their cross-section (CS) and longitudinal size (LS), whereas attachments vary according to their cross-section and longitudinal axis (LA), (Figure 4.9). Parameters for the 14 Compared to a hypothetical combination task, where participants would have to look for attachments only on particular main body shapes. 134 Figure 4.10. Experiment Object Matrix. The complete set of experimental objects comprised of 2 × 2 × 2 × 2 = 16 shape permutations. The variational parameters are: CS, cross-section (square/round); LA, longitudinal axis (straight/curved); LS, longitudinal size (constant/tapered). Parameters for these properties are chosen to preserve the volumes of main bodies and attachments. 135 construction of the main body and attachments are chosen to yield an approximately constant volume, to ensure that average display size of objects under random rotation is approximately equal. Targets and Distractors. Together, the permutations of parameters add up to 2 (main body+shape) × 23 (3 parametric dimensions with 2 choices each ) = 16 objects, listed in Figure 4.10. For each trial of the experiment, a different column (same attachments, different main body) in Figure 4.10 is selected as the set of target objects, with the remaining objects acting as distractors. 4.4.7. Learning and Biasing (6) The performance of participants in a visual perception experiment depends partly on the participants (e.g. acuteness of vision, reaction-time) and partly on the experimental setup (e.g. display modes, test-shapes). The setup parameters affecting performance should not be predictable to ensure that the experimental data reflects the participants’ perceptual abilities and not their memory or deductive reasoning skills. For example, if participants knew that every fourth object was a target object, while the three intermediate objects were distractors, then they would only have to recognize one target correctly and thence continue to count. In the car-experiment described in Section 4.4.4, I only used two different tracks, one for training and one for the trials, so that participants’ performances were the aggregated effect of the different visual stimuli and of learning the curves of the trial-track. Such aggregation can be separated into its constituent components using statistical techniques, but this requires many more independent trials. In practice, learning of some sort is often unavoidable, even if it is only to practice an interaction technique or to internalize the experimental instructions. To ensure that this learning 136 does not affect the experimental data, the experimenter can perform a practice trial without collecting data. A system bias is an effect that results in two experimental conditions differing by any other measure than the intended free variable. If, for example, the target-to-distractor-ratio of the Outline mode was different from the Shading mode, then this could affect the experimental data even if the two modes otherwise behaved identically in terms of their ability to provide shape information. The following paragraphs list the precautions I took to minimize learning effects and biases during experimental trials. Display Strategies. Because the different shape cues provide different shape informa- tion, participants have to develop varied strategies to distinguish targets from distractors (Figure 4.11). For example: flat, shaded surfaces are single-colored, while curved, shaded surfaces show color gradients. Outlines, on the other hand, do not use color at all. To enable participants to develop a strategy for each display mode and to ensure that the learned strategy for one mode does not bias the performance in a later display mode that uses a similar strategy, I require participants to perform a trial run that shows all display modes in random order (the detailed experimental procedure is listed in Section 4.5). Randomization. To avoid introducing a system bias into the experiment, I fully random- ize all system variables. Particularly, I randomize the order in which the columns in Figure 4.10 are chosen as targets objects, including the practice trial. The order of the 5 display modes for each trial and the practice trial is also random. Objects move across the screen in random linear paths, but I ensure that they always cross the entire display, that all objects take the same amount of time to cross the display (8 seconds), 137 Figure 4.11. Mistaken Identity. During the experiment, objects constantly rotate randomly. This ensures that the objects can be viewed from all directions, generates a depth-from-motion cue, and separates objects from the background. The rotation also increases the likelihood of accidental views for which some objects may look alike. Per definition of accidental views, these views are inherently unstable and will quickly disambiguate, but the viewer is required to track objects for a finite amount of time to reliably interpret the scene. Labels in the image correspond to the labels in Figure 4.10, i.e. objects with the same label are different views of the same object, while objects with different labels are views of different objects. The top row illustrates different objects that look similar for some views. The bottom row shows that for some views (middle) the silhouette of two objects (left and right) can be identical. Different perceptual strategies may be necessary to disambiguate similar looking shapes. and that they constantly rotate at similar speeds (between 1-3 radians per second). I determined the values for linear and angular velocities empirically based on a small group of participants. I ensure that the ratio of target objects to distractor objects always remains 1 : 4 by only using the fixed set of objects shown in Figure 4.10. When any object is selected by a participant (correct or incorrect), or when an object has crossed the display entirely, it is re-initialized, which causes its trajectory and rotational velocity to be reset. The object is also deactivated for 138 a random time between 2-7 seconds, so that participants cannot anticipate the type of object re-appearing on the display. 4.4.8. Hardware & Software I implemented the entire experimental system in C++, using OpenGL for rendering. The system, including rendering and data acquisition, ran on an AMD AthlonTM 3500+ with 2Gb RAM and displayed via a Dell 3300MP digital projector. Participants interacted via a touch-sensitive DiamondTouch table interface (Figure 4.5). 4.5. Procedure All participants are asked to be seated in front of the experimentation desk and given brief oral instructions as to the duration of the experiment as well as the interaction method. They are then asked to follow the instructions on the screen and address any questions they may have after reading the instructions but before commencing the experiment. The participants wear ear-mufflers and sit in an isolated partition to minimize distractions. Participants read a few short pages of instructions. The instructions introduce the objects that are used in the experiment and explain with text and visual examples how to differentiate targets from distractors. To advance to the next instruction page, participants use the same interaction as in the experiment (i.e. touching the table). Participants are then asked to perform a short practice trial (20 seconds per display mode). In the practice trial, participants are given visual feedback about their performance. The instructions state that such feedback is not given during the experimental trials. The user display is replicated on an external display visible to the experimenter who can 139 monitor the participants’ performances and note any obvious problems (e.g. a participant only hitting distractors instead of targets). After the practice trial, the experimental trials begin. Each experimental trial is preceded by a single instruction summary page, followed by a textual description and visual example of targets versus distractors. Afterwards, the actual trial begins. Each trial consists of the same set of targets shown in all 5 display modes in randomized order. Each display mode is shown for 60 seconds, followed by a fade-to-black and several seconds of darkness to prevent delayed interactions from one mode affecting the following mode. Each trial ends with intructions to the participants informing them of the completion of the trial and allowing them to rest for up to a minute before continuing. Altogether, each participant performs 4 trials, one for each column in Figure 4.10, for a total time of about 25-30 minutes, including the practice trial and rest-periods. After the last trial, participants are asked to fill out a short questionnaire with yes/no and Likert-type (1 to 5) questions (Figure B.1) to collect subjective ratings for shape cues, personal performance, experimental duration, fatigue, and discomfort. 4.6. Evaluation In Section 4.3.4, I explained the different measurement methods commonly used in shape perception experiments. In this Section, I discuss the direct measurements I gather for each participant (Section 4.6.1) and how these are converted into indirect measures to discount individual variations due to risk disposition and interaction strategy (Section 4.6.2). Finally, Section 4.6.3 describes the statistical analysis of the acquired data. 140 4.6.1. Measurements For each trial of each participant, the system records the following named interaction events (direct measurements), along with time-stamps: shots The number of times a participant indicates an object-selection by touching the table input-device. correct The number of times the touched object is of the target type. incorrect The number of times the touched object is of the distractor type. missed The number of times the participant touches the background instead of an object. This is seldom due to a participant mistaking the background for an actual object, but rather because of imprecise hand-eye coordination (missed events are generally followed immediately by correct or incorrect events). Using the above definitions, it is always true that shots = correct + incorrect + missed. The system also records the following system events: lost The number of target objects that traverse the screen completely without intervention. This happens if the participant fails to identify the object as being of the target type, or if the participant is too busy interacting with other objects. initialized The number of objects that are initialized to traverse the screen. An object is re-initialized every time the participant touches it, or if it traverses the screen completely without interaction. 141 4.6.2. Aggregate (Indirect) Measures Independent of their objective performance skills, participants’ data in task-driven studies is sensitive to subjective factors such as competitiveness, strategy, and risk profile. In my experiment some people might only take a shot if they are very certain of the object type they are selecting, while others might take as many shots as possible while risking false target identification. To normalize for these factors, and to answer other performance questions that cannot be measured directly, I define the following aggregate measures in terms of their name, the question the measure is supposed to help answer, a motivation for the measure’s definition, the formula to compute the measure, and the measure’s scale/range. All aggregate measures are defined in terms of the direct measures of Section 4.6.1. success — Of the shots taken, how many are correct? — Some participants might be more aggressive and willing to take risks. If they shoot often, then their absolute correct count might be high despite also making many mistakes. To compare such participants to more conservative ones, who shoot less but are more accurate, I normalize the correct count on the number of shots taken: success = correct . shots Scale: The range of success is normalized, with a value of 1 indicating a perfect score. failure — Of the shots taken, how many are incorrect? — An equivalent motivation as for success (above) applies here: failure = incorrect . shots Scale: The range of f ailure is normalized, with a value of 1 indicating that all attempted shots were incorrect. 142 risk — Of the objects crossing the screen, how many shot attempts are taken? — To assess the risk profile of a participant, I measure the readiness of that participant to shoot at an object crossing the screen. The higher the risk measure, the more a participant is willing to risk an incorrect target choice (or the more confident the participant is in his or her decision). This makes risk more of a personality measure than a performance measure: risk = shots . initialized − shots Scale: Because each shot itself causes an object re-initialization (to keep the number of objects on the screen roughly constant), I subtract shots from the denominator. This means that risk has no upper bound (and is not normalized to 1). A risk value of 0 indicates no risk (no shots). A value of 1 indicates very high risk (the number of shots equals the number of freely initialized objects). Values above 1 indicate extreme risk (more shot attempts than new objects traversing the screen), but no participant in the study exhibited such risk behavior.. placement — How good is each participant’s hand-eye coordination, and is this a function of display mode? — When participants interact with the system they are supposed to select a target object (whether their actual selection is correct or incorrect is irrelevant). Failure to do so (selecting the background instead), indicates poor hand-eye coordination, which may be linked to the display mode: placement = correct + incorrect shots − missed = . shots shots Scale: The range of placement is normalized, with a value of 1 indicating perfect placement. 143 detection — How well can correct target objects be detected on the display? — This measure compares the correctly shot targets to the number of target objects that traversed the screen completely without having been shot at. Because lack of hand-eye coordination can lower the number of correctly identified targets that are actually shot, the numerator takes into account both the correct shots, as well as a fraction of missed shots that likely would have been correct given overall performance: correct + missed · detection = lost correct correct+incorrect . Scale: Like risk, the range of detection is not normalized to 1, but uses the value of 1 as a qualitative threshold. A detection value of 0 means that no objects were correctly detected. A value of 1 indicates that half the target-type objects were detected. A value of 2 means that twice as many target-type objects were detected than were lost, etc. 4.6.3. Analysis I first convert the direct measurements into aggregate measures and average the latter over the 4 trials that each participant performs (Figure 4.13 and Figure 4.14). I then average these perparticipant averages (Table B.1–B.5) over each display mode (Figure 4.12) to obtain overall performance values for each display mode. To establish whether different shape-from-X display modes have a significant effect on aggregate performance measures, I perform a repeated-measure analysis of variance (ANOVA) for each of the aggregate measures (Table 4.1). 144 Figure 4.12. Aggregate Comparison. These bar graphs show aggregate measures averaged over all participants and all trials. Error bars are normalized standard errors. Note the different vertical scales (to show small details) when comparing. Interpretations for scales are given in Section 4.6.2. 145 Figure 4.13. Detailed Aggregate Measures. These charts show per-participant values for each display mode and each aggregate measure. Most data is reasonably normal-distributed (see Figure 4.14), with occasional outliers, some of which are extreme. Individual performance appears mostly consistent throughout, in accordance with the general analysis remarks in this Section. Note the different scales of the individual charts. 146 Figure 4.14. Detailed Aggregate Measures Histograms. These charts show the histograms (frequency distribution) for the detailed participant data in Figure 4.13. Despite the erratic appearance of the traces (the number of participants limits the resolution of the histogram) the distributions appear to be evenly or normal distributed and no clustering is evident. Note the different scales of the individual charts. 147 The ANOVA analysis only determines overall effect of display modes. Further analysis of pairs of results for different display modes using Student’s t-test allows me to detect significantly different means for each display mode pair. Table 4.2 shows the t-test results for all combinations of display modes. Since the likelihood of a false positive is higher across multiple tests than for each individual test, I use the highly conservative Bonferroni correction, which divides the alpha-value, α, by the number of tests, n, so that, α → αn . 4.7. Results and Discussion I tested 21 participants (8 female, 13 male) comprised of graduate students and university staff volunteers. All participants had normal or corrected-to-normal vision. 4.7.1. Data Consistency and Distribution The scatter-spread charts in Figure 4.13 show that the group of participants as a whole performed fairly consistently (vertical ordering of participants does not change dramatically between the different display modes). Good performers generally performed well throughout and poor performers generally performed worse for most modes. Some notable outliers are evident, though. One extreme outlier (more than three interquartile ranges from the third quartile) is recorded for participant BA056’s M ixed-risk performance. Further analysis of this trial data shows that the participant attempted more than double the number of shots during the first trial compared to the remaining trials. Although this led to an 18% higher success rate in the first trial, it also resulted in an up to ten-fold higher f ailure rate. The reason for this abnormality is difficult to ascertain because the remaining trials show much more moderation in terms of shots attempted. Since no feedback was given during or between trials, the participant decided 148 to adjust his or her interaction behavior autonomously. I eliminate the above-mentioned extreme outlier from the analysis lest I contaminate the remaining data. Another interesting example is BA055’s M ixed-detection performance, which, in this case, is exceptionally good. As above, this anomaly is due to the performance of a single trial (third). During this trial the participant lost almost no objects and detection becomes very sensitive for very low values of lost due to not being a normalized measure. It should be noted, however, that BA055’s detection score is among the highest of all participants for all display modes. Given this detection proficiency, it is somewhat surprising that the participant’s success scores are only average, especially considering that the f ailure scores are also among the lowest of all participants. A possible reason may be the fairly low placement scores (hitting the background instead of objects). Judging by the low f ailure scores, the poor placement is not likely due to mistaking the background for real objects, but more likely due to poor hand-eye coordination or fatigue (most missed objects occurred in the last trial). 4.7.2. Strategies During the initial practice trial, participants develop strategies for categorizing target and distractor objects for the different display modes (Section 4.4.7). To support reasoning about shape perception strategies, I use Figure 4.14, which displays the same data as Figures 4.13 and 4.12 but this time as histograms, to demonstrate distribution properties. The graphs in Figure 4.14 approximate a near normal distribution (when taking into account the low resolution of the histogram, limited by the number of participants in the study and the discrete nature of histograms). Most importantly, there exists no evidence to suggest participants splitting into multiple distinct clusters (e.g. like the two humps of a bactrian camel). Such clusters would be suggestive of the 149 existence of a small number of shape recognition strategies with distinct performance characteristics, applied by participant subgroups. The lack of clusters does not rule out the possibility of several coexisting strategies, however. Multiple strategies could exist that happen to be equally effective. Or there could be a large number of strategies with different performance characteristics and the exhibited distributions are an effect of the population’s likelihood to adopt an optimal strategy for a given display mode. A simpler explanation, and one in line with exit interviews and personal experience, is that all single shape cues suggest a single strategy15. The results for M ixed mode are therefore highly interesting as several possibilities arise: (1) Coexistence : The two strategies for Outline and Shading remain in effect independently and participants choose the better strategy to apply to M ixed. In that case, M ixed should perform like the better of Outline and Shading. (2) Interference : The two strategies for Outline and Shading remain in effect but interdependently. In the case of constructive interference, participants make use of both strategies simultaneously and performance rises above the level of Outline or Shading alone. In the case of destructive interference, one strategy hinders the other and performance is less than the better of Outline and Shading. (3) Synergy : The simultaneous presence of Outline and Shading shape cues allows for a novel strategy only applicable to M ixed. In this case, participants can choose to use the new strategy or stick with the better of the Outline and Shading strategies. The performance should thus not be worse than for the individual shape cues. 15 The following discussion is facilitated by assuming a single strategy per display mode, but does not depend on it. Each occurrence of strategy could be replaced by distinct set of strategies without affecting the arguments. 150 4.7.3. Interference vs. Synergy Altogether, I find that Shading provides the best shape cue in my study (as determined by success, f ailure, and detection scores), followed by M ixed, Outline, and the texture modes, T exIso & T exN OI (Figure 4.12). According to the above discussion, the particular ordering of Shading, M ixed, and Outline suggests a destructive interference effect instead of coexistence, synergy, or constructive interference. Intriguingly, the combination of Outline and Shading actually decreases the efficiency of Shading, instead of adding constructively by helping to disambiguate, as might be expected. This finding reiterates a common theme throughout this dissertation, that for perceptual tasks less visual information can be more effective than more information. Indeed, several participants commented in the exit interview that the M ixed mode offered too much information and confused them. My explanation for this result is that different detection strategies for shapefrom-contours and shape-from-shading could impede each other. For the Outline mode it is advantageous to compare the terminating contour angle of attachments, while Shading offers the most reliable information in terms of presence or absence of gradients along the surfaces interior of attachments. If these strategies are different enough, or even mutually exclusive, participants may find it difficult to focus on one strategy while ignoring the other. These results are therefore highly valuable for the design of effective shapes and shape-cues for interactive non-realistic rendering systems. This result is also important because it partly contradicts and partly augments findings of previous shape perception studies. Bülthoff [19], for example, found that subjects underestimated curvature of static objects shown with shading or texture alone, but results improved 151 when shading and texture were shown in conjunction, lending support to a synergy or constructive interference theory. I believe the fact that Bülthoff detected an additive effect while I found that multiple shape-cues may be counterproductive can be explained by expanding upon my above theory on detection strategies with a timing argument. Participants may find it difficult to focus on one strategy while ignoring the other, under time-constrained conditions. While there is no reason to believe that humans would not take all available evidence under consideration when given the time, I have mentioned previously that the human visual system can only attend to a limited number of stimuli simultaneously. It is therefore conceivable that for static scenes and when given ample time humans use multiple shape cues constructively, while in a time-critical interactive situation a shape cue prioritization takes place16. Such an argument could also find support in findings by Moutoussis and Zeki [114], stating that each of the different visual processing systems of the HVS “[...] terminates its perceptual task and reaches its perceptual endpoint at a slightly different time than the others, thus leading to a perceptual asynchrony in vision - color is seen before form, which is seen before motion, with the advantage of colour over motion being of the order of 60-100 ms [...]” ([189], pg. 79). I thus believe it is vital to perform more studies on shapeperception for real-time, interactive tasks. 4.7.4. Interaction Table 4.1 shows analysis of variance (ANOVA) results for the different aggregate measurements, to test if varying the display mode had a significant effect on the means of these measures. The F (dof, n) value in the first column represents the ratio of two independent estimates 16 Although the results were not statistically significant I also found evidence for such prioritization in the data trends of the car-experiment. 152 Measure success f ailure risk placement detection F(4,84) 48.594 50.154 13.317 1.412 49.625 p 1.14 ·10−20 4.68 ·10−21 2.23 ·10−8 0.238 6.32 ·10−21 Table 4.1. Within-“Aggregate Measure” Effects. This table lists the F -value for the given degrees of freedom, and the p-value for each of the aggregate measures across all display modes. Display modes are averaged over all trials of all participants. Given values assume sphericity. of the variance of a normal distribution, where dof = m − 1 are the degrees of freedom, m is the factor level (here, different display modes: m = 5), and n = 84 are the number of observations under identical conditions (4 trials for 21 participants). Higher F -ratios indicate a greater dissimilarity of the variances under investigation. For the given dof and n values, an F -ratio above 2.5 indicates statistical significance at the p = 0.05 level (actual p-values are shown in the second column). That is, for F -ratios above 2.5 the chance of obtaining the observed data is less than or equal to five percent. This means that it is much more likely that the observations were not obtained by chance and instead represent an actual effect, in this case that the different display modes have a significant effect on aggregate performance measures. Given the values in Table 4.1, I find that the different display modes have a (highly, p < 0.01) significant effect on all our aggregate measures, except placement. This is an ideal result, because it means that while performance measures related to the task are critically affected by the display modes, the placement measure, related to interaction, does not vary significantly with display mode. In other words, participants are able to consistently touch their intended moving objects, even if the objects themselves may be difficult to differentiate. 153 Modes Outline vs. TexNOI Outline vs. TexISO Outline vs. Shading Shading vs. TexNOI Shading vs. TexISO Mixed vs. TexNOI Mixed vs. TexISO Mixed vs. Outline Mixed vs. Shading TexISO vs. TexNOI succ. fail. risk place. detec. vsig vsig vsig 0.061 vsig vsig vsig vsig 0.080 vsig vsig vsig vsig 0.184 vsig vsig vsig vsig 0.469 vsig vsig vsig vsig 0.401 vsig vsig vsig 0.001 0.282 vsig vsig vsig 0.001 0.190 vsig 0.025 0.002 0.018 0.740 vsig 0.025 vsig 0.346 0.380 0.099 0.267 0.237 0.490 0.987 0.347 Table 4.2. Significance Analysis. This table lists p-values for Student’s paired t-test of all combinations of display modes. The columns refer to each of the aggregate measures. A value of p < 0.005 is considered significant (bold-italic), while a value of vsig (p < 0.0005) is highly significant, under the highly conservative Bonferroni correction. Another conclusion is that performance differences for the different display modes can be detected with the aggregate measures, number of participants and number of trials specified in this chapter. This suggests reusability of the experimental setup for numerous other dynamic shape perception studies (Section 6.2). As evident in Figure 4.12, the success rates for all display modes are significantly higher than pure chance (25%), and chance (50%) if participants ignored instructions and considered only one of the two attachment categories (CS & LA, in Figure 4.10). This indicates that participants understand the instructions correctly (using both attachment categories for distinction) and find the task easy enough to perform. These result prove the successful implementation of the third design goal (interaction simplicity) and I am hopeful that the experimental methodology can be adopted for a large variety of additional display modes. 154 4.7.5. Motion and Color For the detailed t-test analysis in Table 4.2 no significant differences are found between the different texture modes. This is surprising, as motion perception is commonly thought to be linked to luminance channels and independent of color [187, 134, 115]. In that case, the isoluminant texture mode, T exISO, should perform worse than the non-isoluminant texture mode, T exN OI. In fact, the results of my study show the opposite trend (although that trend is not statistically significant) and are in line with subjective responses from the exit interview (Table B.6, bottom row), indicating that T exISO appears easier than T exN OI to most participants. This is an interesting finding and may substantiate recent studies that propose two different motion pathways in the HVS, which process slow motion (chromatically) differently from fast motion (achromatically) [55, 168, 104]. 4.7.6. Risk Assessment An interesting result, evident in Figure 4.12, is the positive correlation between success and risk, and the negative correlation between f ailure and risk (significant at p(dof =3) < 0.01 in both cases). Intuitively, it seems that more risk should lower success and increase f ailure, to the point of ultimate risk, equating to pure chance. In my interpretation of this data, participants are generally able to judge their limitations well and behave rather conservatively, in line with the instruction to be as fast as possible without making any mistakes. Mostly non-significant differences are found for M ixed vs. Outline, and M ixed vs. Shading. I attribute this to the fact that M ixed performance is between Shading and Outline for all measures but risk (Figure 4.12), the relatively large variability of M ixed, and the use of the conservative Bonferroni correction. A possible explanation for the high risk value of 155 M ixed is that participants became more daring because they assumed that more visual information would improve their correct target identification. The objective performance measures (success and f ailure) do not corroborate this notion, and, interestingly, neither do the exit interview results (Table B.6). 4.7.7. Exit Interview From the exit interview (Figure B.1 and Table B.6) I gather that participants were content with the duration of the experiment. No participant felt dizzy or disoriented during or after the experiment and participants found the interaction paradigm very intuitive. Most participants described the experiment as “fun”, even though this was not asked in the questionnaire. 4.8. Summary In this chapter, I presented an experiment to study shape perception of multiple concurrent dynamic objects. This experimental approach is novel as it deviates from traditional reductionist (single, static shapes) studies whose results may not apply directly to most interactive graphics applications. Experimental Framework. I created several non-realistic display modes specifically designed to target only a single shape cue at a time, allowing me to study individual shape cues as well as combinations thereof. My framework implementation carefully follows a number of high-level design goals described in Section 4.1.1 and the statistical significance of the data collected during my study 156 (Section 4.6.3) indicates a high-quality experimental setup where these design goals were attained. Results further indicate that my experiment supports a number of additional shape-fromX studies with only minor modifications (Section 6.2). Interaction. I presented a novel interaction paradigm that does not rely on pointing de- vices or other indirect mechanisms. Participants interact with the experiment by simply touching a table at the position where they see an object. This interaction is simple, intuitive, unobtrusive, and reliable (see placement values in Figure 4.12). Task. Compared to most previous shape perception studies, which can quickly become monotonous and tiring, my experimental task is inspired by games (Section 4.4) and intended to be motivating. Participants have to react quickly and stay alert to achieve a good performance. Most participants became very competitive during the experiment and described the task as fun (Section 4.7.7). Results. Although the main contribution of this chapter is a reusable experimental design that allows for a large number of graphics-relevant shape perception studies, the results for the single study I performed are already interesting (Section 4.7). The most important of these results is that, in dynamic situations, shape cues do not seem to add constructively and may even interfere destructively. This is an important result for interactive graphics, and one which contrasts previous studies on static objects (Section 4.7.3). Shape Cues in Action. It is common knowledge amongst graphic designers that different types of shape-cues are effective for different design goals. I have made use of this concept throughout the figures in this chapter: Figure 4.10 uses shading and coloring to indicate shape and differentiate object parts, Figure 4.6 uses contours to draw attention to target objects, and 157 Figure 4.3 uses texture to illustrate a curved surface. It is important to study the effectiveness of these shape cues for various perceptual tasks, to apply the cues appropriately in a given graphical situation. 158 CHAPTER 5 General Future Work Given the beneficial relationship between NPR and Perception advocated in this dissertation, one might naturally ask questions such as: How far can we push this relationship? or What is the ultimate perceptual depiction? or What other non-realistic imagery, apart from that inspired by art, can be used for visual communication purposes? Although I do not presume to have conclusive answers to these questions, I believe the direction outlined by my dissertation points towards some interesting leads. To start this discussion off, I revisit the topic of realism versus non-realism in the light of the issues addressed in previous chapters. 5.0.1. Realistic Images Figure 5.1 illustrates simplistically the general lifecycle of a synthetic image from conception to perception and onto cognition. I argued in Section 1.1.1 that every image serves a purpose. For now, let this purpose be to convey a message, even if this message is only the image itself (e.g. “A table and chair in the corner of a room”). In the purely photorealistic approach, this message is encoded into a life-like visual representation, without reference to the HVS1, to be consumed by an observer. The observer’s task, then, is to decipher the message given the input image. If all elements of this encoding and decoding process work well, the observer recovers a good approximation of the original message. Because the entire process is rather lengthy 1 As noted in Section 2.4.2, even adaptive rendering, which sometimes does consider the HVS, does so mostly to hide artifacts, not to enhance images. 159 Figure 5.1. Lifecycle of a Synthetic Image. The image generation (rendering, blue outlines) starts with a concept: A table and chair stand in the corner of a room. A user models the objects, sets up the scene, and renders the image to a display device. An observer views the final image on the display and starts deconstructing the retinal projection (vision, red outlines). The observer goes through various low-level and cognitive processing steps before recognizing the depicted scene: A chair next to a table in a room. If rendering and vision work in perfect harmony, the initial concept and the recognized scene are identical. Vision shortcuts are the attempt to bypass some of the rendering and visual decoding pipeline to affect a more direct visual communication. and complicated, and because there are no possible shortcuts (see below) for realistic image synthesis, there are various stages at which the message can be degraded or confused. 5.0.2. Non-realistic Images Non-realistic image synthesis is not bound by the constraints of the physical world. It thus becomes easier to eliminate detail that (1) does not contribute to representing the message and could in the worst case mask the message (confusion), and (2) requires additional rendering 160 resources, thereby incurring unnecessary costs. The best example of purposeful omission of information is abstraction. Restrooms around the world generally do not post photographs of a man and a woman on their doors. Doing so would give too much information, be too specific. Patrons may be lead to believe that the room behind the door belongs to the depicted person. Instead, restroom signs are abstract representations of men and women, so that any person of the appropriate gender can identify with the depiction. The allowed shortcuts for non-realistic images are to bypass optical models required for realistic image synthesis. I should note that my use of the term shortcut chiefly refers to optimizations in visual communication. While it is possible that such shortcuts are also computationally efficient (as is the case for many nonrealistic image synthesis algorithms that do not rely on global illumination solutions) I do not require this to consider a shortcut to be effective. 5.0.3. Perceptually-based Images I have argued throughout this dissertation that the effectiveness of non-realistic imagery can be further increased by considering human perception. I indicate this with the perceptually-based rendering label in Figure 5.1. To generate images optimized for low-level human vision, the rendering process needs to include a model of perception (light-blue rendering input). Although such a model does not introduce additional shortcuts on the rendering side, it might increase efficiency on the visual decoding side. This is the approach I took in Chapter 3 and which lead to increased performance in two perceptual tasks. One way to discuss the questions I pose at the beginning of this Section is to investigate any perceptual shortcuts beyond those already mentioned. In other words, Can we generate images that convey a given message while by-passing more of the coding/decoding pipeline? I believe 161 the answer is, yes. To substantiate this claim, let me give a few examples of what I refer to as vision shortcuts. 5.1. Vision Shortcuts In most realistic and even non-realistic graphics, there exists a fairly straightforward connection between a generated visual stimulus and its perceptual response. The intensity of a pixel on a monitor is related to the perceived brightness of that pixel. The perceived color of a pixel is related to the red, green, and blue intensities of that pixel, and so on. There exist, however, various examples of visual stimuli producing a perceptual sensation that is naturally associated with a very different type of stimulus: a sequence of black-and-white signals can create the illusion of colors. An interlaced duo-chrome image can be perceived to contain colors outside the gamut of additive mixture. A static texture pattern can elicit the sensation of motion. Partially deleted outlines can be perceived as complete. The following sections introduce these perceptual phenomena in terms of non-realistic imagery and discuss some of their potential applications for visual communication. 5.1.1. Benham-Fechner Illusion: Flicker Color The Phenomenon. In 1895, a toy-maker named Charles Benham created a spinning top painted with a pattern similar to the left pattern in Figure 5.2. This toy was inspired by his finding that when the pattern was spun, it created the appearance of multiple colored, concentric rings2 [7]. Gustav Fechner [44] and Hermann von Helmholtz investigated the phenomenon 2 I first experienced this illusion in a Natural Science museum in India. The exhibit was in motion when I read the accompanying instructions and it was not until the disk was almost stationary that I was finally convinced that there were, indeed, no colors. 162 Figure 5.2. Flicker Color Designs. The left and center circular designs can be enlarged, cut out and placed on an old record turntable with adjustable speed. When viewing the animated pattern, most people experience concentric circles in different colors. When the rotation is reversed, the color ordering reverses accordingly. The square design is intended for a conveyor-belt motion, or to be painted onto a cylinder. These designs are but a few of many others possible. Note though, that all designs contain half a period of blackness. more generally and termed it pattern induced flicker color (PIFC), or flicker color for short. Although the effect has been researched for a long time [21], a satisfactory explanation remains elusive. An early theory stipulated that the pulse patterns of the Benham design approximated neural coding of color information, similar to Morse code. Festinger et al. [48] argued that Benham’s induced (or subjective) colors were only faint because they poorly approximated real neural codes. They devised several new patterns with cell-typical activation and fall-off characteristics and demonstrated that their patterns did not require the half-period rest-state of typical Benham-like patterns (Figure 5.2). Festinger et al.’s theory was later disputed, particularly by Jarvis [81] who could not reproduce their results. A currently accepted partial explanation argues that lateral inhibition of neighboring HVS cells exposed to flicker stimuli causes subjective colors to be seen [182]. 163 Applications. Apart from research work, the PIFC effect has found applications in oph- thalmic treatment and even numerous patents (BD Patent Nr. 931533 11. Aug. 1955; U.S. Patent #2844990, July 29, 1958; U.S. Patent #3311699, Mar. 28, 1967), including novelty advertisement before the era of color television. If more knowledge existed about the causes of PIFCs and a reliable method to synthesize saturated and vibrant PIFCs were known, it could be possible to induce color sensations in retinally color-blind or otherwise retinally-damaged people. 5.1.2. Retinex Color The Phenomenon. In Section 1.2.1, I mentioned the phenomenon of color constancy, which allows humans to perceive the true color of a material instead of its reflected color. Another phenomenon, sometimes called color illusion, is best explained with an example: If a small grey square (the shape is not important) is placed upon a larger green square, then the grey square appears tinted lightly red. Similarly, if a grey square is placed upon a larger red square, the grey square appears tinted slightly green, i.e. the tint appears as the opposite color of the square it is placed upon. This effect also works with cyan/yellow combinations and poses problems for theories that posit that the cones in humanoid retinas are independently sensitive to red, blue, and green wavelengths. To explain these color illusions, the cones’ responses cannot be interpreted independently. Alternative theories, along with supporting physiological evidence exist, based on antagonistic interactions between combinations of cones resulting in spectrally opposing stimulation [76, 80]. 164 Figure 5.3. Retinex Images. Viewing Instructions: Due to interlacing, the images may not display well at some magnification levels. In the electronic version of this document, zoom into each image until all the horizontal lines comprising the images appear of the same height. Then adjust your viewing distance to the display until the individual lines seize to be discernible. In this configuration, examine the images for 30 seconds or more and then determine what colors you see. Afterwards you can compare the real colors in Figure 5.4. Finally, zoom fully into the above images to inspect the actual colors used. Edwin Land devised an experiment using both phenomena to suggest subjective colors, which are objectively not present. In this experiment, he implemented the color illusion phenomenon with a picture slide and a few color-filters to produce duo-chrome images that induced the illusion of colors which were present only in the original image. The HVS interpreted the overall color bias of his images as a global illuminant, thus taking advantage of the color constancy phenomenon. Land described this experiment and the accompanying theory in his Retinex3 publication [100]. Figure 5.3 shows two Retinex image examples. The images are best viewed on a computer display with adjusted viewing conditions (see Figure 5.3 caption for instructions). The left 3 Retinex=Retina + visual cortex. 165 Figure 5.4. Originals for Retinex Images. Originals used to construct images in Figure 5.3.— {Left: Public Domain. Right: Creative Commons License.} image really only uses one color, red, but induces the sensation of green (and other colors) with interlaced grey bands. Note the brown tinge of the burger bun, the yellow of the fries and the blue’ish-green tray. The right image uses two different colors, green and red, to achieve a much fuller color appearance. Note the grey color of the sweater-vest and the blue color of the shirt. None of these colors are in the gamut of additive mixture of red and green. The image borders are not strictly necessary but they help to improve the effect. In the left image, I selected a green that is suggestive of the perceived color of the tray. In the right image, I selected a substitute white, sampled from the bright stripes in the sweater-vest. In previous work [172], I have conducted a distributed user-study4, to test whether subjective colors could be induced reliably on different monitors and under different illumination conditions. The results indicated that this was possible for a variety of monitors and illumination 4 Similar to those suggested in Section 6.2. 166 conditions; that participants perceived colors clearly outside the gamut of additive mixture; but that some colors were not identified uniquely. Applications. Retinex theory, even though heavily criticized when Land first presented it, has regained some interest in the research community and is used increasingly in image enhancement applications [54, 6], including images taken by NASA5 [133, 178] during orbital and space missions. Along with the flicker color illusion, discussed above, applied Retinex theory is one of the prime examples of vision shortcuts. In fact, because these two examples address visual processes so early on in the HVS, it could be possible to do away with an external display device altogether. Once artificial retinas become a reality, Retinex theory and flicker color could be used to encode color for direct neural stimulation. 5.1.3. Anomalous Motion The Phenomenon. Anomalous motion is an example of a more indirect method of triggering a sensation generally associated with a different stimulus. When viewing images like those in Figure 5.5 at a suitable magnification level, the motion texture elements, which I call motiels, appear to be moving. Akiyoshi Kitaoka has created many different types of anomalous motion illusions6 and published several papers and books on the phenomenon [92, 91]. Despite Kitaoka’s and other research efforts, there exist many more types of anomalous motion designs than theories explaining their perceptual mechanisms. In a sense, these illusions are great examples of Zeki’s observation about artists acting as neurologists7 (Section 1.2.3). There exist 5 http://dragon.larc.nasa.gov. A large number of Kitaoka’s designs are available at http://www.ritsumei.ac.jp/˜akitaoka/ index-e.html. 7 This is not to imply that A. Kitaoka’s scientific prowess is in any way inferior to his artistic talents, but rather that less scholarly individuals throughout the Internet have found it possible to adopt and modify his original designs. 6 167 Figure 5.5. Anomalous Motion. Tow row: Two anomalous motion designs using the same motiel shape, but different color schemes. Bottom Row: The Rotating Snakes illusion, after A. Kitaoka. Changing the viewing distance or zooming in/out affects the magnitude of the effect. Try viewing only one image at a time. 168 various rules of thumb to create anomalous motion designs. Most designs require a repeated texture element (motiel) with the following characteristics: One side of the shape is brighter than the center, while the opposing side is darker than the center. The brightness of the center region should not be too different from the background. The shape of the motiel can be varied. Most observers perceive motion in the light-to-dark direction of the motiel. The size of the motiel has a significant effect on the magnitude of the illusion. These and additional rules help in designing anomalous motion illusions, but they do not explain them. However, parametrization of these rules combined with computer graphics visualization may help us to learn more about the extent to which these rules apply and when they break down. This, in turn, is likely to increase our understanding of the illusions, and may lead to perceptual models explaining them in more detail, again reiterating the leitmotif of my dissertation. Applications. Possible uses of these illusions, in addition to their entertainment factor and scientific interest, could include indication of motion in print media, motion visualization in static displays, and velocity indication of slow moving objects. These applications are similar to those Freeman et al. [53] proposed, although their system required a short animation sequence not realizable on truly static media. 5.1.4. Deleted Contours The Phenomenon. As noted in Section 4.2, the HVS is equipped with a fair amount of redun- dancy to increase robustness of visual tasks and to deal with underconstrained visual situations. Figure 5.6 illustrates another facet of this concept. The cube example shows that the straight edges connecting the corners of the cube do not add much more visually useful information to the image. Their presence can be inferred from the termination points of the corners. The 169 exact mechanism by which humans are able to automatically complete such missing contours is not fully understood, but Hoffman [75] composed a set of rules that are viable candidates for visual hypothesis testing, as introduced in Section 1.2.2. While Koenderink [94, 97] investigated the geometric properties of contours that allow shape recovery, Biederman and others [10, 9, 112, 113] demonstrated via user-studies which types of contour deletions the HVS could recover. The scissor example in Figure 5.6 shows that, as mentioned in Section 2.2, not all visual information is of equal importance. While some contour deletions are easily recovered, others are not. Interestingly, adding arbitrary plausible masking shapes to the unrecoverable scissor image re-enables recognition. I believe deleted contours are an excellent example of the minimal graphics described by Herman et al. [69], which I mentioned in Section 3.5.9. If we do not require a complete contour description to obtain shape, then how much do we need, and what? Junctions (corners and intersections) are good candidates for a necessity requirement, but we need additional information to discern curved features (e.g. a circle). Hoffman’s minima rule (referring to an extremum in curvature) and other shape parsing rules [74, 148] could help in that respect. Applications. In terms of applications of this phenomenon, it is surprising that (to the best of my knowledge) no adaptive rendering technique (Section 2.4.2) makes use of the fact that some parts of an outline are more salient than others. Given that most rendering systems have 3-D information readily available and could easily compute contour saliency, this is an avenue worth investigating; not only to speed up rendering and hide artifacts, but to actively increase the visual clarity of a rendered image. 170 Figure 5.6. Deleted Contours. Cube: To perceive a cube, it is not necessary to fully depict it because the HVS fills in missing information automatically. Scissors: Not all information is equally valuable. Strategically placed information in the recoverable version facilitate the recognition task. Both the recoverable and non-recoverable versions contain about the same length of total contour, but redundant and coincidental information in the non-recoverable version make identification very difficult. Adding masking cues to the non-recoverable version (recoverable again) disambiguates coincidental information, leading to renewed recovery.— {Scissor example after Biederman [10].} 5.2. Discussion The examples of vision shortcuts I give in Section 5.1 are all more or less ad hoc and there exists no unified framework that ties them all together. Some of their possible applications may sound fantastical – for now. Too little is still known about the perceptual mechanisms whereby vision shortcuts operate. NPR systems in conjunction with perceptual studies might bridge this knowledge gap some day. 171 The great potential I see in vision shortcuts is as a continuation of what art (and NPR) has already achieved: a divorce of function from form. This separation allows for greater freedom in the design of images and more direct targeting of visual information. Art is not bound by the requirement to simulate optical processes in order to convey a message, and more often than not this helps the visual communication purpose of an image. Vision shortcuts take the separation of function and form one step further. By almost directly addressing HVS processes (as in the Flicker color examples) function (perception of color) can be targeted completely without form. Note, that form, here, does not refer to a generic medium through which function may be applied, but only to the natural medium associated with the function. In the case of color, the natural medium is light of different wavelengths. Flicker color replaces this medium with a series of pulses that objectively are nothing more than intermittent light signals, but in the HVS these become perceptions of color. The divorce of color from wavelengths may enable us to create revolutionary new display devices and techniques. I do concede that we are still a long way from incorporating vision shortcuts into standard rendering pipelines, but I hope that research at the interface between NPR and Perception, as advocated in my dissertation, will bring us closer to that goal. 172 CHAPTER 6 Conclusion In the beginning of this dissertation, I argue that the connection between non-realistic depiction and human perception is a valuable tool to improve the visual communication potential of computer-generated images, and conversely, to learn more about human perception of such images. My perception-centric approach to non-realistic depiction differs from most previous NPR work in that I am not interested in merely replicating an artistic style1, but that I focus on the perceptual motivation for using non-realistic imagery. In Chapter 3, I show how a perceptually inspired image processing framework can create images that are effective for visual communication tasks. These images also resemble cartoons; not primarily because my motivation was to imitate a cartoon-style, but because in the design of the framework I used the same perceptual principles that make cartoons highly effective for visual communication purposes. This subtle difference has important consequences: although the resulting images of my framework may resemble those of previous works, my perceptually inspired framework is faster than previous systems, more temporally coherent, and implicitly generates certain visual effects (indication and motion lines), that other NPR cartooning systems have to program explicitly. The appearance of these complimentary effects is likely linked to the fact that, although the effects are commonly considered merely stylistic, they actually 1 This is not to say that I argue against purely artistic use of NPR. There definitely is merit in such use for creative expression and aesthetic purposes. For this reason, I included the various stylistic parameters introduced in Chapter 3. 173 have roots in perceptual mechanisms and physiological structures [89, 52]. This demonstrates some of the benefits of re-examining non-realistic graphics and artistic styles in the light of perceptual motivations. Not only can this approach teach us about art and perception of art, but it can provide insights to leverage the perceptual principles that make art so effective for visual communication. We can then use that knowledge to improve computer graphics, realistic and non-realistic alike. Similarly, we can leverage existing non-realistic imaging techniques and the immense processing power of graphics hardware to perform perceptual studies more relevant to interactive computer graphics applications (Chapter 4) than the impoverished studies that are traditionally performed. The knowledge gained from such studies is not only valuable for non-realistic graphics, but is likely to transfer to improving realistic computer graphics, as well. 6.1. Conclusion drawn from Real-time Video Abstraction Chapter In Chapter 3, I have presented a simple and effective real-time framework that abstracts images while retaining much of their perceptually important information, as demonstrated by two user studies (Section 3.4). In addition to the contribution of the actual framework, I can draw several high-level conclusions from Chapter 3. While not all of these conclusions are necessarily novel, they are in my opinion particularly well reflected in the framework’s design and implementation. 6.1.1. Contrast All of the important processing steps in the framework are based on contrasts, not absolute values, continuing a recently developing trend in the graphics community towards differential 174 (change-based) models and algorithms [123, 156, 59]. Particularly, the automatic version of the non-linear diffusion approximation in Section 3.3.2 uses the given contrast in an image to change said contrast, forming a closed-loop, implicit algorithm. I believe that differential methods will play an increasingly important role in future systems, particularly in those based on perceptual principles. 6.1.2. Soft Quantization Temporal coherence has been a major problem for many animated stylization systems from the very beginning [101]. There are several reasons for this. Stylization is often an arbitrary external addition to an image (e.g. waviness or randomness in line-drawings of exact 3-D models) and should therefore be controlled via a temporally smooth function (e.g. Perlin noise [124] or wavelet noise [29]). Another problem that is more difficult to address is that of quantization. Many existing stylization systems force discontinuous quantizations, particularly to derive an explicit image representation [36, 161, 26]. My approach is different. I want to aid the human visual system to increase efficiency for certain visual tasks, but I do not endeavor to perform the visual task for the user, who is much more capable than any system that I can devise. In terms of quantization this means that I will not force a quantization if I cannot be relatively sure to make the correct decision (e.g. whether a pixel belongs to one object or another). Instead, I perform a quasi-quantization, or soft-quantization, which suggests rather than enforces, and let the observers mentally complete the picture for themselves. This principle is used effectively in the color-quantization in Section 3.3.5 and the edge detection in Section 3.3.3. In essence, it can often be useful to give a good partial solution than an erroneous full solution which needs to be corrected with the help of the user [161, 26]. 175 6.1.3. Gaussians, Art, and Perception Gaussian filters or Gaussian-like convolutions appear recurringly in Chapter 3, from scale-space theory and edge-detection to diffusion-approximations and motion-blur. Why should this be so? My personal explanation is related to the receptive-field concept, which is illustrated in Figure 3.11 and to which Zeki refers as “[...] one of the most important concepts to emerge from sensory physiology in the past fifty years.” ([189], pp. 88). The receptive field of a cell connects the cell with its physical or logical neighbors2 and thus performs information integration over larger and larger areas, and eventually the entire visual field. In essence, the output of many cortical cells depends on the weighted input of its own trigger mechanism and that of its neighbors, not entirely unlike the convolution of a Gaussian-like kernel with an image. This connection has been shown physiologically [183] before, but I believe it to be important for two additional reasons. First, using Gaussian-like convolutions allows for very efficient, parallel, and implicit information processing frameworks, which become akin to neural net implementations when processed iteratively. It might thus be interesting to look at neural nets and related artificial intelligence applications in terms of image processing operations that can leverage the parallel processing power of modern GPUs for high-performance computations. Second, many of the Gaussian-based image processing operations I have used show interesting connections to well-known artistic techniques and principles. There are obvious connections3 like the one between DoG edges and line-drawings, but there are also less obvious connections, like the indication and motion lines of Section 3.5.7. Another effect (not shown in 2 This corresponds to the topological (physical) versus feature-based (logical) mappings found to connect the different visual cortical areas [188]. 3 In terms of relatedness, not causality. 176 detail) is that recursively bilaterally filtered images often tend to look like water-color paintings (in which non-linear color diffusion through canvas and water plays a pivotal role). In short, Gaussians-like filters seem to play heavily into both perception and art, and given the fact that various artistic techniques have been traced back to cortical structures and lowlevel perception [66, 17, 89, 52, 189], it might be worthwhile to attempt an explanation and parametrization of art in terms of a perceptually-based computational information-integration model using Gaussian-like functions. 6.2. Conclusions drawn from Shape-from-X Chapter In Chapter 4, I have presented an experiment to study shape perception of moving objects using non-realistic imagery. My main contributions in designing the experiment are the use of simple, non-realistic visual stimuli to separate shape cues onto orthogonal perceptual shapefrom-X axes, and to display these cues in a highly dynamic environment for a time-critical, game-like task. The experiment therefore demonstrates well the contribution that NPR can make to perceptual studies, as well as a methodology that NPR researchers can use to validate and improve their systems. One of the most interesting results of the user study I performed using my experimental design is that shape cues in a time-constrained condition need not interact constructively. In fact, my results indicate that the opposite may be the case. When multiple shape cues are present, they can conflict and impede each other. Apart from the other results discussed in Section 4.7, the most important conclusions to be drawn from Chapter 4 is that the experimental framework fulfilled all the design goals put forth in Section 4.1.1, and that this allows for numerous additional studies to be performed and evaluated with my experimental design. One benefit of 177 the design is that setting up new studies can be as simple as generating new sets of shapes to be tested, or varying the texture parametrization. The following sections describe some of the possible dynamic shape studies that my experiment supports, and which might yield valuable perceptual insights to improve existing rendering systems and to develop new display algorithms for interactive applications. 6.2.1. Contours The different types of contours (silhouettes, outlines, ridges, valleys, creases, etc.) illustrated in Figure 4.1, Contours, might be investigated. Of particular interest would be the evaluation of the perceptually motivated suggestive contours [35]. I actually included DeCarlo et al.’s suggestive contour code in the initial carexperiment, but ended up not using it because the coarse real-time models of my setup did not provide enough geometric detail for the suggestive contour method to work properly, and higher resolution models prohibited real-time rendering. The same limitations apply to the geon shapes of my current experiment. Because studying the effectiveness of suggestive contours requires a more complex shape set and higher resolution models, some obstacles will have to be overcome to enable real-time rendering performance. An exciting development in that regard is the new feature set of the latest generation of graphics cards, which allows for geometry generation in GPU code. This, together with instancing, might enable real-time suggestive contour rendering of multiple complex, high-resolution models. 178 6.2.2. Textures Apart from the simple sphere-mapping used in my experiment, a number of other shape parameterizations can be used to map texture onto objects [77, 58, 90, 153]. The type of texture can also be varied. My experiment uses a random-design texture comprised of structure at a variety of spatial scales. Most perception studies that focus solely on texture use sinusoidal gratings of a well-defined frequency and amplitude. It will be interesting to study how these textures perform in a dynamic experiment. 6.2.3. Shapes I varied attachment shape along two categorical axes but there are other shape categories to explore. Some of my results (Section 4.7.3) suggest that different shape cues might be more effective for certain types of shapes than for others, so it will be interesting to perform additional studies that match shape classes with optimal shape cues or shape cue combinations. 6.2.4. Dynamics Another result of my experiment indicates that shape perception in dynamic environments may abide by different rules than perception in static environments. In line with recent research on motion detection of variously colored stimuli [55, 168, 104], it might be interesting to see what exactly constitutes a static versus a dynamic environment. What kind of translational and rotational velocities can be considered dynamic or even highly dynamic? How fast can the different shape-from-X mechanisms of the human visual system reliably detect subtle shape differences? 179 I explained in Section 4.4.2, Motion (pg. 124), that shape-from-Motion in isolation is difficult to study because for motion to be perceived, something has to move. Experiments on the perception of biological motion have shown that motion can indeed be divorced from form [82]. Coherently moving dots in a random dot display can be perceived to represent the motion of various rigid bodies or even complex biological entities. The dots are totally devoid of shape when stationary, but become part of a moving form when animated (similar to the effect of the complex background textures in my experiment). The resolution of sparsely distributed dots on a display is obviously too limited to resolve the subtle differences between the shape categories tested in my study, but it will be interesting to devise a modified version of my experiment that can help to separate the effect of shape-from-Motion from the contribution of the other shape cues in a dynamic environment. 6.2.5. Games and Distributed Studies Finally, I am encouraged by the positive user-feedback from the experiment. Perceptual studies often employ repetitive, time-consuming, and tedious tasks to obtain accurate data. Such tasks can negatively impact concentration and performance levels. I found in my experiment that participants generally enjoyed the interaction because it was simple and engaging. The task also triggered competitive behavior in participants who wanted to do well and shoot as many correct targets as possible. I see game-like interaction tasks for perceptual studies as a way of obtaining data that is more relevant to real-life activities than in most traditional, reductionist experiments. Of course, there are problems and pitfalls, as I have found out with the carexperiment, but a careful experimental design can minimize those problems. I am interested to see how popular game paradigms, like racing games, first-person shooters, or third-person 180 obstacle games can be modified to yield perceptually valuable and scientifically sound data. One big advantage of such an approach, in addition to its immediate applicability to interactive graphics, game design, and perception research, is the large base of volunteer gamers who could download the experiment/game and would generate valuable data just by playing. The obvious problems of limited control over the environmental conditions during the experimental trials would have to be weighed against the benefits of fast and copious data-gathering possible in a distributed, autonomous experiment. 6.3. Summary The graphics community at large has acquired much knowledge about the design and performance of rendering algorithms as well as interactive and even immersive applications. Yet, very little is known about the perceptual effects of these algorithms and applications on human task performance. It is my hope that in the future we will harness more of the advanced rendering systems and processing power that computer graphics has to offer, to perform perceptual studies that would otherwise not be possible. In return, the insights gained from such perceptual studies can flow right back into designing graphical systems that are not only fast and photorealistic, but that provide verifiably effective visual stimuli for the human tasks they are intended to support. 181 References [1] Nur Arad and Craig Gotsman. Enhancement by image-dependent warping. IEEE Trans. on Image Processing, 8(9):1063–1074, 1999. 72, 73 [2] James Arvo, Kenneth Torrance, and Brian Smits. A framework for the analysis of error in global illumination algorithms. In SIGGRAPH ’94: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 75–84, New York, NY, USA, 1994. ACM Press. 39 [3] Alethea Bair, Donald House, and Colin Ware. Perceptually optimizing textures for layered surfaces. In APGV ’05: Proceedings of the 2nd symposium on Applied perception in graphics and visualization, pages 67–74, New York, NY, USA, 2005. ACM Press. 118 [4] Danny Barash and Dorin Comaniciu. A common framework for nonlinear diffusion, adaptive smoothing, bilateral filtering and mean shift. Image and Video Computing, 22(1):73–81, 2004. 58, 90 [5] Woodrow Barfield, James Sandford, and James Foley. The mental rotation and perceived realism of computer-generated three-dimensional images. Intl. J. Man-Machine Studies, 29:669–684, 1988. 112, 115, 118 [6] H.G. Barrow and J.M. Tenenbaum. Line drawings as three-dimensional surfaces. Artificial Intelligence, 17:75–116, 1981. 166 [7] C.E. Benham. The artificial spectrum top. Nature (London), 51:200, 1894. 161 [8] I. Biederman and M. Bar. One-shot viewpoint invariance in matching novel objects. Vision Research, 39(17):2885–2899, 1999. 113, 118 [9] I. Biederman and E. E. Cooper. Priming contour-deleted images: evidence for intermediate representations in visual object recognition. Cognitve Psychology, 23(3):393–419, 1991. 169 182 [10] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94(2):115–147, 1987. 114, 116, 117, 118, 131, 169, 170 [11] Irving Biederman and Peter C. Gerhardstein. Recognizing depth-rotated objects: Evidence and conditions for three-dimensional viewpoint invariance. Experimental Psychology, 19(6):1162–1182, 1993. 115, 117, 118 [12] T. O. Binford. Generalized cylinders representation. In S. C. Shapiro, editor, Encyclopedia of Artificial Intelligence, pages 321–323, New York, 1987. John Wiley & Sons. 118, 131 [13] Mark R. Bolin and Gary W. Meyer. A perceptually based adaptive sampling algorithm. In SIGGRAPH ’98: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 299–309, New York, NY, USA, 1998. ACM Press. 40 [14] R. Van den Boomgaard and J. Van de Weijer. On the equivalence of local-mode finding, robust estimation and mean-shift analysis as used in early vision tasks. 16th Internat. Conf. on Pattern Recog., 3:927–930, 2002. 90 [15] Philippe Bordes and Philippe Guillotel. Perceptually adapted MPEG video encoding. Human Vision and Electronic Imaging V, 3959(1):168–175, 2000. 37 [16] D. J. Bremer and J. F. Hughes. Rapid approximate silhouette rendering of implicit surfaces. Implicit Surfaces ’98, pages 155–164, 1998. 114 [17] S. E. Brennan. Caricature generator: The dynamic exaggeration of faces by computer. Leonardo, 18(3):170–178, 1985. 53, 176 [18] H. H. Bülthoff and S. Edelman. Psychophysical Support for a Two-Dimensional View Interpolation Theory of Object Recognition. Proc. of the Natl. Ac. of Sciences, 89(1):60– 64, 1992. 118 [19] H. H. Bülthoff and H. A. Mallot. Integration of stereo, shading and texture. In A. Blake and T. Troscianko, editors, AI and the Eye, pages 119–146. Wiley, London, UK, 1990. 150 [20] Michael Burns, Janek Klawe, Szymon Rusinkiewicz, Adam Finkelstein, and Doug DeCarlo. Line drawings from volume data. ACM Trans. Graph., 24(3):512–518, 2005. 114 [21] C. Von Campenhausen and J. Schramme. 100 years of Benham’s top in colour science. Perception, 24(6):695–717, 1995. 162 183 [22] J. F. Canny. A computational approach to edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8:769–798, 1986. 69, 72 [23] K. Cater, A. Chalmers, and G. Ward. Detail to attention: exploiting visual tasks for selective rendering. In EGRW ’03: Proceedings of the 14th Eurographics workshop on Rendering, pages 270–280, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics Association. 41 [24] Stephen Chenney, Mark Pingel, Rob Iverson, and Marcin Szymanski. Simulating cartoon style animation. In NPAR ’02: Proceedings of the 2nd international symposium on Nonphotorealistic animation and rendering, pages 133–138, New York, NY, USA, 2002. ACM Press. 18 [25] Johan Claes, Fabian Di Fiore, Gert Vansichem, and Frank Van Reeth. Fast 3D cartoon rendering with improved quality by exploiting graphics hardware. In Proceedings of Image and Vision Computing New Zealand (IVCNZ) 2001, pages 13–18. IVCNZ, November 2001. 19 [26] John P. Collomosse, David Rowntree, and Peter M. Hall. Stroke surfaces: Temporally coherent artistic animations from video. IEEE Trans. on Visualization and Computer Graphics, 11(5):540–549, 2005. 48, 88, 90, 91, 92, 93, 97, 174 [27] Dorin Comaniciu and Peter Meer. Mean shift analysis and applications. In ICCV ’99: Proceedings of the Int. Conference on Computer Vision-Volume 2, page 1197, Washington, DC, USA, 1999. IEEE Computer Society. 90 [28] Robert L. Cook, Loren Carpenter, and Edwin Catmull. The Reyes image rendering architecture. In SIGGRAPH ’87: Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pages 95–102, New York, NY, USA, 1987. ACM Press. 15 [29] Robert L. Cook and Tony DeRose. Wavelet noise. ACM Trans. Graph., 24(3):803–811, 2005. 174 [30] Lynn A. Cooper. Mental rotation of random two-dimensional shapes. Cognitive Psychology, 7:20–43, 1975. 115 [31] B. Cumming, E. Johnston, and A. Parker. Effects of different texture cues on curved surfaces viewed stereoscopically. Vision Research, 33(5-6):827–838, 1993. 110 [32] Cassidy J. Curtis, Sean E. Anderson, Joshua E. Seims, Kurt W. Fleischer, and David H. Salesin. Computer-generated watercolor. Proceedings of SIGGRAPH 97, pages 421–430, August 1997. 18 184 [33] S. J. Daly. Visible differences predictor: an algorithm for the assessment of image fidelity. Proc. SPIE, 1666:2–15, 1992. 34, 41 [34] Richard Dawkins. Climbing Mount Improbable. W. W. Norton & Company, 1997. 23 [35] Doug DeCarlo, Adam Finkelstein, and Szymon Rusinkiewicz. Interactive rendering of suggestive contours with temporal coherence. In NPAR ’04, pages 15–24, New York, NY, USA, 2004. ACM Press. 19, 110, 114, 177 [36] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs. ACM Trans. Graph., 21(3):769–776, 2002. 19, 33, 48, 49, 62, 63, 72, 91, 95, 174 [37] Michael F. Deering. A photon accurate model of the human eye. ACM Trans. Graph., 24(3):649–658, 2005. 15, 17 [38] Oliver Deussen and Thomas Strothotte. Computer-generated pen-and-ink illustration of trees. Proceedings of SIGGRAPH 2000, pages 13–18, July 2000. 19 [39] J. Duncan. Selective attention and the organization of visual information. Journal of experimental psychology. General., 113(4):501–517, December 1984. 105, 130 [40] Frédo Durand. An invitation to discuss computer depiction. In NPAR ’02: Proceedings of the 2nd international symposium on Non-photorealistic animation and rendering, pages 111–124, New York, NY, USA, 2002. ACM Press. 16, 20, 22 [41] David Ebert and Penny Rheingans. Volume illustration: non-photorealistic rendering of volume models. In VIS ’00: Proceedings of the conference on Visualization ’00, pages 195–202, Los Alamitos, CA, USA, 2000. IEEE Computer Society Press. 114 [42] James H. Elder. Are edges incomplete? Internat. Journal of Computer Vision, 34(23):97–122, 1999. 88 [43] L. C. Evans. Partial Differential Equations. American Mathematical Society, Providence, 1998. 56 [44] G. T. Fechner. Über eine Scheibe zur Erzeugung subjectiver Farben. Annalen der Physik und Chemie. Verlag von Johann Ambrosius Barth, Leipzig, pages 227–232, 1838. 161 [45] G. T. Fechner. Elemente der Psychophysik, volume 2. Breitkopf und Haertel, Leipzig, 1860. 52 [46] James A. Ferwerda, Peter Shirley, Sumanta N. Pattanaik, and Donald P. Greenberg. A model of visual masking for computer graphics. In SIGGRAPH ’97: Proceedings of the 185 24th annual conference on Computer graphics and interactive techniques, pages 143– 152, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 33, 34, 40 [47] James A. Ferwerda, Stephen H. Westin, Randall C. Smith, and Richard Pawlicki. Effects of rendering on shape perception in automobile design. In APGV ’04: Proceedings of the 1st Symposium on Applied perception in graphics and visualization, pages 107–114, New York, NY, USA, 2004. ACM Press. 113, 117 [48] L. Festinger, M. R. Allyn, and C. W. White. The perception of color with achromatic stimulation. Vision Res., 11(6):591–612, 1971. 162 [49] J. Fischer, D. Bartz, and W. Straßer. Stylized Augmented Reality for Improved Immersion. In Proc. of IEEE VR, pages 195–202, 2005. 48, 65, 69, 71, 72 [50] Alan Fogel and Thomas E. Hannan. Manual actions of nine- to fifteen-week-old human infants during face-to-face interaction with their mothers. Child Development, 56(5):1271–1279, Oct. 1985. 128 [51] Mark D. Folk and R. Duncan Luce. Effects of stimulus complexity on mental rotation rate of polygons. Experimental Psychology: Human Perception and Performance, 13(3):395– 404, 1987. 115 [52] Gregory Francis and Hyungjun Kim. Motion parallel to line orientation: Disambiguation of motion percepts. Perception, 28:1243–1255, 1999. 96, 173, 176 [53] William T. Freeman, Edward H. Adelson, and David J. Heeger. Motion without movement. In SIGGRAPH ’91: Proceedings of the 18th annual conference on Computer graphics and interactive techniques, pages 27–30, New York, NY, USA, 1991. ACM Press. 168 [54] B. Funt, K. Barnard, M. Brockington, and V. Cardei. Luminance based multi scale retinex. In Proceedings AIC Colour 97 Kyoto 8th Congress of the International Colour Association, volume 1, pages 330–333, May 1997. 166 [55] K. R. Gegenfurtner and M. J. Hawken. Interaction of motion and color in the visual pathways. Trends Neuroscience, 19(9):394–401, 1996. 154, 178 [56] J. J. Gibson. The perception of the visible world. American Journal of Psychology, 63:367–384, 1950. 110 [57] J. J. Gibson. The Ecological Approach to Visual Perception. Lawrence Erlbaum Assoc. Inc., 1987. 131 186 [58] Ahna Girshick, Victoria Interrante, Steven Haker, and Todd Lemoine. Line direction matters: an argument for the use of principal directions in 3D line drawings. In NPAR ’00, pages 43–52, New York, NY, USA, 2000. ACM Press. 115, 178 [59] Amy Ashurst Gooch. Preserving Salience By Maintaining Perceptual Differences for Image Creation and Manipulation. PhD thesis, Northwestern University, 2006. 52, 174 [60] Amy Ashurst Gooch and Peter Willemsen. Evaluating space perception in NPR immersive environments. In NPAR ’02, pages 105–110, New York, NY, USA, 2002. ACM Press. 19, 114, 116, 117 [61] Bruce Gooch and Amy Ashurst Gooch. Non-Photorealistic Rendering. A. K. Peters, 2001. 18 [62] Bruce Gooch, Erik Reinhard, and Amy Gooch. Human facial illustrations: Creation and psychophysical evaluation. ACM Trans. Graph., 23(1):27–44, 2004. 19, 25, 49, 53, 69, 81, 83, 98 [63] Cindy M. Goral, Kenneth E. Torrance, Donald P. Greenberg, and Bennett Battaile. Modeling the interaction of light between diffuse surfaces. In SIGGRAPH ’84: Proceedings of the 11th annual conference on Computer graphics and interactive techniques, pages 213–222, New York, NY, USA, 1984. ACM Press. 15 [64] Ian E. Gordon. Theories of Visual Perception. Psychology Press, New York, 3rd edition, Dec. 2004. 131 [65] R. L. Gregory. Eye and Brain - The Psychology of Seeing. Oxford University Press, 1994. 23, 51 [66] M. H. Hansen. Effects of discrimination training on stimulus generalization. Journal of Experimental Psychology, 58:321–334, 1959. 53, 176 [67] J. W. Harris and H. Stocker. General cylinder. In Handbook of Mathematics and Computational Science, page 103, New York, 1998. Springer-Verlag. 4.6.1. 118, 131 [68] James Hays and Irfan Essa. Image and video based painterly animation. In NPAR ’04: Proceedings of the 3rd international symposium on Non-photorealistic animation and rendering, pages 113–120, New York, NY, USA, 2004. ACM Press. 19 [69] Ivan Herman and D. J. Duke. Minimal graphics. IEEE Computer Graphics and Applications, 21(6):18–21, 2001. 99, 169 187 [70] Aaron Hertzmann. Introduction to 3D non-photorealistic rendering: Silhouettes and outlines. In Non-Photorealistic Rendering (Siggraph ’99 Course Notes), 1999. 110, 114, 122 [71] Aaron Hertzmann. Paint by relaxation. In CGI ’01:Computer Graphics Internat. 2001, pages 47–54, 2001. 62 [72] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian Curless, and David H. Salesin. Image analogies. In SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 327–340, New York, NY, USA, 2001. ACM Press. 18 [73] Aaron Hertzmann and Ken Perlin. Painterly rendering for video and interaction. In NPAR ’00: Proceedings of the 1st international symposium on Non-photorealistic animation and rendering, pages 7–12, New York, NY, USA, 2000. ACM Press. 19 [74] D. D. Hoffman and M. Singh. Salience of visual parts. Cognition, 63(1):29–78, 1997. 169 [75] Donald D. Hoffman. Visual Intelligence: How We Create What We See. W.W. Norton & Company, NY, 2000. 23, 107, 169 [76] L. Hurvich. Color Vision. Sinauer Assoc., Sunderland, Mass., 1981. 163 [77] Victoria Interrante. Illustrating surface shape in volume data via principal directiondriven 3D line integral convolution. In SIGGRAPH ’97, pages 109–116, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 115, 178 [78] Laurent Itti and Christof Koch. Computational modeling of visual attention. Nature Reviews Neuroscience, 2(3):194–203, 2001. 33, 53, 62 [79] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259, 1998. 33, 53 [80] D. Jameson and L. M. Hurvich. Some quantitative aspects of an opponent-colors theory. I. chromatic response and spectral saturation. II. brightness, saturation and hue in normal and dichromatic vision. Journal of the Optical Society of America, 45(8):602–616, 1955. 163 [81] J. R. Jarvis. On Fechner-Benham subjective colour. Vision Res., 17(3):445–451, 1977. 162 188 [82] G. Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14(2):201–211, 1973. 179 [83] Alan Johnston and Peter J. Passmore. Shape from shading. I: Surface curvature and orientation. Perception, 23:169–189, 1994. 112 [84] Scott F. Johnston. Lumo: illumination for cel animation. In NPAR ’02: Proceedings of the 2nd international symposium on Non-photorealistic animation and rendering, pages 45–52, New York, NY, USA, 2002. ACM Press. 19 [85] James T. Kajiya. The rendering equation. In SIGGRAPH ’86: Proceedings of the 13th annual conference on Computer graphics and interactive techniques, pages 143–150, New York, NY, USA, 1986. ACM Press. 15 [86] Robert D. Kalnins, Philip L. Davidson, Lee Markosian, and Adam Finkelstein. Coherent stylized silhouettes. ACM Trans. Graph., 22(3):856–861, 2003. 19 [87] Nanda Kambhatla, Simon Haykin, and Robert D. Dony. Image compression using KLT, wavelets and an adaptive mixture of principal components model. J. VLSI Signal Process. Syst., 18(3):287–296, 1998. 37 [88] G. Kayaert, I. Biederman, and R. Vogels. Shape Tuning in Macaque Inferior Temporal Cortex. Journal of Neuroscience, 23(7):3016–3027, 2003. 113, 118 [89] Hyungjun Kim and Gregory Francis. A computational and perceptual account of motion lines. Perception, 27:785–797, 1998. 96, 173, 176 [90] Sunghee Kim, Haleh Hagh-Shenas, and Victoria Interrante. Conveying threedimensional shape with texture. In APGV ’04: Proceedings of the 1st Symposium on Applied perception in graphics and visualization, pages 119–122, New York, NY, USA, 2004. ACM Press. 115, 118, 178 [91] Akiyoshi Kitaoka. Trick Eyes. Barnes & Noble Books, 2005. 166 [92] Akiyoshi Kitaoka and Hiroshi Ashida. Phenomenological characteristics of the peripheral drift illusion. Vision, 15(4):261–262, 2003. 166 [93] J. J. Koenderink. The structure of images. Biological Cybernetics, 50:363–370, 1984. 54 [94] J. J. Koenderink and A. J. Doorn. The internal representation of solid shape with respect to vision. Biological Cybernetics, 32(4):211–216, 1979. 131, 169 189 [95] J. J. Koenderink, A. J. Van Doorn, and A. M. L. Kappers. Surface perception in pictures. Perception and Psychophysics, 52(5):487–496, 1992. 108, 112, 115 [96] J. J. Koenderink, A. J. Van Doorn, and A. M. L. Kappers. Pictorial surface attitude and local depth comparisons. Perception and Psychophysics, 58(2):163–173, 1996. 108, 112, 115 [97] Jan J. Koenderink. What does the occluding contour tell us about solid shape? Perception, 13:321–330, 1984. 110, 169 [98] Jan J. Koenderink and Andrea J. Van Doorn. Relief: Pictorial and otherwise. Image and Vision Computing, pages 321–334, 1995. 115 [99] Adam Lake, Carl Marshall, Mark Harris, and Marc Blackstein. Stylized rendering techniques for scalable real-time 3D animation. In NPAR ’00: Proceedings of the 1st international symposium on Non-photorealistic animation and rendering, pages 13–20, New York, NY, USA, 2000. ACM Press. 19 [100] Edwin H. Land. The retinex theory of color vision. Scientific American, 237(6):108–128, 1977. 164 [101] John Lansdown and Simon Schofield. Expressive rendering: A review of nonphotorealistic techniques. IEEE Comput. Graph. Appl., 15(3):29–37, 1995. 174 [102] Tony Lindeberg. Scale-Space Theory in Computer Vision. Kluwer, Netherlands, 1994. 54 [103] Joern Loviscach. Scharfzeichner: Klare Bilddetails durch Verformung. Computer Technik, 22:236–237, 1999. 74 [104] Zhong-Lin Lu, Luis A. Lesmes, and George Sperling. The mechanism of isoluminant chromatic motion perception. Proc. Natl. Acad. Science USA, 96(14):8289–8294, 1999. 124, 154, 178 [105] R. Duncan Luce and Ward Edwards. The derivation of subjective scales from just noticeable differences. Psychol. Rev., 65(4):222–237, 1958. 52 [106] Rafał Mantiuk, Scott Daly, Karol Myszkowski, and Hans-Peter Seidel. Predicting visible differences in high dynamic range images - model and its calibration. In Bernice E. Rogowitz, Thrasyvoulos N. Pappas, and Scott J. Daly, editors, Human Vision and Electronic Imaging X, volume 5666, pages 204–214, 2005. 34 190 [107] Rafał Mantiuk, Grzegorz Krawczyk, Karol Myszkowski, and Hans-Peter Seidel. Perception-motivated high dynamic range video encoding. ACM Trans. Graph., 23(3):733–741, 2004. 39 [108] D. Marr. Vision. W. H. Freeman, San Francisco, 1982. 131 [109] D. Marr and E. C. Hildreth. Theory of edge detection. Proc. Royal Soc. London, Bio. Sci., 207:187–217, 1980. 68, 71 [110] Barbara J. Meier. Painterly rendering for animation. Proceedings of SIGGRAPH 96, pages 477–484, August 1996. 19 [111] Ross Messing and Frank H. Durgin. Distance perception and the visual horizon in headmounted displays. ACM Trans. Appl. Percept., 2(3):234–250, 2005. 116, 117 [112] A. S. Meyer, A. M. Sleiderink, and W. J. M. Levelt. Viewing and naming objects: Eye movements during noun phrase production. Cognition, 66(2):25–33, 1998. 169 [113] C. Moore and P. Cavanagh. Recovery of 3D volume from 2-tone images of novel objects. Cognition, 67(1):45–71, 1998. 169 [114] K. Moutoussis and S. Zeki. A direct demonstration of perceptual asynchrony in vision. Proc. R. Soc. Lond. B Biol. Sci., 264(1380):393–399, 1997. 151 [115] K. T. Mullen and C. L. Baker Jr. A motion aftereffect from an isoluminant stimulus. Vision Res., 25(5):685–688, 1985. 154 [116] Karol Myszkowski. Perception-based global illumination, rendering, and animation techniques. In SCCG ’02: Proceedings of the 18th spring conference on Computer graphics, pages 13–24, New York, NY, USA, 2002. ACM Press. 41 [117] D. E. Nilsson and S. Pelger. A pessimistic estimate of the time required for an eye to evolve. Proc. R. Soc. Lond. B Bio. Sci., 256(1345):53–58, 1994. 23 [118] J. F. Norman, J. T. Todd, and F. Phillips. The perception of surface orientation from multiple sources of information. Perception and Psychophysics, 57(5):629–636, 1995. 115, 118 [119] Sven C. Olsen, Holger Winnemöller, and Bruce Gooch. Implementing real-time video abstraction. In Proceedings of SIGGRAPH 2006 Sketches. 77 [120] Victor Ostromoukhov. Digital facial engraving. Proceedings of SIGGRAPH 99, pages 417–424, August 1999. 18 191 [121] Stephen E. Palmer. Vision Science: Photons to Phenomenology. The MIT Press, 1999. 51, 54, 67, 111 [122] Sumanta N. Pattanaik, Jack Tumblin, Hector Yee, and Donald P. Greenberg. Timedependent visual adaptation for fast realistic image display. In SIGGRAPH ’00: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 47–54, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co. 39 [123] Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing. ACM Trans. Graph., 22(3):313–318, 2003. 52, 174 [124] Ken Perlin. Improving noise. In SIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 681–682, New York, NY, USA, 2002. ACM Press. 174 [125] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(7):629–639, 1991. 58, 61 [126] Tuan Q. Pham and Lucas J. Van Vliet. Separable bilateral filtering for fast video preprocessing. In IEEE Internat. Conf. on Multimedia & Expo, pages CD1–4, Amsterdam, July 2005. 58, 66, 88 [127] B. T. Phong. Illumination for computer generated pictures. Communications of the ACM, 18(6):311–317, 1975. 108 [128] Simon Plantinga and Gert Vegter. Contour generators of evolving implicit surfaces. In SM ’03: Proceedings of the eighth ACM symposium on Solid modeling and applications, pages 23–32, New York, NY, USA, 2003. ACM Press. 114 [129] Jodie M. Plumert, Joseph K. Kearney, James F. Cremer, and Kara Recker. Distance perception in real and virtual environments. ACM Trans. Appl. Percept., 2(3):216–233, 2005. 116, 117 [130] Claudio M. Privitera and Lawrence W. Stark. Algorithms for defining visual regions-ofinterest: Comparison with eye fixations. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(9):970–982, 2000. 33, 62 [131] Thierry Pudet. Real time fitting of hand-sketched pressure brushstrokes. Eurographics 1994, 13(3):277–292, August 1994. 18 192 [132] Paul Rademacher, Jed Lengyel, Edward Cutrell, and Turner Whitted. Measuring the perception of visual realism in images. In Proceedings of the 12th Eurographics Workshop on Rendering Techniques, pages 235–248, London, UK, 2001. Springer-Verlag. 113 [133] Z. Rahman, D. J. Jobson, G. A. Woodell, and G. D. Hines. Automated, on-board terrain analysis for precision landings. In Visual Information Processing XIV, Proc. SPIE 6246, 2006. 166 [134] V. S. Ramachandran and R. L. Gregory. Does colour provide an input to human motion perception? Nature, 275:55–56, Sep. 1978. 154 [135] V. S. Ramachandran and W. Hirstein. The science of art. Journal of Consciousness Studies, 6(6–7):15–51, 1999. 24 [136] Mahesh Ramasubramanian, Sumanta N. Pattanaik, and Donald P. Greenberg. A perceptually based physical error metric for realistic image synthesis. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 73–82, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co. 40 [137] Ramesh Raskar. Hardware support for non-photorealistic rendering. In HWWS ’01: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, pages 41–47, New York, NY, USA, 2001. ACM Press. 122 [138] Ramesh Raskar, Kar-Han Tan, Rogerio Feris, Jingyi Yu, and Matthew Turk. Nonphotorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. ACM Trans. Graph., 23(3):679–688, 2004. 19, 48 [139] M. M. Reid, R. J. Millar, and N. D. Black. Second-generation image coding: an overview. ACM Comput. Surv., 29(1):3–29, 1997. 36 [140] I. Rock and J. DiVita. A case of viewer-centered perception. Cognitive Psychology, 19:280–293, 1987. 118 [141] T. A. Ryan and C. B. Schwartz. Speed of perception as a function of mode of representation. American Journal of Psychology, 69(1):60–69, March 1956. 25, 112, 113, 117, 131 [142] Takafumi Saito and Tokiichiro Takahashi. Comprehensible rendering of 3-D shapes. In Proc. of ACM SIGGRAPH 90, pages 197–206, 1990. 19, 47 [143] Michael P. Salisbury, Michael T. Wong, John F. Hughes, and David H. Salesin. Orientable textures for image-based pen-and-ink illustration. In SIGGRAPH ’97: Proceedings of the 193 24th annual conference on Computer graphics and interactive techniques, pages 401– 406, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. 18 [144] Anthony Santella and Doug DeCarlo. Visual interest and NPR: An evaluation and manifesto. In Proc. of NPAR ’04, pages 71–78, 2004. 19, 20, 33, 41, 49 [145] Jutta Schumann, Thomas Strothotte, Andreas Raab, and Stefan Laser. Assessing the effect of non-photorealistic rendered images in CAD. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Common Ground, pages 35–41, 1996. 88 [146] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:623–656, October 1948. 88 [147] R. N. Shepard and J. Metzler. Mental rotation of three-dimensional objects. Science, New Series, 171(3972):701–703, Feb. 1971. 114, 115, 118 [148] M. Singh, G. D. Seyranian, and D. D. Hoffman. Parsing silhouettes: the short-cut rule. Perceptual Psychophysics, 61(4):636–660, 1999. 169 [149] Sarah V. Stevenage. Can caricatures really produce distinctiveness effects? British Journal of Psychology, 86:127–146, 1995. 49, 53, 80, 81, 83, 98, 116 [150] Thomas Strothotte and Stefan Schlechtweg. Non-Photorealistic Computer Graphics: Modeling, Rendering, and Animation. Morgan Kaufmann, 2002. 103 [151] Kim Sunghee, H. Hagh-Shenas, and Victoria Interrante. Conveying shape with texture: An experimental investigation of the impact of texture type on shape categorization judgments. 2003 IEEE Symposium on Information Visualization, pages 163–170, 2003. 116, 118 [152] Ivan Sutherland. Sketchpad: A man-machine graphical communication system. In Proc. AFIPS Spring Joint Computer Conference, pages 329–346, Washington, D.C, 1963. Spartan Books. 18 [153] Graeme Sweet and Colin Ware. View direction, surface orientation and texture orientation for perception of surface shape. In GI ’04: Proceedings of the 2004 conference on Graphics interface, pages 97–106. Canadian Human-Computer Communications Society, 2004. 112, 115, 118, 178 [154] M. J. Tarr. Orientation Dependence in Three-Dimensional Object Recognition. PhD thesis, Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 1989. 118 194 [155] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proceedings of ICCV ’98, pages 839–846, 1998. 59, 61 [156] J. Tumblin, A. Agarwal, and R. Raskar. Why I want a gradient camera. Computer Vision and Pattern Recognition (CVPR), pages 103–110, 2005. 52, 174 [157] Jack Tumblin, Jessica K. Hodgins, and Brian K. Guenter. Two methods for display of high contrast images. ACM Trans. Graph., 18(1):56–94, 1999. 38 [158] Jack Tumblin and Greg Turk. LCIS: A boundary hierarchy for detail-preserving contrast reduction. In SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 83–90, New York, NY, USA, 1999. ACM Press/Addison-Wesley Publishing Co. 38, 50, 61 [159] R. L. De Valois and K. K. De Valois. Spatial Vision. Oxford University Press, New York, 1988. 54 [160] P. Verghese and D. G. Pelli. The information capacity of visual attention. Vision Research, 32(5):983–995, May 1992. 105, 130 [161] Jue Wang, Yingqing Xu, Heung-Yeung Shum, and Michael F. Cohen. Video tooning. ACM Trans. Graph., 23(3):574–583, 2004. 48, 90, 91, 92, 93, 174 [162] Gregory J. Ward. The RADIANCE lighting simulation and rendering system. In SIGGRAPH ’94: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 459–472, New York, NY, USA, 1994. ACM Press. 17, 41 [163] Benjamin Watson, Alinda Friedman, and Aaron McGaffey. Measuring and predicting visual fidelity. In SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 213–220, New York, NY, USA, 2001. ACM Press. 37 [164] Joachim Weickert. Anisotropic Diffusion in Image Processing. ECMI. Teubner, Stuttgart, 1998. 61 [165] Andreas Wenger, Andrew Gardner, Chris Tchou, Jonas Unger, Tim Hawkins, and Paul Debevec. Performance relighting and reflectance transformation with time-multiplexed illumination. ACM Trans. Graph., 24(3):756–764, 2005. 17 [166] Turner Whitted. An improved illumination model for shaded display. Commun. ACM, 23(6):343–349, 1980. 15 195 [167] Nathaniel Williams, David Luebke, Jonathan D. Cohen, Michael Kelley, and Brenden Schubert. Perceptually guided simplification of lit, textured meshes. In SI3D ’03: Proceedings of the 2003 symposium on Interactive 3D graphics, pages 113–121, New York, NY, USA, 2003. ACM Press. 37 [168] A. Willis and S. J. Anderson. Separate colour-opponent mechanisms underlie the detection and discrimination of moving chromatic targets. Proc. R. Soc. Lond. B Biol. Sci., 265(1413):2435–2441, 1998. 154, 178 [169] Georges Winkenbach and David H. Salesin. Computer-generated pen-and-ink illustration. In Proc. of ACM SIGGRAPH 94, pages 91–100, 1994. 95 [170] Georges Winkenbach and David H. Salesin. Rendering parametric surfaces in pen and ink. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 469–476, New York, NY, USA, 1996. ACM Press. 18, 19 [171] Holger Winnemöeller, Sven C. Olsen, and Bruce Gooch. Real-time video abstraction. ACM Trans. Graph., 25(3):1221–1226, 2006. 101 [172] Holger Winnemöller. Testing effects of color constancy for images displayed on CRT devices. Technical Report CS03-03-00, University of Cape Town, Computer Science Department, September 2003. 165 [173] Holger Winnemöller and Shaun Bangay. Geometric approximations towards free specular comic shading. Computer Graphics Forum, 21(3):309–316, September 2002. 19 [174] Holger Winnemöller and Shaun Bangay. Rendering Optimisations for Stylised Sketching. In ACM Afrigraph 2003: 2nd International Conference on Computer Graphics, Virtual Reality and Visualization in Africa, pages 117–122. ACM, ACM SIGGRAPH, February 2003. 19 [175] A. P. Witkin. Scale-space filtering. In 8th Int. Joint Conference on Artificial Intelligence, pages 1019–1022, Karlsruhe, Germany, 1983. 54 [176] Eric Wong. Artistic rendering of portrait photographs. Master’s thesis, Cornell University, 1999. 19 [177] M. Woo and M. B. Sheridan. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Version 1.2. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1999. 123 196 [178] G. A. Woodell, D. J. Jobson, Z. Rahman, and G. D. Hines. Advanced image processing of aerial imagery. In Visual Information Processing XIV, Proc. SPIE 6246, 2006. 166 [179] G. Wyszecki and W. S. Styles. Color Science: Concepts and Methods, Quantitative Data and Formulae. Wiley, New York, NY, 1982. 51 [180] Hector Yee, Sumanita Pattanaik, and Donald P. Greenberg. Spatiotemporal sensitivity and visual attention for efficient rendering of dynamic environments. ACM Trans. Graph., 20(1):39–65, 2001. 34, 41 [181] Ian T. Young, Lucas J. Van Vliet, and Michael Van Ginkel. Recursive gabor filtering. IEEE Trans. on Signal Processing, 50(11):2798–2805, 2002. 89 [182] R. A. Young. Some observations on temporal coding of color vision: psychophysical results. Vision Res., 17(8):957–965, 1977. 162 [183] R. A. Young. The gaussian derivative model for spatial vision: I. retinal mechanisms. Spatial Vision, 2:273–293, 1987. 175 [184] John C. Yuille and James H. Steiger. Nonholistic processing in mental rotation: Some suggestive evidence. Perception & Psychophysics, 31(3):201–209, 1982. 114, 115, 118 [185] S. Zeki and M. Lamb. The neurology of kinetic art. Brain, 117:607–636, 1994. 24 [186] S. Zeki and M. Marini. Three cortical stages of colour processing in the human brain. Brain, 121:1669–1685, 1998. 24 [187] S. M. Zeki. Colour coding in the superior temporal sulcus of rhesus monkey visual cortex. Proc. R. Soc. Lond. B Biol. Sci., 197(1127):195–223, 1977. 154 [188] Semir Zeki. A vision of the brain. Blackwell Scientific Publications Oxford, 1993. 45, 51, 83, 124, 175 [189] Semir Zeki. Art and the brain. Journal of Consciousness Studies, 6(6–7):76–96, 1999. 21, 24, 25, 151, 175, 176 [190] Robert C. Zeleznik, Kenneth P. Herndon, and John F. Hughes. SKETCH: An interface for sketching 3D scenes. In SIGGRAPH ’96: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 163–170, New York, NY, USA, 1996. ACM Press. 18 197 APPENDIX A User-data for Videoabstraction Studies Table A.1 and Table A.2 list the per participant data values for study 1 and study 2 in Section 3.4. Figure 3.21 visualizes data for both tables. In these tables and the following, Std. Dev. stands for standard deviation, σ, and Std. Err. √ stands for standard error (not normalized), se ≡ σ/ n, where n is the number of samples. Study1: Recognition Time(msec) Data Pair Photograph Abstraction 1 1159 965 2 1291 1237 3 1660 1281 4 1305 1285 5 1342 1330 6 1486 1367 7 1712 1378 8 1622 1388 9 1748 1435 10 1811 1520 Average 1513.5 1318.5 Std. Dev. 227.5 148.5 Std. Err. 72.0 47.0 Table A.1. Data for Videoabstraction Study 1. This table shows the average time (in milliseconds) each participant took to recognize a depicted face (photograph or abstraction) taken over all faces presented to the participant. The data pairs are ordered in ascending abstraction time, corresponding to Figure 3.21, top graph. 198 Study2: Memory Time(secs) Clicks Data Pair Photograph Abstraction Photograph Abstraction 1 60.5 48.7 60 42 2 54.1 50.8 58 52 3 68.1 51.7 86 64 4 64.6 55.4 50 40 5 92.5 57.0 62 52 6 77.7 57.1 59 42 7 76.9 60.0 64 52 8 92.0 66.2 64 44 9 91.2 71.2 51 42 10 83.7 81.4 70 62 Average 75.5 59.4 62.4 49.2 Std. Dev. 13.3 9.9 9.7 8.2 Std. Err. 4.0 3.0 2.9 2.5 Table A.2. Data for Videoabstraction Study 2. This table shows the time (in milliseconds) and number of clicks each participant used to complete a memory game with photograph and a memory game with abstraction images. The data pairs are ordered in ascending abstraction time, corresponding to Figure 3.21, middle and bottom graphs (this ordering is not intended to correspond to Table A.1). 199 APPENDIX B User-data for Shape-from-X Study The tables B.1-B.5 list experimental data (aggregate values averaged over four trials) for each display mode for all 21 participants of the shape-from-X study described in Section 4.5. Figure B.1 shows the questionnaire given to participants after they completed the experimental trials. Table B.6 lists the numerical data gathered from the questionnaire. 200 Shading UserID BA044 BA046 BA047 BA048 BA049 BA050 BA051 BA052 BA053 BA054 BA055 BA056 BA057 BA058 BA059 BA060 BA061 BA062 BA063 BA064 BA065 Average Std. Dev. Std. Err. success 0.868 0.910 0.863 0.920 0.928 0.878 0.860 0.820 0.855 0.800 0.835 0.670 0.810 0.903 0.900 0.785 0.778 0.785 0.788 0.793 0.880 0.839 0.063 0.014 failure 0.018 0.073 0.050 0.055 0.020 0.043 0.060 0.040 0.020 0.035 0.000 0.198 0.035 0.008 0.033 0.053 0.073 0.020 0.038 0.038 0.010 0.044 0.041 0.009 risk 0.525 0.235 0.460 0.688 0.403 0.505 0.360 0.335 0.895 0.443 0.910 0.793 0.813 0.643 0.393 0.463 0.625 0.800 0.345 0.493 0.465 0.552 0.198 0.043 placement 0.885 0.983 0.910 0.978 0.948 0.923 0.923 0.860 0.880 0.833 0.835 0.870 0.845 0.910 0.933 0.835 0.850 0.805 0.823 0.830 0.890 0.883 0.051 0.011 detection 2.970 1.021 2.311 3.344 2.135 2.445 1.585 1.483 6.654 2.196 5.767 1.758 4.565 3.868 1.857 1.976 2.957 3.545 1.547 2.289 2.354 2.792 1.433 0.313 Table B.1. Shading Data. Averages of each participant over four trials for the Shading display mode. 201 Outline UserID BA044 BA046 BA047 BA048 BA049 BA050 BA051 BA052 BA053 BA054 BA055 BA056 BA057 BA058 BA059 BA060 BA061 BA062 BA063 BA064 BA065 Average Std. Dev. Std. Err. success 0.848 0.693 0.723 0.830 0.838 0.903 0.765 0.683 0.850 0.690 0.808 0.683 0.723 0.755 0.753 0.803 0.760 0.648 0.743 0.785 0.918 0.771 0.075 0.016 failure 0.095 0.280 0.185 0.140 0.128 0.068 0.213 0.190 0.058 0.065 0.038 0.248 0.130 0.110 0.170 0.103 0.163 0.065 0.095 0.090 0.028 0.127 0.069 0.015 risk 0.495 0.093 0.283 0.310 0.358 0.315 0.235 0.188 0.483 0.355 0.648 0.690 0.565 0.428 0.230 0.390 0.418 0.825 0.288 0.398 0.263 0.393 0.177 0.039 placement 0.938 0.973 0.908 0.965 0.963 0.968 0.978 0.873 0.908 0.755 0.848 0.928 0.848 0.863 0.923 0.905 0.920 0.715 0.838 0.875 0.943 0.897 0.069 0.015 detection 2.179 0.263 0.883 1.200 1.413 1.311 0.808 0.639 2.366 1.465 3.080 1.507 2.016 1.664 0.732 1.464 1.499 3.194 1.157 1.646 1.196 1.509 0.740 0.162 Table B.2. Outline Data. Averages of each participant over four trials for the Outline display mode. 202 M ixed UserID BA044 BA046 BA047 BA048 BA049 BA050 BA051 BA052 BA053 BA054 BA055 BA056 BA057 BA058 BA059 BA060 BA061 BA062 BA063 BA064 BA065 Average Std. Dev. Std. Err. success 0.853 0.760 0.890 0.903 0.933 0.838 0.843 0.900 0.885 0.738 0.788 0.683 0.743 0.800 0.805 0.825 0.798 0.750 0.688 0.723 0.860 0.810 0.073 0.016 failure 0.085 0.190 0.075 0.055 0.038 0.068 0.070 0.055 0.043 0.083 0.045 0.193 0.078 0.065 0.063 0.103 0.075 0.135 0.080 0.098 0.040 0.083 0.043 0.009 risk 0.533 0.248 0.470 0.693 0.430 0.470 0.365 0.345 0.633 0.485 1.048 2.900 0.698 0.608 0.420 0.398 0.630 0.903 0.345 0.533 0.503 0.650 0.549 0.120 placement 0.938 0.953 0.965 0.960 0.973 0.903 0.908 0.953 0.928 0.820 0.833 0.875 0.823 0.868 0.870 0.928 0.873 0.885 0.765 0.823 0.898 0.892 0.057 0.012 detection 2.287 1.005 2.103 3.440 2.083 2.149 1.506 1.557 3.231 2.047 5.975 3.033 2.833 2.741 1.904 1.690 2.669 3.172 1.343 2.050 2.635 2.450 1.043 0.228 Table B.3. Mixed Data. Averages of each participant over four trials for the M ixed display mode. 203 T exISO UserID BA044 BA046 BA047 BA048 BA049 BA050 BA051 BA052 BA053 BA054 BA055 BA056 BA057 BA058 BA059 BA060 BA061 BA062 BA063 BA064 BA065 Average Std. Dev. Std. Err. success 0.668 0.658 0.640 0.620 0.648 0.758 0.730 0.670 0.740 0.735 0.705 0.468 0.673 0.680 0.820 0.728 0.543 0.580 0.590 0.540 0.718 0.662 0.084 0.018 failure 0.223 0.253 0.245 0.295 0.300 0.185 0.230 0.185 0.150 0.140 0.100 0.350 0.233 0.250 0.058 0.183 0.398 0.135 0.188 0.213 0.138 0.212 0.082 0.018 risk 0.235 0.133 0.258 0.353 0.303 0.195 0.180 0.148 0.390 0.133 0.378 0.513 0.300 0.223 0.173 0.293 0.235 0.438 0.183 0.275 0.235 0.265 0.103 0.022 placement 0.888 0.908 0.883 0.915 0.945 0.940 0.960 0.855 0.888 0.880 0.805 0.818 0.905 0.928 0.878 0.910 0.940 0.715 0.775 0.753 0.855 0.873 0.066 0.014 detection 0.718 0.403 0.747 0.934 0.743 0.673 0.517 0.458 1.214 0.423 1.415 0.780 0.789 0.677 0.733 0.955 0.544 1.309 0.577 0.814 0.811 0.773 0.274 0.060 Table B.4. TexISO Data. Averages of each participant over four trials for the T exISO display mode. 204 T exN OI UserID BA044 BA046 BA047 BA048 BA049 BA050 BA051 BA052 BA053 BA054 BA055 BA056 BA057 BA058 BA059 BA060 BA061 BA062 BA063 BA064 BA065 Average Std. Dev. Std. Err. success 0.688 0.675 0.510 0.620 0.520 0.773 0.645 0.573 0.653 0.715 0.690 0.565 0.695 0.650 0.690 0.640 0.628 0.480 0.710 0.580 0.795 0.643 0.082 0.018 failure 0.273 0.218 0.385 0.323 0.458 0.200 0.258 0.323 0.238 0.123 0.120 0.315 0.165 0.228 0.135 0.113 0.258 0.165 0.170 0.255 0.143 0.231 0.093 0.020 risk 0.193 0.153 0.240 0.275 0.300 0.258 0.193 0.115 0.295 0.115 0.383 0.533 0.358 0.233 0.170 0.328 0.285 0.395 0.145 0.263 0.210 0.259 0.103 0.023 placement 0.958 0.893 0.895 0.945 0.975 0.968 0.900 0.895 0.893 0.838 0.808 0.880 0.858 0.880 0.828 0.753 0.888 0.645 0.880 0.833 0.938 0.874 0.076 0.017 detection 0.555 0.439 0.533 0.717 0.655 0.878 0.532 0.344 0.903 0.374 1.404 0.719 1.127 0.731 0.577 1.135 0.794 1.111 0.489 0.707 0.770 0.738 0.277 0.060 Table B.5. TexNOI Data. Averages of each participant over four trials for the T exN OI display mode. (B) Lines Rating: ............... (C) Shading&Lines Rating: ............... (D) Texture1 Rating: ............... (E) Texture2 Rating: ............... (Scale: 1=Very Clear … 5=Totally Unclear) (Scale: 1=Very Easy … 5 = Very difficult) (Scale: 1=Too short … 5 = Too long) Comfort: ……………………… (Scale: 1=Not at all … 5 =I was very uncomfortable) 7.) Did the experiment cause you any discomfort? Exhaustion: …………………… (Scale: 1=Not at all … 5 =I was very exhausted) 6.) Did the experiment tire/exhaust you? Length: .................................. 5.) Rate the duration of the experiment Interaction:………………. 4.) Rate how difficult you found the mode of interaction with the system (i.e. clicking/tapping) Instructions: ………………. 3.) Rate how clear the instructions were to understand Performance: ............................ (Scale: 1=Very Good … 5 =Poor) 2.) Rate how well you think you performed in the experiment (did you hit most targets?) Rating: ............... (A) Shading 1.) Please rank the display modes in order of difficulty. If some modes felt the same, you can assign the same number. (Scale: 1=Easiest … 5 =Most difficult) PiGeonAtor Questionnaire 11.) Please give any additional comments or suggestions you may have ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. 10.) If, yes, above, do you think it impaired your performance? ( ) Yes ( ) No 8.) If you did not answer Not at all above, please explain: Comfort explanation: .......................................................................................................................... ............................................................................................................................................................. ............................................................................................................................................................. ............................................................................................................................................................. 9.) Did you notice that your hands were casting a shadow? ( ) Yes ( ) No 205 Figure B.1. Questionnaire. Participants were asked to fill out this short questionnaire after completing all trials. 206 10) ShadowImpair 9) ShadowCast 7) Discomfort 6) Exhaustion 5) Duration 4) Interaction 3) Instructions 2) Performance 1e) Texture2 (TexNOI) 1d) Texture1 (TexISO) 1c) Shading & Lines (Mixed) 1b) Lines (Outlines) 1a) Shading Subjective Difficulty UserID BA044 1 2 2 5 4 3 1 2 4 4 3 0 0 BA046 2 5 1 4 3 3 2 1 3 3 1 0 0 BA047 2 3 1 5 4 4 1 1 3 1 1 0 0 BA048 1 5 2 3 4 2 1 1 3 2 2 0 0 BA049 3 4 1 5 5 4 2 3 3 2 1 0 0 BA050 2 3 1 4 5 2 1 1 3 2 1 0 0 BA051 1 4 1 5 5 3 2.5 2 5 4 2 1 0 BA052 2 3 1 5 5 3 1 2 4 2 1 0 0 BA053 2 4 1 3 5 2 4 2 3 3 3 1 0 BA054 1 2 1 4 3 3 3 1 4 3 1 0 0 BA055 2 3 1 4 5 4 1 1 3 1 1 0 0 BA056 3 1 1 4 5 3 4 3 2 2 1 0 0 BA057 1 3 1 5 4 2 2 3 5 4 3 0 0 BA058 1 4 2 5 5 2 1 3 3 4 1 1 1 BA059 1 3 2 4 5 2.5 3 1 3 2 1 1 0 BA060 2 3 1 4 5 3 2 1 4 2 1 0 0 BA061 2 3 1 4 5 3 1 1 3 2 2 0 0 BA062 2 3 1 4 5 3 1 3 3 1 1 0 0 BA063 2 3 1 4 5 2.5 4 2 2.5 2 2 1 1 BA064 2 3 1 5 4 3 2 2 5 3 3 1 1 BA065 2 3 1 4 5 4 5 2 3 2 2 0 0 Average 1.8 3.2 1.2 4.3 4.6 2.9 2.1 1.8 3.4 2.4 1.6 0.3 0.2 Std. Err. 0.1 0.2 0.1 0.1 0.2 0.2 0.3 0.2 0.2 0.2 0.2 0.1 0.1 Mode 2 3 1 4 5 Table B.6. Questionnaire Data. Numerical results for the questionnaire shown in Figure B.1. See questionnaire for meaning of each column and scales used. Display mode names in parentheses are those used in this dissertation. For questions 9 and 10: 1=Yes and 0=No. Mode in the last row refers to the statistical measure (most frequent number), not display mode. 207 APPENDIX C Links for Selected Objects Table C.1 lists publicly accessible internet URLs for several images and other objects used in this dissertation. No guarantees can be made about the validity and availability of these links. Figure 1.1 :http://commons.wikimedia.org/wiki/Image:Glasses_800.png Figure 1.2 (a): http://commons.wikimedia.org/wiki/Image:IMG_0071_-_England%2C_London.JPG Figure 1.3 (b), Bunny model: http://graphics.stanford.edu/data/3Dscanrep/ Figure 1.4 (a): http://commons.wikimedia.org/wiki/Image:Escaping_criticism_by_Caso.jpg Figure 1.4 (b): http://commons.wikimedia.org/wiki/Image:Portrait_of_Dr._Gachet.jpg Figure 3.8 Eye-tracking data and source image: http://www.cs.rutgers.edu/˜decarlo/abstract.html Figure 3.9 Source images: http://upload.wikimedia.org/wikipedia/commons/4/4c/Pitt_Clooney_Damon.jpg Figure 3.10 Source image: http://www.indcjournal.com/archives/Lehrer.jpg Figures 3.13–3.17 Source image: http://www.flickr.com/photos/johnnydriftwood/115499900/ Figure 3.26 Original, stationary: http://commons.wikimedia.org/wiki/Image:Ferrari-250-GT-Berlinetta-1.jpg Figure 4.4 Girl courtesy of: www.crystalspace3d.org Figure 4.4 Man & Tool courtesy of: http://www.3dcafe.com Figure 4.4 Architecture model courtesy of Google 3D Warehouse: http://sketchup.google.com/3dwarehouse Figure 4.7 Rendering engine: http://fabio.policarpo.nom.br/fly3d/index.htm Figure 5.4 Left: http://commons.wikimedia.org/wiki/Image:Burger_King_Whopper_Combo.jpg Figure 5.4 Right: http://www.flickr.com/photo_zoom.gne?id=100995096&size=o 208 Table C.1. Internet references. Links to selected images and 3-D models.