Motion Magnification of Facial Micro-expressions - Runpeng Liu
Transcription
Motion Magnification of Facial Micro-expressions - Runpeng Liu
Motion Magnification of Facial Micro-expressions Sumit Gogia Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA Runpeng Liu Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA summit@mit.edu rliu42@mit.edu December 8, 2014 Figure 1: Motion magnification of facial micro-expressions generated by “frustration” (male) and “disgust” (female) emotional cues. (a) Top: frame in original video sequence. Bottom: spatiotemporal YT slices of video along eye and lip profiles marked (yellow) above (b-c) Results of applying linear Eulerian motion magnification [9] and phase-based motion magnification [7], respectively, on video sequence in (a). Subtle lip and eye movements associated with micro-expressions are amplified (top), as visualized by YT spatiotemporal slices (bottom). 1 Introduction employed to good effect in emotion recognition from facial images and videos [1]. The current models, though, cannot distinguish the expressions of a wide variety of people well, and cannot accurately detect and classify micro-expressions. In addition, in high-sensitivity environments such as psychological treatments and interrogation, it may not be desirable to fully automate the emotion detection procedures. A common issue faced when judging emotional reactions, such as in psychology appointments and high-stakes interrogation rooms, is the inability to accurately detect brief, subtle expressions called micro-expressions. These micro-expressions frequently indicate true emotional response, but also are often difficult for human vision to perceive [2]. These issues motivate methods that sit between fully automated solutions and pure human perception for analyzing emotion. In this paper, we develop such a method relying on motion magnification of video sequences, a topic with recent and promising development ([3], [9]). Motion magnification, particularly Eulerian motion mag- To tackle this problem, there have been attempts to frame micro-expression detection computationally, with the hope that machine vision resources can parse the small differences that human visual systems toss out ([4], [5], [6], [10]). Particularly, machine learning has been 1 2.2 nification ([7], [8], [9]), has been utilized to great effect in revealing subtle motions in videos, e.g. for amplification of physiological features such as heart pulse, breath rate, and pupil dilation ([9], [7]). Accordingly, we use motion magnification to magnify subtle facial expressions and make them visible to humans. Fundamentally, the Eulerian method uses temporal filtering to amplify variation in a fixed spatial region across a sequence of video frames. It has been shown in [9] that this Eulerian process can also be used to approximate spatial translation in 2D, and thereby amplify small motions in the same video. We formulate the task of magnifying micro-expressions by detailing the linear and phase-based Eulerian motion processing methods. We describe filtering parameters used in these methods and delineate the connections between facial motion cues and human emotions. These are used to specify spatiotemporal filters for extracting motion from different facial features. We lastly evaluate our proposed approach against a small dataset of facial expression video sequences we collected, and examine the relevance of collected data to target emotion detection environments. Our results indicate that motion magnification can be used to amplify a variety of subtle facial expressions and make them visible to humans. 2 2.1 Eulerian Motion Magnification In the first step of Eulerian motion magnification, videos are decomposed into different spatial pyramid levels (Figure 2a). Spatial decomposition allows modularity in signal amplification, so that bands that best approximate desired motions receive greater amplification, and those that encode unwanted artifacts receive little amplification. Recent works in Eulerian motion magnification have used various types of spatial pyramids for this step, such as Laplacian pyramids by Wu et al. [9], or complex steerable pyramids by Wadwha et al. in [7] and Riesz pyramids in [8]. Next, a temporal bandpass filter is applied to each spatial band in order to extract time series signals responding to a frequency range of interest (Figure 2b). The band width and frequency range of the temporal filter is modular, and can be fine-tuned for various applications such as selective magnification of motion factors in a video. In our experiments, modular temporal filters are used to extract movement from different facial features. Background Lagrangian Motion Magnification Lagrangian methods inspire the initial solutions to motion magnification, and are used in many optical flow and feature-tracking algorithms. For solutions in this approach, optimal flow vectors are computed for each pixel in an image frame so as to track moving features in time and space through a video sequence. As such, Lagrangian methods are best suited for representing large, broad movements in a video. They generally fail to amplify small changes arising from very subtle motions. After applying the desired temporal filter to each spatial pyramid level, the bands are then amplified before being added back to the original signal (Figure 2c). These amplification factors are also modular, and can be specified as a function of the spatial frequency band. After amplification, spatial bands are collapsed to form a reconstructed video with desired temporal frequencies amplified (Figure 2d). Another significant drawback of the approach is that much computation must be expended to ensure that motion is amplified smoothly; even with diligence artifacts tend to appear, as shown in [3]. Lastly, we note that the approach is also sensitive to noise, as tracking explicit pixels under noise is an issue. We will see that the Eulerian approach mitigates some of these effects. While solutions under the Lagrangian approach rely on global optimization of flow vectors and smoothness parameters, the Eulerian method of motion magnification analyzes temporal changes of pixel values in a spatially localized manner, making it less computationally expensive. In addition, it is capable of approximating and magnifying small-scale spatial motion [9, 7], indicating its ap2 Figure 2: Spatiotemporal processing pipeline for general Eulerian video magnification framework. (a) Input video is decomposed into different spatial frequency bands (i.e. pyramid levels). (b) User-specified temporal bandpass filter is applied to each spatial pyramid level. (c) Each spatial band is amplified by a motion magnification factor, α, before being added back to original signal. (d) Spatial pyramids are collapsed to reconstruct motion-magnified output video. Figure adapted from [9] plicability to magnification of micro-expressions. for smooth motion magnification. 2.2.1 This method was shown by the inventors to magnify subtle motion signals effectively; however, as seen through the determinedn bound, the range of suitable α values is small for high spatial frequency signals. In addition, with high α values noise can be amplified significantly due to amplification directly on the pixel values. Linear Magnification Linear motion magnification, examined by [9] was the first method proposed falling under the Eulerian approach. In linear motion magnification, variations of pixel values over time are considered the motion-coherent signals, and so are extracted by temporal bandpass and amplified. 2.2.2 Theoretical justification for this coherence between pixel value and motion is given in [9]. They used a first-order Taylor series approximation of image motion to show that the amplification of temporally-bandpassed pixel values could approximate motion, and derived a bound on amplification factor: (1 + α)δ(t) < Phase-based Magnification In phase-based magnification, as opposed to linear motion magnification, the image phase over time is taken to be the motion-coherent signal, following from a relation between signal phase and translation. To operate on phase as well as spatial subbands, a complex steerable pyramid is employed; each phase signal at each spatial subband and orientation is then temporally-bandpassed and amplified. λ 8 where α is the magnification factor, δ(t) is motion signal, and λ is the spatial wavelength of the moving signal. This bound is then applied to determine how much to amplify the motion-coherent signal at each spatial band The method has theoretical justification similar to that for linear motion magnification, though the Fourier expansion is used instead to expose the relation between translation and phase. Again, an according bound on the am3 Emotion Happiness Sadness Surprise Anger Frustration plification factor was also found: αδ(t) < λ 4 Disgust with α the magnification factor, δ the motion signal, and λ the corresponding wavelength for the frequency extracted by the processed subband. Table 1: Summary of facial feature motion cues associated with each of the six universal emotions, as specified in [1]. The phase-based method extends the range of appropriate amplification factors and has improved noise performance relative to linear magnification, allowing for more accurate magnification of small motions, and better modular control over magnification of larger motions. The only drawback of the phase-based approach is in performance efficiency, as computing the complex steerable pyramid at different scales and orientations takes substantially more time than the computing the Laplacian pyramid in the linear approach. 2.3 Description Raising of mouth corners Lowering of mouth corners Brow arch; eyes open wide Eyes bulging; brows lowered Lowering of mouth corner; slanted slips; eye twitch Brow ridge wrinkled; head withdrawal 3 3.1 Methods Generation of Micro-Expression Sequences We collected video sequences of facial expressions from 10 volunteer undergraduate students at the Massachusetts Institute of Technology (MIT). A DSLR camera recording at 30 fps was trained on subjects’ faces to videotape their responses to various emotion words. Subjects were instructed to remain as motionless and emotionless as possible during the experiment, except when cued by emotional keywords to generate a particular microexpression. Micro-expressions Micro-expressions are brief facial expressions that occur when people unconsciously repress or deliberately conceal emotions. They match facial expressions that occur when emotion is naturally expressed, and so their detection and magnification is valuable for recognizing emotion in high-stakes environments. Unfortunately, microexpressions are difficult for humans to detect due to their brevity and subtlety in the spatial domain. It is for these reasons that micro-expressions are a natural candidate for motion magnification. Verbal cues of 6 universal emotions (happiness, sadness, disgust, frustration, anger, surprise) as classified by psychological research [1] were given at 15-second intervals to elicit corresponding micro-expressions from each subject. Though micro-expressions often occur involuntarily or unconsciously in real-life situations, it was difficult to reproduce this effect in an experimental setting. Alternatively, subjects were advised to imagine a high-stakes situation in which there would be great incentive to conceal their true feelings or emotional responses to a sensitive topic. This instruction would motivate subjects to make their facial expressions as brief and subtle as possible, so that any motion of facial features would be nearly imperceptible to the human eye, but suitable for applying motion magnification. In order to formulate the magnification task, a description of facial expressions in space is required, particularly for specifying the spatiotemporal frequencies to magnify. This description has been given much study, and accepted results have been determined for the 6 universal emotions. We list them in Table 1 and visualize them in Figure 3, as found in [1]. 4 Figure 3: A visualization of the 6 universal facial expressions as they form in space and time; taken from [1]. Facial feature Head Brow and Lip Eye/Pupil Finally, in selecting appropriate video footage for processing by the motion magnification algorithms described previously, we discarded any sequences in which a subject’s facial expressions were trivially perceptible by the naked eye (i.e. too melodramatic to be considered a micro-expression), or in which large head movements would lead to unwanted artifacts in a motion-magnified video. 3.2 Temporal frequency range < 1.0 Hz 1.0−5.0 Hz > 5.0 Hz Table 2: Estimated temporal frequency benchmarks for magnifying motion of different facial features in our video sequences. Values are hypothesized based roughly on the size and scale of motion to be observed at each location. Next, we specified temporal bandpass filter parameters for motion-magnifying subtle facial expressions corresponding to each of the 6 universal emotions. We synthesize the qualitative descriptions of facial motion cues in Table 1 with quantified estimates of optimal frequency ranges in Table 2 to generate temporal filters (specified by low frequency cutoff, ωl , and high frequency cutoff, ωh ) for the six emotions. Application of Motion Magnification We applied linear and phase-based Eulerian motion magnification, as formulated by Wu et al. [9] and Wadwha et al. [7] respectively, to the micro-expression sequences we collected. In specifying temporal frequency ranges that will best amplify motions of different facial features (i.e. head, mouth/lip, brow, eye/pupil), we follow the heuristic that motions of low temporal frequencies correspond to larger facial features (subtle but broad head movements); motions of mid-range temporal frequencies correspond to brow and lip movements; and motions of high temporal frequencies correspond to sudden motions of small facial features (eye/pupil movements). Rough estimates of temporal frequency benchmarks we used for magnifying motion in these facial features are specified in Table 2. For example, in magnifying motion of facial features corresponding to the micro-expression of disgust, we chose a temporal bandpass filter of [ωl , ωh ] = [0.5, 4.0] Hz that might amplify subtle movement due to “head withdrawal” (< 1.0 Hz; Table 2) and “brow wrinkling” (1.0 − 5.0 Hz; Table 2). For frustration, we used a temporal bandpass of [ωl , ωh ] = [1.5, 6.0] to extract possible movement due to “slanted lips” (1.0 − 5.0 Hz) and “eye twitching” (> 5.0 Hz). The full set of hypothesized temporal filter parameters is summarized in Table 3. 5 Micro-expression Happiness Sadness Disgust Frustration Anger Surprise Temporal Bandpass [ωl , ωh ] (Hz) [1.0, 3.0] [1.0, 3.0] [0.5, 4.0] [1.5, 6.0] [1.5, 8.0] [2.0, 8.0] whole indicate that magnification of micro-expressions is achievable by the methods proposed. 4.1 Usage of Simulated Expressions A notable issue with the results is that they were obtained for artificial data; that is, micro-expressions were simulated by people on command, whereas true Table 3: Temporal bandpass filtering parameters used to micro-expressions arise during emotional concealment or magnify video sequences of each micro-expression. Val- repression. It would of course be desirable to apply the ues are hypothesized based description of facial motion methods described to real data. cues in Table 1 and temporal frequency ranges in Table 2. 4 However, setting up an environment in which subjects are forced to react in emotionally-charged situations, particularly the high-stakes environments that are necessary for micro-expressions to appear, is difficult and proved infeasible in the time allotted for this project. Results and Discussion We used the linear Eulerian magnification implementation from [9] and our own unoptimized MATLAB implementation of phase-based motion processing to magnify the collected video sequences for each facial expression on a quad core laptop with 32 GB of RAM. While our code for phase-based motion processing took roughly 100 seconds to run on each 200-frame video sequence, with suitable optimization discussed in [7] the magnification could be run with similar performance in real-time. All code, as well as both original and magnified video sequences are available upon request. Despite this drawback, we believe that the data obtained appropriately approximates genuine production of microexpressions, and at least serve as subtle and brief facial motions. That these sequences can be clearly be magnified indicates that the same holds for true microexpressions, and possibly normal expressions as well. 4.2 Head Motion Environments In our experiments, we requested that subjects remain completely motionless except when making facial expressions. While in many applicable real-life environments, such as interrogation rooms, this may be a feasible constraint, it is also likely that the subject is in a more natural state and so has continual small head motions. We noted that for subjects with large head motions, the magnification was fairly unhelpful as the expression had coordinated motion and it was difficult to separate. It would then be prudent to see if the expression magnification for small head motions was still acceptable. Using the filtering parameters hypothesized in Table 3, the magnification results were good, with magnification clearly visible for all subjects over all emotions (Figure 4). In frames from the motion-magnified videos, we observe facial motion cues corresponding well those described in Table 1. For example, in frustration (Figure 1), we observe slanting of lips and eye twitching in the motion-magnified frame that is nearly imperceptible in the source frame. These amplified facial motions are consistent with those presented in Table 1 and described in [1]. One approach which may be helpful for this case, and would only improve the usefulness of the approach in this paper, would be to separate out the pieces of the face important to the expression, namely the eyes and mouth. Isolating the motion of these components may not be helpful with separating head motion and expression motion, but it would allow for better perceptual focus for the viewer. Artifacting and noise amplification was more visible after linear magnification, as expected, but not significantly detrimental to video quality (Figure 1b). For both magnification methods, happiness and sadness expressions required more magnification for similar levels of distinctiveness as the other expressions. The results as a 6 Figure 4: Sample results after applying phase-based Eulerian motion magnification on facial micro-expressions corresponding to the six universal emotions. (left) shows frame from source video and (right) shows corresponding frame in motion-magnified video. The magnified facial motions e.g. (a) raising of lip corners; (b) lowering of lip corners; (c) brow wrinkling; (d) eyes bulging; (e) slanted lips and eye twitching; (f) eyes open wide correspond well to motion cues described in Table 1. 4.3 Comparison to Full Automation and amount of robust training data to train the necessary classifiers. The time needed for data collection and model Human Perception training can become an issue with the fully-automated approach. A natural question the reader may have is regarding the usefulness of this approach in comparison to a fully-automated system, or a professional trained in recognizing and understanding facial expressions. On the other end, professionals trained in recognizing facial expressions may have limited availability in some target applications. Our proposed approach can serve as an effective substitute for these professionals, as well as support professional opinions, a quality valuable in highstakes environments. While a fully automated system is useful for unbiased emotion recognition, and current methods do achieve good results [10], they do not achieve the extremely high accuracy required for many high-stakes application environments. In some cases, an incorrect decision could be tremendously costly, such as the decision to release a serial killer after interrogation. As humans have proven to be effective in understanding emotion from expression, the method proposed has an advantage over a fully automated system. Using both in tandem can also prove useful. 5 Conclusion We showed the feasibility of amplifying nearly imperceptible facial expressions by applying Eulerian motion magnification to collected video sequences of microexpressions. In our proposed method, we hypothesize appropriate temporal filtering parameters for magnifying motion of different facial features. Our motion-magnified results correspond well to accepted facial motion cues of Another drawback of fully automated systems is that current machine learning approaches require a significant 7 the six universal emotions classified by psychological research. This approach sits well between fully automated methods for emotion detection, which may not be suitable for high-stakes environments, and pure human perception, which may be unable to detect subtle changes in facial features at the scale of micro-expressions. 6 [10] Q. Wu, X. Shen, and X. Fu. The machine knows what you are hiding: An automatic micro-expression recognition system. Computing and Intelligent Systems, 2011. Acknowledgements We thank Cole Graham, Harlin Lee, Rebecca Shi, Melanie Abrams, Staly Chin, Norman Cao, Eurah Ko, Aditya Gopalan, Ryan Fish, Sophie Mori, Kevin Hu, and Olivia Chong for volunteering in our micro-expression experiments. References [1] M. J. Black and Y. Yacoob. Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. Journal of Computer Vision, 25(1):23– 48, 1997. [2] P. Ekman. Lie catching and microexpressions. The philosophy of deception, pages 118–133, 2009. [3] C. Liu, A. Torralba, W. T. Freeman, F. Durand, and E. H. Adelson. Motion magnification. In ACM Transactions on Graphics (TOG), volume 24, pages 519–526. ACM, 2005. [4] T. Pfister, X. Li, G. Zhao, and M. Pietikainen. Recognising spontaneous facial micro-expressions. Computer Vision (ICCV), pages 519–526, 2011. [5] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, and S. Sarkar. Towards macro-and micro-expression spotting in video using strain patterns. Applications of Computer Vision (WACV), 2009. [6] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, and S. Sarkar. Macro-and micro-expression spotting in long videos using spatio-temporal strain. IEEE Conference on Automatic Face and Gesture Recognition, 2011. [7] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman. Phase-based video motion processing. ACM Transactions on Graphics (TOG), 32(4):80, 2013. [8] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman. Riesz pyramids for fast phase-based video magnification. 2014. [9] H.-Y. Wu, M. Rubinstein, E. Shih, J. V. Guttag, F. Durand, and W. T. Freeman. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph., 31(4):65, 2012. 8
Similar documents
CASME Database: A Dataset of Spontaneous Micro
Abstract— Micro-expressions are facial expressions which are fleeting and reveal genuine emotions that people try to conceal. These are important clues for detecting lies and dangerous behaviors an...
More information