Kernel Subspace Mapping: Robust Human Pose and Viewpoint
Transcription
Kernel Subspace Mapping: Robust Human Pose and Viewpoint
Kernel Subspace Mapping: Robust Human Pose and Viewpoint Inference from High Dimensional Training Sets by Therdsak Tangkuampien, BscEng(Hons) Thesis Submitted by Therdsak Tangkuampien for fulfillment of the Requirements for the Degree of Doctor of Philosophy Supervisor: Professor David Suter Associate Supervisor: Professor Ray Jarvis Department of Electrical and Computer Systems Engineering Monash University January, 2007 c Copyright ! by Therdsak Tangkuampien 2007 Addendum • page 2 para 2: Comment: KSM allows generalization from a single generic model to a previously unseen person and it is a matter for future work to determine how this approach can generalize to people of different somatotype and those wearing different clothing. • page 4 para 1: Comment: It is important to note that KPCA can perform both de-noising and data reduction. For the context of human motion capture via KSM, KPCA is used for ‘de-noising’ because the processed data is not constrained to be a subset of the training set. • page 22 para 2: Comment: KSM can be applied to any number of cameras, the emphasis of the work is on the development of a new learning technique that generalizes from only a small training set. This is demonstrated in two camera markerless tracking which appears sufficient to disambiguate the coupled problem of pose and yaw. • page 30 para 3: Comment: The Gaussian white noise model in used for parameter tuning (in KSM) because KPCA de-noising has been shown to be effective and efficient in the minimization of Gaussian noise [91]. Provided that the noise level is not substantial, experiments (conducted by the author) shows that the optimal parameter in equation 3.6 remains relatively stable under changes in noise level. In practice (during capture), the robustness of KSM is tested by analyzing its accuracy in pose inference from previously seen and unseen silhouettes corrupted with synthetic noise (section 4.4). • page 36 para 2: Comment: In addition to the Gaussian kernel (equation 3.3.2), other kernels can be used in KSM. It is a matter of future research to determine if some kernels perform better and more efficiently than others. The important factors that will need to be considered are the accuracy and efficiency of the pre-image approximation of KPCA [87] based on the specific kernel. • page 37 para 2: Comment: The parameters K + and λ are tuned over a predefined discretized range. These parameters are tuned such as to minimize the error function in equation 3.20. • page 43 para 2: Comment: Specifically for the experiments in section 3.5, mean square error (Mse) is defined as the Euclidian distance between two RJC vectors as defined in section 3.4.1. • page 58 Add to the end of para 1: There are many other tunable parameters for KSM. For example, two orthogonal camera views are used (in figure 4.4) because the author believes that this provided the greatest information from two synchronized views. Another tunable parameter is the level of the pyramids used in figure 4.5. I In our analysis, we found that a pyramid level of 5 is the most robust for our experimental setup. There are many other tunable parameters (such as changes in illumination level and segmentation quality) which can affect the accuracy of KSM. It is a subject of further research to evaluate the performance of KSM under different parametric conditions. • page 61 para 2 Comment: In the selection of the optimal number of neighbors for LLE mapping, it is important to avoid over-fitting. If the mapping number is too small, KSM may end up locked onto the wrong poses. An interesting area to further investigate is how to dynamically identify the most robust number of neighbors to use for LLE mapping (irrespective of different motion types). • page 86 para 2: Comment: It is important to note that for IMED embedded KSM, the use of LLE mapping will still be affected by the same ambiguity problem as in figure 4.8. However, the probability of this occurring is substantially reduced due to the use of photometric information in section 5.4.1. • page 93 Add to the end of para 1: Another interesting area of research for IMED embedded KSM is to investigate its robustness in an uncontrolled environment (e.g. varying light/illumination condition). In this case, it may be possible to increase the technique’s robustness by training KPCA using relative change in intensity levels between neighboring pixels (as opposed to absolute pixel intensities). • page 96 para 1: Comment: The reader should note that the de-noising result of non-cyclical dancing motion is not included in the analysis in section 6.3.2. • page 99 Add to the end of para 1: The optimization of Greedy Kernel PCA (GKPCA) is considered to be beyond the scope of this work. The emphasis of the work is on the performance improvement of KSM (in human motion capture) via the application of GKPCA in training set reduction. Readers interested in the topic of GKPCA optimization and upper bound minimization should refer to [41], section 5.4 (page 91). • page 104 figure 6.6: The results in figure 6.6 is surprising. For further research, it would be interesting to investigate if a portion of the reduction in feature space noise can be attributed to data reduction (instead of fully to KPCA or GKPCA de-noising). • page 113 Add to the end of para 1: It may also be possible to generalize KSM to colour images by transferring colour properties from the target to be tracked to a generic model and regenerating the database (re-training) for tracking. • page 133 Additional References: – “Viewpoint invariant exemplar-based 3D human tracking”, Computer Vision and Image Understanding, 104(2), 178–189, 2006, E.-J. Ong et al . – “The Dynamics of Linear Combinations: Tracking 3D Skeletons of Human Subjects”, Image and Vision Computing, 20, 397–414, 2002, E.-J. Ong and S. Gong. – “A Multi-View Nonlinear Active Shape Model using Kernel PCA”, BMVC 1999, S. Romdhani et al . II Errata • section 1.1 (p1): ‘limps of the human body’ for ‘limbs’ • section 1.1 (p3): ‘KSM can refer’ for ‘infer’ • section 1.2 (p4): ‘motion capture techniques calls’ for ‘called’ • section 1.2 (p4): ‘in lower evaluation cost’ for ‘in a lower’ • section 2.2 (p13): ‘particles’ for ‘particle’ • section 2.2.2 (p16): ‘so that Euclidian norm’ for ‘that the Euclidian’ • section 2.2.2 (p16): ‘boosted cascade of classifier’ for ‘classifiers’ • section 2.2.2 (p17): ‘automatically constraining search space’ for ‘the search space’ • section 2.2.2 (p17): ‘to identify region of’ for ‘regions of’ • section 2.3 (p21): ‘3D points on as an array’ for ‘on an’ • section 3.3.2 (p37): ‘two free parameters that requires’ for ‘require’ • section 3.4.1 (p39): ‘if two different pose vector’ for ‘vectors’ • section 3.4.2 (p41): ‘de-noising, noisy human’ for ‘a noisy human’ • section 3.4.2 (p41): ‘performance improvement’ for ‘improvements’ • section 3.4.2 (p41): ‘integrating KPCA motion de-noiser’ for ‘the KPCA’ • section 3.5 (p41): ‘play-backed’ for ‘played-back’ • section 3.5 (p45): ‘As expected, smaller noise’ for ‘a smaller noise’ • section 4.3 (p59): ‘different sets of motion’ for ‘motions’ • section 4.4.2 (p68): ‘there are many possible set of’ for ‘sets of’ • section 4.4.2 (p68): ‘false positives’ for ‘false negatives’ • section 4.4.3 (p72): ‘was able to created’ for ‘create’ • section 4.5 (p73): ‘high level of noise’ for ‘levels’ • section 4.5 (p73): ‘very well at estimation 3D’ for ‘estimating’ • section 4.5 (p74): ‘The limits the’ for ‘This limits’ • section 4.5 (p74): ‘the capture other’ for ‘of other’ III • section 5.6 (p90): ‘only shows minor percentage’ for ‘a minor percentage’ • section 6 (p95): ‘Human motion de-noising’ for ‘A human motion de-noising’ • section 6.2 (p97): ‘human de-noising’ for ‘motion de-noising’ • section 6.2 (p97): ‘whilst minimizing reduction’ for ‘the reduction’ • section 6.2 (p98): ‘experiment are conducted’ for ‘experiments are conducted’ • section 6.2 (p98): ‘as well as compare the’ for ‘comparing the’ • section 6.4 (p109): ‘removed form’ for ‘removed from’ • section 7.1 (p109): ‘is non-linear’ for ‘are non-linear’ • section 7.2 (p113): ‘such as way’ for ‘such a way’ • section 7.2 (p143): ‘techniques aim at’ for ‘aimed at’ • section 7.2 (p143): ‘calibrate cameras’ for ‘calibrated cameras’ IV List of Tables 5.1 3D pose inference comparison using the mean angular error for ‘unseen’ views of the object ‘Tom’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 3D pose inference comparison using the mean angular error for randomly selected views of the object ‘Tom’. . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 3D pose inference comparison using the mean angular error for ‘unseen’ views of the object ‘Dwarf’. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4 3D pose inference comparison using the mean angular error for randomly selected views of the object ‘Dwarf’. . . . . . . . . . . . . . . . . . . . . . . 92 6.1 Comparison of capture rate for varying training sizes. . . . . . . . . . . . . 101 V List of Figures 1.1 Diagram to summarize the training and testing process of Kernel Subspace Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 3 2D Taxonomy plot to summarize the classification of the various motion capture literature reviewed in this thesis. . . . . . . . . . . . . . . . . . . . . 10 3.1 Linear toy example to show the results of PCA de-noising. . . . . . . . . . . 27 3.2 Toy Example to show the projection of data via PCA. . . . . . . . . . . . . 28 3.3 Toy example to illustrate the limitation of PCA. . . . . . . . . . . . . . . . 32 3.4 PCA de-noising of non-linear toy data. . . . . . . . . . . . . . . . . . . . . . 33 3.5 KPCA de-noising of the non-linear toy data. . . . . . . . . . . . . . . . . . . 37 3.6 Diagram to illustrate the missing rotation in a ball & socket joint using the RJC encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7 Data flow diagram to summarize the relationship between human motion capture and human motion de-noising. . . . . . . . . . . . . . . . . . . . . . 42 3.8 Quantitative comparison between PCA and KPCA de-noising of human motion sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 Frame by Frame error comparison between PCA and KPCA de-noising of a walk sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.10 Frame by Frame error comparison between PCA and KPCA de-noising of a run sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.11 Diagram to show results of KPCA in the implicit de-noising of feature space noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.12 Feature and pose space mean square error relationship for KPCA. . . . . . 46 4.1 Overview of Kernel Subspace Mapping (KSM). . . . . . . . . . . . . . . . . 51 VI 4.2 Scatter plot of the RJC projections onto the first 4 kernel principal components in Mp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Relationship between human motion de-noising, KPCA and KSM. . . . . . 53 4.4 Example of a training pose and its corresponding concatenated synthetic image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 Diagram to summarize the silhouette encoding for 2 synchronized cameras. 4.6 Markerless motion capture re-expressed as the problem of mapping from 57 the silhouette subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.7 Diagram to summarize the mapping from silhouette subspace to the pose subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.8 Diagram to show how two different poses may have similar concatenated silhouettes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.9 Illustration to summarize markerless motion capture. . . . . . . . . . . . . . 64 4.10 Selected images of the different models used to test KSM. . . . . . . . . . . 65 4.11 Comparison between the shape context descriptor and the pyramid match kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.12 Intensity images of the RJC pose space kernel and the Pyramid Match kernel for a training walk sequence (fully rotated about the vertical axis) . . 66 4.13 Visual comparison of the captured pose with ground truth. . . . . . . . . . 67 4.14 KSM capture error (degrees per joint) for a synthetic walk motion sequence. 68 4.15 KSM capture error (cm/joint) for a synthetic walk motion sequence. . . . . 69 4.16 KSM capture error (degrees per joint) for a motion with different noise densities (Salt & Pepper Noise). . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.17 KSM motion capture on real data using 2 synchronized un-calibrated cameras. 71 4.18 Selected motion capture results to illustrate the robustness of KSM. . . . . 72 5.1 Diagram to summarize the 3D pose estimation problem of an object viewed from the upper viewing hemisphere. . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Images of the Standardizing Transform with different σ values. . . . . . . . 81 5.3 Images of a 3D object after applying the Standardizing Transform. . . . . . 84 5.4 Diagram to Summarize Kernel Subspace Mapping for 3D object pose estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 VII 5.5 Selected images of the test object ‘Tom’ and object ‘Dwarf’. . . . . . . . . . 89 6.1 Illustration to geometrically summarize GKPCA. . . . . . . . . . . . . . . . 96 6.2 De-noising comparison of a toy example between GKPCA and KPCA. . . . 100 6.3 Pose space mse comparison between PCA, KPCA and GKPCA de-noising. 102 6.4 Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human walk sequence. . . . . . . . . . . . . . . . . . . . . . . . 103 6.5 Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human run sequence. . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6 Comparison of feature and pose space mse relationship for KPCA and GKPCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.7 Average errors per joint for the reduced training sets filtered via GKPCA. . 105 6.8 Frame by Frame error comparison (degrees per joint) for a clean walk sequence with different level of GKPCA filter in KSM. . . . . . . . . . . . . . 106 6.9 Frame by Frame error comparison (degrees per joint) for a noisy walk sequence with different level of GKPCA filter in KSM. . . . . . . . . . . . . . 107 6.10 Diagram to illustrate the most likely relationship between using the original training set without GKPCA filtering and using training sequences directly obtained from a motion capture database (to train KSM). . . . . . . . . . . 109 A.1 Diagram to summarize the hierarchical relationship of the bones of inner biped structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.2 Diagram to summarize the proposed markerless motion capture system. . . 117 A.3 Example of a Acclaim motion capture (AMC) format. . . . . . . . . . . . . 118 A.4 Comparison between animation in AMC format and DirectX format (generic mesh). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.5 Comparison between animation in AMC format and RJC format. . . . . . . 121 A.6 Comparison between animation in AMC format and DirectX format (RBF mesh). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.7 Images to illustrate Euler rotations. . . . . . . . . . . . . . . . . . . . . . . 123 B.1 Selected textured mesh models of the author. . . . . . . . . . . . . . . . . . 127 B.2 Examples images of the front and back scan of the author. . . . . . . . . . . 128 B.3 Diagram to illustrate surface fitting for RBF mesh generation. . . . . . . . . 129 VIII B.4 Selected examples of the accurate mesh model of the author. . . . . . . . . 131 IX Kernel Subspace Mapping: Robust Human Pose and Viewpoint Inference from High Dimensional Training Sets Therdsak Tangkuampien, BscEng(Hons) therdsak.tangkuampien@eng.monash.edu.au Monash University, 2007 Supervisor: Professor David Suter d.suter@eng.monash.edu.au Associate Supervisor: Professor Ray Jarvis ray.jarvis@eng.monash.edu.au Abstract A novel markerless motion capture technique called Kernel Subspace Mapping (KSM) is introduced in this thesis. The technique is based on the non-linear unsupervised learning algorithm, Kernel Principal Component Analysis (KPCA). Training sets of human motions captured from marker based systems are used to train de-noising subspaces for human pose estimation. KSM learns two feature subspace representations derived from the synthetic silhouettes and pose pairs, of a single generic human model, and views motion capture as the problem of mapping vectors between the learnt subspaces. After training, novel silhouettes, of previously unseen actors and of unseen poses, can be projected through the two subspaces via Locally Linear non-parametric mapping. The captured human pose is then determined by calculating the pre-image of the de-noised projections. The inference results show that KSM can estimate pose with accuracy similar to other recently proposed state of the art approaches, but requires a substantially smaller training set (which can potentially lead to lower processing costs). To allow automated training set reduction, the novel concept of applying Greedy KPCA as a preprocessing filter for KSM is proposed. The flexibility of KSM is also further illustrated via the integration of the Image Euclidian Distance (IMED) and the technique applied to the problem of 3D object viewpoint estimation. X Kernel Subspace Mapping: Robust Human Pose and Viewpoint Inference from High Dimensional Training Sets Declaration I declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text and a list of references is given. Therdsak Tangkuampien January 23, 2007 XI Acknowledgments I would like to thank my principal supervisor, Professor David Suter for his advice and guidance during my candidature. It was he who initially motivated my research ideas in the area of human motion capture and machine learning. Many thanks to David for the constructive criticisms and the countless hours spent on the discussions and proof reading of my research. This thesis would not have been possible without his invaluable guidance. I would like to thank my colleagues from the Institute for Vision Systems Engineering: Tat-jun Chin, James U, James Cheong, Dr. Konrad Schindler, Dr. Hanzi Wang, Mohamed Gobara, Ee Hui Lim, Hang Zhou, Liang Wang and my associated supervisor, Prof. Raymond Jarvis, for the wonderful discussions and constructive criticisms during our weekly seminars. It has been a wonderful experience sharing research ideas with you all. Especially, I would like to thank Tat-jun Chin for initially generating my interest in the areas of manifold learning and high dimensional data analysis. I would also like to thank many of the reviewers of my conference and journal submissions. The constructive feedbacks have been extremely helpful in guiding the direction of my research. In particular, I would like to thank Kristen Grauman for providing the source code for the Pyramid Match kernel and Gabriele Peters for making available the 3D object viewpoint estimation data set of ‘Tom’ and ‘Dwarf’. These components have helped me greatly during my research candidature. Most importantly, I would like to thank my mother and father for their constant encouragement and for always being there for me, when I needed them. Therdsak Tangkuampien XII Commonly used Symbols Some commonly used symbols in this thesis are defined here: • X tr refers to the training set. • xi refers to the i-th element (column vector) of the training set X tr . • x refers to the novel input (column vector) for the model learnt from X tr . • Mp refers to the pose feature subspace. • Ms refers to the silhouette feature subspace. • kp (·, ·) refers to the KPCA kernel function in the pose space. • ks (·, ·) refers to the KPCA kernel function in the silhouette space. • Ψ refers the the silhouette descriptor for the silhouette kernel ks (·, ·). • vp refers to the coefficients of the KPCA projected vector x via kp (·, ·). • vs refers to the coefficients of the KPCA projected vector Ψ via ks (·, ·). • P tr refers to the KPCA projected set of training poses. • S tr refers to the KPCA projected set of training silhouettes. • P lle refers to the reduced training subset of P tr used in LLE mapping. • S lle refers to the reduced training subset of S tr used in LLE mapping. • s in refers to projection of the the input silhouette in Ms . • p out refers to the pose subspace representation of s in in Mp . • x out refers to the output pose vector for KSM (i.e. the pre-image of p out ). • X gk refers to the reduced training subset (of X tr ), filtered via Greedy KPCA. XIII XIV Chapter 1 Introduction 1.1 Markerless Motion Capture and Machine Learning Markerless human motion capture is the process of registering (capturing and encoding) human pose in mathematical forms without the need for intrusive markers. Moeslund [73] defined human motion capture as “the process of capturing the large scale (human) body movements at some resolution”. The term ‘large scale body movement’ emphasizes that only the significant body parts, such as the arms, legs, torso and head will be considered. The statement “at some resolution” implies that human motion capture can refer to both the tracking of human as a single object (low resolution) or inference of the relative position of the limps of the human body (high resolution). This thesis views the human body as an articulated structure consisting of multiple bones connected in a hierarchical manner (appendix A.1). Markerless motion capture, as referenced in our work, refers to the inference of the relative joint orientations of this hierarchical skeletal model. Virtual animation can be achieved by encoding these relative joint orientations (at each time frame) as pose vectors, and mapping these vectors sequentially to a hierarchical (skeletal) structure. In order to avoid the use of markers, pose information can be inferred from images captured from multiple synchronized cameras. Estimating human pose from images can be classified as computer-visioned based motion capture, and has received growing interest in the last decade. Faster processors, cheaper digital camera technology and larger memory storage have all contributed to the significant growth in the area. 1 CHAPTER 1. INTRODUCTION Machine learning is another area that has received significant interest from both the research community and industry. Literally, machine learning research is concerned with “the development of algorithms and techniques that allow computers (machine) to learn”. In particular, due to the availability of large human motion capture database, there has been many markerless motion capture techniques based on machine learning [37, 113, 45, 6, 83]. The principal concept is to generate synthetic training set of images from available poses in the motion database and learn the (inverse) mapping from image to pose, such that it generalizes well to novel inputs. The term ‘generalizes well to novel inputs’ emphasizes that this is not a traditional database search problem, but a more complex one, which requires the generation of unseen poses not in the training database. An attribute which makes machine learning suitable for human motion capture is that a high percentage of human motion is coordinated [84, 24]. There has been many experiments on the application of unsupervised learning algorithms (such as Principal Components Analysis (PCA) [51, 98] and Locally Linear Embedding (LLE) [85]) in learning low dimensional embedding of human motion [37, 14, 84]. Effectively, inferring pose via a lower dimensional embedding avoids expensive computational processing and searching in the high dimensional human pose space (e.g. 57 degrees of freedom in normalized Relative Joint Center (RJC) format [section 3.4.1]). The main contribution of this thesis is a novel markerless motion capture technique called Kernel Subspace Mapping (KSM)(chapter 4). The technique is based on the nonlinear unsupervised (machine) learning algorithm, Kernel Principal Components Analysis (KPCA) [90]. KSM requires, at initialization, labelled input and output training pairs, which may both be high dimensional. In particular, for computer vision-based motion capture, KSM can learn the (inverse) mapping from synthetic (training) images to pose space encoded in the normalized RJC format (section 3.4.1). Instead of learning poses generated from a number of different mesh models (as in [45, 37]), KSM can estimate pose using training silhouettes generated from a single generic model (figure 1.1 [top left]). To ensure the robustness of the technique and test that it generalizes well to 2 CHAPTER 1. INTRODUCTION Figure 1.1: Diagram to summarize the training and testing process of Kernel Subspace Mapping, which learns the mapping from image to the normalized Relative Joint Centers (RJC) space (section 3.4.1). Note that different mesh models are used in training and testing. The generic model [top left] is used to generate training images, whereas the accurate mesh model of the author (Appendix B) is used to generate synthetic test images. previously unseen∗ poses from a different actor, a different model is used in testing (figure 1.1 [top right]). Results are presented in section 4.4, which shows that KSM can refer accurate human pose and direction, without the need for 3D processing (e.g. voxel carving, shape form silhouettes), and that KSM works robustly in poor segmentation environments as well. 1.2 Thesis Outline & Contributions In chapter 2, a taxonomy and literature review of markerless motion capture techniques are presented. Relevant machine learning algorithms for markerless motion capture are discussed and classified into logical paradigms. This will serve as a context upon which the advantages and contributions of the proposed technique (KSM) can be identified. Thereafter, the remaining chapters and their contributions are outlined below: ∗ The term ‘previously unseen’ is used in this thesis to refer to vectors/silhouettes that are not in the training set. 3 CHAPTER 1. INTRODUCTION • Subspace Learning for Human Motion Capture [chapter 3]: introduces the novel concept of human motion de-noising via non-linear Kernel Principal Components Analysis (KPCA) and summarizes how de-noising can advantageously contribute to markerless motion capture. Arguments are presented, which advocates that the normalized Relative Joint Center (RJC) format for human motion encoding is, not only intuitive, but well suited for KPCA de-noising as well. De-noising results indicates that human motion is inherently non-linear, hence further supporting the integration of non-linear de-noising techniques (such as KPCA) in markerless motion capture techniques. • Kernel Subspace Mapping (KSM)† [chapter 4]: integrates human motion denoising via KPCA (chapter 3) and the Pyramid Match Kernel [44] into a novel (and efficient) markerless motion capture technique calls Kernel Subspace Mapping. The technique learns two feature space representations derived from the synthetic silhouettes and pose pairs, and alternatively views motion capture as the problem of mapping vectors between the two feature subspaces. Quantitative and qualitative motion capture results are presented and compared with other state of the art markerless motion capture algorithms. • Image Euclidian Distance (IMED) embedded KSM‡ [chapter 5]: shows how the Image Euclidian Distance [116], which takes into account spatial relationship of local pixels, can efficiently be embedded into KPCA via the Kronecker product and Eigenvector projections. Mathematical proofs are presented which shows that the technique retains the desirable properties of Euclidian distance, such as kernel positive definitiveness, and can hence be used in techniques based on convex optimization such as KPCA (and KSM). Results are presented which demonstrate that IMED embedded KSM is a more intuitive and accurate technique than standard KSM through a 3D object viewpoint estimation application. † This chapter is based on the conference paper [108] T. Tangkuampien and D. Suter: Real-Time Human Pose Inference using Kernel Principal Component Pre-image Approximations: British Machine Vision Conference (BMVC) 2006, pages 599–608, Edinburgh, UK. ‡ This chapter is based on the conference paper [106] T. Tangkuampien and D. Suter: 3D Object Pose Inference via Kernel Principal Component Analysis with Image Euclidian Distance (IMED): British Machine Vision Conference (BMVC) 2006, pages 137–146, Edinburgh, UK. 4 CHAPTER 1. INTRODUCTION • Greedy KPCA for Human Motion Capture§ [chapter 6]: presents the novel concept of applying Greedy KPCA [41] as a preprocessing filter in training set reduction for KSM. Human motion de-noising comparison between linear PCA, standard KPCA (using all poses in the original sequence) and Greedy KPCA (using the reduced set) is presented at the end of the chapter. The results show that both KPCA and greedy KPCA have superior de-noising qualities over PCA, whilst Greedy KPCA results in lower evaluation cost (for both KPCA de-noising and KSM in motion capture) due to the reduced training set. Finally, in chapter 7, overall conclusions encapsulating the techniques presented in the previous chapters are drawn. Further improvements are highlighted, and most importantly, possible future directions of research on Kernel Subspace Mapping are summarized. § This chapter is based on the conference paper [107] T. Tangkuampien and D. Suter: Human Motion Denoising via Greedy Kernel Principal Component Analysis Filtering: International Conference on Pattern Recognition (ICPR) 2006, pages 457–460,Hong Kong, China. 5 CHAPTER 1. INTRODUCTION 6 Chapter 2 Literature Review Recently, markerless human motion capture has become one of the research areas in computer vision that rely heavily on machine learning methods. Human motion capture can be classified into different paradigms and this chapter serves to elucidate the differences between learning based approaches and the more conventional ones. As markerless motion capture is a popular field of research, this chapter does not aim to provide a complete literature review of all the algorithms. Instead, the principal goal is to motivate and differentiate the proposed method of Kernel Subspace Mapping (KSM) (chapter 4) against previously proposed approaches. To this end, in section 2.1, taxonomies of computer vision-based motion capture (similar to the ones introduced by Moeslund [72, 73] and Gavrila [43]) are summarized and combined. Based on the integrated taxonomy, in section 2.2, motion capture techniques are reviewed and classified. This will serve as a context upon which the advantages and contributions of the proposed technique (KSM) can be identified (section 2.3). 2.1 Motion Capture Taxonomy There has been many attempts at developing a taxonomy for human motion capture [43, 8, 7, 22]. The most logical, as highlighted by [16], is the taxonomy suggested by Moeslund [72, 73], which categorizes the process of motion capture into four stages that should occur in order to solve the problem. These stages are: initialization, tracking, pose estimation, and finally, recognition. Initialization includes any form of off-line processing, 7 CHAPTER 2. LITERATURE REVIEW such as camera calibration and model acquisition. As for tracking and pose estimation, it is important to highlight the differences between them. In the classification (figure 2.1), we use tracking to refer to the run-time identification of pre-defined structure with temporal constraints. Specifically for human tracking from images, the pre-defined structure may be the entire person (high-level) or multiple rigid body parts representing a person (e.g. arms, legs). We use the term pose estimation exclusively for processes which generate as output, pose vectors which can be used to (fully or partially) animate a skeletal model in 3D. As highlighted by Agarwal and Triggs in [4], it is common for the two stages of tracking and pose estimation to interact in a closed looped relationship in that: accurate tracking may take advantage of prior pose knowledge, and pose estimation may require some form of tracking to disambiguate inconclusive pose. Finally, recognition is the classification of motion sequences into discrete paradigms (e.g. running, jumping, etc). We believe that most recognition algorithms can be augmented to pose estimation approaches if 3D joint information is available, and therefore, have explicitly highlighted this extendability (with an arrow) in figure 2.1. The stages of the classification of Moeslund [72] do not have to be in any specific order, and for some techniques, some stages (i.e. tracking and recognition) may even be excluded completely. Each human motion capture approach (e.g. segmentation based, prediction via tracking filter) can be classified as partially or fully belonging to any of the stages. For example, a motion capture approach [23], which estimates pose via the use of Kalman filters [118] (to track body parts) would be categorized in both the tracking and pose estimation stages. Specifically for markerless motion capture techniques based on machine learning, such as [37, 113, 45, 6, 1, 75, 83], these can be classified into the initialization, tracking and pose estimation stages. The learning stage (from training data) and the inference stage (from test data) would be classified as initialization and pose estimation respectively. For a full categorization and survey of markerless motion capture techniques, the reader should refer to [72, 16]. 8 CHAPTER 2. LITERATURE REVIEW Another popular taxonomy is the one suggested by Gavrila [43], which classifies motion capture techniques into either 2D or 3D approaches∗ . As the goal of motion capture is to infer 3D pose information, most 2D and 3D techniques will eventually generate 3D human pose information. To distinguish between 2D and 3D techniques, in our classification, any approach which requires processing and volumetric reconstruction of the human body in 3D space (e.g. voxel carving, shape from silhouettes) will be classified as a 3D technique. The remaining image-based techniques will be classified as 2D approaches. The two taxonomies can be combined together to illustrate an abstract view of the motion capture literature (figure 2.1). For example, a motion capture technique based on voxel reconstruction [70], can be classified as 3D approaches whilst also classified as initialization, tracking and pose estimation. Since the core of Kernel Subspace Mapping (KSM) is based on an unsupervised learning algorithm (KPCA) (for comparison), it would be useful to classify all the reviewed motion capture literature into another two distinct classes: learning based techniques and nonlearning based techniques. In figure 2.1, the blue labels highlight techniques that are exemplar/learning based (i.e. requires a training database) and the yellow labels indicates techniques that are not. An interesting observation is that most of the recent work on 2D markerless motion capture is based on learning algorithms, which indicates an increase in the popularity of machine learning in computer vision applications. ∗ Note that in the survey by Gavrila [43], the taxonomy classifies techniques into 3 distinct classes: 2D approaches without explicit shape models, 2D approaches with explicit shape models, and 3D approaches. For simplicity, we have combined all 2D approaches into one category. 9 CHAPTER 2. LITERATURE REVIEW Figure 2.1: 2D Taxonomy plot to summarize the classification of the various motion capture literature reviewed in this thesis. The color coding emphasizes if a motion capture technique is exemplar/learning based (i.e. requires a training database)[blue] or not based on any learning algorithm [yellow]. Techniques that do not estimate full pose (e.g. upper body only or 2D pose in images) are highlighted by an uncomplete bar in the pose estimation column. Similarly, tracking in a reduced dimensional (embedded) space is signified by a reduced bar in the tracking column. We believe that most recognition algorithms can be augmented to pose estimation approaches, once 3D joint information is available. This is indicated by the arrow at the end of each row pointing towards the recognition column. 10 CHAPTER 2. LITERATURE REVIEW 2.2 Markerless Motion Capture A review of human motion capture techniques, as summarized in the combined taxonomy of Moeslund [72] and Gavrila [43] (figure 2.1), is now presented. The techniques are discussed in the following categorizations: Section 2.2.1 reviews pose estimation techniques which requires processing and volumetric reconstruction in 3D (e.g. voxel carving, shape from silhouettes). For quick technical comparison with KSM, section 2.2.2 summarizes the relevant 2D learning/exemplar based motion capture techniques (i.e. requires a training set). Note that not all 2D markerless motion capture approaches require a training set for pose inference. For example, Moeslund and Granum [74] use silhouettes in conjunction with the typical human kinematic constraints (instead of using the subspaces defined by the training data) to infer pose. Other popular approaches (that do not rely on training data) include the motion tracking algorithm of the Sony Playstation’s EyeToy (which only uses simple image differencing [36, 120]), the silhouette contour technique of Takahashi et al [104] (which uses Kalman filter [118] to track feature points), the particle filter based algorithm by Shen et al [93], and the product of exponential maps and twist motion framework of Bregler and Malik [18]. 2.2.1 Motion Capture via 3D Volumetric Reconstruction Human pose estimation from 3D volumetric reconstruction is a natural approach to markerless motion capture. These techniques usually require as input, synchronized images form multiple calibrated cameras, as well as the camera’s intrinsic and extrinsic parameters. The principal idea is to derive a 3D volume (enclosing the actor), which satisfies the constraint imposed by the multi-view images, and use the volume to aid in 3D pose estimation. Popular algorithms for volumetric reconstruction include the Shape-from-Silhouette (SFS) algorithm [27] and voxel-carving techniques [70, 111]. These 3D approaches can be further categorized into two distinct classes: model-free and model-based techniques. In a model-free setup, the surface/volume reconstruction at each time instance (sometimes in combination with its textures) can be captured (and used directly in animation) without prior knowledge of the human body [71, 100]. In a model-based setup [70, 111, 99, 48, 13], prior knowledge of the human structure is used to aid in motion capture, and therefore, 11 CHAPTER 2. LITERATURE REVIEW usually allows pose estimation (i.e. the capture of joint orientations/positions for reanimation). The advantage of pose estimation is that it is efficient to encode (only the structural joint orientations need to be encoded for each frame) and this allows its practical use in applications such as motion editing and motion synthesis [57, 62, 10, 84], as well as human computer interaction (HCI). As this thesis specifically proposes a motion capture technique (KSM) for pose estimation, only 3D model based approaches, which can capture pose for biped animation will be reviewed. A state of the art model-based approach to realistic surface reconstruction and pose estimation is the technique proposed by Starck et al [99, 48]. The technique has been shown successful when using 9 calibrated and synchronized cameras (with 8 cameras forming 4 stereo pairs, and the remaining camera positioned overhead). A prior humanoid model is used in conjunction with manually labelled feature points (e.g. skeletal joints and mesh vertices) to match the images at each time frame. A combination of silhouette, feature cues and stereo is then used in a constrained optimization framework to update the mesh model, whilst preserving its parametrization of the surface mesh. Tracking with a prior model allows the integration of temporal information to disambiguate under-constrained scenarios, which may occur as a result of self occlusion. The technique has been shown to successfully reconstruct complex motion sequences (e.g. jumping and dancing), and in some cases, it is possible to visualize realistic creases of the actor’s clothing during reanimation. As the humanoid model is updated at every time frame, we believe that pose (joint orientations) can easily be estimated from the model. Two disadvantages are highlighted by the authors [99], these being the requirement for manual labelling of feature points and the constraints imposed by the shape of the prior model. Another state of the art 3D pose estimation algorithm is the technique proposed by Cheung et al [28]. The approach [28] takes into account temporal constraints in the form of a novel “Shape-fromSilhouette across time” algorithm [27]. Motion capture is partitioned into two distinct stages: human model acquisition and pose estimation. In model acquisition, the joint positions are registered in a sequential approach, where the actor is asked to rotate only one joint at a time. The initialization step, which consists of the joint skeleton body shape 12 CHAPTER 2. LITERATURE REVIEW acquisition was reported to take a total of ≈7 hours [joint acquisition (≈5 hours), shape acquisition (≈2 hours)]. During pose tracking and estimation, the Visual Hull alignment uses the Shape-from-Silhouette reconstruction and photometric information in conjunction with the prior model. The authors reported an average tracking rate of 1.5 minutes per frame. The technique has been shown to successfully track complicated non-cyclical motion such as aerobics, dancing and Kung Fu sequences. Pose estimation approaches are sometimes designed specifically for the inference of only the upper part of the human body (from torso upwards) (e.g. Bernier et al [13] and Fua et al [42]). These techniques may potentially be useful for human computer interaction in the near future. In [13], Bernier et al used particles filters and proposal maps to track fast moving human motion using depth images from a stereo camera. The technique can successfully estimate pose at up to 10Hz without prior knowledge of the actor nor background. In the work of Fua et al [42], 3 synchronizes pre-calibrated cameras are used to generate 3D surface point cloud to constrain pose estimation of the upper body. The approach uses optimization to deform an articulated model (with “metaballs” to simulate muscle and joints) to align with the synchronized imageries. Other notable (but similar to [99, 48, 28]) pose estimation techniques based on 3D volumetric reconstruction include [29, 79, 19, 117, 54, 70, 69]. In [29], Chu et al integrated (the manifold learning algorithm) Isomap [110], to allow the extraction of “skeleton curve” features from volumetric reconstruction of the human body. In [79], Niskanen et al used a 4-6 camera setup to estimate 3D points and 3D normals of a surface mesh enclosing a model at a time instance. Caillette et al [19] presented a volumetric reconstruction and fitting scheme based on 4 calibrated cameras, but used Variable Length Markov Models for prediction. Wang and Leow [117] derived a self-calibrated approach based on Nonparametric Belief Propagation [101], which allows pose inference via synchronized un-calibrated cameras (the self-calibration is highlighted in figure 2.1). Kerl et al [54] improved on volumetric capture techniques by introducing a fitting algorithm, which uses stochastic meta descent optimization (instead of deterministic ones). In addition to the space constraints 13 CHAPTER 2. LITERATURE REVIEW (imposed by the volumetric reconstruction), Mikić et al [70, 69] and Sundaresan et al [102] imposed temporal constraints via the use of extended Kalman filters [53]. In general, 3D motion capture based on volumetric reconstruction requires an expensive controlled environment of multiple calibrated cameras and clean silhouette segmentation. Voxel reconstruction, Shape-from-Silhouette or other space carving algorithms usually contribute to the higher processing cost and lower capture rate of 3D based techniques (when compared to learning based techniques [section 2.2.2]). On the other hand, volumetric reconstruction techniques with a tracking framework can capture more complicated and non-cyclical motion, and in some cases [99, 28], employing realistic surface models as well. 2.2.2 Learning or Exemplar Based Motion Capture Techniques Markerless motion capture techniques based on learning algorithms have grown significantly in this last decade due to the availability of online motion capture data. Generally learning/exemplars techniques adopt a supervised learning approach and require labelled input and output pairs as a training set. The principal concept is to learn the mapping (from input to output) from the training set, and generalize it to unseen inputs (sampled from a similar distribution as the training data) during online capture. The labelling (of the training sets) can be generated manually as in [75, 15]. However, as human motion is complicated, its encoding usually requires a large training set of high dimensional data, which makes manual labelling impractical. To allow scalability, most recent approaches, such as Agarwal & Triggs in [6] and Grauman et al in [45], used synthetic models to automatically generate synthetic training pairs for learning. To avoid over-fitting of the learnt prior to a specific actor/model, multiple synthetic (mesh) models are used in training (e.g. to capture walk sequences irrespective of yaw angle, a training set of 20,000 synthetic silhouettes are generated from a number of different models in [45]). Interesting enough, every member of the training set does not need to be labelled as shown by Navaratnam et al [76], who extended the learning-based concept to allow a semi-supervised learning approach (i.e. the training set consists of a combination of labelled [input & output pairs] 14 CHAPTER 2. LITERATURE REVIEW and unlabelled data [input without output, or vice versa]). There are many different types of inputs that have been successfully used in learning/exemplar based motion capture. In the technique proposed by Bowden et al [15, 14], a skin color classifier based on the Hue-Saturation space [67] is used to detect (image) locations of the hands and head (as input). Using a similar concept, Micilotta et al [68] trained the AdaBoost classifier to detect the locations of the head and hands in cluttered images. In [5], Agarwal and Triggs used the shape invariant feature transform (SIFT) descriptor [64] in conjunction with non-negative factorization [49] to encode edges in human images for training. In [65], Loy et al used manually labelled points in key frames together with a prior model (for image space fitting) to interpolate pose from cluttered images. These techniques [5, 68, 15, 14], although robust to pose estimation in cluttered environments, involve a complex scheme of mostly inefficient shape descriptors (when compared to using binary human silhouettes). A more efficient approach (also by Agarwal & Triggs [6]) infers pose using silhouettes from monocular images by directly regression from points sampled from the silhouette’s edge via the Shape Context [11] descriptor to the Euler pose space. Inferring high dimensional pose from monocular silhouettes is complicated because a high level of ambiguity arises due to self occlusion and the lack of depth and foreground (photometric) information. In that case [6], a complex temporal tracking algorithm needs to be employed to disambiguate inconclusive pose. To overcome ambiguities, 3D pose can be inferred from concatenated silhouette images of the actor captured from multiple synchronized cameras (as proposed by Ren et al in [83] with 3 cameras and Grauman et al in [45] with 4 cameras). Even though concatenating silhouettes has the advantage of reducing the level of ambiguity (in input images), there is also a drawback, in that the dimension of the input (i.e. the number of pixels) will increase, hence effecting the technique’s performance. In particular, for learning based approaches (where the main processing relies on comparing inputs to training data), the increase in input dimension becomes a principal factor to consider when determining the practicality of the techniques. 15 CHAPTER 2. LITERATURE REVIEW Specifically for silhouette-based approaches, the efficiency of shape comparison algorithms is a crucial attribute to consider for potential real-time systems. There have been many approaches aimed at optimizing matching/comparison costs between input and training data. In particular, Agarwal and Triggs [6, 1] improved on the shape context approach to motion capture (originally proposed by Mori and Malik [75]) by using vector quantization to compress the Shape Context histogram relationship between other silhouettes, so that Euclidian norm is applicable (instead of the more common approach of using the shortest augmenting path algorithm [52] to solve for the best match). In [83], Ren et al used a boosted cascade of feature classifier (introduced by Viola and Jones [115]) for efficient silhouette comparisons. Other efficient shape descriptors, which may be useful in silhouette-based motion capture, include the pyramid match kernel† of Grauman et al [44] and the Active Shape Model of Cootes et al [30]. In addition to using an efficient comparison algorithm, learning/exemplar based techniques should also consider strategies to minimize the search time (of the training set) during pose inference. For exemplar based approaches that rely solely on a matching framework (i.e. given an input, find the closest training data) [35, 83], it is obvious why faster search algorithms will lead to faster pose inference rate. For learning based approaches that rely on a reconstruction framework (i.e. generate the output from a combination of training data), a complex search strategy (which can efficiently locate arrays of similar training poses) will be required. A problem usually encountered in optimizing search strategies for learning based (human) motion capture techniques is the high dimensionality and size of the training set. As highlighted by Shakhnarovich et al [94]: “For complex and high-dimensional problems such as pose estimation, the number of required examples and the computational complexity rapidly become prohibitively high.” To avoid a large training set, Agarwal & Triggs [6] use Bayesian non-linear regression to filter out a reduced training set that generalizes well to novel data. Shakhnarovich et al [94] learns a set of hash functions, which indexes only the relevant exemplars required for a specific motion. The main difference between the two approaches is that in the former [6], a smaller training set is selected form the original set to model the motion sequence, whereas in † In chapter 4, we show how to efficiently integrate the Pyramid Match kernel [44] into KSM. 16 CHAPTER 2. LITERATURE REVIEW the latter [94], the larger set still remains, but a more efficient hashing algorithm is learnt to search the full training set. An alternative to using a hashing algorithm to reduce search time is to perform a stochastic sampling [95] or a Covariance Scaled Sampling [97] to locate the optimal pose. In Covariance Scaled Sampling, a hypothesis distribution is calculated by predicting the dynamics of the current time prior, and inflating the prior covariance at the predicted center for broader sampling. Bray et al [17] use the dynamic graph cut algorithm [55] to locate the global minimum (optimal pose), when infering pose from images without silhouette segmentation. Other human tracking approaches based on constrained sampling algorithms include [33, 34, 82]. Instead of exploring ways to optimize the sampling algorithm for tracking, an alternative solution to avoiding the curse of dimensionality is to integrate manifold or subspace learning algorithms into motion capture. This is possible because a majority of human motion is coordinated [84]. The lower dimensional embedded space can be used to reduce comparison time by automatically constraining search space to within the learnt model. Grauman et al [45] used a mixture of linear probabilistic principal components analyzer (PPCA) models [112] to locally and linearly model cluster of similar training data. Li et al [61] used the Locally Linear Coordination (LLC) algorithm [109] to enforce smoothness locally within the clusters of the pose (joint) space. Agarwal and Triggs [3] used kernel PCA [90] (with the polynomial kernel based on the Bhattacharya histogram similarity measure [56]) to learn the manifold of training silhouettes from the (vector quantized) shape context histogram descriptor [6]. In that case [3], the manifold in conjunction with the pose vectors (in Euler angles) are then combined to identify region of “multi-valueness” when mapping from silhouette to pose space. Elgammal and Lee [37] integrated the unsupervised learning algorithm, Locally Linear Embedding (LLE) [85] to learn lower dimensional manifolds of viewpoint specific silhouettes. Urtasun et al [113] uses Scaled Gaussian Process Latent Variable Models (SGPLVM) [46] to learn prior low dimensional embedding for specific human motion (e.g. walking, golf swing). Some of these approaches [37, 113] have only been shown to successfully track human joints using the same camera yaw angle (rotation about the vertical axis) as the ones used to capture the training data. It is not 17 CHAPTER 2. LITERATURE REVIEW clear if subspace models learnt from a different yaw orientation can be integrated together to form a single general model that can track joints irrespective of the camera’s yaw angle. From our experiments [108], one of the most difficult aspect of motion capture is not how to track the joint positions of the human body (when viewpoint is known), but how to automatically infer the correct yaw orientation of the model, as well as to correctly track the joints‡ . To do so, the prior learnt model must generalize well to unseen poses, as well as poses from unseen camera angles. To capture human pose irrespective of yaw orientation, Ren et al [83] separated yaw and pose inference into two independent stages. Yaw inference is viewed as a multi-class classification problem, which is always performed before pose inference. There are 36 classes in total, with each class encapsulating a 10 degree sector of the vertical axis rotation. The input is first classified into the correct class, before using a viewpoint specific model (with a 10 degree range) to infer pose. Other silhouette-based approaches that can accurately infer pose irrespective of yaw orientation include the “shape+structure” model of Grauman et al [45] and the regressive model of Agarwal and Triggs [6]. In general, the learning/exemplar based techniques summarized in this section [45, 5, 75, 15, 37, 83, 6, 95, 97, 113, 68], adopt a supervised learning approach and require labelled training data at initialization. In some approaches, unsupervised learning algorithms, such as PCA [15, 50], LLE [37], SGPLVM [46], have been used initially to learn lower dimensional embedding to constrain search and tracking. Even so, a supervised learning algorithm is still used to determine the mapping (between embedding and pose space) during pose inference. In comparison with 3D based markerless motion capture algorithms, learning based techniques are still limited to capturing less complicated (and mostly cyclical) motions, but 2D learning-based approaches generally do not require intrinsic camera calibration, and therefore, are cheaper and easier to initialize. The state of the art techniques [6, 96, 45, 37] mostly concentrate on robustly tracking and pose inference of spiral walk and turn sequences captured via un-calibrated cameras. Complex ‡ For this thesis, we have considered viewpoint change to be a result of rotation about the vertical axis (yaw rotation). The camera/sensor is assumed to remain at a fixed vertical height during the rotation 18 CHAPTER 2. LITERATURE REVIEW paradigms of pose estimation in cluttered background or poor silhouette segmentation environments are usually used to test and compare the robustness of proposed techniques. From our review of the current literatures, we believe the approach adopted by Agarwal and Triggs in [6] to be the current state of the art due to its ability to accurately and robustly infer pose and yaw orientation from monocular silhouettes. 2.3 Motivation for Kernel Subspace Mapping (KSM) Kernel Subspace Mapping (KSM) (chapter 4) falls in the paradigm of 2D learning-based approaches. With regards to the taxonomy classification of Moeslund [72] (figure 2.1), at initialization, KSM uses motion capture data from a marker-based system to learn a de-noising subspace of human motion (section 3.4). Efficient tracking is performed in the KPCA subspace (instead of the full pose space), hence its partial classification in the tracking column (figure 2.1). KSM uses concatenated silhouettes from synchronized cameras in conjunction with subspace temporal constraints to infer full pose vectors for full body biped animation. In general, most of the 2D learning-based approaches [6, 45, 37] rely on learning from a training set consisting of multiple people to allow good generalization to unseen input. KSM adopts a different approach by learning the de-noising subspace in the normalized relative joint orientation space (which is consistent for all actors), and de-noises (previously) unseen silhouettes with the learnt prior before pose inference. The advantage of this approach is a substantial reduction in training size, as only the motion sequence from a single generic model is used in learning. For comparison, in the technique proposed by Grauman et al [45], a large training set of 20,000 silhouettes (from multiple actors) are used to train a motion capture system consisting of 4 synchronized cameras. To capture similar walk sequences irrespective of (yaw) viewing angle, KSM shows similar results (to [45]) using a training set of only 343 exemplars and 2 synchronized cameras. The ability to accurately and efficiently estimate pose irrespective of yaw orientation is one of the many advantages of KSM. For learning-based approaches, inferring the yaw orientation (rotation 19 CHAPTER 2. LITERATURE REVIEW about vertical axis) is significantly harder than inferring, let’s say, an extra rotation of the arm. This is because yaw rotation is not coordinated with any joint rotation (as any motion sequence can be captured from an infinite number of angles). Techniques which learn a view specific prior model for tracking (or a discrete combination of multiple view specific models) [37, 83, 113, 35] are limited to inferring pose from the pre-defined viewpoints. KSM is not constrained to view-specific estimation because view-point inference and pose estimation is not performed in two independent steps as in [83, 37]. As a result, view-point inference is not only possible, but continuous in the sense that pose estimation is possible from unseen (camera) yaw angles. As common with most supervised learning approaches to markerless motion capture [45, 113, 6, 83], KSM requires (for training) a labelled set of silhouette and pose pairs. KSM is similar to the mixture of regressors of Agarwal and Triggs [3], in the sense that both techniques rely on the unsupervised learning algorithm, KPCA [90]. However, there are substantial differences between the two approaches. In [3], KPCA with a complicated kernel (polynomial kernel with the Bhattacharya histogram similarity measure [56]) is used to project (vector quantized) shape context descriptors from the silhouette space. The shape context calculation (this occurs before the vector quantization step) has a quadratic complexity as the number of features (points sampled form the silhouette’s edge). KSM, uses a shape descriptor, based on the efficient pyramid match kernel [44], which has only linear complexity in the cardinality of the set. Instead of learning subspaces from the histograms of silhouette descriptor (as in [3]), which are ambiguous and nosier (as the descriptor is based on discrete histogram bins), KSM initially learns a de-noising subspace from the continuous relative joint center space (section 3.4.1), which is more efficient to analyze and unambiguous (except for when a limb is fully extended [figure 3.6]). Furthermore, KSM is based on a KPCA de-noising framework (i.e. requires pre-image approximations [91]) of the Gaussian kernel, which already has well established and efficient pre-image estimators [87]. In [3], there is no approximation of pre-images, and such a stage would require the pre-image approximations of complex and inefficient polynomial kernels of discrete histogram descriptors (when compared to pre-image approximations using Gaussian 20 CHAPTER 2. LITERATURE REVIEW kernels [87]). Finally, instead of regression from silhouette to Euler pose space as in [6, 3], we use the normalized relative joint center encoding RJC (section 3.4.1). The problems of mapping to Euler space (appendix A.1.1) are that Euler angles suffer from singularities and gimbal locks, are non-commutative, as well as having a complicated group structure rather than that of a simple vector space. On the other hand, the normalized RJC format (section 3.4.1) encodes pose using 3D points on as an array of unit spheres, which is simple to parameterize (from vector space) and does not suffer from gimbal locks. An advantage of normalized RJC is that the mapping from Euclidian to geodesic distance (between two points) on a sphere can be closely approximated using Gaussians, and is therefore wellsuited for use with non-linear Gaussian kernels, as required in KPCA de-noising (section 3.3.1). Kernel Subspace Mapping (KSM) is also similar to the activity manifold learning method of [37], which uses Locally Linear Embedding (LLE) [85] to learn manifolds of human silhouettes for each viewpoint. In that system, during capture, the preprocessed silhouettes must be projected onto all the manifolds of each static viewpoint before performing a one dimensional search on each manifold to determine the optimal pose and viewing angle. Furthermore, because the manifolds are learnt from discrete viewing angles, it is not possible to accurately infer pose if the input silhouettes are captured from previously unseen viewpoint, where a corresponding manifold has not been learnt. Kernel Subspace Mapping (KSM), which does not suffer from these inefficiencies and problems, has the following advantages: • Instead of learning separate subspaces for each viewpoint, a single combined subspace from multiple viewpoints (rotated about the vertical axis) is learnt, hence giving the technique the ability to accurately infer heading angle and human pose information in a single step (and avoid explicitly searching multiple subspaces). • Kernel Subspace Mapping, which is based on already well established de-noising algorithms (KPCA) [90], generalizes well to silhouettes of unseen models as well as silhouettes from unseen viewing angles. 21 CHAPTER 2. LITERATURE REVIEW • KSM is able to implicitly project noisy input silhouettes (which are perturbed by noise and lies outside the subspace defined by the training silhouettes) into the subspace, hence allowing the use of a single generic training model for training, which leads to a reduction in inference time. KSM uses images of silhouettes from synchronized cameras without the need for their intrinsic parameters (and therefore does not require any camera calibration). Extrinsic parameters are also not required, provided that the relative cameras position (between other cameras) are consistent between training and during online pose inference. It is important to note that KSM is a multiple camera approach to pose estimation, rather than the monocular approach proposed by Agarwal and Triggs [6, 3]. Pose inference from monocular silhouettes as in [6] is substantially harder as the level of ambiguity increases significantly in monocular sequences. In those cases, a more complicated tracking algorithm needs to be employed. KSM only uses a simple tracking algorithm embedded in the de-noising subspace, hence the requirement for (at least) two cameras to generate a learnt subspace that does not significantly overlap upon itself (e.g. the side view of a monocular silhouette walk sequence in [37]). From our experiments, we concluded that two cameras are sufficient in this regard, as oppose to the 3 camera framework in [83] or the 4 cameras setup in [45]. The important point to note is that KSM does not aim to improve on monocular approaches to pose estimation and their efficacy should not be directly compared as such. Instead, KSM aims to improve on the ability for learning-based approaches to generalize to unseen actors via the incorporating of a de-noising framework learnt from a single model, rather than the more common approach of using a large set from different models. For the capture of a walk sequence irrespective of angle, the reduction in training size allows KSM to efficiently infer full pose and yaw orientation at a rate of up to 10Hz. For comparison, Caillette et al [19] can capture at a speed of 10Hz, but requires volumetric reconstruction from four calibrated cameras. Bray et al presented an approach [17] which can infer pose without segmentation at an average of approximately 50 seconds per frame. Other markerless techniques [26] reported pose inference times of between 3 to 5 seconds per frame. 22 CHAPTER 2. LITERATURE REVIEW It is also important to highlight the differences between 3D markerless pose estimation techniques, which are based on volumetric reconstruction (section 2.2.1) and 2D markerless learning-based techniques (section 2.2.2). In 3D pose estimation techniques [111, 70, 102, 54, 99, 48], an expensive setup of multiple calibrated cameras (usually more than four) are required to derive the enclosing volume. Capture rate is usually lower than 2D approaches due to volumetric/surface reconstruction. On the other hand, 3D based techniques can capture more complicated non-cyclical human motion and, in some cases, have even been shown to capture complex surfaces (e.g. wrinkles in clothing) for reanimation [99, 48]. KSM, being a 2D learning-based approach, should not be directly compared with 3D approaches in relation to its ability to estimate pose, as the constraints vary substantially between the two paradigms. Nevertheless, the reader should bear in mind that the ultimate goal of full body pose estimation remains similar between the two paradigms. Finally, there is a large community of researchers in the field of machine learning, which concentrate specifically on the optimization and improvement of kernel techniques, such as Kernel PCA [91, 87] and Support Vector Machines (SVMs) [40]. KSM, being a technique which is based on the well-established KPCA algorithm [90], could most likely take advantage of any improved algorithms that may be proposed (from the machine learning community) in the future. To support this argument, two recently proposed algorithms, the Image Euclidian Distance [116] and the Greedy Kernel PCA algorithm [41] are integrated to the problem of 3D object viewpoint estimation (chapter 5) and human motion capture via KSM (chapter 6) respectively. There have not been, to the authors knowledge, any application of the Greedy Kernel KPCA algorithm in markerless motion capture, nor any use of the Image Euclidian Distance (IMED) in 3D object pose estimation problems. The results of the improved approaches indicate that theoretical improvements on KPCA can be transferred to practical improvements of KSM with relatively minor modifications. 23 CHAPTER 2. LITERATURE REVIEW 24 Chapter 3 Subspace Learning for Human Motion Capture This chapter introduces the novel concept of human motion de-noising via non-linear Kernel Principal Components Analysis (KPCA) [90] and summarizes how de-noising can contribute to markerless motion capture. Arguments are presented, which advocates that the normalized Relative Joint Center (RJC) format (section 3.4.1) for human motion encoding is, not only intuitive, but also well suited for KPCA de-noising. Motion denoising comparison between linear and non-linear approaches are presented to further demonstrate the advantages of using non-linear de-noising techniques (such as KPCA) in markerless motion capture. 3.1 Introduction The inference of full human pose and orientation (57 degrees of freedom) from silhouettes without the need of real world 3D processing is a complex and ill conditioned problem. When inferring from silhouettes, ambiguities arise due to the loss of depth cues (when projecting from 3D onto the 2D image plane) and the loss of foreground photometric information. Another complication to overcome is the high dimensionality of both the input and output spaces, which also leads to poor scalability of learning based algorithms. To avoid confusion it is important to note that, visually, an image as a whole is considered to exist in 2D. However, low level analysis of the pixels of an image (as applied in 25 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE KSM) is considered high dimensional (more than 2D) as each pixel intensity represents a possible dimension for processing. Specifically in our work, the input is the vectorized pixel intensities of concatenated images from synchronized cameras. The output is the high dimensional human pose vector encoded in the normalized Relative Joint Center (RJC) format (section 3.4.1). For exemplar based motion capture techniques, such as [113, 94, 45, 75, 6, 37], searching the full dimension of the output pose vector space (e.g. 57 dimensions for RJC format) is usually avoided. A possible solution (to avoid searching the complete pose space) is to exploit the correlation in human movement [84, 14] and learn a reduced subspace in which searching and processing can be performed more efficiently. The dimensionality reduction technique Principal Components Analysis (PCA) [51] has been shown to be successful in learning human subspaces for the synthesis of novel human motion [84, 14]. However, PCA, being a linear technique is limited by its inability to effectively de-noise non-linear data. By de-noising, we mean the minimization of unwanted noise by projecting data onto a learnt subspace (determined from either PCA, Kernel PCA [90] or other subspace learning algorithms) and ignoring the remainder (section 3.2). To overcome the non-linearity in encoding, the Locally Linear Embedding algorithm [85] (which is a non-linear dimensionality reduction/manifold learning technique) has become popular in human motion capture [37]. The problem with LLE is that it does not yet have well established algorithms for the projection of out-of-sample points [12]. To project unseen points to the manifold embedding, the input is usually appended to the training set and the entire manifold relearned∗ . This chapter aims to find a suitable subspace learning algorithm (for human motion capture) which can effectively de-noise non-linear data, as well as one which is computationally efficient in the projection (de-noising) of novel inputs. To advocate the use of non-linear learning techniques in human motion capture, experiments should show that the simplest form of human motion (e.g. walking, running) is non-linear and, therefore, the motion would not be effectively de-noised by linear techniques such as PCA (when ∗ An investigation into the practical application of LLE for silhouette based motion capture was conducted by the author in [105]. Specifically for silhouette based approach to motion capture, the major problem encountered was the (computationally) expensive projection cost of novel silhouette feature vectors via LLE. Recently, the originators of LLE, Saul and Roweis, have investigated this problem and have presented two possible solutions in the update of their original Locally Linear Embedding paper [86]. 26 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE compared to KPCA). The remainder of this chapter is organized as follows: Firstly, section 3.2 reviews the principal components analysis (PCA) algorithm [51, 98], as well as highlighting its linear limitation. The non-linear kernel PCA (KPCA) algorithm [90] is reviewed in section 3.3. In section 3.4, the novel application of KPCA in human motion de-noising is introduced, and the advantageous relationship between human subspace learning via KPCA [90] and markerless motion capture is summarized. Results are presented in section 3.5 to allow quantitative and visual comparison between PCA de-noising and KPCA de-noising of noisy motion sequences. 3.2 Review: Principal Components Analysis (PCA) Principal components analysis (PCA) [51] is a powerful statistical technique with useful applications in areas of computer vision, data compression, pattern/face recognition and human motion analysis [84, 24]. PCA effectively creates an alternative set of orthogonal ‘principal axes’ (basis vectors) with which to describe the data (figure 3.1). Geometrically speaking, expressing a data vector in terms of its principal components is a simple case of projecting the vector onto these independent axes (provided that the vector has already been centered around the training mean). Once the principal components are created, each projection is effectively a dot product, the calculation of which is only linear (in terms of the vector’s dimension) in complexity. Furthermore, multiple data vectors can be concatenated into a matrix and the projection performed in batches via matrix multiplications. Figure 3.1: Linear toy example to show the results of Principal Components Analysis (PCA) de-noising. The principal axes are highlighted in red. Consider, for example, the 2D noisy (linear) toy data set in figure 3.1, where PCA will 27 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE calculate two principal axes. Note that the principal axes must be orthogonal to each other and therefore, there can only be as many principal axes as the dimension of the original data. In this case, the main principal axis (1st principal component) lies along the direction of greatest variance, whereas, the less significant (2nd component) axis lies orthogonal to it. By ignoring data projection onto the 2nd component (and considering it as noise), the dimensionality of the problem reduces whilst minimizing loss of information. Figure 3.2: Toy Example to show the projection of data onto the 1st principal axis from PCA. The projections along the 2nd component is considered to be noisy data and ignored. Due to its simplicity, there are widespread applications of PCA in computer vision, more specifically in areas of high dimensionality analysis where subspace learning or dimensionality reduction are advantageous. There is a vast array of different applications (of PCA), ranging from, for example, data compression [14], optimization [84] to ‘de-noising’ [91]. However, in most applications, the underlying concept remains relatively the same, in that, given a training data in RD , PCA initially learns a set of K principal axes (where K ≤ D) to efficiently represent the data. In data compression via PCA, a ‘lossy’ form of compression is achieved by storing only the projection coefficients onto the first K principal axes. In human motion optimization [84], only the projection coefficients onto the K principal axes are processed and analyzed. In de-noising, the projections onto the K principal components are considered as the clean data (or that which is closest to the clean data). In most cases, vector projections onto the less significant components are ignored. To this end, two crucial questions that should be asked are: • how to automatically determine the optimal number of K principal components for a specific data set, and • how effective these K principal components are in representing the clean data. 28 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE For notational purposes, a mathematical formulation of PCA [51, 90] is now presented to aid in the investigation of these problems. A training set of N centered vectors in RD is denoted as X tr , whereas the i-th training (column) vector is represented as xi . A novel (column) vector, which is not in the training set, is represented as x. Note that the data set must be preprocessed and centered around its mean beforehand (i.e. !N i=1 xi = 0). Thereafter, to determine the principal components, PCA diagonalizes the covariance matrix N 1 " xi xT C= i N (3.1) i=1 and solves for the Eigenvectors v and diagonal Eigenvalues matrix λ as follows: λv = Cv, for λk ≥ 0 and v ∈ RD . (3.2) Each (column) Eigenvector vk (in the matrix v ) corresponds to a principal axis for data projection. On the other hand, each Eigenvalue λk (the k-th diagonal in matrix λ) encodes the variance of the training data along each corresponding Eigenvector. Hence by ordering the Eigenvectors in decreasing order of their corresponding Eigenvalues, a set of ordered principal components is attained. Assuming that the optimal number of principal components is known (the selection of this will be explained in the next paragraph), a matrix of feature vectors F is constructed [98], where F = [v1 v2 v3 ... vK ]. (3.3) Projecting the novel point x (which has already been centered around the mean of the training set) onto the first K principal components is achieved via a single matrix multiplication β = FT x , (3.4) where the k-th column β k (of the coefficient matrix β) denotes the coefficient of the projection of x on the principal axis v k . Conversely, synthesizing a data vector from its 29 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE projected coefficient is simply a weighted sum of the principal components: x̂ = Fβ. (3.5) To map the centered reconstruction back to its original space, the reconstructed vector x̂ can be added to the training mean. The remaining factor to consider is how to determine the optimal number of principal components. In order to do so, the meaning of optimal value, from a de-noising perspective must be defined. The goal of de-noising is to learn a subspace, which best represents the ‘true’ training subspace, given the constraint of the (de-noising) algorithm. The nearest point in the learnt subspace closest to a noisy data point is considered the cleanest representative of this entity. At this point, it is important to highlight the main difference between ‘clean’ data and de-noised data (which lies in the learnt subspace). As the denoiser may not be fully representative of the training data, the de-noised vector may still retain unwanted noise. Therefore, de-noised data is not necessarily (and most unlikely) clean data. The goal is to eliminate as much noise as possible by learning a subspace, which is as close as possible to the ‘true’ subspace† , without over-fitting. To this end, ‘denoising’ is the elimination of the components of any point which causes it to deviate from the learnt subspace. For example, in figure 3.2, projecting a noisy point onto the subspace (defined by the 1st principal axis) via equation 3.4 is considered a case of linear de-noising. In our work, for PCA de-noising [39], the optimal number of principal components K + will be defined as K + = argmin N " K∈[1,D] i=1 &x i − DK (x i + ∆nσi )&2 , (3.6) where ∆nσi is a random sample from a Gaussian white noise signal with variance of σ 2 . The de-noising function is signified by DK (•) and is a combination of the linear projection (3.4) and re-synthesizing (3.5) of the noisy data from the first K principal components † For interested readers, an analysis of noise in linear subspace approaches is conducted by Chen and Suter in [25]. The article aims to investigate the de-noising capacity, which indicates, in matrix terms, how close a low rank matrix (learnt subspace) is to the noise-free matrix (the ‘true’ training data). 30 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE (ordered in decreasing Eigenvalues). For the sake of simplicity, in equation (3.6), there is only a single iteration over the entire training set for each value of K during tuning (this is the process of selecting the optimal value K + ). To ensure that the de-noising parameter generalizes well to novel test data, the optimal number of principal components is tuned using κ-fold cross-validation [39]. In this case, the training set is partitioned into κ testing subsets. For each subset, the remaining κ − 1 subsets are combined to create the training set. This effectively ensures that the training and testing subsets are mutually exclusive during the tuning process, hence representing a scenario which is more likely to occur in practice. To this end, equation 3.6 becomes: k K = argmin + κ card(X " "tr ) K∈[1,D] k=1 i=1 k &x ki − D̄K (x ki + ∆nσi )&2 , (3.7) k (•) represents the de-noising function using the where Xtrk denotes the k-th subset and D̄K principal components learnt from the remaining κ − 1 subsets (i.e. not inclusive of data in subset k). As the size of each subset may vary, the cardinality of the k-th subset is denoted by card(Xtrk ). Many other techniques exist for the selection of the optimal parameter for PCA. For example, in human motion synthesis via PCA, Safonova et al [84] chooses K + such that: !K + Er = !i=1 D λi i=1 λi ≥ 0.9 (3.8) Geometrically speaking, because the i-th Eigenvalue encodes the variance along the i-th principal component (Eigenvector), equation 3.8 effectively selects K + such that the PCA projections represents more than 90% of the variance of the training data. In that case [84], PCA is used for data compression to allow optimization in a lower dimensional space, whereas in our work, the parameter is tuned specifically to optimize PCA’s ability in the de-noising of novel data (equation 3.7). 31 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE 3.2.1 Limitations of PCA Even though PCA is an efficient and simple statistical learning technique, there are two crucial requirements for PCA de-noising to work effectively. These are: • that the data remains within the original space‡ , and, • that the clean data lie on or near a linear subspace. The first constraint is obvious since the orthogonal principal components lie within the span of RD (the original space), data constructed from these components must remain in RD . The second limitation is dependent on specific data set (i.e. is the data linear or non-linear?). This can best be illustrated with a simple non-linear toy sample. Consider, for example, the non-linear toy data set (in figure 3.3), which has a mean square error of 0.1259. Figure 3.3: Toy example to illustrate the limitation of PCA on the de-noising of nonlinear data. The clean subspace is represented by the blue curve and the noisy test data represented by the red crosses. In this case, the goal of PCA de-noising is to project the noisy (red) points onto the blue curve defining the clean subspace. Visually, for non-linear de-noising, a good projection for a noisy point, would be a projection which is orthogonal to the curve’s tangent (at the point of interception). From figure 3.4, it is clear that PCA de-noising is ineffective in the de-noising of non-linear data. By projecting the test data (red points) onto the 1st principal component, the mean error actually increases by more than four times the original error. ‡ The ability to map data to an alternative space (from the original space RD ) is advantageous because it introduces a new degree of flexibility for data fitting and processing (this concept in investigated later for KPCA in section 3.3). 32 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE Figure 3.4: PCA de-noising of non-linear toy data via projection onto the 1st principal component. Obviously, two principal components axes can be used in de-noising, but as this is already the dimension of the original data, there is no de-noising effect and the error remains at its original value (up to some small rounding differences from reconstruction). For this specific example, a possible solution may be to view this as a linear fit in the polynomial space (xn , xn−1 ... , x2 , xy, y 2 , x, y, 1). This is the basic concept of Kernel PCA (section 3.3), where data is mapped to an alternative feature space where linear fitting/de-noising is possible. The problem that needs to be solved is how to automatically determine the optimal mapping for each specific data set. 3.3 Kernel Principal Components Analysis (KPCA) Kernel Principal Components Analysis (KPCA) [90, 91, 87] is a non-linear extension of PCA. Geometrically speaking, KPCA implicitly maps non-linear data to a (potentially infinite) higher dimensional feature space where the data may lie on or near a linear subspace. In this feature space, PCA might work more effectively. The basic principal of implicitly mapping data to a high dimensional feature space has found many other areas of application, such as Support Vector Machines (SVMs) [31, 89] for optical character recognition of handwritten characters [39], and image de-noising [91]. To the author’s knowledge, there is, however, no previous work on the de-noising of human motion and silhouettes for human motion capture using KPCA projections and pre-image approximations. To show that KPCA is useful in motion de-noising, in section 3.5, an analysis of 33 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE the motion sequences is presented which aims to show that the (motion) sequences are inherently non-linear, and therefore should be de-noised by non-linear techniques, such as KPCA. 3.3.1 KPCA Feature Extraction & De-noising For completeness, a formulation of the KPCA algorithm as introduced by Schölkopf et al [90] will now be presented. Using the same notations as PCA, the training set of N data vectors is denoted by X tr in RD . Each element of the training set is indexed by x i . Each novel input point will, again, be represented by x . The non-linear mapping from the input space to a higher dimensional feature space F will be denoted by Φ, where Φ : RD → F, x )→ Φ(x ). (3.9) Provided that a suitable mapping for Φ can be found, the covariance matrix (similar to equation 3.1, but in feature space F) can be defined as: C̄ = N 1 " Φ(xi )Φ(xi )T . N (3.10) i=1 Note that the symbol C̄ is used instead of C to highlight the fact that the data is centered !N in feature space (i.e. i=1 Φ(x i ) = 0). For more details on how to center data in feature space, the reader should refer to [90, 91]. Following the same setting as PCA (section 3.2), the Eigenvectors V (in feature space) and the Eigenvalues (matrix) λ of the covariance matrix are calculated: λV = C̄V. (3.11) All Eigenvectors in V with Eigenvalues of more than zero must lie within the space of its training vectors Φ(x1 ), Φ(x2 ), ...., Φ(xN ), leading to the possible reconstruction of V from its ‘basis’ vectors: V = ΣN i=1 αi Φ(x i ), 34 (3.12) CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE where αi denotes the respective coefficients of each basis. The concatenated matrix α, which consists of these coefficient vectors, is the unknown that needs to be determined before any feature space projection via KPCA can be performed. By introducing an N ×N kernel matrix K, where Kij := +Φ(xi ) · Φ(xj ),, (3.13) N λKα = K2 α, (3.14) Schölkopf et al [90] showed that which further leads a simplified formulation: N λα = Kα. (3.15) From equation 3.15, it is clear that the unknown coefficient matrix can be determine by solving for the Eigenvectors and Eigenvalues of K. Specifically for the purpose of de-noising via KPCA (where a major concern is the ability to project novel points) the projections onto the k-th principal components in feature space is then +Vk · Φ(x), = N " i=1 αki +Φ(x) · Φ(xi ),. (3.16) Using similar notations for projected coefficients as equation 3.4, the coefficients of the KPCA projection (onto the first K + principal axis) of x can be denoted as + β = [+V1 · Φ(x),, .., +VK · Φ(x),]T , ∀ β ∈ RK + (3.17) where β k denotes the coefficient of the projection of x on the principal axis Vk (in feature space). The remaining factor to consider is how to define the feature space mapping Φ(·), such that KPCA de-noising is of any practical use. Referring back to the formulation of KPCA, there are, in fact, only two instances (when projecting points via KPCA) where an explicit mapping from input space RD to feature space F is required. These are when calculating for the kernel matrix K in equation 3.13 and during the projection of a novel 35 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE point in equation 3.16. In both cases, a dot product (projection) with another mapped vector is performed immediately, once in feature space. To this end, it is possible to avoid defining an explicit map for Φ(·) , and instead, define a positive definite kernel function k(x i , x j ) [90], such that k(xi , xj ) = +Φ(xi ) · Φ(xj ),, Φ : RD → F (3.18) is a dot product in the feature space of the mapped vectors. Furthermore, due to the high (possibly infinite) dimension of the feature space, it would also be computationally expensive to explicitly perform the mapping. By implicitly mapping via a kernel function, which is a dot product in feature space [87], the system also avoids having to explicitly store high dimensional feature space vectors in memory. The following section summarizes the selection and tuning process for the relevant kernel. 3.3.2 Kernel Selection for Pre-Image Approximation There are many possible choices of valid kernels for KPCA, such as Gaussian or polynomial kernels. The choice of the optimal kernel will be dependant on the data and the application of KPCA. From a KPCA de-noising perspective, the radial basis Gaussian kernel T (x k(xi , x) = exp−γ{(x i −x ) i −x )} (3.19) is selected due to the availability of well established and tested ‘pre-image’ approximation algorithms [91, 58]. The pre-image (which exists in the original input space) is an approximation of the KPCA projected vector (which exists in the de-noised feature space). Approximating the pre-image of a projected vector (in feature space) is an extremely complex problem and for some kernels, this still remains an open question. Since the projected (de-noised) vector exists in a higher (possibly infinite) feature space, not all vectors in this space may have pre-images in the original (lower dimensional) input space. Specifically for the case of Gaussian kernels, however, the fixed point algorithm [88], gradient optimization [39] or the Kwok-Tsang algorithm [58], have been shown to be successful in approximating pre-images. For more information regarding the approximation of pre-images, the reader 36 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE should refer to [89]. Note that there are two free parameters that requires tuning for the Gaussian kernel (equation 3.19), these being γ, the Euclidian distance scale factor, and K + , the optimal number of principal axis projections to retain in the feature space. To do so, the same concept of tuning for de-noising (as with PCA [section 3.2]) can be impleγ mented. By letting PK + define the KPCA de-noising function (projection and pre-image approximation with the Euclidian scale factor of γ and an optimal principal axes number of K + ), the optimal parameters will be chosen such as to minimize the error function ε(γ, K + ) = N " i=1 γ σ 2 &x i − PK + (x i + ∆ni )& . (3.20) Note that this is similar to PCA tuning in equation 3.6, and can be improved to generalize to unseen inputs by using cross validation in tuning, hence resulting in the following form: k ε(γ, K + ) = κ card(X " "tr ) k=1 i=1 γ k σ 2 &x ki − P̄K + (x i + ∆ni )& , (3.21) k γ where Xtrk denotes the k-th subset for cross validation and P̄K + (•) represents the KPCA k de-noising function using the principal components learnt from the remaining κ−1 subsets (i.e. not inclusive of data in subset k). Figure 3.5: KPCA de-noising of the non-linear toy data from figure 3.3. Using the tuned Gaussian kernel, figure 3.5 shows the results of KPCA de-noising of the non-linear example [39] from figure 3.3. In this case, KPCA de-noising eliminated noise of more than 40% of the original error. When compared with linear PCA, where there 37 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE was no possible reduction in error, KPCA’s ability to de-noise data provides a significant improvement§ . 3.4 KPCA Subspace Learning for Motion Capture The concept of motion subspace learning via KPCA is to use (Gaussian) kernels to model the non-linearity of joint rotation, such that computationally efficient pose ‘distance’ can be calculated, hence, allowing practical human motion de-noising. Clean human motion captured via a marker-based system [32] are converted to the normalized Relative Joint Center (RJC) encoding (section 3.4.1) and used as training data. The optimal de-noising subspace is learnt by mapping the concatenated joint rotation (pose) vector via a Gaussian kernel and tuning its parameters to minimize error in pre-image approximations (section 3.3.2). In section 3.4.2, the relationship between human motion de-noising and markerless motion capture is summarized. 3.4.1 Normalized Relative Joint Center (RJC) Encoding This section aims to show that encoding human pose using normalized RJC vectors is not only logical, but intuitive, in the sense that it allows efficient non-linear comparison of pose vectors, and has a structure which is well suited for KPCA. Generally, human pose can be encoded using a variety of mathematical forms, ranging from Euler angles (appendix A.1.1) to homogeneous matrices (appendix A.1.2). In terms of pose reconstruction and interpolation [9, 57], problems usually encountered (when using Euler angles and matrices) include how to overcome the non-linearity and non-commutative structure of rotation encoding. Specifically for markerless motion capture using Euler angles, we do not try to map from input to Euler space as in [6] because Euler angles also suffer from singularities and gimbal locks. Instead we model (3D) joint rotation as a point on a unit sphere and use a Gaussian kernel to approximate its non-linearity. The advantage of using normalized spherical surface points to model rotations is that the points lie on a non-linear manifold, which can be well-approximated using a combination of exponential maps and linear algebra. A similar § An interesting area to consider for each kernel (of KPCA) is if any heteroscedastic noise [38, 63] in induced in feature space, and if so, how this relates to KPCA’s ability to de-noise and approximate pre-images. 38 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE concept of applying non-linear exponential maps to model human joint rotations has been proposed by Bregler and Malik in [18]. From equation 3.16, the requirement needed to enable KPCA projection for human motion de-noising is the definition of the ‘distance’ between two pose descriptors, which should be a dot product in feature space (section 3.3). To begin, let us consider the simple case of defining the ‘distance’ between two different orientations of the same joint (e.g. the shoulder joint). In this case, the two rotations can be modelled as two points on a unit sphere, with a logical ‘distance’ being the geodesic (surface) distance between the two surface points. If we denote the two normalized points on the unit sphere as p and q respectively, the geodesic distance is the inverse cosine of the dot product: cos−1 (p · q ). The difficulty of using the inverse cosine arises when trying to efficiently extend the ‘distance’ definition to an array of spheres (encoding multiple joints of the human body), and ensuring that this ‘distance’ is positive definite to allow embedding into KPCA [87]. Instead of adopting this approach, we use a Gaussian kernel, which has already been proven to be positive definite for KPCA de-noising [91]. In this case, the ‘distance’ between two orientations of the i-th joint will be defined as k(p i , q i ), with T (p k(p i , q i ) = exp−γ{(p i −q i ) i −q i )} ∀ k(·, ·) ∈ [0, 1], (3.22) In equation 3.22, the Gaussian kernel in conjunction with the Euclidian distance (between p and q ) have been used to approximate the non-linearity between the two spherical surface points. The reader should note that the Gaussian kernel is an inverse encoding (i.e. a complete alignment of the two orientations is indicated by the maximum value of k(p i , q i ) = 1). Any misalignment of the specific joint will tend to reduce the kernel function’s output towards zero. The advantage of adopting an exponential map approach to joint distance definition becomes apparent when trying to calculate the distance between arrays of multiple spheres, which encode the full pose (stance) of a person. Since k(·, ·) ∈ [0, 1], the individual joint ‘distance’ can be extended to multiple joints by taking a direct multiplication of the kernel’s output. For example, if two different pose vector x 39 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE and y for a human model of M joints is encoded as x = [p 1 , p 2 , ... p M ]T , x ∈ R3M , and (3.23) y = [q 1 , q 2 , ... q M ]T , y ∈ R3M , then the distance between these pose vectors, which are encoded as an array of normalized spherical surface points, is k(x , y ) = k(p 1 , q 1 ) × k(p 2 , q 2 ) × ..... k(p M , q M ) T (p 1 −q 1 )} T (p T 1 −q 1 )+(p 2 −q 2 ) (p 2 −q 2 )+..... = exp−γ{(p 1 −q 1 ) = exp−γ{(p 1 −q 1 ) T (x −y )} = exp−γ{(x −y ) T (p exp−γ{(p 2 −q 2 ) 2 −q 2 )} T (p ..... exp−γ{(p M −q M ) M −q M )} (p M −q M )T (p M −q M )} . (3.24) This results in the Gaussian kernel, which was used in KPCA de-noising of the non-linear toy example in figure 3.5. Intuitively, this encoding makes sense because two full body poses which are aligned will still generate a ‘distance’ of 1. Any joint which results in a relative misalignment (from the same joint in a different pose) will tend to reduce this ‘distance’ towards zero. Equation 3.24 shows that the ‘distance’ between pose vectors of M joints can be efficiently calculated by simply taking the Euclidian distance between the concatenated spherical surface points and mapping via a Gaussian kernel. The Euclidian scale parameter γ allows the tuning of the kernel to fit different motion sequences and training sets (section 4.2.2). Furthermore, the kernel is guaranteed to be positive definite, and therefore, applicable to human motion de-noising via KPCA. A disadvantage of encoding a rotation using a normalized spherical surface point is that it loses a degree of rotation for any ball & socket joint (figure 3.6). This is because a surface point on a sphere cannot encode the rotation about the axis defined by the sphere’s center and the surface point itself. However, as highlighted in the phase space representation of Moeslund [74], for the main limbs of the human body (e.g. arms and legs), the ball & socket joints (e.g. shoulder or pelvis) are directly linked to hinge joints (e.g. elbow or knee), which only has one degree of freedom. Provided that the limb is 40 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE not fully extended, it is possible to rediscover the missing degree of freedom (for a ball & socket joint) by analyzing the orientation of the corresponding hinge joint. If the limb is fully extended, then temporal constraints can be used to select the most probable missing rotation of the ball & socket joint. Figure 3.6: Diagram to illustrate the missing rotation in a ball & socket joint using the RJC encoding. The missing rotation of the ball & socket joint (shoulder) can be inferred from the hinge joint (elbow) below it in the skeletal hierarchy (provided that the limb is not fully extended). 3.4.2 Relationship between KPCA De-noising and Motion Capture It is important to highlight the possible relationship between KPCA de-noising of human motion and exemplar based motion capture techniques. In human motion de-noising, noisy human pose vector can be de-noised by projecting it onto the feature subspace via the kernel trick, and mapping back via pre-image approximation [87]. In exemplarbased motion capture, performance improvement can be achieved by constraining search to within a more efficient lower dimensional human motion subspace (rather than the original pose space). Both techniques require the use of a human motion subspace for processing. To some readers, an obvious way to integrate KPCA de-noising into markerless motion capture would be to add a KPCA de-noiser as a post-processor to motion capture and have some kind of feedback loop to constrain the search space (figure 3.7 [top]). This thesis explores a more efficient concept, that of integrating KPCA human motion de-noiser into 41 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE Figure 3.7: [Top] Data flow diagram to summarize a likely relationship between human motion capture and human motion de-noising. [Bottom] Data flow diagram summary of Kernel Subspace Mapping (KSM) [chapter 4] (based on KPCA [section 3.3]), which integrates motion capture and de-noising into a single efficient processing step. the core of a novel markerless motion capture technique called Kernel Subspace mapping (KSM)(figure 3.7 [bottom]). Instead of capturing a pose and de-noising from the normalized RJC space (figure 3.7 [top]), KSM maps silhouette descriptors immediately to the projected feature subspace learnt from KPCA, hence avoiding the KPCA projection step (equation 3.16) during pose inference. In order to test the effectiveness of KPCA for human motion capture, an engineering approach is adopted and experiments are performed on the KPCA de-noiser (for human motion) independently, before motion capture integration. To advocate the use of the more complex non-linear KPCA, instead of linear PCA, experiments (section 3.5) should show that the RJC encoding (section 3.4.1) of human motion is non-linear, and more importantly, that more noise can be eliminated via KPCA (when compared to PCA). Denoising of synthetic Gaussian white noise in the feature space of various motion sequences is also analyzed. This is investigated because in KSM, noise may also be induced in the feature space (crucial details regarding this will be discussed in chapter 4). 42 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE 3.5 Experiments & Results To compare PCA and KPCA for the de-noising of human motion, various motion sequences were downloaded from the CMU motion capture database. All motions were captured via the VICON marker-based motion capture system [103], and were originally downloaded in AMC format (section A.1.1). To avoid the singularities and non-linearity of Euler angles (AMC format), the data is converted to its RJC equivalent (section 3.4.1) for training. All experiments were performed on a PentiumT M 4 with a 2.8 GHz processor. Synthetic Gaussian white noise are added to motion sequence X tr in normalized RJC format, and the de-noising qualities compared quantitatively (figure 3.8) and qualitatively in 3D animation playback¶ . Figure 3.8: Quantitative comparison between PCA and KPCA de-noising of human motion sequence. The level of input synthetic noise (mean square error [mse]) is depicted on the horizontal axes. The corresponding output noise level in the de-noised motion is depicted on the vertical axes. Figure 3.8 highlights the superiority of KPCA over linear PCA in human motion denoising. Specifically for the walking and running motion sequences (figure 3.9 and figure 3.10 respectively), the frame by frame comparison further emphasizes the superiority of KPCA (blue line) over PCA (black line). Furthermore, the KPCA algorithm was able to generate realistic and smooth animation when the de-noised motion is mapped to a ¶ For motion de-noising results via PCA and KPCA, please refer to the attached file: videos/motionDenoisingRun.MP4 & videos/motionDenoisingWalk.MP4. 43 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE Figure 3.9: Frame by Frame error comparison between PCA and KPCA de-noising of a walk sequence. Figure 3.10: Frame by Frame error comparison between PCA and KPCA de-noising of a run sequence. skeleton model and play-backed in real time. PCA de-noising, on the other hand, was unsuccessful due to its linear limitation and resulted in jittery unrealistic motions similar to the original noisy sequences. A repeat of this experiment (motivated by our work) has also been conducted by Schraudolph et al (in [92]) with the fast iterative Kernel PCA algorithm, and similar de-noising results are confirmed. For feature space$ de-noising of another non-linear toy example (figure 3.11 [center]), the result shows that KPCA can implicitly de-noise noisy feature space data by directly " We are interested in the de-noising of feature space noise because in KSM (section 4), input data is initially mapped to the feature space, before pose estimation. 44 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE Figure 3.11: Diagram to show results of KPCA in the implicit de-noising of feature space noise of a toy example via the fixed-point algorithm [88]. [Left] Projections of the clean (blue) training data and the noisy (red) test data onto the first 3 principal axes in feature space. [Center] The ‘direct’ pre-images of the noisy data in the original input space. [Right] The ‘direct’ pre-images re-projected back to the feature space. Note the reduction in mean squared error between the noisy data [left] and the de-noised data [right]. calculating its pre-image. This result is further supported when the pre-images of the noisy feature space vectors are re-projected back to feature space without any other form of processing via KPCA. Figure 3.11 [right] highlights geometrically and quantitatively the reduction in error of the re-projected points onto the first 3 principal axes in feature space. Figure 3.12 [top] shows the correlation (though erratic) between feature space noise and pose (RJC) space noise (in mean square error [mse]) for KPCA de-noising of a walk sequence. As expected, smaller noise level in feature space is an indication of smaller noise level in the original space, and vice versa. As the processing cost of KPCA is dependant on the training size (equation 3.16), an analysis of this cost with respect to the training size was also investigated. Figure 3.12 [bottom] confirms the linear relationship between the de-noising cost of KPCA and its training size. 3.6 Conclusions & Future Directions Human motion can be considered as a high dimensional set of concatenated vectors (57 dimensions in RJC format). As human motion is coordinated, it is possible to use less 45 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE Figure 3.12: Feature and pose space mean square error [mse] relationship for KPCA [top]. Average computational de-noising cost for KPCA [bottom]. dimensions to represent the original data. PCA can encode more than 98% of a simple walking motion in less than 10 principal components [14], and it has even been shown successful in learning principal axes for the synthesis of new human motion via optimization [84]. However, because a set of axes represents a high percentage of the original data, it does not necessary mean that it is a good de-noiser of the data. The main point to take into account, is that neither PCA, nor KPCA, is being investigated (in this chapter) for its ability to compress (reduce the dimension of) human motion data . The main issue of concern is to compare the de-noising qualities (of human motion) between PCA and KPCA (and use the results to motivate the use of KPCA de-noising in Kernel Subspace Mapping [chapter 4]). From the results in figure 3.8, it is clear that KPCA performs significantly better than PCA in human motion de-noising, hence, verifying that the simplest forms of human motion (e.g. walking, running) are inherently non-linear. Note that PCA is, in fact, a special case of KPCA with linear kernels, hence, eliminating the possibility that PCA de-noising can perform better than KPCA de-noising. KPCA simply allows the selection of different 46 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE kernels (which may also have more parameters than PCA), more flexibility in parameter tuning to avoid over/under fitting, and hence, enables custom improvements over linear PCA. Specifically for KPCA de-noising via the RBF kernel (equation 3.16), from experiments, it was discovered that the optimal value for γ (the Gaussian scale factor) plays a substantial role in improving the de-noising qualities. This parameter is not available for tuning via the traditional linear PCA de-noising algorithm. The ability for KPCA to implicitly de-noise data in feature space is also interesting. Instead, assume a situation where noise is actually induced in feature space, and not input space. By directly approximating the pre-images from the noisy data in feature space, it is possible to implicitly de-noise the data in the original space. To reiterate, because the feature space usually exists in much higher dimension than the input space, most points in the feature space do not have corresponding pre-images. This is even more likely if the (feature space) point is located away from the subspace of training data, since the feature space is, in fact, defined by the training data itself. In the case where the pre-image does not exist, then gradient descent techniques [88] can be used to approximate the most likely pre-image. From the results in figure 3.11, this approximation has the desirable de-noising effect of implicitly projecting the noisy feature space data into the non-linear subspace (i.e. the toy data set in figure 3.11 [center]) in the input space. From a markerless motion capture perspectively, human motion de-noising is advantageous as it allows the automated learning of human subspaces for specific applications. In chapter 4, the KPCA human motion de-noiser will be integrated into the core of the markerless motion capture technique (called Kernel Subspace Mapping), where noise may be induced in both the input and feature space. For real-time motion capture, a factor to take into account for future consideration is the complexity for KPCA projection (figure 3.12 [bottom]), which is O(N ), where N is the cardinality of the training set. As motion capture data is usually stored as a set of concatenated vectors captured (sampled) at high frame rates (from 60Hz to 120Hz), training the de-noiser with all the frames from each motion file will be extremely expensive when it comes to motion de-noising (and motion 47 CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE capture). To this end, the Greedy KPCA algorithm [41], which selects a reduced training set that retains a similar span in feature space will be reviewed for motion de-noising (chapter 6). In particular, the capture rate (which is dependant on the de-noising rate) can be controlled by modifying the size of the reduced training set via this greedy algorithm. 48 Chapter 4 Kernel Subspace Mapping (KSM)∗ This chapter integrates human motion de-noising via KPCA (chapter 3) and the Pyramid Match Kernel [44] into a novel markerless motion capture technique called Kernel Subspace Mapping. The technique learns two feature space representations derived from the synthetic silhouettes and pose pairs, and views motion capture as the problem of mapping vectors between the two feature subspaces. Quantitative and qualitative motion capture results are presented and compared with other state of the art markerless motion capture algorithms. 4.1 Introduction Kernel Subspace Mapping (KSM) concentrates on the problem of inferring articulated human pose from concatenated and synchronized human silhouettes. Due to ambiguities when mapping from 2D silhouette to 3D pose space, previous silhouette based techniques, such as [6, 19, 26, 37, 45, 75, 77], involve complex and expensive schemes to disambiguate between the multi-valued silhouette and pose pairs. A major contributor to the expensive cost of human pose inference is the high dimensionality of the output pose space. To mitigate exhaustively searching in high dimensional pose space, pose inference can be constrained to a lower dimensional subspace [37, 113]. This is possible due to a high degree ∗ This chapter is based on the conference paper [108] T. Tangkuampien and D. Suter: Real-Time Human Pose Inference using Kernel Principal Component Pre-image Approximations: British Machine Vision Conference (BMVC) 2006, pages 599–608, Edinburgh, UK. 49 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) of correlation in human motion. In particular, a novel markerless motion capture technique called Kernel Subspace Mapping (KSM), which takes advantage of the correlation in human motion, is introduced. The algorithm, which is based on Kernel Principal Components Analysis (KPCA) [90] (section 3.3), learns two feature subspace representations derived from the synthetic silhouettes and normalized relative joint centers (section 3.4.1) of a single generic human mesh model (figure 4.1). After training, novel silhouettes of previously unseen actors and of unseen poses are projected through the two feature subspaces (learnt from KPCA) via Locally Linear non-parametric mapping [85]. The captured pose is then determined by calculating the pre-image [87] of the projected silhouettes. As highlighted in chapter 3, an advantage of KPCA is its ability to de-noise non-linear data before processing, as shown in [91] with images of handwritten characters. This chapter further explores this concept, and shows how this novel technique can be applied in the area of markerless motion capture. Results in section 4.4.2 show that KSM can infer relatively accurate poses (compared to [45, 6, 2]) from noisy unseen silhouettes by using only one synthetic human training model. A limitation of the technique is that silhouette data will be projected onto the subspace spanned by the training pose, hence restricting the output to within this pose subspace. This restriction, however, is not crucial since the system can be initialized with the correct pre-trained data set if prior knowledge on the expected type of motion is available. The main contribution of this chapter includes the introduction of a novel technique for markerless motion capture called Kernel Subspace Mapping (KSM), which is based on mapping between two feature subspaces learnt via KPCA. A novel concept of silhouette de-noising is presented, which allows previously unseen (test) silhouettes to be projected onto the subspace learnt from the generic (training) silhouettes (figure 4.1), hence allowing pose inference using only a single training model (which also leads to a significant decrease in training size and inference time). For mapping from silhouettes to the pose subspace, instead of using standard or robust regression (which was found to be both slower and less 50 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Figure 4.1: Overview of Kernel Subspace Mapping (KSM), a markerless motion capture technique base on Kernel Principal Components Analysis (KPCA) [90]. accurate in our experiments), non-parametric LLE mapping [85] is applied. The silhouette kernel parameters are tuned to optimize the silhouette-to-pose mapping by minimizing the LLE mapping error (section 4.3.3). Finally, by mapping silhouettes to the pose feature subspace, the search space can be implicitly constrained to a set of valid poses, whilst taking advantage of well established and optimized pre-image (inverse mapping) approximation techniques, such as the fixed-point algorithm of [87] or the gradient optimization technique [39]. 4.2 Markerless Motion Capture: A Mapping Problem As stated previously, Kernel Subspace Mapping (KSM) views markerless motion capture as a mapping problem. Given as input, silhouettes of a person, the algorithm maps the pixel data to a pose space defined by the articulated joints of the human body. The static pose of a person (in the pose space) is encoded using the normalized relative joint centers (RJC) format (section 3.4.1), where a pose (at a time instance) is denoted by x = [p 1 , p 2 , ...p M ]T , x ∈ R3M , 51 (4.1) CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) and p k represents normalized spherical surface point encoding of the k-th joint relative to its parent joint in the skeletal’s hierarchy (appendix A.1). Regression is not performed from silhouette to Euler pose vectors as in [2, 6] because the mapping from pose to Euler joint coordinates is non-linear and multi-valued. Any technique, like KPCA and regression based on standard linear algebra and convex optimization will therefore eventually breakdown when applied to vectors consisting of Euler angles (as it may potentially map the same 3D joint rotation to different locations in vector space). To avoid these problems, the KPCA de-noising subspace is learnt from (training) pose vectors in the relative joint center format (section 3.4.1). Figure 4.2: Scatter plot of the RJC projections onto the first 4 kernel principal components in Mp . of a walk motion, which is fully rotated about the vertical axis. The 4th dimension is represented by the intensity of each point. The pose feature subspace Mp (figure 4.2) is learnt from the set of training poses. Similarly, for the silhouette space, synchronized and concatenated silhouettes (synthesized from the pose x i ) are preprocessed to a hierarchical shape descriptor Ψi (section 4.2.3) using a technique similar to the pyramid match kernel of [44]. The preprocessed hierarchical training set is embedded into KPCA, and the silhouette subspace Ms learnt (figure 4.1 [bottom left]). The system is tuned to minimize the LLE non-parametric mapping [85] from Ms to Mp (section 4.3). During Capture, novel silhouettes of unseen actors (figure 4.1 [top left]) are projected through the two subspaces, before mapping to the output pose space using pre-image approximation techniques [87, 39]. Crucial steps are explained fully in the remainder of this chapter. 52 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) 4.2.1 Learning the Pose Subspace via KPCA From a motion capture perspective, pose subspace learning is the same as learning the subspace of human motion for de-noising (chapter 3). For the latter (motion de-noising), a noisy pose vector is projected onto the (kernel) principal components for de-noising in the feature subspace Mp . Thereafter, the pre-image of the projected vector is approximated, hence mapping back to the original space (figure 4.3 [KPCA De-noiser: black rectangle]). Figure 4.3: Diagram to highlight the relationship between human motion de-noising (section 3), Kernel Principal Components Analysis (KPCA) and Kernel Subspace Mapping (KSM). For markerless motion capture (figure 4.3 [KSM: red rectangle]), KSM initially learns the projection subspace Mp as in KPCA de-noising. However, instead of beginning in the normalized RJC (pose) space, the input is derived from concatenated silhouettes (figure 4.3 [left]) and mapped (via a silhouette feature subspace Ms ) to the pose feature subspace Mp . Thereafter, KSM and KPCA de-noising follow the same path in pre-image approximation. To update the notations for KSM, for a motion sequence, the training set of RJC pose vectors is denoted as X tr . The KPCA projection of a novel pose vector x onto the k-th 53 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) principal axis Vpk in the pose feature space can be expressed implicitly via the kernel trick as +Vpk · Φ(x), = = N " i=1 N " αki +Φ(xi ) · Φ(x), (4.2) αki kp (xi , x), i=1 where α refers to the Eigenvectors of the centered RJC kernel matrix (equation 3.16). In this case, the radial basis Gaussian kernel T (x kp (xi , x) = exp−γp {(x i −x ) i −x )} (4.3) is used because of the availability of well established and tested fixed point pre-image approximation algorithm [87] (The reason for this selection will become clear later in section 4.2.2). To avoid confusion, the symbol kp (·, ·) will be used explicitly for the pose space kernel and ks (·, ·) for the silhouette kernel [the kernel ks (·, ·) will be discussed in section 4.2.3]. As in section 3.3.2, there are two free parameters in need of tuning of kp (·, ·), these being γp the Euclidian distance scale factor, and ηp the optimal number of principal axis projections to retain in the pose feature space. The KPCA projection (onto the first ηp principal axis) of x i is denoted as v pi , where η v pi = [+Vp1 · Φ(xi ),, .., +Vpp · Φ(xi ),]T , ∀ v p ∈ Rηp . (4.4) For novel input pose x , the KPCA projection is simply signified by v p . 4.2.2 Pose Parameter Tuning via Pre-image Approximation In order to understand how to optimally tune the KPCA parameters γp and ηp for the pose feature subspace Mp , the context of how Mp is applied in KSM must be highlighted (figure 4.1 [bottom right]). Ideally, a tuned system should minimize the inverse mapping error from Mp to the original pose space in normalized RJC format (figure 4.1 [top right]). In the context of KPCA, if we encode each pose as x and its corresponding KPCA 54 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) projected vector as v p , the inverse mapping from v p to x is commonly referred as the pre-image [87] mapping. Since a novel input will first be mapped from Ms to Mp , the system needs to determine its inverse mapping (pre-image) from Mp to the original pose space. Therefore, for this specific case, the Gaussian kernel parameters, γp and ηp , are tuned using cross-validation to minimize pre-image reconstruction error (as in equation 3.21). Cross validation prevents over-fitting and it ensures that the pre-image mapping generalized well to unseen poses, which may be projected from the silhouette subspace Ms . An interesting advantage of using pre-image approximation for mapping is its implicit ability to de-noise if the input vector (which is perturbed by noise) lies outside the learnt subspace Mp . This is relatively the same problem as that encountered by Elgammal and Lee in [37], which they solved by fitting each separate manifold to a spline and rescaling before performing a one dimensional search (for the closest point) on each separate manifold. For Kernel Subspace Mapping, however, it is usual that the noisy input (from Ms ) lies outside the clean subspace, in which case, the corresponding pre-image of v p usually does not exist [91]. In such a scenario, the algorithm will implicitly locate the closest point in the clean subspace corresponding to such an input, without the need to explicitly search for the optimal point. 4.2.3 Learning the Silhouette Subspace This section shows how to optimally learn the silhouette subspace Ms , which is a more complicated problem than learning the pose subspace Mp (section 4.2.1). Efficiently embedding silhouette distance into KPCA is more complex and expensive because the silhouette exists in a much higher dimensional image space. The use of Euclidian distance between vectorized images, which is common but highly inefficient, is therefore avoided. For KPCA, an important factor that must be taken into account when deciding on a silhouette kernel (to define the distance between silhouettes) is if the kernel is positive definite and satisfies Mercer’s condition [87]. In KSM, we use a modified version of the pyramid match kernel [44] (to define silhouette distances), which has already been proven positive definite. The main difference (between the original pyramid match kernel [44] and 55 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) the one used in KSM) is that instead of using feature points sampled from the silhouette’s edges (as in [44]), KSM consider all the silhouette foreground pixels as feature points. This has the advantage of skipping the edge segmentation and contour sampling step. During training (figure 4.4), virtual cameras (with the same extrinsic parameters as the real cameras to be used during capture) are set up to capture synthetic training silhouettes. Note that more cameras can be added by concatenating more silhouettes into the image. Figure 4.4: Example of a training pose and its corresponding concatenated synthetic image using 2 synchronized cameras. Each segmented silhouette is first rotated to align the silhouette’s principal axis with the vertical axis before cropping and concatenation (figure 4.4) Each segmented silhouette is normalized by first rotating the silhouette’s principal axis to align with its vertical axis before cropping and concatenation. Thereafter, a simplified recursive multi-resolution approach is applied to encode the concatenated silhouette (figure 4.5). At each resolution (level) the silhouette area ratio is registered in the silhouette descriptor Ψ. 56 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Figure 4.5: Diagram to summarize the silhouette encoding for 2 synchronized cameras. Each concatenated image is preprocessed to Ψ before projection onto Ms . A five level pyramid is implemented, which results in a 341 dimensional silhouette descriptor† . To compare the difference between two concatenated images, their respective silhouette descriptors Ψi and Ψj are compared using the weighted distance D Ψ (Ψi , Ψj ) = F " f =1 1 {|Ψi (f ) − Ψj (f )| − γL(f )+1 }. L(f ) (4.5) The counter L(f ) denotes the current level of the sub-image f in the pyramid, with the smallest sub-images located at the bottom of the pyramid and the original image at the top. In order to minimize segmentation and silhouette noise (located mainly at the top levels), the lower resolution images are biased by scaling each level comparison by 1/L(f ). As the encoding process moves downwards from the top of the pyramid to the bottom, it must continually update the cumulative mean area difference γL(f )+1 at each level. This is because at any current level L(f ), only the differences in features that have not already been recorded at levels above it is recorded, hence the subtraction of γL(f )+1 . To embed DΨ (Ψi , Ψj ) into KPCA, the Euclidian distance in kp is replaced with the weighted distance, resulting in a silhouette kernel ks (Ψi , Ψ) = exp−γs D † Ψ (Ψ 2 i ,Ψ) . (4.6) For a five level pyramid, the feature vector dimension is determined by the following formula: 12 + 2 + 42 + 82 + 162 = 341. 2 57 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Using the same implicit technique as that of equation 4.2, KPCA silhouette projection is achieved by using ks (·, ·) and the projection (onto the first ηs principal axis) of Ψi is denoted as v si = [+Vs1 · Φ(Ψi ),, .., +Vsηs · Φ(Ψi ),]T , ∀ v s ∈ Rηs , where +Vsk · Φ(Ψi ), = !N k j=1 γj ks (Ψj , Ψi ), (4.7) with γ representing the Eigenvectors of the corresponding centered silhouette kernel matrix. 4.3 Locally Linear Subspace Mapping Having obtained Ms and Mp , markerless motion capture can now be viewed as the problem of mapping from the silhouette subspace Ms to the pose subspace Mp . Using P tr to denote the KPCA projected set of training poses (where P tr =[v p1 , v p2 , ... , v pM ]) and, similarly, letting S tr denote the projected set of training silhouettes (where S tr =[v s1 , v s2 , ... , v sM ]), the subspace mapping can now be summarized as follows: • Given S tr (the set of training silhouettes projected onto Ms via KPCA using the pyramid kernel) and its corresponding pose subspace training set P tr in Mp , how do we learn a mapping from Ms to Mp , such that it generalizes well to previously unseen silhouettes projected onto Ms at run-time? Referring to figure 4.6, the projection of the (novel) input silhouette during capture (onto Ms ) is denoted by s in and the corresponding pose subspace representation (in Mp ) is denoted by p out . The captured (output) pose vector xout (representing the joint positions of a generic mesh in normalized RJC format) is approximated by determining the preimage of p out . The (non-parametric) LLE mapping [85] is used to transfer projected vectors from Ms to Mp . This only requires the first 2 (efficient) steps of LLE: 1. Neighborhood selection in Ms and Mp . 2. Computation of the weights for neighborhood reconstruction. 58 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Figure 4.6: Markerless motion capture re-expressed as the problem of mapping from the silhouette subspace Ms to the pose subspace Mp . Only the projection of the training set (S tr and P tr ) onto the first 4 principal axis in their respective feature spaces are shown (point intensity is the 4th dimension). Note that the subspaces displayed here is that of a walk motion fully rotated about the vertical axis. Different sets of motion will obviously have different subspace samples, however, the underlying concept of KSM remains the same. The goal of LLE mapping for KSM is to map s in (in Ms ) to p out (in Mp ), whilst trying to preserve the local isometry of its nearest K neighbors in both subspaces. Note that, in this case, we are not trying to learn a complete embedding of the data, which requires the inefficient third step of LLE, which is of complexity O(dN 2 ), where d is the dimension of the embedding and N is the training size. Crucial details regarding the first two steps of LLE, as used in KSM, are now summarized. Figure 4.7: Diagram to summarize the mapping of s in from silhouette subspace Ms to p out in pose subspace Mp via non-parametric LLE mapping [85]. Note that the reduced subsets S lle and P lle are used in mapping novel input silhouette feature vectors. 59 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) 4.3.1 Neighborhood Selection for LLE Mapping From initial visual inspection of the two projected training sets S tr and P tr in figure 4.6, there may not appear to be any similarity. Therefore, the selection of the nearest K neighbors (of an input silhouette captured at run-time) by simply using Euclidian distances in Ms is not ideal. This is because the local neighborhood relationships between the training silhouettes themselves do not appear to be preserved between Ms and Mp , let alone trying to preserve the neighborhood relationship for a novel silhouette. For example to understand how the distortion in Ms may have been generated, the silhouette kernel ks is used to embed two different RJC pose subspace vectors p A and p B which have similar silhouettes (figure 4.8). The silhouette kernel ks , in this case, will map both silhouettes to positions close together (represented as sA and sB ) in Ms , even though they are far apart in Mp . Figure 4.8: Diagram to show how two different poses p A and p B may have similar concatenated silhouettes in image space. As a result, an extended neighborhood selection criterion, which takes into account temporal constraints, must be enforced. That is, given the unseen silhouette projected onto Ms , training exemplars that are neighbors in both Ms and Mp must be identified. Euclidian distances between training vectors in S tr and s in can still identify neighbors in Ms . The problem lies in finding local neighbors in P tr , since the system does not yet know the projected output pose vector p out (this is what the KSM is trying to determine 60 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) in the first place). However, a rough estimation of p out can be predicted by tracking in the subspace Mp via a predictive tracker, such as the Kalman Filter [53]. From experiments, it was concluded that tracking using linear extrapolation is sufficient in KSM for motion capture. This is because tracking is only used to eliminate potential neighbors which may be close in Ms , but far apart (from the tracked vector) in Mp . Therefore, no accurate Euclidian distance between the training exemplars in P tr and the predicted pose needs to be calculated. In summary, to select the K neighbors needed for LLE mapping, the expected pose is predicted using linear extrapolation in Mp . A subset of training exemplars nearest to the predicted pose is then selected (from P tr ) to form a reduced subset P lle (in Mp ) and S lle (in Ms ). From the reduced subsets‡ , the closest K neighbors to s in (using Euclidian distances in Ms ) can be identified and p out reconstructed from the linked neighbors in Mp . 4.3.2 LLE Weight Calculation Based on the filtered neighbor subset S lle (with K exemplars identified from the previous section), the weight vector ws is calculated, such as to minimize the following reconstruction cost function: ε(s in ) = &s in − LLE mapping also enforces that the !κ κ " j=1 s j=1 wj 2 wjs s lle j & . (4.8) = 1. Saul and Roweis [85] showed that ws can be determined by initially calculating the symmetrical (and semi-positive definite) “local” Gram matrix H, where in − s lle Hij = (s in − s lle i ) · (s j ). (4.9) ‡ From experiments, subsets consisting from 10% to 40% of the entire training set are usually sufficient in filtering out silhouette outliers. The subset size also does not affect the output pose (as visually determined by the naked eye) whilst in this range, and is therefore not included as a parameter in the tuning process - section 4.3.3. 61 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Solving wjs from H is a constrained least square problem, which has the following closedform solution: wjs = Σk H−1 jk Σlm H−1 lm . (4.10) To avoid explicit inversion of the Gram matrix H, wj s can efficiently be determined by solving the linear system of equation, Σk Hjk wks = 1 and re-scaling the weights to sum to one [86]. From ws , the projected pose subspace representation can be calculated as, p out = κ " wjs p lle j , j=1 p out ∈ Mp , (4.11) lle (the pose subspace representation of S lle in Mp ). From where p lle j is an instance of P p out (which exists in Mp ), the captured (pre-image) pose xout (in the normalized RJC pose space) is approximated via the fixed point algorithm of Schölkopf et al [87]. 4.3.3 Silhouette Parameter Tuning via LLE Optimization Similar to Mp , the two free parameters γs and ηs (in equation 4.6 and equation 4.7) need to be tuned for the silhouette subspace Ms . In addition, an optimal value for K, the number of neighbors for LLE mapping must also be selected. Referring back to figure 4.1, the parameters should be tuned to optimize the non-parametric LLE mapping from Ms to Mp . The same concept as in section 4.2.2 is applied, but instead of using preimage approximations in tuning, the parameters γs , ηs and K are tuned to optimize the LLE silhouette-to-pose mapping. Note that due to the use of an unconventional kernel in silhouette KPCA projection, it is unlikely that a good pre-image can be approximated. However, in KSM for motion capture, there is no need to map back to the silhouette input space, hence, no requirement to determine the pre-image in silhouette space. The silhouette tuning is achieved by minimizing the LLE reconstruction cost function Cs = N K 1 " p " s p 2 &v i − wij v j & , N i=1 j=1 62 (4.12) CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) s , in this case, is the where j indexes the K neighbors of v pi . The mapping weight wij weight factor of v si that can be encoded in v sj (as determined by the first two steps of LLE in [85] and the reduced training set from Mp ) using the Euclidian distance in Ms . To ensure that the tuned parameters generalizes well to unseen novel inputs, training is again performed using cross validation on the training set. Once the optimal parameters are selected for both Ms and Mp , the system is ready for capture by projecting the input silhouettes through the learnt feature subspaces. 63 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) 4.4 Experiments & Results Section 4.4.1 presents experiments which compares the Pyramid Match kernel [44] (adopted in KSM) and the Shape Context [11] (which is a popular choice for silhouette-based markerless motion capture [75, 6]) in defining correspondence between human silhouettes. Motion capture results for Kernel Subspace Mapping (KSM) from real and synthetic data are presented in section 4.4.2 and section 4.4.3 respectively. The system is trained with a generic mesh model (figure 4.4) using motion captured data from the Carnegie Mellon University motion capture database [32]. Note that even though the system is trained with a generic model, the technique is still model free as no prior knowledge nor manual labelling of the test model is required. All concatenated silhouettes are preprocessed and resized to 160 × 160 pixels. Figure 4.9: Illustration to summarize markerless motion capture and the training and testing procedures. The generic mesh model (figure 4.10 [center]) in used in training, whilst a previously unseen mesh model is used for testing. 4.4.1 Pyramid Match Kernel for Motion Capture This section aims to show that the pyramid kernel for human silhouette comparison (section 4.2.3) is as descriptive as the shape context [11] (which has already been shown successful in silhouette-based markerless motion capture techniques such as [6, 75]). A set of synthetic silhouettes of a generic mesh model walking (figure 4.11 [right]) is used to test 64 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Figure 4.10: Selected images of the different models used to test Kernel Subspace Mapping (KSM) for markerless motion capture. [left] Biped model used to create deformable meshes for training and testing. [center] Generic model used to generate synthetic training silhouettes. [right] Example of a previously unseen mesh model, which can be used for synthetic testing purposes (appendix B). the efficacy of the modified pyramid Match kernel (section 4.2.3) on human silhouettes. For comparison, the Shape Context of the silhouettes are also calculated and the silhouette correspondence recorded (figure 4.11 [left]). From the experiments, our pyramid kernel is as expressive as the Shape Context for human silhouette comparison [11, 75] and its cost is (only) linear in the number of silhouette features (i.e. the sets’ cardinality). Figure 4.11: Comparison between the shape context descriptor (341 sample points) and the optimized pyramid match kernel (341 features) for walking human silhouettes. The silhouette distances are given relative to the first silhouette at the bottom left of each silhouette bar. 65 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) To further test the kernel, a synthetic sequence of a walk motion fully rotated about the vertical [yaw] axis is used. In this case, the corresponding normalized RJC data (section 3.4.1) is used as the clean data (for comparison). From a markerless motion capture perspective, the goal is to infer the RJC vectors from the noisy input silhouettes. Therefore, a good silhouette descriptor must be able to define shape correspondence that shows a visual similarity with the RJC equivalent. The kernel matrix (using a Gaussian RBF kernel) of the corresponding RJC data (equation 3.19) is shown in figure 4.12 [left]. The kernel matrix generated from the pyramid kernel from the corresponding silhouettes is shown in figure 4.12 [right]. The silhouettes are captured with two synchronized virtual cameras (orthogonal to each other) and concatenated into a single feature image before processing. Note that a single synthetic camera can be used. However, due to ambiguities from 3D to 2D projections of silhouettes, it is difficult to generate visually similar kernel matrix (to the RJC matrix) via a single camera, let alone infer accurate pose and yaw direction for motion capture. In figure 4.12, the similarities in the diagonal intensity bands indicates a correlation between the two tuned kernels. Figure 4.12: Intensity images of the normalized RJC pose space kernel (RJC) and the Pyramid Match kernel of silhouettes captured from synchronized cameras(right) of a training walk sequence (fully rotated about the vertical axis). Similarities in the diagonal intensity bands indicates a correlation between the two tuned kernels. 66 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) 4.4.2 Quantitative Experiments with Synthetic Data In order to test KSM quantitatively, novel motions (similar to the training set) are used to animate a previously unseen mesh model of the author (appendix B) and their corresponding synthetic silhouettes are captured for use as control test images§ . Using a walking training set of 323 exemplars, KSM is able to infer novel poses at an average speed of ∼0.104 seconds per frame on a PentiumT M 4 with 2.8 GHz processor. The captured pre-image pose is compared with the original pose that was used to generate the synthetic silhouettes (figure 4.13). At this point, it should be highlighted that the test mesh model is different to the training mesh model, and all our test images are from an unseen viewing angle or pose (though relative camera-to-camera positions should be the same as was used in training). Figure 4.13: Visual comparison of the captured pose (red with ‘O’ joints) and the original ground truth pose (blue with ‘*’ joints) used to generate the synthetic test silhouettes. For 1260 unseen test silhouettes of a walking sequence, which is captured from different yaw orientations, KSM (with prior knowledge of the starting pose) can achieve accurate § For motion capture results of synthetic walking sequences via KSM, please refer to the attached file: videos/syntheticMotionCapture.MP4. 67 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) pose reconstructions with an average error of 2.78◦ per joint. Figure 4.14: KSM capture error (degrees per joint) for a synthetic walk motion sequence. The entire sequence has a mean error of 2.78◦ per joint. For comparison with other related work [75] the error may reduce down to less than 2◦ of error per each Euler degree of freedom [d.o.f.] (each 3D rotation is represented by a set of 3 Euler rotations¶ .) Agarwal and Triggs in [2] were able to achieve a mean angular error of 4.1◦ per d.o.f., but their approach requires only a single camera. We have intentionally recorded our errors in terms of degrees per joint, and not in Euler degree of freedom (as in [2]), because for each 3D rotation, there are many possible set of Euler rotations encoding. Therefore, using the Euler degree of freedom to encode error can result in scenarios where the same 3D rotation, can be interpreted as different, hence inducing false positives. Our technique (which uses a reduced training set of 343 silhouettes and 2 synchronized cameras) also shows visually comparable results to the technique proposed by Grauman et al in [45], which uses a synthetic training set of 20,000 silhouettes and 4 synchronized cameras. In [45], the pose error is recorded using the Euclidian distance between the joint centers in real world scale. We believe that this error measurement is not normalized, in the sense, that, for a similar motion sequence, a taller person (with longer limbs) will more likely record larger average error than a shorter person (due to larger variance for the joint located at the end of each limb). Nevertheless, to enable future comparison, we ¶ We note that the error relationship between a 3D rotation and the set of 3 Euler rotations (encoding a 3D rotation) is complicated. In this case, only an approximation has been made to allow error comparison between other markerless motion capture approaches. 68 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) have presented our real world distance results (as in [45]) in combination with the test model’s height. For a test model with a height of 1.66cm, KSM can estimate pose with a mean (real world) error of approximately 2.02 cm per joint (figure 4.15). The technique proposed by Grauman et al in [45] reported an average pose error of 3 cm per joint, when using the full system of 4 cameras. A fairer comparison can be achieved when the two approaches (KSM and [45]) both use 2 synchronized cameras, in which case, the approach in [45] reported an average pose estimation error of over 10 cm per joint. Figure 4.15: KSM capture error (cm/joint) for a synthetic walk motion sequence. The entire sequence has a mean error of 2.02 cm per joint. To further test the robustness of KSM, binary salt & pepper noise with different noise densities are added to the original synthetic data set (as figure 4.14). The noise densities of 0.2, 0.4 and 0.6 were added to the test silhouettes and the following mean errors of 2.99◦ per joint, 4.45◦ per joint and 10.17◦ per joint were attained respectively. At this point, we stress that KSM is now being tested for its ability to handle a combination of silhouette noise (i.e. the test silhouettes are different to that of the training silhouettes) and pixel noise (which is used to simulate scenarios with poor silhouette segmentation). An interesting point to note is that the increase in noise level does not equally affect the inference error for each silhouette, but rather, the noise increases the error significantly for a minority of the poses (i.e. the peaks in figure 4.16). This indicates that the robustness of KSM can be improved substantially by introducing some form of temporal smoothing to minimize these peaks in error. 69 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) Figure 4.16: KSM capture error for a walking motion (fully rotates about the vertical axis) with different noise densities (Salt & Pepper Noise). The errors are recorded in degrees per joint. 70 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) 4.4.3 Qualitative Experiments with Real Data For testing with real data, silhouettes of a spiral walk sequence were captured using simple background subtraction$ . Two perpendicular cameras were set up (with the same extrinsic parameters as the training cameras) without precise measurements. Due to the simplicity of the setup, segmentation noise (as a result of shadows and varying light) was prominent in our test sequences. Selected results are shown in figure 4.17. Synthetic salt & pepper noise were also added to the concatenated real silhouettes to test the robustness of the system (figure 4.18) The ability to capture motion under these varying conditions demonstrates the robustness of KSM. Figure 4.17: Kernel subspace mapping motion capture on real data using 2 synchronized un-calibrated cameras. All captured poses are mapped to a generic mesh model and rendered from the same angles as the cameras. " For motion capture results of real walking sequences (noise and clean silhouettes) via KSM, please refer to the attached file: videos/realMotionCapture.MP4 and videos/realMotionCaptureNoisy.MP4 71 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) For animation, captured poses were mapped to a generic mesh model and the technique was able to created smooth realistic animation in most parts of the sequence. In other parts (∼12% of the captured animation’s time frame), the animation was unrealistic as inaccurate captured poses were being appended, the presence of which is further exaggerated in animation. However, even though an incorrect pose may lead to unrealistic animation, it still remains within the subspace of realistic pose when viewed statically (by itself). This is because all output poses are constrained to lie within the subspace spanned by the set of realistic training poses in the first place. Figure 4.18: Selected motion capture results to illustrate the robustness of KSM. The system was able to infer pose (at lower accuracy) in the presence of Gaussian noise in segmented images. The algorithm was able to capture visually accurate pose (∼80% of the test set composing of 500 exemplars) from most concatenated silhouette images - Top two images and bottom left. In some cases KSM output visually incorrect pose vector (bottom right). However, because the pose was generated in the subspace of possible walking motion, even though the pose are visually inaccurate when compared to the silhouettes, it still is a pose of a person in a walking motion. 4.5 Conclusions & Future Directions This chapter introduces Kernel Subspace Mapping (KSM), a novel markerless motion capture technique that can capture 3D human pose using un-calibrated cameras. The result 72 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) shows that no detailed measurements, nor initialization of the cameras are required. The main advantages of the approach are its simplicity in setup, its speed and its robustness. Furthermore the technique generalizes well to unseen examples and is able to generate realistic poses that are not in the original database, even when using a single generic model for training. The motion capture system, which is model free, does not require prior knowledge of the actor nor manual labelling of body parts. This makes our un-calibrated system well suited for real-time and low cost human computer interaction (HCI) as no accurate initialization is required. Furthermore, the technique is still able to accurately track and estimate full pose from silhouettes with high level of noise (figure 4.16). To the author’s knowledge, there has not been any previously proposed silhouette-based motion capture approaches, which can robustly handle these levels of noise. From the motion capture results (where KSM has shown the ability to effective infer pose from both synthetic and real silhouettes, and also in the presence of noise), we believe KSM to be a motion capture technique that works very well at estimation 3D pose from binary (human) silhouettes. It is important to elaborate other interesting aspects of KSM, such as the ability for KSM to handle gray scale images (as opposed to only binary images), as well as how KSM can take into account pixel relationships along the image plane. By taking into account the relationship of pixels along the image plane, a better ‘distance’ kernel can be defined (for KSM), which allows improvement in the areas of training set reduction and more accurate mapping results. In chapter 5, we investigate this concept on the problem of view-point estimation of 3D rigid objects form gray scale images. There is a good reason why KSM has not been applied to human pose inference from gray scale images. This is because by taking into account the photometric information of an actor at training, the system will only be constrained to the capture of that actor at run-time. Furthermore, during capture, the actor will need to be wearing the same clothing that he/she was wearing at the time the training sequences were capture. The background pattern would also need to remain the same between training and during 73 CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM) motion capture. The limits the ability for KSM to generalize to the capture other unseen actors, as well as preventing KSM from being trained off-site (i.e. KSM cannot be trained in one place and used at a later stage to capture pose in an environment with a different background). Off-site training is possible for silhouettes because normalized silhouettes (i.e. after alignment, cropping and resizing [section 4.2.3]) are relatively similar irrespective of their environment (provided that acceptable silhouette segmentation can be attained). Specifically with the goal of improving the capture rate of KSM in mind, an interesting aspect to further investigate is how the size of the training set affects the computational cost of pose estimation. As highlighted earlier, because the cost of KPCA projection (a component in KSM) is dependant on the cardinality of the training set (section 3.3), it may be possible to further improve KPCA (motion) de-noising rate by using a smaller training set (than the original training set). Furthermore, if it is possible to find a way to select only the best data that represent the de-noising subspace (and discard the remainder), then an increase in capture rate can be achieved, perhaps at the expense of relatively minor decrease in de-noising and (motion) capture quality. Chapter 6 investigates this concept of training set reduction (for human motion de-noising) via the Greedy KPCA algorithm [41]. 74 Chapter 5 Image Euclidian Distance (IMED) embedded KSM∗ The previous chapter shows how Kernel Subspace Mapping (KSM) (chapter 4) can be apply in markerless motion capture to infer high dimensional pose from concatenated binary images. This chapter further confirms the flexibility of KSM by extending its application to the analysis of monocular intensity images (via a 3D object view-point estimation problem [figure 5.1]). The novel application of the Image Euclidian Distance (IMED) [116] in KPCA and KSM is introduced. We show how IMED (which takes into account spatial relationship of pixels and their intensities) allows KSM to accurately estimate viewpoints (less than 4◦ error) via the use a sparse training set (as low as 30 training images). This chapter aims to demonstrate that KSM embedded with the Image Euclidian Distance (IMED) [116] is a more accurate technique (and one with a sparser solution set) than KSM (using vectorized Euclidian distance [i.e. raster scan an image into a vector and calculate the standard Euclidian distance]) and other previously proposed approaches in 3D viewpoint estimation [80, 119, 47, 78, 80]. ∗ This chapter is based on the conference paper [106] T. Tangkuampien and D. Suter: 3D Object Pose Inference via Kernel Principal Component Analysis with Image Euclidian Distance (IMED): British Machine Vision Conference (BMVC) 2006, pages 137–146, Edinburgh, UK. 75 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM 5.1 Introduction Kernel techniques such as Kernel Principal Component Analysis (KPCA) [87] and Support Vector Machines (SVM) [114] have both been shown to be powerful non-linear techniques in the analysis of pixel patterns in images [90, 91, 40]. In supervised SVM image classification [39], images are usually vectorized before embedding (via a kernel function) for discrete classification in high dimensional feature space. In unsupervised KPCA [90, 87], vectorized images are also embedded to a high dimensional feature space, but instead of determining optimal separating hyper planes, linear Principal Component Analysis (PCA) is performed. Both KPCA and SVM take advantage of the kernel trick (in order to avoid explicitly mapping input vectors), which involves defining the ‘distance’ between two vectors. In both cases, the traditional vectorized Euclidian distance is used for embedding. Each pixel, in this case, is usually considered as an independent dimension, and therefore these approaches do not take into account the spatial relationship of nearby pixels along the image plane. This chapter presents results indicating that the Image Euclidian Distance (IMED) [116] (which takes into account spatial pixel relation on the image plane) is a better distance criterion (than vectorized Euclidian distance), especially for 3D object pose estimation (figure 5.1). IMED can be embedded into KPCA via the use of the Standardizing Transform (ST) [116], and it can also be implemented efficiently via a combination of the Kronecker product and Eigenvector projections [60]. A major significance of ST, is that it can alternatively be viewed as a pre-processing transformation (on the pixel intensities), after which traditional vectorized Euclidian distance can be applied [116]. Effectively, this means that all desirable properties (such as positive definiteness, non-linear de-noising [91] and pre-image approximation using gradient descent [87]) of using traditional Euclidian distances in KPCA still applies. A minor disadvantage of ST is that it can be expensive in its memory consumption. For an image of size M ×N , the full ST matrix needs to be of size MN ×MN . Fortunately, for the case of Gaussian kernel embedding, this matrix is separable [116] and can be stored as the Kronecker product [60] of two reduced matrices 76 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM of size M ×M and N ×N . A significant contribution of this chapter is the demonstration of a practically viable embedding of IMED (with Gaussian kernel) into Kernel Principal Components Analysis (KPCA) [90] and Kernel Subspace Mapping (KSM) (section 4). By separating the originally proposed Standardizing Transform (ST) into the Kronecker product of two identical reduced transform [116], this chapter shows how IMED is a better image ‘distance’ criterion than traditional vectorized Euclidian distance. To support this, results using IMED embedded KSM for 3D object pose estimation (compared to [80]), are presented in section 5.5. 5.2 Problem Definition and Related Work The problem of 3D object pose estimation using machine learning can be summarized as follows: • Given images of the object, from different known orientations, as the training set, how can a system optimally learn a mapping to accurately determine the orientation of unseen images of the same object from a ‘static’ camera? Alternatively, instead of calculating the object’s orientation, the system could determine the orientation and position of a moving camera that the ‘static’ object is viewed from. To test the system’s robustness, the unseen input images can be deliberately corrupted with noise, in which case, image de-nosing techniques, such as KPCA, can be applied. Zhao et al [119] used KPCA with a neural network architecture to determine the orientation of objects from the Columbia Object Image Library (COIL-20) database [78]. In that case, however, the pose is only restricted to rotations about a single vertical axis. This chapter considers the more complex pose estimation problem, that of determining the camera position where an object is viewed from anywhere on the upper hemisphere of an object’s viewing sphere. 77 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM Figure 5.1: Diagram to summarize the 3D pose estimation problem of an object viewed from the upper viewing hemisphere. Peters et al [80] introduced the notion of ‘view-bubbles’, which are area views where the object remains visually the same (up to within a pre-defined criterion). The problem of hemispherical object pose estimation can then be re-formulated as the selection of the correct view bubble and interpolating from within the bubble. A disadvantage of this technique, however, is the need for a large and well-sampled training set (up to 2500 images in [80]), to select the view bubble training images from. In this chapter, IMED embedded KSM presents comparable results using only 30 randomly selected training images of the same object (i.e. we do not select the best 30 training images by scanning through all 2500 images† ). 5.3 Image Euclidian Distance (IMED) For notation purposes, a summary of the Image Euclidian Distance as introduced by Wang et al [116] is now presented. Each M ×N input image must be first vectorized to x, where x ∈ RM N . The intensity at the (m, n) pixel is then represented as the (mN +n)-th dimension in x. The standard vectorized Euclidian distance dE (x , y ) between vectorized † The data set used in this chapter to test IMED embedded KSM (for 3D viewpoint estimation) is the same set described in [81]. Each data set consists of 2500 images of a 3D object taken at a regular interval from the top hemisphere of the object’s viewing sphere. 78 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM images x and y is then defined as d2E (x , y ) = M N " k=1 (xk − y k )2 (5.1) = (x − y)T (x − y). Alternatively, the IMED between images, introduces the notion of a metric matrix G of size M N ×M N , where the element gij represents how the dimension x i affects the dimension x j . To avoid confusion, it is important to note that i, j, k ∈ [1, M N ], whereas m ∈ [1, M ] and n ∈ [1, N ]. Provided that the metric matrix G is known, the Image Euclidian Distance can then be calculated as d2IM (x , y ) = M N " i,j=1 gij (x i − y i )(x j − y j ) (5.2) = (x − y)T G(x − y ). The metric matrix G solely defines how the IMED deviates from the standard (vectorized) Euclidian distance (which is induced when G is replaced by the identity matrix). As highlighted in [116], the main constraints for IMED are that the element gij is dependent on the distance between pixels Pi and Pj , that is gij = f (|Pi −Pj |), and that gij monotonically decreases as |Pi − Pj | increases. A constraint on f is that it must be a continuous positive definite function [116], thereby ensuring that G is positive definite and excluding the tradition Euclidian distance as a subset of IMED‡ . To avoid confusion, the reader should note that f refers to a continuous positive definite function encoding pixel spatial relationship along the image place. f should not be confused with the positive definite kernel function as required in KPCA (section 3.3). Note that at this current stage, the calculation of IMED is not memory efficient, due to the need to store the metric G, which is of size M N ×M N . Section 5.3.2 shows how to improve on this, but first, a summary of the Standardizing Transform (ST) [116] (which allows IMED to be embedded into more powerful learning algorithms such as SVMs, KPCA and KSM) will be summarized. ‡ Note that even though the typical vectorized euclidian distance can be induced by replacing G with the identity matrix, the Euclidian distance is not a subset of IMED as it is not continuous across the image plane. 79 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM 5.3.1 Standardizing Transform (ST) As suggested in [116], the calculation of IMED can be simplified by decomposing G to AT A, leading to d2IM (x , y ) = (x − y)T AT A(x − y ) (5.3) = (u − v ) (u − v ), T where u = Ax and v = Ay. The Standardizing Transform (ST) is then merely a special case of A, where G = AT A 1 (5.4) 1 = G2 G2 . 1 This reveals symmetric, positive definite and unique solutions for both G and G 2 : (5.5) G = ΓΛΓT , 1 1 G 2 = ΓΛ 2 ΓT , (5.6) where Γ is the orthogonal column matrix of the Eigenvectors of G, and Λ the diagonal matrix of the corresponding Eigenvalues. The Standardizing Transform (ST) is then the 1 transformation G 2 (•). For example, to embed IMED into learning algorithms based on 1 standard Euclidian norm, the technique simply applies the transformation G 2 (•) to the vectorized images x and y , in order to obtain u and v respectively. From (5.3), the IMED is then simply the traditional Euclidian distance between the transformed vectors u and v . 5.3.2 Kronecker Product IMED Up until now, only the constraints of IMED and its corresponding Standardizing Transform 1 G 2 (•) have been summarized, but not how to construct the matrix G itself. This section reveals how to construct G efficiently using the Kronecker Product (as suggested in [116]) 80 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM and concentrating specifically on the Gaussian function, where gij = 1 −|Pi − Pj |2 exp{ }. 2πσ 2 2σ 2 (5.7) In this case, σ controls the spread of the Gaussian signal, which acts like a bandlimited filter. For large values of σ, the influence of, let’s say Pi to its neighboring pixel Pj on the image plane, is more significant (compared to when σ is small). Therefore, G acts like a low pass filter, since the signal induced is smoother and flatter and can be constructed from low frequency signals. As σ decreases, the Gaussian signal induced becomes steeper and thinner, reducing its influence on neighboring pixels, and requiring higher frequency signals for reconstruction in the Fourier domain. For the extreme case, where σ→0, the dirac delta signal is induced, which requires infinite bandwidth, leading to the traditional Euclidian distance. 1 Figure 5.2: Images of the Standardizing Transform G 2 with different σ values. Note how the images tends to the diagonal matrix as σ→0. By letting Pi and Pj be the pixels at location (mi ,ni ) and (mj ,nj ) respectively, the corresponding squared pixel distance between the two pixels on the image plane is then given by |Pi − Pj |2 = (mi − mj )2 + (ni − nj )2 . 81 (5.8) CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM As highlighted in [116], by substituting the above equation into equation 5.7, the separable, and reducible metric representation can be obtained: 1 −[(mi − mj )2 + (ni − nj )2 ] [exp{ }] 2πσ 2 2σ 2 1 −(mi − mj )2 −(ni − nj )2 [exp{ } · exp{ }]. = 2πσ 2 2σ 2 2σ 2 gij = (5.9) As previously mentioned, m ∈ [1, M ] and n ∈ [1, N ], hence leading to a re-formulation of equation 5.9 as the Kronecker product [60] of two smaller matrices ΨM and ΨN of size M ×M and N ×N respectively, where 1 −(mi − mj )2 exp{ }, 2πσ 2 2σ 2 1 −(ni − nj )2 ΨN ( n i , n j ) = exp{ }. 2πσ 2 2σ 2 ΨM (mi , mj ) = (5.10) This leads to a simplified and memory efficient version of the metric matrix [116], where G = ΨM ⊗ ΨN . (5.11) For the case of squared images, where M = N , ΨM and ΨN are identical (henceforth both ΨM and ΨN will be denoted as Ψ) and only one matrix copy of size M ×N is required§ . Hence, Ψ can be decomposed into its corresponding matrix of Eigenvectors Ω and diagonal Eigenvalues Θ, where Ψ = ΩΘΩT . (5.12) Using established properties of the Kronecker product [60] compatible with standard matrix multiplication, the metric matrix becomes G = (ΩΘΩT ) ⊗ (ΩΘΩT ) = (Ω ⊗ Ω)(Θ ⊗ Θ)(Ω ⊗ Ω)T (5.13) = ΓΛΓT § For the remainder of this chapter, only square images will be considered, as this can effective halve the current memory consumption. Note that this does not exclude all non-square images as they can simply be pre-processed to square images before calculating IMED. 82 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM 1 From this and equation 5.6, the Standardizing Transform G 2 (•) can be derived as 1 1 G 2 = ΓΛ 2 ΓT 1 2 (5.14) T = (Ω ⊗ Ω)(Θ ⊗ Θ) (Ω ⊗ Ω) Furthermore, the reduced Eigenvalue matrix Θ is diagonal and can be diagonally vec* (i.e. take only the diagonal values), where Θ * m represents Θ(m, m). The torized as Θ 1 corresponding Kronecker product of the Eigenvalue matrices (Θ ⊗ Θ) 2 can efficiently be represented as the outer product 1 *Θ * T ]) 12 , (Θ ⊗ Θ) 2 = (R[Θ (5.15) where R is the vectorization function followed by the diagonalization function. Further improvements of the calculation of the Standardizing Transform is possible by 1 realizing that G 2 is a series of linear Eigenvector projections and scaling [116]. Effectively, this means that a specified minimum Eigenvalue scaling threshold can be pre-selected, and the Eigenvalue projection ignored for Eigenvalues below the selected threshold. Letting τ * the Eigenvalues re-ordered in descending order, be the minimum Eigenvalue scaling and Θ the reduced Eigenvector column matrix can be rewritten as Ω =[ J |K], (5.16) where J represents the column of J Eigenvectors corresponding to the Eigenvalues larger than τ , and K the remaining Eigenvectors. The approximate Standardizing Transform then becomes 1 * (1:J) Θ * T ]) 12 (J ⊗ J )T G̃τ2 = (J ⊗ J )(R[Θ (1:J) (5.17) * (1:J) being a one dimensional where J is only of dimension M ×J (with J<M ), and Θ vector of the highest J Eigenvalues. Therefore, for high resolution images where M is 1 1 large, it is possible to use the approximate Standardizing Transform G̃τ2 instead of G 2 . 83 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM On the other hand, if the image resolution is low, and speed is a factor, it is possible 1 to construct the full G 2 matrix using a single Kronecker product. From equation 5.14, the following can be derived: 1 1 G 2 = (Ω ⊗ Ω)(Θ ⊗ Θ) 2 (Ω ⊗ Ω)T 1 1 = ΩΘ 2 ΩT ⊗ ΩΘ 2 ΩT 1 (5.18) 1 = Ψ2 ⊗ Ψ2 , 1 1 1 1 with the constraints: Ψ = (Ψ 2 )T Ψ 2 = Ψ 2 Ψ 2 . 1 Visually, G 2 is a transform domain smoothing [116], where the Eigenvectors with the higher Eigenvalues, represent the low frequency basis signals. Hence, by leaving out the Eigenvectors represented by K, the higher frequency signals are suppressed. Referring back to figure 5.3 and equation 5.7, the larger the value of σ, the smoother the Standardizing Transform image will be, and therefore, the higher the Eigenvalues corresponding to the low frequency basis. The sharpest that the image can ever be (including noise) after the Standardizing Transformation is when σ→0, where only the original image itself remains. 1 Figure 5.3: Images of a 3D object after applying the Standardizing Transform G 2 with different σ values. 5.4 IMED embedded Kernel PCA Since KSM is based on KPCA, a factor to initially consider is the embedding of IMED into unsupervised Kernel Principal Components Analysis (KPCA) [90]. A summary of KPCA is presented in section 3.3. KSM will be used in the 3D view-point estimation problem 84 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM (which is summarized in section 5.2). To embed IMED into KPCA, the Standardization 1 Transform G 2 is initially applied to the vectorized training set X , to get standardized training set U, where 1 (5.19) U = G2 X . KPCA is then performed using the standard RBF kernel, which leads to the overall effect of: T G(x kim (xi , xj ) = exp−γ{(x i −x j ) T (u = exp−γ{(u i −u j ) i −x j )} i −u j )} (5.20) = kv (u i , u j ), which we refer to as IMED embedded KSM. Considering the relationship in equation 5.20, we are assured that all desirable properties of the standard RBF kernel, such as positive definitiveness and gradient descent pre-image approximations, are held by IMED. The free parameters are then tuned on U and the IMED embedded KPCA projection becomes k +Vim · Φ(x), = N " αki kim (xi , x) = i=1 N " αki kv (ui , u). (5.21) i=1 The pre-image approximation using IMED embedded KPCA (trained using the set U) is then straight forward. For example, the fixed point algorithm of [87] can be used to determine the standardized pre-image u p , and the inverse Standardizing Transform 1 (G 2 )−1 applied to obtain the vectorized pre-image x p , where 1 x p = (G 2 )−1 u p . (5.22) 1 The inverse transform (G 2 )−1 can be efficiently computed (without storing an explicit version of the inverse matrix) by realizing that it is also a projection onto the same Eigen1 vectors of G 2 , but instead, the inverse scaling (by the corresponding Eigenvalues) are applied, 85 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM 1 1 (G 2 )−1 = Γ(Λ 2 )−1 ΓT (5.23) *Θ * T ])ΓT = Γ(R−1 [Θ *Θ * T ] is defined as the vectorization function of the inverse scaling of each where R−1 [Θ *Θ * T , followed by the diagonalization function. It is important to note that the element in Θ 1 * m = 0. The Standardizing Transform positive definite matrix G 2 may have Eigenvalue Θ 1 1 1 G 2 is then singular, and the inverse (G 2 )−1 does not exist [116]. However, (G̃ 2 )−1 can be approximated using the same concept as in equation 5.17 and ignoring the Eigenvectors whose corresponding Eigenvalues are close to zero. 5.4.1 IMED embedded Kernel Subspace Mapping Recall the basic structure and properties of KSM: Kernel Subspace Mapping can be used to map data between two high dimensional spaces (chapter 4). KPCA is initialized to learn the subspaces of the high dimensional data, for let’s say the input set X and the output set Y, and LLE non-parametric mapping [85] is used to map between the two feature subspaces. Given novel inputs, the data is projected through the two subspaces and the pre-image calculated to determine the input’s representation in the output space. If any of the training vectors are derived from images, then IMED embedded KPCA can be applied. For generality, this section will consider both cases, where the input set X is constructed from images of a 3D object, and the output set Y are three dimensional camera positions on the upper hemisphere of the object’s viewing sphere. Note that this is a simple case (as the output space is only 3 dimensional) for KSM, which can map to higher dimensional output spaces as in human markerless motion capture (chapter 4). KSM is initialized by learning the subspace representation of X and Y using IMED embedded KPCA and standard (RBF kernel) KPCA respectively. From these, the KPCA X and V Y are obtained, where the corresponding instance in the projected training sets Vim v sets have the following forms: 86 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM Figure 5.4: Diagram to Summarize Kernel Subspace Mapping for 3D object pose estimation. Selected members of the image training set X are highlighted with black bounding boxes, whereas the output training set Y are represented as blue circles on the viewing hemisphere. η 1 T X vX ∈ Rη i = [+Vim · Φ(xi ),, .., +Vim · Φ(xi ),] , ∀ v vY i = [+Vv1 · Φ(yi ),, ....., +Vvψ · Φ(yi ),] , ∀ v T Y (5.24) ψ ∈R , where η and ψ are the training set’s corresponding tuned projected feature dimensions (using the tuning techniques in section 4.2.2 and 4.3.3). During run time, given a new unseen vectorized image x of the same 3D object, projection via IMED embedded KPCA can be used to obtain v X . The projected vector is then mapped between the two feature subspaces using LLE neighborhood reconstruction weight w, which is calculated by minimizing the reconstruction cost function (as shown in [85] and section 4.3). ε(v X ) = &v X − " i∈I 2 wi v X i & . (5.25) where I is the set of neighborhood indices of the projected vector v X in the tuned feature space. This leads to the mapping vY = " i∈I 87 wi v Y i , (5.26) CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM from which the corresponding output hemispherical position y p (which is the pre-image of v Y ) can be determined. Note that there is now another free parameter to tune, this being card(I), the cardinality of the index set I. For KSM, this is tuned using cross validation as in section 4.2.2, but instead of minimizing the pre-image reconstruction, the mapping error is minimized (section 4.3.3). 5.5 Experiments & Results To test the efficacy of IMED embedded KSM on 3D object viewpoint estimation, training images of the object ‘Tom’ (figure 5.5[left]) are selected from the data set described in [81]. From the original image set of 2500 images, reduced sets of 30, 60, 90 and 145 training images are randomly selected. The training size of 30 and 145 were selected to match the experiments in [80]¶ . A test set of 600 ‘unseen’ images are then filtered from the remaining images at regular intervals, such as to maximize the distribution around the viewing hemisphere$ . Table 5.1: 3D pose inference comparison using the mean angular error for 600 ‘unseen’ views of the object ‘Tom’. Training size 30 60 120 145 Clean data set KSM via standard KPCA IMED embedded KSM 6.68◦ 3.38◦ 2.97◦ 1.92◦ 1.41◦ 0.89◦ 1.19◦ 0.76◦ Gaussian noise set KSM via standard KPCA IMED embedded KSM 8.63◦ 3.44◦ 5.88◦ 2.05◦ 3.22◦ 1.01◦ 3.01◦ 0.89◦ Salt/Pepper noisy set KSM via standard KPCA IMED embedded KSM 10.68◦ 4.58◦ 9.40◦ 2.40◦ 5.14◦ 1.51◦ 4.56◦ 1.34◦ ¶ In [80] - figure 1, the parameters are given in terms of the ‘tracking threshold’, and their corresponding view bubbles, each consisting of 5 training images. For view bubble count of 6 and 29 this corresponds to 30 and 145 training images respectively. " For 3D object pose estimation results, please refer to the attached file: videos/[IMEDposeEst.mp4, IMEDposeEstNoisy.mp4, IMEDposeEstNoisy2.mp4, IMEDposeEstNoisySP.mp4]. 88 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM Figure 5.5: Selected images of the object ‘Tom’ [left] and object ‘Dwarf’ [right] used in the 3D pose estimation problem. The images consist of the original clean image (1st & 4th image), image corrupted with Gaussian noise (2nd & 5th image) and Salt/Pepper noise (3rd & 6th image). IMED embedded KSM is tested with clean images as well as with images corrupted with zero mean Gaussian noise (variance of 0.01) and salt/pepper noise (noise density of 0.05) as shown in figure 5.5. Table 5.1 shows that in all cases, IMED embedded KSM provides more accurate viewpoint estimation (than standard KSM) from both clean and noisy images. Furthermore, IMED embedded KSM is also more robust to noise and it shows the least percentage increase in error for both Gaussian and salt/pepper noise. As expected, with any exemplar based learning algorithm, the mean error gradually decreases as more training images are included in the set. For completeness, another set of pose inference results for the same tuned parameters as in table 5.1 is presented. In this case, however, 400 test images are selected at random from the original hemispherical data set (2500 images) and the selected images are not constrained to be novel (i.e. the test images may also include images that were used in training). This is done because it gives a more realistic representation of the input data in practice, where there is no exclusion of the training set, let alone any idea of which images were used in training. As expected, IMED embedded KSM performs even better in this case (table 5.2), as KSM will simply project any training images to the correct position in the feature space, and this has zero error to begin with. Results from [80] are also included for comparison in table 5.2 even though only 30 test images were used∗∗ . It must be noted that the accuracy of the pose inference is dependent on the complexity of the object. For example, it is not possible to determine the hemispherical pose using ∗∗ Note that the test images in [80] have not been considered as novel/unseen. This is because a sparse representation of the object is built from the original set of 2500 images using a greedy approach, which is then used for pose estimation. Since the 30 test images are also derived from the original 2500 images, they cannot be considered novel. 89 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM Table 5.2: 3D pose inference comparison using the mean angular error for 400 randomly selected views of the object ‘Tom’. Training size 30 Clean data set View Bubble Method [80] KSM via standard KPCA IMED embedded KSM 36.51◦ 4.44◦ 3.22◦ Gaussian noise set KSM via standard KPCA IMED embedded KSM Salt/Pepper noisy set KSM via standard KPCA IMED embedded KSM 60 120 145 2.32◦ 1.15◦ 0.91◦ 0.65◦ 0.77◦ 0.84◦ 0.59◦ 6.28◦ 3.29◦ 5.13◦ 1.26◦ 2.59◦ 0.77◦ 2.72◦ 0.69◦ 9.47◦ 4.51◦ 8.64◦ 1.75◦ 4.56◦ 1.19◦ 4.38◦ 1.04◦ a sphere with uniform colour (i.e. all images of the sphere will look the same from any viewpoint). To test the efficacy of IMED embedded KPCA on a more symmetrical object, the object ‘Dwarf’ (which also appears in [80]) is used. All experiments are performed in exactly the same was as before, and results summarized in table 5.3 & table 5.4. The training size of 30, 60, 120 and 145 are also used to allow comparison (with object ‘Tom’), in which case table 5.3 should be compared with table 5.1, and table 5.4 with table 5.2. For cross comparison with [80], IMED embedded KSM is able to achieve more accurate results for a training set of only 30 images (3.26◦ ) as compared to using the ‘view bubble’ sparse set of 130 images (4.2◦ ). A surprising result is that it is possible to achieve relatively the same level of accuracy (even higher in some cases) for the ‘Dwarf’ object (as compared to object ‘Tom’). This was not the case using the view bubble method in [80]. 5.6 Discussions and Concluding Remarks For the case of IMED embedded KSM, pose inference results with mean errors of 3.22◦ were achieved, whilst using only a training set of 30 images. This is quite accurate, as it is unlikely that a human would be able to achieve the same level of accuracy [80]. Our technique is also robust to noise (especially Gaussian noise) and, in such a case, only shows minor percentage increase in mean angular error. Furthermore, this was achieved 90 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM Table 5.3: 3D pose inference comparison using the mean angular error for 600 ‘unseen’ views of the object ‘Dwarf’. Training size 30 60 120 145 Clean data set KSM via standard KPCA IMED embedded KSM 8.27◦ 3.99◦ 2.98◦ 1.67◦ 1.30◦ 0.91◦ 1.15◦ 0.81◦ Gaussian noise set KSM via standard KPCA IMED embedded KSM 12.01◦ 4.04◦ 5.22◦ 1.73◦ 3.09◦ 0.99◦ 2.60◦ 0.87◦ Salt/Pepper noisy set KSM via standard KPCA IMED embedded KSM 16.51◦ 4.46◦ 8.33◦ 1.90◦ 6.04◦ 1.10◦ 5.07◦ 1.07◦ without explicit knowledge of the full training set of size 2500 images as is the case with [80]. This is because a greedy approach is not used to filter out (from the original set of 2500 images) training data, in order to build a sparse representation; but simply, training images are randomly selected from it. The constraint, in this case, being that the training set is relatively evenly spread over the probability distribution of the data (as opposed to spatial configuration of the data). The test images used are also novel (table 5.1 and 5.3) and are regularly spread over the entire hemisphere, whereas in [80], only 30 test images were used from three different patches of the viewing hemisphere. Results with more than 145 training images are not presented because for industrial applications, it becomes impractical to capture such a large training set, and in such a case, the hemispherical data will be so dense that linear interpolation will be as accurate. This chapter presents a technique that potentially allows automatic parameter selection (i.e. without human intervention). This leads to the ability to tune KSM for a novel 3D object by simply placing it on a rotating table (which rotates about the vertical axis). For the other degree of freedom, a synchronized camera that moves up and down along the longitudinal arc of the hemisphere can be installed (which will automatically capture images of the object from different orientations). From the captured images and corresponding orientation, the system can automatically tune the parameters, which can then 91 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM Table 5.4: 3D pose inference comparison using the mean angular error for 400 randomly selected views of the object ‘Dwarf’. Training size 30 60 120 145 Clean data set KSM via standard KPCA IMED embedded KSM 9.40◦ 3.26◦ 3.14◦ 1.49◦ 1.31◦ 0.77◦ 1.23◦ 0.68◦ Gaussian noise set KSM via standard KPCA IMED embedded KSM 12.72◦ 3.56◦ 5.60◦ 1.55◦ 2.81◦ 0.86◦ 2.46◦ 0.75◦ Salt/Pepper noisy set KSM via standard KPCA IMED embedded KSM 14.64◦ 4.80◦ 8.57◦ 1.83◦ 5.40◦ 1.14◦ 4.92◦ 1.05◦ be loaded into a single camera pose inference system in situ. Note that in this case an assumption was made that the background of the training images and test images are the same. In reality, it is usually the case that the training images are captured in controlled environment, whereas the test images are not. To solve this problem, a robust background segmentation algorithm can be appended as a pre-processing step to mask out the varying backgrounds in both controlled training images and run-time images on site. Learning and pose inference can then be performed on the masked images, hence mitigating the problem of uncontrolled backgrounds. For the more common problem of determining the relative yaw orientation (rotation about the vertical axis only) of an object on a (factory) conveyor belt using static cameras, this is an even simpler problem. This is because the problem reduces to constraining the space of possible orientation from three dimensional space onto a two dimensional plane. Kernel Subspace Mapping (KSM) was initially developed for markerless motion capture (chapter 4). In that case, KSM was used to learn the mapping between concatenated images of the human silhouette and the normalized relative joint centers (section 4.2). For future directions, IMED embedded KPCA can be integrated into the core of markerless motion capture using KSM. Results have been presented indicating that significant improvements can be achieved by embedding IMED into KSM (for viewpoint estimation). 92 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM It is likely that relatively the same increase in accuracy can be achieved by applying the same concept in markerless motion capture. The problem to overcome is how to effectively integrate IMED into the pyramid kernel (section 4.2.3) for (computationally) efficient silhouette comparison. Another area to consider is using IMED on local gradients found from SIFT -like algorithms [64] and possibly improving on previous techniques on human pose estimation in cluttered environments [5]. 93 CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM 94 Chapter 6 Greedy KPCA for Human Motion Capture∗ This chapter presents the novel concept of applying Greedy KPCA (GKPCA) [41] as a preprocessing filter (in training set reduction) for Kernel Subspace Mapping (KSM). Human motion de-noising comparison between linear PCA, standard KPCA (using all poses in the original sequence) and Greedy KPCA (using the reduced set) is presented at the end of the chapter. The chapter advocates the use of Greedy KPCA in KSM by showing that both KPCA and Greedy KPCA have superior de-noising qualities over PCA, whilst KSM with Greedy KPCA results in relatively similar pose estimation quality (as standard KSM [chapter 4]) but with lower evaluation cost (due to the reduced training set). 6.1 Introduction Due to the high degree of correlation in human motion, unsupervised learning techniques like Principal Components Analysis (PCA) [51, 98] and its non-linear extension, Kernel Principal Components Analysis (KPCA) [90, 91] are commonly used to learn subspaces of human motion. As highlighted in section 3.2.1, PCA, being a linear technique, is not well suited for the de-noising of non-linearly correlated human motion. This hypothesis ∗ This chapter is based on the conference paper [107] T. Tangkuampien and D. Suter: Human Motion Denoising via Greedy Kernel Principal Component Analysis Filtering: International Conference on Pattern Recognition (ICPR) 2006, pages 457–460,Hong Kong, China. 95 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE is further supported by the human de-noising results in section 3.5. KPCA, on the other hand, shows significant improvement over PCA, for both cyclical human motion (e.g. walking and running) to more complex non-cyclical motion (e.g. dancing and boxing). A drawback of KPCA, however, is that the training and evaluation costs are dependant on the size of the training set (equation 3.13 and equation 3.16). During training, the kernel matrix (which grows quadratically with the number of exemplars in the training set) needs to be calculated before standard PCA can be applied in feature space. As for the de-noising (projection and pre-image approximations [91]) via KPCA, the cost is linear in the exemplar number because each projection requires a kernel comparison with each vector in the training set. As a result, the cardinality of the training set is vital in any real system incorporating KPCA. The goal of KPCA with greedy filtering (GKPCA) [41] is to filter (from the original motion sequences) a reduced training subset than can optimally represent the original de-noising subspace, given a specific prior constraint (e.g. using 70% of the original training data). Figure 6.1: Illustration to geometrically summarize Greedy Kernel Principal Components Analysis (GKPCA) [41] in relation to Kernel Principal Components Analysis (KPCA) [90]. Novel captured motions similar to the original training sequence can then be de-noised using this reduced set. To this end, the recently proposed GKPCA algorithm proposed by [40] is investigated for the goal of filtering the reduced training set. To the author’s knowledge, there is currently no previous work on the practical applications of Greedy Kernel 96 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE Principal Components Analysis (GKPCA) on human motion de-noising. Similarly, KPCA with greedy algorithm filtering is a relatively novel approach to the problem of markerless motion capture based on subspace learning. Results are presented in section 6.3, which supports the integration of GKPCA into the pre-processing of KSM (chapter 4). The experiments aim to show that GKPCA filtering (for training set reduction) can aid in the reduction of pose estimation cost, whilst still retaining the de-noising and pose estimation qualities of the original training set. 6.2 Greedy KPCA filtering on Motion Capture Data To understand the concept of GKPCA filtering, imagine that the non-linear toy data set in figure 3.3 were cloned and the training set doubled in size (i.e. there are now two instances of each point in the training set). In this case, there should not be any substantial change in the de-noising quality as the feature subspace defined by the training set should remain relatively the same (as the original set). However, from a performance perspective (due to the linear complexity of KPCA projections) the computation cost would have (approximately) doubled. Most likely, there are redundant (training) poses, which when removed would increase performance, whilst minimizing reduction in de-noising quality. Put more generally, the objective (of GKPCA filtering) is to remove training vectors, which do not contribute substantially to the definition of the de-noising subspace. Specifically for human motion capture data, GKPCA is highly compatible because most animation data (which is a concatenation of poses captured from a marker-based system) is repetitive and usually sampled at a high frame rate (between 60-120 Hz). The higher the sampling rate, the more likely it is that there will be redundant poses when it comes to defining subspaces for human de-noising. In most motion capture training data (where there is a high percentage of redundant poses), the removal of any redundant pose will increase the speed of KSM for markerless motion capture. GKPCA can act as a preprocessing filter that will select a reduced training set P gk (which defines a de-noising 97 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE pose subspace similar to P tr [section 4.3]) from the original training set. The remainder of this chapter is structured as follows: in section 6.2.1 the Greedy KPCA algorithm is summarized. In section 6.3, experiment are conducted to show how a reduced training set can play an important role in controlling the capture rate and pose estimation quality of KSM for markerless motion capture (as well as compare the motion de-noising quality between GKPCA, KPCA and PCA). 6.2.1 Training Set Filtering via Greedy KPCA Using the same notation an section 3.3, recall that standard KPCA basically maps the training set X tr of size N non-linearly to a feature space F, before performing the equivalent of linear PCA. Greedy KPCA [41] aims to filter X tr to a reduced set X gk of size M (where M 1 N ), such that the linear span of F gk is similar to the linear span of F, where (6.1) Φ : X gk → F gk . Assuming that X gk can be found, it is then possible to express every vector in F as a linear combination of the filtered set in F gk [41]. To summarize Greedy KPCA as in [41, 40]: let J = {j1 , j2 , ..., jM } be the set of M indices, a subset of I, where I = {i1 , i2 , ..., iN } is the original indices of X tr . The approximate feature space representation of the original training exemplars can be expressed as follows: Φ̃(x i ) = " ωij Φ(x j ), j∈J ∀i ∈ I. (6.2) The reduced set’s objective function to minimize is the mean square error εM S (Htr |J ) = " g 1 " &Φ(x i ) − ωij Φ(x j )&2 . N i∈I (6.3) j∈J It is important to note that given the subset X tr indexed by J , it is possible to compute ω g optimally to minimize εM S (Htr |J ). Therefore ω g can be removed from the error function εM S (Htr |J ), and the greedy approximation problem re-expressed as the problem 98 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE of determining J , where J = argmin εM S (Htr |J ) (6.4) card(J )=M and card(J ) denoting the cardinality of subset J . Furthermore, as shown in [40, 87], it is possible to avoid explicitly mapping to feature space F tr by re-expressing equation (6.3) using the kernel trick as εM S (F tr |J ) = 1 " gk gk gk gk (kp (x i , x i ) − 2Kgk c k (x i ) + +k (x i ), Kc k (x i ),). N (6.5) i∈I gk and Kgk c , in this case, is the centered kernel matrix of the filtered set X kgk (x i ) = [kp (x j1 , x i ), ....., kp (x jM , x i )]T , (6.6) is the projection of X tr onto the reduced set’s feature space. Greedy KPCA, therefore only needs to determine the optimal subset J from I [41]. In order to achieve this, we could try #N $ all possible combination, but this would be impractical as there exists M combinations. Instead, as shown in [40], we can choose to minimize the upper bound where εM S (F tr |J ) ≤ 1 (N − M ) max &Φ(x i ) − Φ̃(x i )&2 . N i∈I\J (6.7) Intuitively, the upper bound can be viewed as initially finding the maximum feature space error (between the approximate set X̃ construction and the original set X ) and multiplying it by (N − M ). The reason (N − M ) is chosen as the scale factor, and not N is because M vectors in X tr can already be represented with zero error [41]. Equation (6.7) should hold because the mean error εM S (F tr |J ) cannot be higher than the mean of the maximum error. Further discussion of the Greedy KPCA algorithm is beyond the scope of this chapter, but can be found in [40]. Given the original motion sequence X tr , it is possible to use greedy KPCA to filter out X gk = X tr (J ), a subset of X tr (I), which has similar linear span in the pose feature space. Figure 6.2 compares the de-noising quality (on a toy example from figure 3.3) between 99 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE GKPCA using 50% of the original training set) and KPCA (using the full training set). Figure 6.2: De-noising comparison of a toy example between GKPCA (using 50% of the original training set) [top] and standard KPCA (using the full training set) [bottom]. 6.3 Experiments & Results There are three factors that should be considered in concluding if Greedy KPCA can filter out a reduced set (in a manner which will enhance the performance of KSM). These are: 1. Does a reduced training size lead to a reduction in motion capture time (i.e. an increase in motion capture rate)? 2. What is the effect of a reduced training set on the de-noising qualities of human motion when compared with using the full set? 3. What is the effect of a reduced training set on the quality of pose estimation via KSM? 100 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE Experiments regarding the motion capture rate and de-noising qualities are summarized in section 6.3.1 and section 6.3.2 respectively. Section 6.3.3 shows the results of using Greedy KPCA as a preprocessing filter for KSM in markerless motion capture. 6.3.1 Capture Rate Control for KSM To understand how a reduced training set can control the capture rate of KSM motion capture, the reader should refer to equation 3.16. The outer iteration over the cardinality of the set confirms that any data projection cost via KPCA will be dependent on the training size. Therefore, training KPCA using the reduced set X gk will result in a reduced kernel matrix Kgk and lower computational load in equation 3.16, hence reduction in the kernel subspace mapping cost. As previously discussed, figure 3.12 [bottom] highlights the linear relationship between the de-noising cost of KPCA and the training size. Table 6.1 summarizes how the current capture rate of KSM for motion capture can be controlled via training size modification. The dominant cost, being the recursive pyramid cost† of calculating the silhouette feature vector Ψ(f ), which is currently implemented in un-optimized MatlabT M code. This cost is not shown in the table as it is independent of the training size, but can be determined by taking the difference between the KSM total cost and the LLE mapping and pre-image cost. Table 6.1: Comparison of capture rate for varying training sizes. Training size LLE mapping & pre-image (s) KSM total cost (s) Capture rate (Hz ) 100 200 300 400 500 0.0065 0.0120 0.0220 0.0310 0.0410 0.0847 0.0902 0.1002 0.1092 0.1192 11.81 11.09 9.98 9.16 8.39 † The cost of calculating Ψ(f ) using the pyramid kernel is dependent on the number of silhouette features and not on the size of the training set. For a five level pyramid of 431 features, this results in a average cost of ∼0.0782 seconds. 101 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE 6.3.2 Comparison between Greedy KPCA and KPCA De-noising This section begins by comparing the mean square error [mse] of human motion de-noising (in normalized RJC format [section 3.4.1]) using PCA, standard KPCA trained with the entire set X tr , and Greedy KPCA trained with the reduced set X gk . All experiments were performed on a PentiumT M 4 with a 2.8 GHz processor. In figure 6.3, GKPCA selects 30% of the original sequence to build the reduced set subspace. Synthetic gaussian white noise is added to motion sequence X in pose space, and the de-noising qualities compared quantitatively (figure 6.3) and qualitatively in a 3D animation playback‡ . Figure 6.3: Pose space mse comparison between PCA, KPCA and GKPCA de-noising for walking and boxing sequences. Figure 6.3 highlights the superiority of both KPCA and GKPCA over linear PCA in motion de-noising. GKPCA will tend towards the KPCA de-noising limit as a greater percentage of the original sequence is included in the reduced set. Specifically for the walking and running motion sequences, the frame by frame comparison in figure 6.4 and figure 6.5 further emphasizes the similarity of de-noising qualities between KPCA (blue line) and GKPCA (red line). Both the error plots for GKPCA and KPCA are well below the error plot for PCA de-noising (black line). For human animation, the GKPCA algorithm is able to generate realistic and smooth animation comparable with KPCA when ‡ For motion de-noising results via PCA, KPCA and GKPCA, please refer to the attached file: videos/motionDenoisingRun.MP4 & videos/motionDenoisingWalk.MP4. 102 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE Figure 6.4: Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human walk sequence. GKPCA selects 30% of the original sequence. Figure 6.5: Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human run sequence. GKPCA selects 30% of the original sequence. the de-noised motion is mapped to a skeleton model and play backed in real time. To analyze the ability for GKPCA to implicitly de-noise feature subspace noise, we add synthetic noise in the feature space of a walk motion sequence. We are interested in this aspect of de-noising because, in KSM, noise may be induced in the process of mapping from one feature subspace to another (i.e. from Ms to Mp ). Figure 6.6 shows the relationship between feature space noise and pose space noise for both KPCA and GKPCA de-noising for a walk sequence from various yaw orientations (GKPCA uses a reduced set consisting of 30% of the original set). Surprisingly, the ratio of pose space noise to 103 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE Figure 6.6: Comparison of feature and pose space mse relationship for KPCA and GKPCA. GKPCA uses a reduced set consisting of 30% of the original set. feature subspace noise (i.e. pose space noise/feature subspace noise) is lower for GKPCA de-noising when compared to using KPCA de-noising. The lower GKPCA plot indicates its superiority over KPCA at minimizing feature space noise, which may be induced in the process of pose inference via KSM (i.e. the same level of noise in GKPCA feature subspace will more likely map to a lower level of noise [when compared to KPCA] in the RJC pose space). However, it is important to note that the noise analysis presented here is only for human motion de-noising. The effect of a reduced set (filtered via GKPCA) on markerless motion capture has not yet been analyzed. The following section attempts to investigate this relationship between the size of the reduced training set and the quality of pose estimation via KSM. 6.3.3 Greedy KPCA for Kernel Subspace Mapping To test the efficacy of KSM motion capture with a GKPCA pre-processing filter, the training set of 323 exemplars (as previously applied in section 4.4.2) is used as the original (starting) set. Using the distance in the pose feature subspace for filtering, the GKPCA algorithm [39, 41] is initialized to extract reduced training sets from the original set. The reduced sets are then tuned independently using the process described in chapter 4. For comparison with previous results of motion capture via KSM, the same synthetic set of 1260 unseen silhouettes (as used in section 4.4.2) is used for testing. 104 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE For pose inference from clean silhouettes (i.e. clean silhouette segmentation, but still using silhouettes of different models between testing and training), and using reduced training sets consisting of 90%, 80% and 70% of the original set: KSM can infer pose with average errors of 3.17◦ per joint, 4.40◦ per joint and 5.86◦ degree per joint respectively. The frame by frame error in pose estimation (from clean silhouettes) are plotted in figure 6.8. Figure 6.9 shows the frame by frame estimation error from noisy silhouettes. For training sets consisting of 90%, 80% and 70% of the original set, the mean pose inference error from noisy silhouettes (with salt & pepper noise density of 0.2) are recorded as 3.71◦ per joint, 4.80◦ per joint and 6.51◦ degree per joint respectively (figure 6.9). The average errors per joint for all the tested reduced sets are summarized in figure 6.7. As expected, the mean pose estimation error (for KSM) when inferring from clean or noisy silhouettes increases with a reduction in the training size. The important factor to note is the nonlinear relationship between the training size and (pose) inference error. Similar to KSM pose estimation in the presence of noise (section 4.4.2), the reduction in training size does not equally affect the inference error for each silhouette, but rather, the size reduction increases the error significantly for a minority of the poses (i.e. the peaks in figure 6.9). Figure 6.7: Average errors per joint for the reduced training sets filtered via GKPCA. 105 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE Figure 6.8: Frame by Frame error comparison (degrees per joint) for a clean walk sequence with different level of GKPCA filter in KSM. 106 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE Figure 6.9: Frame by Frame error comparison (degrees per joint) for a noisy walk sequence with different level of GKPCA filter in KSM. 107 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE 6.4 Conclusions & Future Directions The chapter investigates the novel approach of applying Greedy KPCA [41] in the minimization of noise in non-linearly correlated human motion. Section 3.5 has already shown how non-linear KPCA is superior than linear PCA in the de-noising of Gaussian noise in human motion captured data. GKPCA shares this superiority over linear PCA. In terms of its advantage over KPCA, Greedy KPCA can create realistic and comparable (to KPCA) animation from noisy motion sequences, but at a fraction of KPCA’s processing cost. GKPCA, therefore enables KSM to be extended to capture complex sequences defined by a large training size. Currently KSM learns pose and silhouette subspaces via KPCA, and approaches motion capture as the problem of mapping from the projected silhouette subspace Ms to the projected pose subspace Mp (section 4.2). GKPCA can be applied, in this case, to control capture rate by filtering out a reduced training set that can define relatively similar de-noising subspaces (as the original set), thereby reducing the de-noising cost of KPCA (equation 3.16) and pose inference cost of KSM. An area that should be further investigated is GKPCA’s superiority over KPCA in the reduction of pose (feature) subspace noise (figure 6.6). A possible explanation for this improvement may be due to GKPCA’s reduced set constraint, therefore resulting in a reduced space of valid pre-images in pose space, and favorably, a reduced space for unwanted noise as well. This improvement provided by GKPCA filtering, however, does not necessarily mean that GKPCA can lead to superior pose estimation results for KSM. As expected, a reduction in training data leads to an increase in pose estimation error. An interesting point to note is the pattern of the error increase (which occurs in peaks [figure 6.9]) as the training size is reduced. This pattern arises because some silhouettes (of a walk motion) are significantly more ambiguous than others. A relatively similar pattern of error increase has also been recorded (in section 4.4.2) for KSM pose estimation in the presence of noise. This indicates that the robustness of KSM (to both noise and training set reduction) may be improved by applying a better tracking and neighbor selection criterion, which allows the integration of more complex temporal smoothing (to minimize 108 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE error due to ambiguous silhouettes located at the error peaks [figure 6.9]). Figure 6.10: Diagram to illustrate the most likely relationship between using the original training set (used to capture results of figure 4.14) without GKPCA filtering (to train KSM) and using training sequences directly obtained from a motion capture database (to train KSM). The vertical axis indicates the probable mean error for different sizes of the training set. An important point to note is that the original training set (of 323 exemplars) is already a sparse set, in the sense that there is no redundant information and the training poses are (relatively) uniformly spread out in the normalized RJC space (i.e. training exemplars have already been removed form it to begin with). In practice, the marker-based motion capture training set (which can be downloaded from motion capture databases, such as [32]), is usually sampled at high frame rates of between 60Hz to 120Hz. Therefore, when these motion sequences are used directly as the original training set (before GKPCA filtering), they will have a high level of redundant information. For these scenarios, we believe GKPCA filtering will be extremely useful because it may be able to remove a majority of training exemplars before resulting in a significant decrease in the pose estimation error of KSM (figure 6.10). 109 CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE 110 Chapter 7 Conclusions & Future Work 7.1 Concluding Remarks This thesis introduces a novel markerless motion capture technique called Kernel Subspace Mapping (KSM) [chapter 4], which can estimate full body (human) pose without the need for any intrinsic camera calibration. The technique uses the well established unsupervised learning algorithm, Kernel Principal Components Analysis (KPCA) [90], to learn the denoising subspace of human motion in the normalized relative joint center space (which is consistent for all actors) [section 3.4.1]. After training, novel silhouettes are projected into the pose feature subspace, before pose inference via pre-image approximations [87]. This has the advantageous effect of implicitly de-noising the input silhouettes onto the subspace defined by the generic (clean) training data, hence allowing pose inference (of similar motion to the training set) from silhouettes of unseen actors and of unseen poses. To begin, the thesis advocates the use of non-linear KPCA in human motion de-noising by showing the simplest forms of human motion (encoded in the normalized RJC format [section 3.4.1]) is non-linear. Thereafter, to test our hypothesis of a de-noising framework for markerless motion capture, KSM was trained using synthetic silhouettes of a walk sequence (fully rotated about the yaw axis) generated from a generic model. For testing, a substantially different mesh model is used to generate a large set of silhouettes in unseen poses, and from previously unseen viewing orientations. The pose inference results (average error of 2.78◦ /joint or 2.02 cm/joint) show that KSM can estimate pose with 111 CHAPTER 7. CONCLUSIONS & FUTURE WORK accuracy similar to other state of the art approaches [2, 45]. KSM is also one of the few 2D learning-based pose estimation approaches (others are [6, 45, 83]), which can accurately infer pose irrespective of yaw orientation. Furthermore, KSM also works robustly in the presence of synthetic binary noise, which are used to simulate scenarios with poor silhouette segmentation. As KPCA [90] is a well established machine learning technique, there is a large community of researchers developing novel optimization algorithms for it. We believe that because KSM is based on this popular technique (KPCA), most of the improvements on KPCA can be transferred to practical improvements of KSM with relatively minor modifications. To illustrate the ease of integration (of novel improved algorithms) and elaborate further some of the important aspects of KSM, two recently proposed techniques are embedded into KSM: • the Image Euclidian Distance (IMED) [116], and • the greedy KPCA (GKPCA) algorithm [41]. Specifically for IMED embedded KSM, we concentrate on the problem of 3D object viewpoint estimation from intensity images [81]. In this case, IMED embedded KSM shows an improvement in the accuracy of viewpoint estimation over KSM (using vectorized Euclidian distance) and other previously proposed approaches [80, 119, 47, 78, 80]. For greedy KPCA integration, we concentrate specifically on the problem of human pose inference (from synchronized silhouettes) via the use of a reduced training set. The greedy KPCA algorithm is used as a preprocessing filter to select a reduced training subset than can optimally represent the original de-noising subspace, given a specific prior constraint (e.g. using 70% of the original training data). The experiments show that Greedy KPCA filtering for KSM results in lower evaluation cost (for both KPCA denoising and KSM in motion capture) due to the reduced training set. More importantly, KSM with a GKPCA filter is able to retain most of the pose inference quality when compared to using the original training set (to train KSM). 112 CHAPTER 7. CONCLUSIONS & FUTURE WORK 7.2 Future Directions Kernel Subspace Mapping (KSM) has been shown to be accurate in human pose estimation from binary human silhouettes. An interesting area to explore is the possibility of extending KSM to take into account photometric information without limiting its flexibility. Using silhouettes is advantageous because normalized silhouettes (section 4.2.3) of the same pose (for most actors) are relatively similar irrespective of their environment (provided that there is acceptable silhouette segmentation). However, silhouettes encode less information due to the loss of foreground photometric information. The use of pixel intensity and colors in KSM could potentially lead to more accurate pose estimation. This is because photometric data can be used to disambiguate inconclusive silhouettes (i.e. the ones located mostly at the error peaks in figure 4.16). However, photometric data is also model specific, in the sense that two different persons will most likely have extremely different patterns and color imageries for the same stance (pose). The problem that will need to be overcome is how to standardized the color/intensity information in such as way that the training data (using the photometric patterns generated from one person) can be generalized to a different person for (computationally) efficient pose estimation. A possible solution to this may be to use local gradients from SIFT -like algorithms [64] (as in the work of Agarwal and Triggs [5]) to first encode photometric data before training and pose estimation via KSM. The integration of IMED [116] into KSM (for viewpoint estimation) allows improved estimation results over using vectorized Euclidian distance in KSM. An interesting area to further explore is if IMED can be embedded into KSM to improve human pose estimation from silhouettes. The integration of IMED into the pyramid silhouette kernel (of the original KSM) is substantially harder (than vectorized Euclidian distance) due to the hierarchical structure of the silhouette descriptor. Furthermore, the (silhouette) pyramid kernel with embedded IMED would also need to be positive definite to ensure its efficacy with techniques based on convex optimization such as KPCA and KSM. 113 CHAPTER 7. CONCLUSIONS & FUTURE WORK Greedy KPCA filtering for reduced set selection in KSM is a research area which should be further investigated. Interestingly enough, KSM with GKPCA filtering results in superior reduction in pose (feature) subspace noise (i.e. lower ratio of pose space noise/feature subspace noise) than using KSM trained with the full set. As previously mentioned, an explanation for this (improvement) may be due to GKPCA’s reduced set constraint, which leads to a reduced space of valid pre-images in the output space, and favorably, a reduced space for unwanted noise as well. The frame by frame analysis of the pose estimation error (when using GKPCA filtering) is also interesting (figure 6.9). The error plot shows that as the training size reduces, the errors do not increase (relatively) equally for all the test silhouettes, but the errors increases substantially for a small group of silhouettes. This similar pattern in error increase is also observed for pose estimation from noisy silhouettes (figure 4.16). Therefore, an important future area of research for KSM would be to explore techniques aim at the minimization of the error peaks due to ambiguous silhouettes. We believe that such a improvement would increase the robustness of KSM in pose estimation, in the presence of noise and when learning from reduced training sets. Finally, most of the learning based human pose estimation approaches do not rely on 3D volumetric reconstruction of the human body. This is mainly because 3D approaches mostly require an expensive array of synchronized and calibrate cameras. Specifically for previously proposed 3D pose estimation techniques (e.g. [99, 28]), an interesting area to investigate would be the integration of KSM into 3D pose estimation approaches, by learning the mapping from 3D feature points to pose vectors. Effectively, this may lead to a reduction in computational cost of pose estimation, as the full 3D volumetric reconstruction of the actor for each frame would not be required (to constrain the skeletal structure for pose vector estimation). Instead, the pose vectors can be inferred directly from 3D feature points, which can be more efficiently tracked and reconstructed. 114 Appendix A Motion Capture Formats and Converters A.1 Motion Capture Formats in KSM A deformable mesh model can be animated by simply controlling its inner hierarchical biped/skeleton structure (figure A.1). As a result, most motion formats only encode pose information to animate these structures. For information on how to create a deformable skinned mesh in DirectX (i.e. attach a static mesh model to its skeleton), the reader should refer to [66]. This section does not aim to present a complete review of all motion capture formats, but rather a summary of the relevant formats applied in the proposed technique, Kernel Subspace Mapping (KSM). In KSM, three different formats are used. These are: the Acclaim Motion Capture (AMC) (section A.1.1) [59], the DirectX format (section A.1.2) [66] and the Relative Joint Center (RJC) format (section 3.4.1). Two motion capture format converters are used in this thesis (Appendix A.2), one to convert AMC to DirectX format and the other from AMC to RJC format. Figure A.2 shows a diagram summarizing the use of these converters in the proposed markerless motion capture system. KSM learns the mapping from the synthetic silhouette space to the corresponding RJC space (figure A.2 [right:dotted arrow]) for a specific motion set (e.g. walking, running). For training, a database of accurate human motion is required, which can be downloaded in AMC format 115 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS Figure A.1: Diagram to summarize the hierarchical relationship of the bones of inner biped structure. from the CMU motion capture database [32]. From this, synthetic silhouettes for machine learning purposes can be generated via the DirectX skinned mesh model [66]. From our experiments, the interpolation and reconstruct (of realistic new poses) is best synthesized in the normalized RJC format (section 3.4.1). A.1.1 Acclaim Motion Capture (AMC) Format The Acclaim motion capture (AMC) format, developed by the game maker Acclaim, stores human motion using concatenated Euler rotations. The full motion capture format consists of two types of file, the Acclaim skeleton file (ASF) and the AMC file [59]. The ASF file stores the attributes of the skeleton such as the number of bones, their geometric dimension, their hierarchical relationships and the base pose, which is usually the ‘tpose’ (figure A.3 [right]). For the case where a single skeletal structure is used, the ASF file can be kept constant and ignored. On the other hand, the AMC file (which encodes human motion using joint rotations over a sequence of frames) is different for every set of motion and is the file generated during motion capture. 116 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS Figure A.2: Diagram to summarize the proposed markerless motion capture system and how the reviewed motion formats combine together to create a novel pose inference technique called Kernel Subspace Mapping (KSM). Concentrating principally on the AMC file format, human motion is stored by encoding each 3D local joint rotation as a set of consecutive Euler rotations. This is highlighted in figure A.3 [left humerus bone], where the left shoulder joint rotation (which corresponds to a ball & socket joint) is represented by 3 concatenated Euler rotations. Note that it is not necessary for a joint to be encoded using 3 Euler transformations. For example, the left elbow joint, which is a hinge joint (and represented by the left radius bone in figure A.3) has a single degree of freedom, and hence, is only represented by one Euler rotation. A full human pose can be encoded as a set of Euler pose (stance) vectors representing the actor’s joint orientations at specific point in time (as used in [1, 84, 6, 24, 83]). Skeleton animation is achieved by concatenating these pose vectors (over time) and mapping them sequentially to a skeleton embedded inside a human mesh model. The AMC format is the simplest form of joint encoding and is one of the format adopted by the VICON marker based motion capture system [103]. On the other hand, encoding pose with Euler rotations presents numerus problems due to the fact that the mapping of pose to Euler joint coordinates is non-linear and multivalued. Any technique, like Kernel Principal Components Analysis (KPCA)[90, 91, 87] (which is the core component in KSM) will eventually breakdown when applied to vectors consisting of Euler angles. This is because it may potentially map the same 3D (joint) 117 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS Figure A.3: Example of a Acclaim motion capture (AMC) format with example rotation of the left humerus bone (i.e. the left shoulder joint) rotation to different locations in vector space. Furthermore, linear pose interpolation using Euler pose vectors also presents a problem, because Euler rotations are not commutative, as well as suffering from the problems of Gimbal locks (Appendix A.2 [figure A.7]). A.1.2 DirectX Animation Format Human motion capture format in DirectX is similar to the AMC format in that there are two parts to the file structure, the skeleton/mesh attribute section and the human motion section. The skeleton/mesh attribute section encodes the biped hierarchical structure, its geometric attributes and additional information such as its relationship to the mesh model (i.e. the weight matrix of the mesh vertices [66]). This information can again be considered constant and can be ignored in the pose inference process. Similar to the AMC format (section A.1.1), we concentrating principally on the file section that encodes the motion data. The DirectX encoding is similar to the AMC format in that the relative 3D 118 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS joint rotations are recorded in a sequential frame structure. However, instead of encoding 3D rotations using Euler rotations, the 4 × 4 homogeneous matrix is adopted. Note that the 4 × 4 matrix can also encode translation and scaling data (in addition to the rotation). However, for motion capture, where the geometric attributes of the model (e.g. length of arms, height, etc) is known and considered constant, pose inference can be constrained to inferring a set of 3 × 3 rotation matrices, which only encodes the rotation information. The main advantage of the DirectX format is the optimized skinned mesh animation and support for the format [66], which enables the efficient rendering and capture of synthetic silhouettes for training and testing purposes. As there are major similarities between the AMC and the DirectX format, AMC data from the marker based VICON system can easily be converted to its DirectX equivalent by converting each set of xyz Euler rotations to its corresponding 3 × 3 rotation matrix as follows: L[1:3,1:3] = Rx (Θxi )Ry (Θyi )Rz (Θzi ), (A.1) where Θxi , Θyi and Θzi are the Euler rotations about the x,y and z axis respectively, and L[1:3,1:3] denoting the first 3 rows and columns of the DirectX homogenous matrix (see Appendix A.2 for more details regarding format conversion). Unfortunately, the algebra of rotations using matrices is non-commutative and its corresponding manifold is nonlinear [9]. Therefore, use of the DirectX format for the synthesis of novel pose is also prone to relatively the same problems as when using Euler angles. 119 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS Figure A.4: Comparison between the 3D models view using the AMC viewer [32] (yellow figure) and the DirectX mesh viewer of a generic skinned mesh model (grey mesh) for a gold swing [top], a boxing motion sequence [center] and a salsa dancing motion sequence [bottom]. 120 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS Figure A.5: Comparison between the 3D models view using the AMC viewer [32] (yellow figure) and the connected stick figure of the normalized RJC format of a running motion sequence [top], a boxing motion sequence [center] and a dancing motion sequence [bottom]. 121 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS Figure A.6: Comparison between the 3D models view using the AMC viewer [32] (yellow figure) and the DirectX mesh viewer of an RBF mesh model of the author (blue background). 122 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS A.2 A.2.1 Motion Capture Format Converters The AMC to DirectX converter The AMC and DirectX formats basically store pose as a set of 3D rotations in different forms. Therefore, conversion between the two pose can be simplified to the process of converting one 3D rotation format to another. In this case, the converter will simply convert Euler rotations to a 4 × 4 homogenous matrix. A 3D Euler rotation is characterized by three concatenated rotations, usually about the x, y and z axis, which can commonly be referred to as the yaw, pitch and row respectively. The order of this concatenation plays a significant role in how the object is orientated after applying the rotations. Geometrically, if we apply the Euler rotations in the order of the yaw (x ), the pitch (y) and the roll (z ) as in figure A.7, the pitch axis will be rotated by the yaw rotation, the roll axis by both the yaw and the pitch rotations, however, the yaw axis will be affected by neither the roll nor the pitch rotation. Figure A.7: Images to illustrate Euler rotations in term of the yaw, pitch and roll. The image on the right shows the negative effect of Gimbal lock, where a degree of rotation is lost due to the alignment of the roll and the yaw axis. Notice how the pitch is affected by the yaw rotation, the roll is by both the yaw and pitch rotations, whereas the yaw is independent of any other rotations. 123 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS We can represent a rotation Θxi about the yaw axis as Rx (Θxi ), where 0 0 1 Rx (Θxi ) = cos(Θxi ) sin(Θxi ) 0 0 − sin(Θxi ) cos(Θxi ) (A.2) and the remaining pitch and roll rotations as Ry (Θyi ) and Rz (Θzi ) respectively as: y cos(Θi ) Ry (Θyi ) = 0 sin(Θyi ) 0 1 0 − sin(Θyi ) 0 y cos(Θi ) Rz (Θzi ) cos(Θzi ) sin(Θzi 0) z z = − sin(Θi ) cos(Θi 0) 0 0 0 0 1 (A.3) . The 3 × 3 rotation matrix of the xyz (yaw-pitch-roll) Euler rotation can be generated by concatenating the separate rotation matrices in reverse order as in equation A.1. The reversal in matrix application may initially appear incorrect, however, closer examination of figure A.7 will reveal that from the object’s point of view, the roll is performed first (z axis), followed by the pitch (y axis) and finally the yaw (x axis) rotation . For each pose, the set of 3 × 3 rotations matrices for the i-th joint is converted to its corresponding 4 × 4 homogeneous matrix Li (which also encodes the bone’s translation displacement relative to its parent) as follows: y x) O z) O R R I b R (Θ (Θ ) O (Θ i x i y i z i Li = [0, 0, 0, 1]T . T T T T O 1 O 1 O 1 O 1 (A.4) In equation A.4, O represents a 3 × 1 zero vector, I the 3 × 3 identity matrix, and bi the length vector of the i-th bone, whose x element stores the actual length of the bone (the y and z elements are both zero). A.2.2 The AMC to RJC converter The conversion from the AMC format to the RJC format (section 3.4.1) can be summarized as the problem of transforming Euler rotations to 3D points on a unit sphere. As the local homogeneous 4 × 4 matrix Li already encodes rotation information for the i-th joint, the simplest form of conversion is to transform the appended zero vector [0, 0, 0, 1]T as follows: 124 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS pi = Li · [0, 0, 0, 1]T . &Li · [0, 0, 0, 1]T & (A.5) Each joint rotation is now encoded as a point on a unit sphere and denoted by pi . Specifically in our work, a column-wise concatenation of the separate joint rotation forms the full pose vector p, which requires 57 dimensions (19 joints). Geometrically, for each frame, the normalized RJC encoding simply stores the relative position of the joints relative to each other along the skeletal hierarchy, with the pelvis node as the root. 125 APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS 126 Appendix B Accurate Mesh Acquisition Figure B.1: Selected textured mesh models of the author in varying boxing poses. This section focuses on the capturing, synthesizing and texturing of accurate human mesh models for realistic skinned mesh animation in the DirectX format (section A.1.2)∗ . Human model acquisition was achieved via a laser scanner integrated with a synchronized high-resolution camera. In section B.1, the surface fitting and re-sampling technique of Carr et al [20, 21] is applied in a novel way to the scanned data in order to create accurate human mesh models. Thereafter, images captured from the digital camera during scanning are exported to the DirectX format and textured onto the mesh. This will enable the generation of synthetic test silhouettes of real people, hence allowing the quantitative analysis of any markerless motion capture technique by comparing the captured pose with the pose that was used to generate the synthetic images. ∗ This appendix summarizes one of the many solutions available to capturing an accurate human mesh model for synthetic testing purposes. Note that there are other techniques, which can also be used for accurate human model acquisition and texturing. 127 APPENDIX B. ACCURATE MESH ACQUISITION B.1 Model Acquisition for Surface Fitting & Re-sampling The model acquisition was performed using the Riegl LMS-Z420i terrestrial laser scanner equipped with a calibrated Nikon D100 6 mega pixels digital camera. The laser is positioned approximately five metres from the actor and two scans of the front and back of the actor are captured. Two stands are positioned on the left and right of the actor for hand placement in order to ensure ease of biped and mesh alignment. Thereafter, the front and back point cloud data are filtered and merged using the Riscan Pro software and the combined point cloud data exported for surface fitting. Figure B.2: Examples images of the front and back scan of the author [left] and the merged filtered scan data, which is used for surface fitting and re-sampling purposes. B.2 Radial Basis Fitting & Re-sampling From the merged point cloud data of the actor (figure B.2 [right]), a radial basis function (RBF) can be fitted to the data set for re-sampling, after which a smooth mesh surface can be attained by joining the sampled points. For the case of 3D point cloud data, the original scanned points (also referred to as on-surface points) are each assigned a 4th dimension density value of zero [20, 21]. By letting x denote a 3D vector, a smooth surface in R3 can be obtained by fitting an RBF function S(x) to the labelled scanned points and 128 APPENDIX B. ACCURATE MESH ACQUISITION re-sampling the function at the same density value of S(x) = 0. It is obvious that if only data points with density values of 0 are available, RBF fitting (to this set) will produce the trivial solution [i.e. S(x) = 0 for all x ∈ R3 ]. To avoid this, a signed-distance function, which encodes the relationship between off -surface and on-surface points, can be adopted (figure B.3). Figure B.3: Diagram to illustrate the generation of off -surface points (blue and red) from the input on-surface points (green) before fitting an RBF function. An off -surface point, in this case, is a synthetic point xsm whose density is assigned the signed value dm , which is proportional to the distance to its closest on-surface point. Off -surface points are created along the projected normals, and can be created inside (blue points: dm < 0) or outside (red points: dm > 0) the surface defined by the onsurface (green) points. Normals are determined from local patches of on-surface points and performed using an evaluation version of the FastRBFT M Matlab toolbox [20, 21]. The surface fitting problem can now be re-formulated as the problem of determining an RBF function S(x ) such that S(xn ) = 0, S(xsm ) = dm , for n = 1, 2, ...., N and for m = 1, 2, ...., M 129 (B.1) APPENDIX B. ACCURATE MESH ACQUISITION where N and M are the number of on-surface and off -surface points respectively. Using the notation from [20], an RBF S(x) can be represented as follows: S(x) = p(x) + M +N " i=1 λi Φ(|x − xi |), (B.2) with λi denoting the RBF coefficients of Φ, the real values basis function, and p representing a linear polynomial function. By additionally constraining the RBF to the Beppo-Levi space in R3 [20], the following side conditions can be implicitly guaranteed: M +N " λi = i=1 M +N " λi xi = i=1 M +N " λi yi = i=1 M +N " λi zi = 0. (B.3) i=1 Amalgamating the side constraints with the RBF representation in (B.2), the problem of RBF fitting can effectively be reduced to that of solving the following system of linear equation: λ A P λ =B c c PT 0 (B.4) where A is the matrix defining the relationship between x i and x j (i.e. Ai = φ(|x i − x j |) for i, k ∈ [1, M + N ]), and Pij = pj (x i ), with j denoting the index to the basis of the polynomials. By solving for the unknown coefficients c and λ, the RBF function S(x) can now be sampled at any point within the range of the on-surface and off -surface training data. More importantly, it is possible to sample the RBF at S(x) = 0, which corresponds to the surface defined by the on-surface points. Furthermore, it is possible to sample this RBF at regular grid interval, adding to the simplicity and efficiency of the mesh construction algorithm. A disadvantage of RBF fitting, as highlighted in [20, 21], is the complexity of solving for the unknown coefficients c and λ, which is O(M +N )3 . For an accurate mesh model derived from approximately 8000 on-surface points (as in figure B.2 - right), this complexity makes the technique impractical on a home computer. By applying fast approximation techniques [20, 21], it is possible to reduce this complexity to O((M + N ) log(M + N )). Figure B.4 shows the resultant mesh model of the author generated via the RBF fast approximation 130 APPENDIX B. ACCURATE MESH ACQUISITION and re-sampling technique. Figure B.4: Selected examples of the accurate mesh model of the author create via the RBF fast approximation & sampling technique of Carr et al [20, 21]. For increases realism, images of the person captured from the synchronized digital camera can be textured onto the plain model. Selected poses of textured mesh models of the author (using this simple solution) are shown in figure B.1. 131 APPENDIX B. ACCURATE MESH ACQUISITION 132 References [1] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector regression. In International Conference on Computer Vision & Pattern Recognition, pages 882–888, 2004. [2] A. Agarwal and B. Triggs. Learning to track 3D human motion from silhouettes. In International Conference on Machine Learning, pages 9–16, 2004. [3] A. Agarwal and B. Triggs. Monocular human motion capture with a mixture of regressors. In IEEE workshop on Vision for Human Computer Interaction at CVPR, June 2005. [4] A. Agarwal and B. Triggs. Tracking articulated motion using a mixture of autoregressive models. In European Conference on Computer Vision, volume 3, pages 54–65, 2005. [5] A. Agarwal and B. Triggs. A local basis representation for estimating the human pose from cluttered images. In Asian Conference on Computer Vision, pages 55–59, 2006. [6] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images. IEEE Transaction on Pattern Analysis & Machine Intelligence, 28(1), 2006. [7] J.K. Aggarwal and Q. Cai. Nonrigid motion analysis: Articulated and elastic motion. In Computer Vision and Image Understanding, volume 70, pages 142–156, 1998. [8] J.K. Aggarwal and Q. Cai. Human motion analysis: A review. In Computer Vision and Image Understanding, volume 73, pages 428–440, 1999. [9] M. Alexa. Linear combination of transformations. In SIGGRAPH, pages 380–387, 2002. 133 REFERENCES [10] O. Arikan, D.A. Forsyth, and J.F. O’Brien. Motion synthesis from annotations. In SIGGRAPH, volume 22, pages 402–408, 2003. [11] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. In IEEE Transaction on Pattern Analysis & Machine Intelligence, volume 24, pages 509–522, 2002. [12] Y. Bengio, J.-F. Paiement, and P. Vincenta. Out of sample extensions for lle, isomap, mds, eigenmaps and spectral clustering. In Advances in Neural Information Processing Systems, volume 16, 2004. [13] O. Benier and P. Cheung-Mon-Chan. Real-time 3D articulated pose tracking using particle filtering and belief propogation on factor graphs. In British Machine Vision Conference, volume 1, pages 27–36, 2006. [14] R. Bowden. Learning statistical models of human motion. In IEEE Workshop on Human Modeling, Analysis and Synthesis, Internation Conference on Computer Vision & Pattern Recognition, 2000. [15] R. Bowden, T. A. Mitchell, and M. Sarhadi. Reconstructing 3D pose and motion from a single camera view. In British Machine Vision Conference, volume 2, pages 904–913, September 1998. [16] J. Bray. Markerless based human motion capture: A survey. Technical report, Vision adn VR Group, Department of Systems Engineering, Brunei University. [17] M. Bray, K. Pushmeet, and P.H.S. Torr. POSECUT: Simutaneous segmentation and 3D pose estimation of human using dynamic graph-cuts. In European Conference on Computer Vision, volume II, pages 642–655, 2006. [18] C. Bregler and J. Malik. Tracking people with twists and exponential maps. In International Conference on Computer Vision & Pattern Recognition, pages 8–15, 1998. [19] F. Caillette, A. Galata, and T. Howard. Real-time 3D human body tracking using variable length markov models. In British Machine Vision Conference, pages 469– 478, 2005. 134 REFERENCES [20] J.C. Carr, R.K. Beatson, J.B. Cherrie, T.J. Mitchell, W.R. Fright, B.C. McCallum, and T.R. Evans. Reconstruction and representation of 3D objects with radial basis functions. In SIGGRAPH, pages 67–76, 2001. [21] J.C. Carr, R.K. Beatson, B.C. McCallum, W.R. Fright, T.J. McLennan, and T.J. Mitchell. Smooth surface reconstruction from noisy range data. In Applied Research Associates NZ Ltd. [22] C. Cedras and M. Shah. Motion based recognition: A survey. In IEEE Proceedings, Image and Vision Computing, 1995. [23] P. Cerveri, A. Pedotti, and G. Ferrigno. Robust recovery of human motion from video using kalman filters and virtual humans. In Human movement science, volume 22, pages 377–404, 2003. [24] J. Chai and J.K. Hodgins. Performance animation from low-dimensional control signals. ACM Transaction on Graphics, 24(3):686–696, 2005. [25] P. Chen and D. Suter. An analysis of linear subspace approaches for computer vision and pattern recognition. In International Journal of Computer Vision, volume 68, pages 83–106, 2006. [26] Y. Chen, J. Lee, R. Parent, and R. Machiraju. Markerless monocular motion capture using image features and physical constraints. In Computer Graphincs International, pages 36–43, June 2005. [27] K-M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette across time part I: Theory and algorithms. In International Journal of Computer Vision, volume 3, pages 221–247, 2005. [28] K-M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette across time part II: Applications to human modeling and markerless motion tracking. In International Journal of Computer Vision, volume 3, pages 225–245, 2005. [29] C-W. Chu, O.C. Jenkins, and M.J. Matarić. Markerless kinematic model and motion capture from volume sequences. In International Conference on Computer Vision & Pattern Recognition, page 475, 2003. 135 REFERENCES [30] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models - their training and application. In Computer Vision and Image Understanding, volume 61, pages 38–59, 1995. [31] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and other kernel-based learning methods). Cambridge University Press, 2000. [32] Carnegie Mellon University Graphics Lab Motion Capture Database. http://mocaps.cs.cmu.edu. [33] D. Demirdjian, L. Taycher, G. Shakhnarovich, K. Grauman, and T. Darrell. Avoiding the streetlight effect: tracking by exploring likelihood modes. In International Conference on Computer Vision, pages 357–364, October 2005. [34] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by annealed particle filtering. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 126–133, June 2000. [35] M. Dimitrijevic, V. Lepetit, and P. Fua. Human body pose recognition using spatiotemporal templates. In ICCV workshop on Modeling People and Human Interaction, October 2005. [36] J. Eisenstein and W.E. Mackay. Interacting with communication appliances: an evaluation of two computer vision-based selection techniques. In Computer Human Interaction, pages 1111–1114, 2006. [37] A. Elgammal and C.-S. Lee. Inferring 3D body pose from silhouettes using activity manifold learning. In International Conference on Computer Vision & Pattern Recognition, pages 681–688, 2004. [38] R. Engle. GARCH 101: The use of ARCH/GARCH models in applied econometrics. In Journal of Economic Perspectives, volume 15, pages 157–168, 2001. [39] V. Franc. Pattern recognition toolbox for matlab. In Centre for Machine Perception, Czech Technical University, 2000. [40] V. Franc. Optimization Algorithms for Kernel Methods. PhD thesis, Centre for Machine Perception, Czech Technical University, July 2005. 136 REFERENCES [41] V. Franc and V. Hlavac. Greedy algorithm for a training set reduction in the kernel methods. In Int. Conf. on Computer Analysis of Images and Patterns, pages 426– 433, 2003. [42] P. Fua, A. Gruen, N. D’Apuzzo, and R. Plänkers. Markerless full body shape and motion capture from video sequences. In International Archives of Photogrammetry and Remote Sensing, volume 34, pages 256–261, 2002. [43] D.M. Gavrila. The visual analysis of human movement: A survey. In Computer Vision and Image Understanding, volume 73, pages 82–96, 1999. [44] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In International Conference on Computer Vision, pages 1458–1465, 2005. [45] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3D structure with a statistical image-based shape model. In International Conference on Computer Vision, pages 641–648, 2003. [46] G. Grochow, S.L. Martin, A. Hertzmann, and Z. Popovic. Style-based inverse kinematics. In SIGGRAPH, pages 522–531, 2004. [47] Ji Hun Ham, I. Ahn, and D. Lee. Learning a manifold-constrained map between image sets: Applications to matching and pose estimation. International Conference on Computer Vision & Pattern Recognition, 2006. [48] A. Hilton and J. Starck. Multiple view reconstruction of people. In International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), pages 357–364, 2004. [49] P.O. Hoyer. Non-negative matrix factorization with sparness constraints. In Journal of Machine Learning Research, number 5, pages 1457–1469, 2004. [50] S. Hu and B.F. Buxton. Using temporal coherence for gait pose estimation from a monocular camera view. In British Machine Vision Conference, volume 1, pages 449–457, 2005. [51] I. T. Jolliffe. Principal component analysis. Springer-Verlag, New York, 1986. 137 REFERENCES [52] R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignments problems. Computing, 38:325–340, 1987. [53] R.E. Kalman. A new approach to linear filtering and predictiono problems. In Transactions on the ASME - Journal of Basic Engineering, volume 83, pages 95– 107, 1961. [54] R. Kehl, M. Bray, and L. Van Gool. Full body tracking from multiple views using stochastic sampling. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 129–136, 2005. [55] P Kohli and P. Torr. Efficiently solving dynamic markov random fields using graph cuts. In International Conference on Computer Vision, pages 922–929, 2005. [56] R. Kondor and T. Jebara. A kernel between sets of vectors. In International Conference on Machine Learning, 2003. [57] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. In SIGGRAPH: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 473–482, 2002. [58] J.T. Kwok and I.W. Tsang. The pre-image problem in kernel methods. In International Conference on Machine Learning, pages 408–415, 2003. [59] J. Lander. Working with motion capture file formats. In Game Developer, pages 30–37, January 1998. [60] A. J. Laub. Matrix analysis for scientists and engineers. pages 139–150, 2005. [61] R. Li, M.-H. Yang, S. Sclaroff, and T.-P. Tian. Monocular tracking of 3D human motion with a coordinated mixture of factor analyzers. In European Conference on Computer Vision, volume II, 2006. [62] Y. Li, T. Wang, and H-Y. Shum. Motion texture: a two-level statistical model for character motion synthesis. In SIGGRAPH: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pages 465–472, 2002. 138 REFERENCES [63] Y. Liang, W. Gong, W. Li, and Y. Pan. Face recognition using heteroscedastic weighted kernel discriminant analysis. In International conference on advances in pattern recognition, volume 2, pages 199–205, 2005. [64] D. Lowe. Distinctive images features from scale-invariant keypoints. In International Journal of Computer Vision, volume 60, pages 91–110, 2004. [65] G. Loy, M. Eriksson, J. Sullivan, and S. Carlsson. Monocular 3D reconstruction of human motion in long action sequences. In European Conference on Computer Vision, pages 442–455, 2004. [66] F. Luna. Skinned mesh character animation with direct3D 9.0c. In www.moonslab.com, September 2004. [67] S. McKenna, G. Gong, and Y. Raja. Face recognition in dynamic scenes. In British Machine Vision Conference, pages 140–151, 1997. [68] A.S. Micilotta, E.J. Ong, and R. Bowden. Detection and tracking of humans by probabilistic body part assembly. In British Machine Vision Conference, volume 1, pages 429–438, 2005. [69] I. Mikić, M. Trivedi, E. Hunter, and P. Cosman. Human body model acquisition and tracking using voxel data. In International Journal of Computer Vision, volume 3, pages 199–223, 2003. [70] I. Mikić, M.M. Trivedi, E. Hunter, and P.C. Cosman. Human body model acquisition and motion capture using voxel data. In International workshop on Articulated Motion and Deformable Objects, pages 104–118, 2002. [71] G. Miller, J. Starck, and A. Hilton. Projective surface refinement for free-viewpoint video. In European Conference on Visual Media Production (CVMP), 2006. [72] T. Moeslund. Computer vision-based human motion capture - a survey. Technical report, Laboratory of Image Analysis, Institure of Electronic Systems, University of Aalborg, Denmark, 1999. [73] T. Moeslund. Summaries of 107 computer vision-based human motion capture papers. Technical report, Laboratory of Image Analysis, Institure of Electronic Systems, University of Aalborg, Denmark, 1999. 139 REFERENCES [74] T.B. Moeslund and E. Granum. 3D human pose estimation using 2D-data and an alternative phase space representation. In Proceedings of the IEEE Workshop on Human Modeling, Analysis and Synthesis, pages 26–33, 2000. [75] G. Mori and J. Malik. Estimating human body configurations using shape context matching. In European Conference on Computer Vision, volume 3, pages 666–680, 2002. [76] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. Semi-supervised learning of joint density models for human pose estimation. In British Machine Vision Conference, volume 2, pages 679–688, 2006. [77] R. Navaratnam, A. Thayananthan, P.H.S. Torr, and R. Cipolla. Heirarchical partbased human body pose estimation. In British Machine Vision Conference, pages 479–488, 2005. [78] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-20). In http://www.cs.columbia.edu/CAVE/, 1996. [79] M. Niskanen, E. Boyer, and R. Horaud. Articulated motion capture from 3-D points and normals. In British Machine Vision Conference, volume 1, pages 439–448, 2005. [80] G. Peters. Efficient pose estimation using view-based object representations. In Machine Vision and Applications, volume 16, pages 59–63, 2004. [81] G. Peters, B. Zitova, and C. von der Malsburg. How to measure the pose robustness of object views. In Image and Vision Computing, volume 20, pages 249–256, 2002. [82] R. Plankers and P. Fua. Articulated soft objects for video-based body modeling. In International Conference on Computer Vision, pages 394–401, 2001. [83] L. Ren, G. Shakhnarovich, J.K. Hodgins, H. Pfister, and P. Viola. Learning silhouette features for control of human motion. ACM Transaction on Graphics, 24(4), October 2005. [84] A. Safonova, J.K. Hodgins, and N.S. Pollard. Synthesizing physically realistic human motion in low-dimensional. ACM Transaction on Graphics, 23(3):514–521, 2004. 140 REFERENCES [85] L. K. Saul and S. T. Roweis. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2269, 2000. [86] L.K. Saul and S.T. Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119–155, 2003. [87] B. Schölkopf, S. Mika, A.J. Smola, G. Rätsch, and K.R. Müller. Kernel PCA pattern reconstruction via approximate pre-images. In International Conference on Artificial Neural Networks, pages 147–152, 1998. [88] B. Schölkopf, P.Knirsch, C.Smola, and A. Burges. Fast approximation of support vector kernel expansions, and an interpretation of clustering as approximation in feature spaces. In Mustererkennung, pages 124–132, 1998. [89] B. Schölkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond - Chapter 18. MIT Press, Cambridge, 2002. [90] B. Schölkopf, A.J. Smola, and K.R. Müller. Kernel principal component analysis. In Internation Conference on Artificial Neural Networks, pages 583–588, 1997. [91] B. Schölkopf, A.J. Smola, and K.R. Müller. Kernel PCA and de-noising in feature spaces. In Advances in Neural Information Processing Systems, pages 536–542, 1999. [92] N.N. Schraudolph, S. Günter, and S.V.N. Vishwanathan. Fast iterative kernel PCA. In Advances in Neural Information Processing Systems, 2007. [93] C. Sehn, A. van den Hengel, A. Dick, and M.J. Brooks. 2D articulated tracking with dynamic bayesian networks. In International Conference on Computer and Information Technology, pages 130–136, 2004. [94] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In International Conference on Computer Vision, 2003. [95] H. Sidenbladh, M. Black, and D. Fleet. Stochastic tracking of 3D human figures using 2D image motion. In European Conference on Computer Vision, pages 702– 718, June 2000. 141 REFERENCES [96] C. Sminchisescu, A. Kanaujia, and D. Metaxas. Learning joint top-down and bottom-up processes for 3D visual inference. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 1743–1752, 2006. [97] C. Sminchisescu and B. Triggs. Covariance scaled sampling for monocular 3D body tracking. In International Conference on Computer Vision & Pattern Recognition, volume 1, pages 447–454, December 2001. [98] L. I. Smith. A tutorial on principal components analysis. 2002. [99] J. Starck and A. Hilton. Model-based multiple view reconstruction of people. In International Conference on Computer Vision, pages 915–922, 2003. [100] J. Starck, G. Miller, and A. Hilton. Volumetric stereo with silhouette and feature constraints. In British Machine Vision Conference, volume 3, pages 1189–1198, 2006. [101] E.B. Sudderth, A.T. Ihler, W.T. Freman, and A.S. Willsky. Nonparametric belief propagation. In International Conference on Computer Vision & Pattern Recognition, volume 11, page 605, 2003. [102] A. Sundaresan and R. Chellappa. Markerless motion capture using multiple cameras. In Computer Vision for Interactive and Intelligent Environment, pages 15–26, 2005. [103] Vicon Peak: Vicon MX System. http://www.vicon.com/products/systems.html. [104] K. Takahashi, T. Sakaguchi, and J. Ohya. Real-time estimation of human body postures using kalman filter. In International Workshop on Robot and Human Interaction, pages 189–194, 1999. [105] T. Tangkuampien and T-J. Chin. Locally linear embedding for markerless human motion capture using multiple cameras. In Digital Image Computing: Techniques and Applications, page 72, 2005. [106] T. Tangkuampien and D. Suter. 3D object pose inference via kernel principal components analysis with image euclidian distance (IMED). In British Machine Vision Conference, pages 137–146, 2006. 142 REFERENCES [107] T. Tangkuampien and D. Suter. Human motion de-noising via greedy kernel principal component analysis filtering. In International Conference on Pattern Recognition, pages 457–460, 2006. [108] T. Tangkuampien and D. Suter. Real-time human pose inference using kernel principal components pre-image approximations. In British Machine Vision Conference, pages 599–608, 2006. [109] W.Y. Teh and S. Roweis. Automatic alignment of local representations. In Advances in Neural Information Processing Systems, pages 841–848, 2002. [110] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 2904:2319–2323, 2000. [111] C. Theobalt, M. Magnor, P. Schüler, and H.P. Seidel. Combining 2D feature tracking and volume reconstruction for online video-based human motion capture. In Pacific Conference on Computer Graphics and Applications, page 96, 2002. [112] M.E. Tipping and C.M. Bishop. Mixtures of probabilistic principal component analysers. In Neural Computation, volume 11, pages 443–482, 1999. [113] R. Urtasun, D.J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from small training sets. In International Conference on Computer Vision, pages 403–410, 2005. [114] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. [115] P. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features. In International Conference on Computer Vision & Pattern Recognition, pages 511–518, 2001. [116] L. Wang, Y.Zhang, and J. Feng. On the euclidian distance of images. In IEEE Transaction on Pattern Analysis & Machine Intelligence, volume 27, pages 1334– 1339, 2005. [117] R. Wang and W.K. Leow. Human posture sequence estimation using two un- calibrated cameras. In British Machine Vision Conference, volume 1, pages 459–468, 2005. 143 REFERENCES [118] Greg Welsh and Gary Bishop. An introduction to the kalman filter, siggraph 2001. In SIGGRAPH, 2001. [119] L-W Zhao, S-W Luo, and L-Z Liao. 3D object recognition and pose estimation using kernel PCA. In International Conference on Machine learning & Cybernetics, pages 3258–3262, 2004. [120] Z.Zivkovic. Optical-flow-driven gadgets for gaming user interface. In Proceedings of the 3rd International Conference on Entertainment Computing, pages 90–100, 2004. 144