Kernel Subspace Mapping: Robust Human Pose and Viewpoint

Transcription

Kernel Subspace Mapping: Robust Human Pose and Viewpoint
Kernel Subspace Mapping: Robust Human Pose and
Viewpoint Inference from High Dimensional Training Sets
by
Therdsak Tangkuampien, BscEng(Hons)
Thesis
Submitted by Therdsak Tangkuampien
for fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Supervisor: Professor David Suter
Associate Supervisor: Professor Ray Jarvis
Department of
Electrical and Computer Systems Engineering
Monash University
January, 2007
c Copyright
!
by
Therdsak Tangkuampien
2007
Addendum
• page 2 para 2: Comment: KSM allows generalization from a single generic model
to a previously unseen person and it is a matter for future work to determine how
this approach can generalize to people of different somatotype and those wearing
different clothing.
• page 4 para 1: Comment: It is important to note that KPCA can perform both
de-noising and data reduction. For the context of human motion capture via KSM,
KPCA is used for ‘de-noising’ because the processed data is not constrained to be a
subset of the training set.
• page 22 para 2: Comment: KSM can be applied to any number of cameras,
the emphasis of the work is on the development of a new learning technique that
generalizes from only a small training set. This is demonstrated in two camera
markerless tracking which appears sufficient to disambiguate the coupled problem of
pose and yaw.
• page 30 para 3: Comment: The Gaussian white noise model in used for parameter tuning (in KSM) because KPCA de-noising has been shown to be effective and
efficient in the minimization of Gaussian noise [91]. Provided that the noise level
is not substantial, experiments (conducted by the author) shows that the optimal
parameter in equation 3.6 remains relatively stable under changes in noise level. In
practice (during capture), the robustness of KSM is tested by analyzing its accuracy in pose inference from previously seen and unseen silhouettes corrupted with
synthetic noise (section 4.4).
• page 36 para 2: Comment: In addition to the Gaussian kernel (equation 3.3.2),
other kernels can be used in KSM. It is a matter of future research to determine if
some kernels perform better and more efficiently than others. The important factors
that will need to be considered are the accuracy and efficiency of the pre-image
approximation of KPCA [87] based on the specific kernel.
• page 37 para 2: Comment: The parameters K + and λ are tuned over a predefined
discretized range. These parameters are tuned such as to minimize the error function
in equation 3.20.
• page 43 para 2: Comment: Specifically for the experiments in section 3.5, mean
square error (Mse) is defined as the Euclidian distance between two RJC vectors as
defined in section 3.4.1.
• page 58 Add to the end of para 1: There are many other tunable parameters for
KSM. For example, two orthogonal camera views are used (in figure 4.4) because the
author believes that this provided the greatest information from two synchronized
views. Another tunable parameter is the level of the pyramids used in figure 4.5.
I
In our analysis, we found that a pyramid level of 5 is the most robust for our
experimental setup. There are many other tunable parameters (such as changes in
illumination level and segmentation quality) which can affect the accuracy of KSM.
It is a subject of further research to evaluate the performance of KSM under different
parametric conditions.
• page 61 para 2 Comment: In the selection of the optimal number of neighbors
for LLE mapping, it is important to avoid over-fitting. If the mapping number is too
small, KSM may end up locked onto the wrong poses. An interesting area to further
investigate is how to dynamically identify the most robust number of neighbors to
use for LLE mapping (irrespective of different motion types).
• page 86 para 2: Comment: It is important to note that for IMED embedded
KSM, the use of LLE mapping will still be affected by the same ambiguity problem
as in figure 4.8. However, the probability of this occurring is substantially reduced
due to the use of photometric information in section 5.4.1.
• page 93 Add to the end of para 1: Another interesting area of research for IMED
embedded KSM is to investigate its robustness in an uncontrolled environment (e.g.
varying light/illumination condition). In this case, it may be possible to increase the
technique’s robustness by training KPCA using relative change in intensity levels
between neighboring pixels (as opposed to absolute pixel intensities).
• page 96 para 1: Comment: The reader should note that the de-noising result of
non-cyclical dancing motion is not included in the analysis in section 6.3.2.
• page 99 Add to the end of para 1: The optimization of Greedy Kernel PCA
(GKPCA) is considered to be beyond the scope of this work. The emphasis of the
work is on the performance improvement of KSM (in human motion capture) via
the application of GKPCA in training set reduction. Readers interested in the topic
of GKPCA optimization and upper bound minimization should refer to [41], section
5.4 (page 91).
• page 104 figure 6.6: The results in figure 6.6 is surprising. For further research,
it would be interesting to investigate if a portion of the reduction in feature space
noise can be attributed to data reduction (instead of fully to KPCA or GKPCA
de-noising).
• page 113 Add to the end of para 1: It may also be possible to generalize KSM
to colour images by transferring colour properties from the target to be tracked to
a generic model and regenerating the database (re-training) for tracking.
• page 133 Additional References:
– “Viewpoint invariant exemplar-based 3D human tracking”, Computer Vision
and Image Understanding, 104(2), 178–189, 2006, E.-J. Ong et al .
– “The Dynamics of Linear Combinations: Tracking 3D Skeletons of Human
Subjects”, Image and Vision Computing, 20, 397–414, 2002, E.-J. Ong and S.
Gong.
– “A Multi-View Nonlinear Active Shape Model using Kernel PCA”, BMVC
1999, S. Romdhani et al .
II
Errata
• section 1.1 (p1): ‘limps of the human body’ for ‘limbs’
• section 1.1 (p3): ‘KSM can refer’ for ‘infer’
• section 1.2 (p4): ‘motion capture techniques calls’ for ‘called’
• section 1.2 (p4): ‘in lower evaluation cost’ for ‘in a lower’
• section 2.2 (p13): ‘particles’ for ‘particle’
• section 2.2.2 (p16): ‘so that Euclidian norm’ for ‘that the Euclidian’
• section 2.2.2 (p16): ‘boosted cascade of classifier’ for ‘classifiers’
• section 2.2.2 (p17): ‘automatically constraining search space’ for ‘the search space’
• section 2.2.2 (p17): ‘to identify region of’ for ‘regions of’
• section 2.3 (p21): ‘3D points on as an array’ for ‘on an’
• section 3.3.2 (p37): ‘two free parameters that requires’ for ‘require’
• section 3.4.1 (p39): ‘if two different pose vector’ for ‘vectors’
• section 3.4.2 (p41): ‘de-noising, noisy human’ for ‘a noisy human’
• section 3.4.2 (p41): ‘performance improvement’ for ‘improvements’
• section 3.4.2 (p41): ‘integrating KPCA motion de-noiser’ for ‘the KPCA’
• section 3.5 (p41): ‘play-backed’ for ‘played-back’
• section 3.5 (p45): ‘As expected, smaller noise’ for ‘a smaller noise’
• section 4.3 (p59): ‘different sets of motion’ for ‘motions’
• section 4.4.2 (p68): ‘there are many possible set of’ for ‘sets of’
• section 4.4.2 (p68): ‘false positives’ for ‘false negatives’
• section 4.4.3 (p72): ‘was able to created’ for ‘create’
• section 4.5 (p73): ‘high level of noise’ for ‘levels’
• section 4.5 (p73): ‘very well at estimation 3D’ for ‘estimating’
• section 4.5 (p74): ‘The limits the’ for ‘This limits’
• section 4.5 (p74): ‘the capture other’ for ‘of other’
III
• section 5.6 (p90): ‘only shows minor percentage’ for ‘a minor percentage’
• section 6 (p95): ‘Human motion de-noising’ for ‘A human motion de-noising’
• section 6.2 (p97): ‘human de-noising’ for ‘motion de-noising’
• section 6.2 (p97): ‘whilst minimizing reduction’ for ‘the reduction’
• section 6.2 (p98): ‘experiment are conducted’ for ‘experiments are conducted’
• section 6.2 (p98): ‘as well as compare the’ for ‘comparing the’
• section 6.4 (p109): ‘removed form’ for ‘removed from’
• section 7.1 (p109): ‘is non-linear’ for ‘are non-linear’
• section 7.2 (p113): ‘such as way’ for ‘such a way’
• section 7.2 (p143): ‘techniques aim at’ for ‘aimed at’
• section 7.2 (p143): ‘calibrate cameras’ for ‘calibrated cameras’
IV
List of Tables
5.1
3D pose inference comparison using the mean angular error for ‘unseen’
views of the object ‘Tom’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2
3D pose inference comparison using the mean angular error for randomly
selected views of the object ‘Tom’. . . . . . . . . . . . . . . . . . . . . . . . 90
5.3
3D pose inference comparison using the mean angular error for ‘unseen’
views of the object ‘Dwarf’. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4
3D pose inference comparison using the mean angular error for randomly
selected views of the object ‘Dwarf’. . . . . . . . . . . . . . . . . . . . . . . 92
6.1
Comparison of capture rate for varying training sizes. . . . . . . . . . . . . 101
V
List of Figures
1.1
Diagram to summarize the training and testing process of Kernel Subspace
Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
3
2D Taxonomy plot to summarize the classification of the various motion
capture literature reviewed in this thesis. . . . . . . . . . . . . . . . . . . . . 10
3.1
Linear toy example to show the results of PCA de-noising. . . . . . . . . . . 27
3.2
Toy Example to show the projection of data via PCA. . . . . . . . . . . . . 28
3.3
Toy example to illustrate the limitation of PCA. . . . . . . . . . . . . . . . 32
3.4
PCA de-noising of non-linear toy data. . . . . . . . . . . . . . . . . . . . . . 33
3.5
KPCA de-noising of the non-linear toy data. . . . . . . . . . . . . . . . . . . 37
3.6
Diagram to illustrate the missing rotation in a ball & socket joint using the
RJC encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7
Data flow diagram to summarize the relationship between human motion
capture and human motion de-noising. . . . . . . . . . . . . . . . . . . . . . 42
3.8
Quantitative comparison between PCA and KPCA de-noising of human
motion sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9
Frame by Frame error comparison between PCA and KPCA de-noising of
a walk sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10 Frame by Frame error comparison between PCA and KPCA de-noising of
a run sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.11 Diagram to show results of KPCA in the implicit de-noising of feature space
noise.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.12 Feature and pose space mean square error relationship for KPCA. . . . . . 46
4.1
Overview of Kernel Subspace Mapping (KSM). . . . . . . . . . . . . . . . . 51
VI
4.2
Scatter plot of the RJC projections onto the first 4 kernel principal components in Mp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3
Relationship between human motion de-noising, KPCA and KSM. . . . . . 53
4.4
Example of a training pose and its corresponding concatenated synthetic
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5
Diagram to summarize the silhouette encoding for 2 synchronized cameras.
4.6
Markerless motion capture re-expressed as the problem of mapping from
57
the silhouette subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7
Diagram to summarize the mapping from silhouette subspace to the pose
subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8
Diagram to show how two different poses may have similar concatenated
silhouettes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.9
Illustration to summarize markerless motion capture. . . . . . . . . . . . . . 64
4.10 Selected images of the different models used to test KSM. . . . . . . . . . . 65
4.11 Comparison between the shape context descriptor and the pyramid match
kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.12 Intensity images of the RJC pose space kernel and the Pyramid Match
kernel for a training walk sequence (fully rotated about the vertical axis) . . 66
4.13 Visual comparison of the captured pose with ground truth. . . . . . . . . . 67
4.14 KSM capture error (degrees per joint) for a synthetic walk motion sequence. 68
4.15 KSM capture error (cm/joint) for a synthetic walk motion sequence. . . . . 69
4.16 KSM capture error (degrees per joint) for a motion with different noise
densities (Salt & Pepper Noise). . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.17 KSM motion capture on real data using 2 synchronized un-calibrated cameras. 71
4.18 Selected motion capture results to illustrate the robustness of KSM. . . . . 72
5.1
Diagram to summarize the 3D pose estimation problem of an object viewed
from the upper viewing hemisphere. . . . . . . . . . . . . . . . . . . . . . . 78
5.2
Images of the Standardizing Transform with different σ values. . . . . . . . 81
5.3
Images of a 3D object after applying the Standardizing Transform. . . . . . 84
5.4
Diagram to Summarize Kernel Subspace Mapping for 3D object pose estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
VII
5.5
Selected images of the test object ‘Tom’ and object ‘Dwarf’. . . . . . . . . . 89
6.1
Illustration to geometrically summarize GKPCA. . . . . . . . . . . . . . . . 96
6.2
De-noising comparison of a toy example between GKPCA and KPCA. . . . 100
6.3
Pose space mse comparison between PCA, KPCA and GKPCA de-noising. 102
6.4
Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human walk sequence. . . . . . . . . . . . . . . . . . . . . . . . 103
6.5
Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human run sequence. . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6
Comparison of feature and pose space mse relationship for KPCA and
GKPCA.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7
Average errors per joint for the reduced training sets filtered via GKPCA. . 105
6.8
Frame by Frame error comparison (degrees per joint) for a clean walk sequence with different level of GKPCA filter in KSM. . . . . . . . . . . . . . 106
6.9
Frame by Frame error comparison (degrees per joint) for a noisy walk sequence with different level of GKPCA filter in KSM. . . . . . . . . . . . . . 107
6.10 Diagram to illustrate the most likely relationship between using the original
training set without GKPCA filtering and using training sequences directly
obtained from a motion capture database (to train KSM). . . . . . . . . . . 109
A.1 Diagram to summarize the hierarchical relationship of the bones of inner
biped structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A.2 Diagram to summarize the proposed markerless motion capture system. . . 117
A.3 Example of a Acclaim motion capture (AMC) format. . . . . . . . . . . . . 118
A.4 Comparison between animation in AMC format and DirectX format (generic
mesh). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.5 Comparison between animation in AMC format and RJC format. . . . . . . 121
A.6 Comparison between animation in AMC format and DirectX format (RBF
mesh). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.7 Images to illustrate Euler rotations.
. . . . . . . . . . . . . . . . . . . . . . 123
B.1 Selected textured mesh models of the author. . . . . . . . . . . . . . . . . . 127
B.2 Examples images of the front and back scan of the author. . . . . . . . . . . 128
B.3 Diagram to illustrate surface fitting for RBF mesh generation. . . . . . . . . 129
VIII
B.4 Selected examples of the accurate mesh model of the author. . . . . . . . . 131
IX
Kernel Subspace Mapping: Robust Human Pose and
Viewpoint Inference from High Dimensional Training Sets
Therdsak Tangkuampien, BscEng(Hons)
therdsak.tangkuampien@eng.monash.edu.au
Monash University, 2007
Supervisor: Professor David Suter
d.suter@eng.monash.edu.au
Associate Supervisor: Professor Ray Jarvis
ray.jarvis@eng.monash.edu.au
Abstract
A novel markerless motion capture technique called Kernel Subspace Mapping (KSM)
is introduced in this thesis. The technique is based on the non-linear unsupervised learning
algorithm, Kernel Principal Component Analysis (KPCA). Training sets of human motions
captured from marker based systems are used to train de-noising subspaces for human pose
estimation. KSM learns two feature subspace representations derived from the synthetic
silhouettes and pose pairs, of a single generic human model, and views motion capture
as the problem of mapping vectors between the learnt subspaces. After training, novel
silhouettes, of previously unseen actors and of unseen poses, can be projected through
the two subspaces via Locally Linear non-parametric mapping. The captured human
pose is then determined by calculating the pre-image of the de-noised projections. The
inference results show that KSM can estimate pose with accuracy similar to other recently
proposed state of the art approaches, but requires a substantially smaller training set
(which can potentially lead to lower processing costs). To allow automated training set
reduction, the novel concept of applying Greedy KPCA as a preprocessing filter for KSM
is proposed. The flexibility of KSM is also further illustrated via the integration of the
Image Euclidian Distance (IMED) and the technique applied to the problem of 3D object
viewpoint estimation.
X
Kernel Subspace Mapping: Robust Human Pose and
Viewpoint Inference from High Dimensional Training Sets
Declaration
I declare that this thesis is my own work and has not been submitted in any form for
another degree or diploma at any university or other institute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged
in the text and a list of references is given.
Therdsak Tangkuampien
January 23, 2007
XI
Acknowledgments
I would like to thank my principal supervisor, Professor David Suter for his advice and
guidance during my candidature. It was he who initially motivated my research ideas in
the area of human motion capture and machine learning. Many thanks to David for the
constructive criticisms and the countless hours spent on the discussions and proof reading
of my research. This thesis would not have been possible without his invaluable guidance.
I would like to thank my colleagues from the Institute for Vision Systems Engineering: Tat-jun Chin, James U, James Cheong, Dr. Konrad Schindler, Dr. Hanzi Wang,
Mohamed Gobara, Ee Hui Lim, Hang Zhou, Liang Wang and my associated supervisor,
Prof. Raymond Jarvis, for the wonderful discussions and constructive criticisms during
our weekly seminars. It has been a wonderful experience sharing research ideas with you
all. Especially, I would like to thank Tat-jun Chin for initially generating my interest in
the areas of manifold learning and high dimensional data analysis.
I would also like to thank many of the reviewers of my conference and journal submissions. The constructive feedbacks have been extremely helpful in guiding the direction
of my research. In particular, I would like to thank Kristen Grauman for providing the
source code for the Pyramid Match kernel and Gabriele Peters for making available the
3D object viewpoint estimation data set of ‘Tom’ and ‘Dwarf’. These components have
helped me greatly during my research candidature.
Most importantly, I would like to thank my mother and father for their constant encouragement and for always being there for me, when I needed them.
Therdsak Tangkuampien
XII
Commonly used Symbols
Some commonly used symbols in this thesis are defined here:
• X tr
refers to the training set.
• xi
refers to the i-th element (column vector) of the training set X tr .
• x
refers to the novel input (column vector) for the model learnt from X tr .
• Mp
refers to the pose feature subspace.
• Ms
refers to the silhouette feature subspace.
• kp (·, ·) refers to the KPCA kernel function in the pose space.
• ks (·, ·) refers to the KPCA kernel function in the silhouette space.
• Ψ
refers the the silhouette descriptor for the silhouette kernel ks (·, ·).
• vp
refers to the coefficients of the KPCA projected vector x via kp (·, ·).
• vs
refers to the coefficients of the KPCA projected vector Ψ via ks (·, ·).
• P tr
refers to the KPCA projected set of training poses.
• S tr
refers to the KPCA projected set of training silhouettes.
• P lle
refers to the reduced training subset of P tr used in LLE mapping.
• S lle
refers to the reduced training subset of S tr used in LLE mapping.
• s in
refers to projection of the the input silhouette in Ms .
• p out
refers to the pose subspace representation of s in in Mp .
• x out
refers to the output pose vector for KSM (i.e. the pre-image of p out ).
• X gk
refers to the reduced training subset (of X tr ), filtered via Greedy KPCA.
XIII
XIV
Chapter 1
Introduction
1.1
Markerless Motion Capture and Machine Learning
Markerless human motion capture is the process of registering (capturing and encoding)
human pose in mathematical forms without the need for intrusive markers. Moeslund [73]
defined human motion capture as “the process of capturing the large scale (human) body
movements at some resolution”. The term ‘large scale body movement’ emphasizes that
only the significant body parts, such as the arms, legs, torso and head will be considered.
The statement “at some resolution” implies that human motion capture can refer to both
the tracking of human as a single object (low resolution) or inference of the relative position of the limps of the human body (high resolution). This thesis views the human body
as an articulated structure consisting of multiple bones connected in a hierarchical manner
(appendix A.1). Markerless motion capture, as referenced in our work, refers to the inference of the relative joint orientations of this hierarchical skeletal model. Virtual animation
can be achieved by encoding these relative joint orientations (at each time frame) as pose
vectors, and mapping these vectors sequentially to a hierarchical (skeletal) structure. In
order to avoid the use of markers, pose information can be inferred from images captured
from multiple synchronized cameras. Estimating human pose from images can be classified
as computer-visioned based motion capture, and has received growing interest in the last
decade. Faster processors, cheaper digital camera technology and larger memory storage
have all contributed to the significant growth in the area.
1
CHAPTER 1. INTRODUCTION
Machine learning is another area that has received significant interest from both the research community and industry. Literally, machine learning research is concerned with “the
development of algorithms and techniques that allow computers (machine) to learn”. In
particular, due to the availability of large human motion capture database, there has been
many markerless motion capture techniques based on machine learning [37, 113, 45, 6, 83].
The principal concept is to generate synthetic training set of images from available poses
in the motion database and learn the (inverse) mapping from image to pose, such that it
generalizes well to novel inputs. The term ‘generalizes well to novel inputs’ emphasizes
that this is not a traditional database search problem, but a more complex one, which
requires the generation of unseen poses not in the training database. An attribute which
makes machine learning suitable for human motion capture is that a high percentage of
human motion is coordinated [84, 24]. There has been many experiments on the application of unsupervised learning algorithms (such as Principal Components Analysis (PCA)
[51, 98] and Locally Linear Embedding (LLE) [85]) in learning low dimensional embedding
of human motion [37, 14, 84]. Effectively, inferring pose via a lower dimensional embedding
avoids expensive computational processing and searching in the high dimensional human
pose space (e.g. 57 degrees of freedom in normalized Relative Joint Center (RJC) format
[section 3.4.1]).
The main contribution of this thesis is a novel markerless motion capture technique
called Kernel Subspace Mapping (KSM)(chapter 4). The technique is based on the nonlinear unsupervised (machine) learning algorithm, Kernel Principal Components Analysis
(KPCA) [90]. KSM requires, at initialization, labelled input and output training pairs,
which may both be high dimensional. In particular, for computer vision-based motion
capture, KSM can learn the (inverse) mapping from synthetic (training) images to pose
space encoded in the normalized RJC format (section 3.4.1). Instead of learning poses
generated from a number of different mesh models (as in [45, 37]), KSM can estimate
pose using training silhouettes generated from a single generic model (figure 1.1 [top
left]). To ensure the robustness of the technique and test that it generalizes well to
2
CHAPTER 1. INTRODUCTION
Figure 1.1: Diagram to summarize the training and testing process of Kernel Subspace
Mapping, which learns the mapping from image to the normalized Relative Joint Centers
(RJC) space (section 3.4.1). Note that different mesh models are used in training and
testing. The generic model [top left] is used to generate training images, whereas the
accurate mesh model of the author (Appendix B) is used to generate synthetic test images.
previously unseen∗ poses from a different actor, a different model is used in testing (figure
1.1 [top right]). Results are presented in section 4.4, which shows that KSM can refer
accurate human pose and direction, without the need for 3D processing (e.g. voxel carving,
shape form silhouettes), and that KSM works robustly in poor segmentation environments
as well.
1.2
Thesis Outline & Contributions
In chapter 2, a taxonomy and literature review of markerless motion capture techniques
are presented. Relevant machine learning algorithms for markerless motion capture are
discussed and classified into logical paradigms. This will serve as a context upon which
the advantages and contributions of the proposed technique (KSM) can be identified.
Thereafter, the remaining chapters and their contributions are outlined below:
∗
The term ‘previously unseen’ is used in this thesis to refer to vectors/silhouettes that are not in the
training set.
3
CHAPTER 1. INTRODUCTION
• Subspace Learning for Human Motion Capture [chapter 3]: introduces the
novel concept of human motion de-noising via non-linear Kernel Principal Components Analysis (KPCA) and summarizes how de-noising can advantageously contribute to markerless motion capture. Arguments are presented, which advocates
that the normalized Relative Joint Center (RJC) format for human motion encoding is, not only intuitive, but well suited for KPCA de-noising as well. De-noising
results indicates that human motion is inherently non-linear, hence further supporting the integration of non-linear de-noising techniques (such as KPCA) in markerless
motion capture techniques.
• Kernel Subspace Mapping (KSM)† [chapter 4]: integrates human motion denoising via KPCA (chapter 3) and the Pyramid Match Kernel [44] into a novel
(and efficient) markerless motion capture technique calls Kernel Subspace Mapping.
The technique learns two feature space representations derived from the synthetic
silhouettes and pose pairs, and alternatively views motion capture as the problem
of mapping vectors between the two feature subspaces. Quantitative and qualitative motion capture results are presented and compared with other state of the art
markerless motion capture algorithms.
• Image Euclidian Distance (IMED) embedded KSM‡ [chapter 5]: shows how
the Image Euclidian Distance [116], which takes into account spatial relationship
of local pixels, can efficiently be embedded into KPCA via the Kronecker product
and Eigenvector projections. Mathematical proofs are presented which shows that
the technique retains the desirable properties of Euclidian distance, such as kernel
positive definitiveness, and can hence be used in techniques based on convex optimization such as KPCA (and KSM). Results are presented which demonstrate that
IMED embedded KSM is a more intuitive and accurate technique than standard
KSM through a 3D object viewpoint estimation application.
†
This chapter is based on the conference paper [108] T. Tangkuampien and D. Suter: Real-Time
Human Pose Inference using Kernel Principal Component Pre-image Approximations: British Machine
Vision Conference (BMVC) 2006, pages 599–608, Edinburgh, UK.
‡
This chapter is based on the conference paper [106] T. Tangkuampien and D. Suter: 3D Object
Pose Inference via Kernel Principal Component Analysis with Image Euclidian Distance (IMED): British
Machine Vision Conference (BMVC) 2006, pages 137–146, Edinburgh, UK.
4
CHAPTER 1. INTRODUCTION
• Greedy KPCA for Human Motion Capture§ [chapter 6]: presents the novel
concept of applying Greedy KPCA [41] as a preprocessing filter in training set reduction for KSM. Human motion de-noising comparison between linear PCA, standard
KPCA (using all poses in the original sequence) and Greedy KPCA (using the reduced set) is presented at the end of the chapter. The results show that both KPCA
and greedy KPCA have superior de-noising qualities over PCA, whilst Greedy KPCA
results in lower evaluation cost (for both KPCA de-noising and KSM in motion capture) due to the reduced training set.
Finally, in chapter 7, overall conclusions encapsulating the techniques presented in the previous chapters are drawn. Further improvements are highlighted, and most importantly,
possible future directions of research on Kernel Subspace Mapping are summarized.
§
This chapter is based on the conference paper [107] T. Tangkuampien and D. Suter: Human Motion Denoising via Greedy Kernel Principal Component Analysis Filtering: International Conference on Pattern
Recognition (ICPR) 2006, pages 457–460,Hong Kong, China.
5
CHAPTER 1. INTRODUCTION
6
Chapter 2
Literature Review
Recently, markerless human motion capture has become one of the research areas in computer vision that rely heavily on machine learning methods. Human motion capture can
be classified into different paradigms and this chapter serves to elucidate the differences
between learning based approaches and the more conventional ones. As markerless motion
capture is a popular field of research, this chapter does not aim to provide a complete literature review of all the algorithms. Instead, the principal goal is to motivate and differentiate
the proposed method of Kernel Subspace Mapping (KSM) (chapter 4) against previously
proposed approaches. To this end, in section 2.1, taxonomies of computer vision-based
motion capture (similar to the ones introduced by Moeslund [72, 73] and Gavrila [43])
are summarized and combined. Based on the integrated taxonomy, in section 2.2, motion
capture techniques are reviewed and classified. This will serve as a context upon which the
advantages and contributions of the proposed technique (KSM) can be identified (section
2.3).
2.1
Motion Capture Taxonomy
There has been many attempts at developing a taxonomy for human motion capture
[43, 8, 7, 22]. The most logical, as highlighted by [16], is the taxonomy suggested by
Moeslund [72, 73], which categorizes the process of motion capture into four stages that
should occur in order to solve the problem. These stages are: initialization, tracking, pose
estimation, and finally, recognition. Initialization includes any form of off-line processing,
7
CHAPTER 2. LITERATURE REVIEW
such as camera calibration and model acquisition. As for tracking and pose estimation, it
is important to highlight the differences between them. In the classification (figure 2.1), we
use tracking to refer to the run-time identification of pre-defined structure with temporal
constraints. Specifically for human tracking from images, the pre-defined structure may
be the entire person (high-level) or multiple rigid body parts representing a person (e.g.
arms, legs). We use the term pose estimation exclusively for processes which generate as
output, pose vectors which can be used to (fully or partially) animate a skeletal model in
3D. As highlighted by Agarwal and Triggs in [4], it is common for the two stages of tracking
and pose estimation to interact in a closed looped relationship in that: accurate tracking
may take advantage of prior pose knowledge, and pose estimation may require some form
of tracking to disambiguate inconclusive pose. Finally, recognition is the classification of
motion sequences into discrete paradigms (e.g. running, jumping, etc). We believe that
most recognition algorithms can be augmented to pose estimation approaches if 3D joint
information is available, and therefore, have explicitly highlighted this extendability (with
an arrow) in figure 2.1.
The stages of the classification of Moeslund [72] do not have to be in any specific order,
and for some techniques, some stages (i.e. tracking and recognition) may even be excluded
completely. Each human motion capture approach (e.g. segmentation based, prediction
via tracking filter) can be classified as partially or fully belonging to any of the stages.
For example, a motion capture approach [23], which estimates pose via the use of Kalman
filters [118] (to track body parts) would be categorized in both the tracking and pose estimation stages. Specifically for markerless motion capture techniques based on machine
learning, such as [37, 113, 45, 6, 1, 75, 83], these can be classified into the initialization,
tracking and pose estimation stages. The learning stage (from training data) and the
inference stage (from test data) would be classified as initialization and pose estimation
respectively. For a full categorization and survey of markerless motion capture techniques,
the reader should refer to [72, 16].
8
CHAPTER 2. LITERATURE REVIEW
Another popular taxonomy is the one suggested by Gavrila [43], which classifies motion
capture techniques into either 2D or 3D approaches∗ . As the goal of motion capture is to
infer 3D pose information, most 2D and 3D techniques will eventually generate 3D human
pose information. To distinguish between 2D and 3D techniques, in our classification, any
approach which requires processing and volumetric reconstruction of the human body in
3D space (e.g. voxel carving, shape from silhouettes) will be classified as a 3D technique.
The remaining image-based techniques will be classified as 2D approaches.
The two taxonomies can be combined together to illustrate an abstract view of the
motion capture literature (figure 2.1). For example, a motion capture technique based
on voxel reconstruction [70], can be classified as 3D approaches whilst also classified as
initialization, tracking and pose estimation.
Since the core of Kernel Subspace Mapping (KSM) is based on an unsupervised learning
algorithm (KPCA) (for comparison), it would be useful to classify all the reviewed motion
capture literature into another two distinct classes: learning based techniques and nonlearning based techniques. In figure 2.1, the blue labels highlight techniques that are
exemplar/learning based (i.e. requires a training database) and the yellow labels indicates
techniques that are not. An interesting observation is that most of the recent work on 2D
markerless motion capture is based on learning algorithms, which indicates an increase in
the popularity of machine learning in computer vision applications.
∗
Note that in the survey by Gavrila [43], the taxonomy classifies techniques into 3 distinct classes: 2D
approaches without explicit shape models, 2D approaches with explicit shape models, and 3D approaches.
For simplicity, we have combined all 2D approaches into one category.
9
CHAPTER 2. LITERATURE REVIEW
Figure 2.1: 2D Taxonomy plot to summarize the classification of the various motion capture literature reviewed in this thesis. The color coding emphasizes if a motion capture
technique is exemplar/learning based (i.e. requires a training database)[blue] or not based
on any learning algorithm [yellow]. Techniques that do not estimate full pose (e.g. upper
body only or 2D pose in images) are highlighted by an uncomplete bar in the pose estimation column. Similarly, tracking in a reduced dimensional (embedded) space is signified
by a reduced bar in the tracking column. We believe that most recognition algorithms can
be augmented to pose estimation approaches, once 3D joint information is available. This
is indicated by the arrow at the end of each row pointing towards the recognition column.
10
CHAPTER 2. LITERATURE REVIEW
2.2
Markerless Motion Capture
A review of human motion capture techniques, as summarized in the combined taxonomy
of Moeslund [72] and Gavrila [43] (figure 2.1), is now presented. The techniques are
discussed in the following categorizations: Section 2.2.1 reviews pose estimation techniques
which requires processing and volumetric reconstruction in 3D (e.g. voxel carving, shape
from silhouettes). For quick technical comparison with KSM, section 2.2.2 summarizes the
relevant 2D learning/exemplar based motion capture techniques (i.e. requires a training
set). Note that not all 2D markerless motion capture approaches require a training set for
pose inference. For example, Moeslund and Granum [74] use silhouettes in conjunction
with the typical human kinematic constraints (instead of using the subspaces defined by
the training data) to infer pose. Other popular approaches (that do not rely on training
data) include the motion tracking algorithm of the Sony Playstation’s EyeToy (which only
uses simple image differencing [36, 120]), the silhouette contour technique of Takahashi et
al [104] (which uses Kalman filter [118] to track feature points), the particle filter based
algorithm by Shen et al [93], and the product of exponential maps and twist motion
framework of Bregler and Malik [18].
2.2.1
Motion Capture via 3D Volumetric Reconstruction
Human pose estimation from 3D volumetric reconstruction is a natural approach to markerless motion capture. These techniques usually require as input, synchronized images
form multiple calibrated cameras, as well as the camera’s intrinsic and extrinsic parameters. The principal idea is to derive a 3D volume (enclosing the actor), which satisfies the
constraint imposed by the multi-view images, and use the volume to aid in 3D pose estimation. Popular algorithms for volumetric reconstruction include the Shape-from-Silhouette
(SFS) algorithm [27] and voxel-carving techniques [70, 111]. These 3D approaches can be
further categorized into two distinct classes: model-free and model-based techniques. In a
model-free setup, the surface/volume reconstruction at each time instance (sometimes in
combination with its textures) can be captured (and used directly in animation) without
prior knowledge of the human body [71, 100]. In a model-based setup [70, 111, 99, 48, 13],
prior knowledge of the human structure is used to aid in motion capture, and therefore,
11
CHAPTER 2. LITERATURE REVIEW
usually allows pose estimation (i.e. the capture of joint orientations/positions for reanimation). The advantage of pose estimation is that it is efficient to encode (only the
structural joint orientations need to be encoded for each frame) and this allows its practical use in applications such as motion editing and motion synthesis [57, 62, 10, 84], as
well as human computer interaction (HCI). As this thesis specifically proposes a motion
capture technique (KSM) for pose estimation, only 3D model based approaches, which
can capture pose for biped animation will be reviewed.
A state of the art model-based approach to realistic surface reconstruction and pose estimation is the technique proposed by Starck et al [99, 48]. The technique has been shown
successful when using 9 calibrated and synchronized cameras (with 8 cameras forming 4
stereo pairs, and the remaining camera positioned overhead). A prior humanoid model is
used in conjunction with manually labelled feature points (e.g. skeletal joints and mesh
vertices) to match the images at each time frame. A combination of silhouette, feature
cues and stereo is then used in a constrained optimization framework to update the mesh
model, whilst preserving its parametrization of the surface mesh. Tracking with a prior
model allows the integration of temporal information to disambiguate under-constrained
scenarios, which may occur as a result of self occlusion. The technique has been shown
to successfully reconstruct complex motion sequences (e.g. jumping and dancing), and
in some cases, it is possible to visualize realistic creases of the actor’s clothing during
reanimation. As the humanoid model is updated at every time frame, we believe that
pose (joint orientations) can easily be estimated from the model. Two disadvantages are
highlighted by the authors [99], these being the requirement for manual labelling of feature
points and the constraints imposed by the shape of the prior model. Another state of the
art 3D pose estimation algorithm is the technique proposed by Cheung et al [28]. The
approach [28] takes into account temporal constraints in the form of a novel “Shape-fromSilhouette across time” algorithm [27]. Motion capture is partitioned into two distinct
stages: human model acquisition and pose estimation. In model acquisition, the joint
positions are registered in a sequential approach, where the actor is asked to rotate only
one joint at a time. The initialization step, which consists of the joint skeleton body shape
12
CHAPTER 2. LITERATURE REVIEW
acquisition was reported to take a total of ≈7 hours [joint acquisition (≈5 hours), shape
acquisition (≈2 hours)]. During pose tracking and estimation, the Visual Hull alignment
uses the Shape-from-Silhouette reconstruction and photometric information in conjunction with the prior model. The authors reported an average tracking rate of 1.5 minutes
per frame. The technique has been shown to successfully track complicated non-cyclical
motion such as aerobics, dancing and Kung Fu sequences.
Pose estimation approaches are sometimes designed specifically for the inference of
only the upper part of the human body (from torso upwards) (e.g. Bernier et al [13] and
Fua et al [42]). These techniques may potentially be useful for human computer interaction in the near future. In [13], Bernier et al used particles filters and proposal maps to
track fast moving human motion using depth images from a stereo camera. The technique
can successfully estimate pose at up to 10Hz without prior knowledge of the actor nor
background. In the work of Fua et al [42], 3 synchronizes pre-calibrated cameras are used
to generate 3D surface point cloud to constrain pose estimation of the upper body. The
approach uses optimization to deform an articulated model (with “metaballs” to simulate
muscle and joints) to align with the synchronized imageries.
Other notable (but similar to [99, 48, 28]) pose estimation techniques based on 3D volumetric reconstruction include [29, 79, 19, 117, 54, 70, 69]. In [29], Chu et al integrated
(the manifold learning algorithm) Isomap [110], to allow the extraction of “skeleton curve”
features from volumetric reconstruction of the human body. In [79], Niskanen et al used
a 4-6 camera setup to estimate 3D points and 3D normals of a surface mesh enclosing a
model at a time instance. Caillette et al [19] presented a volumetric reconstruction and
fitting scheme based on 4 calibrated cameras, but used Variable Length Markov Models for
prediction. Wang and Leow [117] derived a self-calibrated approach based on Nonparametric Belief Propagation [101], which allows pose inference via synchronized un-calibrated
cameras (the self-calibration is highlighted in figure 2.1). Kerl et al [54] improved on volumetric capture techniques by introducing a fitting algorithm, which uses stochastic meta
descent optimization (instead of deterministic ones). In addition to the space constraints
13
CHAPTER 2. LITERATURE REVIEW
(imposed by the volumetric reconstruction), Mikić et al [70, 69] and Sundaresan et al [102]
imposed temporal constraints via the use of extended Kalman filters [53].
In general, 3D motion capture based on volumetric reconstruction requires an expensive
controlled environment of multiple calibrated cameras and clean silhouette segmentation.
Voxel reconstruction, Shape-from-Silhouette or other space carving algorithms usually contribute to the higher processing cost and lower capture rate of 3D based techniques (when
compared to learning based techniques [section 2.2.2]). On the other hand, volumetric
reconstruction techniques with a tracking framework can capture more complicated and
non-cyclical motion, and in some cases [99, 28], employing realistic surface models as well.
2.2.2
Learning or Exemplar Based Motion Capture Techniques
Markerless motion capture techniques based on learning algorithms have grown significantly in this last decade due to the availability of online motion capture data. Generally
learning/exemplars techniques adopt a supervised learning approach and require labelled
input and output pairs as a training set. The principal concept is to learn the mapping
(from input to output) from the training set, and generalize it to unseen inputs (sampled
from a similar distribution as the training data) during online capture. The labelling (of
the training sets) can be generated manually as in [75, 15]. However, as human motion is
complicated, its encoding usually requires a large training set of high dimensional data,
which makes manual labelling impractical. To allow scalability, most recent approaches,
such as Agarwal & Triggs in [6] and Grauman et al in [45], used synthetic models to automatically generate synthetic training pairs for learning. To avoid over-fitting of the learnt
prior to a specific actor/model, multiple synthetic (mesh) models are used in training (e.g.
to capture walk sequences irrespective of yaw angle, a training set of 20,000 synthetic
silhouettes are generated from a number of different models in [45]). Interesting enough,
every member of the training set does not need to be labelled as shown by Navaratnam
et al [76], who extended the learning-based concept to allow a semi-supervised learning
approach (i.e. the training set consists of a combination of labelled [input & output pairs]
14
CHAPTER 2. LITERATURE REVIEW
and unlabelled data [input without output, or vice versa]).
There are many different types of inputs that have been successfully used in learning/exemplar based motion capture. In the technique proposed by Bowden et al [15, 14],
a skin color classifier based on the Hue-Saturation space [67] is used to detect (image)
locations of the hands and head (as input). Using a similar concept, Micilotta et al [68]
trained the AdaBoost classifier to detect the locations of the head and hands in cluttered
images. In [5], Agarwal and Triggs used the shape invariant feature transform (SIFT)
descriptor [64] in conjunction with non-negative factorization [49] to encode edges in human images for training. In [65], Loy et al used manually labelled points in key frames
together with a prior model (for image space fitting) to interpolate pose from cluttered
images. These techniques [5, 68, 15, 14], although robust to pose estimation in cluttered
environments, involve a complex scheme of mostly inefficient shape descriptors (when compared to using binary human silhouettes). A more efficient approach (also by Agarwal &
Triggs [6]) infers pose using silhouettes from monocular images by directly regression from
points sampled from the silhouette’s edge via the Shape Context [11] descriptor to the
Euler pose space. Inferring high dimensional pose from monocular silhouettes is complicated because a high level of ambiguity arises due to self occlusion and the lack of depth
and foreground (photometric) information. In that case [6], a complex temporal tracking
algorithm needs to be employed to disambiguate inconclusive pose. To overcome ambiguities, 3D pose can be inferred from concatenated silhouette images of the actor captured
from multiple synchronized cameras (as proposed by Ren et al in [83] with 3 cameras and
Grauman et al in [45] with 4 cameras). Even though concatenating silhouettes has the
advantage of reducing the level of ambiguity (in input images), there is also a drawback,
in that the dimension of the input (i.e. the number of pixels) will increase, hence effecting
the technique’s performance. In particular, for learning based approaches (where the main
processing relies on comparing inputs to training data), the increase in input dimension
becomes a principal factor to consider when determining the practicality of the techniques.
15
CHAPTER 2. LITERATURE REVIEW
Specifically for silhouette-based approaches, the efficiency of shape comparison algorithms is a crucial attribute to consider for potential real-time systems. There have
been many approaches aimed at optimizing matching/comparison costs between input
and training data. In particular, Agarwal and Triggs [6, 1] improved on the shape context
approach to motion capture (originally proposed by Mori and Malik [75]) by using vector
quantization to compress the Shape Context histogram relationship between other silhouettes, so that Euclidian norm is applicable (instead of the more common approach of using
the shortest augmenting path algorithm [52] to solve for the best match). In [83], Ren et
al used a boosted cascade of feature classifier (introduced by Viola and Jones [115]) for
efficient silhouette comparisons. Other efficient shape descriptors, which may be useful
in silhouette-based motion capture, include the pyramid match kernel† of Grauman et al
[44] and the Active Shape Model of Cootes et al [30].
In addition to using an efficient comparison algorithm, learning/exemplar based techniques should also consider strategies to minimize the search time (of the training set)
during pose inference. For exemplar based approaches that rely solely on a matching
framework (i.e. given an input, find the closest training data) [35, 83], it is obvious
why faster search algorithms will lead to faster pose inference rate. For learning based
approaches that rely on a reconstruction framework (i.e. generate the output from a combination of training data), a complex search strategy (which can efficiently locate arrays
of similar training poses) will be required. A problem usually encountered in optimizing
search strategies for learning based (human) motion capture techniques is the high dimensionality and size of the training set. As highlighted by Shakhnarovich et al [94]: “For
complex and high-dimensional problems such as pose estimation, the number of required
examples and the computational complexity rapidly become prohibitively high.” To avoid a
large training set, Agarwal & Triggs [6] use Bayesian non-linear regression to filter out a
reduced training set that generalizes well to novel data. Shakhnarovich et al [94] learns a
set of hash functions, which indexes only the relevant exemplars required for a specific motion. The main difference between the two approaches is that in the former [6], a smaller
training set is selected form the original set to model the motion sequence, whereas in
†
In chapter 4, we show how to efficiently integrate the Pyramid Match kernel [44] into KSM.
16
CHAPTER 2. LITERATURE REVIEW
the latter [94], the larger set still remains, but a more efficient hashing algorithm is learnt
to search the full training set. An alternative to using a hashing algorithm to reduce
search time is to perform a stochastic sampling [95] or a Covariance Scaled Sampling [97]
to locate the optimal pose. In Covariance Scaled Sampling, a hypothesis distribution is
calculated by predicting the dynamics of the current time prior, and inflating the prior
covariance at the predicted center for broader sampling. Bray et al [17] use the dynamic
graph cut algorithm [55] to locate the global minimum (optimal pose), when infering pose
from images without silhouette segmentation. Other human tracking approaches based
on constrained sampling algorithms include [33, 34, 82].
Instead of exploring ways to optimize the sampling algorithm for tracking, an alternative solution to avoiding the curse of dimensionality is to integrate manifold or subspace
learning algorithms into motion capture. This is possible because a majority of human
motion is coordinated [84]. The lower dimensional embedded space can be used to reduce
comparison time by automatically constraining search space to within the learnt model.
Grauman et al [45] used a mixture of linear probabilistic principal components analyzer
(PPCA) models [112] to locally and linearly model cluster of similar training data. Li et al
[61] used the Locally Linear Coordination (LLC) algorithm [109] to enforce smoothness locally within the clusters of the pose (joint) space. Agarwal and Triggs [3] used kernel PCA
[90] (with the polynomial kernel based on the Bhattacharya histogram similarity measure
[56]) to learn the manifold of training silhouettes from the (vector quantized) shape context histogram descriptor [6]. In that case [3], the manifold in conjunction with the pose
vectors (in Euler angles) are then combined to identify region of “multi-valueness” when
mapping from silhouette to pose space. Elgammal and Lee [37] integrated the unsupervised learning algorithm, Locally Linear Embedding (LLE) [85] to learn lower dimensional
manifolds of viewpoint specific silhouettes. Urtasun et al [113] uses Scaled Gaussian Process Latent Variable Models (SGPLVM) [46] to learn prior low dimensional embedding
for specific human motion (e.g. walking, golf swing). Some of these approaches [37, 113]
have only been shown to successfully track human joints using the same camera yaw angle
(rotation about the vertical axis) as the ones used to capture the training data. It is not
17
CHAPTER 2. LITERATURE REVIEW
clear if subspace models learnt from a different yaw orientation can be integrated together
to form a single general model that can track joints irrespective of the camera’s yaw angle.
From our experiments [108], one of the most difficult aspect of motion capture is not
how to track the joint positions of the human body (when viewpoint is known), but how
to automatically infer the correct yaw orientation of the model, as well as to correctly
track the joints‡ . To do so, the prior learnt model must generalize well to unseen poses,
as well as poses from unseen camera angles. To capture human pose irrespective of yaw
orientation, Ren et al [83] separated yaw and pose inference into two independent stages.
Yaw inference is viewed as a multi-class classification problem, which is always performed
before pose inference. There are 36 classes in total, with each class encapsulating a 10 degree sector of the vertical axis rotation. The input is first classified into the correct class,
before using a viewpoint specific model (with a 10 degree range) to infer pose. Other
silhouette-based approaches that can accurately infer pose irrespective of yaw orientation
include the “shape+structure” model of Grauman et al [45] and the regressive model of
Agarwal and Triggs [6].
In general, the learning/exemplar based techniques summarized in this section [45,
5, 75, 15, 37, 83, 6, 95, 97, 113, 68], adopt a supervised learning approach and require
labelled training data at initialization. In some approaches, unsupervised learning algorithms, such as PCA [15, 50], LLE [37], SGPLVM [46], have been used initially to learn
lower dimensional embedding to constrain search and tracking. Even so, a supervised
learning algorithm is still used to determine the mapping (between embedding and pose
space) during pose inference. In comparison with 3D based markerless motion capture
algorithms, learning based techniques are still limited to capturing less complicated (and
mostly cyclical) motions, but 2D learning-based approaches generally do not require intrinsic camera calibration, and therefore, are cheaper and easier to initialize. The state
of the art techniques [6, 96, 45, 37] mostly concentrate on robustly tracking and pose
inference of spiral walk and turn sequences captured via un-calibrated cameras. Complex
‡
For this thesis, we have considered viewpoint change to be a result of rotation about the vertical axis
(yaw rotation). The camera/sensor is assumed to remain at a fixed vertical height during the rotation
18
CHAPTER 2. LITERATURE REVIEW
paradigms of pose estimation in cluttered background or poor silhouette segmentation
environments are usually used to test and compare the robustness of proposed techniques.
From our review of the current literatures, we believe the approach adopted by Agarwal
and Triggs in [6] to be the current state of the art due to its ability to accurately and
robustly infer pose and yaw orientation from monocular silhouettes.
2.3
Motivation for Kernel Subspace Mapping (KSM)
Kernel Subspace Mapping (KSM) (chapter 4) falls in the paradigm of 2D learning-based
approaches. With regards to the taxonomy classification of Moeslund [72] (figure 2.1),
at initialization, KSM uses motion capture data from a marker-based system to learn a
de-noising subspace of human motion (section 3.4). Efficient tracking is performed in the
KPCA subspace (instead of the full pose space), hence its partial classification in the tracking column (figure 2.1). KSM uses concatenated silhouettes from synchronized cameras
in conjunction with subspace temporal constraints to infer full pose vectors for full body
biped animation.
In general, most of the 2D learning-based approaches [6, 45, 37] rely on learning from
a training set consisting of multiple people to allow good generalization to unseen input.
KSM adopts a different approach by learning the de-noising subspace in the normalized
relative joint orientation space (which is consistent for all actors), and de-noises (previously) unseen silhouettes with the learnt prior before pose inference. The advantage of
this approach is a substantial reduction in training size, as only the motion sequence from
a single generic model is used in learning. For comparison, in the technique proposed by
Grauman et al [45], a large training set of 20,000 silhouettes (from multiple actors) are
used to train a motion capture system consisting of 4 synchronized cameras. To capture
similar walk sequences irrespective of (yaw) viewing angle, KSM shows similar results (to
[45]) using a training set of only 343 exemplars and 2 synchronized cameras. The ability to
accurately and efficiently estimate pose irrespective of yaw orientation is one of the many
advantages of KSM. For learning-based approaches, inferring the yaw orientation (rotation
19
CHAPTER 2. LITERATURE REVIEW
about vertical axis) is significantly harder than inferring, let’s say, an extra rotation of
the arm. This is because yaw rotation is not coordinated with any joint rotation (as any
motion sequence can be captured from an infinite number of angles). Techniques which
learn a view specific prior model for tracking (or a discrete combination of multiple view
specific models) [37, 83, 113, 35] are limited to inferring pose from the pre-defined viewpoints. KSM is not constrained to view-specific estimation because view-point inference
and pose estimation is not performed in two independent steps as in [83, 37]. As a result,
view-point inference is not only possible, but continuous in the sense that pose estimation
is possible from unseen (camera) yaw angles.
As common with most supervised learning approaches to markerless motion capture
[45, 113, 6, 83], KSM requires (for training) a labelled set of silhouette and pose pairs.
KSM is similar to the mixture of regressors of Agarwal and Triggs [3], in the sense that both
techniques rely on the unsupervised learning algorithm, KPCA [90]. However, there are
substantial differences between the two approaches. In [3], KPCA with a complicated kernel (polynomial kernel with the Bhattacharya histogram similarity measure [56]) is used to
project (vector quantized) shape context descriptors from the silhouette space. The shape
context calculation (this occurs before the vector quantization step) has a quadratic complexity as the number of features (points sampled form the silhouette’s edge). KSM, uses
a shape descriptor, based on the efficient pyramid match kernel [44], which has only linear
complexity in the cardinality of the set. Instead of learning subspaces from the histograms
of silhouette descriptor (as in [3]), which are ambiguous and nosier (as the descriptor is
based on discrete histogram bins), KSM initially learns a de-noising subspace from the
continuous relative joint center space (section 3.4.1), which is more efficient to analyze
and unambiguous (except for when a limb is fully extended [figure 3.6]). Furthermore,
KSM is based on a KPCA de-noising framework (i.e. requires pre-image approximations
[91]) of the Gaussian kernel, which already has well established and efficient pre-image
estimators [87]. In [3], there is no approximation of pre-images, and such a stage would
require the pre-image approximations of complex and inefficient polynomial kernels of discrete histogram descriptors (when compared to pre-image approximations using Gaussian
20
CHAPTER 2. LITERATURE REVIEW
kernels [87]). Finally, instead of regression from silhouette to Euler pose space as in [6, 3],
we use the normalized relative joint center encoding RJC (section 3.4.1). The problems
of mapping to Euler space (appendix A.1.1) are that Euler angles suffer from singularities
and gimbal locks, are non-commutative, as well as having a complicated group structure
rather than that of a simple vector space. On the other hand, the normalized RJC format
(section 3.4.1) encodes pose using 3D points on as an array of unit spheres, which is simple
to parameterize (from vector space) and does not suffer from gimbal locks. An advantage
of normalized RJC is that the mapping from Euclidian to geodesic distance (between two
points) on a sphere can be closely approximated using Gaussians, and is therefore wellsuited for use with non-linear Gaussian kernels, as required in KPCA de-noising (section
3.3.1).
Kernel Subspace Mapping (KSM) is also similar to the activity manifold learning
method of [37], which uses Locally Linear Embedding (LLE) [85] to learn manifolds of
human silhouettes for each viewpoint. In that system, during capture, the preprocessed
silhouettes must be projected onto all the manifolds of each static viewpoint before performing a one dimensional search on each manifold to determine the optimal pose and
viewing angle. Furthermore, because the manifolds are learnt from discrete viewing angles, it is not possible to accurately infer pose if the input silhouettes are captured from
previously unseen viewpoint, where a corresponding manifold has not been learnt. Kernel
Subspace Mapping (KSM), which does not suffer from these inefficiencies and problems,
has the following advantages:
• Instead of learning separate subspaces for each viewpoint, a single combined subspace
from multiple viewpoints (rotated about the vertical axis) is learnt, hence giving the
technique the ability to accurately infer heading angle and human pose information
in a single step (and avoid explicitly searching multiple subspaces).
• Kernel Subspace Mapping, which is based on already well established de-noising
algorithms (KPCA) [90], generalizes well to silhouettes of unseen models as well as
silhouettes from unseen viewing angles.
21
CHAPTER 2. LITERATURE REVIEW
• KSM is able to implicitly project noisy input silhouettes (which are perturbed by
noise and lies outside the subspace defined by the training silhouettes) into the
subspace, hence allowing the use of a single generic training model for training,
which leads to a reduction in inference time.
KSM uses images of silhouettes from synchronized cameras without the need for their
intrinsic parameters (and therefore does not require any camera calibration). Extrinsic
parameters are also not required, provided that the relative cameras position (between
other cameras) are consistent between training and during online pose inference. It is
important to note that KSM is a multiple camera approach to pose estimation, rather
than the monocular approach proposed by Agarwal and Triggs [6, 3]. Pose inference from
monocular silhouettes as in [6] is substantially harder as the level of ambiguity increases
significantly in monocular sequences. In those cases, a more complicated tracking algorithm needs to be employed. KSM only uses a simple tracking algorithm embedded in
the de-noising subspace, hence the requirement for (at least) two cameras to generate a
learnt subspace that does not significantly overlap upon itself (e.g. the side view of a
monocular silhouette walk sequence in [37]). From our experiments, we concluded that
two cameras are sufficient in this regard, as oppose to the 3 camera framework in [83]
or the 4 cameras setup in [45]. The important point to note is that KSM does not aim
to improve on monocular approaches to pose estimation and their efficacy should not be
directly compared as such. Instead, KSM aims to improve on the ability for learning-based
approaches to generalize to unseen actors via the incorporating of a de-noising framework
learnt from a single model, rather than the more common approach of using a large set
from different models. For the capture of a walk sequence irrespective of angle, the reduction in training size allows KSM to efficiently infer full pose and yaw orientation at a rate
of up to 10Hz. For comparison, Caillette et al [19] can capture at a speed of 10Hz, but
requires volumetric reconstruction from four calibrated cameras. Bray et al presented an
approach [17] which can infer pose without segmentation at an average of approximately
50 seconds per frame. Other markerless techniques [26] reported pose inference times of
between 3 to 5 seconds per frame.
22
CHAPTER 2. LITERATURE REVIEW
It is also important to highlight the differences between 3D markerless pose estimation techniques, which are based on volumetric reconstruction (section 2.2.1) and 2D
markerless learning-based techniques (section 2.2.2). In 3D pose estimation techniques
[111, 70, 102, 54, 99, 48], an expensive setup of multiple calibrated cameras (usually more
than four) are required to derive the enclosing volume. Capture rate is usually lower
than 2D approaches due to volumetric/surface reconstruction. On the other hand, 3D
based techniques can capture more complicated non-cyclical human motion and, in some
cases, have even been shown to capture complex surfaces (e.g. wrinkles in clothing) for
reanimation [99, 48]. KSM, being a 2D learning-based approach, should not be directly
compared with 3D approaches in relation to its ability to estimate pose, as the constraints
vary substantially between the two paradigms. Nevertheless, the reader should bear in
mind that the ultimate goal of full body pose estimation remains similar between the two
paradigms.
Finally, there is a large community of researchers in the field of machine learning,
which concentrate specifically on the optimization and improvement of kernel techniques,
such as Kernel PCA [91, 87] and Support Vector Machines (SVMs) [40]. KSM, being a
technique which is based on the well-established KPCA algorithm [90], could most likely
take advantage of any improved algorithms that may be proposed (from the machine
learning community) in the future. To support this argument, two recently proposed
algorithms, the Image Euclidian Distance [116] and the Greedy Kernel PCA algorithm [41]
are integrated to the problem of 3D object viewpoint estimation (chapter 5) and human
motion capture via KSM (chapter 6) respectively. There have not been, to the authors
knowledge, any application of the Greedy Kernel KPCA algorithm in markerless motion
capture, nor any use of the Image Euclidian Distance (IMED) in 3D object pose estimation
problems. The results of the improved approaches indicate that theoretical improvements
on KPCA can be transferred to practical improvements of KSM with relatively minor
modifications.
23
CHAPTER 2. LITERATURE REVIEW
24
Chapter 3
Subspace Learning for Human
Motion Capture
This chapter introduces the novel concept of human motion de-noising via non-linear
Kernel Principal Components Analysis (KPCA) [90] and summarizes how de-noising can
contribute to markerless motion capture. Arguments are presented, which advocates that
the normalized Relative Joint Center (RJC) format (section 3.4.1) for human motion
encoding is, not only intuitive, but also well suited for KPCA de-noising. Motion denoising comparison between linear and non-linear approaches are presented to further
demonstrate the advantages of using non-linear de-noising techniques (such as KPCA) in
markerless motion capture.
3.1
Introduction
The inference of full human pose and orientation (57 degrees of freedom) from silhouettes
without the need of real world 3D processing is a complex and ill conditioned problem.
When inferring from silhouettes, ambiguities arise due to the loss of depth cues (when
projecting from 3D onto the 2D image plane) and the loss of foreground photometric
information. Another complication to overcome is the high dimensionality of both the
input and output spaces, which also leads to poor scalability of learning based algorithms.
To avoid confusion it is important to note that, visually, an image as a whole is considered to exist in 2D. However, low level analysis of the pixels of an image (as applied in
25
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
KSM) is considered high dimensional (more than 2D) as each pixel intensity represents
a possible dimension for processing. Specifically in our work, the input is the vectorized
pixel intensities of concatenated images from synchronized cameras. The output is the
high dimensional human pose vector encoded in the normalized Relative Joint Center
(RJC) format (section 3.4.1). For exemplar based motion capture techniques, such as
[113, 94, 45, 75, 6, 37], searching the full dimension of the output pose vector space (e.g.
57 dimensions for RJC format) is usually avoided. A possible solution (to avoid searching
the complete pose space) is to exploit the correlation in human movement [84, 14] and learn
a reduced subspace in which searching and processing can be performed more efficiently.
The dimensionality reduction technique Principal Components Analysis (PCA) [51] has
been shown to be successful in learning human subspaces for the synthesis of novel human
motion [84, 14]. However, PCA, being a linear technique is limited by its inability to effectively de-noise non-linear data. By de-noising, we mean the minimization of unwanted
noise by projecting data onto a learnt subspace (determined from either PCA, Kernel
PCA [90] or other subspace learning algorithms) and ignoring the remainder (section 3.2).
To overcome the non-linearity in encoding, the Locally Linear Embedding algorithm [85]
(which is a non-linear dimensionality reduction/manifold learning technique) has become
popular in human motion capture [37]. The problem with LLE is that it does not yet have
well established algorithms for the projection of out-of-sample points [12]. To project unseen points to the manifold embedding, the input is usually appended to the training set
and the entire manifold relearned∗ .
This chapter aims to find a suitable subspace learning algorithm (for human motion
capture) which can effectively de-noise non-linear data, as well as one which is computationally efficient in the projection (de-noising) of novel inputs. To advocate the use of
non-linear learning techniques in human motion capture, experiments should show that
the simplest form of human motion (e.g. walking, running) is non-linear and, therefore,
the motion would not be effectively de-noised by linear techniques such as PCA (when
∗
An investigation into the practical application of LLE for silhouette based motion capture was conducted by the author in [105]. Specifically for silhouette based approach to motion capture, the major
problem encountered was the (computationally) expensive projection cost of novel silhouette feature vectors via LLE. Recently, the originators of LLE, Saul and Roweis, have investigated this problem and have
presented two possible solutions in the update of their original Locally Linear Embedding paper [86].
26
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
compared to KPCA).
The remainder of this chapter is organized as follows: Firstly, section 3.2 reviews the
principal components analysis (PCA) algorithm [51, 98], as well as highlighting its linear
limitation. The non-linear kernel PCA (KPCA) algorithm [90] is reviewed in section 3.3.
In section 3.4, the novel application of KPCA in human motion de-noising is introduced,
and the advantageous relationship between human subspace learning via KPCA [90] and
markerless motion capture is summarized. Results are presented in section 3.5 to allow
quantitative and visual comparison between PCA de-noising and KPCA de-noising of
noisy motion sequences.
3.2
Review: Principal Components Analysis (PCA)
Principal components analysis (PCA) [51] is a powerful statistical technique with useful
applications in areas of computer vision, data compression, pattern/face recognition and
human motion analysis [84, 24]. PCA effectively creates an alternative set of orthogonal
‘principal axes’ (basis vectors) with which to describe the data (figure 3.1). Geometrically
speaking, expressing a data vector in terms of its principal components is a simple case of
projecting the vector onto these independent axes (provided that the vector has already
been centered around the training mean). Once the principal components are created, each
projection is effectively a dot product, the calculation of which is only linear (in terms of the
vector’s dimension) in complexity. Furthermore, multiple data vectors can be concatenated
into a matrix and the projection performed in batches via matrix multiplications.
Figure 3.1: Linear toy example to show the results of Principal Components Analysis
(PCA) de-noising. The principal axes are highlighted in red.
Consider, for example, the 2D noisy (linear) toy data set in figure 3.1, where PCA will
27
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
calculate two principal axes. Note that the principal axes must be orthogonal to each
other and therefore, there can only be as many principal axes as the dimension of the
original data. In this case, the main principal axis (1st principal component) lies along
the direction of greatest variance, whereas, the less significant (2nd component) axis lies
orthogonal to it. By ignoring data projection onto the 2nd component (and considering it
as noise), the dimensionality of the problem reduces whilst minimizing loss of information.
Figure 3.2: Toy Example to show the projection of data onto the 1st principal axis from
PCA. The projections along the 2nd component is considered to be noisy data and ignored.
Due to its simplicity, there are widespread applications of PCA in computer vision,
more specifically in areas of high dimensionality analysis where subspace learning or dimensionality reduction are advantageous. There is a vast array of different applications (of
PCA), ranging from, for example, data compression [14], optimization [84] to ‘de-noising’
[91]. However, in most applications, the underlying concept remains relatively the same,
in that, given a training data in RD , PCA initially learns a set of K principal axes (where
K ≤ D) to efficiently represent the data. In data compression via PCA, a ‘lossy’ form of
compression is achieved by storing only the projection coefficients onto the first K principal axes. In human motion optimization [84], only the projection coefficients onto the
K principal axes are processed and analyzed. In de-noising, the projections onto the K
principal components are considered as the clean data (or that which is closest to the clean
data). In most cases, vector projections onto the less significant components are ignored.
To this end, two crucial questions that should be asked are:
• how to automatically determine the optimal number of K principal components for
a specific data set, and
• how effective these K principal components are in representing the clean data.
28
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
For notational purposes, a mathematical formulation of PCA [51, 90] is now presented
to aid in the investigation of these problems. A training set of N centered vectors in
RD is denoted as X tr , whereas the i-th training (column) vector is represented as xi .
A novel (column) vector, which is not in the training set, is represented as x. Note
that the data set must be preprocessed and centered around its mean beforehand (i.e.
!N
i=1 xi = 0). Thereafter, to determine the principal components, PCA diagonalizes the
covariance matrix
N
1 "
xi xT
C=
i
N
(3.1)
i=1
and solves for the Eigenvectors v and diagonal Eigenvalues matrix λ as follows:
λv = Cv, for λk ≥ 0 and v ∈ RD .
(3.2)
Each (column) Eigenvector vk (in the matrix v ) corresponds to a principal axis for data
projection. On the other hand, each Eigenvalue λk (the k-th diagonal in matrix λ) encodes the variance of the training data along each corresponding Eigenvector. Hence by
ordering the Eigenvectors in decreasing order of their corresponding Eigenvalues, a set of
ordered principal components is attained. Assuming that the optimal number of principal
components is known (the selection of this will be explained in the next paragraph), a
matrix of feature vectors F is constructed [98], where
F = [v1 v2 v3 ... vK ].
(3.3)
Projecting the novel point x (which has already been centered around the mean of the
training set) onto the first K principal components is achieved via a single matrix multiplication
β = FT x ,
(3.4)
where the k-th column β k (of the coefficient matrix β) denotes the coefficient of the
projection of x on the principal axis v k . Conversely, synthesizing a data vector from its
29
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
projected coefficient is simply a weighted sum of the principal components:
x̂ = Fβ.
(3.5)
To map the centered reconstruction back to its original space, the reconstructed vector x̂
can be added to the training mean.
The remaining factor to consider is how to determine the optimal number of principal
components. In order to do so, the meaning of optimal value, from a de-noising perspective must be defined. The goal of de-noising is to learn a subspace, which best represents
the ‘true’ training subspace, given the constraint of the (de-noising) algorithm. The nearest point in the learnt subspace closest to a noisy data point is considered the cleanest
representative of this entity. At this point, it is important to highlight the main difference
between ‘clean’ data and de-noised data (which lies in the learnt subspace). As the denoiser may not be fully representative of the training data, the de-noised vector may still
retain unwanted noise. Therefore, de-noised data is not necessarily (and most unlikely)
clean data. The goal is to eliminate as much noise as possible by learning a subspace,
which is as close as possible to the ‘true’ subspace† , without over-fitting. To this end, ‘denoising’ is the elimination of the components of any point which causes it to deviate from
the learnt subspace. For example, in figure 3.2, projecting a noisy point onto the subspace
(defined by the 1st principal axis) via equation 3.4 is considered a case of linear de-noising.
In our work, for PCA de-noising [39], the optimal number of principal components K +
will be defined as
K + = argmin
N
"
K∈[1,D] i=1
&x i − DK (x i + ∆nσi )&2 ,
(3.6)
where ∆nσi is a random sample from a Gaussian white noise signal with variance of σ 2 .
The de-noising function is signified by DK (•) and is a combination of the linear projection
(3.4) and re-synthesizing (3.5) of the noisy data from the first K principal components
†
For interested readers, an analysis of noise in linear subspace approaches is conducted by Chen and
Suter in [25]. The article aims to investigate the de-noising capacity, which indicates, in matrix terms,
how close a low rank matrix (learnt subspace) is to the noise-free matrix (the ‘true’ training data).
30
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
(ordered in decreasing Eigenvalues).
For the sake of simplicity, in equation (3.6), there is only a single iteration over the
entire training set for each value of K during tuning (this is the process of selecting the
optimal value K + ). To ensure that the de-noising parameter generalizes well to novel test
data, the optimal number of principal components is tuned using κ-fold cross-validation
[39]. In this case, the training set is partitioned into κ testing subsets. For each subset,
the remaining κ − 1 subsets are combined to create the training set. This effectively
ensures that the training and testing subsets are mutually exclusive during the tuning
process, hence representing a scenario which is more likely to occur in practice. To this
end, equation 3.6 becomes:
k
K = argmin
+
κ card(X
"
"tr )
K∈[1,D] k=1
i=1
k
&x ki − D̄K
(x ki + ∆nσi )&2 ,
(3.7)
k (•) represents the de-noising function using the
where Xtrk denotes the k-th subset and D̄K
principal components learnt from the remaining κ − 1 subsets (i.e. not inclusive of data
in subset k). As the size of each subset may vary, the cardinality of the k-th subset is
denoted by card(Xtrk ).
Many other techniques exist for the selection of the optimal parameter for PCA. For
example, in human motion synthesis via PCA, Safonova et al [84] chooses K + such that:
!K +
Er = !i=1
D
λi
i=1 λi
≥ 0.9
(3.8)
Geometrically speaking, because the i-th Eigenvalue encodes the variance along the i-th
principal component (Eigenvector), equation 3.8 effectively selects K + such that the PCA
projections represents more than 90% of the variance of the training data. In that case
[84], PCA is used for data compression to allow optimization in a lower dimensional space,
whereas in our work, the parameter is tuned specifically to optimize PCA’s ability in the
de-noising of novel data (equation 3.7).
31
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
3.2.1
Limitations of PCA
Even though PCA is an efficient and simple statistical learning technique, there are two
crucial requirements for PCA de-noising to work effectively. These are:
• that the data remains within the original space‡ , and,
• that the clean data lie on or near a linear subspace.
The first constraint is obvious since the orthogonal principal components lie within the
span of RD (the original space), data constructed from these components must remain in
RD . The second limitation is dependent on specific data set (i.e. is the data linear or
non-linear?). This can best be illustrated with a simple non-linear toy sample. Consider,
for example, the non-linear toy data set (in figure 3.3), which has a mean square error of
0.1259.
Figure 3.3: Toy example to illustrate the limitation of PCA on the de-noising of nonlinear data. The clean subspace is represented by the blue curve and the noisy test data
represented by the red crosses.
In this case, the goal of PCA de-noising is to project the noisy (red) points onto the blue
curve defining the clean subspace. Visually, for non-linear de-noising, a good projection
for a noisy point, would be a projection which is orthogonal to the curve’s tangent (at the
point of interception). From figure 3.4, it is clear that PCA de-noising is ineffective in the
de-noising of non-linear data. By projecting the test data (red points) onto the 1st principal
component, the mean error actually increases by more than four times the original error.
‡
The ability to map data to an alternative space (from the original space RD ) is advantageous because
it introduces a new degree of flexibility for data fitting and processing (this concept in investigated later
for KPCA in section 3.3).
32
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
Figure 3.4: PCA de-noising of non-linear toy data via projection onto the 1st principal
component.
Obviously, two principal components axes can be used in de-noising, but as this is already
the dimension of the original data, there is no de-noising effect and the error remains at
its original value (up to some small rounding differences from reconstruction). For this
specific example, a possible solution may be to view this as a linear fit in the polynomial
space (xn , xn−1 ...
, x2 , xy, y 2 , x, y, 1). This is the basic concept of Kernel PCA (section
3.3), where data is mapped to an alternative feature space where linear fitting/de-noising
is possible. The problem that needs to be solved is how to automatically determine the
optimal mapping for each specific data set.
3.3
Kernel Principal Components Analysis (KPCA)
Kernel Principal Components Analysis (KPCA) [90, 91, 87] is a non-linear extension of
PCA. Geometrically speaking, KPCA implicitly maps non-linear data to a (potentially
infinite) higher dimensional feature space where the data may lie on or near a linear subspace. In this feature space, PCA might work more effectively. The basic principal of
implicitly mapping data to a high dimensional feature space has found many other areas
of application, such as Support Vector Machines (SVMs) [31, 89] for optical character
recognition of handwritten characters [39], and image de-noising [91]. To the author’s
knowledge, there is, however, no previous work on the de-noising of human motion and
silhouettes for human motion capture using KPCA projections and pre-image approximations. To show that KPCA is useful in motion de-noising, in section 3.5, an analysis of
33
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
the motion sequences is presented which aims to show that the (motion) sequences are
inherently non-linear, and therefore should be de-noised by non-linear techniques, such as
KPCA.
3.3.1
KPCA Feature Extraction & De-noising
For completeness, a formulation of the KPCA algorithm as introduced by Schölkopf et al
[90] will now be presented. Using the same notations as PCA, the training set of N data
vectors is denoted by X tr in RD . Each element of the training set is indexed by x i . Each
novel input point will, again, be represented by x . The non-linear mapping from the input
space to a higher dimensional feature space F will be denoted by Φ, where
Φ : RD → F,
x )→ Φ(x ).
(3.9)
Provided that a suitable mapping for Φ can be found, the covariance matrix (similar to
equation 3.1, but in feature space F) can be defined as:
C̄ =
N
1 "
Φ(xi )Φ(xi )T .
N
(3.10)
i=1
Note that the symbol C̄ is used instead of C to highlight the fact that the data is centered
!N
in feature space (i.e.
i=1 Φ(x i ) = 0). For more details on how to center data in feature
space, the reader should refer to [90, 91]. Following the same setting as PCA (section 3.2),
the Eigenvectors V (in feature space) and the Eigenvalues (matrix) λ of the covariance
matrix are calculated:
λV = C̄V.
(3.11)
All Eigenvectors in V with Eigenvalues of more than zero must lie within the space of its
training vectors Φ(x1 ), Φ(x2 ), ...., Φ(xN ), leading to the possible reconstruction of V from
its ‘basis’ vectors:
V = ΣN
i=1 αi Φ(x i ),
34
(3.12)
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
where αi denotes the respective coefficients of each basis. The concatenated matrix α,
which consists of these coefficient vectors, is the unknown that needs to be determined
before any feature space projection via KPCA can be performed. By introducing an N ×N
kernel matrix K, where
Kij := +Φ(xi ) · Φ(xj ),,
(3.13)
N λKα = K2 α,
(3.14)
Schölkopf et al [90] showed that
which further leads a simplified formulation:
N λα = Kα.
(3.15)
From equation 3.15, it is clear that the unknown coefficient matrix can be determine
by solving for the Eigenvectors and Eigenvalues of K. Specifically for the purpose of
de-noising via KPCA (where a major concern is the ability to project novel points) the
projections onto the k-th principal components in feature space is then
+Vk · Φ(x), =
N
"
i=1
αki +Φ(x) · Φ(xi ),.
(3.16)
Using similar notations for projected coefficients as equation 3.4, the coefficients of the
KPCA projection (onto the first K + principal axis) of x can be denoted as
+
β = [+V1 · Φ(x),, .., +VK · Φ(x),]T , ∀ β ∈ RK
+
(3.17)
where β k denotes the coefficient of the projection of x on the principal axis Vk (in feature
space). The remaining factor to consider is how to define the feature space mapping Φ(·),
such that KPCA de-noising is of any practical use. Referring back to the formulation of
KPCA, there are, in fact, only two instances (when projecting points via KPCA) where
an explicit mapping from input space RD to feature space F is required. These are when
calculating for the kernel matrix K in equation 3.13 and during the projection of a novel
35
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
point in equation 3.16. In both cases, a dot product (projection) with another mapped
vector is performed immediately, once in feature space. To this end, it is possible to avoid
defining an explicit map for Φ(·) , and instead, define a positive definite kernel function
k(x i , x j ) [90], such that
k(xi , xj ) = +Φ(xi ) · Φ(xj ),, Φ : RD → F
(3.18)
is a dot product in the feature space of the mapped vectors. Furthermore, due to the
high (possibly infinite) dimension of the feature space, it would also be computationally
expensive to explicitly perform the mapping. By implicitly mapping via a kernel function,
which is a dot product in feature space [87], the system also avoids having to explicitly
store high dimensional feature space vectors in memory. The following section summarizes
the selection and tuning process for the relevant kernel.
3.3.2
Kernel Selection for Pre-Image Approximation
There are many possible choices of valid kernels for KPCA, such as Gaussian or polynomial
kernels. The choice of the optimal kernel will be dependant on the data and the application
of KPCA. From a KPCA de-noising perspective, the radial basis Gaussian kernel
T (x
k(xi , x) = exp−γ{(x i −x )
i −x )}
(3.19)
is selected due to the availability of well established and tested ‘pre-image’ approximation
algorithms [91, 58]. The pre-image (which exists in the original input space) is an approximation of the KPCA projected vector (which exists in the de-noised feature space).
Approximating the pre-image of a projected vector (in feature space) is an extremely complex problem and for some kernels, this still remains an open question. Since the projected
(de-noised) vector exists in a higher (possibly infinite) feature space, not all vectors in this
space may have pre-images in the original (lower dimensional) input space. Specifically for
the case of Gaussian kernels, however, the fixed point algorithm [88], gradient optimization
[39] or the Kwok-Tsang algorithm [58], have been shown to be successful in approximating
pre-images. For more information regarding the approximation of pre-images, the reader
36
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
should refer to [89]. Note that there are two free parameters that requires tuning for the
Gaussian kernel (equation 3.19), these being γ, the Euclidian distance scale factor, and
K + , the optimal number of principal axis projections to retain in the feature space. To
do so, the same concept of tuning for de-noising (as with PCA [section 3.2]) can be impleγ
mented. By letting PK
+ define the KPCA de-noising function (projection and pre-image
approximation with the Euclidian scale factor of γ and an optimal principal axes number
of K + ), the optimal parameters will be chosen such as to minimize the error function
ε(γ, K + ) =
N
"
i=1
γ
σ 2
&x i − PK
+ (x i + ∆ni )& .
(3.20)
Note that this is similar to PCA tuning in equation 3.6, and can be improved to generalize
to unseen inputs by using cross validation in tuning, hence resulting in the following form:
k
ε(γ, K + ) =
κ card(X
"
"tr )
k=1
i=1
γ
k
σ 2
&x ki − P̄K
+ (x i + ∆ni )& ,
(3.21)
k
γ
where Xtrk denotes the k-th subset for cross validation and P̄K
+ (•) represents the KPCA
k
de-noising function using the principal components learnt from the remaining κ−1 subsets
(i.e. not inclusive of data in subset k).
Figure 3.5: KPCA de-noising of the non-linear toy data from figure 3.3.
Using the tuned Gaussian kernel, figure 3.5 shows the results of KPCA de-noising of
the non-linear example [39] from figure 3.3. In this case, KPCA de-noising eliminated noise
of more than 40% of the original error. When compared with linear PCA, where there
37
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
was no possible reduction in error, KPCA’s ability to de-noise data provides a significant
improvement§ .
3.4
KPCA Subspace Learning for Motion Capture
The concept of motion subspace learning via KPCA is to use (Gaussian) kernels to model
the non-linearity of joint rotation, such that computationally efficient pose ‘distance’ can
be calculated, hence, allowing practical human motion de-noising. Clean human motion
captured via a marker-based system [32] are converted to the normalized Relative Joint
Center (RJC) encoding (section 3.4.1) and used as training data. The optimal de-noising
subspace is learnt by mapping the concatenated joint rotation (pose) vector via a Gaussian
kernel and tuning its parameters to minimize error in pre-image approximations (section
3.3.2). In section 3.4.2, the relationship between human motion de-noising and markerless
motion capture is summarized.
3.4.1
Normalized Relative Joint Center (RJC) Encoding
This section aims to show that encoding human pose using normalized RJC vectors is not
only logical, but intuitive, in the sense that it allows efficient non-linear comparison of pose
vectors, and has a structure which is well suited for KPCA. Generally, human pose can be
encoded using a variety of mathematical forms, ranging from Euler angles (appendix A.1.1)
to homogeneous matrices (appendix A.1.2). In terms of pose reconstruction and interpolation [9, 57], problems usually encountered (when using Euler angles and matrices) include
how to overcome the non-linearity and non-commutative structure of rotation encoding.
Specifically for markerless motion capture using Euler angles, we do not try to map from
input to Euler space as in [6] because Euler angles also suffer from singularities and gimbal
locks. Instead we model (3D) joint rotation as a point on a unit sphere and use a Gaussian
kernel to approximate its non-linearity. The advantage of using normalized spherical surface points to model rotations is that the points lie on a non-linear manifold, which can be
well-approximated using a combination of exponential maps and linear algebra. A similar
§
An interesting area to consider for each kernel (of KPCA) is if any heteroscedastic noise [38, 63]
in induced in feature space, and if so, how this relates to KPCA’s ability to de-noise and approximate
pre-images.
38
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
concept of applying non-linear exponential maps to model human joint rotations has been
proposed by Bregler and Malik in [18]. From equation 3.16, the requirement needed to
enable KPCA projection for human motion de-noising is the definition of the ‘distance’
between two pose descriptors, which should be a dot product in feature space (section 3.3).
To begin, let us consider the simple case of defining the ‘distance’ between two different
orientations of the same joint (e.g. the shoulder joint). In this case, the two rotations can
be modelled as two points on a unit sphere, with a logical ‘distance’ being the geodesic
(surface) distance between the two surface points. If we denote the two normalized points
on the unit sphere as p and q respectively, the geodesic distance is the inverse cosine of
the dot product: cos−1 (p · q ). The difficulty of using the inverse cosine arises when trying
to efficiently extend the ‘distance’ definition to an array of spheres (encoding multiple
joints of the human body), and ensuring that this ‘distance’ is positive definite to allow
embedding into KPCA [87]. Instead of adopting this approach, we use a Gaussian kernel,
which has already been proven to be positive definite for KPCA de-noising [91]. In this
case, the ‘distance’ between two orientations of the i-th joint will be defined as k(p i , q i ),
with
T (p
k(p i , q i ) = exp−γ{(p i −q i )
i −q i )}
∀ k(·, ·) ∈ [0, 1],
(3.22)
In equation 3.22, the Gaussian kernel in conjunction with the Euclidian distance (between
p and q ) have been used to approximate the non-linearity between the two spherical
surface points. The reader should note that the Gaussian kernel is an inverse encoding
(i.e. a complete alignment of the two orientations is indicated by the maximum value
of k(p i , q i ) = 1). Any misalignment of the specific joint will tend to reduce the kernel
function’s output towards zero. The advantage of adopting an exponential map approach
to joint distance definition becomes apparent when trying to calculate the distance between arrays of multiple spheres, which encode the full pose (stance) of a person. Since
k(·, ·) ∈ [0, 1], the individual joint ‘distance’ can be extended to multiple joints by taking
a direct multiplication of the kernel’s output. For example, if two different pose vector x
39
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
and y for a human model of M joints is encoded as
x = [p 1 , p 2 , ... p M ]T , x ∈ R3M , and
(3.23)
y = [q 1 , q 2 , ... q M ]T , y ∈ R3M ,
then the distance between these pose vectors, which are encoded as an array of normalized
spherical surface points, is
k(x , y ) = k(p 1 , q 1 ) × k(p 2 , q 2 ) × ..... k(p M , q M )
T (p
1 −q 1 )}
T (p
T
1 −q 1 )+(p 2 −q 2 ) (p 2 −q 2 )+.....
= exp−γ{(p 1 −q 1 )
= exp−γ{(p 1 −q 1 )
T (x −y )}
= exp−γ{(x −y )
T (p
exp−γ{(p 2 −q 2 )
2 −q 2 )}
T (p
..... exp−γ{(p M −q M )
M −q M )}
(p M −q M )T (p M −q M )}
.
(3.24)
This results in the Gaussian kernel, which was used in KPCA de-noising of the non-linear
toy example in figure 3.5. Intuitively, this encoding makes sense because two full body
poses which are aligned will still generate a ‘distance’ of 1. Any joint which results in a
relative misalignment (from the same joint in a different pose) will tend to reduce this
‘distance’ towards zero. Equation 3.24 shows that the ‘distance’ between pose vectors of
M joints can be efficiently calculated by simply taking the Euclidian distance between the
concatenated spherical surface points and mapping via a Gaussian kernel. The Euclidian
scale parameter γ allows the tuning of the kernel to fit different motion sequences and
training sets (section 4.2.2). Furthermore, the kernel is guaranteed to be positive definite,
and therefore, applicable to human motion de-noising via KPCA.
A disadvantage of encoding a rotation using a normalized spherical surface point is
that it loses a degree of rotation for any ball & socket joint (figure 3.6). This is because
a surface point on a sphere cannot encode the rotation about the axis defined by the
sphere’s center and the surface point itself. However, as highlighted in the phase space
representation of Moeslund [74], for the main limbs of the human body (e.g. arms and
legs), the ball & socket joints (e.g. shoulder or pelvis) are directly linked to hinge joints
(e.g. elbow or knee), which only has one degree of freedom. Provided that the limb is
40
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
not fully extended, it is possible to rediscover the missing degree of freedom (for a ball &
socket joint) by analyzing the orientation of the corresponding hinge joint. If the limb is
fully extended, then temporal constraints can be used to select the most probable missing
rotation of the ball & socket joint.
Figure 3.6: Diagram to illustrate the missing rotation in a ball & socket joint using the
RJC encoding. The missing rotation of the ball & socket joint (shoulder) can be inferred
from the hinge joint (elbow) below it in the skeletal hierarchy (provided that the limb is
not fully extended).
3.4.2
Relationship between KPCA De-noising and Motion Capture
It is important to highlight the possible relationship between KPCA de-noising of human
motion and exemplar based motion capture techniques. In human motion de-noising,
noisy human pose vector can be de-noised by projecting it onto the feature subspace
via the kernel trick, and mapping back via pre-image approximation [87]. In exemplarbased motion capture, performance improvement can be achieved by constraining search
to within a more efficient lower dimensional human motion subspace (rather than the
original pose space). Both techniques require the use of a human motion subspace for
processing.
To some readers, an obvious way to integrate KPCA de-noising into markerless motion
capture would be to add a KPCA de-noiser as a post-processor to motion capture and
have some kind of feedback loop to constrain the search space (figure 3.7 [top]). This thesis
explores a more efficient concept, that of integrating KPCA human motion de-noiser into
41
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
Figure 3.7: [Top] Data flow diagram to summarize a likely relationship between human
motion capture and human motion de-noising. [Bottom] Data flow diagram summary
of Kernel Subspace Mapping (KSM) [chapter 4] (based on KPCA [section 3.3]), which
integrates motion capture and de-noising into a single efficient processing step.
the core of a novel markerless motion capture technique called Kernel Subspace mapping
(KSM)(figure 3.7 [bottom]). Instead of capturing a pose and de-noising from the normalized RJC space (figure 3.7 [top]), KSM maps silhouette descriptors immediately to the
projected feature subspace learnt from KPCA, hence avoiding the KPCA projection step
(equation 3.16) during pose inference.
In order to test the effectiveness of KPCA for human motion capture, an engineering
approach is adopted and experiments are performed on the KPCA de-noiser (for human
motion) independently, before motion capture integration. To advocate the use of the
more complex non-linear KPCA, instead of linear PCA, experiments (section 3.5) should
show that the RJC encoding (section 3.4.1) of human motion is non-linear, and more
importantly, that more noise can be eliminated via KPCA (when compared to PCA). Denoising of synthetic Gaussian white noise in the feature space of various motion sequences
is also analyzed. This is investigated because in KSM, noise may also be induced in the
feature space (crucial details regarding this will be discussed in chapter 4).
42
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
3.5
Experiments & Results
To compare PCA and KPCA for the de-noising of human motion, various motion sequences
were downloaded from the CMU motion capture database. All motions were captured via
the VICON marker-based motion capture system [103], and were originally downloaded in
AMC format (section A.1.1). To avoid the singularities and non-linearity of Euler angles
(AMC format), the data is converted to its RJC equivalent (section 3.4.1) for training.
All experiments were performed on a PentiumT M 4 with a 2.8 GHz processor. Synthetic
Gaussian white noise are added to motion sequence X tr in normalized RJC format, and the
de-noising qualities compared quantitatively (figure 3.8) and qualitatively in 3D animation
playback¶ .
Figure 3.8: Quantitative comparison between PCA and KPCA de-noising of human motion
sequence. The level of input synthetic noise (mean square error [mse]) is depicted on the
horizontal axes. The corresponding output noise level in the de-noised motion is depicted
on the vertical axes.
Figure 3.8 highlights the superiority of KPCA over linear PCA in human motion denoising. Specifically for the walking and running motion sequences (figure 3.9 and figure
3.10 respectively), the frame by frame comparison further emphasizes the superiority of
KPCA (blue line) over PCA (black line). Furthermore, the KPCA algorithm was able
to generate realistic and smooth animation when the de-noised motion is mapped to a
¶
For motion de-noising results via PCA and KPCA, please refer to the attached file:
videos/motionDenoisingRun.MP4 & videos/motionDenoisingWalk.MP4.
43
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
Figure 3.9: Frame by Frame error comparison between PCA and KPCA de-noising of a
walk sequence.
Figure 3.10: Frame by Frame error comparison between PCA and KPCA de-noising of a
run sequence.
skeleton model and play-backed in real time. PCA de-noising, on the other hand, was
unsuccessful due to its linear limitation and resulted in jittery unrealistic motions similar
to the original noisy sequences. A repeat of this experiment (motivated by our work) has
also been conducted by Schraudolph et al (in [92]) with the fast iterative Kernel PCA
algorithm, and similar de-noising results are confirmed.
For feature space$ de-noising of another non-linear toy example (figure 3.11 [center]),
the result shows that KPCA can implicitly de-noise noisy feature space data by directly
"
We are interested in the de-noising of feature space noise because in KSM (section 4), input data is
initially mapped to the feature space, before pose estimation.
44
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
Figure 3.11: Diagram to show results of KPCA in the implicit de-noising of feature space
noise of a toy example via the fixed-point algorithm [88]. [Left] Projections of the clean
(blue) training data and the noisy (red) test data onto the first 3 principal axes in feature
space. [Center] The ‘direct’ pre-images of the noisy data in the original input space. [Right]
The ‘direct’ pre-images re-projected back to the feature space. Note the reduction in mean
squared error between the noisy data [left] and the de-noised data [right].
calculating its pre-image. This result is further supported when the pre-images of the
noisy feature space vectors are re-projected back to feature space without any other form
of processing via KPCA. Figure 3.11 [right] highlights geometrically and quantitatively
the reduction in error of the re-projected points onto the first 3 principal axes in feature
space. Figure 3.12 [top] shows the correlation (though erratic) between feature space noise
and pose (RJC) space noise (in mean square error [mse]) for KPCA de-noising of a walk
sequence. As expected, smaller noise level in feature space is an indication of smaller noise
level in the original space, and vice versa.
As the processing cost of KPCA is dependant on the training size (equation 3.16), an
analysis of this cost with respect to the training size was also investigated. Figure 3.12
[bottom] confirms the linear relationship between the de-noising cost of KPCA and its
training size.
3.6
Conclusions & Future Directions
Human motion can be considered as a high dimensional set of concatenated vectors (57
dimensions in RJC format). As human motion is coordinated, it is possible to use less
45
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
Figure 3.12: Feature and pose space mean square error [mse] relationship for KPCA [top].
Average computational de-noising cost for KPCA [bottom].
dimensions to represent the original data. PCA can encode more than 98% of a simple
walking motion in less than 10 principal components [14], and it has even been shown successful in learning principal axes for the synthesis of new human motion via optimization
[84]. However, because a set of axes represents a high percentage of the original data, it
does not necessary mean that it is a good de-noiser of the data. The main point to take
into account, is that neither PCA, nor KPCA, is being investigated (in this chapter) for
its ability to compress (reduce the dimension of) human motion data . The main issue
of concern is to compare the de-noising qualities (of human motion) between PCA and
KPCA (and use the results to motivate the use of KPCA de-noising in Kernel Subspace
Mapping [chapter 4]).
From the results in figure 3.8, it is clear that KPCA performs significantly better than
PCA in human motion de-noising, hence, verifying that the simplest forms of human motion (e.g. walking, running) are inherently non-linear. Note that PCA is, in fact, a special
case of KPCA with linear kernels, hence, eliminating the possibility that PCA de-noising
can perform better than KPCA de-noising. KPCA simply allows the selection of different
46
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
kernels (which may also have more parameters than PCA), more flexibility in parameter
tuning to avoid over/under fitting, and hence, enables custom improvements over linear
PCA. Specifically for KPCA de-noising via the RBF kernel (equation 3.16), from experiments, it was discovered that the optimal value for γ (the Gaussian scale factor) plays a
substantial role in improving the de-noising qualities. This parameter is not available for
tuning via the traditional linear PCA de-noising algorithm.
The ability for KPCA to implicitly de-noise data in feature space is also interesting.
Instead, assume a situation where noise is actually induced in feature space, and not input
space. By directly approximating the pre-images from the noisy data in feature space, it
is possible to implicitly de-noise the data in the original space. To reiterate, because the
feature space usually exists in much higher dimension than the input space, most points
in the feature space do not have corresponding pre-images. This is even more likely if the
(feature space) point is located away from the subspace of training data, since the feature
space is, in fact, defined by the training data itself. In the case where the pre-image does
not exist, then gradient descent techniques [88] can be used to approximate the most likely
pre-image. From the results in figure 3.11, this approximation has the desirable de-noising
effect of implicitly projecting the noisy feature space data into the non-linear subspace
(i.e. the toy data set in figure 3.11 [center]) in the input space.
From a markerless motion capture perspectively, human motion de-noising is advantageous as it allows the automated learning of human subspaces for specific applications.
In chapter 4, the KPCA human motion de-noiser will be integrated into the core of the
markerless motion capture technique (called Kernel Subspace Mapping), where noise may
be induced in both the input and feature space. For real-time motion capture, a factor to
take into account for future consideration is the complexity for KPCA projection (figure
3.12 [bottom]), which is O(N ), where N is the cardinality of the training set. As motion
capture data is usually stored as a set of concatenated vectors captured (sampled) at high
frame rates (from 60Hz to 120Hz), training the de-noiser with all the frames from each
motion file will be extremely expensive when it comes to motion de-noising (and motion
47
CHAPTER 3. SUBSPACE LEARNING FOR HUMAN MOTION CAPTURE
capture). To this end, the Greedy KPCA algorithm [41], which selects a reduced training set that retains a similar span in feature space will be reviewed for motion de-noising
(chapter 6). In particular, the capture rate (which is dependant on the de-noising rate) can
be controlled by modifying the size of the reduced training set via this greedy algorithm.
48
Chapter 4
Kernel Subspace Mapping (KSM)∗
This chapter integrates human motion de-noising via KPCA (chapter 3) and the Pyramid
Match Kernel [44] into a novel markerless motion capture technique called Kernel Subspace Mapping. The technique learns two feature space representations derived from the
synthetic silhouettes and pose pairs, and views motion capture as the problem of mapping
vectors between the two feature subspaces. Quantitative and qualitative motion capture
results are presented and compared with other state of the art markerless motion capture
algorithms.
4.1
Introduction
Kernel Subspace Mapping (KSM) concentrates on the problem of inferring articulated
human pose from concatenated and synchronized human silhouettes. Due to ambiguities
when mapping from 2D silhouette to 3D pose space, previous silhouette based techniques,
such as [6, 19, 26, 37, 45, 75, 77], involve complex and expensive schemes to disambiguate
between the multi-valued silhouette and pose pairs. A major contributor to the expensive
cost of human pose inference is the high dimensionality of the output pose space. To
mitigate exhaustively searching in high dimensional pose space, pose inference can be constrained to a lower dimensional subspace [37, 113]. This is possible due to a high degree
∗
This chapter is based on the conference paper [108] T. Tangkuampien and D. Suter: Real-Time
Human Pose Inference using Kernel Principal Component Pre-image Approximations: British Machine
Vision Conference (BMVC) 2006, pages 599–608, Edinburgh, UK.
49
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
of correlation in human motion.
In particular, a novel markerless motion capture technique called Kernel Subspace
Mapping (KSM), which takes advantage of the correlation in human motion, is introduced. The algorithm, which is based on Kernel Principal Components Analysis (KPCA)
[90] (section 3.3), learns two feature subspace representations derived from the synthetic
silhouettes and normalized relative joint centers (section 3.4.1) of a single generic human
mesh model (figure 4.1). After training, novel silhouettes of previously unseen actors and
of unseen poses are projected through the two feature subspaces (learnt from KPCA) via
Locally Linear non-parametric mapping [85]. The captured pose is then determined by
calculating the pre-image [87] of the projected silhouettes. As highlighted in chapter 3, an
advantage of KPCA is its ability to de-noise non-linear data before processing, as shown
in [91] with images of handwritten characters. This chapter further explores this concept,
and shows how this novel technique can be applied in the area of markerless motion capture. Results in section 4.4.2 show that KSM can infer relatively accurate poses (compared
to [45, 6, 2]) from noisy unseen silhouettes by using only one synthetic human training
model. A limitation of the technique is that silhouette data will be projected onto the
subspace spanned by the training pose, hence restricting the output to within this pose
subspace. This restriction, however, is not crucial since the system can be initialized with
the correct pre-trained data set if prior knowledge on the expected type of motion is available.
The main contribution of this chapter includes the introduction of a novel technique
for markerless motion capture called Kernel Subspace Mapping (KSM), which is based on
mapping between two feature subspaces learnt via KPCA. A novel concept of silhouette
de-noising is presented, which allows previously unseen (test) silhouettes to be projected
onto the subspace learnt from the generic (training) silhouettes (figure 4.1), hence allowing
pose inference using only a single training model (which also leads to a significant decrease
in training size and inference time). For mapping from silhouettes to the pose subspace,
instead of using standard or robust regression (which was found to be both slower and less
50
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Figure 4.1: Overview of Kernel Subspace Mapping (KSM), a markerless motion capture
technique base on Kernel Principal Components Analysis (KPCA) [90].
accurate in our experiments), non-parametric LLE mapping [85] is applied. The silhouette
kernel parameters are tuned to optimize the silhouette-to-pose mapping by minimizing the
LLE mapping error (section 4.3.3). Finally, by mapping silhouettes to the pose feature
subspace, the search space can be implicitly constrained to a set of valid poses, whilst
taking advantage of well established and optimized pre-image (inverse mapping) approximation techniques, such as the fixed-point algorithm of [87] or the gradient optimization
technique [39].
4.2
Markerless Motion Capture: A Mapping Problem
As stated previously, Kernel Subspace Mapping (KSM) views markerless motion capture
as a mapping problem. Given as input, silhouettes of a person, the algorithm maps the
pixel data to a pose space defined by the articulated joints of the human body. The static
pose of a person (in the pose space) is encoded using the normalized relative joint centers
(RJC) format (section 3.4.1), where a pose (at a time instance) is denoted by
x = [p 1 , p 2 , ...p M ]T , x ∈ R3M ,
51
(4.1)
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
and p k represents normalized spherical surface point encoding of the k-th joint relative
to its parent joint in the skeletal’s hierarchy (appendix A.1). Regression is not performed
from silhouette to Euler pose vectors as in [2, 6] because the mapping from pose to Euler
joint coordinates is non-linear and multi-valued. Any technique, like KPCA and regression based on standard linear algebra and convex optimization will therefore eventually
breakdown when applied to vectors consisting of Euler angles (as it may potentially map
the same 3D joint rotation to different locations in vector space). To avoid these problems,
the KPCA de-noising subspace is learnt from (training) pose vectors in the relative joint
center format (section 3.4.1).
Figure 4.2: Scatter plot of the RJC projections onto the first 4 kernel principal components
in Mp . of a walk motion, which is fully rotated about the vertical axis. The 4th dimension
is represented by the intensity of each point.
The pose feature subspace Mp (figure 4.2) is learnt from the set of training poses. Similarly, for the silhouette space, synchronized and concatenated silhouettes (synthesized from
the pose x i ) are preprocessed to a hierarchical shape descriptor Ψi (section 4.2.3) using
a technique similar to the pyramid match kernel of [44]. The preprocessed hierarchical
training set is embedded into KPCA, and the silhouette subspace Ms learnt (figure 4.1
[bottom left]). The system is tuned to minimize the LLE non-parametric mapping [85]
from Ms to Mp (section 4.3). During Capture, novel silhouettes of unseen actors (figure
4.1 [top left]) are projected through the two subspaces, before mapping to the output pose
space using pre-image approximation techniques [87, 39]. Crucial steps are explained fully
in the remainder of this chapter.
52
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
4.2.1
Learning the Pose Subspace via KPCA
From a motion capture perspective, pose subspace learning is the same as learning the
subspace of human motion for de-noising (chapter 3). For the latter (motion de-noising), a
noisy pose vector is projected onto the (kernel) principal components for de-noising in the
feature subspace Mp . Thereafter, the pre-image of the projected vector is approximated,
hence mapping back to the original space (figure 4.3 [KPCA De-noiser: black rectangle]).
Figure 4.3: Diagram to highlight the relationship between human motion de-noising (section 3), Kernel Principal Components Analysis (KPCA) and Kernel Subspace Mapping
(KSM).
For markerless motion capture (figure 4.3 [KSM: red rectangle]), KSM initially learns
the projection subspace Mp as in KPCA de-noising. However, instead of beginning in the
normalized RJC (pose) space, the input is derived from concatenated silhouettes (figure
4.3 [left]) and mapped (via a silhouette feature subspace Ms ) to the pose feature subspace
Mp . Thereafter, KSM and KPCA de-noising follow the same path in pre-image approximation.
To update the notations for KSM, for a motion sequence, the training set of RJC pose
vectors is denoted as X tr . The KPCA projection of a novel pose vector x onto the k-th
53
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
principal axis Vpk in the pose feature space can be expressed implicitly via the kernel trick
as
+Vpk · Φ(x), =
=
N
"
i=1
N
"
αki +Φ(xi ) · Φ(x),
(4.2)
αki kp (xi , x),
i=1
where α refers to the Eigenvectors of the centered RJC kernel matrix (equation 3.16). In
this case, the radial basis Gaussian kernel
T (x
kp (xi , x) = exp−γp {(x i −x )
i −x )}
(4.3)
is used because of the availability of well established and tested fixed point pre-image
approximation algorithm [87] (The reason for this selection will become clear later in
section 4.2.2). To avoid confusion, the symbol kp (·, ·) will be used explicitly for the pose
space kernel and ks (·, ·) for the silhouette kernel [the kernel ks (·, ·) will be discussed in
section 4.2.3]. As in section 3.3.2, there are two free parameters in need of tuning of
kp (·, ·), these being γp the Euclidian distance scale factor, and ηp the optimal number of
principal axis projections to retain in the pose feature space. The KPCA projection (onto
the first ηp principal axis) of x i is denoted as v pi , where
η
v pi = [+Vp1 · Φ(xi ),, .., +Vpp · Φ(xi ),]T , ∀ v p ∈ Rηp .
(4.4)
For novel input pose x , the KPCA projection is simply signified by v p .
4.2.2
Pose Parameter Tuning via Pre-image Approximation
In order to understand how to optimally tune the KPCA parameters γp and ηp for the
pose feature subspace Mp , the context of how Mp is applied in KSM must be highlighted
(figure 4.1 [bottom right]). Ideally, a tuned system should minimize the inverse mapping error from Mp to the original pose space in normalized RJC format (figure 4.1 [top
right]). In the context of KPCA, if we encode each pose as x and its corresponding KPCA
54
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
projected vector as v p , the inverse mapping from v p to x is commonly referred as the
pre-image [87] mapping. Since a novel input will first be mapped from Ms to Mp , the
system needs to determine its inverse mapping (pre-image) from Mp to the original pose
space. Therefore, for this specific case, the Gaussian kernel parameters, γp and ηp , are
tuned using cross-validation to minimize pre-image reconstruction error (as in equation
3.21). Cross validation prevents over-fitting and it ensures that the pre-image mapping
generalized well to unseen poses, which may be projected from the silhouette subspace Ms .
An interesting advantage of using pre-image approximation for mapping is its implicit
ability to de-noise if the input vector (which is perturbed by noise) lies outside the learnt
subspace Mp . This is relatively the same problem as that encountered by Elgammal
and Lee in [37], which they solved by fitting each separate manifold to a spline and rescaling before performing a one dimensional search (for the closest point) on each separate
manifold. For Kernel Subspace Mapping, however, it is usual that the noisy input (from
Ms ) lies outside the clean subspace, in which case, the corresponding pre-image of v p
usually does not exist [91]. In such a scenario, the algorithm will implicitly locate the
closest point in the clean subspace corresponding to such an input, without the need to
explicitly search for the optimal point.
4.2.3
Learning the Silhouette Subspace
This section shows how to optimally learn the silhouette subspace Ms , which is a more
complicated problem than learning the pose subspace Mp (section 4.2.1). Efficiently
embedding silhouette distance into KPCA is more complex and expensive because the
silhouette exists in a much higher dimensional image space. The use of Euclidian distance
between vectorized images, which is common but highly inefficient, is therefore avoided.
For KPCA, an important factor that must be taken into account when deciding on a
silhouette kernel (to define the distance between silhouettes) is if the kernel is positive
definite and satisfies Mercer’s condition [87]. In KSM, we use a modified version of the
pyramid match kernel [44] (to define silhouette distances), which has already been proven
positive definite. The main difference (between the original pyramid match kernel [44] and
55
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
the one used in KSM) is that instead of using feature points sampled from the silhouette’s
edges (as in [44]), KSM consider all the silhouette foreground pixels as feature points.
This has the advantage of skipping the edge segmentation and contour sampling step.
During training (figure 4.4), virtual cameras (with the same extrinsic parameters as the
real cameras to be used during capture) are set up to capture synthetic training silhouettes.
Note that more cameras can be added by concatenating more silhouettes into the image.
Figure 4.4: Example of a training pose and its corresponding concatenated synthetic
image using 2 synchronized cameras. Each segmented silhouette is first rotated to align
the silhouette’s principal axis with the vertical axis before cropping and concatenation
(figure 4.4)
Each segmented silhouette is normalized by first rotating the silhouette’s principal axis
to align with its vertical axis before cropping and concatenation. Thereafter, a simplified
recursive multi-resolution approach is applied to encode the concatenated silhouette (figure
4.5). At each resolution (level) the silhouette area ratio is registered in the silhouette
descriptor Ψ.
56
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Figure 4.5: Diagram to summarize the silhouette encoding for 2 synchronized cameras.
Each concatenated image is preprocessed to Ψ before projection onto Ms .
A five level pyramid is implemented, which results in a 341 dimensional silhouette
descriptor† . To compare the difference between two concatenated images, their respective
silhouette descriptors Ψi and Ψj are compared using the weighted distance
D Ψ (Ψi , Ψj ) =
F
"
f =1
1
{|Ψi (f ) − Ψj (f )| − γL(f )+1 }.
L(f )
(4.5)
The counter L(f ) denotes the current level of the sub-image f in the pyramid, with the
smallest sub-images located at the bottom of the pyramid and the original image at the
top. In order to minimize segmentation and silhouette noise (located mainly at the top
levels), the lower resolution images are biased by scaling each level comparison by 1/L(f ).
As the encoding process moves downwards from the top of the pyramid to the bottom,
it must continually update the cumulative mean area difference γL(f )+1 at each level.
This is because at any current level L(f ), only the differences in features that have not
already been recorded at levels above it is recorded, hence the subtraction of γL(f )+1 . To
embed DΨ (Ψi , Ψj ) into KPCA, the Euclidian distance in kp is replaced with the weighted
distance, resulting in a silhouette kernel
ks (Ψi , Ψ) = exp−γs D
†
Ψ (Ψ
2
i ,Ψ)
.
(4.6)
For a five level pyramid, the feature vector dimension is determined by the following formula: 12 +
2 + 42 + 82 + 162 = 341.
2
57
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Using the same implicit technique as that of equation 4.2, KPCA silhouette projection
is achieved by using ks (·, ·) and the projection (onto the first ηs principal axis) of Ψi is
denoted as
v si = [+Vs1 · Φ(Ψi ),, .., +Vsηs · Φ(Ψi ),]T , ∀ v s ∈ Rηs ,
where +Vsk · Φ(Ψi ), =
!N
k
j=1 γj ks (Ψj , Ψi ),
(4.7)
with γ representing the Eigenvectors of the
corresponding centered silhouette kernel matrix.
4.3
Locally Linear Subspace Mapping
Having obtained Ms and Mp , markerless motion capture can now be viewed as the problem of mapping from the silhouette subspace Ms to the pose subspace Mp . Using P tr to
denote the KPCA projected set of training poses (where P tr =[v p1 , v p2 , ... , v pM ]) and, similarly, letting S tr denote the projected set of training silhouettes (where S tr =[v s1 , v s2 , ... , v sM ]),
the subspace mapping can now be summarized as follows:
• Given S tr (the set of training silhouettes projected onto Ms via KPCA using the
pyramid kernel) and its corresponding pose subspace training set P tr in Mp , how
do we learn a mapping from Ms to Mp , such that it generalizes well to previously
unseen silhouettes projected onto Ms at run-time?
Referring to figure 4.6, the projection of the (novel) input silhouette during capture (onto
Ms ) is denoted by s in and the corresponding pose subspace representation (in Mp ) is
denoted by p out . The captured (output) pose vector xout (representing the joint positions
of a generic mesh in normalized RJC format) is approximated by determining the preimage of p out .
The (non-parametric) LLE mapping [85] is used to transfer projected vectors from Ms
to Mp . This only requires the first 2 (efficient) steps of LLE:
1. Neighborhood selection in Ms and Mp .
2. Computation of the weights for neighborhood reconstruction.
58
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Figure 4.6: Markerless motion capture re-expressed as the problem of mapping from the
silhouette subspace Ms to the pose subspace Mp . Only the projection of the training set
(S tr and P tr ) onto the first 4 principal axis in their respective feature spaces are shown
(point intensity is the 4th dimension). Note that the subspaces displayed here is that of a
walk motion fully rotated about the vertical axis. Different sets of motion will obviously
have different subspace samples, however, the underlying concept of KSM remains the
same.
The goal of LLE mapping for KSM is to map s in (in Ms ) to p out (in Mp ), whilst trying
to preserve the local isometry of its nearest K neighbors in both subspaces. Note that, in
this case, we are not trying to learn a complete embedding of the data, which requires the
inefficient third step of LLE, which is of complexity O(dN 2 ), where d is the dimension of
the embedding and N is the training size. Crucial details regarding the first two steps of
LLE, as used in KSM, are now summarized.
Figure 4.7: Diagram to summarize the mapping of s in from silhouette subspace Ms to
p out in pose subspace Mp via non-parametric LLE mapping [85]. Note that the reduced
subsets S lle and P lle are used in mapping novel input silhouette feature vectors.
59
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
4.3.1
Neighborhood Selection for LLE Mapping
From initial visual inspection of the two projected training sets S tr and P tr in figure 4.6,
there may not appear to be any similarity. Therefore, the selection of the nearest K neighbors (of an input silhouette captured at run-time) by simply using Euclidian distances in
Ms is not ideal. This is because the local neighborhood relationships between the training silhouettes themselves do not appear to be preserved between Ms and Mp , let alone
trying to preserve the neighborhood relationship for a novel silhouette.
For example to understand how the distortion in Ms may have been generated, the
silhouette kernel ks is used to embed two different RJC pose subspace vectors p A and
p B which have similar silhouettes (figure 4.8). The silhouette kernel ks , in this case, will
map both silhouettes to positions close together (represented as sA and sB ) in Ms , even
though they are far apart in Mp .
Figure 4.8: Diagram to show how two different poses p A and p B may have similar concatenated silhouettes in image space.
As a result, an extended neighborhood selection criterion, which takes into account
temporal constraints, must be enforced. That is, given the unseen silhouette projected
onto Ms , training exemplars that are neighbors in both Ms and Mp must be identified.
Euclidian distances between training vectors in S tr and s in can still identify neighbors in
Ms . The problem lies in finding local neighbors in P tr , since the system does not yet
know the projected output pose vector p out (this is what the KSM is trying to determine
60
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
in the first place). However, a rough estimation of p out can be predicted by tracking in the
subspace Mp via a predictive tracker, such as the Kalman Filter [53]. From experiments,
it was concluded that tracking using linear extrapolation is sufficient in KSM for motion
capture. This is because tracking is only used to eliminate potential neighbors which may
be close in Ms , but far apart (from the tracked vector) in Mp . Therefore, no accurate
Euclidian distance between the training exemplars in P tr and the predicted pose needs to
be calculated.
In summary, to select the K neighbors needed for LLE mapping, the expected pose
is predicted using linear extrapolation in Mp . A subset of training exemplars nearest to
the predicted pose is then selected (from P tr ) to form a reduced subset P lle (in Mp ) and
S lle (in Ms ). From the reduced subsets‡ , the closest K neighbors to s in (using Euclidian
distances in Ms ) can be identified and p out reconstructed from the linked neighbors in
Mp .
4.3.2
LLE Weight Calculation
Based on the filtered neighbor subset S lle (with K exemplars identified from the previous
section), the weight vector ws is calculated, such as to minimize the following reconstruction cost function:
ε(s in ) = &s in −
LLE mapping also enforces that the
!κ
κ
"
j=1
s
j=1 wj
2
wjs s lle
j & .
(4.8)
= 1. Saul and Roweis [85] showed that
ws can be determined by initially calculating the symmetrical (and semi-positive definite)
“local” Gram matrix H, where
in
− s lle
Hij = (s in − s lle
i ) · (s
j ).
(4.9)
‡
From experiments, subsets consisting from 10% to 40% of the entire training set are usually sufficient in
filtering out silhouette outliers. The subset size also does not affect the output pose (as visually determined
by the naked eye) whilst in this range, and is therefore not included as a parameter in the tuning process
- section 4.3.3.
61
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Solving wjs from H is a constrained least square problem, which has the following closedform solution:
wjs
=
Σk H−1
jk
Σlm H−1
lm
.
(4.10)
To avoid explicit inversion of the Gram matrix H, wj s can efficiently be determined by
solving the linear system of equation, Σk Hjk wks = 1 and re-scaling the weights to sum to
one [86]. From ws , the projected pose subspace representation can be calculated as,
p out =
κ
"
wjs p lle
j ,
j=1
p out ∈ Mp ,
(4.11)
lle (the pose subspace representation of S lle in Mp ). From
where p lle
j is an instance of P
p out (which exists in Mp ), the captured (pre-image) pose xout (in the normalized RJC
pose space) is approximated via the fixed point algorithm of Schölkopf et al [87].
4.3.3
Silhouette Parameter Tuning via LLE Optimization
Similar to Mp , the two free parameters γs and ηs (in equation 4.6 and equation 4.7) need
to be tuned for the silhouette subspace Ms . In addition, an optimal value for K, the
number of neighbors for LLE mapping must also be selected. Referring back to figure
4.1, the parameters should be tuned to optimize the non-parametric LLE mapping from
Ms to Mp . The same concept as in section 4.2.2 is applied, but instead of using preimage approximations in tuning, the parameters γs , ηs and K are tuned to optimize the
LLE silhouette-to-pose mapping. Note that due to the use of an unconventional kernel in
silhouette KPCA projection, it is unlikely that a good pre-image can be approximated.
However, in KSM for motion capture, there is no need to map back to the silhouette input
space, hence, no requirement to determine the pre-image in silhouette space.
The silhouette tuning is achieved by minimizing the LLE reconstruction cost function
Cs =
N
K
1 " p " s p 2
&v i −
wij v j & ,
N
i=1
j=1
62
(4.12)
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
s , in this case, is the
where j indexes the K neighbors of v pi . The mapping weight wij
weight factor of v si that can be encoded in v sj (as determined by the first two steps of
LLE in [85] and the reduced training set from Mp ) using the Euclidian distance in Ms .
To ensure that the tuned parameters generalizes well to unseen novel inputs, training is
again performed using cross validation on the training set. Once the optimal parameters
are selected for both Ms and Mp , the system is ready for capture by projecting the input
silhouettes through the learnt feature subspaces.
63
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
4.4
Experiments & Results
Section 4.4.1 presents experiments which compares the Pyramid Match kernel [44] (adopted
in KSM) and the Shape Context [11] (which is a popular choice for silhouette-based markerless motion capture [75, 6]) in defining correspondence between human silhouettes. Motion capture results for Kernel Subspace Mapping (KSM) from real and synthetic data
are presented in section 4.4.2 and section 4.4.3 respectively. The system is trained with
a generic mesh model (figure 4.4) using motion captured data from the Carnegie Mellon
University motion capture database [32]. Note that even though the system is trained
with a generic model, the technique is still model free as no prior knowledge nor manual
labelling of the test model is required. All concatenated silhouettes are preprocessed and
resized to 160 × 160 pixels.
Figure 4.9: Illustration to summarize markerless motion capture and the training and
testing procedures. The generic mesh model (figure 4.10 [center]) in used in training,
whilst a previously unseen mesh model is used for testing.
4.4.1
Pyramid Match Kernel for Motion Capture
This section aims to show that the pyramid kernel for human silhouette comparison (section 4.2.3) is as descriptive as the shape context [11] (which has already been shown
successful in silhouette-based markerless motion capture techniques such as [6, 75]). A set
of synthetic silhouettes of a generic mesh model walking (figure 4.11 [right]) is used to test
64
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Figure 4.10: Selected images of the different models used to test Kernel Subspace Mapping (KSM) for markerless motion capture. [left] Biped model used to create deformable
meshes for training and testing. [center] Generic model used to generate synthetic training
silhouettes. [right] Example of a previously unseen mesh model, which can be used for
synthetic testing purposes (appendix B).
the efficacy of the modified pyramid Match kernel (section 4.2.3) on human silhouettes.
For comparison, the Shape Context of the silhouettes are also calculated and the silhouette
correspondence recorded (figure 4.11 [left]). From the experiments, our pyramid kernel is
as expressive as the Shape Context for human silhouette comparison [11, 75] and its cost
is (only) linear in the number of silhouette features (i.e. the sets’ cardinality).
Figure 4.11: Comparison between the shape context descriptor (341 sample points) and
the optimized pyramid match kernel (341 features) for walking human silhouettes. The
silhouette distances are given relative to the first silhouette at the bottom left of each
silhouette bar.
65
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
To further test the kernel, a synthetic sequence of a walk motion fully rotated about
the vertical [yaw] axis is used. In this case, the corresponding normalized RJC data
(section 3.4.1) is used as the clean data (for comparison). From a markerless motion
capture perspective, the goal is to infer the RJC vectors from the noisy input silhouettes.
Therefore, a good silhouette descriptor must be able to define shape correspondence that
shows a visual similarity with the RJC equivalent. The kernel matrix (using a Gaussian
RBF kernel) of the corresponding RJC data (equation 3.19) is shown in figure 4.12 [left].
The kernel matrix generated from the pyramid kernel from the corresponding silhouettes
is shown in figure 4.12 [right]. The silhouettes are captured with two synchronized virtual
cameras (orthogonal to each other) and concatenated into a single feature image before
processing. Note that a single synthetic camera can be used. However, due to ambiguities
from 3D to 2D projections of silhouettes, it is difficult to generate visually similar kernel
matrix (to the RJC matrix) via a single camera, let alone infer accurate pose and yaw
direction for motion capture. In figure 4.12, the similarities in the diagonal intensity bands
indicates a correlation between the two tuned kernels.
Figure 4.12: Intensity images of the normalized RJC pose space kernel (RJC) and the Pyramid Match kernel of silhouettes captured from synchronized cameras(right) of a training
walk sequence (fully rotated about the vertical axis). Similarities in the diagonal intensity
bands indicates a correlation between the two tuned kernels.
66
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
4.4.2
Quantitative Experiments with Synthetic Data
In order to test KSM quantitatively, novel motions (similar to the training set) are used
to animate a previously unseen mesh model of the author (appendix B) and their corresponding synthetic silhouettes are captured for use as control test images§ . Using a
walking training set of 323 exemplars, KSM is able to infer novel poses at an average
speed of ∼0.104 seconds per frame on a PentiumT M 4 with 2.8 GHz processor. The captured pre-image pose is compared with the original pose that was used to generate the
synthetic silhouettes (figure 4.13). At this point, it should be highlighted that the test
mesh model is different to the training mesh model, and all our test images are from an
unseen viewing angle or pose (though relative camera-to-camera positions should be the
same as was used in training).
Figure 4.13: Visual comparison of the captured pose (red with ‘O’ joints) and the original
ground truth pose (blue with ‘*’ joints) used to generate the synthetic test silhouettes.
For 1260 unseen test silhouettes of a walking sequence, which is captured from different
yaw orientations, KSM (with prior knowledge of the starting pose) can achieve accurate
§
For motion capture results of synthetic walking sequences via KSM, please refer to the attached file:
videos/syntheticMotionCapture.MP4.
67
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
pose reconstructions with an average error of 2.78◦ per joint.
Figure 4.14: KSM capture error (degrees per joint) for a synthetic walk motion sequence.
The entire sequence has a mean error of 2.78◦ per joint.
For comparison with other related work [75] the error may reduce down to less than 2◦
of error per each Euler degree of freedom [d.o.f.] (each 3D rotation is represented by a set
of 3 Euler rotations¶ .) Agarwal and Triggs in [2] were able to achieve a mean angular error
of 4.1◦ per d.o.f., but their approach requires only a single camera. We have intentionally
recorded our errors in terms of degrees per joint, and not in Euler degree of freedom (as
in [2]), because for each 3D rotation, there are many possible set of Euler rotations encoding. Therefore, using the Euler degree of freedom to encode error can result in scenarios
where the same 3D rotation, can be interpreted as different, hence inducing false positives.
Our technique (which uses a reduced training set of 343 silhouettes and 2 synchronized
cameras) also shows visually comparable results to the technique proposed by Grauman
et al in [45], which uses a synthetic training set of 20,000 silhouettes and 4 synchronized
cameras. In [45], the pose error is recorded using the Euclidian distance between the joint
centers in real world scale. We believe that this error measurement is not normalized,
in the sense, that, for a similar motion sequence, a taller person (with longer limbs) will
more likely record larger average error than a shorter person (due to larger variance for
the joint located at the end of each limb). Nevertheless, to enable future comparison, we
¶
We note that the error relationship between a 3D rotation and the set of 3 Euler rotations (encoding a
3D rotation) is complicated. In this case, only an approximation has been made to allow error comparison
between other markerless motion capture approaches.
68
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
have presented our real world distance results (as in [45]) in combination with the test
model’s height. For a test model with a height of 1.66cm, KSM can estimate pose with a
mean (real world) error of approximately 2.02 cm per joint (figure 4.15). The technique
proposed by Grauman et al in [45] reported an average pose error of 3 cm per joint, when
using the full system of 4 cameras. A fairer comparison can be achieved when the two
approaches (KSM and [45]) both use 2 synchronized cameras, in which case, the approach
in [45] reported an average pose estimation error of over 10 cm per joint.
Figure 4.15: KSM capture error (cm/joint) for a synthetic walk motion sequence. The
entire sequence has a mean error of 2.02 cm per joint.
To further test the robustness of KSM, binary salt & pepper noise with different noise
densities are added to the original synthetic data set (as figure 4.14). The noise densities
of 0.2, 0.4 and 0.6 were added to the test silhouettes and the following mean errors of
2.99◦ per joint, 4.45◦ per joint and 10.17◦ per joint were attained respectively. At this
point, we stress that KSM is now being tested for its ability to handle a combination of
silhouette noise (i.e. the test silhouettes are different to that of the training silhouettes)
and pixel noise (which is used to simulate scenarios with poor silhouette segmentation).
An interesting point to note is that the increase in noise level does not equally affect the
inference error for each silhouette, but rather, the noise increases the error significantly for
a minority of the poses (i.e. the peaks in figure 4.16). This indicates that the robustness
of KSM can be improved substantially by introducing some form of temporal smoothing
to minimize these peaks in error.
69
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
Figure 4.16: KSM capture error for a walking motion (fully rotates about the vertical axis)
with different noise densities (Salt & Pepper Noise). The errors are recorded in degrees
per joint.
70
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
4.4.3
Qualitative Experiments with Real Data
For testing with real data, silhouettes of a spiral walk sequence were captured using simple
background subtraction$ . Two perpendicular cameras were set up (with the same extrinsic
parameters as the training cameras) without precise measurements. Due to the simplicity
of the setup, segmentation noise (as a result of shadows and varying light) was prominent in
our test sequences. Selected results are shown in figure 4.17. Synthetic salt & pepper noise
were also added to the concatenated real silhouettes to test the robustness of the system
(figure 4.18) The ability to capture motion under these varying conditions demonstrates
the robustness of KSM.
Figure 4.17: Kernel subspace mapping motion capture on real data using 2 synchronized
un-calibrated cameras. All captured poses are mapped to a generic mesh model and
rendered from the same angles as the cameras.
"
For motion capture results of real walking sequences (noise and clean silhouettes) via KSM, please
refer to the attached file: videos/realMotionCapture.MP4 and videos/realMotionCaptureNoisy.MP4
71
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
For animation, captured poses were mapped to a generic mesh model and the technique was able to created smooth realistic animation in most parts of the sequence. In
other parts (∼12% of the captured animation’s time frame), the animation was unrealistic
as inaccurate captured poses were being appended, the presence of which is further exaggerated in animation. However, even though an incorrect pose may lead to unrealistic
animation, it still remains within the subspace of realistic pose when viewed statically (by
itself). This is because all output poses are constrained to lie within the subspace spanned
by the set of realistic training poses in the first place.
Figure 4.18: Selected motion capture results to illustrate the robustness of KSM. The
system was able to infer pose (at lower accuracy) in the presence of Gaussian noise in
segmented images. The algorithm was able to capture visually accurate pose (∼80% of
the test set composing of 500 exemplars) from most concatenated silhouette images - Top
two images and bottom left. In some cases KSM output visually incorrect pose vector
(bottom right). However, because the pose was generated in the subspace of possible
walking motion, even though the pose are visually inaccurate when compared to the
silhouettes, it still is a pose of a person in a walking motion.
4.5
Conclusions & Future Directions
This chapter introduces Kernel Subspace Mapping (KSM), a novel markerless motion capture technique that can capture 3D human pose using un-calibrated cameras. The result
72
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
shows that no detailed measurements, nor initialization of the cameras are required. The
main advantages of the approach are its simplicity in setup, its speed and its robustness.
Furthermore the technique generalizes well to unseen examples and is able to generate realistic poses that are not in the original database, even when using a single generic model
for training. The motion capture system, which is model free, does not require prior
knowledge of the actor nor manual labelling of body parts. This makes our un-calibrated
system well suited for real-time and low cost human computer interaction (HCI) as no
accurate initialization is required. Furthermore, the technique is still able to accurately
track and estimate full pose from silhouettes with high level of noise (figure 4.16). To the
author’s knowledge, there has not been any previously proposed silhouette-based motion
capture approaches, which can robustly handle these levels of noise.
From the motion capture results (where KSM has shown the ability to effective infer
pose from both synthetic and real silhouettes, and also in the presence of noise), we believe KSM to be a motion capture technique that works very well at estimation 3D pose
from binary (human) silhouettes. It is important to elaborate other interesting aspects of
KSM, such as the ability for KSM to handle gray scale images (as opposed to only binary
images), as well as how KSM can take into account pixel relationships along the image
plane. By taking into account the relationship of pixels along the image plane, a better
‘distance’ kernel can be defined (for KSM), which allows improvement in the areas of training set reduction and more accurate mapping results. In chapter 5, we investigate this
concept on the problem of view-point estimation of 3D rigid objects form gray scale images.
There is a good reason why KSM has not been applied to human pose inference from
gray scale images. This is because by taking into account the photometric information
of an actor at training, the system will only be constrained to the capture of that actor
at run-time. Furthermore, during capture, the actor will need to be wearing the same
clothing that he/she was wearing at the time the training sequences were capture. The
background pattern would also need to remain the same between training and during
73
CHAPTER 4. KERNEL SUBSPACE MAPPING (KSM)
motion capture. The limits the ability for KSM to generalize to the capture other unseen actors, as well as preventing KSM from being trained off-site (i.e. KSM cannot be
trained in one place and used at a later stage to capture pose in an environment with a
different background). Off-site training is possible for silhouettes because normalized silhouettes (i.e. after alignment, cropping and resizing [section 4.2.3]) are relatively similar
irrespective of their environment (provided that acceptable silhouette segmentation can
be attained).
Specifically with the goal of improving the capture rate of KSM in mind, an interesting
aspect to further investigate is how the size of the training set affects the computational
cost of pose estimation. As highlighted earlier, because the cost of KPCA projection (a
component in KSM) is dependant on the cardinality of the training set (section 3.3), it may
be possible to further improve KPCA (motion) de-noising rate by using a smaller training
set (than the original training set). Furthermore, if it is possible to find a way to select
only the best data that represent the de-noising subspace (and discard the remainder),
then an increase in capture rate can be achieved, perhaps at the expense of relatively
minor decrease in de-noising and (motion) capture quality. Chapter 6 investigates this
concept of training set reduction (for human motion de-noising) via the Greedy KPCA
algorithm [41].
74
Chapter 5
Image Euclidian Distance (IMED)
embedded KSM∗
The previous chapter shows how Kernel Subspace Mapping (KSM) (chapter 4) can be apply in markerless motion capture to infer high dimensional pose from concatenated binary
images. This chapter further confirms the flexibility of KSM by extending its application
to the analysis of monocular intensity images (via a 3D object view-point estimation problem [figure 5.1]). The novel application of the Image Euclidian Distance (IMED) [116] in
KPCA and KSM is introduced. We show how IMED (which takes into account spatial
relationship of pixels and their intensities) allows KSM to accurately estimate viewpoints
(less than 4◦ error) via the use a sparse training set (as low as 30 training images). This
chapter aims to demonstrate that KSM embedded with the Image Euclidian Distance
(IMED) [116] is a more accurate technique (and one with a sparser solution set) than
KSM (using vectorized Euclidian distance [i.e. raster scan an image into a vector and
calculate the standard Euclidian distance]) and other previously proposed approaches in
3D viewpoint estimation [80, 119, 47, 78, 80].
∗
This chapter is based on the conference paper [106] T. Tangkuampien and D. Suter: 3D Object
Pose Inference via Kernel Principal Component Analysis with Image Euclidian Distance (IMED): British
Machine Vision Conference (BMVC) 2006, pages 137–146, Edinburgh, UK.
75
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
5.1
Introduction
Kernel techniques such as Kernel Principal Component Analysis (KPCA) [87] and Support
Vector Machines (SVM) [114] have both been shown to be powerful non-linear techniques
in the analysis of pixel patterns in images [90, 91, 40]. In supervised SVM image classification [39], images are usually vectorized before embedding (via a kernel function) for
discrete classification in high dimensional feature space. In unsupervised KPCA [90, 87],
vectorized images are also embedded to a high dimensional feature space, but instead of
determining optimal separating hyper planes, linear Principal Component Analysis (PCA)
is performed. Both KPCA and SVM take advantage of the kernel trick (in order to avoid
explicitly mapping input vectors), which involves defining the ‘distance’ between two vectors. In both cases, the traditional vectorized Euclidian distance is used for embedding.
Each pixel, in this case, is usually considered as an independent dimension, and therefore
these approaches do not take into account the spatial relationship of nearby pixels along
the image plane.
This chapter presents results indicating that the Image Euclidian Distance (IMED) [116]
(which takes into account spatial pixel relation on the image plane) is a better distance
criterion (than vectorized Euclidian distance), especially for 3D object pose estimation
(figure 5.1). IMED can be embedded into KPCA via the use of the Standardizing Transform (ST) [116], and it can also be implemented efficiently via a combination of the
Kronecker product and Eigenvector projections [60]. A major significance of ST, is that it
can alternatively be viewed as a pre-processing transformation (on the pixel intensities),
after which traditional vectorized Euclidian distance can be applied [116]. Effectively, this
means that all desirable properties (such as positive definiteness, non-linear de-noising [91]
and pre-image approximation using gradient descent [87]) of using traditional Euclidian
distances in KPCA still applies. A minor disadvantage of ST is that it can be expensive
in its memory consumption. For an image of size M ×N , the full ST matrix needs to be
of size MN ×MN . Fortunately, for the case of Gaussian kernel embedding, this matrix is
separable [116] and can be stored as the Kronecker product [60] of two reduced matrices
76
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
of size M ×M and N ×N .
A significant contribution of this chapter is the demonstration of a practically viable
embedding of IMED (with Gaussian kernel) into Kernel Principal Components Analysis
(KPCA) [90] and Kernel Subspace Mapping (KSM) (section 4). By separating the originally proposed Standardizing Transform (ST) into the Kronecker product of two identical
reduced transform [116], this chapter shows how IMED is a better image ‘distance’ criterion than traditional vectorized Euclidian distance. To support this, results using IMED
embedded KSM for 3D object pose estimation (compared to [80]), are presented in section 5.5.
5.2
Problem Definition and Related Work
The problem of 3D object pose estimation using machine learning can be summarized as
follows:
• Given images of the object, from different known orientations, as the training set,
how can a system optimally learn a mapping to accurately determine the orientation
of unseen images of the same object from a ‘static’ camera?
Alternatively, instead of calculating the object’s orientation, the system could determine
the orientation and position of a moving camera that the ‘static’ object is viewed from. To
test the system’s robustness, the unseen input images can be deliberately corrupted with
noise, in which case, image de-nosing techniques, such as KPCA, can be applied. Zhao
et al [119] used KPCA with a neural network architecture to determine the orientation
of objects from the Columbia Object Image Library (COIL-20) database [78]. In that
case, however, the pose is only restricted to rotations about a single vertical axis. This
chapter considers the more complex pose estimation problem, that of determining the
camera position where an object is viewed from anywhere on the upper hemisphere of an
object’s viewing sphere.
77
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
Figure 5.1: Diagram to summarize the 3D pose estimation problem of an object viewed
from the upper viewing hemisphere.
Peters et al [80] introduced the notion of ‘view-bubbles’, which are area views where
the object remains visually the same (up to within a pre-defined criterion). The problem
of hemispherical object pose estimation can then be re-formulated as the selection of the
correct view bubble and interpolating from within the bubble. A disadvantage of this
technique, however, is the need for a large and well-sampled training set (up to 2500
images in [80]), to select the view bubble training images from. In this chapter, IMED
embedded KSM presents comparable results using only 30 randomly selected training
images of the same object (i.e. we do not select the best 30 training images by scanning
through all 2500 images† ).
5.3
Image Euclidian Distance (IMED)
For notation purposes, a summary of the Image Euclidian Distance as introduced by
Wang et al [116] is now presented. Each M ×N input image must be first vectorized to x,
where x ∈ RM N . The intensity at the (m, n) pixel is then represented as the (mN +n)-th
dimension in x. The standard vectorized Euclidian distance dE (x , y ) between vectorized
†
The data set used in this chapter to test IMED embedded KSM (for 3D viewpoint estimation) is the
same set described in [81]. Each data set consists of 2500 images of a 3D object taken at a regular interval
from the top hemisphere of the object’s viewing sphere.
78
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
images x and y is then defined as
d2E (x , y ) =
M
N
"
k=1
(xk − y k )2
(5.1)
= (x − y)T (x − y).
Alternatively, the IMED between images, introduces the notion of a metric matrix
G of size M N ×M N , where the element gij represents how the dimension x i affects the
dimension x j . To avoid confusion, it is important to note that i, j, k ∈ [1, M N ], whereas
m ∈ [1, M ] and n ∈ [1, N ]. Provided that the metric matrix G is known, the Image
Euclidian Distance can then be calculated as
d2IM (x , y )
=
M
N
"
i,j=1
gij (x i − y i )(x j − y j )
(5.2)
= (x − y)T G(x − y ).
The metric matrix G solely defines how the IMED deviates from the standard (vectorized) Euclidian distance (which is induced when G is replaced by the identity matrix). As
highlighted in [116], the main constraints for IMED are that the element gij is dependent
on the distance between pixels Pi and Pj , that is gij = f (|Pi −Pj |), and that gij monotonically decreases as |Pi − Pj | increases. A constraint on f is that it must be a continuous
positive definite function [116], thereby ensuring that G is positive definite and excluding
the tradition Euclidian distance as a subset of IMED‡ . To avoid confusion, the reader
should note that f refers to a continuous positive definite function encoding pixel spatial
relationship along the image place. f should not be confused with the positive definite
kernel function as required in KPCA (section 3.3). Note that at this current stage, the
calculation of IMED is not memory efficient, due to the need to store the metric G, which
is of size M N ×M N . Section 5.3.2 shows how to improve on this, but first, a summary of
the Standardizing Transform (ST) [116] (which allows IMED to be embedded into more
powerful learning algorithms such as SVMs, KPCA and KSM) will be summarized.
‡
Note that even though the typical vectorized euclidian distance can be induced by replacing G with
the identity matrix, the Euclidian distance is not a subset of IMED as it is not continuous across the image
plane.
79
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
5.3.1
Standardizing Transform (ST)
As suggested in [116], the calculation of IMED can be simplified by decomposing G to AT A,
leading to
d2IM (x , y ) = (x − y)T AT A(x − y )
(5.3)
= (u − v ) (u − v ),
T
where u = Ax and v = Ay. The Standardizing Transform (ST) is then merely a special
case of A, where
G = AT A
1
(5.4)
1
= G2 G2 .
1
This reveals symmetric, positive definite and unique solutions for both G and G 2 :
(5.5)
G = ΓΛΓT ,
1
1
G 2 = ΓΛ 2 ΓT ,
(5.6)
where Γ is the orthogonal column matrix of the Eigenvectors of G, and Λ the diagonal
matrix of the corresponding Eigenvalues. The Standardizing Transform (ST) is then the
1
transformation G 2 (•). For example, to embed IMED into learning algorithms based on
1
standard Euclidian norm, the technique simply applies the transformation G 2 (•) to the
vectorized images x and y , in order to obtain u and v respectively. From (5.3), the IMED
is then simply the traditional Euclidian distance between the transformed vectors u and v .
5.3.2
Kronecker Product IMED
Up until now, only the constraints of IMED and its corresponding Standardizing Transform
1
G 2 (•) have been summarized, but not how to construct the matrix G itself. This section
reveals how to construct G efficiently using the Kronecker Product (as suggested in [116])
80
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
and concentrating specifically on the Gaussian function, where
gij =
1
−|Pi − Pj |2
exp{
}.
2πσ 2
2σ 2
(5.7)
In this case, σ controls the spread of the Gaussian signal, which acts like a bandlimited
filter. For large values of σ, the influence of, let’s say Pi to its neighboring pixel Pj on the
image plane, is more significant (compared to when σ is small). Therefore, G acts like
a low pass filter, since the signal induced is smoother and flatter and can be constructed
from low frequency signals. As σ decreases, the Gaussian signal induced becomes steeper
and thinner, reducing its influence on neighboring pixels, and requiring higher frequency
signals for reconstruction in the Fourier domain. For the extreme case, where σ→0, the
dirac delta signal is induced, which requires infinite bandwidth, leading to the traditional
Euclidian distance.
1
Figure 5.2: Images of the Standardizing Transform G 2 with different σ values. Note how
the images tends to the diagonal matrix as σ→0.
By letting Pi and Pj be the pixels at location (mi ,ni ) and (mj ,nj ) respectively, the
corresponding squared pixel distance between the two pixels on the image plane is then
given by
|Pi − Pj |2 = (mi − mj )2 + (ni − nj )2 .
81
(5.8)
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
As highlighted in [116], by substituting the above equation into equation 5.7, the separable,
and reducible metric representation can be obtained:
1
−[(mi − mj )2 + (ni − nj )2 ]
[exp{
}]
2πσ 2
2σ 2
1
−(mi − mj )2
−(ni − nj )2
[exp{
}
·
exp{
}].
=
2πσ 2
2σ 2
2σ 2
gij =
(5.9)
As previously mentioned, m ∈ [1, M ] and n ∈ [1, N ], hence leading to a re-formulation of
equation 5.9 as the Kronecker product [60] of two smaller matrices ΨM and ΨN of size
M ×M and N ×N respectively, where
1
−(mi − mj )2
exp{
},
2πσ 2
2σ 2
1
−(ni − nj )2
ΨN ( n i , n j ) =
exp{
}.
2πσ 2
2σ 2
ΨM (mi , mj ) =
(5.10)
This leads to a simplified and memory efficient version of the metric matrix [116], where
G = ΨM ⊗ ΨN .
(5.11)
For the case of squared images, where M = N , ΨM and ΨN are identical (henceforth both
ΨM and ΨN will be denoted as Ψ) and only one matrix copy of size M ×N is required§ .
Hence, Ψ can be decomposed into its corresponding matrix of Eigenvectors Ω and diagonal
Eigenvalues Θ, where
Ψ = ΩΘΩT .
(5.12)
Using established properties of the Kronecker product [60] compatible with standard matrix multiplication, the metric matrix becomes
G = (ΩΘΩT ) ⊗ (ΩΘΩT )
= (Ω ⊗ Ω)(Θ ⊗ Θ)(Ω ⊗ Ω)T
(5.13)
= ΓΛΓT
§
For the remainder of this chapter, only square images will be considered, as this can effective halve the
current memory consumption. Note that this does not exclude all non-square images as they can simply
be pre-processed to square images before calculating IMED.
82
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
1
From this and equation 5.6, the Standardizing Transform G 2 (•) can be derived as
1
1
G 2 = ΓΛ 2 ΓT
1
2
(5.14)
T
= (Ω ⊗ Ω)(Θ ⊗ Θ) (Ω ⊗ Ω)
Furthermore, the reduced Eigenvalue matrix Θ is diagonal and can be diagonally vec* (i.e. take only the diagonal values), where Θ
* m represents Θ(m, m). The
torized as Θ
1
corresponding Kronecker product of the Eigenvalue matrices (Θ ⊗ Θ) 2 can efficiently be
represented as the outer product
1
*Θ
* T ]) 12 ,
(Θ ⊗ Θ) 2 = (R[Θ
(5.15)
where R is the vectorization function followed by the diagonalization function.
Further improvements of the calculation of the Standardizing Transform is possible by
1
realizing that G 2 is a series of linear Eigenvector projections and scaling [116]. Effectively,
this means that a specified minimum Eigenvalue scaling threshold can be pre-selected, and
the Eigenvalue projection ignored for Eigenvalues below the selected threshold. Letting τ
* the Eigenvalues re-ordered in descending order,
be the minimum Eigenvalue scaling and Θ
the reduced Eigenvector column matrix can be rewritten as
Ω =[ J |K],
(5.16)
where J represents the column of J Eigenvectors corresponding to the Eigenvalues larger
than τ , and K the remaining Eigenvectors. The approximate Standardizing Transform
then becomes
1
* (1:J) Θ
* T ]) 12 (J ⊗ J )T
G̃τ2 = (J ⊗ J )(R[Θ
(1:J)
(5.17)
* (1:J) being a one dimensional
where J is only of dimension M ×J (with J<M ), and Θ
vector of the highest J Eigenvalues. Therefore, for high resolution images where M is
1
1
large, it is possible to use the approximate Standardizing Transform G̃τ2 instead of G 2 .
83
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
On the other hand, if the image resolution is low, and speed is a factor, it is possible
1
to construct the full G 2 matrix using a single Kronecker product. From equation 5.14,
the following can be derived:
1
1
G 2 = (Ω ⊗ Ω)(Θ ⊗ Θ) 2 (Ω ⊗ Ω)T
1
1
= ΩΘ 2 ΩT ⊗ ΩΘ 2 ΩT
1
(5.18)
1
= Ψ2 ⊗ Ψ2 ,
1
1
1
1
with the constraints: Ψ = (Ψ 2 )T Ψ 2 = Ψ 2 Ψ 2 .
1
Visually, G 2 is a transform domain smoothing [116], where the Eigenvectors with the
higher Eigenvalues, represent the low frequency basis signals. Hence, by leaving out the
Eigenvectors represented by K, the higher frequency signals are suppressed. Referring back
to figure 5.3 and equation 5.7, the larger the value of σ, the smoother the Standardizing
Transform image will be, and therefore, the higher the Eigenvalues corresponding to the
low frequency basis. The sharpest that the image can ever be (including noise) after the
Standardizing Transformation is when σ→0, where only the original image itself remains.
1
Figure 5.3: Images of a 3D object after applying the Standardizing Transform G 2 with
different σ values.
5.4
IMED embedded Kernel PCA
Since KSM is based on KPCA, a factor to initially consider is the embedding of IMED into
unsupervised Kernel Principal Components Analysis (KPCA) [90]. A summary of KPCA
is presented in section 3.3. KSM will be used in the 3D view-point estimation problem
84
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
(which is summarized in section 5.2). To embed IMED into KPCA, the Standardization
1
Transform G 2 is initially applied to the vectorized training set X , to get standardized
training set U, where
1
(5.19)
U = G2 X .
KPCA is then performed using the standard RBF kernel, which leads to the overall effect
of:
T G(x
kim (xi , xj ) = exp−γ{(x i −x j )
T (u
= exp−γ{(u i −u j )
i −x j )}
i −u j )}
(5.20)
= kv (u i , u j ),
which we refer to as IMED embedded KSM. Considering the relationship in equation 5.20,
we are assured that all desirable properties of the standard RBF kernel, such as positive
definitiveness and gradient descent pre-image approximations, are held by IMED. The free
parameters are then tuned on U and the IMED embedded KPCA projection becomes
k
+Vim
· Φ(x), =
N
"
αki kim (xi , x)
=
i=1
N
"
αki kv (ui , u).
(5.21)
i=1
The pre-image approximation using IMED embedded KPCA (trained using the set U)
is then straight forward. For example, the fixed point algorithm of [87] can be used
to determine the standardized pre-image u p , and the inverse Standardizing Transform
1
(G 2 )−1 applied to obtain the vectorized pre-image x p , where
1
x p = (G 2 )−1 u p .
(5.22)
1
The inverse transform (G 2 )−1 can be efficiently computed (without storing an explicit
version of the inverse matrix) by realizing that it is also a projection onto the same Eigen1
vectors of G 2 , but instead, the inverse scaling (by the corresponding Eigenvalues) are
applied,
85
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
1
1
(G 2 )−1 = Γ(Λ 2 )−1 ΓT
(5.23)
*Θ
* T ])ΓT
= Γ(R−1 [Θ
*Θ
* T ] is defined as the vectorization function of the inverse scaling of each
where R−1 [Θ
*Θ
* T , followed by the diagonalization function. It is important to note that the
element in Θ
1
* m = 0. The Standardizing Transform
positive definite matrix G 2 may have Eigenvalue Θ
1
1
1
G 2 is then singular, and the inverse (G 2 )−1 does not exist [116]. However, (G̃ 2 )−1 can be
approximated using the same concept as in equation 5.17 and ignoring the Eigenvectors
whose corresponding Eigenvalues are close to zero.
5.4.1
IMED embedded Kernel Subspace Mapping
Recall the basic structure and properties of KSM: Kernel Subspace Mapping can be used
to map data between two high dimensional spaces (chapter 4). KPCA is initialized to
learn the subspaces of the high dimensional data, for let’s say the input set X and the
output set Y, and LLE non-parametric mapping [85] is used to map between the two
feature subspaces. Given novel inputs, the data is projected through the two subspaces
and the pre-image calculated to determine the input’s representation in the output space.
If any of the training vectors are derived from images, then IMED embedded KPCA can
be applied. For generality, this section will consider both cases, where the input set X
is constructed from images of a 3D object, and the output set Y are three dimensional
camera positions on the upper hemisphere of the object’s viewing sphere. Note that this
is a simple case (as the output space is only 3 dimensional) for KSM, which can map to
higher dimensional output spaces as in human markerless motion capture (chapter 4).
KSM is initialized by learning the subspace representation of X and Y using IMED
embedded KPCA and standard (RBF kernel) KPCA respectively. From these, the KPCA
X and V Y are obtained, where the corresponding instance in the
projected training sets Vim
v
sets have the following forms:
86
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
Figure 5.4: Diagram to Summarize Kernel Subspace Mapping for 3D object pose estimation. Selected members of the image training set X are highlighted with black bounding
boxes, whereas the output training set Y are represented as blue circles on the viewing
hemisphere.
η
1
T
X
vX
∈ Rη
i = [+Vim · Φ(xi ),, .., +Vim · Φ(xi ),] , ∀ v
vY
i
=
[+Vv1
·
Φ(yi ),, ....., +Vvψ
· Φ(yi ),] , ∀ v
T
Y
(5.24)
ψ
∈R ,
where η and ψ are the training set’s corresponding tuned projected feature dimensions
(using the tuning techniques in section 4.2.2 and 4.3.3).
During run time, given a new unseen vectorized image x of the same 3D object, projection via IMED embedded KPCA can be used to obtain v X . The projected vector is
then mapped between the two feature subspaces using LLE neighborhood reconstruction
weight w, which is calculated by minimizing the reconstruction cost function (as shown in
[85] and section 4.3).
ε(v X ) = &v X −
"
i∈I
2
wi v X
i & .
(5.25)
where I is the set of neighborhood indices of the projected vector v X in the tuned feature
space. This leads to the mapping
vY =
"
i∈I
87
wi v Y
i ,
(5.26)
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
from which the corresponding output hemispherical position y p (which is the pre-image of
v Y ) can be determined. Note that there is now another free parameter to tune, this being
card(I), the cardinality of the index set I. For KSM, this is tuned using cross validation
as in section 4.2.2, but instead of minimizing the pre-image reconstruction, the mapping
error is minimized (section 4.3.3).
5.5
Experiments & Results
To test the efficacy of IMED embedded KSM on 3D object viewpoint estimation, training
images of the object ‘Tom’ (figure 5.5[left]) are selected from the data set described in [81].
From the original image set of 2500 images, reduced sets of 30, 60, 90 and 145 training
images are randomly selected. The training size of 30 and 145 were selected to match
the experiments in [80]¶ . A test set of 600 ‘unseen’ images are then filtered from the
remaining images at regular intervals, such as to maximize the distribution around the
viewing hemisphere$ .
Table 5.1: 3D pose inference comparison using the mean angular error for 600 ‘unseen’
views of the object ‘Tom’.
Training size
30
60
120
145
Clean data set
KSM via standard KPCA
IMED embedded KSM
6.68◦
3.38◦
2.97◦
1.92◦
1.41◦
0.89◦
1.19◦
0.76◦
Gaussian noise set
KSM via standard KPCA
IMED embedded KSM
8.63◦
3.44◦
5.88◦
2.05◦
3.22◦
1.01◦
3.01◦
0.89◦
Salt/Pepper noisy set
KSM via standard KPCA
IMED embedded KSM
10.68◦
4.58◦
9.40◦
2.40◦
5.14◦
1.51◦
4.56◦
1.34◦
¶
In [80] - figure 1, the parameters are given in terms of the ‘tracking threshold’, and their corresponding
view bubbles, each consisting of 5 training images. For view bubble count of 6 and 29 this corresponds to
30 and 145 training images respectively.
"
For 3D object pose estimation results, please refer to the attached file: videos/[IMEDposeEst.mp4,
IMEDposeEstNoisy.mp4, IMEDposeEstNoisy2.mp4, IMEDposeEstNoisySP.mp4].
88
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
Figure 5.5: Selected images of the object ‘Tom’ [left] and object ‘Dwarf’ [right] used in
the 3D pose estimation problem. The images consist of the original clean image (1st &
4th image), image corrupted with Gaussian noise (2nd & 5th image) and Salt/Pepper noise
(3rd & 6th image).
IMED embedded KSM is tested with clean images as well as with images corrupted
with zero mean Gaussian noise (variance of 0.01) and salt/pepper noise (noise density of
0.05) as shown in figure 5.5. Table 5.1 shows that in all cases, IMED embedded KSM
provides more accurate viewpoint estimation (than standard KSM) from both clean and
noisy images. Furthermore, IMED embedded KSM is also more robust to noise and it
shows the least percentage increase in error for both Gaussian and salt/pepper noise. As
expected, with any exemplar based learning algorithm, the mean error gradually decreases
as more training images are included in the set. For completeness, another set of pose
inference results for the same tuned parameters as in table 5.1 is presented. In this case,
however, 400 test images are selected at random from the original hemispherical data set
(2500 images) and the selected images are not constrained to be novel (i.e. the test images
may also include images that were used in training). This is done because it gives a more
realistic representation of the input data in practice, where there is no exclusion of the
training set, let alone any idea of which images were used in training. As expected, IMED
embedded KSM performs even better in this case (table 5.2), as KSM will simply project
any training images to the correct position in the feature space, and this has zero error to
begin with. Results from [80] are also included for comparison in table 5.2 even though
only 30 test images were used∗∗ .
It must be noted that the accuracy of the pose inference is dependent on the complexity
of the object. For example, it is not possible to determine the hemispherical pose using
∗∗
Note that the test images in [80] have not been considered as novel/unseen. This is because a sparse
representation of the object is built from the original set of 2500 images using a greedy approach, which
is then used for pose estimation. Since the 30 test images are also derived from the original 2500 images,
they cannot be considered novel.
89
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
Table 5.2: 3D pose inference comparison using the mean angular error for 400 randomly
selected views of the object ‘Tom’.
Training size
30
Clean data set
View Bubble Method [80]
KSM via standard KPCA
IMED embedded KSM
36.51◦
4.44◦
3.22◦
Gaussian noise set
KSM via standard KPCA
IMED embedded KSM
Salt/Pepper noisy set
KSM via standard KPCA
IMED embedded KSM
60
120
145
2.32◦
1.15◦
0.91◦
0.65◦
0.77◦
0.84◦
0.59◦
6.28◦
3.29◦
5.13◦
1.26◦
2.59◦
0.77◦
2.72◦
0.69◦
9.47◦
4.51◦
8.64◦
1.75◦
4.56◦
1.19◦
4.38◦
1.04◦
a sphere with uniform colour (i.e. all images of the sphere will look the same from any
viewpoint). To test the efficacy of IMED embedded KPCA on a more symmetrical object,
the object ‘Dwarf’ (which also appears in [80]) is used. All experiments are performed
in exactly the same was as before, and results summarized in table 5.3 & table 5.4. The
training size of 30, 60, 120 and 145 are also used to allow comparison (with object ‘Tom’),
in which case table 5.3 should be compared with table 5.1, and table 5.4 with table 5.2.
For cross comparison with [80], IMED embedded KSM is able to achieve more accurate
results for a training set of only 30 images (3.26◦ ) as compared to using the ‘view bubble’
sparse set of 130 images (4.2◦ ). A surprising result is that it is possible to achieve relatively
the same level of accuracy (even higher in some cases) for the ‘Dwarf’ object (as compared
to object ‘Tom’). This was not the case using the view bubble method in [80].
5.6
Discussions and Concluding Remarks
For the case of IMED embedded KSM, pose inference results with mean errors of 3.22◦
were achieved, whilst using only a training set of 30 images. This is quite accurate, as it
is unlikely that a human would be able to achieve the same level of accuracy [80]. Our
technique is also robust to noise (especially Gaussian noise) and, in such a case, only
shows minor percentage increase in mean angular error. Furthermore, this was achieved
90
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
Table 5.3: 3D pose inference comparison using the mean angular error for 600 ‘unseen’
views of the object ‘Dwarf’.
Training size
30
60
120
145
Clean data set
KSM via standard KPCA
IMED embedded KSM
8.27◦
3.99◦
2.98◦
1.67◦
1.30◦
0.91◦
1.15◦
0.81◦
Gaussian noise set
KSM via standard KPCA
IMED embedded KSM
12.01◦
4.04◦
5.22◦
1.73◦
3.09◦
0.99◦
2.60◦
0.87◦
Salt/Pepper noisy set
KSM via standard KPCA
IMED embedded KSM
16.51◦
4.46◦
8.33◦
1.90◦
6.04◦
1.10◦
5.07◦
1.07◦
without explicit knowledge of the full training set of size 2500 images as is the case with
[80]. This is because a greedy approach is not used to filter out (from the original set of
2500 images) training data, in order to build a sparse representation; but simply, training
images are randomly selected from it. The constraint, in this case, being that the training
set is relatively evenly spread over the probability distribution of the data (as opposed to
spatial configuration of the data). The test images used are also novel (table 5.1 and 5.3)
and are regularly spread over the entire hemisphere, whereas in [80], only 30 test images
were used from three different patches of the viewing hemisphere. Results with more than
145 training images are not presented because for industrial applications, it becomes impractical to capture such a large training set, and in such a case, the hemispherical data
will be so dense that linear interpolation will be as accurate.
This chapter presents a technique that potentially allows automatic parameter selection (i.e. without human intervention). This leads to the ability to tune KSM for a novel
3D object by simply placing it on a rotating table (which rotates about the vertical axis).
For the other degree of freedom, a synchronized camera that moves up and down along
the longitudinal arc of the hemisphere can be installed (which will automatically capture
images of the object from different orientations). From the captured images and corresponding orientation, the system can automatically tune the parameters, which can then
91
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
Table 5.4: 3D pose inference comparison using the mean angular error for 400 randomly
selected views of the object ‘Dwarf’.
Training size
30
60
120
145
Clean data set
KSM via standard KPCA
IMED embedded KSM
9.40◦
3.26◦
3.14◦
1.49◦
1.31◦
0.77◦
1.23◦
0.68◦
Gaussian noise set
KSM via standard KPCA
IMED embedded KSM
12.72◦
3.56◦
5.60◦
1.55◦
2.81◦
0.86◦
2.46◦
0.75◦
Salt/Pepper noisy set
KSM via standard KPCA
IMED embedded KSM
14.64◦
4.80◦
8.57◦
1.83◦
5.40◦
1.14◦
4.92◦
1.05◦
be loaded into a single camera pose inference system in situ. Note that in this case an
assumption was made that the background of the training images and test images are the
same. In reality, it is usually the case that the training images are captured in controlled
environment, whereas the test images are not. To solve this problem, a robust background
segmentation algorithm can be appended as a pre-processing step to mask out the varying
backgrounds in both controlled training images and run-time images on site. Learning
and pose inference can then be performed on the masked images, hence mitigating the
problem of uncontrolled backgrounds. For the more common problem of determining the
relative yaw orientation (rotation about the vertical axis only) of an object on a (factory)
conveyor belt using static cameras, this is an even simpler problem. This is because the
problem reduces to constraining the space of possible orientation from three dimensional
space onto a two dimensional plane.
Kernel Subspace Mapping (KSM) was initially developed for markerless motion capture (chapter 4). In that case, KSM was used to learn the mapping between concatenated
images of the human silhouette and the normalized relative joint centers (section 4.2). For
future directions, IMED embedded KPCA can be integrated into the core of markerless
motion capture using KSM. Results have been presented indicating that significant improvements can be achieved by embedding IMED into KSM (for viewpoint estimation).
92
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
It is likely that relatively the same increase in accuracy can be achieved by applying the
same concept in markerless motion capture. The problem to overcome is how to effectively
integrate IMED into the pyramid kernel (section 4.2.3) for (computationally) efficient silhouette comparison. Another area to consider is using IMED on local gradients found
from SIFT -like algorithms [64] and possibly improving on previous techniques on human
pose estimation in cluttered environments [5].
93
CHAPTER 5. IMAGE EUCLIDIAN DISTANCE (IMED) EMBEDDED KSM
94
Chapter 6
Greedy KPCA for Human Motion
Capture∗
This chapter presents the novel concept of applying Greedy KPCA (GKPCA) [41] as
a preprocessing filter (in training set reduction) for Kernel Subspace Mapping (KSM).
Human motion de-noising comparison between linear PCA, standard KPCA (using all
poses in the original sequence) and Greedy KPCA (using the reduced set) is presented
at the end of the chapter. The chapter advocates the use of Greedy KPCA in KSM by
showing that both KPCA and Greedy KPCA have superior de-noising qualities over PCA,
whilst KSM with Greedy KPCA results in relatively similar pose estimation quality (as
standard KSM [chapter 4]) but with lower evaluation cost (due to the reduced training
set).
6.1
Introduction
Due to the high degree of correlation in human motion, unsupervised learning techniques
like Principal Components Analysis (PCA) [51, 98] and its non-linear extension, Kernel
Principal Components Analysis (KPCA) [90, 91] are commonly used to learn subspaces
of human motion. As highlighted in section 3.2.1, PCA, being a linear technique, is not
well suited for the de-noising of non-linearly correlated human motion. This hypothesis
∗
This chapter is based on the conference paper [107] T. Tangkuampien and D. Suter: Human Motion Denoising via Greedy Kernel Principal Component Analysis Filtering: International Conference on Pattern
Recognition (ICPR) 2006, pages 457–460,Hong Kong, China.
95
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
is further supported by the human de-noising results in section 3.5. KPCA, on the other
hand, shows significant improvement over PCA, for both cyclical human motion (e.g.
walking and running) to more complex non-cyclical motion (e.g. dancing and boxing).
A drawback of KPCA, however, is that the training and evaluation costs are dependant
on the size of the training set (equation 3.13 and equation 3.16). During training, the
kernel matrix (which grows quadratically with the number of exemplars in the training
set) needs to be calculated before standard PCA can be applied in feature space. As
for the de-noising (projection and pre-image approximations [91]) via KPCA, the cost is
linear in the exemplar number because each projection requires a kernel comparison with
each vector in the training set. As a result, the cardinality of the training set is vital in
any real system incorporating KPCA. The goal of KPCA with greedy filtering (GKPCA)
[41] is to filter (from the original motion sequences) a reduced training subset than can
optimally represent the original de-noising subspace, given a specific prior constraint (e.g.
using 70% of the original training data).
Figure 6.1: Illustration to geometrically summarize Greedy Kernel Principal Components
Analysis (GKPCA) [41] in relation to Kernel Principal Components Analysis (KPCA) [90].
Novel captured motions similar to the original training sequence can then be de-noised
using this reduced set. To this end, the recently proposed GKPCA algorithm proposed by
[40] is investigated for the goal of filtering the reduced training set. To the author’s knowledge, there is currently no previous work on the practical applications of Greedy Kernel
96
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
Principal Components Analysis (GKPCA) on human motion de-noising. Similarly, KPCA
with greedy algorithm filtering is a relatively novel approach to the problem of markerless
motion capture based on subspace learning. Results are presented in section 6.3, which
supports the integration of GKPCA into the pre-processing of KSM (chapter 4). The
experiments aim to show that GKPCA filtering (for training set reduction) can aid in the
reduction of pose estimation cost, whilst still retaining the de-noising and pose estimation
qualities of the original training set.
6.2
Greedy KPCA filtering on Motion Capture Data
To understand the concept of GKPCA filtering, imagine that the non-linear toy data set in
figure 3.3 were cloned and the training set doubled in size (i.e. there are now two instances
of each point in the training set). In this case, there should not be any substantial change
in the de-noising quality as the feature subspace defined by the training set should remain
relatively the same (as the original set). However, from a performance perspective (due
to the linear complexity of KPCA projections) the computation cost would have (approximately) doubled. Most likely, there are redundant (training) poses, which when removed
would increase performance, whilst minimizing reduction in de-noising quality. Put more
generally, the objective (of GKPCA filtering) is to remove training vectors, which do not
contribute substantially to the definition of the de-noising subspace.
Specifically for human motion capture data, GKPCA is highly compatible because
most animation data (which is a concatenation of poses captured from a marker-based
system) is repetitive and usually sampled at a high frame rate (between 60-120 Hz). The
higher the sampling rate, the more likely it is that there will be redundant poses when
it comes to defining subspaces for human de-noising. In most motion capture training
data (where there is a high percentage of redundant poses), the removal of any redundant
pose will increase the speed of KSM for markerless motion capture. GKPCA can act as a
preprocessing filter that will select a reduced training set P gk (which defines a de-noising
97
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
pose subspace similar to P tr [section 4.3]) from the original training set.
The remainder of this chapter is structured as follows: in section 6.2.1 the Greedy
KPCA algorithm is summarized. In section 6.3, experiment are conducted to show how a
reduced training set can play an important role in controlling the capture rate and pose
estimation quality of KSM for markerless motion capture (as well as compare the motion
de-noising quality between GKPCA, KPCA and PCA).
6.2.1
Training Set Filtering via Greedy KPCA
Using the same notation an section 3.3, recall that standard KPCA basically maps the
training set X tr of size N non-linearly to a feature space F, before performing the equivalent of linear PCA. Greedy KPCA [41] aims to filter X tr to a reduced set X gk of size M
(where M 1 N ), such that the linear span of F gk is similar to the linear span of F, where
(6.1)
Φ : X gk → F gk .
Assuming that X gk can be found, it is then possible to express every vector in F as a
linear combination of the filtered set in F gk [41].
To summarize Greedy KPCA as in [41, 40]: let J = {j1 , j2 , ..., jM } be the set of M
indices, a subset of I, where I = {i1 , i2 , ..., iN } is the original indices of X tr . The approximate feature space representation of the original training exemplars can be expressed as
follows:
Φ̃(x i ) =
"
ωij Φ(x j ),
j∈J
∀i ∈ I.
(6.2)
The reduced set’s objective function to minimize is the mean square error
εM S (Htr |J ) =
" g
1 "
&Φ(x i ) −
ωij Φ(x j )&2 .
N
i∈I
(6.3)
j∈J
It is important to note that given the subset X tr indexed by J , it is possible to compute
ω g optimally to minimize εM S (Htr |J ). Therefore ω g can be removed from the error
function εM S (Htr |J ), and the greedy approximation problem re-expressed as the problem
98
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
of determining J , where
J = argmin εM S (Htr |J )
(6.4)
card(J )=M
and card(J ) denoting the cardinality of subset J . Furthermore, as shown in [40, 87], it is
possible to avoid explicitly mapping to feature space F tr by re-expressing equation (6.3)
using the kernel trick as
εM S (F tr |J ) =
1 "
gk
gk
gk gk
(kp (x i , x i ) − 2Kgk
c k (x i ) + +k (x i ), Kc k (x i ),).
N
(6.5)
i∈I
gk and
Kgk
c , in this case, is the centered kernel matrix of the filtered set X
kgk (x i ) = [kp (x j1 , x i ), ....., kp (x jM , x i )]T ,
(6.6)
is the projection of X tr onto the reduced set’s feature space. Greedy KPCA, therefore only
needs to determine the optimal subset J from I [41]. In order to achieve this, we could try
#N $
all possible combination, but this would be impractical as there exists M
combinations.
Instead, as shown in [40], we can choose to minimize the upper bound where
εM S (F tr |J ) ≤
1
(N − M ) max &Φ(x i ) − Φ̃(x i )&2 .
N
i∈I\J
(6.7)
Intuitively, the upper bound can be viewed as initially finding the maximum feature space
error (between the approximate set X̃ construction and the original set X ) and multiplying it by (N − M ). The reason (N − M ) is chosen as the scale factor, and not N is
because M vectors in X tr can already be represented with zero error [41]. Equation (6.7)
should hold because the mean error εM S (F tr |J ) cannot be higher than the mean of the
maximum error. Further discussion of the Greedy KPCA algorithm is beyond the scope
of this chapter, but can be found in [40].
Given the original motion sequence X tr , it is possible to use greedy KPCA to filter out
X gk = X tr (J ), a subset of X tr (I), which has similar linear span in the pose feature space.
Figure 6.2 compares the de-noising quality (on a toy example from figure 3.3) between
99
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
GKPCA using 50% of the original training set) and KPCA (using the full training set).
Figure 6.2: De-noising comparison of a toy example between GKPCA (using 50% of the
original training set) [top] and standard KPCA (using the full training set) [bottom].
6.3
Experiments & Results
There are three factors that should be considered in concluding if Greedy KPCA can filter
out a reduced set (in a manner which will enhance the performance of KSM). These are:
1. Does a reduced training size lead to a reduction in motion capture time (i.e. an
increase in motion capture rate)?
2. What is the effect of a reduced training set on the de-noising qualities of human
motion when compared with using the full set?
3. What is the effect of a reduced training set on the quality of pose estimation via
KSM?
100
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
Experiments regarding the motion capture rate and de-noising qualities are summarized
in section 6.3.1 and section 6.3.2 respectively. Section 6.3.3 shows the results of using
Greedy KPCA as a preprocessing filter for KSM in markerless motion capture.
6.3.1
Capture Rate Control for KSM
To understand how a reduced training set can control the capture rate of KSM motion
capture, the reader should refer to equation 3.16. The outer iteration over the cardinality
of the set confirms that any data projection cost via KPCA will be dependent on the
training size. Therefore, training KPCA using the reduced set X gk will result in a reduced
kernel matrix Kgk and lower computational load in equation 3.16, hence reduction in the
kernel subspace mapping cost.
As previously discussed, figure 3.12 [bottom] highlights the linear relationship between
the de-noising cost of KPCA and the training size. Table 6.1 summarizes how the current
capture rate of KSM for motion capture can be controlled via training size modification.
The dominant cost, being the recursive pyramid cost† of calculating the silhouette feature
vector Ψ(f ), which is currently implemented in un-optimized MatlabT M code. This cost
is not shown in the table as it is independent of the training size, but can be determined
by taking the difference between the KSM total cost and the LLE mapping and pre-image
cost.
Table 6.1: Comparison of capture rate for varying training sizes.
Training
size
LLE mapping
& pre-image (s)
KSM total
cost (s)
Capture
rate (Hz )
100
200
300
400
500
0.0065
0.0120
0.0220
0.0310
0.0410
0.0847
0.0902
0.1002
0.1092
0.1192
11.81
11.09
9.98
9.16
8.39
†
The cost of calculating Ψ(f ) using the pyramid kernel is dependent on the number of silhouette features
and not on the size of the training set. For a five level pyramid of 431 features, this results in a average
cost of ∼0.0782 seconds.
101
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
6.3.2
Comparison between Greedy KPCA and KPCA De-noising
This section begins by comparing the mean square error [mse] of human motion de-noising
(in normalized RJC format [section 3.4.1]) using PCA, standard KPCA trained with the
entire set X tr , and Greedy KPCA trained with the reduced set X gk . All experiments were
performed on a PentiumT M 4 with a 2.8 GHz processor. In figure 6.3, GKPCA selects
30% of the original sequence to build the reduced set subspace. Synthetic gaussian white
noise is added to motion sequence X in pose space, and the de-noising qualities compared
quantitatively (figure 6.3) and qualitatively in a 3D animation playback‡ .
Figure 6.3: Pose space mse comparison between PCA, KPCA and GKPCA de-noising for
walking and boxing sequences.
Figure 6.3 highlights the superiority of both KPCA and GKPCA over linear PCA in
motion de-noising. GKPCA will tend towards the KPCA de-noising limit as a greater
percentage of the original sequence is included in the reduced set. Specifically for the
walking and running motion sequences, the frame by frame comparison in figure 6.4 and
figure 6.5 further emphasizes the similarity of de-noising qualities between KPCA (blue
line) and GKPCA (red line). Both the error plots for GKPCA and KPCA are well
below the error plot for PCA de-noising (black line). For human animation, the GKPCA
algorithm is able to generate realistic and smooth animation comparable with KPCA when
‡
For motion de-noising results via PCA, KPCA and GKPCA, please refer to the attached file:
videos/motionDenoisingRun.MP4 & videos/motionDenoisingWalk.MP4.
102
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
Figure 6.4: Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human walk sequence. GKPCA selects 30% of the original sequence.
Figure 6.5: Frame by Frame error comparison between PCA, KPCA and GKPCA denoising of a human run sequence. GKPCA selects 30% of the original sequence.
the de-noised motion is mapped to a skeleton model and play backed in real time.
To analyze the ability for GKPCA to implicitly de-noise feature subspace noise, we
add synthetic noise in the feature space of a walk motion sequence. We are interested in
this aspect of de-noising because, in KSM, noise may be induced in the process of mapping
from one feature subspace to another (i.e. from Ms to Mp ). Figure 6.6 shows the relationship between feature space noise and pose space noise for both KPCA and GKPCA
de-noising for a walk sequence from various yaw orientations (GKPCA uses a reduced
set consisting of 30% of the original set). Surprisingly, the ratio of pose space noise to
103
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
Figure 6.6: Comparison of feature and pose space mse relationship for KPCA and
GKPCA. GKPCA uses a reduced set consisting of 30% of the original set.
feature subspace noise (i.e. pose space noise/feature subspace noise) is lower for GKPCA
de-noising when compared to using KPCA de-noising. The lower GKPCA plot indicates
its superiority over KPCA at minimizing feature space noise, which may be induced in
the process of pose inference via KSM (i.e. the same level of noise in GKPCA feature
subspace will more likely map to a lower level of noise [when compared to KPCA] in the
RJC pose space). However, it is important to note that the noise analysis presented here
is only for human motion de-noising. The effect of a reduced set (filtered via GKPCA) on
markerless motion capture has not yet been analyzed. The following section attempts to
investigate this relationship between the size of the reduced training set and the quality
of pose estimation via KSM.
6.3.3
Greedy KPCA for Kernel Subspace Mapping
To test the efficacy of KSM motion capture with a GKPCA pre-processing filter, the
training set of 323 exemplars (as previously applied in section 4.4.2) is used as the original
(starting) set. Using the distance in the pose feature subspace for filtering, the GKPCA
algorithm [39, 41] is initialized to extract reduced training sets from the original set. The
reduced sets are then tuned independently using the process described in chapter 4. For
comparison with previous results of motion capture via KSM, the same synthetic set of
1260 unseen silhouettes (as used in section 4.4.2) is used for testing.
104
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
For pose inference from clean silhouettes (i.e. clean silhouette segmentation, but still
using silhouettes of different models between testing and training), and using reduced
training sets consisting of 90%, 80% and 70% of the original set: KSM can infer pose with
average errors of 3.17◦ per joint, 4.40◦ per joint and 5.86◦ degree per joint respectively.
The frame by frame error in pose estimation (from clean silhouettes) are plotted in figure 6.8. Figure 6.9 shows the frame by frame estimation error from noisy silhouettes. For
training sets consisting of 90%, 80% and 70% of the original set, the mean pose inference
error from noisy silhouettes (with salt & pepper noise density of 0.2) are recorded as 3.71◦
per joint, 4.80◦ per joint and 6.51◦ degree per joint respectively (figure 6.9). The average
errors per joint for all the tested reduced sets are summarized in figure 6.7. As expected,
the mean pose estimation error (for KSM) when inferring from clean or noisy silhouettes
increases with a reduction in the training size. The important factor to note is the nonlinear relationship between the training size and (pose) inference error. Similar to KSM
pose estimation in the presence of noise (section 4.4.2), the reduction in training size does
not equally affect the inference error for each silhouette, but rather, the size reduction
increases the error significantly for a minority of the poses (i.e. the peaks in figure 6.9).
Figure 6.7: Average errors per joint for the reduced training sets filtered via GKPCA.
105
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
Figure 6.8: Frame by Frame error comparison (degrees per joint) for a clean walk sequence
with different level of GKPCA filter in KSM.
106
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
Figure 6.9: Frame by Frame error comparison (degrees per joint) for a noisy walk sequence
with different level of GKPCA filter in KSM.
107
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
6.4
Conclusions & Future Directions
The chapter investigates the novel approach of applying Greedy KPCA [41] in the minimization of noise in non-linearly correlated human motion. Section 3.5 has already shown
how non-linear KPCA is superior than linear PCA in the de-noising of Gaussian noise
in human motion captured data. GKPCA shares this superiority over linear PCA. In
terms of its advantage over KPCA, Greedy KPCA can create realistic and comparable (to
KPCA) animation from noisy motion sequences, but at a fraction of KPCA’s processing
cost. GKPCA, therefore enables KSM to be extended to capture complex sequences defined by a large training size. Currently KSM learns pose and silhouette subspaces via
KPCA, and approaches motion capture as the problem of mapping from the projected
silhouette subspace Ms to the projected pose subspace Mp (section 4.2). GKPCA can be
applied, in this case, to control capture rate by filtering out a reduced training set that
can define relatively similar de-noising subspaces (as the original set), thereby reducing
the de-noising cost of KPCA (equation 3.16) and pose inference cost of KSM.
An area that should be further investigated is GKPCA’s superiority over KPCA in
the reduction of pose (feature) subspace noise (figure 6.6). A possible explanation for
this improvement may be due to GKPCA’s reduced set constraint, therefore resulting
in a reduced space of valid pre-images in pose space, and favorably, a reduced space for
unwanted noise as well. This improvement provided by GKPCA filtering, however, does
not necessarily mean that GKPCA can lead to superior pose estimation results for KSM.
As expected, a reduction in training data leads to an increase in pose estimation error. An
interesting point to note is the pattern of the error increase (which occurs in peaks [figure
6.9]) as the training size is reduced. This pattern arises because some silhouettes (of a
walk motion) are significantly more ambiguous than others. A relatively similar pattern
of error increase has also been recorded (in section 4.4.2) for KSM pose estimation in the
presence of noise. This indicates that the robustness of KSM (to both noise and training
set reduction) may be improved by applying a better tracking and neighbor selection
criterion, which allows the integration of more complex temporal smoothing (to minimize
108
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
error due to ambiguous silhouettes located at the error peaks [figure 6.9]).
Figure 6.10: Diagram to illustrate the most likely relationship between using the original
training set (used to capture results of figure 4.14) without GKPCA filtering (to train
KSM) and using training sequences directly obtained from a motion capture database (to
train KSM). The vertical axis indicates the probable mean error for different sizes of the
training set.
An important point to note is that the original training set (of 323 exemplars) is already a sparse set, in the sense that there is no redundant information and the training
poses are (relatively) uniformly spread out in the normalized RJC space (i.e. training exemplars have already been removed form it to begin with). In practice, the marker-based
motion capture training set (which can be downloaded from motion capture databases,
such as [32]), is usually sampled at high frame rates of between 60Hz to 120Hz. Therefore,
when these motion sequences are used directly as the original training set (before GKPCA
filtering), they will have a high level of redundant information. For these scenarios, we
believe GKPCA filtering will be extremely useful because it may be able to remove a majority of training exemplars before resulting in a significant decrease in the pose estimation
error of KSM (figure 6.10).
109
CHAPTER 6. GREEDY KPCA FOR MOTION HUMAN CAPTURE
110
Chapter 7
Conclusions & Future Work
7.1
Concluding Remarks
This thesis introduces a novel markerless motion capture technique called Kernel Subspace
Mapping (KSM) [chapter 4], which can estimate full body (human) pose without the need
for any intrinsic camera calibration. The technique uses the well established unsupervised
learning algorithm, Kernel Principal Components Analysis (KPCA) [90], to learn the denoising subspace of human motion in the normalized relative joint center space (which
is consistent for all actors) [section 3.4.1]. After training, novel silhouettes are projected
into the pose feature subspace, before pose inference via pre-image approximations [87].
This has the advantageous effect of implicitly de-noising the input silhouettes onto the
subspace defined by the generic (clean) training data, hence allowing pose inference (of
similar motion to the training set) from silhouettes of unseen actors and of unseen poses.
To begin, the thesis advocates the use of non-linear KPCA in human motion de-noising
by showing the simplest forms of human motion (encoded in the normalized RJC format
[section 3.4.1]) is non-linear. Thereafter, to test our hypothesis of a de-noising framework
for markerless motion capture, KSM was trained using synthetic silhouettes of a walk
sequence (fully rotated about the yaw axis) generated from a generic model. For testing,
a substantially different mesh model is used to generate a large set of silhouettes in unseen poses, and from previously unseen viewing orientations. The pose inference results
(average error of 2.78◦ /joint or 2.02 cm/joint) show that KSM can estimate pose with
111
CHAPTER 7. CONCLUSIONS & FUTURE WORK
accuracy similar to other state of the art approaches [2, 45]. KSM is also one of the few
2D learning-based pose estimation approaches (others are [6, 45, 83]), which can accurately infer pose irrespective of yaw orientation. Furthermore, KSM also works robustly
in the presence of synthetic binary noise, which are used to simulate scenarios with poor
silhouette segmentation.
As KPCA [90] is a well established machine learning technique, there is a large community of researchers developing novel optimization algorithms for it. We believe that because
KSM is based on this popular technique (KPCA), most of the improvements on KPCA
can be transferred to practical improvements of KSM with relatively minor modifications.
To illustrate the ease of integration (of novel improved algorithms) and elaborate further
some of the important aspects of KSM, two recently proposed techniques are embedded
into KSM:
• the Image Euclidian Distance (IMED) [116], and
• the greedy KPCA (GKPCA) algorithm [41].
Specifically for IMED embedded KSM, we concentrate on the problem of 3D object
viewpoint estimation from intensity images [81]. In this case, IMED embedded KSM shows
an improvement in the accuracy of viewpoint estimation over KSM (using vectorized Euclidian distance) and other previously proposed approaches [80, 119, 47, 78, 80].
For greedy KPCA integration, we concentrate specifically on the problem of human
pose inference (from synchronized silhouettes) via the use of a reduced training set. The
greedy KPCA algorithm is used as a preprocessing filter to select a reduced training
subset than can optimally represent the original de-noising subspace, given a specific prior
constraint (e.g. using 70% of the original training data). The experiments show that
Greedy KPCA filtering for KSM results in lower evaluation cost (for both KPCA denoising and KSM in motion capture) due to the reduced training set. More importantly,
KSM with a GKPCA filter is able to retain most of the pose inference quality when
compared to using the original training set (to train KSM).
112
CHAPTER 7. CONCLUSIONS & FUTURE WORK
7.2
Future Directions
Kernel Subspace Mapping (KSM) has been shown to be accurate in human pose estimation from binary human silhouettes. An interesting area to explore is the possibility of
extending KSM to take into account photometric information without limiting its flexibility. Using silhouettes is advantageous because normalized silhouettes (section 4.2.3)
of the same pose (for most actors) are relatively similar irrespective of their environment
(provided that there is acceptable silhouette segmentation). However, silhouettes encode
less information due to the loss of foreground photometric information. The use of pixel
intensity and colors in KSM could potentially lead to more accurate pose estimation. This
is because photometric data can be used to disambiguate inconclusive silhouettes (i.e.
the ones located mostly at the error peaks in figure 4.16). However, photometric data is
also model specific, in the sense that two different persons will most likely have extremely
different patterns and color imageries for the same stance (pose). The problem that will
need to be overcome is how to standardized the color/intensity information in such as
way that the training data (using the photometric patterns generated from one person)
can be generalized to a different person for (computationally) efficient pose estimation. A
possible solution to this may be to use local gradients from SIFT -like algorithms [64] (as
in the work of Agarwal and Triggs [5]) to first encode photometric data before training
and pose estimation via KSM.
The integration of IMED [116] into KSM (for viewpoint estimation) allows improved
estimation results over using vectorized Euclidian distance in KSM. An interesting area
to further explore is if IMED can be embedded into KSM to improve human pose estimation from silhouettes. The integration of IMED into the pyramid silhouette kernel (of
the original KSM) is substantially harder (than vectorized Euclidian distance) due to the
hierarchical structure of the silhouette descriptor. Furthermore, the (silhouette) pyramid
kernel with embedded IMED would also need to be positive definite to ensure its efficacy
with techniques based on convex optimization such as KPCA and KSM.
113
CHAPTER 7. CONCLUSIONS & FUTURE WORK
Greedy KPCA filtering for reduced set selection in KSM is a research area which should
be further investigated. Interestingly enough, KSM with GKPCA filtering results in superior reduction in pose (feature) subspace noise (i.e. lower ratio of pose space noise/feature
subspace noise) than using KSM trained with the full set. As previously mentioned, an
explanation for this (improvement) may be due to GKPCA’s reduced set constraint, which
leads to a reduced space of valid pre-images in the output space, and favorably, a reduced
space for unwanted noise as well. The frame by frame analysis of the pose estimation error
(when using GKPCA filtering) is also interesting (figure 6.9). The error plot shows that
as the training size reduces, the errors do not increase (relatively) equally for all the test
silhouettes, but the errors increases substantially for a small group of silhouettes. This
similar pattern in error increase is also observed for pose estimation from noisy silhouettes
(figure 4.16). Therefore, an important future area of research for KSM would be to explore
techniques aim at the minimization of the error peaks due to ambiguous silhouettes. We
believe that such a improvement would increase the robustness of KSM in pose estimation,
in the presence of noise and when learning from reduced training sets.
Finally, most of the learning based human pose estimation approaches do not rely on
3D volumetric reconstruction of the human body. This is mainly because 3D approaches
mostly require an expensive array of synchronized and calibrate cameras. Specifically for
previously proposed 3D pose estimation techniques (e.g. [99, 28]), an interesting area
to investigate would be the integration of KSM into 3D pose estimation approaches, by
learning the mapping from 3D feature points to pose vectors. Effectively, this may lead to
a reduction in computational cost of pose estimation, as the full 3D volumetric reconstruction of the actor for each frame would not be required (to constrain the skeletal structure
for pose vector estimation). Instead, the pose vectors can be inferred directly from 3D
feature points, which can be more efficiently tracked and reconstructed.
114
Appendix A
Motion Capture Formats and
Converters
A.1
Motion Capture Formats in KSM
A deformable mesh model can be animated by simply controlling its inner hierarchical
biped/skeleton structure (figure A.1). As a result, most motion formats only encode pose
information to animate these structures. For information on how to create a deformable
skinned mesh in DirectX (i.e. attach a static mesh model to its skeleton), the reader
should refer to [66].
This section does not aim to present a complete review of all motion capture formats,
but rather a summary of the relevant formats applied in the proposed technique, Kernel
Subspace Mapping (KSM). In KSM, three different formats are used. These are: the
Acclaim Motion Capture (AMC) (section A.1.1) [59], the DirectX format (section A.1.2)
[66] and the Relative Joint Center (RJC) format (section 3.4.1). Two motion capture
format converters are used in this thesis (Appendix A.2), one to convert AMC to DirectX
format and the other from AMC to RJC format. Figure A.2 shows a diagram summarizing
the use of these converters in the proposed markerless motion capture system. KSM learns
the mapping from the synthetic silhouette space to the corresponding RJC space (figure
A.2 [right:dotted arrow]) for a specific motion set (e.g. walking, running). For training, a
database of accurate human motion is required, which can be downloaded in AMC format
115
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
Figure A.1: Diagram to summarize the hierarchical relationship of the bones of inner
biped structure.
from the CMU motion capture database [32]. From this, synthetic silhouettes for machine
learning purposes can be generated via the DirectX skinned mesh model [66]. From our
experiments, the interpolation and reconstruct (of realistic new poses) is best synthesized
in the normalized RJC format (section 3.4.1).
A.1.1
Acclaim Motion Capture (AMC) Format
The Acclaim motion capture (AMC) format, developed by the game maker Acclaim,
stores human motion using concatenated Euler rotations. The full motion capture format
consists of two types of file, the Acclaim skeleton file (ASF) and the AMC file [59]. The
ASF file stores the attributes of the skeleton such as the number of bones, their geometric
dimension, their hierarchical relationships and the base pose, which is usually the ‘tpose’
(figure A.3 [right]). For the case where a single skeletal structure is used, the ASF file can
be kept constant and ignored. On the other hand, the AMC file (which encodes human
motion using joint rotations over a sequence of frames) is different for every set of motion
and is the file generated during motion capture.
116
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
Figure A.2: Diagram to summarize the proposed markerless motion capture system and
how the reviewed motion formats combine together to create a novel pose inference technique called Kernel Subspace Mapping (KSM).
Concentrating principally on the AMC file format, human motion is stored by encoding
each 3D local joint rotation as a set of consecutive Euler rotations. This is highlighted in
figure A.3 [left humerus bone], where the left shoulder joint rotation (which corresponds
to a ball & socket joint) is represented by 3 concatenated Euler rotations. Note that it is
not necessary for a joint to be encoded using 3 Euler transformations. For example, the
left elbow joint, which is a hinge joint (and represented by the left radius bone in figure
A.3) has a single degree of freedom, and hence, is only represented by one Euler rotation.
A full human pose can be encoded as a set of Euler pose (stance) vectors representing the
actor’s joint orientations at specific point in time (as used in [1, 84, 6, 24, 83]). Skeleton
animation is achieved by concatenating these pose vectors (over time) and mapping them
sequentially to a skeleton embedded inside a human mesh model. The AMC format is the
simplest form of joint encoding and is one of the format adopted by the VICON marker
based motion capture system [103].
On the other hand, encoding pose with Euler rotations presents numerus problems due
to the fact that the mapping of pose to Euler joint coordinates is non-linear and multivalued. Any technique, like Kernel Principal Components Analysis (KPCA)[90, 91, 87]
(which is the core component in KSM) will eventually breakdown when applied to vectors
consisting of Euler angles. This is because it may potentially map the same 3D (joint)
117
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
Figure A.3: Example of a Acclaim motion capture (AMC) format with example rotation
of the left humerus bone (i.e. the left shoulder joint)
rotation to different locations in vector space. Furthermore, linear pose interpolation using
Euler pose vectors also presents a problem, because Euler rotations are not commutative,
as well as suffering from the problems of Gimbal locks (Appendix A.2 [figure A.7]).
A.1.2
DirectX Animation Format
Human motion capture format in DirectX is similar to the AMC format in that there are
two parts to the file structure, the skeleton/mesh attribute section and the human motion
section. The skeleton/mesh attribute section encodes the biped hierarchical structure,
its geometric attributes and additional information such as its relationship to the mesh
model (i.e. the weight matrix of the mesh vertices [66]). This information can again be
considered constant and can be ignored in the pose inference process. Similar to the AMC
format (section A.1.1), we concentrating principally on the file section that encodes the
motion data. The DirectX encoding is similar to the AMC format in that the relative 3D
118
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
joint rotations are recorded in a sequential frame structure. However, instead of encoding
3D rotations using Euler rotations, the 4 × 4 homogeneous matrix is adopted. Note that
the 4 × 4 matrix can also encode translation and scaling data (in addition to the rotation).
However, for motion capture, where the geometric attributes of the model (e.g. length of
arms, height, etc) is known and considered constant, pose inference can be constrained to
inferring a set of 3 × 3 rotation matrices, which only encodes the rotation information.
The main advantage of the DirectX format is the optimized skinned mesh animation
and support for the format [66], which enables the efficient rendering and capture of
synthetic silhouettes for training and testing purposes. As there are major similarities
between the AMC and the DirectX format, AMC data from the marker based VICON
system can easily be converted to its DirectX equivalent by converting each set of xyz
Euler rotations to its corresponding 3 × 3 rotation matrix as follows:
L[1:3,1:3] = Rx (Θxi )Ry (Θyi )Rz (Θzi ),
(A.1)
where Θxi , Θyi and Θzi are the Euler rotations about the x,y and z axis respectively, and
L[1:3,1:3] denoting the first 3 rows and columns of the DirectX homogenous matrix (see
Appendix A.2 for more details regarding format conversion). Unfortunately, the algebra
of rotations using matrices is non-commutative and its corresponding manifold is nonlinear [9]. Therefore, use of the DirectX format for the synthesis of novel pose is also
prone to relatively the same problems as when using Euler angles.
119
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
Figure A.4: Comparison between the 3D models view using the AMC viewer [32] (yellow
figure) and the DirectX mesh viewer of a generic skinned mesh model (grey mesh) for a
gold swing [top], a boxing motion sequence [center] and a salsa dancing motion sequence
[bottom].
120
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
Figure A.5: Comparison between the 3D models view using the AMC viewer [32] (yellow
figure) and the connected stick figure of the normalized RJC format of a running motion
sequence [top], a boxing motion sequence [center] and a dancing motion sequence [bottom].
121
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
Figure A.6: Comparison between the 3D models view using the AMC viewer [32] (yellow figure) and the DirectX mesh viewer of an RBF mesh model of the author (blue
background).
122
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
A.2
A.2.1
Motion Capture Format Converters
The AMC to DirectX converter
The AMC and DirectX formats basically store pose as a set of 3D rotations in different
forms. Therefore, conversion between the two pose can be simplified to the process of converting one 3D rotation format to another. In this case, the converter will simply convert
Euler rotations to a 4 × 4 homogenous matrix. A 3D Euler rotation is characterized by
three concatenated rotations, usually about the x, y and z axis, which can commonly be
referred to as the yaw, pitch and row respectively. The order of this concatenation plays a
significant role in how the object is orientated after applying the rotations. Geometrically,
if we apply the Euler rotations in the order of the yaw (x ), the pitch (y) and the roll (z )
as in figure A.7, the pitch axis will be rotated by the yaw rotation, the roll axis by both
the yaw and the pitch rotations, however, the yaw axis will be affected by neither the roll
nor the pitch rotation.
Figure A.7: Images to illustrate Euler rotations in term of the yaw, pitch and roll. The
image on the right shows the negative effect of Gimbal lock, where a degree of rotation is
lost due to the alignment of the roll and the yaw axis. Notice how the pitch is affected
by the yaw rotation, the roll is by both the yaw and pitch rotations, whereas the yaw is
independent of any other rotations.
123
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
We can represent a rotation Θxi about the yaw axis as Rx (Θxi ), where

0
0
 1

Rx (Θxi ) = 
cos(Θxi ) sin(Θxi )
 0

0 − sin(Θxi ) cos(Θxi )






(A.2)
and the remaining pitch and roll rotations as Ry (Θyi ) and Rz (Θzi ) respectively as:

y
 cos(Θi )

Ry (Θyi ) = 
 0

sin(Θyi )
0
1
0

− sin(Θyi ) 


0


y
cos(Θi )
Rz (Θzi )

cos(Θzi ) sin(Θzi 0)


z
z
=
 − sin(Θi ) cos(Θi 0)

0
0
0
0
1
(A.3)



.


The 3 × 3 rotation matrix of the xyz (yaw-pitch-roll) Euler rotation can be generated by
concatenating the separate rotation matrices in reverse order as in equation A.1. The
reversal in matrix application may initially appear incorrect, however, closer examination
of figure A.7 will reveal that from the object’s point of view, the roll is performed first (z
axis), followed by the pitch (y axis) and finally the yaw (x axis) rotation . For each pose,
the set of 3 × 3 rotations matrices for the i-th joint is converted to its corresponding 4 × 4
homogeneous matrix Li (which also encodes the bone’s translation displacement relative
to its parent) as follows:




y
x) O
z) O
R
R
I
b
R
(Θ
(Θ
)
O
(Θ
i 
 x i
 y i
 z i

Li = 



 [0, 0, 0, 1]T .
T
T
T
T
O
1
O
1
O
1
O 1

(A.4)
In equation A.4, O represents a 3 × 1 zero vector, I the 3 × 3 identity matrix, and bi the
length vector of the i-th bone, whose x element stores the actual length of the bone (the
y and z elements are both zero).
A.2.2
The AMC to RJC converter
The conversion from the AMC format to the RJC format (section 3.4.1) can be summarized
as the problem of transforming Euler rotations to 3D points on a unit sphere. As the local
homogeneous 4 × 4 matrix Li already encodes rotation information for the i-th joint, the
simplest form of conversion is to transform the appended zero vector [0, 0, 0, 1]T as follows:
124
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
pi =
Li · [0, 0, 0, 1]T
.
&Li · [0, 0, 0, 1]T &
(A.5)
Each joint rotation is now encoded as a point on a unit sphere and denoted by pi .
Specifically in our work, a column-wise concatenation of the separate joint rotation forms
the full pose vector p, which requires 57 dimensions (19 joints). Geometrically, for each
frame, the normalized RJC encoding simply stores the relative position of the joints relative
to each other along the skeletal hierarchy, with the pelvis node as the root.
125
APPENDIX A. MOTION CAPTURE FORMATS AND CONVERTERS
126
Appendix B
Accurate Mesh Acquisition
Figure B.1: Selected textured mesh models of the author in varying boxing poses.
This section focuses on the capturing, synthesizing and texturing of accurate human
mesh models for realistic skinned mesh animation in the DirectX format (section A.1.2)∗ .
Human model acquisition was achieved via a laser scanner integrated with a synchronized
high-resolution camera. In section B.1, the surface fitting and re-sampling technique of
Carr et al [20, 21] is applied in a novel way to the scanned data in order to create accurate
human mesh models. Thereafter, images captured from the digital camera during scanning
are exported to the DirectX format and textured onto the mesh. This will enable the
generation of synthetic test silhouettes of real people, hence allowing the quantitative
analysis of any markerless motion capture technique by comparing the captured pose with
the pose that was used to generate the synthetic images.
∗
This appendix summarizes one of the many solutions available to capturing an accurate human mesh
model for synthetic testing purposes. Note that there are other techniques, which can also be used for
accurate human model acquisition and texturing.
127
APPENDIX B. ACCURATE MESH ACQUISITION
B.1
Model Acquisition for Surface Fitting & Re-sampling
The model acquisition was performed using the Riegl LMS-Z420i terrestrial laser scanner
equipped with a calibrated Nikon D100 6 mega pixels digital camera. The laser is positioned approximately five metres from the actor and two scans of the front and back of
the actor are captured. Two stands are positioned on the left and right of the actor for
hand placement in order to ensure ease of biped and mesh alignment. Thereafter, the
front and back point cloud data are filtered and merged using the Riscan Pro software
and the combined point cloud data exported for surface fitting.
Figure B.2: Examples images of the front and back scan of the author [left] and the merged
filtered scan data, which is used for surface fitting and re-sampling purposes.
B.2
Radial Basis Fitting & Re-sampling
From the merged point cloud data of the actor (figure B.2 [right]), a radial basis function
(RBF) can be fitted to the data set for re-sampling, after which a smooth mesh surface
can be attained by joining the sampled points. For the case of 3D point cloud data,
the original scanned points (also referred to as on-surface points) are each assigned a 4th
dimension density value of zero [20, 21]. By letting x denote a 3D vector, a smooth surface
in R3 can be obtained by fitting an RBF function S(x) to the labelled scanned points and
128
APPENDIX B. ACCURATE MESH ACQUISITION
re-sampling the function at the same density value of S(x) = 0. It is obvious that if only
data points with density values of 0 are available, RBF fitting (to this set) will produce
the trivial solution [i.e. S(x) = 0 for all x ∈ R3 ]. To avoid this, a signed-distance function,
which encodes the relationship between off -surface and on-surface points, can be adopted
(figure B.3).
Figure B.3: Diagram to illustrate the generation of off -surface points (blue and red) from
the input on-surface points (green) before fitting an RBF function.
An off -surface point, in this case, is a synthetic point xsm whose density is assigned
the signed value dm , which is proportional to the distance to its closest on-surface point.
Off -surface points are created along the projected normals, and can be created inside
(blue points: dm < 0) or outside (red points: dm > 0) the surface defined by the onsurface (green) points. Normals are determined from local patches of on-surface points
and performed using an evaluation version of the FastRBFT M Matlab toolbox [20, 21].
The surface fitting problem can now be re-formulated as the problem of determining an
RBF function S(x ) such that
S(xn ) = 0,
S(xsm ) = dm ,
for n = 1, 2, ...., N
and
for m = 1, 2, ...., M
129
(B.1)
APPENDIX B. ACCURATE MESH ACQUISITION
where N and M are the number of on-surface and off -surface points respectively. Using
the notation from [20], an RBF S(x) can be represented as follows:
S(x) = p(x) +
M
+N
"
i=1
λi Φ(|x − xi |),
(B.2)
with λi denoting the RBF coefficients of Φ, the real values basis function, and p representing a linear polynomial function. By additionally constraining the RBF to the Beppo-Levi
space in R3 [20], the following side conditions can be implicitly guaranteed:
M
+N
"
λi =
i=1
M
+N
"
λi xi =
i=1
M
+N
"
λi yi =
i=1
M
+N
"
λi zi = 0.
(B.3)
i=1
Amalgamating the side constraints with the RBF representation in (B.2), the problem of
RBF fitting can effectively be reduced to that of solving the following system of linear
equation:





 λ 
 A P  λ 
=B



c
c
PT 0
(B.4)
where A is the matrix defining the relationship between x i and x j (i.e. Ai = φ(|x i − x j |)
for i, k ∈ [1, M + N ]), and Pij = pj (x i ), with j denoting the index to the basis of the
polynomials. By solving for the unknown coefficients c and λ, the RBF function S(x) can
now be sampled at any point within the range of the on-surface and off -surface training
data. More importantly, it is possible to sample the RBF at S(x) = 0, which corresponds
to the surface defined by the on-surface points. Furthermore, it is possible to sample
this RBF at regular grid interval, adding to the simplicity and efficiency of the mesh
construction algorithm.
A disadvantage of RBF fitting, as highlighted in [20, 21], is the complexity of solving for
the unknown coefficients c and λ, which is O(M +N )3 . For an accurate mesh model derived
from approximately 8000 on-surface points (as in figure B.2 - right), this complexity makes
the technique impractical on a home computer. By applying fast approximation techniques
[20, 21], it is possible to reduce this complexity to O((M + N ) log(M + N )). Figure B.4
shows the resultant mesh model of the author generated via the RBF fast approximation
130
APPENDIX B. ACCURATE MESH ACQUISITION
and re-sampling technique.
Figure B.4: Selected examples of the accurate mesh model of the author create via the
RBF fast approximation & sampling technique of Carr et al [20, 21].
For increases realism, images of the person captured from the synchronized digital
camera can be textured onto the plain model. Selected poses of textured mesh models of
the author (using this simple solution) are shown in figure B.1.
131
APPENDIX B. ACCURATE MESH ACQUISITION
132
References
[1] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevance vector
regression. In International Conference on Computer Vision & Pattern Recognition,
pages 882–888, 2004.
[2] A. Agarwal and B. Triggs. Learning to track 3D human motion from silhouettes. In
International Conference on Machine Learning, pages 9–16, 2004.
[3] A. Agarwal and B. Triggs. Monocular human motion capture with a mixture of
regressors. In IEEE workshop on Vision for Human Computer Interaction at CVPR,
June 2005.
[4] A. Agarwal and B. Triggs. Tracking articulated motion using a mixture of autoregressive models. In European Conference on Computer Vision, volume 3, pages
54–65, 2005.
[5] A. Agarwal and B. Triggs. A local basis representation for estimating the human
pose from cluttered images. In Asian Conference on Computer Vision, pages 55–59,
2006.
[6] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images.
IEEE Transaction on Pattern Analysis & Machine Intelligence, 28(1), 2006.
[7] J.K. Aggarwal and Q. Cai. Nonrigid motion analysis: Articulated and elastic motion.
In Computer Vision and Image Understanding, volume 70, pages 142–156, 1998.
[8] J.K. Aggarwal and Q. Cai. Human motion analysis: A review. In Computer Vision
and Image Understanding, volume 73, pages 428–440, 1999.
[9] M. Alexa. Linear combination of transformations. In SIGGRAPH, pages 380–387,
2002.
133
REFERENCES
[10] O. Arikan, D.A. Forsyth, and J.F. O’Brien. Motion synthesis from annotations. In
SIGGRAPH, volume 22, pages 402–408, 2003.
[11] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using
shape contexts. In IEEE Transaction on Pattern Analysis & Machine Intelligence,
volume 24, pages 509–522, 2002.
[12] Y. Bengio, J.-F. Paiement, and P. Vincenta. Out of sample extensions for lle, isomap,
mds, eigenmaps and spectral clustering. In Advances in Neural Information Processing Systems, volume 16, 2004.
[13] O. Benier and P. Cheung-Mon-Chan. Real-time 3D articulated pose tracking using
particle filtering and belief propogation on factor graphs. In British Machine Vision
Conference, volume 1, pages 27–36, 2006.
[14] R. Bowden. Learning statistical models of human motion. In IEEE Workshop
on Human Modeling, Analysis and Synthesis, Internation Conference on Computer
Vision & Pattern Recognition, 2000.
[15] R. Bowden, T. A. Mitchell, and M. Sarhadi. Reconstructing 3D pose and motion
from a single camera view. In British Machine Vision Conference, volume 2, pages
904–913, September 1998.
[16] J. Bray. Markerless based human motion capture: A survey. Technical report, Vision
adn VR Group, Department of Systems Engineering, Brunei University.
[17] M. Bray, K. Pushmeet, and P.H.S. Torr. POSECUT: Simutaneous segmentation and
3D pose estimation of human using dynamic graph-cuts. In European Conference
on Computer Vision, volume II, pages 642–655, 2006.
[18] C. Bregler and J. Malik. Tracking people with twists and exponential maps. In
International Conference on Computer Vision & Pattern Recognition, pages 8–15,
1998.
[19] F. Caillette, A. Galata, and T. Howard. Real-time 3D human body tracking using
variable length markov models. In British Machine Vision Conference, pages 469–
478, 2005.
134
REFERENCES
[20] J.C. Carr, R.K. Beatson, J.B. Cherrie, T.J. Mitchell, W.R. Fright, B.C. McCallum,
and T.R. Evans. Reconstruction and representation of 3D objects with radial basis
functions. In SIGGRAPH, pages 67–76, 2001.
[21] J.C. Carr, R.K. Beatson, B.C. McCallum, W.R. Fright, T.J. McLennan, and T.J.
Mitchell. Smooth surface reconstruction from noisy range data. In Applied Research
Associates NZ Ltd.
[22] C. Cedras and M. Shah. Motion based recognition: A survey. In IEEE Proceedings,
Image and Vision Computing, 1995.
[23] P. Cerveri, A. Pedotti, and G. Ferrigno. Robust recovery of human motion from video
using kalman filters and virtual humans. In Human movement science, volume 22,
pages 377–404, 2003.
[24] J. Chai and J.K. Hodgins. Performance animation from low-dimensional control
signals. ACM Transaction on Graphics, 24(3):686–696, 2005.
[25] P. Chen and D. Suter. An analysis of linear subspace approaches for computer vision
and pattern recognition. In International Journal of Computer Vision, volume 68,
pages 83–106, 2006.
[26] Y. Chen, J. Lee, R. Parent, and R. Machiraju. Markerless monocular motion capture
using image features and physical constraints. In Computer Graphincs International,
pages 36–43, June 2005.
[27] K-M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette across time part I:
Theory and algorithms. In International Journal of Computer Vision, volume 3,
pages 221–247, 2005.
[28] K-M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette across time part II:
Applications to human modeling and markerless motion tracking. In International
Journal of Computer Vision, volume 3, pages 225–245, 2005.
[29] C-W. Chu, O.C. Jenkins, and M.J. Matarić. Markerless kinematic model and motion
capture from volume sequences. In International Conference on Computer Vision
& Pattern Recognition, page 475, 2003.
135
REFERENCES
[30] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models - their
training and application. In Computer Vision and Image Understanding, volume 61,
pages 38–59, 1995.
[31] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines
(and other kernel-based learning methods). Cambridge University Press, 2000.
[32] Carnegie
Mellon
University
Graphics
Lab
Motion
Capture
Database.
http://mocaps.cs.cmu.edu.
[33] D. Demirdjian, L. Taycher, G. Shakhnarovich, K. Grauman, and T. Darrell. Avoiding the streetlight effect: tracking by exploring likelihood modes. In International
Conference on Computer Vision, pages 357–364, October 2005.
[34] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by annealed
particle filtering. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 126–133, June 2000.
[35] M. Dimitrijevic, V. Lepetit, and P. Fua. Human body pose recognition using spatiotemporal templates. In ICCV workshop on Modeling People and Human Interaction,
October 2005.
[36] J. Eisenstein and W.E. Mackay. Interacting with communication appliances: an
evaluation of two computer vision-based selection techniques. In Computer Human
Interaction, pages 1111–1114, 2006.
[37] A. Elgammal and C.-S. Lee. Inferring 3D body pose from silhouettes using activity manifold learning. In International Conference on Computer Vision & Pattern
Recognition, pages 681–688, 2004.
[38] R. Engle. GARCH 101: The use of ARCH/GARCH models in applied econometrics.
In Journal of Economic Perspectives, volume 15, pages 157–168, 2001.
[39] V. Franc. Pattern recognition toolbox for matlab. In Centre for Machine Perception,
Czech Technical University, 2000.
[40] V. Franc. Optimization Algorithms for Kernel Methods. PhD thesis, Centre for
Machine Perception, Czech Technical University, July 2005.
136
REFERENCES
[41] V. Franc and V. Hlavac. Greedy algorithm for a training set reduction in the kernel
methods. In Int. Conf. on Computer Analysis of Images and Patterns, pages 426–
433, 2003.
[42] P. Fua, A. Gruen, N. D’Apuzzo, and R. Plänkers. Markerless full body shape and
motion capture from video sequences. In International Archives of Photogrammetry
and Remote Sensing, volume 34, pages 256–261, 2002.
[43] D.M. Gavrila. The visual analysis of human movement: A survey. In Computer
Vision and Image Understanding, volume 73, pages 82–96, 1999.
[44] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification
with sets of image features. In International Conference on Computer Vision, pages
1458–1465, 2005.
[45] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3D structure with a statistical image-based shape model. In International Conference on Computer Vision,
pages 641–648, 2003.
[46] G. Grochow, S.L. Martin, A. Hertzmann, and Z. Popovic. Style-based inverse kinematics. In SIGGRAPH, pages 522–531, 2004.
[47] Ji Hun Ham, I. Ahn, and D. Lee. Learning a manifold-constrained map between
image sets: Applications to matching and pose estimation. International Conference
on Computer Vision & Pattern Recognition, 2006.
[48] A. Hilton and J. Starck. Multiple view reconstruction of people. In International
Symposium on 3D Data Processing, Visualization and Transmission (3DPVT),
pages 357–364, 2004.
[49] P.O. Hoyer. Non-negative matrix factorization with sparness constraints. In Journal
of Machine Learning Research, number 5, pages 1457–1469, 2004.
[50] S. Hu and B.F. Buxton. Using temporal coherence for gait pose estimation from
a monocular camera view. In British Machine Vision Conference, volume 1, pages
449–457, 2005.
[51] I. T. Jolliffe. Principal component analysis. Springer-Verlag, New York, 1986.
137
REFERENCES
[52] R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and
sparse linear assignments problems. Computing, 38:325–340, 1987.
[53] R.E. Kalman. A new approach to linear filtering and predictiono problems. In
Transactions on the ASME - Journal of Basic Engineering, volume 83, pages 95–
107, 1961.
[54] R. Kehl, M. Bray, and L. Van Gool. Full body tracking from multiple views using
stochastic sampling. In International Conference on Computer Vision & Pattern
Recognition, volume 2, pages 129–136, 2005.
[55] P Kohli and P. Torr. Efficiently solving dynamic markov random fields using graph
cuts. In International Conference on Computer Vision, pages 922–929, 2005.
[56] R. Kondor and T. Jebara. A kernel between sets of vectors. In International Conference on Machine Learning, 2003.
[57] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. In SIGGRAPH: Proceedings of
the 29th annual conference on Computer graphics and interactive techniques, pages
473–482, 2002.
[58] J.T. Kwok and I.W. Tsang. The pre-image problem in kernel methods. In International Conference on Machine Learning, pages 408–415, 2003.
[59] J. Lander. Working with motion capture file formats. In Game Developer, pages
30–37, January 1998.
[60] A. J. Laub. Matrix analysis for scientists and engineers. pages 139–150, 2005.
[61] R. Li, M.-H. Yang, S. Sclaroff, and T.-P. Tian. Monocular tracking of 3D human
motion with a coordinated mixture of factor analyzers. In European Conference on
Computer Vision, volume II, 2006.
[62] Y. Li, T. Wang, and H-Y. Shum. Motion texture: a two-level statistical model
for character motion synthesis. In SIGGRAPH: Proceedings of the 29th annual
conference on Computer graphics and interactive techniques, pages 465–472, 2002.
138
REFERENCES
[63] Y. Liang, W. Gong, W. Li, and Y. Pan. Face recognition using heteroscedastic
weighted kernel discriminant analysis. In International conference on advances in
pattern recognition, volume 2, pages 199–205, 2005.
[64] D. Lowe. Distinctive images features from scale-invariant keypoints. In International
Journal of Computer Vision, volume 60, pages 91–110, 2004.
[65] G. Loy, M. Eriksson, J. Sullivan, and S. Carlsson. Monocular 3D reconstruction
of human motion in long action sequences. In European Conference on Computer
Vision, pages 442–455, 2004.
[66] F. Luna. Skinned mesh character animation with direct3D 9.0c. In www.moonslab.com, September 2004.
[67] S. McKenna, G. Gong, and Y. Raja. Face recognition in dynamic scenes. In British
Machine Vision Conference, pages 140–151, 1997.
[68] A.S. Micilotta, E.J. Ong, and R. Bowden. Detection and tracking of humans by
probabilistic body part assembly. In British Machine Vision Conference, volume 1,
pages 429–438, 2005.
[69] I. Mikić, M. Trivedi, E. Hunter, and P. Cosman. Human body model acquisition and
tracking using voxel data. In International Journal of Computer Vision, volume 3,
pages 199–223, 2003.
[70] I. Mikić, M.M. Trivedi, E. Hunter, and P.C. Cosman. Human body model acquisition
and motion capture using voxel data. In International workshop on Articulated
Motion and Deformable Objects, pages 104–118, 2002.
[71] G. Miller, J. Starck, and A. Hilton. Projective surface refinement for free-viewpoint
video. In European Conference on Visual Media Production (CVMP), 2006.
[72] T. Moeslund. Computer vision-based human motion capture - a survey. Technical
report, Laboratory of Image Analysis, Institure of Electronic Systems, University of
Aalborg, Denmark, 1999.
[73] T. Moeslund. Summaries of 107 computer vision-based human motion capture papers. Technical report, Laboratory of Image Analysis, Institure of Electronic Systems, University of Aalborg, Denmark, 1999.
139
REFERENCES
[74] T.B. Moeslund and E. Granum. 3D human pose estimation using 2D-data and an
alternative phase space representation. In Proceedings of the IEEE Workshop on
Human Modeling, Analysis and Synthesis, pages 26–33, 2000.
[75] G. Mori and J. Malik. Estimating human body configurations using shape context
matching. In European Conference on Computer Vision, volume 3, pages 666–680,
2002.
[76] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. Semi-supervised learning of joint
density models for human pose estimation. In British Machine Vision Conference,
volume 2, pages 679–688, 2006.
[77] R. Navaratnam, A. Thayananthan, P.H.S. Torr, and R. Cipolla. Heirarchical partbased human body pose estimation. In British Machine Vision Conference, pages
479–488, 2005.
[78] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-20).
In http://www.cs.columbia.edu/CAVE/, 1996.
[79] M. Niskanen, E. Boyer, and R. Horaud. Articulated motion capture from 3-D points
and normals. In British Machine Vision Conference, volume 1, pages 439–448, 2005.
[80] G. Peters. Efficient pose estimation using view-based object representations. In
Machine Vision and Applications, volume 16, pages 59–63, 2004.
[81] G. Peters, B. Zitova, and C. von der Malsburg. How to measure the pose robustness
of object views. In Image and Vision Computing, volume 20, pages 249–256, 2002.
[82] R. Plankers and P. Fua. Articulated soft objects for video-based body modeling. In
International Conference on Computer Vision, pages 394–401, 2001.
[83] L. Ren, G. Shakhnarovich, J.K. Hodgins, H. Pfister, and P. Viola. Learning silhouette features for control of human motion. ACM Transaction on Graphics, 24(4),
October 2005.
[84] A. Safonova, J.K. Hodgins, and N.S. Pollard. Synthesizing physically realistic human
motion in low-dimensional. ACM Transaction on Graphics, 23(3):514–521, 2004.
140
REFERENCES
[85] L. K. Saul and S. T. Roweis. Nonlinear dimensionality reduction by locally linear
embedding. Science, 290:2323–2269, 2000.
[86] L.K. Saul and S.T. Roweis. Think globally, fit locally: unsupervised learning of low
dimensional manifolds. Journal of Machine Learning Research, 4:119–155, 2003.
[87] B. Schölkopf, S. Mika, A.J. Smola, G. Rätsch, and K.R. Müller. Kernel PCA pattern
reconstruction via approximate pre-images. In International Conference on Artificial
Neural Networks, pages 147–152, 1998.
[88] B. Schölkopf, P.Knirsch, C.Smola, and A. Burges. Fast approximation of support
vector kernel expansions, and an interpretation of clustering as approximation in
feature spaces. In Mustererkennung, pages 124–132, 1998.
[89] B. Schölkopf and A.J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization and Beyond - Chapter 18. MIT Press, Cambridge,
2002.
[90] B. Schölkopf, A.J. Smola, and K.R. Müller. Kernel principal component analysis.
In Internation Conference on Artificial Neural Networks, pages 583–588, 1997.
[91] B. Schölkopf, A.J. Smola, and K.R. Müller. Kernel PCA and de-noising in feature
spaces. In Advances in Neural Information Processing Systems, pages 536–542, 1999.
[92] N.N. Schraudolph, S. Günter, and S.V.N. Vishwanathan. Fast iterative kernel PCA.
In Advances in Neural Information Processing Systems, 2007.
[93] C. Sehn, A. van den Hengel, A. Dick, and M.J. Brooks. 2D articulated tracking
with dynamic bayesian networks. In International Conference on Computer and
Information Technology, pages 130–136, 2004.
[94] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter
sensitive hashing. In International Conference on Computer Vision, 2003.
[95] H. Sidenbladh, M. Black, and D. Fleet. Stochastic tracking of 3D human figures
using 2D image motion. In European Conference on Computer Vision, pages 702–
718, June 2000.
141
REFERENCES
[96] C. Sminchisescu, A. Kanaujia, and D. Metaxas.
Learning joint top-down and
bottom-up processes for 3D visual inference. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 1743–1752, 2006.
[97] C. Sminchisescu and B. Triggs. Covariance scaled sampling for monocular 3D body
tracking. In International Conference on Computer Vision & Pattern Recognition,
volume 1, pages 447–454, December 2001.
[98] L. I. Smith. A tutorial on principal components analysis. 2002.
[99] J. Starck and A. Hilton. Model-based multiple view reconstruction of people. In
International Conference on Computer Vision, pages 915–922, 2003.
[100] J. Starck, G. Miller, and A. Hilton. Volumetric stereo with silhouette and feature
constraints. In British Machine Vision Conference, volume 3, pages 1189–1198,
2006.
[101] E.B. Sudderth, A.T. Ihler, W.T. Freman, and A.S. Willsky. Nonparametric belief
propagation. In International Conference on Computer Vision & Pattern Recognition, volume 11, page 605, 2003.
[102] A. Sundaresan and R. Chellappa. Markerless motion capture using multiple cameras.
In Computer Vision for Interactive and Intelligent Environment, pages 15–26, 2005.
[103] Vicon Peak: Vicon MX System. http://www.vicon.com/products/systems.html.
[104] K. Takahashi, T. Sakaguchi, and J. Ohya. Real-time estimation of human body
postures using kalman filter. In International Workshop on Robot and Human Interaction, pages 189–194, 1999.
[105] T. Tangkuampien and T-J. Chin. Locally linear embedding for markerless human
motion capture using multiple cameras. In Digital Image Computing: Techniques
and Applications, page 72, 2005.
[106] T. Tangkuampien and D. Suter. 3D object pose inference via kernel principal components analysis with image euclidian distance (IMED). In British Machine Vision
Conference, pages 137–146, 2006.
142
REFERENCES
[107] T. Tangkuampien and D. Suter. Human motion de-noising via greedy kernel principal component analysis filtering. In International Conference on Pattern Recognition, pages 457–460, 2006.
[108] T. Tangkuampien and D. Suter. Real-time human pose inference using kernel principal components pre-image approximations. In British Machine Vision Conference,
pages 599–608, 2006.
[109] W.Y. Teh and S. Roweis. Automatic alignment of local representations. In Advances
in Neural Information Processing Systems, pages 841–848, 2002.
[110] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science, 2904:2319–2323, 2000.
[111] C. Theobalt, M. Magnor, P. Schüler, and H.P. Seidel. Combining 2D feature tracking
and volume reconstruction for online video-based human motion capture. In Pacific
Conference on Computer Graphics and Applications, page 96, 2002.
[112] M.E. Tipping and C.M. Bishop. Mixtures of probabilistic principal component analysers. In Neural Computation, volume 11, pages 443–482, 1999.
[113] R. Urtasun, D.J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from
small training sets. In International Conference on Computer Vision, pages 403–410,
2005.
[114] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
1995.
[115] P. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple
features. In International Conference on Computer Vision & Pattern Recognition,
pages 511–518, 2001.
[116] L. Wang, Y.Zhang, and J. Feng. On the euclidian distance of images. In IEEE
Transaction on Pattern Analysis & Machine Intelligence, volume 27, pages 1334–
1339, 2005.
[117] R. Wang and W.K. Leow.
Human posture sequence estimation using two un-
calibrated cameras. In British Machine Vision Conference, volume 1, pages 459–468,
2005.
143
REFERENCES
[118] Greg Welsh and Gary Bishop. An introduction to the kalman filter, siggraph 2001.
In SIGGRAPH, 2001.
[119] L-W Zhao, S-W Luo, and L-Z Liao. 3D object recognition and pose estimation using
kernel PCA. In International Conference on Machine learning & Cybernetics, pages
3258–3262, 2004.
[120] Z.Zivkovic. Optical-flow-driven gadgets for gaming user interface. In Proceedings of
the 3rd International Conference on Entertainment Computing, pages 90–100, 2004.
144