WenYi Zhao
The focus of my research has been several aspects of statistical image/signal
processing and computer vision, and their applications, for example, better
video coding scheme.
Statistical pattern recognition techniques have been successfully applied
to recognition tasks based on still image or video.
However, in many cases,
computer vision techniques are needed to fulfill difficult tasks such as 3D
object recognition. One such example is face recognition in which the system
is presented with a face picture from which the person's identification needs
to be decided. The key challenge here is that we need to determine the class
label from the 2D image of a 3D object. By incorporating computer vision and/or
other techniques we could solve such difficult problems.
I. Statistical Image/Signal Processing
In statistical pattern recognition literature, one major approach is training
based methods that are usually composed of the following steps: choosing
an appropriate classifier, and then constructing the specific
classifier, including estimation of parameters in the case of parametric
classifiers. There are many classifiers available, including Bayesian
classifier, nearest-neighbor rule, neural networks and linear discriminant,
to name a few. The Bayesian classifier is optimal according to the traditional
statistical pattern recognition theory. But for applications involving
high-dimensional signals, the demand for a large number of training samples
to construct a good Bayesian classifier is difficult to satisfy.
Thus, researchers are continuing to search for
classifiers that perform close to Bayesian but with fewer training
samples.
A. Subspace Linear/Nonlinear Discriminant Analysis
We proposed a statistical framework, subspace discriminant
analysis, using which we can construct a practically good classifier (both
linear and nonlinear) from a limited number of training samples. Discriminant
analysis has been studied in both pattern recognition and statistics for
many years. For example, linear classifier such as LDA (Linear Discriminant
Analysis) has been successfully applied to the task of face recognition. On
the other hand, subspace methods, especially PCA based ones, have been used
for effective
dimension reduction. Instead we proposed using a universal subspace for
overcoming generalization/over-fitting problem in applications such as face
recognition [4],
Combining subspace and discriminant analysis, we proposed a general
framework to solve practical classification problems. For example,
a successful face recognition system has been built upon
subspace LDA and it has been evaluated on a competitive face algorithms
test called FERET test [4]. However, linear classifier has fundamental
restrictions, they can not handle linearly non-separable cases which can
occur even in the task of face recognition [8]. For such case,
multiple subspaces or parametric subspace can be constructed from the original
subspace to accommodate the inputs distorted by scaling, rotating,
and translating etc [6].
More generally, the concept of subspace [14] can also be used to
derive the so-called kernel PCA method [2]
based on the replacement of dot product with an appropriate kernel
function [1]. Using such method, the original signal can be
transformed into the subspace of a much higher
dimension space through a nonlinear mapping so that performing a linear
classification on such subspace is close to performing non-linear
classification in the original space.
B. Distance Metrics
Distance metric has been an active research topic in pattern recognition
community for many years. Various distance metrics are available, including
Euclidean distance, Mahalanobis distance, Kullback-Leibler distance
etc. For better classification,
smart distance metrics have been studied. For example, discriminant
analysis based distance metrics have been studied extensively. We proposed
two new distance metrics: DCA based distance metric [9],
and minimax based distance metric [7]
and demonstrated the efficiency of these new distance metrics.
Also in our subspace LDA face recognition system, we employed
eigenvalue-guided weighted Euclidean distance for better
performance [10].
C. Performance Evaluation
Performance evaluation is the technique to evaluate how well the designed
system performs. As more and more different approaches have been developed
for similar or same task, performance evaluation becomes more and more
important in order to choose the best approach in a given situation.
Also evaluation task has been
facilitated by creation of large databases and increasing computer
power. However, such pure empirical approach depends very much on the size
of database and the feature vector space. On the other hand, performance
evaluation can be carried out based on the analysis of the empirical system
performance by decomposing the empirical performance into theoretical
performance of the given system and performance perturbation due to signal
noise, small sample size etc. Such analysis
not only gives the insight of how system performance is sensitive to the
system implementation, but also helps to choose practically good classifiers
in certain situation. For example, the performance of LDA classifier was
analysized using matrix perturbation theory [5]. Also
performance analysis based on Shape-from-Shading (SFS) suggest that significant
illumination change can seriously degrade the system
performance. Hence it is necessary to seek methods that compensate for these
changes [8].
D. Improving the efficiency
Different from LDA, we proposed a new scheme DCA (Discriminant Component
Analysis) [9], which is analogous to PCA (Principal Component
Analysis) but fundamentally different. The new scheme decomposes a signal
into orthonormal bases such that for each base there is an
eigenvalue representing the discriminatory power of projection in that
direction. Because DCA iteratively seeks for the full orthonormal bases,
it has the following advantages over LDA: first it encodes the full
discriminatory information, second it generates eigenvalue which
is more suitable as weights used in weighted distance metric, and
third it is more suitable for non-Gaussian distribution.
E. Statistical Learning: Clustering and Factor Analysis
Clustering is a standard unsupervised learning technique
so we can have a more compact representation of the data. In [6],
simple k-means algorithm has been successfully used in
clustering more than 1,000 face images into 7 clusters.
And with such clustering, the efficiency of solving a parameter
estimation problem has been improved dramatically [6].
Factor Analysis is generally referred to the statistical method to recover the
underlying factors used to model the observations (Independent
Component Analysis is the newest addition to this branch).
Traditionally the factor models are additive.
More recently, a multiplicative model (bi-linear model) was
proposed to address a wide range of problems [3].
Bi-linear model not only provides sufficiently expressive representation of
factor interactions but also can be fit using efficient algorithms
based on Singular Value Decomposition (SVD) and
Expectation-Maximization (EM) algorithm. One application of this
approach is to separate the face identification (one factor) and the
pose (another factor); and then to perform pose estimation.
II. Combining Statistics with Computer Vision etc.
In many cases even with robust pattern recognition techniques, the
task of complex object recognition can not be finished without incorporating
other techniques. For example, the above mentioned robust subspace LDA face
recognition system [6] is robust against any 2D face image
transformation, but not against 3D transformation. This is because the 2D image
does not explicitly encode any 3D information and hence pure pattern
recognition techniques will not work in such
cases. Meanwhile we notice that inferring 3D information from 2D has been a
major research topic in computer vision community for many years. Hence we
feel it is natural to combine pattern recognition and computer vision
approaches to solve such difficult problems. However we also notice
that one difficulty in applying computer vision techniques to a
practical problem is that traditional computer vision is seeking
perfect solutions to ill-posed problems under some
unrealistic assumptions.
Hence these techniques are fragile because the real tasks usually do not
satisfy these assumptions. One possible solution to this difficulty
would be to construct better computer vision techniques
or integrating with other techniques. For example,
we can combine computer graphics and computer vision to
solve the ill-posed problem with the help of some prior knowledge
about the objects.
A. Symmetric Shape-from-Shading
We have proposed a new SFS algorithm which can handle symmetric
objects such as a face [8]. Symmetry is very useful information that
can be exploited in SFS algorithms for symmetric objects. However, implicitly
bringing this information into existing SFS algorithms does not seem to help
too much. So we describe a direct method for incorporating this important
cue. Compared to existing SFS algorithms, the new
symmetric SFS algorithm has the following advantages: a) It not only
has a point-wise unique solution for the partial derivatives
(p,q) but also a unique solution for albedo (Here the albedo can be
either constant or piece-wise constant across the whole image plane.)
b) By using the self-ratio image, problems due to variations in albedo
are avoided. Hence a model-based light-source estimation approach
becomes more accurate. c) Combining the symmetric SFS and regular
SFSs, unique solution can be obtained even in case that shadow points are
present.
B. Illumination-Insensitive Face Recognition using Symmetric SFS
Even though symmetric SFS provides a better solution for symmetric
objects than regular SFSs, using it for application such as face
recognition is still difficult. This is partly due to possible violations
of assumptions such as Lambertain model and single light source.
In stead, we propose a direct image-to-image computation based on
symmetric SFS and a generic 3D head model [8]. Such method has the
following features in handling the illumination problem in face
recognition: a) There is no training, hence only one image is needed.
b) A new matching measure which is illumination-invariant is proposed.
c) Since no full symmetric SFS is really carried out and the computation
is image to image, it is fast. d) The problem of solving
complex/arbitrary albedo information is avoided. To demonstrate the
efficacy of our method, we have applied it to several publicly available
face databases. We demonstrate
significant performance improvement over existing face recognition
systems using PCA and/or LDA, for images acquired under variable
lighting conditions.
C. Model-Based Image Synthesis
Image synthesis is an active research area which has numerous
applications such as in computer graphics. Multi-view based image
synthesis is a technique used to generate images under different views
(or even lightings) based on multiples image of the same object/scene.
There are two major approaches of this technique: 1) viewing point is
static [11], 2) lighting is static [12].
Using just one image, we propose a model-based image synthesis method which
can synthesize good-quality images with arbitrary albedo under
different pose and illumination [8].
D. Physics-based 3D Object Recognition
It has been a difficult problem to recognize a 3D object under different
views if we only have one view of this object. Various learning based
methods have been proposed. The success of these methods relies on large
numbers of training samples. And for poses which do not have enough
training samples, it is difficult to recognize new images under such
poses since they are essentially extrapolation problems. A better
alternative (at least in theory) is to infer 3D information from a
single 2D image. After obtaining the 3D information, recognizing
new images of the same object under different poses and illuminations
is simple. However, as we know, it is not easy to infer accurate 3D information
from just one single image. Currently we are using a generic 3D
model to recover the frontal view image from a
given image, which is a special process of synthesizing the frontal
image from a given image [8]. After generating the
frontal-view image, we can just apply the already-trained subspace LDA
system.
III. Applications
Both statistical signal processing and computer vision have numerous
applications: video compression, speech recognition, content-based
information retrieval, etc. One particularly important application is
about human beings: how people act and respond, access control based
on person identity, etc. For example, face recognition can be utilized
in many applications. As such effort, we have built a prototype
viewer-identification system for television based on [10].
We have also made a proposal of face descriptor to MPEG-7 based on our
successful face recognition system [13]. The initial testing
result on the MPEG-7 testing content is very satisfactory.
On the subject of multimedia applications, we have demonstrated that
the performance of a simple color-based shot-detection algorithms can
be improved using the minimax distance metric [7].
IV. Future Directions
My future research would focus on the following directions: