Larry Davis' research on understanding human activity is supported
by DARPA, ONR, Microsoft Corporation, Philips Research Laboratory, ATR's
Media Integration Laboratory, and the Keck Foundation. The research focuses
on new computer vision algorithms and systems that can detect, track and
analyze human movement. Much of this research is carried out in the newly
formed Keck Laboratory for the Analysis of Visual Movement. The Keck Lab
(see Figure 1) contains 64 digital, progressive-scan monochromatic and
color cameras connected to a network of PC's. The cameras are capable of
acquiring images at rates of up to 85 frames per second. The PC's can collect
up to ten seconds of uncompressed video from the 64 cameras for off-line
analysis, or can be used for real-time systems development.
Figure 1. Keck Laboratory architecture.
W4: W4 is a real-time PC-based visual surveillance system that operates on both single images and stereo image pairs, and on visible as well as infrared imagery. It includes real-time vision algorithms for
W4 has been successfully applied to hours of monochromatic video,
and can detect and track people against complex backgrounds at speeds of
up to 30 frames per second. Recent extensions to W4 include the ability
to analyze people in a variety of natural postures (the original version
was restricted to upright people), to track individuals within a moving
group, and to recognize that people are carrying or exchanging objects.
Figure 2 shows one frame from a visible image sequence processed by the
W4 system. Here, W4 is tracking three people who are passing through its
field of regard. It has classified them as people based on their dynamic
shapes and motions, and has built models of their appearance that allows
it to track them through occlusions, and to recognize when a person leaves
the field of view and then later returns. Based on its models of human
form and motion, it has identified the locations of the principal body
parts, and can track these parts through the sequence. Finally, it has
approximately placed these people onto the ground plane via a simple and
automatic calibration procedure.
Figure 2. Tracking of three people by W4.
We have also developed a version of W4 that employs stereo cameras, and combines its single camera intensity-based analysis with depth analysis from stereo. This version of W4 integrates stereo and intensity during its detection phase to eliminate shadows and to accurately segment the person from the background. More recently, we have extended W4's segmentation algorithm to color imagery.
W4 has also been extended to allow it to detect and track small groups of people, and to count the number of people in a group. This extension, called Hydra, is illustrated in Figure 3, where we show several frames from visible image sequences containing groups of 24 people walking together. The color-coding is a probability map showing to which person each foreground pixel belongs.
Figure 3. Segmenting a group of people into individuals using Hydra.
W4 can control an active camera pan/tilt/zoom camera to conduct surveillance over a wide field of regard, and to zoom in on moving objects so they can be classified and tracked. Figure 4 illustrates this, where we display several of the fields of view employed by the system as it first detects a person leaving a building, and then tracks that person over a very long distance through control of the pan/tilt/zoom of the active camera.
Figure 4. Active tracking of a person over a long distance.
(a)
(b)
Figure 5. Periodicity analysis of human movement. (a) Image. (b) Spectral analysis.
Shall We Dance: Shall We Dance was a real-time motion capture demo presented at SIGGRAPH 98 in collaboration with ATR's Media Integration Laboratory and M.I.T.'s Media Laboratory. Shall We Dance uses several of the component algorithms of W4 and incorporates them into a 3D body part tracking system. The operation of Shall We Dance is illustrated in Figure 6 (the graphical characters are reproduced with the permission of ATR's Media Integration Laboratory). Six cameras view a person moving freely. Each camera runs a version of the W4 system, and tracks the positions of the person's head, hands, torso and feet with the assistance of predictions of their positions provided by the controller. The 2D positions are integrated through stereo analysis and models of human movement (developed at M.I.T.) to estimate their locations and motions in 3D. The 3D motion estimates are then used to predict the locations of the body parts in the next set of frames acquired by the cameras. Graphics algorithms from ATR are used to animate graphical characters designed by ATR to illustrate the accuracy of the motion capture. The system operated at speeds of up to 25 frames per seconds; hundreds of people entered the demonstration area and controlled the movements of the graphical characters through their own motions.
Figure 6a. Shall We Dance: Architecture.
Figure 6b. Shall We Dance: Example.
Detecting Pedestrians: In conjunction with Daimler-Benz Research in Ulm, Germany, we have developed a template-based approach to detecting pedestrians in images. The system is based on constructing hierarchical template models based on thousands of instances of people in different poses. It compares these hierarchical template structures to structured edge maps of images using distance transform techniques such as chamfer matching and Hausdorff matching. It operates at speeds of about two frames per second on a PC and can be applied to either visible or infrared imagery (Figure 7).
Figure 7. Pedestrian detection.
Appearance Models for Human Action: We have been studying how
objects with complex time-varying geometries, such as moving people, can
be accurately tracked based on learned models of their typical motions.
We have been developing an appearance-based approach, in which compact
models of evolving parametric flow fields observed from generic viewpoints
are learned from experience. These models are then used to track subsequent
instances of these movements, or to recognize which of a variety of movements
is being observed (based on a goodness-of-tracking criterion). In our most
recent work we show how these learned models of movement can be used for
tracking even when the camera observing the movement is itself moving.
This involves decoupling the motion due to the rigid movement of the camera
from the learned movement of the object.