Prof. Aloimonos' research group continues to work on the computations
underlying the development of descriptions of visual space from images.
These are descriptions of space (objects) and space-time (actions) extracted
from multiple views of the world (such as video), and their extraction
involves geometry and physics. This task is known by various names and
is of interest to many fields, ranging from technology to biology.
After studying the geometry of multiple views from point and line correspondences and introducing the trilinear constraints, we realized that implementing multiple-view geometry in actual robotic vision systems cannot be based on feature correspondences. At that time, our research was influenced by the philosophy of direct perception advocated by Gibson. Since perception is immediate, properties of the scene in view must be directly encoded in patterns or aggregate structures of image measurements. Through the application of computational principles we have been searching for these structures and their associated representations. The technical problems we have investigated are related to basic processes underlying the perception of 3D motion, shape, and their relationships.
Results from this research have recently given rise to new mathematical constraints governing the geometry of visual space and contributing to the understanding of the non-Euclidean nature of visual space. This caused the emergence of a new framework with far-reaching consequences and a multitude of applications in technology and biology. Besides applications to robotics and navigation, these include computational video, new camera technologies, distributed sensor networks, Web-related technologies (video indexing), and a number of empirically testable hypotheses about the structure and function of the brain. For example, this research allows, for the first time, the automatic development of scene and motion representations that allow photorealistic manipulation of video (deleting objects or embedding virtual objects in a physical scene, changing viewpoint, etc.) as well as the development of 3D video.
In the following subsections we describe in more detail the basic problem
we have been studying, the conventional wisdom, and our approach, and we
show some results. Finally, we discuss two new application areas we have
initiated, new camera technology and computational video (analysis and
synthesis of visual data).
Figure 1. The Ouchi pattern. Slight, rapid movements of this pattern produce different movements of the inset and the background.
Since at the beginning of the process correspondence or flow is not available, we cannot utilize the epipolar constraint that has been traditionally used. Instead, we utilize the positive depth constraint and geometric constraints arising from understanding the distortion function, which depends on the errors in the 3D transformation and image measurements. Understanding this function provides the insight that human visual space is a non-Euclidean space; further, it explains a number of illusions and predicts others. At the same time, this understanding gives rise to algorithms for 3D motion estimation, motion segmentation and scene reconstruction from video sequences, producing results not obtainable by correspondence-based approaches. Theoretical results also demonstrate that existing approaches are special cases of our approach; that is, our approach is provably better than the state-of-the-art, correspondence-based schemes.
3D shape: Perhaps the most defining test of how well 3D motion is recovered is the estimation of shape. The reason for this is that many tasks can be achieved with somewhat or slightly incorrect 3D motion, but an error of a few pixels (for example, in the location of the translation direction) is enough to create significant errors in the estimated shape. How well shape models can be estimated depends on a number of factors besides accurate 3D motion estimates, such as the number of frames utilized (amount of data) and the actual representation of the model. Sequence F (reconst.mpg) shows a model for the scene recovered from a few frames and without any elaborate data structures; the scene is simply a set of 3D points. More frames and a bit more sophistication in representing the scene (triangles) result in much better models. Sequence G (sct-input.mpg) shows an original sequence, and Sequence H shows the obtained reconstruction (sct-fly.mpg). Sequences I (yiann-input.mpg) and J (yiann-fly.mpg) show another example. No post-processing was performed here but, clearly, graphics post-processing further improves the results. Finally, consider a reconstruction from multiple videos K, L and M (pooh2-input.mpg, pooh3-input.mpg, and pooh4-input.mpg). Sequence N (pooh4.mpg) shows that the recovery is almost perfect. Again, no post-processing was performed.
Motion segmentation: This is the hardest problem in dynamic scene
analysis; our approach was conceived with this problem in mind. Sequence
O is an original, well-known sequence (mobile-input.mpg). An elaborate
optimization scheme with feedback starts from the normal flow values and
builds representations of camera motion and localizations of motion and
background boundaries. The principle of depth variability plays a central
role. Sequence P (mobile-depth.mpg) shows recovered inverse depth for a
part of the sequence with the gray-level value showing the amount of depth
(white denoting large positive values, i.e., close to the camera, and black
denoting negative values). Notice the high variability of depth at the
locations of independent movement. Also, notice that, at times, the train
motion is consistent with the camera motion (making independent motion
detection difficult) so no high variability of depth is obtained, but the
depth comes out negative, marking independent motion. Notice in Sequence
Q (mobile-binerr.mpg) the depth variability measurements for the same part
of the sequence (white denoting large values). The procedure searches for
the camera motion and the motion boundaries. Depth variability is the basis
for the solution.
Take, for example, the problem of recovering descriptions of space-time from video information. We give here the principles underlying the construction of new eyes for this task. As is well known, to solve this problem we need to accurately recover 3D motion and image motion.
Our point of departure is a set of geometric results regarding inherent ambiguities in estimating 3D motion from video sequences. Denote the five unknown motion parameters of a common moving video camera by (x_0, y_0) (direction of translation) and (a, b, g) (rotation). Assume that the scene in view has depth values uniformly distributed in the camera's coordinate system between any two values. Then, no matter how 3D motion is estimated from the image motion, the expected solution will contain errors (x_{0_e}, y_{0_e}), (ae,be,ge) that satisfy two constraints:
(a) The orthogonality constraint: x_{0_e}/y_{0_e} = b_e/a_e
(b) The line constraint: x_0/y_0 = x_{0_e}/y_{0_e}.
The solution thus contains errors that are mingled and create a confusion between rotation and translation that cannot be cleared up. The errors may be small or large, but their expected value will always satisfy the above conditions. Intuitively, the surface representing an error function whose minimum provides the solution has a ìbadî topography, making it hard to localize the minimum.
Let us step back for a moment and ask our original question differently. We are interested in space and action descriptions that can be extracted from visual data. This requires that there exist an eye or device imaging the scene. All along we took for granted that our basic device was a camera-type eye, that is, a common video camera whose basic principle is the pinhole model, but there was no particular reason to make this assumption.
An examination of the designs of eyes in the biological world reveals a very wide variety. The mechanisms that organisms have evolved for collecting photons and forming images that they use to perform various actions in their environment depend on a number of factors. Chief among these are the individual organism's computational capacity and the tasks that the organism performs. Michael Land, a prominent British zoologist and the world's foremost expert on the science of eyes, has provided a landscape of eye evolution. Considering evolution as a mountain range, with the lower hills representing the earlier steps in the evolutionary ladder, and the highest peaks representing the later stages of evolution, the situation is pictured in Figure 2. It has been estimated that eyes have evolved no fewer than forty times, independently, in diverse parts of the animal kingdom. In some cases, these eyes use radically different principles and the ìeye landscapeî of the figure shows nine basic types of eyes. Eyes low in the hierarchy (such as the nautilus' pinhole eye or the marine snail eye) make very crude images of the world, but at higher levels of evolution we find different types of compound eyes and camera-type eyes (like the ones we have) such as the corneal eyes of land vertebrates and the eyes of fish.
Figure 2. "Landscape" of eye evolution. (From R. Dawkins, Climbing Mount Improbable, Norton, New York, 1996.)
Inspiration for our research on this topic has come from the compound eyes of insects which are particularly intriguing, especially in view of the fact that insects compute 3D motion excellently. Their lives depend on their ability to fly with precision through cluttered environments, avoid obstacles and land on demand on surfaces oriented in various ways. In addition, they perform these tasks with minimal memory and computational capacity, much less than an average personal computer of today. Could it be possible that much of their success emanates from the special construction of their eyes?
Compound eyes exist in several varieties, and can be classified in two categories, apposition and superposition. An apposition eye (Figure 3a) is built as a dense cluster of long, straight tubes radiating out in all directions as from the roof of a dome. Each tube is like a gunsight which sees only a small part of the world in its own direct line of fire. Thus, rays coming from other parts of the dome are prevented by the walls of the tube and the backing of the dome from hitting the back of the tube where the photocells are. In practice, each of the little tube eyes, called ommatidia, is a bit more than a tube. It has its own private lens and its own private retina of about half a dozen photocells. The ommatidium works like a long, poor-quality camera eye. Superposition compound eyes, on the other hand, do not trap rays in tubes. They allow rays that pass through the lens of one ommatidium to be picked up by a neighboring ommatidium's photocells. There is an empty, transparent zone shared by all ommatidia. The lenses of all ommatidia conspire to form a single image on a shared retina which is put together from the light-sensitive cells of all the ommatidia. One kind of superposition eye is the neural superposition (or wired-up superposition) eye shown in Figure 3b. In this case, the ommatidia are isolated tubes just as in the case of the apposition eye, but they achieve a superposition-like effect by ingenious wiring of nerve cells behind the ommatidia.
(a)
(b)
Figure 3. Compound eyes of apposition (a) and superposition (b) types. (From R. Dawkins, op. cit.)
Why is it that biological systems that need to fly (insects, birds) have panoramic vision implemented either as a compound eye or by placing camera-type eyes on opposite sides of the head? This is a fascinating question that has remained open since the time of the pioneer investigator Sigmund Exner at the beginning of this century. The obvious answer is that flying systems must perceive the whole space around them; thus panoramic vision emerges. There is, however, a deeper mathematical reason that has only recently been understood, and it has to do with the ability of a system to estimate 3D motion when it analyzes panoramic images. Put simply, a spherical eye (360 degree field of view) is superior to a planar eye (restricted field) with regard to 3D motion estimation.
Recall that estimating 3D motion from ìplanarî image sequences introduces errors satisfying the ìorthogonalityî and ìlineî constraints. The main reason is the restricted field of view. If, however, the field goes to 360 degrees, the topography of the error surface drastically changes, with the minimum clearly standing out! Thus there is no confusion between the motion parameters. It is no wonder, then, that flying organisms possess panoramic vision!
Since it turns out that spherical eyes such as those of insects, or, in general, panoramic vision provides a much better capability for 3D motion estimation, and since our problem of building accurate space and action descriptions depends on accurate 3D motion computation, it makes sense to reconsider what the eye for our problem should be. There are a few ways to create panoramic vision cameras, and the recent literature is rich in alternative approaches, but the insect eye is not just panoramic. It has an additional property whose mathematics are still largely unknown. It is built from a large collection of ommatidia that for our purpose can be considered as individual cameras. This construction offers additional, unexpected benefits from a computational viewpoint, though we do not know exactly what the benefits for the insect are. One such benefit arises from the fact that the large number of cameras constitute a large collection of stereo systems! Using simple techniques, these stereo systems are capable of providing a large number of the depth discontinuities in the scene. Having available the depth discontinuities, we can estimate very well the motion field in each of the cameras. Thus, if we implement a spherical eye by putting cameras on the surface of a sphere (Figure 4), we can achieve a new eye that has the following two desirable properties:
(a) From an image sequence, it can best estimate 3D motion.
(b) From an image sequence, it can best estimate the image motion field.
An eye like this is what is needed for our problem.
Figure 4. An eye composed of cameras on the surface of a sphere.
The preceding discussion demonstrates the power of multiple-view vision. Using many conventional video cameras and arranging them in specific, purposive configurations provides new eyes that are much more powerful. Their power is not due only to redundancy. It is due to the rich relationships between different projections of the world. As shown above, by treating the sets of video streams collected in a particular way as a new image, mathematical and statistical properties are obtained that were unknown before. It is expected that the study of biological eyes will reveal formidable properties. Relating such properties to tasks that systems perform will reveal a new landscape of mathematical problems related to shape, form, motion and action. To give an example from current problems in the field related to surveillance and monitoring, the problem of motion segmentation (finding independently moving objects from a moving sensor) becomes much easier if one uses a small array of video cameras as in Figure 5. The reason for this is that image motion can be better estimated and background/object motion can be separated more easily.
Figure 5. An eye composed of a planar array of cameras.
Such a research program is supported by current technology. In the early 1980s, we could hardly digitize a video. In the late 1980s, we needed sophisticated, specialized, expensive processors/systems. Now, we can put video directly on PCs! Not to mention that video cameras are quite inexpensive, and are becoming even cheaper.
Eyes like the ones just described have provably optimal properties regarding 3D motion estimation/segmentation, but may be impractical to use unless they are miniaturized. Luckily, from a mathematical viewpoint, it makes no difference whether the cameras are looking inward or outward. Imaging a moving object at the center of a sphere creates the same geometry! The resulting configuration (Figure 6) is a possible configuration for a new eye that recovers accurate shape/action descriptions. Relating the multiple video streams (without correspondence of points) gives rise to robust algorithms for shape/action description recovery.
Figure 6. Inward-pointing cameras on the surface of a sphere have the same geometry as the outward-pointing cameras of a spherical eye.
Our Institute recently obtained a grant from the Keck Foundation to establish the Keck Laboratory for the Study of Visual Movement. The Laboratory consists of a large number of cameras (currently sixty-four) and a network of PCs with the capability for simultaneous recording and synchronization among all sensors.
Using the Keck Lab we are implementing our ideas about the problems
described above. At the same time, we are examining different configurations
of eyes best suited for specific applications. One of our interests is
3D video. This amounts to acquiring visual data in a way that makes it
possible to visualize it from any viewpoint. It is of course impossible
to gather images from all viewpoints. Only through the recovery of some
particular aspects of the 3D structure and motion does it become possible
to visualize from any viewpoint. This is a problem falling in the general
category of multiple-view geometry and statistics.