Take, for example, the problem of recovering descriptions of space-time from video information. We give here the principles underlying the construction of new eyes for this task. As is well known, to solve this problem we need to accurately recover 3D motion and image motion.
Our point of departure is a set of geometric results regarding inherent ambiguities in estimating 3D motion from video sequences. Denote the five unknown motion parameters of a common moving video camera as (direction of translation) and (rotation). Assume that the scene in view has depth values uniformly distributed in the camera's coordinate system between any two values. Then, no matter how 3D motion is estimated from the image motion, the expected solution will contain errors that satisfy two constraints:
(a) The orthogonality constraint:
(b) The line constraint:
The result states that the solution contains errors that are mingled and create a confusion between rotation and translation that cannot be cleared up. The errors may be small or large, but their expected value will always satisfy the above conditions. Intuitively, the surface representing an error function whose minimum provides the solution has a "bad" topography, making it hard to localize the minimum.
Let us step back for a moment and ask our original question differently. We are interested in space and action descriptions that can be extracted from visual data. This requires that there exists an eye or device imaging the scene. All along we took it for granted that our basic device was a camera-type eye, that is, a common video camera whose basic principle is the pinhole model, but there was no particular reason to make this assumption.
An examination of the design of eyes in the biological world reveals a very wide variety. The mechanisms organisms have evolved for collecting photons and forming images that they use to perform various actions in their environment depend on a number of factors. Chief among these are the individual organism's computational capacity and the tasks that the organism performs. Michael Land, a prominent British zoologist and the world's foremost expert on the science of eyes, has provided a landscape of eye evolution. Considering evolution as a mountain, with the lower hills representing the earlier steps in the evolutionary ladder, and the highest peaks representing the later stages of evolution, the situation is pictured in this figure. It has been estimated that eyes have evolved no fewer than forty times, independently, in diverse parts of the animal kingdom. In some cases, these eyes use radically different principles and the "eye landscape" of the previous figure shows nine basic types of eyes. Eyes low in the hierarchy (such as the nautilus' pinhole eye or the marine snail eye) make very crude images of the world, but at higher levels of evolution we find different types of compound eyes and camera-type eyes (like the ones we have) such as the corneal eyes of land vertebrates and fish eyes.
Inspiration for our research on this topic has come from the compound eyes of insects which are particularly intriguing, especially in view of the fact that insects compute excellently 3D motion. Their lives depend on their ability to fly with precision through cluttered environments, avoid obstacles and land on demand on surfaces oriented in various ways. In addition, they perform these tasks with minimal memory and computational capacity, much less than an average personal computer of today. Could it be possible that much of their success emanates from the special construction of their eyes?
Compound eyes exist in several varieties, and can be classified in two categories, the apposition and superposition ones. The apposition eye is built as a dense cluster of long, straight tubes radiating out in all directions as from the roof of a dome. Each tube is like a gun sight which sees only a small part of the world in its own direct line of fire. Thus, rays coming from other parts of the wall are prevented by the walls of the tube and the backing of the dome from hitting the back of the tube where the photocells are (Figure). In practice, each of the little tube eyes called ommatidia, is a bit more than a tube. It has its own private lens and its own private retina of about half a dozen photocells. The ommatidium works like a long, poor quality, camera eye. Superposition compound eyes, on the other hand, do not trap rays in tubes. They allow rays that pass through the lens of one ommatidium to be picked up by a neighboring ommatidium's photocells. There is an empty, transparent zone shared by all ommatidia. The lenses of all ommatidia conspire to form a single image on a shared retina which is put together from the light-sensitive cells of all the ommatidia. One kind of superposition eye is the neural superposition (or wired-up superposition) eye shown in this figure. In this case, the ommatidia are isolated tubes just as in the case of the apposition eye, but they achieve a superposition-like effect by ingenious wiring of nerve cells behind the ommatidia.
Why is it that biological systems that need to fly (insects, birds) have panoramic vision implemented either as a compound eye or by placing camera-type eyes on opposite sides of the head? This is a fascinating question that has remained open since the time of the pioneer investigator, Sigmund Exner, at the beginning of this century. The obvious answer is, of course, that flying systems should perceive the whole space around themthus panoramic vision emerged. There is, however, a deeper mathematical reason that has only recently been understood, and it has to with the ability of a system to estimate 3D motion when it analyzes panoramic images. Put simply, a spherical eye (360 degree field of view) is superior to a planar eye (restricted field) with regard to 3D motion estimation.
Recall that estimating 3D motion from "planar" image sequences introduces errors satisfying the "perpendicularity" and "line" constraints. The main reason is the restricted field of view. If, however, the field goes to 360 degrees, the topography of the error surface drastically changes with the minimum clearly standing out! Thus there is no confusion between the motion parameters. It is no wonder then that flying organisms possess panoramic vision!
Since it turns out that spherical eyes such as the ones of insects, or, in general, panoramic vision provides much better capability for 3D motion estimation, and since our problem of building accurate space and action descriptions depends on accurate 3D motion computation, it makes sense to reconsider what the eye for our problem should be. There are a few ways to create panoramic vision cameras, and the recent literature is rich in alternative approaches, but the insect eye is not just panoramic. It has an additional property whose mathematics are still largely unknown. It is built by a large collection of ommatidia that for our purpose can be considered as individual cameras. This construction offers additional, unexpected benefits from a computational viewpoint, though we do not know exactly what the benefits for the insect are. One such benefit arises from the fact that the large number of cameras constitute a large collection of stereo systems! Using simple techniques, these stereo systems are capable of providing a large number of the depth discontinuities in the scene. Having available the depth discontinuities, we can estimate very well the motion field in each of the cameras. Thus, if we implement a spherical eye by putting cameras on the surface of a sphere, we can achieve a new eye that has the following two desirable properties (Figure):
(a) From an image sequence, it can best estimate 3D motion!
(b) From an image sequence, it can best estimate the image motion field!
An eye like this is what is needed for our problem!
The preceding discussion demonstrates the power of multiple view vision. Using many conventional video cameras and arranging them in specific, purposive configurations provides new eyes that are much more powerful. Their power is not due only to redundancy. It is due to the rich relationships between different projections of the world. As shown above, by treating the sets of video streams collected in a particular way as a new image mathematical and statistical properties are obtained that were unknown before. It is expected that the study of biological eyes will reveal formidable properties of biological eyes. Relating such properties to tasks that systems perform will reveal a new landscape of mathematical problems related to shape, form, motion and action. To give an example from current problems in the field related to surveillance and monitoring, the problem of motion segmentation (finding independently moving objects from a moving sensor) becomes much easier if one uses a small array of video cameras as in this figure. The reason for this is that image motion can be better estimated and background/object motion can be separated more easily.
Such a research program is supported by current technology. In the early 1980s, we could hardly digitize a video. In the late 1980s, we needed sophisticated, specialized, expensive processors/systems. Now, we can just put video directly on PCs! Not to mention that video cameras are quite inexpensive (and are becoming even cheaper!).
Eyes like the ones just described have provable optimal properties regarding 3D motion estimation/segmentation but may be impractical to use (unless they are miniaturized). Luckily, from a mathematical viewpoint, there is no difference if the cameras are looking inside or outside. Imaging a moving object at the center of the sphere creates the same geometry! The resulting configuration (Figure) is a possible configuration for a new eye that recovers accurate shape/action descriptions. Relating the multiple video streams (without correspondence of points) gives rise to robust algorithms for shape/action description recovery.
At the Institute of Advanced Computer Studies at the University of Maryland, we obtained a gift from the Keck Foundation to establish the Keck Laboratory for the study of visual movement. The Laboratory consists of a large number of cameras (currently sixty-four) and a network of PCs with the capability for simultaneous recording and synchronization among all sensors. See this video for some views of the current set-up in the Keck Lab. The cameras are in sixteen clusters of four.
Using the Keck Lab we are implementing our ideas on the problems described above. At the same time, we are examining different configurations of multiple vision eyes best suited for specific applications. One of our interests is 3D video. This amounts to acquiring visual data in a way that makes it possible to visualize it from any viewpoint. It is of course impossible to gather images from all viewpoints. Only through the recovery of some particular aspects of the 3D structure and motion it becomes possible to visualize from any viewpoint and this is a problem falling in the general category of multiple view geometry and statistics.