Yiannis Aloimonos


Prof. Aloimonos' research group continues to work on the computations underlying the development of descriptions of visual space from images. These are descriptions of space (objects) and space-time (actions) extracted from multiple views of the world (such as video), and their extraction involves geometry and physics. This task is known by various names and is of interest to many fields, ranging from technology to biology.

After studying the geometry of multiple views from point and line correspondences and introducing the trilinear constraints, we realized that implementing multiple-view geometry in actual robotic vision systems cannot be based on feature correspondences. At that time, our research was influenced by the philosophy of direct perception advocated by Gibson. Since perception is immediate, properties of the scene in view must be directly encoded in patterns or aggregate structures of image measurements. Through the application of computational principles we have been searching for these structures and their associated representations. The technical problems we have investigated are related to basic processes underlying the perception of 3D motion, shape, and their relationships.

Results from this research have recently given rise to new mathematical constraints governing the geometry of visual space and contributing to the understanding of the non-Euclidean nature of visual space. This caused the emergence of a new framework with far-reaching consequences and a multitude of applications in technology and biology. Besides applications to robotics and navigation, these include computational video, new camera technologies, distributed sensor networks, Web-related technologies (video indexing), and a number of empirically testable hypotheses about the structure and function of the brain. For example, this research allows, for the first time, the automatic development of scene and motion representations that allow photorealistic manipulation of video (deleting objects or embedding virtual objects in a physical scene, changing viewpoint, etc.) as well as the development of 3D video.

In the following subsections we describe in more detail the basic problem we have been studying, the conventional wisdom, and our approach, and we show some results. Finally, we discuss two new application areas we have initiated, new camera technology and computational video (analysis and synthesis of visual data).
 

The Problem

Consider multiple views of a scene, for example, in a video. These amount to central projections onto the retina of an eye or the film of a camera. The problem is, given this information, to derive models of the scene in view (which could, in general, be changing). Models can be a wide variety of things and they are generally purposive or teleological. However, for the purpose of explaining our work, consider a model as consisting of the three-dimensional locations of each scene feature or point, in some coordinate system. In the case of a changing scene, add to this model the three-dimensional velocities of moving scene points.

The Conventional Wisdom

In the bulk of the literature this problem is still studied in the same way that photogrammetrists first approached it at the beginning of the century. Given two positions of the eye or camera, there exist two concepts of interest: (a) the 3D rigid transformation relating the coordinate systems of the two viewing positions, consisting of the sum of a rotation and a translation; and (b) the 2D transformation relating the images. This second concept is usually taken to be the correspondence between features in the two images that are the projections of the same feature in the scene, or, equivalently, the velocities with which image points move, the ìmotion fieldî (whose estimate is referred to as optic flow). Given the correspondence or flow, the 3D transformation can be computed, and subsequently finding a model for the scene is easy. The mathematics of this approach were worked out by Longuet-Higgins, and Huang and his group, for two views and point correspondences; and by Spetsakis and Aloimonos for three views and correspondences of points or lines. Additional views provide no more geometric information for a static scene. Koenderink, Faugeras and Sparr provided insights for computing affine or projective models in this framework. A characteristic of the approach is clear separation between structure and motion computation, and between 2D and 3D information. Usually, first 2D-based smoothing constraints are employed to obtain, from the image measurements, the optical flow field or correspondence; then this information is used to estimate 3D motion and, subsequently, structure. The problem with such an approach is that optical flow or correspondence cannot be computed well on the basis of image measurements only, and erroneously computed optical flow leads to errors in 3D motion and structure. One problem arises from the locations of flow discontinuities which are due to scene elements at different depths or differently moving objects. If we knew where the discontinuities were, we could, using a multitude of approaches based on smoothness constraints, estimate flow values for image patches corresponding to smooth scene patches; but to know the discontinuities requires solving for motion and structure first (a chicken-egg problem). A second problem arises which is of a statistical nature: even within smooth scene patches, optical flow cannot be estimated accurately; the estimation is biased and depends on the gradient distribution of the scene texture. This bias is highly pronounced in the pattern designed by Ouchi (Figure 1) and explained in our recent work. Slight movements of this pattern produce different movements in the inset and the background. This is an example where accurate flow is impossible to compute. This correspondence-based framework has given rise to some applications, especially ones involving well-structured geometric objects or semi-automatic approaches (for example, use of an operator). This framework is approaching its limits. Treating an image sequence as a moving cloud of points has its limitations.

Figure 1. The Ouchi pattern. Slight, rapid movements of this pattern produce different movements of the inset and the background.

Figure 1. The Ouchi pattern. Slight, rapid movements of this pattern produce different movements of the inset and the background.

Our Approach

Our approach combines the processes of smoothing, segmentation, 3D motion and structure estimation. New constraints have been developed which are defined directly on the image derivatives, leading to a geometrical and statistical estimation problem. The main idea is based on the interaction between 3D motion and shape, which allows us to estimate the 3D motion while at the same time segmenting the scene. If we use a wrong 3D motion estimate to compute depth, then we obtain a distorted version of the depth function. The distortion, however, is such that the worse the motion estimate, the more likely we are to obtain depth estimates that vary locally more than the correct ones. Local variability of depth is due either to the existence of a discontinuity or to a wrong 3D motion estimate; by exploiting the statistics of the raw image measurements (derivatives) these two cases can be differentiated. Clearly, at the end of the process a good estimate of correspondence can also be made.

Since at the beginning of the process correspondence or flow is not available, we cannot utilize the epipolar constraint that has been traditionally used. Instead, we utilize the positive depth constraint and geometric constraints arising from understanding the distortion function, which depends on the errors in the 3D transformation and image measurements. Understanding this function provides the insight that human visual space is a non-Euclidean space; further, it explains a number of illusions and predicts others. At the same time, this understanding gives rise to algorithms for 3D motion estimation, motion segmentation and scene reconstruction from video sequences, producing results not obtainable by correspondence-based approaches. Theoretical results also demonstrate that existing approaches are special cases of our approach; that is, our approach is provably better than the state-of-the-art, correspondence-based schemes.

Some Results

3D motion: The best way to test algorithms for 3D motion estimation is with a hand-held camera. This way the motion changes continually and smoothing or other regularization procedures cannot be used, so one can only use the information in the successive frames. The results of this algorithm testing cannot be shown properly in still images, so the reader is referred to video sequences in each example. All these video sequences are available on the World Wide Web; all their URL's begin with http://www.cfar.umd.edu/~larson/dialogue/videos/. Sequence A, captured in our lab (orig.mpg), has a large number of discontinuities, near and far objects, and a rich variety of surface structure. Sequence B (foe-both.mpg) shows the solution for the epipole or Focus of Expansion (the place where the translation vector pierces the image plane). The green dot is the solution from our algorithm and the hollow yellow dot is the solution provided by correspondence-based epipolar minimization. Sequence C (good-depth.mpg) shows the color-coded inverse depth map generated by our solution (mid-gray corresponds to places where no information is available). An important aspect of our approach is based on the concept of distortion. If the wrong 3D motion is recovered and used to find the depth of the scene, then the wrong depth will be recovered, or, as we say, a distorted depth will emerge. This distortion has interesting properties. Notice in Sequence D (bad-depth.mpg) the inverse depth map generated by an incorrect 3D motion, and note the high variability of depth in many places. Even negative depth values are produced (black). Our solution utilizes this property. Sequence E (warp.mpg) shows how well the rotation is estimated. It does so by subtracting the rotation from the original sequence so that the remaining video represents a sequence containing only translation (shown at the right).

3D shape: Perhaps the most defining test of how well 3D motion is recovered is the estimation of shape. The reason for this is that many tasks can be achieved with somewhat or slightly incorrect 3D motion, but an error of a few pixels (for example, in the location of the translation direction) is enough to create significant errors in the estimated shape. How well shape models can be estimated depends on a number of factors besides accurate 3D motion estimates, such as the number of frames utilized (amount of data) and the actual representation of the model. Sequence F (reconst.mpg) shows a model for the scene recovered from a few frames and without any elaborate data structures; the scene is simply a set of 3D points. More frames and a bit more sophistication in representing the scene (triangles) result in much better models. Sequence G (sct-input.mpg) shows an original sequence, and Sequence H shows the obtained reconstruction (sct-fly.mpg). Sequences I (yiann-input.mpg) and J (yiann-fly.mpg) show another example. No post-processing was performed here but, clearly, graphics post-processing further improves the results. Finally, consider a reconstruction from multiple videos K, L and M (pooh2-input.mpg, pooh3-input.mpg, and pooh4-input.mpg). Sequence N (pooh4.mpg) shows that the recovery is almost perfect. Again, no post-processing was performed.

Motion segmentation: This is the hardest problem in dynamic scene analysis; our approach was conceived with this problem in mind. Sequence O is an original, well-known sequence (mobile-input.mpg). An elaborate optimization scheme with feedback starts from the normal flow values and builds representations of camera motion and localizations of motion and background boundaries. The principle of depth variability plays a central role. Sequence P (mobile-depth.mpg) shows recovered inverse depth for a part of the sequence with the gray-level value showing the amount of depth (white denoting large positive values, i.e., close to the camera, and black denoting negative values). Notice the high variability of depth at the locations of independent movement. Also, notice that, at times, the train motion is consistent with the camera motion (making independent motion detection difficult) so no high variability of depth is obtained, but the depth comes out negative, marking independent motion. Notice in Sequence Q (mobile-binerr.mpg) the depth variability measurements for the same part of the sequence (white denoting large values). The procedure searches for the camera motion and the motion boundaries. Depth variability is the basis for the solution.
 

New Camera Technology: Eyes from Eyes

If conventional video cameras are put together in various configurations, new sensors can be constructed that have much more power and the way they ìseeî the world makes it much easier to solve problems of vision.
This research is motivated by examining the wide variety of eye designs in the biological world and obtaining
inspiration for an ensemble of computational studies that relate how a system sees to what that system does (i.e., relating perception to action). This, coupled with the geometry of multiple views that has flourished in terms of theoretical results in the past few years, points to new ways of constructing powerful imaging devices which suit particular tasks in robotics, visualization, video processing, virtual reality and various computer vision applications, better than conventional cameras. From this perspective, this research could lead to a new imaging technology.

Take, for example, the problem of recovering descriptions of space-time from video information. We give here the principles underlying the construction of new eyes for this task. As is well known, to solve this problem we need to accurately recover 3D motion and image motion.

Our point of departure is a set of geometric results regarding inherent ambiguities in estimating 3D motion from video sequences. Denote the five unknown motion parameters of a common moving video camera by (x_0, y_0) (direction of translation) and (a, b, g) (rotation). Assume that the scene in view has depth values uniformly distributed in the camera's coordinate system between any two values. Then, no matter how 3D motion is estimated from the image motion, the expected solution will contain errors (x_{0_e}, y_{0_e}), (ae,be,ge) that satisfy two constraints:

(a) The orthogonality constraint: x_{0_e}/y_{0_e} = ­b_e/a_e
(b) The line constraint: x_0/y_0 = x_{0_e}/y_{0_e}.

The solution thus contains errors that are mingled and create a confusion between rotation and translation that cannot be cleared up. The errors may be small or large, but their expected value will always satisfy the above conditions. Intuitively, the surface representing an error function whose minimum provides the solution has a ìbadî topography, making it hard to localize the minimum.

Let us step back for a moment and ask our original question differently. We are interested in space and action descriptions that can be extracted from visual data. This requires that there exist an eye or device imaging the scene. All along we took for granted that our basic device was a camera-type eye, that is, a common video camera whose basic principle is the pinhole model, but there was no particular reason to make this assumption.

An examination of the designs of eyes in the biological world reveals a very wide variety. The mechanisms that organisms have evolved for collecting photons and forming images that they use to perform various actions in their environment depend on a number of factors. Chief among these are the individual organism's computational capacity and the tasks that the organism performs. Michael Land, a prominent British zoologist and the world's foremost expert on the science of eyes, has provided a landscape of eye evolution. Considering evolution as a mountain range, with the lower hills representing the earlier steps in the evolutionary ladder, and the highest peaks representing the later stages of evolution, the situation is pictured in Figure 2. It has been estimated that eyes have evolved no fewer than forty times, independently, in diverse parts of the animal kingdom. In some cases, these eyes use radically different principles and the ìeye landscapeî of the figure shows nine basic types of eyes. Eyes low in the hierarchy (such as the nautilus' pinhole eye or the marine snail eye) make very crude images of the world, but at higher levels of evolution we find different types of compound eyes and camera-type eyes (like the ones we have) such as the corneal eyes of land vertebrates and the eyes of fish.

Figure 2. "Landscape" of eye evolution. (From R. Dawkins, Climbing Mount Improbable, Norton, New York, 1996.)

Figure 2. "Landscape" of eye evolution. (From R. Dawkins, Climbing Mount Improbable, Norton, New York, 1996.)

Inspiration for our research on this topic has come from the compound eyes of insects which are particularly intriguing, especially in view of the fact that insects compute 3D motion excellently. Their lives depend on their ability to fly with precision through cluttered environments, avoid obstacles and land on demand on surfaces oriented in various ways. In addition, they perform these tasks with minimal memory and computational capacity, much less than an average personal computer of today. Could it be possible that much of their success emanates from the special construction of their eyes?

Compound eyes exist in several varieties, and can be classified in two categories, apposition and superposition. An apposition eye (Figure 3a) is built as a dense cluster of long, straight tubes radiating out in all directions as from the roof of a dome. Each tube is like a gunsight which sees only a small part of the world in its own direct line of fire. Thus, rays coming from other parts of the dome are prevented by the walls of the tube and the backing of the dome from hitting the back of the tube where the photocells are. In practice, each of the little tube eyes, called ommatidia, is a bit more than a tube. It has its own private lens and its own private retina of about half a dozen photocells. The ommatidium works like a long, poor-quality camera eye. Superposition compound eyes, on the other hand, do not trap rays in tubes. They allow rays that pass through the lens of one ommatidium to be picked up by a neighboring ommatidium's photocells. There is an empty, transparent zone shared by all ommatidia. The lenses of all ommatidia conspire to form a single image on a shared retina which is put together from the light-sensitive cells of all the ommatidia. One kind of superposition eye is the neural superposition (or wired-up superposition) eye shown in Figure 3b. In this case, the ommatidia are isolated tubes just as in the case of the apposition eye, but they achieve a superposition-like effect by ingenious wiring of nerve cells behind the ommatidia.

apposition eye(a) superposition eye(b)

Figure 3. Compound eyes of apposition (a) and superposition (b) types. (From R. Dawkins, op. cit.)

Why is it that biological systems that need to fly (insects, birds) have panoramic vision implemented either as a compound eye or by placing camera-type eyes on opposite sides of the head? This is a fascinating question that has remained open since the time of the pioneer investigator Sigmund Exner at the beginning of this century. The obvious answer is that flying systems must perceive the whole space around them; thus panoramic vision emerges. There is, however, a deeper mathematical reason that has only recently been understood, and it has to do with the ability of a system to estimate 3D motion when it analyzes panoramic images. Put simply, a spherical eye (360 degree field of view) is superior to a planar eye (restricted field) with regard to 3D motion estimation.

Recall that estimating 3D motion from ìplanarî image sequences introduces errors satisfying the ìorthogonalityî and ìlineî constraints. The main reason is the restricted field of view. If, however, the field goes to 360 degrees, the topography of the error surface drastically changes, with the minimum clearly standing out! Thus there is no confusion between the motion parameters. It is no wonder, then, that flying organisms possess panoramic vision!

Since it turns out that spherical eyes such as those of insects, or, in general, panoramic vision provides a much better capability for 3D motion estimation, and since our problem of building accurate space and action descriptions depends on accurate 3D motion computation, it makes sense to reconsider what the eye for our problem should be. There are a few ways to create panoramic vision cameras, and the recent literature is rich in alternative approaches, but the insect eye is not just panoramic. It has an additional property whose mathematics are still largely unknown. It is built from a large collection of ommatidia that for our purpose can be considered as individual cameras. This construction offers additional, unexpected benefits from a computational viewpoint, though we do not know exactly what the benefits for the insect are. One such benefit arises from the fact that the large number of cameras constitute a large collection of stereo systems! Using simple techniques, these stereo systems are capable of providing a large number of the depth discontinuities in the scene. Having available the depth discontinuities, we can estimate very well the motion field in each of the cameras. Thus, if we implement a spherical eye by putting cameras on the surface of a sphere (Figure 4), we can achieve a new eye that has the following two desirable properties:

(a) From an image sequence, it can best estimate 3D motion.
(b) From an image sequence, it can best estimate the image motion field.

An eye like this is what is needed for our problem.

Figure 4. An eye composed of cameras on the surface of a sphere.

Figure 4. An eye composed of cameras on the surface of a sphere.

The preceding discussion demonstrates the power of multiple-view vision. Using many conventional video cameras and arranging them in specific, purposive configurations provides new eyes that are much more powerful. Their power is not due only to redundancy. It is due to the rich relationships between different projections of the world. As shown above, by treating the sets of video streams collected in a particular way as a new image, mathematical and statistical properties are obtained that were unknown before. It is expected that the study of biological eyes will reveal formidable properties. Relating such properties to tasks that systems perform will reveal a new landscape of mathematical problems related to shape, form, motion and action. To give an example from current problems in the field related to surveillance and monitoring, the problem of motion segmentation (finding independently moving objects from a moving sensor) becomes much easier if one uses a small array of video cameras as in Figure 5. The reason for this is that image motion can be better estimated and background/object motion can be separated more easily.

Figure 5. An eye composed of a planar array of cameras.

Figure 5. An eye composed of a planar array of cameras.

Such a research program is supported by current technology. In the early 1980s, we could hardly digitize a video. In the late 1980s, we needed sophisticated, specialized, expensive processors/systems. Now, we can put video directly on PCs! Not to mention that video cameras are quite inexpensive, and are becoming even cheaper.

Eyes like the ones just described have provably optimal properties regarding 3D motion estimation/segmentation, but may be impractical to use unless they are miniaturized. Luckily, from a mathematical viewpoint, it makes no difference whether the cameras are looking inward or outward. Imaging a moving object at the center of a sphere creates the same geometry! The resulting configuration (Figure 6) is a possible configuration for a new eye that recovers accurate shape/action descriptions. Relating the multiple video streams (without correspondence of points) gives rise to robust algorithms for shape/action description recovery.

Figure 6. Inward-pointing cameras on the surface of a sphere have the same geometry as the outward-pointing cameras of a spherical eye.

Figure 6. Inward-pointing cameras on the surface of a sphere have the same geometry as the outward-pointing cameras of a spherical eye.

Our Institute recently obtained a grant from the Keck Foundation to establish the Keck Laboratory for the Study of Visual Movement. The Laboratory consists of a large number of cameras (currently sixty-four) and a network of PCs with the capability for simultaneous recording and synchronization among all sensors.

Using the Keck Lab we are implementing our ideas about the problems described above. At the same time, we are examining different configurations of eyes best suited for specific applications. One of our interests is 3D video. This amounts to acquiring visual data in a way that makes it possible to visualize it from any viewpoint. It is of course impossible to gather images from all viewpoints. Only through the recovery of some particular aspects of the 3D structure and motion does it become possible to visualize from any viewpoint. This is a problem falling in the general category of multiple-view geometry and statistics.
 

Computational Video

By computational video we mean the set of principles and associated algorithms that make explicit the relationship between the analysis and synthesis of visual data. Examples include video editing/manipulation, tele-immersion and virtual reality, three-dimensional video, synthetic worlds and the synergistic mixture of graphics with vision. In video manipulation one needs to alter the video's content by inserting or deleting particular objects; it amounts, for the most part, to recovering and maintaining relationships between different coordinate systems. An example of an original sequence (sct-input.mpg) is shown in Sequence R. Having recovered the structure of the scene, we can insert any object for which a model is available with that object having any desired relationship with the structure of the original scene. Having recovered the camera's motion, we can view the ìnewî video; for example, Sequence S (insert.mpg) shows donuts inserted into Sequence R at a few locations. Sequence T (FENCE.MPG) has a fence in front of the scene. A new video can be made with the fence taken out.


[Up][Top][Search]

Please mail questions/comments to
webmaster@cfar.umd.edu

November 1999