Remember the days of the Lytro camera? That camera allowed you to take a picture and adjust the focus afterwards as often as you liked. Now, a machine-learning algorithm is making it possible to do the same and more.
The problem with images taken by a smartphone or DSLR is that either captures a scene from one point of view, while we, the observers, can move around and view the scene from multiple viewpoints. In other words, a traditional image always leaves something to be desired and has little scientific value as the world is a 3D affair.
Computer scientists have been working to provide an immersive experience that would allow viewers to observe a scene from different viewpoints, but so far their efforts have gone as far as requiring specialised camera equipment out of reach of you and me.
To make the process easier, Dr. Nima Kalantari, professor in the Department of Computer Science and Engineering at Texas A&M University, and graduate student Qinbo Li have developed a machine-learning-based approach that would allow people to take a single photo and use it to generate different views of the scene.
Kalantari’s method makes it possible for non-scientists to shoot (or download from the internet) any image and bring it to life by looking at it from several angles. They do it by applying something called “view synthesis”. That is the process of generating new views of an object or a scene using images taken from multiple viewpoints. To create these new viewing angles, the process uses information related to the distance between the objects in the scene. The system then shoots synthetic photos with a virtual camera placed at different points within the scene.
The concept of synthesising new angles has existed for some time, but many of the methods used require the user to manually capture multiple photos of the same scene from different viewpoints simultaneously with specialist hardware. The whole process, in short, is difficult and time-consuming and was not designed to generate new view images from a single input image.
Kalantari’s team has found a way to train a deep-learning network for it to generate a new view based on a single input image. To train the network, they showed it a large set of images and their corresponding new-view images. That takes a lot of computing power and time. An essential aspect of this approach, therefore, is to model the input scene to make the training process more straightforward.
To make the training process more manageable, the researchers converted the input image into a multiplane version, which is a type of layered 3D representation. First, they broke down the image into planes at different depths using the position of the objects in the scene. Then, to generate a photo of the scene from a new viewpoint, they moved the planes in front of each other in a specific way and combined them. Using this representation, the network learns to infer the location of the objects in the scene faster and more efficiently.
To train the network, the scientists introduced it to a dataset of over 2,000 unique scenes that contained various objects. They demonstrated that their approach could produce high-quality new-view images of a variety of scenes that are better than previous state-of-the-art methods.
The researchers are currently working on extending their approach to synthesise videos. As videos are a bunch of individual images played rapidly in sequence, they can apply their approach to generate new views of each of those images independently at different times. But when the newly created video is played back, the picture flickers and is not consistent.
The single image view synthesis method can also be used to generate refocused images and could potentially be used for virtual reality and augmented reality applications.
The results of Kalantari’s work will trickle down to the end-user market eventually, but it’s unclear when we mere mortals will be able to buy an app in the Mac or iOS app store that uses the new approach.