Structured Generative Models for Controllable Scene and 3D Content Synthesis
Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Deep learning has fundamentally transformed the field of image synthesis, facilitated by the emergence of generative models that demonstrate remarkable ability to generate photorealistic imagery and intricate graphics. These models have advanced a wide range of industries, including art, gaming, movies, augmented & virtual reality (AR/VR), and advertising. While realism is undoubtedly a major contributor to their success, the ability to control these models is equally important in ensuring their practical viability and making them more useful for downstream applications. For instance, it is natural to describe an image through natural language, sketches, or attributes controlling the style of specific objects. Therefore, it is convenient to devise generative frameworks that follow a workflow similar to that of an artist. Furthermore, for interactive applications, the generated content needs to be visualized from various viewpoints while making sure that the identity of the scene is preserved and is consistent across multiple views. Addressing this issue is interesting not only from an application-oriented standpoint, but also from an image understanding perspective. Our visual system perceives 2D projections of 3D scenes, but the convolutional architectures commonly used in generative models ignore the concept of image formation and attempt to learn this structure from the data. Generative models that explicitly reason about 3D representations can provide disentangled control over shape, pose, appearance, can better handle spatial phenomena such as occlusions, and can generalize with less data. These practical requirements motivate the need for generative models driven by structured representations that are efficient, easily interpretable, and more aligned with human perception.
In this dissertation, we initially focus on the research question of controlling generative adversarial networks (GANs) for complex scene synthesis. We observe that, while existing approaches exhibit some degree of control over simple domains such as faces or centered objects, they fall short when it comes to complex scenes consisting of multiple objects. We therefore propose a weakly-supervised approach where generated images are described by a sparse scene layout (i.e. a sketch), and in which the style of individual objects can be refined through textual descriptions or attributes. We then show that this paradigm can effectively be used to generate complex images without trading off realism for control.
Next, we address the aforementioned issue of view consistency. Following recent advances in differentiable rendering, we introduce a convolutional mesh generation paradigm that can be used to generate textured 3D meshes using GANs. This model can natively reason using 3D representations, and can therefore be used to generate 3D content for computer graphics applications. We also demonstrate that our 3D generator can be controlled using standard techniques that can also be applied to 2D GANs, and successfully condition our model on class labels, attributes, and textual descriptions. We then observe that methods for 3D content generation typically require ground-truth poses, restricting their applicability to simple datasets where these are available. We therefore propose a follow-up approach to relax this requirement, demonstrating our method on a larger set of classes from ImageNet.
Finally, we draw inspiration from the literature on Neural Radiance Fields (NeRF) and incorporate this recently-proposed representation into our work on 3D generative modelling. We show how these models can be used to solve a series of downstream tasks such as single-view 3D reconstruction. To this end, we propose an approach that bridges NeRFs and GANs to reconstruct the 3D shape, appearance, and pose of an object from a single 2D image. Our approach adopts a bootstrapped GAN inversion strategy where an encoder produces a first guess of the solution, which is then refined through optimization by inverting a pre-trained 3D generator. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000614499Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Deep Learning; Computer Vision; Generative models; 3D VisionOrganisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas
Funding
176004 - Deep Learning for Generating Template Pictorial and Textual Representations (SNF)
More
Show all metadata
ETH Bibliography
yes
Altmetrics