Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Recent advances in deep learning have enabled generative models to produce samples of unparalleled quality.
The true value of these models, however, emerges from our ability to control them.
Controllable synthesis and manipulation holds potential as a democratizing tool, enabling those without expert training to materialize creative concepts and revolutionizing various industries: entertainment, virtual and augmented reality, e-commerce and industrial design.
This thesis offers four main contributions in this domain.
Firstly, we present a semantic image editing pipeline, where the user only needs to provide semantic information of the region they want to edit to materialize their changes. We introduce a semantic inpainting generator and a novel two-stream conditional discriminator enabling local control and improved perceptual quality.
Secondly, we design a Generative Adversarial Network(GAN) that can synthesize images of arbitrary-scales. We implement scale-consistent positional encodings and train a patch-based generator with novel inter-scale augmentations. Our model facilitates the generation of a continuum of scales, even ones unseen during training.
Thirdly, we propose to sample the latent vector of GANs by concatenating a list of sub-vectors independently sampled from a collection of small learnable embedding codebooks. We show that our approach only uses a limited number of parameters to create a broad and versatile latent representation, while enabling intuitive latent-space exploration, superior disentanglement, and conditional sampling through a pretrained classifier.
Lastly, we introduce a latent 3D diffusion model for synthesizing static and articulated 3D assets.
At first, we learn a compact 3D representation by training a volumetric autodecoder to reconstruct multi-view images.
Then, we train the latent diffusion model on the intermediate features of the autodecoder.
We apply our approach on diverse multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.
We perform both unconditional and text-driven generation; our approach is flexible enough to use either existing camera supervision or efficiently infer the camera parameters during training.
To conclude, this thesis explores different approaches to controllable synthesis and manipulation of images and 3D assets.
We hope that our contributions brings us a step closer to our vision of democratizing content creation and enabling human creativity. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000650287Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Van Gool, Luc
Examiner: Kastanis, Iason
Examiner: Timofte, Radu
Examiner: Isola, Phillip
Examiner: Tombari, Federico
Publisher
ETH ZurichSubject
Generative models; Image synthesis; 3D generation; Image manipulationOrganisational unit
03514 - Van Gool, Luc / Van Gool, Luc
More
Show all metadata
ETH Bibliography
yes
Altmetrics