Generative models for controllable synthesis and manipulation in 2D and 3D

Ntavelis, Evangelos

doi:10.3929/ethz-b-000650287

Download

Full text (PDF, 113.0Mb)

Open access

Author

Ntavelis, Evangelos

Date

2023

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 113.0Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Recent advances in deep learning have enabled generative models to produce samples of unparalleled quality. The true value of these models, however, emerges from our ability to control them. Controllable synthesis and manipulation holds potential as a democratizing tool, enabling those without expert training to materialize creative concepts and revolutionizing various industries: entertainment, virtual and augmented reality, e-commerce and industrial design. This thesis offers four main contributions in this domain. Firstly, we present a semantic image editing pipeline, where the user only needs to provide semantic information of the region they want to edit to materialize their changes. We introduce a semantic inpainting generator and a novel two-stream conditional discriminator enabling local control and improved perceptual quality. Secondly, we design a Generative Adversarial Network(GAN) that can synthesize images of arbitrary-scales. We implement scale-consistent positional encodings and train a patch-based generator with novel inter-scale augmentations. Our model facilitates the generation of a continuum of scales, even ones unseen during training. Thirdly, we propose to sample the latent vector of GANs by concatenating a list of sub-vectors independently sampled from a collection of small learnable embedding codebooks. We show that our approach only uses a limited number of parameters to create a broad and versatile latent representation, while enabling intuitive latent-space exploration, superior disentanglement, and conditional sampling through a pretrained classifier. Lastly, we introduce a latent 3D diffusion model for synthesizing static and articulated 3D assets. At first, we learn a compact 3D representation by training a volumetric autodecoder to reconstruct multi-view images. Then, we train the latent diffusion model on the intermediate features of the autodecoder. We apply our approach on diverse multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects. We perform both unconditional and text-driven generation; our approach is flexible enough to use either existing camera supervision or efficiently infer the camera parameters during training. To conclude, this thesis explores different approaches to controllable synthesis and manipulation of images and 3D assets. We hope that our contributions brings us a step closer to our vision of democratizing content creation and enabling human creativity. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000650287

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Van Gool, Luc
Examiner: Kastanis, Iason
Examiner: Timofte, Radu
Examiner: Isola, Phillip
Examiner: Tombari, Federico

Publisher

ETH Zurich

Subject

Generative models; Image synthesis; 3D generation; Image manipulation

Organisational unit

03514 - Van Gool, Luc / Van Gool, Luc

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

Generative models for controllable synthesis and manipulation in 2D and 3D Mendeley CSV RIS BibTeX

Generative models for controllable synthesis and manipulation in 2D and 3D

Mendeley

CSV

RIS

BibTeX