Show simple item record

dc.contributor.author
Paschalidou, Despoina
dc.contributor.supervisor
Van Gool, Luc
dc.contributor.supervisor
Geiger, Andreas
dc.contributor.supervisor
Ferrari, Vittorio
dc.contributor.supervisor
Tombari, Federico
dc.contributor.supervisor
Savva, Manolis
dc.date.accessioned
2021-12-16T12:07:52Z
dc.date.available
2021-12-16T11:47:53Z
dc.date.available
2021-12-16T12:07:52Z
dc.date.issued
2021
dc.identifier.uri
http://hdl.handle.net/20.500.11850/521013
dc.identifier.doi
10.3929/ethz-b-000521013
dc.description.abstract
Humans develop a common-sense understanding of the physical behaviour of the world, within the first year of their life. We are able to identify 3D objects in a scene, infer their geometric and physical properties, predict physical events in dynamic environments and act based on our interaction with the world. Our understanding of our surroundings relies heavily on our ability to properly reason about the arrangement of elements in a scene. Inspired by early works in cognitive science that stipulate that the human visual system perceives objects as a collection of semantically coherent parts and in turn uses them to easily associate unknown objects with object parts whose functionality is already known, researchers developed compositional representations capable of capturing the functional composition and spatial arrangement of objects and object parts in a scene. In the first two parts of this dissertation, we propose learning-based solutions for recovering the 3D object geometry using semantically consistent part arrangements. Finally, we introduce a network architecture that synthesizes indoor environments as object arrangements, whose functional composition and spatial configuration follows clear patterns that are directly inferred from data. First, we present an unsupervised learning-based approach for recovering shape abstractions using superquadric surfaces as atomic elements. We demonstrate that superquadrics lead to more expressive part decompositions while being easier to learn than cuboidal primitives. Moreover, we provide an analytical solution to the Chamfer loss which avoids the need for computational expensive reinforcement learning or iterative prediction. Next, we introduce a novel 3D primitive representation that defines primitives using an Invertible Neural Network (INN) that implements homeomorphic mappings between a sphere and the target object. Since this representation does not impose any constraint on the shape of the predicted primitives, they can capture complex geometries using an order of magnitude fewer parts than existing primitive-based representations. We consider this representation a first step towards bridging the gap between interpretable and high fidelity primitive-based reconstructions. Subsequently, we introduce a structure-aware representation that jointly recovers the geometry of a 3D object as a set of primitives as well as its latent hierarchical structure without any part-level supervision. Our model recovers the higher level structural decomposition of various objects in the form of a binary tree of primitives, where simple parts are represented with fewer primitives and more complex parts are modeled with more components. We demonstrate that considering the latent hierarchical layout of an object into parts facilitates reasoning about the 3D object geometry. Finally, we propose a neural network architecture for synthesizing indoor scenes by plausibly arranging objects within the scene boundaries. In particular, given a room type (e.g. bedroom, living room) and its shape, our model generates meaningful object arrangements by sequentially placing objects in a permutation-invariant fashion. In contrast to prior work, which poses scene synthesis as a sequence generation problem, our model generates rooms as unordered sets of objects. This allows us to perform various interactive scenarios such as room completion, failure case correction, object suggestions with user-provided constraints etc. To summarize, we propose novel primitive-based representations that do not limit the available shape vocabulary on a specific set of shapes such as cuboids, spheres, planes etc. Next, we introduce a structure-aware representation that considers part relationships and represents object parts with multiple levels of granularity, where geometrically complex parts are modeled with more components and simpler parts with fewer components. Finally, we propose a network architecture that generates indoor scenes by properly arranging objects within a room's boundaries. Our model enables new interactive applications for semi-automated scene authoring that were not possible before.
en_US
dc.format
application/pdf
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://rightsstatements.org/page/InC-NC/1.0/
dc.subject
Primitive-based representations
en_US
dc.subject
3D reconstruction
en_US
dc.subject
Structure-aware representations
en_US
dc.subject
Scene understanding
en_US
dc.subject
Scene synthesis
en_US
dc.subject
Interpretable representations
en_US
dc.subject
Unsupervised learning
en_US
dc.subject
Generative modelling
en_US
dc.title
Learning Deep Models with Primitive-Based Representations
en_US
dc.type
Doctoral Thesis
dc.rights.license
In Copyright - Non-Commercial Use Permitted
dc.date.published
2021-12-16
ethz.size
218 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science
en_US
ethz.identifier.diss
28066
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc / Van Gool, Luc
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02140 - Dep. Inf.technologie und Elektrotechnik / Dep. of Inform.Technol. Electrical Eng.::02652 - Institut für Bildverarbeitung / Computer Vision Laboratory::03514 - Van Gool, Luc / Van Gool, Luc
en_US
ethz.date.deposited
2021-12-16T11:47:59Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2021-12-16T12:08:00Z
ethz.rosetta.lastUpdated
2022-03-29T16:37:44Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Learning%20Deep%20Models%20with%20Primitive-Based%20Representations&rft.date=2021&rft.au=Paschalidou,%20Despoina&rft.genre=unknown&rft.btitle=Learning%20Deep%20Models%20with%20Primitive-Based%20Representations
 Search print copy at ETH Library

Files in this item

Thumbnail

Publication type

Show simple item record