Emre Aksan


Loading...

Last Name

Aksan

First Name

Emre

Organisational unit

Search Results

Publications 1 - 10 of 20
  • Aksan, Emre; Hilliges, Otmar (2019)
    arXiv
    Convolutional architectures have recently been shown to be competitive on many sequence modelling tasks when compared to the de-facto standard of recurrent neural networks (RNNs), while providing computational and modeling advantages due to inherent parallelism. However, currently there remains a performance gap to more expressive stochastic RNN variants, especially those with several layers of dependent random variables. In this work, we propose stochastic temporal convolutional networks (STCNs), a novel architecture that combines the computational advantages of temporal convolutional networks (TCN) with the representational power and robustness of stochastic latent spaces. In particular, we propose a hierarchy of stochastic latent variables that captures temporal dependencies at different time-scales. The architecture is modular and flexible due to the decoupling of the deterministic and stochastic layers. We show that the proposed architecture achieves state of the art log-likelihoods across several tasks. Finally, the model is capable of predicting high-quality synthetic sample s over a long-range temporal horizon in modeling of handwritten text.
  • Kaufmann, Manuel; Aksan, Emre; Song, Jie; et al. (2020)
    2020 International Conference on 3D Vision (3DV)
    In this paper we propose a convolutional autoencoder to address the problem of motion infilling for 3D human motion data. Given a start and end sequence, motion infilling aims to complete the missing gap in between, such that the filled in poses plausibly forecast the start sequence and naturally transition into the end sequence. To this end, we propose a single, end-to-end trainable convolutional autoencoder. We show that a single model can be used to create natural transitions between different types of activities. Furthermore, our method is not only able to fill in entire missing frames, but it can also be used to complete gaps where partial poses are available (e.g. from end effectors), or to clean up other forms of noise (e.g. Gaussian). Also, the model can fill in an arbitrary number of gaps that potentially vary in length. In addition, no further post-processing on the model’s outputs is necessary such as smoothing or closing discontinuities at the end of the gap. At the heart of our approach lies the idea to cast motion infilling as an inpainting problem and to train a convolutional de-noising autoencoder on image-like representations of motion sequences. At training time, blocks of columns are removed from such images and we ask the model to fill in the gaps. We demonstrate the versatility of the approach via a number of complex motion sequences and report on thorough evaluations performed to better understand the capabilities and limitations of the proposed approach.
  • Wang, Xi; Li, Gen; Kuo, Yen-Ling; et al. (2022)
    2022 International Conference on 3D Vision (3DV)
    We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural prompts only. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset and show how our method leads to better 3D reconstructions. We further qualitatively evaluate the effectiveness of our method on real images and demonstrate its generalizability towards interaction types and object categories.
  • Ghosh, Partha; Song, Jie; Aksan, Emre; et al. (2017)
    2017 International Conference on 3D Vision (3DV)
  • Aksan, Emre; Ma, Shugao; Caliskan, Akin; et al. (2022)
    Lecture Notes in Computer Science ~ Computer Vision – ECCV 2022
    Neural face avatars that are trained from multi-view data captured in camera domes can produce photo-realistic 3D reconstructions. However, at inference time, they must be driven by limited inputs such as partial views recorded by headset-mounted cameras or a front-facing camera, and sparse facial landmarks. To mitigate this asymmetry, we introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space. Our proposed model, LiP-Flow, consists of two encoders that learn representations from the rich training-time and impoverished inference-time observations. A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective. We trained our model end-to-end to maximize the similarity of both representation spaces and the reconstruction quality, making the 3D face model aware of the limited driving signals. We conduct extensive evaluations where the latent codes are optimized to reconstruct 3D avatars from partial or sparse observations. We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better. Check out our project page for an overview.
  • Christen, Sammy; Kocabas, Muhammed; Aksan, Emre; et al. (2022)
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
    We introduce the dynamic grasp synthesis task: given an object with a known 6D pose and a grasp reference, our goal is to generate motions that move the object to a target 6D pose. This is challenging, because it requires reasoning about the complex articulation of the human hand and the intricate physical interaction with the object. We propose a novel method that frames this problem in the reinforcement learning framework and leverages a physics simulation, both to learn and to evaluate such dynamic interactions. A hierarchical approach decomposes the task into low-level grasping and high-level motion synthesis. It can be used to generate novel hand sequences that approach, grasp, and move an object to a desired location, while retaining human-likeness. We show that our approach leads to stable grasps and generates a wide range of motions. Furthermore, even imperfect labels can be corrected by our method to generate dynamic interaction sequences. Video and code are available at: https://eth-ait.github.io/d-grasp/.
  • Aksan, Emre; Hilliges, Otmar (2021)
    Human–Computer Interaction Series ~ Artificial Intelligence for Human Computer Interaction: A Modern Approach
    Digital ink promises to combine the flexibility of pen and paper interaction and the versatility of digital devices. Computational models of digital ink often focus on recognition of the content by following discriminative techniques such as classification, albeit at the cost of ignoring or losing personalized style. In this chapter, we propose augmenting the digital ink framework via generative modeling to achieve a holistic understanding of the ink content. Our focus particularly lies in developing novel generative models to gain fine-grained control by preserving user style. To this end, we model the inking process and learn to create ink samples similar to users. We first present how digital handwriting can be disentangled into style and content to implement editable digital ink, enabling content synthesis and editing. Second, we address a more complex setup of free-form sketching and propose a novel approach for modeling stroke-based data efficiently. Generative ink promises novel functionalities, leading to compelling applications to enhance the inking experience for users in an interactive and collaborative manner.
  • Aksan, Emre (2022)
    Humans possess a comprehensive set of interaction capabilities at various levels of abstraction including physical activities, verbal and non-verbal cues, and abstract communication skills to interact with the physical world, express ourselves, and communicate with others. In the quest of digitizing humans, we must seek answers to the problems of how to represent humans and how to establish human-like interactions on digital mediums. A critical issue is that human activities exhibit complex and rich dynamic behavior that is non-linear, time-varying, and context-dependent, which are quantities that are typically infeasible to rigorously define. In this thesis, we are primarily interested in modeling complex processes like how humans look, move, and communicate and in generating novel samples that are similar to the ones performed by humans. To do so, we propose using the deep generative modeling framework, which is capable of learning the underlying data generation process directly from observations. Over the course of this thesis, we showcase generative modeling strategies at various levels of abstraction and demonstrate how they can be used to model humans and synthesize plausible and realistic interactions. Specifically, we present three problems that are different in modality and complexity, yet related in terms of the modeling strategies. We first introduce the task of modeling free-form human actions like drawings and handwritten text. Our work focuses on personalization and generalization concepts by learning latent representations of writing style or drawing content. Second, we present the 3D human motion modeling task, where we aim to learn spatio-temporal representations to capture motion dynamics for both accurate short-term and plausible long-term motion predictions. Finally, we focus on learning an expressive representation space for the synthesis and animation of photo-realistic face avatars. Our proposed model is able to create a personalized 3D avatar from rich training data and animate it via impoverished observations at runtime. Our results in different tasks support our hypothesis that deep generative models are able to learn structured representations and capture human dynamics from unstructured observations. Accordingly, the contributions in this thesis aim to demonstrate that the deep generative modeling framework is a promising instrument, paving the way for digitizing humans.
  • Huang, Yinghao; Kaufmann, Manuel; Aksan, Emre; et al. (2018)
    ACM Transactions on Graphics
  • Christen, Sammy; Jendele, Lukas; Aksan, Emre; et al. (2021)
    IEEE Robotics and Automation Letters
    We present HiDe, a novel hierarchical reinforcement learning architecture that successfully solves long horizon control tasks and generalizes to unseen test scenarios. Functional decomposition between planning and low-level control is achieved by explicitly separating the state-action spaces across the hierarchy, which allows the integration of task-relevant knowledge per layer. We propose an RL-based planner to efficiently leverage the information in the planning layer of the hierarchy, while the control layer learns a goal-conditioned control policy. The hierarchy is trained jointly but allows for the modular transfer of policy layers across hierarchies of different agents. We experimentally show that our method generalizes across unseen test environments and can scale to 3x horizon length compared to both learning and non-learning based methods. We evaluate on complex continuous control tasks with sparse rewards, including navigation and robot manipulation. © 2021 IEEE
Publications 1 - 10 of 20