
Open access
Author
Date
2020Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Knowing where the user is looking at, allows for intelligent systems to keep track of the user's needs and respond immediately to changes. The availability of eye gaze information enables new applications in adaptive user interfaces, increased context awareness in intelligent assistants, and direct assistance in the execution of complex tasks in various settings such as augmented reality, cockpits, mobile devices in the field, and the office space. Recent works have proposed large-scale in-the-wild datasets and deep learning based approaches for estimating eye gaze based on webcam image inputs, to make gaze estimation possible in unmodified and unconstrained environments. However to truly bring gaze estimation to the masses, large gains in performance must yet be made with as few manual interventions as possible being required from the user. We suggest that by incorporating known prior knowledge on eye shape and eyeball anatomy into the design and training of deep neural networks, we can yield meaningful improvements in performance both without and with a few manually labeled samples from the end-user.
In this thesis, we explore the space of learning-based representations for webcam-based gaze estimation, in particular, by proposing novel explicitly defined representations, as well as training methods and neural network architectures for learning implicitly defined representations. We show through evaluations on publicly available datasets that our representations yield improvements in gaze estimator performance in the absence of labeled samples from the final user as well as when only a few such labeled samples are available. Our contributions thus push the boundaries of what is possible with webcam-based gaze estimation, allowing for novel applications to become more accessible and possible.
First, we propose to learn eye-region landmarks as an intermediate representation for conventional gaze estimation methods. The description of eyelid and iris shape via landmark coordinates allows for better generalization across datasets and in adapting to previously unseen users when given just a few labeled samples.
Second, we propose a pictorial intermediate representation which can be produced from gaze direction labels alone. This representation effectively decomposes the gaze direction estimation problem into two easier parts, showing large performance improvements in cross-person gaze estimation.
Third, we explore the learning of equivariance to gaze direction changes, while disentangling the effects of head orientation from gaze direction via a novel disentangling and transforming neural network architecture. Furthermore, this representation is used as input to a meta-learning scheme to result in large performance improvements when using as few as a single labeled sample from a target user to adapt to them.
Last, we suggest that the spatio-temporal relation between the visual stimulus presented to a user and their apparent eye movements should be modeled jointly by a neural network. To achieve this, we collect a novel video-based dataset from 54 participants with synchronized multi-camera views and screen content video. Our proposed architecture for automatic gaze estimate refinement yields large performance improvements even in the absence of any labeled samples from the target user. This is enabled particularly by a data augmentation scheme which mimics the unique person-specific offsets in the definition of line-of-sight. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000469238Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Hilliges, Otmar
Examiner: Odobez, Jean-Marc
Examiner: De Mello, Shalini
Examiner: Tang, Siyu
Publisher
ETH ZurichOrganisational unit
03979 - Hilliges, Otmar / Hilliges, Otmar
More
Show all metadata
ETH Bibliography
yes
Altmetrics