“Where am I?” Scene Retrieval with Language
METADATA ONLY
Loading...
Author / Producer
Date
2025
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, “put the bowls back in the cupboard next to the fridge” or “meet me at the intersection under the red sign.” As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as “language-based scene-retrieval” and it is closely related to “coarse-localization,” but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present Text2SceneGraphMatcher, a “scene-retrieval” pipeline that learns joint embeddings between text descriptions and scene graphs to determine if they are a match. The code, trained models, and datasets will be made public.
Permanent link
Publication status
published
External links
Book title
Computer Vision – ECCV 2024
Journal / series
Volume
15095
Pages / Article No.
201 - 220
Publisher
Springer
Event
18th European Conference on Computer Vision (ECCV 2024)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Scene graphs; Text-based localization; Scene retrieval; Cross-modal learning; Coarse-localization
Organisational unit
03766 - Pollefeys, Marc / Pollefeys, Marc
Notes
Funding
Related publications and datasets
Is supplemented by: https://whereami-langloc.github.io/Is new version of: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/5418_ECCV_2024_paper.php