Directional Concept Scales for Latent Space Interpretability


Loading...

Author / Producer

Date

2025

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Generative machine learning models have gained widespread adoption for their remarkable performance in producing human-level outputs across machine vision and natural language processing domains. From probabilistic topic modeling to state-space and large language models, they describe the processes through which observed data is generated from latent variables. In some cases, these latent variables represent semantically meaningful axes of variation within the data and correspond to human-interpretable concepts. Concepts are particularly interpretable when they manifest as one-dimensional, linear, and directional scales in a model’s latent representation space. The abstract concept of “sentiment”, for instance, can only be inferred indirectly through observable representational measurements such as tone or sentiment-laden words in text. These surface-level features may be driven by an underlying latent representation that forms a directional scale describing negative to positive sentiment polarity. In this context, generative models can be seen as operationalizing a measurement modeling setting by capturing abstract concepts in their latents—accessing and evaluating this information calls for interpretability methods. This thesis explores latent space interpretability methods to uncover directional concept scales within generative models. It proceeds from simpler to more complex models, reflecting a shift from active to passive interpretability. Active interpretability pertains to designing a model with latent scales explicitly in mind, whereas passive methods extract these scales post-hoc after training. In both paradigms, we impose structural constraints on the model’s latent representations, largely in the form of directional semantics that convert non-directional latent manifolds into ordinal or cardinal scales. For instance, we apply bijective ordering transformations to the parameters of probabilistic topic models, and introduce a structural constraint that mitigates the label-switching problem in state-space models. Further, we train probing models using an objective to satisfy logical constraints on the latent representations within Transformer-based language models. Finally, we define criteria for intervening on the latent mechanisms of a language model, allowing for scale-controlled steering of the model’s predictions. Aside from a variety of models, we apply these methods to different modalities, including tabular, sequential, and natural language data. Sentiment-related concepts are considered as an application testbed, showcasing the potential for interpretability methods to advance measurement practices in the quantitative social sciences. When the latent space constraint represents the right inductive bias for a given task, performance and interpretability can mutually complement each other.

Publication status

published

Editor

Contributors

Examiner : Cotterell, Ryan
Examiner : West, Robert
Examiner : Leippold, Markus

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

09682 - Cotterell, Ryan / Cotterell, Ryan

Notes

Funding

Related publications and datasets