
Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Sequential decision-making is an iterative process between a learning agent and an environment. We study the stochastic setting, where the learner chooses an action in each round and the environment returns a noisy feedback signal. The learner's objective is to maximize a reward function that depends on the chosen actions. This basic model has many applications, including adaptive experimental design, product recommendations, dynamic pricing and black-box optimization.
The combination of statistical uncertainty and the objective to maximize reward creates a tension between exploration and exploitation: The learner has to carefully balance between actions that provide informative feedback and actions estimated to yield a high reward. The fields of bandit algorithms and partial monitoring study methods to resolve the exploration-exploitation trade-off optimally, using various regularity assumptions on the feedback-reward structure.
Two of the most widely used methods are optimistic algorithms and Thompson sampling, which have been successfully applied in numerous settings and come with strong theoretical guarantees. More recently, however, an increasing amount of evidence shows that optimism and Thompson sampling are not universal exploration principles. In structured models with correlated feedback, clearly suboptimal actions sometimes provide informative feedback that outweighs their cost. Meanwhile, optimistic approaches and Thompson sampling discard such actions early on, which leads to inefficient exploration.
An alternative and less studied design principle is information-directed sampling (IDS), originally proposed in the Bayesian setting. The main contribution in this thesis is a frequentist interpretation of the IDS framework, complemented with frequentist performance guarantees for several settings with linear reward and feedback structure. Using the IDS approach, we resolve the long-standing challenge to find an asymptotically instance-optimal algorithm for linear bandits that is simultaneously minimax optimal. We further extend the IDS approach to the more general linear partial monitoring setting, making the method applicable to a vast range of previously studied models for sequential decision-making. Along the way, we develop the theory of information-directed sampling, uncover a connection to primal-dual methods and propose computationally faster approximations.
Lastly, we discuss extensions of the IDS framework to contextual decision-making and the kernelized setting and highlight example applications. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000494381Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Sequential Decision Making; Partial monitoring; Information-Directed SamplingOrganisational unit
03908 - Krause, Andreas / Krause, Andreas
More
Show all metadata
ETH Bibliography
yes
Altmetrics