« Back to Results

Adaptive Experimentation and Policy Learning

Paper Session

Friday, Jan. 6, 2023 8:00 AM - 10:00 AM (CST)

Hilton Riverside, Quarterdeck C
Hosted By: American Economic Association
  • Chair: Toru Kitagawa, Brown University

Best Arm Identification with a Fixed Budget under a Small Gap

Masahiro Kato
,
CyberAgent, Inc.
Kaito Ariu
,
CyberAgent, Inc.
Masaaki Imaizumi
,
University of Tokyo
Masatoshi Uehara
,
Cornell University
Masahiro Nomura
,
CyberAgent, Inc.

Abstract

We consider the fixed-budget best arm identification problem in two-armed Gaussian bandits with unknown variances. The tightest lower bound on the complexity and an algorithm whose performance guarantee matches the lower bound have long been open problems when the variances are unknown and when the algorithm is agnostic to the optimal proportion of the arm draws. In this paper, we propose a strategy comprising a sampling rule with randomized sampling (RS) following the estimated target allocation probabilities of arm draws and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator, which is often used in the causal inference literature. We refer to our strategy as the RS-AIPW strategy. In the theoretical analysis, we first derive a large deviation principle for martingales, which can be used when the second moment converges in mean, and apply it to our proposed strategy. Then, we show that the proposed strategy is asymptotically optimal in the sense that the probability of misidentification achieves the lower bound by Kaufmann et al. (2016) when the sample size becomes infinitely large and the gap between the two arms goes to zero.

Treatment Choice with Nonlinear Regret

Toru Kitagawa
,
Brown University
Sokbae Lee
,
Columbia University
Chen Qiu
,
Cornell University

Abstract

The literature on treatment choice focuses on the mean of welfare regret. Ignoring other features of the regret distribution, however, can lead to an undesirable rule due to sampling uncertainty. Instead, we propose to minimize the mean of a nonlinear transformation of welfare regret. This paradigm shift alters optimal rules drastically. We show that for a wide class of nonlinear criteria, admissible rules are fractional. Focusing on mean square regret, we derive the closed-form probabilities of randomization for finite-sample Bayes and minimax optimal rules when data are normal with known variance. The minimax rule is a simple logit based on the sample mean and agrees with the posterior probability for positive treatment effect under the least favorable prior. The Bayes rule with an uninformative prior is different but produces quantitatively comparable mean square regret. We extend these results to limit experiments and discuss our findings through sample size calculations.

Policy Design in Experiments with Unknown Interference

Davide Viviano
,
University of California-San Diego

Abstract

In this talk, I will discuss the problem of experimental design for estimation and inference on welfare-maximizing policies in the presence of spillover effects. As a first contribution, I introduce a single-wave experiment that estimates the marginal effect of a change in treatment probabilities, taking spillover effects into account. Using the marginal effect, I propose a practical test for policy optimality. The idea is that researchers should report the marginal effect and test for policy optimality: the marginal effect indicates the direction for a welfare improvement, and the test provides evidence on whether it is worth conducting additional experiments to estimate a welfare-improving treatment allocation. As a second contribution, I design a multiple-wave experiment to estimate treatment assignment rules and maximize welfare, and derive guarantees on the proposed procedure. I illustrate the benefits of the method in simulations calibrated to existing experiments on information diffusion and cash-transfer programs.

Adaptivity and Confounding in Multi-Armed Bandit Experiments

Chao Qin
,
Columbia University
Daniel Russo
,
Columbia University

Abstract

Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action — suggesting optimal adaptivity — but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations — suggesting unusual robustness. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.

Discussant(s)
Toru Kitagawa
,
Brown University
Kaito Ariu
,
CyberAgent, Inc.
Davide Viviano
,
University of California-San Diego
Chao Qin
,
Columbia University
JEL Classifications
  • C4 - Econometric and Statistical Methods: Special Topics
  • C9 - Design of Experiments