Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Rajeev Agrawal

doi:10.2307/1427934

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Part of: Operations research and management science

Published online by Cambridge University Press: 01 July 2016

Rajeev Agrawal

Show author details

Rajeev Agrawal*: Affiliation:
University of Wisconsin-Madison
*: * Postal address: Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53706–1691, U.S.A. E-mail: agrawal@engr.wisc.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(log n) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O(log n) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.

Keywords

UPPER CONFIDENCE BOUNDS ASYMPTOTICALLY EFFICIENT LARGE DEVIATIONS STOCHASTIC ADAPTIVE CONTROL

MSC classification

Primary: 90B50: Management decision making, including multiple objectives

Type: General Applied Probability
Information: Advances in Applied Probability , Volume 27 , Issue 4 , December 1995 , pp. 1054 - 1078

DOI: https://doi.org/10.2307/1427934 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 1995

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Research supported by NSF Grant No. ECS-8919818.

References

[1] Agrawal, R. (1991) Minimizing the learning loss in adaptive control of Markov chains under the weak accessibility condition. J. Appl. Prob. 28, 779–790.Google Scholar

[2] Agrawal, R., Hegde, M. and Teneketzis, D. (1988) Asymptotically efficient adaptive allocation rules for the multi-armed bandit problem with switching cost. IEEE Trans. Autom. Control. 33, 899–906.CrossRef Google Scholar

[3] Agrawal, R., Hegde, M. and Teneketzis, D. (1990) Multi-armed bandit problems with multiple plays and switching cost. Stoch. Stoch. Reports 29, 437–459.Google Scholar

[4] Agrawal, R. and Teneketzis, D. (1989) Certainty equivalence control with forcing: Revisited. Syst. Contr. Lett. 13, 405–412.CrossRef Google Scholar

[5] Agrawal, R., Teneketzis, D. and Anantharam, V. (1989) Asymptotically efficient adaptive allocation schemes for controlled i.i.d. processes: Finite parameter space. IEEE Trans. Autom. Control., 258–267.Google Scholar

[6] Agrawal, R., Teneketzis, D. and Anantharam, V. (1989) Asymptotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space. IEEE Trans. Autom. Contr. 34, 1249–1259.CrossRef Google Scholar

[7] Anantharam, V., Varaiya, P. and Walrand, J. (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays; Part I: IID rewards. IEEE Trans. Autom. Control 32, 968–975.CrossRef Google Scholar

[8] Anantharam, V., Varaiya, P. and Walrand, J. (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays; Part II: Markovian rewards. IEEE Trans. Autom. Control 32, 975–982.Google Scholar

[9] Billingsley, P. (1986) Probability and Measure, 2nd edn., Wiley, New York.Google Scholar

[10] Brown, L. D. (1986) Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Institute of Mathematical Statistics.Google Scholar

[11] Dembo, A. and Zeitouni, O. (1993) Large Deviation Techniques and Applications. Jones and Bartlett.Google Scholar

[12] Ellis, R. S. (1985) Entropy, Large Deviations, and Statistical Mechanics. Springer-Verlag, Berlin.Google Scholar

[13] Lai, T. L. (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15, 1091–1114.Google Scholar

[14] Lai, T. L. and Robbins, H. (1984) Asymptotically optimal allocation of treatments in sequential experiments. In Design of Experiments, ed. Santer, T. J. and Tamhane, A. J., pp. 127–142, Marcel Dekker, New York.Google Scholar

[15] Lai, T. L. and Robbins, H. (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22.Google Scholar

Article contents

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Abstract

Keywords

MSC classification

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests