Some memoryless bandit policies

Erol A. Peköz

doi:10.1239/jap/1044476838

Some memoryless bandit policies

Part of: Operations research and management science Sequential methods

Published online by Cambridge University Press: 14 July 2016

Erol A. Peköz

Show author details

Erol A. Peköz*: Affiliation:
Boston University
*: ∗ Postal address: School of Management, Boston University, Boston, MA 02215, USA. Email address: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We consider a multiarmed bandit problem, where each arm when pulled generates independent and identically distributed nonnegative rewards according to some unknown distribution. The goal is to maximize the long-run average reward per pull with the restriction that any previously learned information is forgotten whenever a switch between arms is made. We present several policies and a peculiarity surrounding them.

Keywords

Bandit problem multiarmed bandit problem average reward criteria

MSC classification

Primary: 62L05: Sequential design 90B36: Scheduling theory, stochastic 62L15: Optimal stopping

Type: Short Communications
Information: Journal of Applied Probability , Volume 40 , Issue 1 , March 2003 , pp. 250 - 256

DOI: https://doi.org/10.1239/jap/1044476838 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2003

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

[1] Bellman, R. (1956). A problem in the sequential design of experiments. Sankhyā 16, 221–229.Google Scholar

[2] Berry, D. A., and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.Google Scholar

[3] Berry, D. A. et al. (1997). Chen, R. W., Zame, A., Heath, D. C., Shepp, L. A. Bandit problems with infinitely many arms. Ann. Statist. 25, 2103–2116.Google Scholar

[4] Cover, T. (1987). Pick the largest number. In Open Problems in Communication and Computation, eds Cover, T. and Gopinath, B., Springer, New York, p. 152.Google Scholar

[5] Durrett, R. (1996). Probability: Theory and Examples, 2nd edn. Wadsworth Publishing, Belmont, CA.Google Scholar

[6] Gittins, J. C., and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Europ. Meeting Statisticians, Budapest, 1972), Vol. 1, eds Gani, J., Sarkadi, K. and Vincze, I., North-Holland, Amsterdam, pp. 241–266.Google Scholar

[7] Herschkorn, S. J., Peköz, E., and Ross, S. M. (1996). Policies without memory for the infinite-armed Bernoulli bandit under the average-reward criterion. Prob. Eng. Inf. Sci. 10, 21–28.Google Scholar

[8] Lai, T. L., and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22.Google Scholar

[9] Lin, C.-T., and Shiau, C.J. (2000). Some optimal strategies for bandit problems with beta prior distributions. Ann. Inst. Statist. Math. 52, 397–405.Google Scholar

[10] Psounis, K., and Prabhakar, B. (2001). A randomized web-cache replacement scheme. In Proc. IEEE Infocom 2001 (Anchorage, AK, 22–26 April 2001), pp. 1407–1415. Available at http://www.ieee-infocom.org/2001/.Google Scholar

Article contents

Some memoryless bandit policies

Abstract

Keywords

MSC classification

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests