ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS

James Edwards; Paul Fearnhead; Kevin Glazebrook

doi:10.1017/S0269964816000279

ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS

Published online by Cambridge University Press: 13 September 2016

James Edwards

Paul Fearnhead and

Kevin Glazebrook

Show author details

James Edwards: Affiliation:
STOR-i Centre for Doctoral Training, Lancaster UniversityLancaster LA1 4YF, UK E-mail: [email protected]
Paul Fearnhead: Affiliation:
Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK E-mail: [email protected]
Kevin Glazebrook: Affiliation:
Department of Management Science, Lancaster University, Lancaster LA1 4YX, UK E-mail: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The knowledge gradient (KG) policy was originally proposed for online ranking and selection problems but has recently been adapted for use in online decision-making in general and multi-armed bandit problems (MABs) in particular. We study its use in a class of exponential family MABs and identify weaknesses, including a propensity to take actions which are dominated with respect to both exploitation and exploration. We propose variants of KG which avoid such errors. These new policies include an index heuristic, which deploys a KG approach to develop an approximation to the Gittins index. A numerical study shows this policy to perform well over a range of MABs including those for which index policies are not optimal. While KG does not take dominated actions when bandits are Gaussian, it fails to be index consistent and appears not to enjoy a performance advantage over competitor policies when arms are correlated to compensate for its greater computational demands.

Keywords

stochastic dynamic programming

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 31 , Issue 2 , April 2017 , pp. 239 - 263

DOI: https://doi.org/10.1017/S0269964816000279 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

1. Berry, D.A. & Fristedt, B. (1985). Bandit Problems. London: Chapman and Hall.Google Scholar

2. Brezzi, M. & Lai, T.L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control 27(1): 87–108.Google Scholar

3. Chick, S.E. & Gans, N. (2009). Economic analysis of simulation selection problems. Management Science 55(3): 421–437.Google Scholar

4. Ding, Z. & Ryzhov, I.O. (2016). Optimal learning with non-Gaussian rewards. Advances in Applied Probability 1(48): 112–136.Google Scholar

5. Frazier, P.I., Powell, W.B., & Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization 47(5): 2410–2439.Google Scholar

6. Frazier, P.I., Powell, W.B., & Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing 21(4): 599–613.CrossRef Google Scholar

7. Gittins, J.C., Glazebrook, K.D., & Weber, R. (2011). Multi-armed Bndit Allocation Indices, 2nd ed. Chichester, UK: John Wiley & Sons.Google Scholar

8. Gupta, S.S. & Miescke, K.J. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Journal of Statistical Planning and Inference 54(2): 229–244.Google Scholar

9. Jones, D.R., Schonlau, M., & Welch, W.J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4): 455–492.Google Scholar

10. Powell, W.B. & Ryzhov, I.O. (2012). Optimal Learning. Hoboken, NJ: John Wiley & Sons.Google Scholar

11. Russo, D. & Van Roy, B. (2014). Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4): 1221–1243.Google Scholar

12. Ryzhov, I.O., Frazier, P.I., & Powell, W.B. (2010). On the robustness of a one-period look-ahead policy in multi-armed bandit problems. Procedia Computer Science 1(1): 1635–1644.Google Scholar

13. Ryzhov, I.O. & Powell, W.B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science, pp. 1363–1372.Google Scholar

14. Ryzhov, I.O., Powell, W.B., & Frazier, P.I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operations Research 60(1): 180–195.Google Scholar

15. Shaked, M. & Shanthikumar, J.G. (2007). Stochastic Orders. New York: Springer.Google Scholar

16. Weber, R. (1992). On the Gittins index for multiarmed bandits. The Annals of Applied Probability 2(4): 1024–1033.Google Scholar

17. Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society. Series B (Methodological) 42(2): 143–149.Google Scholar

18. Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability 25: 287–298.Google Scholar

19. Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. arXiv preprint. arXiv:1103.3089v1.Google Scholar

Article contents

ON THE IDENTIFICATION AND MITIGATION OF WEAKNESSES IN THE KNOWLEDGE GRADIENT POLICY FOR MULTI-ARMED BANDITS

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests