Reinforcement Learning

doi:10.1017/9781108755610.013

10 - Reinforcement Learning

from Part II - Cognitive Modeling Paradigms

Published online by Cambridge University Press: 21 April 2023

Kenji Doya

Edited by

Ron Sun

Show author details

Ron Sun: Affiliation:
Rensselaer Polytechnic Institute, New York

Book contents

Get access

Summary

Reinforcement learning (RL) is a computational framework for an active agent to learn behaviors on the basis of a scalar reward feedback. The theory of reinforcement learning was developed in the artificial intelligence community with intuitions from psychology and animal learning theory and mathematical basis in control theory. It has been successfully applied to tasks like game playing and robot control. Reinforcement learning gives a theoretical account of behavioral learning in humans and animals and underlying brain mechanisms, such as dopamine signaling and the basal ganglia circuit. Reinforcement learning serves as the “common language” for engineers, biologists, and cognitive scientists to exchange their problems and findings in goal-directed behaviors. This chapter introduces the basic theoretical framework of reinforcement learning and reviews its impacts in artificial intelligence, neuroscience, and cognitive science.

Keywords

reward Markov decision process (MDP)value function Q-learning actor-critic basal ganglia dopamine

Type: Chapter
Information: The Cambridge Handbook of Computational Cognitive Sciences , pp. 350 - 370

DOI: https://doi.org/10.1017/9781108755610.013 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Alexander, G. E., & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends in Neuroscience, 13, 266–271. https://doi.org/10.1016/0166-2236(90)90107-L Google Scholar

Balleine, B. W., Delgado, M. R., & Hikosaka, O. (2007). The role of the dorsal striatum in reward and decision-making. Journal of Neuroscience, 27(31), 8161–8165. https://doi.org/10.1523/JNEUROSCI.1554-07.2007 CrossRef Google Scholar PubMed

Barto, A. G. (1995). Adaptive critics and the basal ganglia. In Houk, J. C., Davis, J. L., & Beiser, D. G. (Eds.), Models of Information Processing in the Basal Ganglia, (pp. 215–232). Cambridge, MA: MIT Press.Google Scholar

Barto, A. G., Sutton, R. S., & Andersen, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834–846. https://doi.org/10.1109/TSMC.1983.6313077 Google Scholar

Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38, 716–719.CrossRef Google Scholar PubMed

Belova, M. A., Paton, J. J., Morrison, S. E., & Salzman, C. D. (2007). Expectation modulates neural responses to pleasant and aversive stimuli in primate amygdala. Neuron, 55(6), 970–984. https://doi.org/10.1016/j.neuron.2007.08.004 Google Scholar

Bendesky, A., Tsunozaki, M., Rockman, M. V., Kruglyak, L., & Bargmann, C. I. (2011). Catecholamine receptor polymorphisms affect decision-making in C. elegans. Nature, 472(7343), 313–318. https://doi.org/10.1038/nature09821 CrossRef Google Scholar PubMed

Bengio, Y. (2017). The consciousness prior. arXiv(1709.08568)Google Scholar

Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: safely approximating the value function. In Leen, T. K. (Ed.), Advances in Neural Information Processing Systems 7 (pp. 369–376). Cambridge, MA: MIT Press.Google Scholar

Cassell, M. D., Freedman, L. J., & Shi, C. (1999). The intrinsic organization of the central extended amygdala. Annals of New York Academy of Sciences, 877, 217–240.Google Scholar

Cisek, P. (2007). Cortical mechanisms of action selection: the affordance competition hypothesis. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1485), 1585–1599. https://doi.org/10.1098/rstb.2007.2054 Google Scholar

Coulom, R. (2006). Efficient selectivity and backup operators in Monte-Carlo tree search. 5th International Conference on Computer and Games. Turin, Italy. https://hal.inria.fr/inria-00116992 Google Scholar

Cui, G., Jun, S. B., Jin, X., et al. (2013). Concurrent activation of striatal direct and indirect pathways during action initiation. Nature, 494(7436), 238–242. https://doi.org/10.1038/nature11846 Google Scholar

Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6), 1204–1215. https://doi.org/10.1016/j.neuron.2011.02.027 Google Scholar

Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), 1704–1711. https://doi.org/10.1038/nn1560 CrossRef Google Scholar PubMed

Dayan, P. (2009). Goal-directed control and its antipodes. Neural Networks, 22(3), 213–219. https://doi.org/10.1016/j.neunet.2009.03.004 Google Scholar

Delong, M. R. (1990). Primate models of movement disorders of basal ganglia origin. Trends in Neurosciences, 13, 281–285.CrossRef Google Scholar PubMed

Dorris, M. C., & Glimcher, P. W. (2004). Activity in posterior parietal cortex is correlated with the relative subjective desirability of action. Neuron, 44(2), 365–378. https://doi.org/10.1016/j.neuron.2004.09.009 Google Scholar

Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia, and the cerebral cortex. Neural Networks, 12, 961–974. https://doi.org/10.1016/S0893–6080(99)00046-5 Google Scholar

Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10(6), 732–739.Google Scholar

Doya, K. (2007). Reinforcement learning: computational theory and biological mechanisms. Frontiers in Life Science, 1(1), 30–40. https://doi.org/10.2976/1.2732246/10.2976/1 Google Scholar PubMed

Fermin, A. S., Yoshida, T., Yoshimoto, J., Ito, M., Tanaka, S. C., & Doya, K. (2016). Model-based action planning involves cortico-cerebellar and basal ganglia networks. Scientific Reports, 6, 31378. https://doi.org/10.1038/srep31378 CrossRef Google Scholar PubMed

Frank, M. J., Seeberger, L. C., & O’Reilly, R, C. (2004). By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703), 1940–1943. https://doi.org/10.1126/science.1102941 Google Scholar

Freund, T. F., Powell, J. F., & Smith, A. D. (1984). Tyrosine hydroxylase-immunoreactive boutons in synaptic contact with identified striatonigral neurons, with particular reference to dendritic spines. Neuroscience, 13(4), 1189–1215. https://doi.org/10.1016/0306-4522(84)90294-x Google Scholar

Geddes, C. E., Li, H., & Jin, X. (2018). Optogenetic editing reveals the hierarchical organization of learned action sequences. Cell, 174(1), 32–43, e15. https://doi.org/10.1016/j.cell.2018.06.012 CrossRef Google Scholar PubMed

Gerfen, C. R. (1992). The neostriatal mosaic: multiple levels of compartmental organization in the basal ganglia. Annual Review of Neuroscience, 15, 285–320.Google Scholar

Glascher, J., Daw, N., Dayan, P., & O’Doherty, J. P. (2010). States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66(4), 585–595. https://doi.org/10.1016/j.neuron.2010.04.016 CrossRef Google Scholar PubMed

Glimcher, P. W., & Fehr, E. (2013). Neuroeconomics: Decision Making and the Brain (2nd ed.). London: Elsevier Academic Press.Google Scholar

Graybiel, A. M., & Ragsdale, C. W., Jr. (1978). Histochemically distinct compartments in the striatum of humans, monkeys, and cats demonstrated by acetylthiocholinesterase staining. Proceedings of the National Academy of Sciences, 75(11), 5723–5726. https://doi.org/10.1073/pnas.75.11.5723 CrossRef Google Scholar PubMed

Gu, S., Holly, E., Lillicrap, T., & Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA 2017).CrossRef Google Scholar

Hassabis, D., Kumaran, D., Summerfield, C., & Botvinick, M. (2017). Neuroscience-inspired artificial intelligence. Neuron, 95(2), 245–258. https://doi.org/10.1016/j.neuron.2017.06.011 CrossRef Google Scholar PubMed

Hikida, T., Kimura, K., Wada, N., Funabiki, K., & Nakanishi, S. (2010). Distinct roles of synaptic transmission in direct and indirect striatal pathways to reward and aversive behavior. Neuron, 66(6), 896–907. https://doi.org/10.1016/j.neuron.2010.05.011 CrossRef Google Scholar PubMed

Houk, J. C., Adams, J. L., & Barto, A. G. (1995a). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., & Beiser, D. G. (Eds.), Models of Information Processing in the Basal Ganglia, (pp. 249–270). Cambridge, MA: MIT Press.Google Scholar

Houk, J. C., Adams, J. L., & Barto, A. G. (1995b). Models of Information Processing in the Basal Ganglia. Cambridge, MA: MIT Press.Google Scholar

Iino, Y., Sawada, T., Yamaguchi, K., et al. (2020). Dopamine D2 receptors in discrimination learning and spine enlargement. Nature (online). https://doi.org/10.1038/s41586–020-2115-1 Google Scholar

Ito, M., & Doya, K. (2015). Distinct neural representation in the dorsolateral, dorsomedial, and ventral parts of the striatum during fixed- and free-choice tasks. Journal of Neuroscience, 35(8), 3499–3514. https://doi.org/10.1523/JNEUROSCI.1962-14.2015 Google Scholar

Kahneman, D. (2011). Thinking, Fast and Slow. New York, NY: Farrar, Straus and Giroux.Google Scholar

Kahneman, D., & Tversky, A. (1979). Prospect theory: an analysis of decision under risk. Econometrica, 47(2), 263–291.CrossRef Google Scholar

Kravitz, A. V., Tye, L. D., & Kreitzer, A. C. (2012). Distinct roles for direct and indirect pathway striatal neurons in reinforcement. Nature Neuroscience, 15(6), 816–818. https://doi.org/10.1038/nn.3100 Google Scholar

Matsumoto, K., Suzuki, W., & Tanaka, K. (2003). Neuronal correlates of goal-based motor selection in the prefrontal cortex. Science, 301(5630), 229–232. https://doi.org/10.1126/science.1084204 Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236 Google Scholar

Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728.CrossRef Google Scholar PubMed

Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.Google Scholar

Montague, P. R., Dolan, R. J., Friston, K. J., & Dayan, P. (2012). Computational psychiatry. Trends in Cognitive Sciences, 16(1), 72–80. https://doi.org/10.1016/j.tics.2011.11.018 Google Scholar

Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: reinforcement learning with less data and less time. Machine Learning, 13(1), 103–130. https://doi.org/10.1007/BF00993104 Google Scholar

Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51. https://doi.org/10.1016/S0921–8890(01)00113-0 Google Scholar

Nambu, A., Tokuno, H., & Takada, M. (2002). Functional significance of the cortico–subthalamo–pallidal ‘hyperdirect’ pathway. Neuroscience Research, 43(2), 111–117. https://doi.org/10.1016/s0168–0102(02)00027-5 Google Scholar

Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682–697. https://doi.org/10.1016/j.neunet.2008.02.003 Google Scholar

Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400, 233–238.Google Scholar

Redish, A. D., & Gordon, J. A. (2016). Computational Psychiatry. Cambridge, MA: MIT Press. https://doi.org/10.7551/mitpress/9780262035422.001.0001 Google Scholar

Reynolds, J. N., Hyland, B. I., & Wickens, J. R. (2001). A cellular mechanism of reward-related learning. Nature, 413(6851), 67–70. https://doi.org/10.1038/35092560 Google Scholar

Reynolds, J. N. J., & Wickens, J. R. (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15, 507–521.Google Scholar

Samejima, K., Ueda, Y., Doya, K., & Kimura, M. (2005). Representation of action-specific reward values in the striatum. Science, 310(5752), 1337–1340. https://doi.org/10.1126/science.1115270 Google Scholar

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 210–229.Google Scholar

Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 1–27.Google Scholar

Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900–913.Google Scholar

Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. https://doi.org/10.1126/science.275.5306.1593 Google Scholar

Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10(3), 272–284. https://doi.org/10.1093/cercor/10.3.272 Google Scholar

Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961 Google Scholar

Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140–1144. https://doi.org/10.1126/science.aar6404 Google Scholar

Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359. https://doi.org/10.1038/nature24270 Google Scholar

Soma, M., Aizawa, H., Ito, Y., et al . (2009). Development of the mouse amygdala as revealed by enhanced green fluorescent protein gene transfer by means of in utero electroporation. Journal of Comparative Neurology, 513(1), 113–128. https://doi.org/10.1002/cne.21945 Google Scholar

Sugrue, L. P., Corrado, G. S., & Newsome, W. T. (2004). Matching behavior and the representation of value in the parietal cortex. Science, 304(5678), 1782–1787. https://doi.org/10.1126/science.1094765 Google Scholar

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Cambridge, MA: MIT Press.Google Scholar

Tanaka, S. C., Doya, K., Okada, G., Ueda, K., Okamoto, Y., & Yamawaki, S. (2004). Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nature Neuroscience, 7(8), 887–893. https://doi.org/10.1038/nn1279 Google Scholar

Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6, 215–219.CrossRef Google Scholar

Thorndike, E. L. (1898). Animal intelligence: an experimental study of the associate processes in animals. Psychological Review, Monograph Supplements, 2(8), 1–109.Google Scholar

Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690.Google Scholar

Voorn, P., Vanderschuren, L. J., Groenewegen, H. J., Robbins, T. W., & Pennartz, C. M. (2004). Putting a spin on the dorsal-ventral divide of the striatum. Trends in Neurosciences, 27(8), 468–474. https://doi.org/10.1016/j.tins.2004.06.006 CrossRef Google Scholar PubMed

Watanabe, M. (1996). Reward expectancy in primate prefrontal neurons. Nature, 382, 629–632.Google Scholar

Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge.Google Scholar

Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. https://doi.org/Doi10.1023/A:1022676722315 CrossRef Google Scholar

Wickens, J. R., Begg, A. J., & Arbuthnott, G. W. (1996). Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience, 70(1), 1–5. https://doi.org/10.1016/0306-4522(95)00436-m Google Scholar

Yagishita, S., Hayashi-Takagi, A., Ellis-Davies, G. C., Urakubo, H., Ishii, S., & Kasai, H. (2014). A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science, 345(6204), 1616–1620. https://doi.org/10.1126/science.1255514 Google Scholar

Yamagata, N., Ichinose, T., Aso, Y., et al. (2014). Distinct dopamine neurons mediate reward signals for short- and long-term memories. Proceedings of the National Academy of Sciences (online). https://doi.org/10.1073/pnas.1421930112 CrossRef Google Scholar