Skip to main content Accessibility help
×
Hostname: page-component-78c5997874-dh8gc Total loading time: 0 Render date: 2024-11-10T19:01:24.558Z Has data issue: false hasContentIssue false

10 - Reinforcement Learning

from Part II - Cognitive Modeling Paradigms

Published online by Cambridge University Press:  21 April 2023

Ron Sun
Affiliation:
Rensselaer Polytechnic Institute, New York
Get access

Summary

Reinforcement learning (RL) is a computational framework for an active agent to learn behaviors on the basis of a scalar reward feedback. The theory of reinforcement learning was developed in the artificial intelligence community with intuitions from psychology and animal learning theory and mathematical basis in control theory. It has been successfully applied to tasks like game playing and robot control. Reinforcement learning gives a theoretical account of behavioral learning in humans and animals and underlying brain mechanisms, such as dopamine signaling and the basal ganglia circuit. Reinforcement learning serves as the “common language” for engineers, biologists, and cognitive scientists to exchange their problems and findings in goal-directed behaviors. This chapter introduces the basic theoretical framework of reinforcement learning and reviews its impacts in artificial intelligence, neuroscience, and cognitive science.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alexander, G. E., & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends in Neuroscience, 13, 266271. https://doi.org/10.1016/0166-2236(90)90107-LGoogle Scholar
Balleine, B. W., Delgado, M. R., & Hikosaka, O. (2007). The role of the dorsal striatum in reward and decision-making. Journal of Neuroscience, 27(31), 81618165. https://doi.org/10.1523/JNEUROSCI.1554-07.2007CrossRefGoogle ScholarPubMed
Barto, A. G. (1995). Adaptive critics and the basal ganglia. In Houk, J. C., Davis, J. L., & Beiser, D. G. (Eds.), Models of Information Processing in the Basal Ganglia, (pp. 215232). Cambridge, MA: MIT Press.Google Scholar
Barto, A. G., Sutton, R. S., & Andersen, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834846. https://doi.org/10.1109/TSMC.1983.6313077Google Scholar
Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38, 716719.CrossRefGoogle ScholarPubMed
Belova, M. A., Paton, J. J., Morrison, S. E., & Salzman, C. D. (2007). Expectation modulates neural responses to pleasant and aversive stimuli in primate amygdala. Neuron, 55(6), 970984. https://doi.org/10.1016/j.neuron.2007.08.004Google Scholar
Bendesky, A., Tsunozaki, M., Rockman, M. V., Kruglyak, L., & Bargmann, C. I. (2011). Catecholamine receptor polymorphisms affect decision-making in C. elegans. Nature, 472(7343), 313318. https://doi.org/10.1038/nature09821CrossRefGoogle ScholarPubMed
Bengio, Y. (2017). The consciousness prior. arXiv(1709.08568)Google Scholar
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: safely approximating the value function. In Leen, T. K. (Ed.), Advances in Neural Information Processing Systems 7 (pp. 369376). Cambridge, MA: MIT Press.Google Scholar
Cassell, M. D., Freedman, L. J., & Shi, C. (1999). The intrinsic organization of the central extended amygdala. Annals of New York Academy of Sciences, 877, 217240.Google Scholar
Cisek, P. (2007). Cortical mechanisms of action selection: the affordance competition hypothesis. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1485), 15851599. https://doi.org/10.1098/rstb.2007.2054Google Scholar
Coulom, R. (2006). Efficient selectivity and backup operators in Monte-Carlo tree search. 5th International Conference on Computer and Games. Turin, Italy. https://hal.inria.fr/inria-00116992Google Scholar
Cui, G., Jun, S. B., Jin, X., et al. (2013). Concurrent activation of striatal direct and indirect pathways during action initiation. Nature, 494(7436), 238242. https://doi.org/10.1038/nature11846Google Scholar
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6), 12041215. https://doi.org/10.1016/j.neuron.2011.02.027Google Scholar
Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), 17041711. https://doi.org/10.1038/nn1560CrossRefGoogle ScholarPubMed
Dayan, P. (2009). Goal-directed control and its antipodes. Neural Networks, 22(3), 213219. https://doi.org/10.1016/j.neunet.2009.03.004Google Scholar
Delong, M. R. (1990). Primate models of movement disorders of basal ganglia origin. Trends in Neurosciences, 13, 281285.CrossRefGoogle ScholarPubMed
Dorris, M. C., & Glimcher, P. W. (2004). Activity in posterior parietal cortex is correlated with the relative subjective desirability of action. Neuron, 44(2), 365378. https://doi.org/10.1016/j.neuron.2004.09.009Google Scholar
Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia, and the cerebral cortex. Neural Networks, 12, 961974. https://doi.org/10.1016/S0893–6080(99)00046-5Google Scholar
Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology, 10(6), 732739.Google Scholar
Doya, K. (2007). Reinforcement learning: computational theory and biological mechanisms. Frontiers in Life Science, 1(1), 3040. https://doi.org/10.2976/1.2732246/10.2976/1Google ScholarPubMed
Fermin, A. S., Yoshida, T., Yoshimoto, J., Ito, M., Tanaka, S. C., & Doya, K. (2016). Model-based action planning involves cortico-cerebellar and basal ganglia networks. Scientific Reports, 6, 31378. https://doi.org/10.1038/srep31378CrossRefGoogle ScholarPubMed
Frank, M. J., Seeberger, L. C., & O’Reilly, R, C. (2004). By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703), 19401943. https://doi.org/10.1126/science.1102941Google Scholar
Freund, T. F., Powell, J. F., & Smith, A. D. (1984). Tyrosine hydroxylase-immunoreactive boutons in synaptic contact with identified striatonigral neurons, with particular reference to dendritic spines. Neuroscience, 13(4), 11891215. https://doi.org/10.1016/0306-4522(84)90294-xGoogle Scholar
Geddes, C. E., Li, H., & Jin, X. (2018). Optogenetic editing reveals the hierarchical organization of learned action sequences. Cell, 174(1), 3243, e15. https://doi.org/10.1016/j.cell.2018.06.012CrossRefGoogle ScholarPubMed
Gerfen, C. R. (1992). The neostriatal mosaic: multiple levels of compartmental organization in the basal ganglia. Annual Review of Neuroscience, 15, 285320.Google Scholar
Glascher, J., Daw, N., Dayan, P., & O’Doherty, J. P. (2010). States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66(4), 585595. https://doi.org/10.1016/j.neuron.2010.04.016CrossRefGoogle ScholarPubMed
Glimcher, P. W., & Fehr, E. (2013). Neuroeconomics: Decision Making and the Brain (2nd ed.). London: Elsevier Academic Press.Google Scholar
Graybiel, A. M., & Ragsdale, C. W., Jr. (1978). Histochemically distinct compartments in the striatum of humans, monkeys, and cats demonstrated by acetylthiocholinesterase staining. Proceedings of the National Academy of Sciences, 75(11), 57235726. https://doi.org/10.1073/pnas.75.11.5723CrossRefGoogle ScholarPubMed
Gu, S., Holly, E., Lillicrap, T., & Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA 2017).CrossRefGoogle Scholar
Hassabis, D., Kumaran, D., Summerfield, C., & Botvinick, M. (2017). Neuroscience-inspired artificial intelligence. Neuron, 95(2), 245258. https://doi.org/10.1016/j.neuron.2017.06.011CrossRefGoogle ScholarPubMed
Hikida, T., Kimura, K., Wada, N., Funabiki, K., & Nakanishi, S. (2010). Distinct roles of synaptic transmission in direct and indirect striatal pathways to reward and aversive behavior. Neuron, 66(6), 896907. https://doi.org/10.1016/j.neuron.2010.05.011CrossRefGoogle ScholarPubMed
Houk, J. C., Adams, J. L., & Barto, A. G. (1995a). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In Houk, J. C., Davis, J. L., & Beiser, D. G. (Eds.), Models of Information Processing in the Basal Ganglia, (pp. 249270). Cambridge, MA: MIT Press.Google Scholar
Houk, J. C., Adams, J. L., & Barto, A. G. (1995b). Models of Information Processing in the Basal Ganglia. Cambridge, MA: MIT Press.Google Scholar
Iino, Y., Sawada, T., Yamaguchi, K., et al. (2020). Dopamine D2 receptors in discrimination learning and spine enlargement. Nature (online). https://doi.org/10.1038/s41586–020-2115-1Google Scholar
Ito, M., & Doya, K. (2015). Distinct neural representation in the dorsolateral, dorsomedial, and ventral parts of the striatum during fixed- and free-choice tasks. Journal of Neuroscience, 35(8), 34993514. https://doi.org/10.1523/JNEUROSCI.1962-14.2015Google Scholar
Kahneman, D. (2011). Thinking, Fast and Slow. New York, NY: Farrar, Straus and Giroux.Google Scholar
Kahneman, D., & Tversky, A. (1979). Prospect theory: an analysis of decision under risk. Econometrica, 47(2), 263291.CrossRefGoogle Scholar
Kravitz, A. V., Tye, L. D., & Kreitzer, A. C. (2012). Distinct roles for direct and indirect pathway striatal neurons in reinforcement. Nature Neuroscience, 15(6), 816818. https://doi.org/10.1038/nn.3100Google Scholar
Matsumoto, K., Suzuki, W., & Tanaka, K. (2003). Neuronal correlates of goal-based motor selection in the prefrontal cortex. Science, 301(5630), 229232. https://doi.org/10.1126/science.1084204Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529533. https://doi.org/10.1038/nature14236Google Scholar
Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725728.CrossRefGoogle ScholarPubMed
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 19361947.Google Scholar
Montague, P. R., Dolan, R. J., Friston, K. J., & Dayan, P. (2012). Computational psychiatry. Trends in Cognitive Sciences, 16(1), 7280. https://doi.org/10.1016/j.tics.2011.11.018Google Scholar
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: reinforcement learning with less data and less time. Machine Learning, 13(1), 103130. https://doi.org/10.1007/BF00993104Google Scholar
Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 3751. https://doi.org/10.1016/S0921–8890(01)00113-0Google Scholar
Nambu, A., Tokuno, H., & Takada, M. (2002). Functional significance of the cortico–subthalamo–pallidal ‘hyperdirect’ pathway. Neuroscience Research, 43(2), 111117. https://doi.org/10.1016/s0168–0102(02)00027-5Google Scholar
Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682697. https://doi.org/10.1016/j.neunet.2008.02.003Google Scholar
Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400, 233238.Google Scholar
Redish, A. D., & Gordon, J. A. (2016). Computational Psychiatry. Cambridge, MA: MIT Press. https://doi.org/10.7551/mitpress/9780262035422.001.0001Google Scholar
Reynolds, J. N., Hyland, B. I., & Wickens, J. R. (2001). A cellular mechanism of reward-related learning. Nature, 413(6851), 6770. https://doi.org/10.1038/35092560Google Scholar
Reynolds, J. N. J., & Wickens, J. R. (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15, 507521.Google Scholar
Samejima, K., Ueda, Y., Doya, K., & Kimura, M. (2005). Representation of action-specific reward values in the striatum. Science, 310(5752), 13371340. https://doi.org/10.1126/science.1115270Google Scholar
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 210229.Google Scholar
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80, 127.Google Scholar
Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900913.Google Scholar
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 15931599. https://doi.org/10.1126/science.275.5306.1593Google Scholar
Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10(3), 272284. https://doi.org/10.1093/cercor/10.3.272Google Scholar
Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484489. https://doi.org/10.1038/nature16961Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 11401144. https://doi.org/10.1126/science.aar6404Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354359. https://doi.org/10.1038/nature24270Google Scholar
Soma, M., Aizawa, H., Ito, Y., et al . (2009). Development of the mouse amygdala as revealed by enhanced green fluorescent protein gene transfer by means of in utero electroporation. Journal of Comparative Neurology, 513(1), 113128. https://doi.org/10.1002/cne.21945Google Scholar
Sugrue, L. P., Corrado, G. S., & Newsome, W. T. (2004). Matching behavior and the representation of value in the parietal cortex. Science, 304(5678), 17821787. https://doi.org/10.1126/science.1094765Google Scholar
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Cambridge, MA: MIT Press.Google Scholar
Tanaka, S. C., Doya, K., Okada, G., Ueda, K., Okamoto, Y., & Yamawaki, S. (2004). Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nature Neuroscience, 7(8), 887893. https://doi.org/10.1038/nn1279Google Scholar
Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6, 215219.CrossRefGoogle Scholar
Thorndike, E. L. (1898). Animal intelligence: an experimental study of the associate processes in animals. Psychological Review, Monograph Supplements, 2(8), 1109.Google Scholar
Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674690.Google Scholar
Voorn, P., Vanderschuren, L. J., Groenewegen, H. J., Robbins, T. W., & Pennartz, C. M. (2004). Putting a spin on the dorsal-ventral divide of the striatum. Trends in Neurosciences, 27(8), 468474. https://doi.org/10.1016/j.tins.2004.06.006CrossRefGoogle ScholarPubMed
Watanabe, M. (1996). Reward expectancy in primate prefrontal neurons. Nature, 382, 629632.Google Scholar
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge.Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279292. https://doi.org/Doi10.1023/A:1022676722315CrossRefGoogle Scholar
Wickens, J. R., Begg, A. J., & Arbuthnott, G. W. (1996). Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience, 70(1), 15. https://doi.org/10.1016/0306-4522(95)00436-mGoogle Scholar
Yagishita, S., Hayashi-Takagi, A., Ellis-Davies, G. C., Urakubo, H., Ishii, S., & Kasai, H. (2014). A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science, 345(6204), 16161620. https://doi.org/10.1126/science.1255514Google Scholar
Yamagata, N., Ichinose, T., Aso, Y., et al. (2014). Distinct dopamine neurons mediate reward signals for short- and long-term memories. Proceedings of the National Academy of Sciences (online). https://doi.org/10.1073/pnas.1421930112CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Reinforcement Learning
  • Edited by Ron Sun, Rensselaer Polytechnic Institute, New York
  • Book: The Cambridge Handbook of Computational Cognitive Sciences
  • Online publication: 21 April 2023
  • Chapter DOI: https://doi.org/10.1017/9781108755610.013
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Reinforcement Learning
  • Edited by Ron Sun, Rensselaer Polytechnic Institute, New York
  • Book: The Cambridge Handbook of Computational Cognitive Sciences
  • Online publication: 21 April 2023
  • Chapter DOI: https://doi.org/10.1017/9781108755610.013
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Reinforcement Learning
  • Edited by Ron Sun, Rensselaer Polytechnic Institute, New York
  • Book: The Cambridge Handbook of Computational Cognitive Sciences
  • Online publication: 21 April 2023
  • Chapter DOI: https://doi.org/10.1017/9781108755610.013
Available formats
×