Introduction
Prosocial behavior refers to the behavior that intends to benefit others in social interactions (Fehr & Fischbacher, Reference Fehr and Fischbacher2003; Fiske & Taylor, Reference Fiske and Taylor2013; Moskowitz, Reference Moskowitz2005). To behave prosocially, humans need to process social information and learn about the impacts that their actions can have on others (Fehr & Fischbacher, Reference Fehr and Fischbacher2003; Fiske & Taylor, Reference Fiske and Taylor2013; Lockwood, Apps, Valton, Viding, & Roiser, Reference Lockwood, Apps, Valton, Viding and Roiser2016). Reinforcement learning theory provides a powerful framework for understanding how humans and other species form action–outcome associations (Sutton & Barto, Reference Sutton and Barto2018). Recent evidence has shown that prosocial behaviors can be described in terms of reinforcement learning when people learn to benefit themselves (self-oriented learning) and others (prosocial learning) (Lockwood et al., Reference Lockwood, Apps, Valton, Viding and Roiser2016), suggesting there is a self-bias that humans learn faster from feedback to reward themselves than others (Lockwood et al., Reference Lockwood, Apps, Valton, Viding and Roiser2016; Martins, Lockwood, Cutler, Moran, & Paloyelis, Reference Martins, Lockwood, Cutler, Moran and Paloyelis2022).
Learning is one of the most crucial abilities of our brain to adapt to social life (Alberts, Reference Alberts1994; van den Berg, Molleman, & Weissing, Reference van den Berg, Molleman and Weissing2015). Individuals learn to make decisions that maximize personal utilities or obtain necessary supplies. Yet, it has long been recognized that many real-world decisions are made in a social context, i.e. choices involving not only personal goals, but also potential benefits for others. Recent studies have shown that people can behave prosocially or egoistically by learning the consequences of their decisions for others and themselves from reward feedback (Liao, Huang, & Luo, Reference Liao, Huang and Luo2021; Lockwood et al., Reference Lockwood, Apps, Valton, Viding and Roiser2016). Reward and punishment represent crucial elements of reinforcement learning – positive and negative feedbacks have dissociable effects on learning (Galea, Mallia, Rothwell, & Diedrichsen, Reference Galea, Mallia, Rothwell and Diedrichsen2015) and may bias human independent estimates of the information content (Pulcu & Browning, Reference Pulcu and Browning2017). Dissociable effects of positive and negative feedback also exhibit context-dependent modulations on decision-making. For example, people tend to display an optimistic bias toward self-relevant beliefs, updating their beliefs to a greater extent following positive than negative feedback (Sharot & Garrett, Reference Sharot and Garrett2016), but also prioritize learning for concerning the suffering of others and avoiding harming others from punishment feedback (Crockett, Kurth-Nelson, Siegel, Dayan, & Dolan, Reference Crockett, Kurth-Nelson, Siegel, Dayan and Dolan2014; Lockwood, Klein-Flügge, Abdurahman, & Crockett, Reference Lockwood, Klein-Flügge, Abdurahman and Crockett2020). However, it remains largely unclear on the role of the self/other-orientation in prosocial learning, especially learning from punishment. Acknowledging more about how we learn prosocial behavior in different social contexts may help us understand the atypical behaviors across psychiatric conditions such as antisocial behavior (Lock, Reference Lock2008) and autism spectrum disorders (Apps, Rushworth, & Chang, Reference Apps, Rushworth and Chang2016; Lockwood et al., Reference Lockwood, Apps, Valton, Viding and Roiser2016).
It is critical to understand the neurochemical systems and neurocomputational mechanisms of prosocial learning. As an evolutionarily conserved neuropeptide, arginine vasopressin (AVP) modulates various social behaviors in mammals (Caldwell, Reference Caldwell2017; Winslow, Hastings, Carter, Harbaugh, & Insel, Reference Winslow, Hastings, Carter, Harbaugh and Insel1993). Animal studies have illustrated the role of AVP in social memory (Albers, Reference Albers2015; Caldwell, Reference Caldwell2017), social communication/recognition (Song, Larkin, Malley, & Albers, Reference Song, Larkin, Malley and Albers2016; Song et al., Reference Song, McCann, McNeill, Larkin, Huhman and Albers2014), aggression (Caldwell & Albers, Reference Caldwell and Albers2004; Gobrogge, Liu, Young, & Wang, Reference Gobrogge, Liu, Young and Wang2009), and pair bonding (Liu, Curtis, & Wang, Reference Liu, Curtis and Wang2001; Pitkow et al., Reference Pitkow, Sharer, Ren, Insel, Terwilliger and Young2001). Not only aggressive but also prosocial behaviors can be modulated by AVP in specific social contexts. Human genetic studies have revealed the link between AVP system and complex human social behaviors, such that polymorphisms of the human AVP receptor gene (AVPR1A) have been associated with reciprocity and trust (Nishina, Takagishi, Takahashi, Sakagami, & Inoue-Murayama, Reference Nishina, Takagishi, Takahashi, Sakagami and Inoue-Murayama2019) as well as altruistic behavior (Avinun et al., Reference Avinun, Israel, Shalev, Gritsenko, Bornstein, Ebstein and Knafo2011; Knafo et al., Reference Knafo, Israel, Darvasi, Bachner-Melman, Uzefovsky, Cohen and Ebstein2008; Wang et al., Reference Wang, Qin, Liu, Liu, Zhou, Jiang and Yu2016). Importantly, the intranasal administration of AVP has been widely applied to humans to reveal the causal role of AVP in human social cognition and is considered as an effective means to directly affect central processes through the blood–brain barrier (Born et al., Reference Born, Lange, Kern, McGregor, Bickel and Fehm2002; Dhuria, Hanson, & Frey, Reference Dhuria, Hanson and Frey2010). For instance, intranasal AVP regulates the auditory attention, perception, and memory of emotional and social cues (Dodt et al., Reference Dodt, Pietrowsky, Sewing, Zabel, Fehm and Born1994; Uzefovsky, Shalev, Israel, Knafo, & Ebstein, Reference Uzefovsky, Shalev, Israel, Knafo and Ebstein2012; Zink et al., Reference Zink, Kempf, Hakimi, Rainey, Stein and Meyer-Lindenberg2011), as well as risky decision making, cooperation, and prosocial behaviors (Feng, Qin, Luo, & Xu, Reference Feng, Qin, Luo and Xu2020; Feng et al., Reference Feng, Hackett, DeMarco, Chen, Stair, Haroon and Rilling2015; Neto et al., Reference Neto, Antunes, Lopes, Ferreira, Rilling and Prata2020; Patel et al., Reference Patel, Grillon, Pavletic, Rosen, Pine and Ernst2015; Rilling et al., Reference Rilling, DeMarco, Hackett, Chen, Gautam, Stair and Pagnoni2014). Therefore, AVP is a strong molecular candidate of prosocial learning by modulating on the underlying neurocomputational mechanisms when we learn to act prosocially under specific social contexts.
To examine learning adaptation of the self/other-oriented bias in reward-seeking and punishment-avoidance and modulated effect of intranasal AVP on specific prosocial learning, we designed a probabilistic reversal learning task in which participants learned to benefit self/others, or to avoid punishment for self/others separately. Specifically, the neurocomputational mechanisms underlying the effect of AVP on prosocial learning were examined by using computational modeling as well as recording event-related potentials (ERPs) and brain oscillations. We hypothesized that self/other-oriented bias would be specific in reward/punishment-related prosocial learning and AVP might be a crucial modulatory that supports prosocial learning.
Materials and methods
Participants
One hundred and four healthy participants were recruited in the current study (age: 18–26; 54 males; two left-handed). For the effect size (f = 0.30), type I error rate of 0.05, and statistical power of 0.8, G-Power 3.1 yielded a required minimum sample size of 58 participants for two [drug administration: placebo (PBO) v. AVP] between-subject factor and interactions with other factors in a repeated-measure design (Faul, Erdfelder, Lang, & Buchner, Reference Faul, Erdfelder, Lang and Buchner2007). Participants were recruited via an online recruiting system and received monetary compensations. All potential participants completed a medical history questionnaire. Participants were not recruited if they reported any clinical disorder, drug/medication/alcohol abuse, or had recently participated in any other drug studies, or majored in economics/psychology. Participants were kept away from caffeine and alcohol on the day of experiment and from drink (except for water) and food for 2 h before the drug administration. The study was carried out according to the 1964 Helsinki Declaration and its later amendments and was approved by the local Ethics Committee. Written informed consent was obtained from each participant before the experiment. For electroencephalography (EEG) analyses, data from six participants were excluded because of incomplete EEG data, and data from four participants were discarded due to none trial available for any condition after denoising or due to left-handedness for the stimulus-preceding negativity (SPN) analysis.
Administration of AVP and PBO
Drug administration of the current study was randomized, double-blind, and PBO-controlled. Participants were randomly assigned to the PBO or the AVP group. The PBO group self-administered 20 IU of PBO (n = 50; 24 females) intranasally and the AVP group 20 IU of vasopressin (n = 54; 27 females). The effective time of 20 IU AVP on social processes is about 80 min (Born et al., Reference Born, Lange, Kern, McGregor, Bickel and Fehm2002; Thompson, George, Walton, Orr, & Benson, Reference Thompson, George, Walton, Orr and Benson2006). In the experiment, an experimenter inspected the drug administration; however, both the experimenters and the participants were blind to the drug administration. Participants were asked to place the nasal applicator in one nostril and to press the lever until they felt a mist of spray in the nostril, then to breathe in deeply through the nose. Subsequently, participants were instructed to repeat this process in the other nostril. Each application involved both nostrils. In each application, the drug was applied three times in total with a 30 s delay. Participants proceeded to the main experiment approximately 20 min after drug treatment (Thompson et al., Reference Thompson, George, Walton, Orr and Benson2006).
Task procedure
The experiment consisted of two probabilistic reversal learning tasks (Fig. 1a), reward learning task (RLT) and punishment learning task (PLT). Each session included two runs, and participants made choices either for self or the other participant (informed that was the next participant) in each run (two runs are pseudorandom in each session). Therefore, there were four conditions in total, including making decisions for self in RLT session (SR), making decisions for others in RLT session (OR), making decisions for self in PLT session (SP), and making decisions for others in PLT session (OP). Participants were instructed to complete two learning tasks. At the beginning of each trial, participants were instructed to make decisions for self or others. In each trial, after a fixation of 750–1250 ms, two visual stimuli/options were simultaneously presented to participants and asked to choose one option with the corresponding mouse click. In RLT session, one option was designated as the optimal option that associated with a high probability (70%) to obtain monetary reward (winning 5 cents) and a low probability (30%) to get a null reward (0 cent). The other option was linked to a low probability (40%) to obtain a reward and high probability (60%) to get a null reward. In contrast to RLT session, participants were informed that they were given 400 cents as initial funding in PLT. One option was associated with high probability (70%) to not to be punished (0 cent) and a low probability (30%) to be punished (losing 5 cents). The other option was associated with low probability (40%) to not to be punished and high probability (60%) to be punished. Once participants had chosen the optimal option on four consecutive occasions, the contingencies would reverse with a probability of 25% on each successive trial. Once the reinforcement contingencies reversed, the option with high rewards or low punishments (winning 5 cents in RLT frame or losing 0 cent in PLT frame) became frequently punished (winning 0 cent in RLT frame or losing 5 cents in PLT frame) and vice versa. Participants then needed to choose the other option – the one with high rewards or low punishments after reversed. To avoid participants using explicit strategies, such as counting the number of trials to reversal, they were not informed the details of how reversals were triggered by the computer but just be informed that reversals occurred randomly throughout the experiment. Participants were asked to obtain rewards or avoid punishments as more as possible, which were related to their payments.
EEG data collection and preprocessing
EEG data were recorded continuously from 64 scalp sites using electrodes mounted on an elastic cap (Compumedics, Texas, USA), with an online reference to the left mastoid. All inter-electrode impedances were maintained below 5 kΩ. The EEG and electrooculography were filtered using a 0.05–100 Hz bandpass and continuously sampled at 500 Hz in each channel for off-line analysis. EEGs were re-referenced to the algebraic average of left mastoid and right mastoid. Eye blinks and muscle artifacts were cleaned using independent components analysis from the EEGLAB toolbox (Delorme & Makeig, Reference Delorme and Makeig2004). Trials contaminated with artifacts exceeding ±100 μV were excluded from averaging.
Data analysis
Behavioral measure
To quantify the performance of participants in the tasks, the accuracy was analyzed using repeated measures analysis of variance (ANOVA) with Context (RLT v. PLT) and Target (Self v. Other) as within-subject factors, and with Drug (PBO v. AVP) as a between-subject factor.
Computational model
To flexibly estimate participants' choices in response to changes in reward and punishment contingencies, we use positive–negative model (P-N), a different extension of the Rescorla–Wagner model (Rescorla & Wagner, Reference Rescorla, Wagner, Black and Prokasy1972), to capture dissociable learning effects from positive and negative outcomes separately:
where η pos is the reward learning rate (0 in negative feedback trials), and η neg is the learning rate for negative feedback (0 in positive feedback trials); O is the received outcome; the value V at each trial t for the chosen option c (V c,t) is updated with the actual prediction error $(O_{t-1}-\; V_{c,t-1})$.
We fitted models using the hBayesDM package (Ahn, Haines, & Zhang, Reference Ahn, Haines and Zhang2017). Parameter estimation was performed with hierarchical Bayesian analysis using Stan language in R (Carpenter et al., Reference Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt and Riddell2017; Team, Reference Team2016). Markov chain Monte Carlo sampling was used to perform posterior inference and we compared and selected the optimal model by using LOOIC. We compared three computational models: Fictitious update model, a model which assumes that participants simultaneously update the value of the chosen and unchosen options; Experience-weighted attraction model, a model which captures the attribution of significance to past experience over and above new information as an individual progress through the task; and Positive–Negative model, which hypothesizes that individuals may update the estimation of the values by learning from positive and negative outcomes separately. See online Supplementary information for model comparisons. To identify the optimal learning parameters for each model, we simulated choice data for each learning rate with random noise. Then we inputted the simulated data to each model for fitting, exploring the parameter recovery and identifying the optimal learning rate (Crawley et al., Reference Crawley, Zhang, Jones, Ahmad, Oakley and San Jose Caceres2020; Wilson & Collins, Reference Wilson and Collins2019). Next, we used the estimated model parameters from the winning model to simulate choices. For the following analyses, we excluded the data of two participants whose accuracy was lower than 45%, given that the data were outliers which the model could not fit precisely (Frank, Seeberger, & O'Reilly, Reference Frank, Seeberger and O'Reilly2004).
EEG data analysis
To examine the neural mechanism of prosocial learning and AVP modulation, we were interested in the motivation and prediction-related SPN (Brunia & Damen, Reference Brunia and Damen1988; Hackley, Valle-Inclán, Masaki, & Hebert, Reference Hackley, Valle-Inclán, Masaki, Hebert and Mangun2014; Masaki, Yamazaki, & Hackley, Reference Masaki, Yamazaki and Hackley2010; Morís, Luque, & Rodríguez-Fornells, Reference Morís, Luque and Rodríguez-Fornells2013), feedback-related negativity (FRN) that associated with expectation and learning processing (Gehring, Goss, Coles, Meyer, & Donchin, Reference Gehring, Goss, Coles, Meyer and Donchin1993; Holroyd & Coles, Reference Holroyd and Coles2002; Miltner, Braun, & Coles, Reference Miltner, Braun and Coles1997; Yeung, Holroyd, & Cohen, Reference Yeung, Holroyd and Cohen2005), and outcome evaluation-related P300 (Nieuwenhuis, Aston-Jones, & Cohen, Reference Nieuwenhuis, Aston-Jones and Cohen2005; Osinsky, Mussel, & Hewig, Reference Osinsky, Mussel and Hewig2012). Given that the frontal theta is associated with the update of dynamic prediction error and cognitive control, while delta reflects the prediction of future behavioral adjustments (Cavanagh & Frank, Reference Cavanagh and Frank2014; Cohen, Elger, & Ranganath, Reference Cohen, Elger and Ranganath2007; Hauser et al., Reference Hauser, Iannaccone, Stämpfli, Drechsler, Brandeis, Walitza and Brem2014), we also expected to observe theta (Bernat, Nelson, Steele, Gehring, & Patrick, Reference Bernat, Nelson, Steele, Gehring and Patrick2011; Hauser et al., Reference Hauser, Iannaccone, Stämpfli, Drechsler, Brandeis, Walitza and Brem2014) and delta oscillations (Bernat et al., Reference Bernat, Nelson, Steele, Gehring and Patrick2011; Cavanagh, Reference Cavanagh2015) at outcome evaluation stage (see online Supplementary Fig. S1).
For ERP analyses, we were interested in the FRN and P300 as well as slow waves, such as the SPN. The original EEG data were low-pass filtered at 20 Hz for the SPN analysis, but band-pass filtered with cutoffs of 0.1 and 30 Hz to remove low-frequency waves from the EEG for the FRN and P300 analyses (Brunia, van Boxtel, & Böcker, Reference Brunia, van Boxtel, Böcker, Luck and Kappenman2012; Zheng, Li, Wang, Wu, & Liu, Reference Zheng, Li, Wang, Wu and Liu2015). The filtered EEG data were then segmented into epochs that were time-locked to the feedback onset. For the SPN, epochs were extracted from −2500 to 500 ms, with the activity from −2500 to −2300 ms serving as the baseline (Hackley et al., Reference Hackley, Valle-Inclán, Masaki, Hebert and Mangun2014; Masaki, Takeuchi, Gehring, Takasawa, & Yamazaki, Reference Masaki, Takeuchi, Gehring, Takasawa and Yamazaki2006). We selected this baseline at the start of anticipation because SPN was a slow and negative wave that progressively developed prior to the feedback presentation, assuming that the baseline interval did not contain the signal of SPN. For the FRN and P300, epochs were extracted from −500 to 1000 ms around each feedback onset for further analyses. Afterward, epochs were extracted from −200 to 1000 ms, with the activity from −200 to 0 ms serving as the baseline for the analyses of FRN and P300 (Zheng et al., Reference Zheng, Li, Wang, Wu and Liu2015), assuming that neural activity in this period is unaffected by the feedback presentation. For illustration, SPN waveforms were filtered with a low-pass cutoff at 7 Hz (24 dB/octave).
Based on the grand-average waveforms and topographic maps, amplitude of SPN from −200 to 0 ms (i.e. the 200 ms window immediately prior to the feedback onset) was extracted as the mean voltage at bilateral electrode sites (F5/6, and FC5/6). Two participants were excluded in the SPN analysis because of no trial available in one condition after denoising and two were excluded due to left-handedness. The data were analyzed by using a repeated measure ANOVA, with Context (RLT v. PLT), Target (Self v. Other), Hemisphere (Left v. Right), and Site (F5/6 v. FC5/6) as within-subject factors, and with Drug (PBO v. AVP) as the between-subject factor. Based on previous studies, we used a peak-to-peak method to measure the FRN (Holroyd, Nieuwenhuis, Yeung, & Cohen, Reference Holroyd, Nieuwenhuis, Yeung and Cohen2003; Osinsky et al., Reference Osinsky, Mussel and Hewig2012; Osinsky, Walter, & Hewig, Reference Osinsky, Walter and Hewig2014). See online Supplementary information for details. To isolate the FRN from the confusion of positive feedback (Holroyd, Krigolson, Baker, Lee, & Gibson, Reference Holroyd, Krigolson, Baker, Lee and Gibson2009; Walsh & Anderson, Reference Walsh and Anderson2012; Zheng et al., Reference Zheng, Li, Wang, Wu and Liu2015), we created peak-to-peak FRN difference waves (negative feedback minus positive feedback under each condition) separately for the positive and negative feedback contexts in RLT as well as PLT (Pfabigan, Alexopoulos, Bauer, & Sailer, Reference Pfabigan, Alexopoulos, Bauer and Sailer2011). The peak-to-peak FRN difference waves were computed under SR, OR, SP and OP conditions. We measured the peak-to-peak FRN difference waves at FCz, a location that used to analyze the FRN in reinforcement learning and correlated with the update of dynamic prediction error (Hauser et al., Reference Hauser, Iannaccone, Stämpfli, Drechsler, Brandeis, Walitza and Brem2014), and where the difference waves were maximal across the entire sample. Similarly, the P300 amplitude was calculated as the mean voltage difference wave at CPz (Cavanagh, Reference Cavanagh2015) given a posterior distribution of the P300 component in the period 320–420 ms after feedback onset. The 2 (Context) × 2 (Target) × 2 (Drug) ANOVA was used to examine the differences in FRN and P300, respectively.
Next, we focused on oscillations at delta band (<4 Hz) and theta band (4–7 Hz) in the outcome evaluation (see online Supplementary information for details). Time–frequency distributions of the EEG time course were obtained using a windowed Fourier transform with a fixed 200 ms Hanning window for theta signal acquisition and with a fixed 500 ms window for delta signal acquisition. For each epoch, thus, there was a complex time–frequency spectral estimate at each point of the time–frequency plane, extending from −500 to 1000 ms (in 2 ms intervals) in the time domain, and from 1 to 30 Hz (in 1 Hz intervals) in the frequency domain. The resulting spectrogram represents the signal power as a joint function of time and frequency at each time–frequency point. As the center of fixed 200 ms Hanning window moves among the time range from −100 to 0 ms, the complex time–frequency spectral estimate of the time–frequency plane would be contaminated by the signals after feedback onset (Hu & Zhang, Reference Hu and Zhang2019). Therefore, the spectrogram was baseline-corrected (with the reference interval from −300 to −200 ms relative to feedback onset) at each frequency using the subtraction approach (Cavanagh, Reference Cavanagh2015). The mean of theta activity (4–7 Hz) was extracted in the 100–300 ms interval following feedback onset at FCz, because topographic distributions of power exhibited a fronto-central peak that was maximal around FCz. The mean of delta (<4 Hz) activity was extracted in the 320–420 ms interval at Cz. Power differences between negative and positive feedback as well as differences of frequency activities were then compared by using the 2 (Frequency band) × 2 (Context) × 2 (Target) × 2 (Drug) repeated measures ANOVA. To explore the relationship between behavioral adjustments and brain oscillations, we correlated the average delta activity with reaction time (RT) (Cavanagh, Reference Cavanagh2015) and estimated the moderating effect of AVP on the relation between delta activity and RT (online Supplementary Fig. S1).
Moreover, we conducted two moderation models at the anticipation stage and the outcome evaluation stage, estimating the moderating role of AVP in the relation between neuroelectrophysiological signals and psychological processes. For all statistical tests, Greenhouse–Geisser epsilon correction was applied for nonsphericity when appropriate (Jennings & Wood, Reference Jennings and Wood1976). The partial eta-squared $( \eta _P^2 )$ was reported as a measure of effect size. The Bonferroni procedure was used to corrected for multiple comparisons in the post hoc analyses.
Results
Behavioral differences in learning to avoid punishment and AVP modulation on self-related reward-seeking and other-regarded punishment-avoidance
The 2 × 2 × 2 ANOVA of accuracy showed a significant main effect of Context that the accuracy was higher in punishment learning than reward learning (F (1,102) = 4.396, p = 0.038, $\eta _P^2 \;$ = 0.041), a significant main effect of Target that the accuracy was higher when participants learned for themselves than others (F (1,102) = 5.046, p = 0.027, $\eta _P^2 \;$ = 0.047), and a three-way significant interaction of Context × Target × Drug (F (1,102) = 5.231, p = 0.024, $\eta _P^2 \;$ = 0.049; Fig. 1b). However, the main effect of Drug (F (1,102) = 0.011, p = 0.916, $\eta _P^2 \;$ = 0.000) and the interaction effect of Context × Drug (F (1,102) = 1.204, p = 0.275, $\eta _P^2 \;$ = 0.012) and effect of Target × Drug (F (1,102) = 0.362, p = 0.549, $\eta _P^2 \;$ = 0.004) were not significant. Simple effect analyses of the three-way significant interaction showed that the interaction between effect of Context and Target was significant in PBO group (F (1,102) = 4.19, p = 0.043, $\eta _P^2$ = 0.099) but not in AVP group (F (1,102) = 1.37, p = 0.245, $\eta _P^2 \;$ = 0.02). In addition, the interaction effect between Context and Drug in Target of other condition was significant (F (1,102) = 5.511, p = 0.021, $\eta _P^2$ = 0.051). Specifically, in the PBO group, the accuracy of SP was significantly higher than SR (F (1,102) = 3.946, p = 0.050, $\eta _P^2$ = 0.037), and accuracy of SP was significantly higher than OP (F (1,102) = 5.311, p = 0.023, $\eta _P^2$ = 0.049). However, in the AVP group, the accuracy of OP was significantly higher compared to OR (F (1,102) = 5.765, p = 0.018, $\eta _P^2$ = 0.053), and accuracy of SR was significantly higher relative to OR (F (1,102) = 4.983, p = 0.028, $\eta _P^2$ = 0.047). These results suggested that AVP modulates the adaption of individuals' self/other-oriented bias depending on the specific frames. Specifically, AVP promotes individuals' self-bias in reward learning as compared to learn for others, and enhances prosocial performance in punishment learning as compared to learn for self.
Computational evidence for self-bias on punishment learning and dissociable modulations of AVP in prosocial learning
Bayesian model comparison showed that the positive–negative (P-N) model was superior to the other two models under all four conditions (Fig. 1c). Subsequently, two estimated P-N model parameters learning rate η pos and η neg were analyzed using the 2 (Learning rate) × 2 (Context) × 2 (Target) × 2 (Drug) ANOVA, where the four-way interaction was significant (F (1,100) = 78.122, p = 0.000, $\eta _P^2 \;$ = 0.439). Interestingly, we found a three-way interaction of learning rate for negative feedback η neg was significant (F (1,100) = 6.999, p = 0.009, $\eta _P^2$ = 0.065; Fig. 1d), which was consistent with the results of accuracy. We also found a significant main effect of Context with a better performance on punishment learning (F (1,100) = 7.556, p = 0.007, $\eta _P^2 \;$ = 0.070) and significant main effect of Target (F (1,100) = 13.532, p = 0.000, $\eta _P^2 \;$ = 0.119) for η neg parameter, while the main effect of Drug (F (1,100) = 0.243, p = 0.623, $\eta _P^2 \;$ = 0.002) was not significant. The two-way interaction effect of Context × Drug was significant (F (1,100) = 5.882, p = 0.017, $\eta _P^2 \;$ = 0.056), while the interaction effect of Target × Drug was not significant (F (1,100) = 1.727, p = 0.192, $\eta _P^2 \;$ = 0.017). Simple effect analyses illustrated that the AVP group performed better than PBO in punishment learning (F (1,100) = 13.654, p = 0.000, $\eta _P^2 \;$ = 0.120), rather than reward learning (F (1,100) = 13.654, p = 0.821, $\eta _P^2 \;$ = 0.120).
Simple effect analyses of three-way interaction of η neg parameter showed that in the PBO group, the negative feedback learning rate η neg in SP was marginal significantly higher compared to SR (F (1,100) = 3.134, p = 0.080, $\eta _P^2 \;$ = 0.030), and $\eta ^{neg}$ in SP was significantly higher than OP (F (1,100) = 21.714, p = 0.000, $\eta _P^2 \;$ = 0.178). However, in the AVP group, η neg in OP was significantly higher relative to OR (F (1,100) = 12.480, p = 0.001, $\eta _P^2 \;$ = 0.111) and η neg in SP was significantly higher compared to SR (F (1,100) = 4.194, p = 0.043, $\eta _P^2 \;$ = 0.040). Moreover, η neg in SR was significantly higher than OR (F (1,100) = 3.960, p = 0.049, $\eta _P^{2\;}$ = 0.038). Additionally, η neg in OP was higher in AVP than under PBO condition (F (1,100) = 7.872, p = 0.006, $\eta _P^2 \;$ = 0.073). These results supported the behavioral results and suggested that AVP may modulate the abilities of individuals to capture the information of negative feedback trials, particularly for prosociality on punishment learning and for self-orientation on reward learning.
Identifying self-bias on punishment learning and dissociable modulations of AVP in prosocial learning using ERPs
SPN at the stage of anticipation. The SPN develops gradually as a relative negativity after the choice and reaches its maximum immediately prior to the feedback onset (Fig. 2a). The topography of the SPN appearing as a plateau-shaped tends to be larger in the frontal areas. The 2 (Context) × 2 (Target) × 2 (Drug) ANOVA of the SPN data revealed a significant three-way interaction effect (F (1,93) = 5.651, p = 0.019, $\eta _P^2 \;$ = 0.057). Specifically, in the PBO group, the SPN amplitude in SP was significantly higher than OP (F (1,93) = 7.31, p = 0.008, $\eta _P^{2\;}$ = 0.073; Fig. 2b). While in AVP, the SPN amplitude in OP was significantly larger relative to OR (F (1,93) = 4.154, p = 0.044, $\eta _P^2 \;$ = 0.043), and SPN in SR was significantly higher than in OR (F (1,93) = 6.092, p = 0.015, $\eta _P^2 \;$ = 0.061). It should be mentioned that the PBO group showed a symmetrical distribution in the frontal areas (F (1,93) = 4.315, p = 0.041, $\eta _P^2 \;$ = 0.044), which was in line with previous findings (Brunia, Hackley, van Boxtel, Kotani, & Ohgami, Reference Brunia, Hackley, van Boxtel, Kotani and Ohgami2011). However, the main effect of Drug was not significant (F (1,93) = 0.059, p = 0.809, $\eta _P^2 \;$ = 0.001). At the anticipation stage, individuals prepare the brain for the upcoming feedback, reflecting SPN underlying modulations of AVP on anticipation in self-related reward-seeking and other-regarded punishment-avoidance behaviors.
We established a moderation model (Fig. 2c) to estimate whether drug treatment would moderate the association between SPN amplitudes and η neg. The results revealed a significant moderation under OP condition. Our model showed that there was a significant main effect of Drug on η neg (b = 0.094, p = 0.009) rather than SPN amplitudes (b = 0.002, p = 0.741), and more importantly, the effect of SPN on η neg was significantly moderated by Drug (b = 0.032, p = 0.005). Simple slope tests revealed that higher level SPN amplitudes were associated with higher levels of negative learning rate in the AVP group (b simple = 0.017, p = 0.015; Fig. 2d), while this correlation was not significant in PBO (b simple = −0.015, p = 0.093). These results showed a moderating role of AVP in relation between anticipation and prosocial punishment learning.
Dissociable neural processing between FRN and P300 at stage of outcome evaluation. The difference waveform of FRN at FCz as a function of feedback type (positive v. negative) exhibited a negative deflection over the fronto-central regions during the feedback evaluation (Fig. 3), while the P300 at CPz showed a positive potential over the centroparietal regions (Fig. 4a). The 2 × 2 × 2 ANOVA of the peak-to-peak FRN difference wave reflected a three-way interaction (F (1,96) = 7.811, p = 0.006, $\eta _P^2 \;$ = 0.075; Fig. 4b). Interestingly, in PBO, the negative amplitude of peak-to-peak FRN difference wave under SP condition was significantly larger than SR (F (1,96) = 11.858, p = 0.001, $\eta _P^2$ = 0.11), whereas in the AVP group, the difference amplitude in OP was significantly more negative compared to OR (F (1,96) = 11.863, p = 0.001, $\eta _P^2 \;$ = 0.11). In addition, the difference amplitude of the AVP group under condition OP was significantly larger than SP (F (1,93) = 5.582, p = 0.020, $\eta _P^2 \;$ = 0.055). The 2 × 2 × 2 ANOVA of the P300 difference wave revealed a three-way interaction effect among Context, Target, and Drug (F (1,96) = 5.332, p = 0.023, $\eta _P^2$ = 0.053; Fig. 4c). The simple effect analysis illustrated that in the PBO group, the amplitude of P300 difference wave in response to OR was significantly larger than OP (F (1,96) = 8.734, p = 0.004, $\eta _P^2 \;$ = 0.083), while difference wave in SR was significantly larger that SP in the AVP group (F (1,96) = 25.261, p = 0.000, $\eta _P^2 \;$ = 0.208). Amplitude of the P300 difference wave in OP was higher in AVP relative to in PBO (F (1,96) = 5.932, p = 0.017, $\eta _P^2$ = 0.058). There were no significant main effect of Drug in neither FRN (F (1,96) = 0.633, p = 0.428, $\eta _P^2$ = 0.007) nor P300 (F (1,96) = 2.467, p = 0.120, $\eta _P^2$ = 0.025). Under the interactive learning tasks, these results suggested that AVP enhanced difference wave of FRN which responded to other-related feedback in punishment situation and increased P300 that acted to self-related feedback in reward situation.
Theta oscillation at stage of outcome evaluation. To examine brain oscillations underlying the modulation of AVP on prosocial learning, 2 (Frequency band) × 2 (Context) × 2 (Target) × 2 (Drug) repeated measures ANOVA was applied to analyze differences of frequency activities under negative–positive feedback condition at FCz (Fig. 5). The results showed that a main effect of Context (F (1,96) = 5.974, p = 0.004, $\eta _P^2$ = 0.084) reflecting a larger response evoked on punishment learning. There was a significant four-way interaction effect (F (1,96) = 5.974, p = 0.016, $\eta _P^2$ = 0.059). Post hoc analyses reflected that the significant effect was focused on the three-way interaction effect of Context by Target by Drug on theta band (F (1,96) = 4.559, p = 0.035, $\eta _P^2$ = 0.045; Fig. 5b), rather than delta band (F (1,96) = 3.081, p = 0.082, $\eta _P^2$ = 0.031). Simple effect analyses showed that in PBO, the difference power from theta band in SP was significantly larger than that in SR (F (1,96) = 4.703, p = 0.033, $\eta _P^2$ = 0.047), and the power in SP was significantly higher than that in OP (F (1,96) = 4.050, p = 0.047, $\eta _P^2$ = 0.040). However, in the AVP group, theta band power in OP was significantly larger than OR (F (1,96) = 8.054, p = 0.006, $\eta _P^2 \;$ = 0.077), and the power in SR was marginal significantly higher than OR (F (1,102) = 3.854, p = 0.053, $\eta _P^2$ = 0.039). There was no significant main effect of Drug in FRN (F (1,96) = 0.003, p = 0.960, $\eta _P^2$ = 0.000). These results revealed that theta oscillation underlying the modulation of AVP on self-oriented reward-seeking and prosocial punishment-avoidant behaviors.
Moreover, our data revealed a significant moderating effect at the outcome evaluation stage (Fig. 5c), estimating the moderation of drug treatment on the association between theta oscillation difference (negative − positive) and difference between negative and positive learning rate parameters (ηneg − ηpos) under OP condition. Our model showed that there was a significant main effect of Drug on η neg − η pos (b = 0.210, p = 0.000) rather than theta oscillation difference (b = −0.035, p = 0.161), and more importantly, the effect of theta oscillation difference on $\eta ^{neg}\; -\; \eta ^{pos}$ was significantly moderated by Drug (b = −0.107, p = 0.036). Simple slope tests revealed that lower level theta oscillation was associated with higher levels of η neg − η pos in the AVP group (b simple = −0.088, p = 0.024; Fig. 5d), while this correlation was not significant in PBO (b simple = 0.019, p = 0.562). These results revealed the moderating mechanism of AVP in prosocial punishment learning.
Discussion
In the present study, we used a combination of behavioral manipulation, computational modeling, and EEG to examine the diverse adaptations when people learn to benefit or avoid harms for themselves (self-oriented learning) and for others (prosocial learning) and the modulatory role of AVP in this adaption, from external behavioral performance, to internal psychological processes, and underlying neural dynamics. Our behavioral findings showed that the self-bias was specified for avoiding punishment and AVP increased learning performance for self in the reward-seeking and for others in the punishment-avoidance. Using computational modeling and electrophysiological measurements, we found the self-bias and the modulation of AVP was specific for negative feedback learning, underpinned by increased brain responses in anticipation (i.e. SPN) and in outcome evaluation (i.e. FRN and P300, as well as frontal theta oscillations). At the stage of outcome evaluation, AVP system improved prosocial learning by adjusting punishment-related early FRN neural process and acting on reward-related late P300 neural processes to enhance proself learning, while two diverse time-series responses were processing at theta band. At the anticipation stage, increased SPN in AVP than PBO suggests that AVP system directly modulated self/other-oriented bias to expedite learning for self-oriented reward-seeking and other-regarded punishment-avoidance behaviors. Together, our study shows the neurocomputational mechanisms of how we adapt to obtain reward or avoid punishment in self-oriented and prosocial learning, where AVP plays a context-dependent modulatory role.
In the PBO group, individuals behaved differently between for self and other in the learning task, suggesting a self-bias in punishment learning. Previous studies focusing on reward learning have also shown a better performance for self than for others in prosocial learning (Lockwood et al., Reference Lockwood, Apps, Valton, Viding and Roiser2016; Martins et al., Reference Martins, Lockwood, Cutler, Moran and Paloyelis2022). Consistently, we found a self-bias of learning rate when participants learned from positive feedback in reward learning (online Supplementary Fig. S2), which was not shown in accuracy. Using prosocial learning paradigm characterized by reward and punishment, we found the self-oriented learning effect when avoiding punishment. However, there are also studies showing influences of social dilemma on people's social preferences, especially other-regarding concerns and altruism (Liu et al., Reference Liu, Gu, Liao, Lu, Fang, Xu and Cui2020; van Dijk & Wilke, Reference van Dijk and Wilke2000). Contrary to other social dilemma paradigms which normally consider a tradeoff between economic benefits and the feelings of others, decisions were made in self-action reference across reward and punishment in our task. Consideration without those tradeoffs may be one explanation for the absence of altruism in the PBO group. Therefore, we could measure the interaction effect between self/other-oriented biases and reward/punishment biases in social learning by our learning framework.
Our results showed that intranasal AVP up-regulated altruism concerning others' losses and reward-seeking for self-oriented benefits. Consistent with recent studies which has shown the involvement of AVP in prosocial behavior (Nishina et al., Reference Nishina, Takagishi, Takahashi, Sakagami and Inoue-Murayama2019; Wang et al., Reference Wang, Qin, Liu, Liu, Zhou, Jiang and Yu2016), and social cooperative behaviors (Feng et al., Reference Feng, Hackett, DeMarco, Chen, Stair, Haroon and Rilling2015), our results showed that AVP enhances altruism, particularly in protecting others from monetary losing. On the other hand, vasopressin can promote individuals to maximize personal utilities in adaption to the environment (Brunnlieb et al., Reference Brunnlieb, Nave, Camerer, Schosser, Vogt, Münte and Heldmann2016; Patel et al., Reference Patel, Grillon, Pavletic, Rosen, Pine and Ernst2015). This hypothesis is also supported by our results that AVP improves individuals' performances toward proself benefits in the reward-seeking. Thus, AVP induced both prosocial and proself behaviors, depends on reward/punishment contexts. Although the nonsignificant difference toward learning performances and neural responses between the PBO and AVP groups, we found a three-way interaction which suggested the effects of AVP was conditional.
Different learning rates in the negative rather than positive feedback learning, which was modulated by the AVP system, suggest that learning information from negative feedback may be a crucial aspect when participants are making decisions on social learning. People are more sensitive to negative information than positive information, which has been shown as the negativity bias in attention (Rozin & Royzman, Reference Rozin and Royzman2001) and as loss aversion in decision making (Tversky & Kahneman, Reference Tversky and Kahneman1991). Previous reinforcement-learning models suggest that negative outcomes make greater contributions to the overall feedback evaluation (Cavanagh, Frank, Klein, & Allen, Reference Cavanagh, Frank, Klein and Allen2010; Pearce & Hall, Reference Pearce and Hall1980). In social setting, negative events weigh more heavily than positive ones (Alves, Koch, & Unkelbach, Reference Alves, Koch and Unkelbach2017; Shin & Niv, Reference Shin and Niv2021). It has also been shown that people take the consequence of their actions into account when it will have an impact on others, in particular learning to avoid harming others (Crockett et al., Reference Crockett, Kurth-Nelson, Siegel, Dayan and Dolan2014; Lockwood et al., Reference Lockwood, Klein-Flügge, Abdurahman and Crockett2020). Therefore, negative information processing could be an important aspect in prosocial learning and more evolutionarily natural in the modulatory role of the AVP system.
The dissociated responses of FRN and P300 to specific learning framework suggest that distinct time-series of neural processes underlying self/other-oriented bias. FRN and P300 are the critical ERP components in outcome evaluation. Previous studies suggest that the FRN reflected a fast evaluation of outcome valence, with a larger differential effect between loss and win (Gehring et al., Reference Gehring, Goss, Coles, Meyer and Donchin1993; Yeung et al., Reference Yeung, Holroyd and Cohen2005) and evaluation of consistency between expectation and actual outcomes (Holroyd & Coles, Reference Holroyd and Coles2002). In contrast, P300 is related to reward processing and sensitive to a later, top-down controlled process of outcome evaluation (Cavanagh, Reference Cavanagh2015; Nieuwenhuis et al., Reference Nieuwenhuis, Aston-Jones and Cohen2005; Pfabigan et al., Reference Pfabigan, Alexopoulos, Bauer and Sailer2011). Under social context, FRN and P300 respond to different outcomes for oneself and others (Hu, Xu, & Mai, Reference Hu, Xu and Mai2017; Qi, Wu, Raiha, & Liu, Reference Qi, Wu, Raiha and Liu2018). For instance, FRN is sensitive to self-benefit context while P300 responds to prosocial context in a gambling task (Qi et al., Reference Qi, Wu, Raiha and Liu2018). Consistently, we observed a larger FRN when making decision for themselves in aversive situation, while a larger P300 response for others benefits in the PBO group. Interestingly, AVP up-regulated prosocial punishment-avoidance behaviors with a larger FRN response in aversive situation, while modulated the reward-seeking behaviors with a larger P300 response for self-related benefit. Together, these results suggested that the AVP system dissociated improves prosocial learning by adjusting punishment-related FRN neural process and acting on reward-related P300 neural processes.
Our results also shed light on brain oscillation mechanisms for modulations of AVP on prosocial learning at the outcome evaluation stage. Consistent with behavioral and computational model measures, the theta difference activity also supported the self-bias and dissociated modulations of AVP. Previous studies suggest that midfrontal theta band activity was predictive of cognitive control (Cavanagh & Frank, Reference Cavanagh and Frank2014; Cohen, Reference Cohen2011) and was indicative of altruistic behavioral responses (Rodrigues, Ulrich, & Hewig, Reference Rodrigues, Ulrich and Hewig2015). Theta activity has also been shown to respond to social interactions (Rodrigues et al., Reference Rodrigues, Ulrich and Hewig2015; Tendler & Wagner, Reference Tendler and Wagner2015). Therefore, theta activity, proself in reward-seeking and prosocial in punishment-avoidance learning, indicates the underpinning control mechanisms of the interaction between self/other-oriented bias and feedback valence.
Lastly, SPN findings suggest the self-bias and AVP modulation at the anticipation stage of prosocial learning. In our study, SPN showed a self-bias when participants trying to avoid punishments. Larger amplitudes of SPN at the anticipation stage when individuals making decision for self-related reward-seeking and other-regarded punishment-avoidance, suggest that participants with AVP administration had a biased expectation to imminent outcome. The SPN, a slow and negative wave that progressively develops prior to the motivational stimuli (Brunia & Damen, Reference Brunia and Damen1988), has been considered to reflect outcome predictions and expectation of response reinforcement (Masaki et al., Reference Masaki, Yamazaki and Hackley2010). The right-hemisphere predominance in PBO was fairly compatible with previous findings, possibly reflecting contributions from the ventral attention system (Brunia et al., Reference Brunia, Hackley, van Boxtel, Kotani and Ohgami2011; Zheng et al., Reference Zheng, Li, Wang, Wu and Liu2015). SPN was interpreted as preparatory activity aimed at speeds up brain processes after the relevant stimulus, preparation of the brain for the upcoming event or action, and as an index of anticipatory attention (Brunia et al., Reference Brunia, Hackley, van Boxtel, Kotani and Ohgami2011, Reference Brunia, van Boxtel, Böcker, Luck and Kappenman2012). Therefore, our SPN findings illuminate a self-bias in preparation for forthcoming aversive stimulus in social learning, while the AVP system directly modulates self/other-oriented bias at the anticipation stage to expedite self-oriented reward-seeking and other-oriented punishment-avoidance behaviors in prosocial learning.
Overall, our findings suggest that intranasal vasopressin modulates self/other-oriented bias by up-regulating self-related reward-seeking and other-regarded punishment-avoidance behaviors in prosocial learning. AVP modulates learning and processing of negative feedback at both stages of anticipation and outcome evaluation. These modulations of AVP systems are underpinned by punishment-related FRN for prosocial learning and reward-related P300 for proself learning, as well as theta band oscillations at the outcome evaluation stage and SPN at anticipation stage. Our work sheds light on the mechanisms of our prosocial behaviors and has important implications in the atypical social behaviors of psychiatry disorders.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291722002483.
Acknowledgements
This study was supported by the National Natural Science Foundation of China (31871137, 31900757, 31920103009, and 32020103008), the Major Project of National Social Science Foundation (20&ZD153), Young Elite Scientists Sponsorship Program by China Association for Science and Technology (YESS20180158), Guangdong International Scientific Collaboration Project (2019A050510048), Natural Science Foundation of Guangdong Province (2020A1515011394 and 2021A1515010746), Shenzhen-Hong Kong Institute of Brain Science-Shenzhen Fundamental Research Institutions (2019SHIBS0003), and Shenzhen Science and Technology Research Funding Program (JCYJ20180507183500566, JCYJ20180306173253533 and JCYJ20190808121415365).
Author contributions
J. X., C. F., and P. X. designed research; L. Q. performed research; G. D. analyzed data; and G. D., H. A., C. F., and P. X. wrote the paper.
Conflict of interest
The authors declare that they have no competing financial interests.