Non-technical Summary
Researchers often use large databases to conduct their studies; however, they do not always provide credit, through citations, to the people who produced the data in the databases. In the field of paleontology, researchers use a large database called the Paleobiology Database (PBDB) to study global patterns and processes over millions of years. These studies use data from the PBDB and typically receive a greater number of citations than the original data-producing papers. This creates a situation where the hard work of collecting the data is not credited and rewarded in a fair way, even though this work is equally important to the field of paleontology. By fixing this issue and giving proper credit to data-producing papers, paleontology itself can be strengthened by increasing the incentives for producing data and at the same time creating more high-quality data for everyone to use.
Introduction
“Both data collectors and data crunchers are important, and certainly the latter would not exist without the former.”
MacRoberts and MacRoberts (Reference MacRoberts and MacRoberts2018: p. 476)Large data compilations allow new questions to be asked at previously unreachable scales and provide greater certainty in answering old questions. Across the biological sciences, large compilations have led to theoretical and practical advances, ranging from new insights on biodiversity dynamics in conservation science (Dornelas et al. Reference Dornelas, Gotelli, McGill, Shimadzu, Moyes, Sievers and Magurran2014), to biomedical advances based on compiled genetic data (Benson et al. Reference Benson, Cavanaugh, Clark, Karsch-Mizrachi, Lipman, Ostell and Sayers2013), to the identification of mass extinctions in paleontology (Raup and Sepkoski Reference Raup and Sepkoski1982). Compilations of data at these scales, and the associated advances, are accompanied by a need to establish a standard set of protocols for the curation, management, and citation of the underlying data (Altman et al. Reference Altman, Borgman, Crosas and Matone2015; Cousijn et al. Reference Cousijn, Kenall, Ganley, Harrison, Kernohan, Lemberger and Murphy2018; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Marwick and Birch Reference Marwick and Birch2018; Lammey Reference Lammey2019). Recognizing the reliance of large data compilations on the many data-producing studies from which they are built, there is a growing consensus that data users should credit data producers in a way that is on par with the credit attributed to traditionally recognized outputs, like peer-reviewed publications (Piwowar and Vision Reference Piwowar and Vision2013; Altman et al. Reference Altman, Borgman, Crosas and Matone2015; Penev et al. Reference Penev, Mietchen, Chavan, Hagedorn, Smith, Shotton and Tuama2017; Cousijn et al. Reference Cousijn, Kenall, Ganley, Harrison, Kernohan, Lemberger and Murphy2018; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Silvello Reference Silvello2018; Zhao et al. Reference Zhao, Yan and Li2018; Lammey Reference Lammey2019; Pierce et al. Reference Pierce, Dev, Statham and Bierer2019; Dosso and Silvello Reference Dosso and Silvello2020).
Despite this emerging consensus, the scientific community at large has been slow to adopt the practice of citing data sources, and a common procedure for data citation—used inclusively here to refer to any attribution to data provisioners by data users (e.g., Penev et al. Reference Penev, Mietchen, Chavan, Hagedorn, Smith, Shotton and Tuama2017; Cousijn et al. Reference Cousijn, Kenall, Ganley, Harrison, Kernohan, Lemberger and Murphy2018; Hood and Sutherland Reference Hood and Sutherland2021; and see “Balancing Data Use and Citation in Paleontology”)—remains elusive (Ingwersen and Chavan Reference Ingwersen and Chavan2011; Marwick and Birch Reference Marwick and Birch2018; Zhao et al. Reference Zhao, Yan and Li2018; Cousijn et al. Reference Cousijn, Feeney, Lowenberg, Presani and Simons2019; Tomaszewski Reference Tomaszewski2019; Silveira et al. Reference Silveira, Barbosa, Ferreira and Caregnato2020; Suhr et al. Reference Suhr, Dungl and Stocker2020). For example, in a review of 600 papers from 12 disciplines (e.g., biology, earth sciences, ecology, and environmental sciences; Zhao et al. Reference Zhao, Yan and Li2018), when authors used a new or existing dataset in their analysis (n = 312), data attribution was variable: 6% included data citations, 9% used unique identifiers for the data (e.g., DOI), 24% mentioned data with only a database name, and 60% referenced their data using a URL—an imperfect citation, given that URLs can expire. Relatedly, 88% of studies (n = 100, randomly drawn from 4533 studies) using data compiled by the Global Biodiversity Information Facility failed to appropriately cite the sources of the data they used (Escribano et al. Reference Escribano, Galicia and Ariño2018). It has become clear that making data citation a standardized practice will require changes at all stages of academic research—from funders, publishers, editorial boards, data repositories, authors submitting analyses of compiled data, researchers producing the data, scientists evaluating each other's work, and all other persons involved in research production (Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Marwick and Birch Reference Marwick and Birch2018; Cousijn et al. Reference Cousijn, Feeney, Lowenberg, Presani and Simons2019; Colavizza et al. Reference Colavizza, Hrynaszkiewicz, Staden, Whitaker and McGillivray2020; Silveira et al. Reference Silveira, Barbosa, Ferreira and Caregnato2020).
Like many other scientific disciplines, paleontology has much room for improvement in how data are cited (Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Fig. 1). Paleontology has historically been a descriptive field wherein accumulations of fossils are documented when they are found in rocks and sediments. Most basically, individual fossils—alongside information on their location, stratigraphy, and taxonomy—are the raw data of paleontology (Johnson et al. Reference Johnson, Filkorn and Stecheson2005; Allmon et al. Reference Allmon, Dietl, Hendricks and Ross2018). It is typically these records of taxa at a given place and time that are compiled for larger-scale analyses. The analysis of data compilations has deep roots in paleontology (e.g., Phillips Reference Phillips1860; Newell Reference Newell1952, Reference Newell and Albritton1967; Harland Reference Harland1967; Sepkoski et al. Reference Sepkoski, Bambach, Raup and Valentine1981; Sepkoski Reference Sepkoski1984), and the development of online databases (e.g., ART [Raja et al. Reference Raja, Dimitrijević, Krause and Kiessling2022a]; BioDeepTime [Smith et al. Reference Smith, Rillo, Kocsis, Dornelas, Fastovich, Huang and Jonkers2023b]; Geobiodiversity Database [Fan et al. Reference Fan, Chen, Hou, Miller, Melchin, Shen and Wu2013]; Neotoma [Williams et al. Reference Williams, Grimm, Blois, Charles, Davis, Goring, Graham, Smith, Anderson and Arroyo-Cabrales2018]; Neptune Sandbox Berlin [Renaudie et al. Reference Renaudie, Lazarus and Diver2020]; Paleobiology Database, https://paleobiodb.org; PARED [Kiessling and Krause Reference Kiessling and Krause2022]; Triton [Fenton et al. Reference Fenton, Woodhouse, Aze, Lazarus, Renaudie, Dunhill, Young and Saupe2021]) in the last two decades has helped make these types of analyses a cornerstone of modern paleontology (Supplementary Fig. S1). Paleontologists now routinely analyze compiled data at local to global scales across temporal ranges of hundreds of millions of years, greatly expanding the ambition of the hypotheses and questions that can be addressed about the history of life on Earth (e.g., Kiessling Reference Kiessling2005; Payne and Finnegan Reference Payne and Finnegan2007; Alroy et al. Reference Alroy, Aberhan, Bottjer, Foote, Fürsich, Harries, Hendy, Holland, Ivany and Kiessling2008). However, the use of compiled data in paleontology has moved at a faster pace than the development of protocols for best practices in data citation (Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018), which has contributed to a decrease in the number of taxonomic experts in paleontology (e.g., Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012), much as it has in overlapping disciplines (e.g., archaeology [Marwick and Birch Reference Marwick and Birch2018], biodiversity research [Escribano et al. Reference Escribano, Galicia and Ariño2018; Mandeville et al. Reference Mandeville, Koch, Nilsen and Finstad2021], ecology and evolution [Hood and Sutherland Reference Hood and Sutherland2021], taxonomy [Agnarsson and Kuntner Reference Agnarsson and Kuntner2007; Engel et al. Reference Engel, Ceríaco, Daniel, Dellapé, Löbl, Marinov and Reis2021; Benichou et al. Reference Benichou, Buschbom, Campbell, Hermann, Kvaček, Mergen, Mitchell, Rinaldo and Agosti2022]). As paleontology and related disciplines move toward a FAIR (Findability, Accessibility, Interoperability, and Reuse; Wilkinson et al. Reference Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak and Blomberg2016; and see https://www.go-fair.org/fair-principles) infrastructure for digital assets in the long-term future, a short-term solution is needed to ensure the continuance of the specimen-based work that is foundational to each of the areas of research.
Here we quantify the extent to which scientific contributions of data-provisioning publications are unseen and uncredited and discuss present and future consequences of this imbalance. We do so by estimating the number of neglected citations, defined here as citations that were not attributed to these studies despite the data being used, in peer-reviewed publications based on analyses of the Paleobiology Database (PBDB; hereafter, “PBDB publications,” including only those listed as “official publications”). The PBDB was selected as it is one of the oldest, largest, and most widely used paleontological databases and maintains a list of publications that make use of the database (i.e., “official publications”). We transform the raw estimates of neglected citations into an annual citation rate that enables us to standardize comparison of citations across PBDB publications and the underlying data-provisioning publications and capture an estimate of neglected citations. To demonstrate the effect of neglected citations beyond individual publications, we also estimate changes to the impact factors of paleontological journals (e.g., Acta Palaeontologica Polonica, Journal of Paleontology, Palaeontology) that often publish specimen-based work (used inclusively for taxonomy, systematics, morphology, and other areas associated with data provisioning). Leveraging these comparisons, we advocate for the proper citation of specimen-based work in paleontology and present a strategy for more equitable data citation.
Methods
The data used to produce this study were drawn from published studies based on data from the PBDB (https://paleobiodb.org), bibliometric data from Google Scholar (https://scholar.google.com), and Journal Citation Reports generated by Clarivate (https://jcr.clarivate.com/jcr/home). These data were used to estimate the extent to which data-provisioning publications have been undervalued through a lack of citation when their data have been reused in publications drawing from the PBDB. We estimated how citation metrics for data-provisioning publications would change if they were cited in all instances where their data outputs were reused, as well as the effect this would have on the impact factors of discipline-specific paleontological journals. All analyses were carried out using R 4.1.2.
Data Collection on Paleontological Data Reuse
In this study, we focused on the Paleobiology Database (https://paleobiodb.org). The PBDB is among the most commonly used large fossil occurrence databases and is widely used in large-scale temporal and spatial analyses of biodiversity in the fossil record. The PBDB records a list of “official publications,” which are publications that use data from the database and have requested an official publication number (see: https://paleobiodb.org/#/publications)—this list is maintained to demonstrate the importance and utility of the database to funding agencies. We compiled this publication list into a dataset on May 6, 2021, and at that time, the list included 396 publications spanning the years 2001–2021.
As our study required the scientometric information for the original publications that contributed the data reused in the PBDB publications, we extracted the raw datasets associated with each PBDB publication whenever they were available (e.g., those uploaded to a data repository linked to the manuscript). It was important to have the raw datasets, because these are downloaded from the PBDB directly and contain the reference information for the data-provisioning publications. We assumed that all data listed in these exported dataset files were used in the subsequent study, and we gave equal weight to a study provisioning 1 or 100 data points (see Dosso and Silvello [2020] for an alternative approach to data credit distribution). When these datasets were not available online, we sent a personalized template email (see Supplementary Material) to the lead or corresponding author(s) of PBDB publications asking for the dataset. If no response was received after 2 weeks, we contacted authors again with a follow-up email. Within a few days (median = 1 day, mean = 5.5 days), 50% of the 167 responses provided either a file or a link to the requested data, 17% of responses indicated that the files had been lost, 9% of responses indicated only simple use of the PBDB that required no download, and 23% of responses indicated the publication did not use PBDB data. In some cases, authors provided us with the parameters they used to extract their data from the PBDB; however, as the PBDB is a dynamic database, the data produced by these queries change over time and could not be incorporated. We did not receive a response from authors for 25% of the PBDB publications (68/268 requests). In total, we were able to extract the needed information from 151 PBDB publications, accounting for 38% of PBDB publications (total = 396) within the temporal scope of our data collection phase (see Smith et al. Reference Smith, Raja, Clements, Dimitrijević, Dowding, Dunne and Gee2023a).
Existing and Neglected Citations for Data-provisioning Publications
Using the combined data from the 151 datasets available to us from PBDB publications, we compiled each instance of unique citation information, yielding a list of 49,999 data-provisioning publications. To quantify the magnitude of neglected citations attributed to these references, we first needed to extract the existing number of citations for each publication. This was done by scraping citation data of each data-provisioning publication from Google Scholar in June–August 2021. Google Scholar is detached from academic publishers and other metadata aggregators (e.g., CrossRef, Scopus) and has less transparency than some of these other tools; however, it continues to be commonly used by the academic community and is readily accessed, making it a suitable choice for the objectives of this study. The process of scraping citations was complicated by several factors, including incomplete citation information in some PBDB datasets and issues with Google Scholar not retrieving the correct publication associated with a citation. Consequently, 9816 references required non-automated data extraction by members of the authorship team between August 2021 and April 2022—it is possible that some publications received additional citations during this period, and we assume the overall effect was negligible (e.g., as the median citation rate was relatively low, this is substantiated in our data). Overall, this process produced citation information for 47,122 of the 49,999 (94.2%) data-provisioning publications. Citation data were also extracted for all 396 PBDB publications to enable comparisons between citations of the two publication types.
We tabulated the number of times data from each data-provisioning paper were reused. Although the number of neglected citations is informative on its own (Supplementary Figs. S2, S3), we standardized citations to an annual rate to enable comparison between data-provisioning and PBDB publications. Likewise, we focused on publications from the period of 2001–2021, as this encompasses the period during which PBDB publications have existed and rates of citation are likely influenced by the time period being considered (e.g., more citations and publications in more recent times). Annual citation rates were calculated for data-provisioning publications in three scenarios, using: (1) only existing citations; (2) instances of data reuse in the 151 PBDB publications with data available, in addition to existing citations; and (3) extrapolating to potential neglected citations in the entire dataset of 396 PBDB publications for which we sought data in this study (assuming rates of data reuse in this larger dataset would be similar to those in our smaller set; see Supplementary Material for discussion of assumptions). Citation rate of PBDB publications was calculated solely for existing citations and used as a basis of comparison to approximate the relative seen and unseen contributions of data-provisioning publications to paleontology.
Rates of citation for data-provisioning publications in each of the three scenarios were compared statistically to the citation rate for PBDB publications using median and harmonic mean. Comparison of median citation rates was conducted using a Wilcoxon rank sum test with continuity correction. Harmonic mean was also evaluated to account for outliers in the dataset that might have biased comparisons (e.g., a publication with an exceptionally high citation rate). As the results were similar when using median and harmonic mean, only the results using the median are reported in the main text (see Supplementary Material for results with harmonic means).
Estimating Effects on Paleontological Journal Impact Factors
We estimated the effect that citation of data-provisioning publications in past PBDB publications would have on paleontological journals, using impact factor as a metric. For this analysis, we evaluated changes to the journal impact factor (JIF) of all journals categorized by Clarivate as “Paleontology.” As with citations to data-provisioning publications, we first compiled the data used to calculate JIF for the period of 1997–2021 (see Smith et al. [2023a] for raw data). These data included number of citable items published in each journal every year, the number of citations of those citable items every year, and the resulting impact factor, which is calculated as, for example:
All necessary data are compiled annually by Clarivate and published as Journal Citation Reports (https://jcr.clarivate.com/jcr/home), which we accessed between November 18, 2022, and February 19, 2023, for use in calculating adjusted impact factors. To calculate new impact factors, we tabulated the number of neglected citations to data-provisioning publications, aggregated by journal on an annual basis. These neglected citations were added to the number of citations of citable items for each paleontological journal, and impact factor was recalculated based on these new citation counts. Changes in JIF were converted to percent differences to standardize the results. To contextualize these changes within the scope of publishing in paleontology, in general, we also tabulated and plotted the total number of citable items and citations to those items each year.
Results and Discussion
Balancing Data Use and Citation in Paleontology
PBDB publications were cited at a median rate of 4.28 times each year (median absolute deviation: 3.47), a significantly greater rate than annual citations for data-provisioning publications from the same period of time (1.35/year, median absolute deviation: 1.26; Wilcoxon rank sum test, p-value < 0.0001; Fig. 2; data available in Smith et al. [Reference Smith, Raja, Clements, Dimitrijević, Dowding, Dunne and Gee2023a]). When a citation was credited to each data-provisioning publication within the available subset of data-using publications (151 out of 396 PBDB publications with available data), the citation rate increased to 2.44 each year (median absolute deviation: 1.49). Assuming these 151 publications are a representative sample of all PBDB publications (see Supplementary Material for discussion of assumptions), extrapolating to the entire set of 396 PBDB publications increased the median citation rate for data-provisioning publications to 4.16 annual citations (median absolute deviation: 2.22), statistically indistinguishable from the median rate for PBDB papers (Wilcoxon rank sum test, p-value = 0.2103; Fig. 2)—using the harmonic mean as the summary statistic to downweight outliers (e.g., publications with extraordinarily high citation rates) resulted in a similar pattern (see “Supporting Analyses” in the Supplementary Material). These results suggest data-provisioning publications should be cited at a rate equal to that for the PBDB publications that reuse their data.
It is clear that the status quo (Fig. 1A)—where the professional reward for PBDB publications is three times greater than for data-provisioning publications—does not give adequate recognition to the importance of data-provisioning publications and the effort required to produce them. At a minimum, the outputs of data-provisioning publications represent intellectual input by their authors and, in many cases, represent dozens or hundreds of hours of work and large financial investment (Agnarsson and Kuntner Reference Agnarsson and Kuntner2007; Ebach et al. Reference Ebach, Valdecasas and Wheeler2011; Baker and Mayernik Reference Baker and Mayernik2020; Melville et al. Reference Melville, Chapple, Keogh, Sumner, Amey, Bowles and Brennan2021). As has been broadly recognized in the literature, data producers deserve credit for their foundational work (Agnarsson and Kuntner Reference Agnarsson and Kuntner2007; Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Penev et al. Reference Penev, Mietchen, Chavan, Hagedorn, Smith, Shotton and Tuama2017; Cousijn et al. Reference Cousijn, Kenall, Ganley, Harrison, Kernohan, Lemberger and Murphy2018, Reference Cousijn, Feeney, Lowenberg, Presani and Simons2019; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Marwick and Birch Reference Marwick and Birch2018; Silvello Reference Silvello2018; Zhao et al. Reference Zhao, Yan and Li2018; Jones et al. Reference Jones, Grant and Hrynaszkiewicz2019; Lammey Reference Lammey2019; Pierce et al. Reference Pierce, Dev, Statham and Bierer2019; Tomaszewski Reference Tomaszewski2019; Colavizza et al. Reference Colavizza, Hrynaszkiewicz, Staden, Whitaker and McGillivray2020; Dosso and Silvello Reference Dosso and Silvello2020; Dorta-González et al. Reference Dorta-González, González-Betancor and Dorta-González2021; Hood and Sutherland Reference Hood and Sutherland2021). In a hypercompetitive academic environment where many aspects of an individual's career (e.g., reputation, career prospects, funding) are influenced by citation counts, this status quo for citation practice is neither fair nor sustainable (Agnarsson and Kuntner Reference Agnarsson and Kuntner2007; Neylon and Wu Reference Neylon and Wu2009; Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Piwowar and Vision Reference Piwowar and Vision2013; Tang et al. Reference Tang, Bever and Yu2017; Curry Reference Curry2018; Gingras and Khelfaoui Reference Gingras and Khelfaoui2018; MacRoberts and MacRoberts Reference MacRoberts and MacRoberts2018; Silvello Reference Silvello2018; Pierce et al. Reference Pierce, Dev, Statham and Bierer2019; Stern and O'Shea Reference Stern and O'Shea2019; Colavizza et al. Reference Colavizza, Hrynaszkiewicz, Staden, Whitaker and McGillivray2020; Dosso and Silvello Reference Dosso and Silvello2020; Raja and Dunne Reference Raja and Dunne2022). Without rebalancing the credit distribution (Fig. 1B), emerging “big data” research—considered here as research using large amounts of data (e.g., > 1 TB) including environmental data, images, stratigraphic information, taxonomic records, and more (Leonelli Reference Leonelli2014; Allmon et al. Reference Allmon, Dietl, Hendricks and Ross2018)—in paleontology is at risk of undercutting itself by contributing to a systematic devaluation of the specimen-based work that is foundational to the discipline itself.
The estimated citation rate for data-provisioning papers after the addition of neglected citations demonstrates the fundamental and underappreciated value of specimen-based work in paleontology. One way to acknowledge its value and to incentivize future specimen-based work is to cite the data in a formal way when they are used (Piwowar and Vision Reference Piwowar and Vision2013; Penev et al. Reference Penev, Mietchen, Chavan, Hagedorn, Smith, Shotton and Tuama2017; Cousijn et al. Reference Cousijn, Kenall, Ganley, Harrison, Kernohan, Lemberger and Murphy2018, Reference Cousijn, Feeney, Lowenberg, Presani and Simons2019; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Silvello Reference Silvello2018; Zhao et al. Reference Zhao, Yan and Li2018; Pierce et al. Reference Pierce, Dev, Statham and Bierer2019; Dosso and Silvello Reference Dosso and Silvello2020; Dorta-González et al. Reference Dorta-González, González-Betancor and Dorta-González2021; Hood and Sutherland Reference Hood and Sutherland2021). Although citations are inherently flawed as a metric and subject to biases (e.g., Gingras and Khelfaoui Reference Gingras and Khelfaoui2018; MacRoberts and MacRoberts Reference MacRoberts and MacRoberts2018; Davies et al. Reference Davies, Putnam, Ainsworth, Baum, Bove, Crosby and Côté2021; Hood and Sutherland Reference Hood and Sutherland2021; Raja and Dunne Reference Raja and Dunne2022), citations in one form or another are likely to continue being used to evaluate researchers (see Hicks et al. [Reference Hicks, Wouters, Waltman, de Rijcke and Rafols2015] for cautionary guidelines and Wilkinson et al. [Reference Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak and Blomberg2016] for discussion of FAIR principles). Citing data producers may be another step toward increased transparency and reproducibility in the pipeline from data production to digital upload and reuse (Wilkinson et al. Reference Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak and Blomberg2016; Escribano et al. Reference Escribano, Galicia and Ariño2018; Hood and Sutherland Reference Hood and Sutherland2021; see also Supplementary Table S1 and “Additional Contributions to the Paleobiology Database” in the Supplementary Material). Consequently, the development of a clear protocol for citing data can set a community-wide standard that preempts many of the shortcomings reported for traditional text citations (e.g., Gingras and Khelfaoui Reference Gingras and Khelfaoui2018; MacRoberts and MacRoberts Reference MacRoberts and MacRoberts2018; Davies et al. Reference Davies, Putnam, Ainsworth, Baum, Bove, Crosby and Côté2021). The recommended best practices for data citation from the broader literature, from which paleontology can draw (Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018), include two general themes: (1) credit data provisioning by citing the publication in which the data were initially reported, or (2) use new metrics specifically developed for data citation.
Conceptually, the most straightforward way to credit data producers is to cite the publication from which the data were originally reported when the data are reused (Penev et al. Reference Penev, Mietchen, Chavan, Hagedorn, Smith, Shotton and Tuama2017; Cousijn et al. Reference Cousijn, Kenall, Ganley, Harrison, Kernohan, Lemberger and Murphy2018; Hood and Sutherland Reference Hood and Sutherland2021; and as recommended by some databases, e.g., BioTIME, https://biotime.st-andrews.ac.uk/usageGuidelines.php; Neotoma, https://www.neotomadb.org/data/data-use-and-embargo-policy). As the most basic option, this strategy carries the simplifying assumption that all authors participated in data production and credits them equally on this basis (see Pierce et al. [2019] for a counterargument). By virtue of its simplicity, this strategy for citing original publications facilitates ease of use through rapid integration into existing citation metrics, circumventing the need for an independent data citation tool. A prerequisite for using many data citation tools is a unique identifier for datasets (e.g., DOI), which is not available for many past publications (Hood and Sutherland Reference Hood and Sutherland2021) and, in recent publications, continues to be a shortcoming driven by poor adherence to data-sharing recommendations (Gabelica et al. Reference Gabelica, Bojčić and Puljak2022; see Agosti et al. [Reference Agosti, Benichou, Addink, Arvanitidis, Catapano, Cochrane and Dillen2022] for recommendations on use of identifiers). Many of the datasets included in the PBDB do not have unique identifiers, making the application of more complex data citation tools intractable. In alignment with our objective, citing original publications upon data reuse allowed for the most intuitive comparison between data-provisioning and data-using publications with a metric already familiar to academics.
It also bears stating that citing the database itself—in this case study, the PBDB—is necessary but not sufficient. As a consequence of being secondary sources of data, databases indirectly can create a barrier to citation of data-provisioning publications by masking the original data sources. Reflecting this issue, several databases (e.g., BioTIME, Neotoma) provide guidance on citing original data sources and, in the PBDB itself, recommendations toward this end have been made (https://paleobiodb.org/#/faq/how-should-the-paleobiology-database-data-be-cited-; see also Uhen et al. [Reference Uhen, Allen, Behboudi, Clapham, Dunne, Hendy and Holroyd2023] for a current user guide).
Citation practice is developing rapidly, as a host of data citation tools have been proposed—including the Data Citation Index (Clarivate 2023), SageCite (Lyon Reference Lyon2010), Data Usage Index (Ingwersen and Chavan Reference Ingwersen and Chavan2011), and Data Credit Distribution (Dosso and Silvello Reference Dosso and Silvello2020)—and multiple working groups have been convened on this topic (e.g., Scholix, Data Usage Metrics, Data Citation Synthesis Group). One of the driving principles behind the development of these metrics is the idea that data use is complex and therefore requires a tool that captures the nuances of data (Data Citation Synthesis Group 2014; Cousijn et al. Reference Cousijn, Feeney, Lowenberg, Presani and Simons2019; Dosso and Silvello Reference Dosso and Silvello2020; Hood and Sutherland Reference Hood and Sutherland2021). As a scientist's value cannot be distilled to a single metric, using several of these tools in combination with other measures of a person's contributions to science and society may be a more equitable option for evaluating scientists in the future (Neylon and Wu Reference Neylon and Wu2009; Curry Reference Curry2018; Ewers et al. Reference Ewers, Barlow, Banks-Leite and Rahbek2019; Stern and O'Shea Reference Stern and O'Shea2019; Davies et al. Reference Davies, Putnam, Ainsworth, Baum, Bove, Crosby and Côté2021; Hood and Sutherland Reference Hood and Sutherland2021; Westoby et al. Reference Westoby, Falster and Schrader2021). Given the attention to citation practice and alternative metrics in the recent literature and the progress made by working groups on the topic, it may only be a matter of time before data citation metrics become mainstream (Data Citation Synthesis Group 2014; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Cousijn et al. Reference Cousijn, Feeney, Lowenberg, Presani and Simons2019; Hood and Sutherland Reference Hood and Sutherland2021).
Regardless of the citation approach, attributing credit to data provisioning (and all individuals involved in the process of making data available in digital compilations; see Supplementary Table S1; see also Escribano et al. Reference Escribano, Galicia and Ariño2018; Benichou et al. Reference Benichou, Buschbom, Campbell, Hermann, Kvaček, Mergen, Mitchell, Rinaldo and Agosti2022) in a professionally meaningful way represents a shift in citation practice and credit distribution in paleontology (Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Fig. 1). With the rise of quantitative paleontology and the associated shift away from paleontology's traditional descriptive roots, it is imperative we find reasonable and equitable ways to improve our data citation practices. For example, a single publication might draw data from thousands of primary sources (Supplemental Fig. S4) and, particularly for journals with strict page limits or length-based page charges, it is often not feasible to include citations for each of the data-provisioning publications. Though it will continue to be impractical to cite thousands of papers in printed format, the growing awareness of the importance of data citation and improving digital infrastructure provide a path forward. As a starting point, online archives and preprint servers (e.g., BioArXiv, EarthArXiv, Open Science Framework) can accommodate the long list of references required to cite all data-provisioning publications used by a publication based on the PBDB or another database. These online archives and preprint servers are routinely indexed by aggregators (e.g., Google Scholar, Web of Science). Publication of a reference list with an online repository or preprint server can increase the likelihood that citations are attributed to data-provisioning publications but, critically, the references must be included with the main text in the references section, not placed in the supplementary material. Current processes for aggregating citations do not find references in supplementary material. Raja et al. (Reference Raja, Dimitrijević, Krause and Kiessling2022a) illustrated the feasibility of this approach by publishing their database references in a preprint hosted at Open Science Framework (Raja et al. Reference Raja, Dimitrijević, Krause and Kiessling2022b). To alleviate the burden and facilitate consistency of compiling these large reference lists, future authors can use the R package refer (https://github.com/adamkocsis/refer). This package offers tools to generate a formatted document containing the metadata and reference list that can be used to upload to the aforementioned online archives. The user is required to provide either a text file containing formatted references or a BibTeX file containing the references in their data-using publication along with other generic information (e.g., title of the publication, author affiliations). The package also includes a ready-made template for the formatting of the document, and experienced users can provide their own templates.
Still, this solution is a stopgap measure, and it would be preferable for journals to implement policies and technical changes on their platforms to encourage more equitable citation practices. Although many journals still print hard copies, essentially all journals have online versions, and many journals are now published online-only. Even so, many online publishers have retained strict manuscript length policies, thereby limiting the number of references allowed. Rather than relegating data reference information to the depths of supplementary material where they will not be included in citation counts, online journals can, as a first step, omit the reference list from their imposed page limits. Another option for journals is to require authors to submit a list of references for data-provisioning publications as an appendix and to publish this list with the main text references in the online version (e.g., McGill et al. Reference McGill, Dornelas and Field2016)—a printed issue could still include only the references cited in the main text. Encouragingly, some journals (e.g., Global Ecology and Biogeography [McGill et al. Reference McGill, Dornelas and Field2016], Scientific Data [personal experience, e.g., Raja et al. Reference Raja, Dimitrijević, Krause and Kiessling2022a]) have already made these changes, allowing authors to fully cite their data sources. Admittedly, these changes will be somewhat onerous, as they require managing and formatting thousands of references; however, AI tools and the refer package presented here are viable options for streamlining this process. Whether these changes are adopted more broadly will depend on demand from the community.
Broader Considerations for Paleontology as a Discipline
Improving data citation practice will also have a positive effect on paleontological journals, especially those that publish specimen-based work (e.g., Acta Palaeontologica Polonica, Journal of Paleontology, Journal of Vertebrate Paleontology). Whereas higher-profile outlets (e.g., Science, Nature) tend to publish paleontological articles on charismatic and unusual specimens (e.g., dinosaurs, fossils in amber) or on large data compilations (e.g., latitudinal diversity gradients, extinction), most paleontological papers are published in discipline-specific journals (Raja and Dunne Reference Raja and Dunne2022). As might be expected, publications in these journals traditionally receive fewer citations, and the journals have lower impact factors.
Just as the citation rate for data-provisioning publications increased after accounting for neglected data citations (Fig. 2), the JIF—another flawed but commonly used evaluative metric (e.g., Neylon and Wu Reference Neylon and Wu2009; Stephan et al. Reference Stephan, Veugelers and Wang2017; Curry Reference Curry2018; Stern and O'Shea Reference Stern and O'Shea2019)—increases substantially for paleontological journals (Fig. 3A). Combining our tabulated neglected citations with currently attributed citations used to calculate JIF by Clarivate (https://jcr.clarivate.com/jcr/home), we found that in the last decade (2010–2019), the JIF reported for a journal in a given year (e.g., Journal of Paleontology in 2015) would increase on average by ~0.1, or 5.08%. This is a conservative estimate, as it only includes neglected citations from the 151 PBDB publications for which data were available. Extrapolating to the entire dataset of 396 PBDB publications suggests that any of the 55 journals categorized by Clarivate as a paleontological journal would see an increase in JIF by ~0.2, or 13.3% (see “7_paleo_journal_JIFcalculation.csv” in Smith et al. [Reference Smith, Raja, Clements, Dimitrijević, Dowding, Dunne and Gee2023a] for raw data for all 55 paleontological journals from 1997 to 2021 and additional information on language and country of publishing). The change in JIF from neglected citations was not, however, uniform across journals or through time. Whereas some journals had no (n = 10; e.g., Micropaleontology, Paleoceanography and Paleoclimatology, Stratigraphy) or few (e.g., GFF, Palaios, Zootaxa) neglected citations in our dataset and limited associated changes to JIF, other journals would have substantial increases in JIF in one or more years (e.g., Palaeontologia Electronica, Palaeontology, PalZ). Furthermore, for those journals with large JIF changes, there is a notable increase in the effect of adding neglected citations in more recent years (Fig. 3B). Neglected citations rarely contributed to JIF in the early part of the decade (2010 onward); however, at the end of the decade, the average JIF for 10 highly impacted journals increased by 27% in 2018 and 36% in 2019 when recalculated to include neglected citations. These differences through time are a consequence of the formula for calculating JIF and publishing trends in paleontology (Fig. 3C,D). Because JIF for a given year (e.g., 2020) is based on citations of research published in the preceding 2 years (e.g., 2018 and 2019), a relatively short turnaround time is needed between publication of a data-provisioning study and subsequent PBDB publication using those data. Consequently, many instances of data reuse cannot be incorporated into this metric because of the limited look-back period. Alternatives to the 2-year JIF do exist (e.g., 5-year JIF); however, analyses comparing 2- and 5-year JIFs show minimal differences between the two (e.g., Campanario Reference Campanario2011; Dorta-González and Dorta-González Reference Dorta-Gonzalez and Dorta-González2013). Though it continues to be a widely used metric across many branches of science, JIF performs poorly when capturing reuse of data and undervalues journals where data-provisioning studies are published. Accelerating publication rates in paleontology (Fig. 3C,D) and the shift from printed to online publication appear to have reduced the time between initial data publication and data reuse. Moreover, the number of paleontological journals published in 2021 was 55, more than double the 24 published in 1997 when Clarivate began compiling Journal Citation Reports. As the number of publications and citations in paleontology continues to grow (Fig. 3C,D), so too will the consequences of neglected data citations (Fig. 3A,B). A more equitable future in paleontology requires rapid correction to citation practice.
JIF influences more than how journals rank in comparison to one another; it also influences how authors and the work they publish in those journals are regarded and rewarded professionally (Neylon and Wu Reference Neylon and Wu2009; Stephan et al. Reference Stephan, Veugelers and Wang2017; Curry Reference Curry2018; Stern and O'Shea Reference Stern and O'Shea2019). Despite the poor performance of JIF as an indicator of quality, JIF continues to influence an author's choice of publication venue and contributes to the perceived importance of the papers published in the journal and, more broadly, the discipline (Neylon and Wu Reference Neylon and Wu2009; Curry Reference Curry2018; Stern and O'Shea Reference Stern and O'Shea2019). In lieu of systematic changes in publication practices in science (e.g., Kravitz and Baker Reference Kravitz and Baker2011; Curry Reference Curry2018; Davies et al. Reference Davies, Putnam, Ainsworth, Baum, Bove, Crosby and Côté2021), increasing the prestige of discipline-specific journals is imperative for increasing the profile of paleontology and will benefit all in the discipline.
Data sharing—particularly when credited appropriately—is an equally important component in any effort to strengthen the field of paleontology. As with data citation, the issue of data sharing is commonly discussed in the literature, and there is a consensus that it is incumbent upon authors to share the data they use to produce their results (e.g., Piwowar and Vision Reference Piwowar and Vision2013; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Marwick and Birch Reference Marwick and Birch2018; Jones et al. Reference Jones, Grant and Hrynaszkiewicz2019; Lammey Reference Lammey2019; Mandeville et al. Reference Mandeville, Koch, Nilsen and Finstad2021). Even so, data sharing is not practiced consistently (Stuart et al. Reference Stuart, Baynes, Hrynaszkiewicz, Allin, Penny, Lucraft and Astell2018; Gabelica et al. Reference Gabelica, Bojčić and Puljak2022; Roche et al. Reference Roche, Berberi, Dhane, Lauzon, Soeharjono, Dakin and Binning2022). Several large publishers (e.g., Elsevier, Springer, Taylor and Francis, Wiley) have data availability policies with multiple tiers, ranging from written recommendations to strict requirements for publishing data, but it remains at the discretion of journals to enact and enforce these policies (Jones et al. Reference Jones, Grant and Hrynaszkiewicz2019)—this may change in the United States, however, with a new mandate for public availability of data produced in federally funded research, beginning in 2023 (National Science and Technology Council 2022). As demonstrated by Gabelica et al. (Reference Gabelica, Bojčić and Puljak2022), who found that data were only available for 6.8% (n = 3,556) of publications in their review of 333 open access journals from BioMed Central, many data-sharing policies are not effective in practice. Data availability was considerably better in the present study, with data accessible for 32% (n = 128) of PBDB publications—note that though data were available, they were not always usable for the analysis conducted here. Encouragingly, when data from PBDB publications were not readily available online, 167 of the 268 (68%) authors who were contacted were responsive, and approximately half (n = 84) of these responses included the requested data. Still, we were unable to recover data for 21% (n = 83) of PBDB publications for myriad reasons. With pushes toward big data science in paleontology and related disciplines, it will be up to the community to influence journal policies toward required sharing rather than relying on unenforceable recommendations (Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Kaufman et al. Reference Kaufman, Abram, Evans, Francus, Goosse, Linderholm and Loutre2018; Jones et al. Reference Jones, Grant and Hrynaszkiewicz2019).
Improved data sharing requires buy-in from individuals, who may themselves benefit from the practice and enhance the quality of science in paleontology. As reviewed by Marwick and Birch (Reference Marwick and Birch2018), there are many reasons to share data (e.g., reciprocal data sharing by others; reproducibility of research; enabling others to ask new questions) and some associated costs (e.g., time required to clean data; data use without citation). One of the incentives is that data sharing is associated with increased citation of the publication where the data were initially published (Sears Reference Sears2011; Piwowar and Vision Reference Piwowar and Vision2013; Tomaszewski Reference Tomaszewski2019; Colavizza et al. Reference Colavizza, Hrynaszkiewicz, Staden, Whitaker and McGillivray2020; Dorta-González et al. Reference Dorta-González, González-Betancor and Dorta-González2021). For example, Colavizza et al. (Reference Colavizza, Hrynaszkiewicz, Staden, Whitaker and McGillivray2020) reported that when publications included data availability statements with the associated data publicly accessible, those publications saw a 25% increase in their citations compared with publications without available data. As demonstrated here (Fig. 2), the potential citation benefit may be even larger in a discipline like paleontology, where publications on data compilations have become mainstream. Changes to the format on funding proposals, for example, inclusion of a “research outcomes” section that includes datasets by the Deutsche Forschungsgemeinschaft (i.e., German Research Foundation) and a non-publication section in National Science Foundation grant reports, can further encourage data sharing. Of perhaps greater importance, data sharing ensures the reproducibility of scientific results (Piwowar and Vision Reference Piwowar and Vision2013; Altman et al. Reference Altman, Borgman, Crosas and Matone2015; Marwick and Birch Reference Marwick and Birch2018). As has been demonstrated to the detriment of many fields of study (e.g., behavioral ecology [Viglione Reference Viglione2020], food science [van der Zee et al. Reference van der Zee, Anaya and Brown2017], paleontology [Price Reference Price2022], psychology [John et al. Reference John, Loewenstein and Prelec2012]), some researchers have been guilty of misrepresenting their data. Data sharing provides a means to uphold academic integrity and establishes an ethical and practical standard that encourages scientific advancement (Marwick and Birch Reference Marwick and Birch2018; Raja and Dunne Reference Raja and Dunne2022).
Paleontology has not yet crossed the threshold to become a big data discipline (Allmon et al. Reference Allmon, Dietl, Hendricks and Ross2018) but has the potential to do so in the near future. Realizing this potential will expand research horizons in paleontology but, to be done effectively and equitably (e.g., Raja et al. Reference Raja, Dunne, Matiwane, Khan, Nätscher, Ghilardi and Chattopadhyay2022c), it requires a stable foundation in specimen-based work and reckoning with structural biases. Large paleontological databases, including the PBDB, are far from complete. For example, in examining the collections at nine paleontological museums in the United States, Marshall et al. (Reference Marshall, Finnegan, Clites, Holroyd, Bonuso, Cortez, Davis, Dietl, Druckenmiller and Eng2018) estimated that there were 23 times the number of unique localities in only those nine collections than were in the PBDB at the time. Paleontologists should be wary of assuming our databases are comprehensive, as “having a lot of data is not the same as having all of them; and cultivating such an illusion of completeness is a very risky and potentially misleading strategy” (Leonelli Reference Leonelli2014: p. 7). Activating the extensive data held in museum collections (e.g., unpublished specimens; “extended specimen” data; Webster Reference Webster2017; Allmon et al. Reference Allmon, Dietl, Hendricks and Ross2018; Marshall et al. Reference Marshall, Finnegan, Clites, Holroyd, Bonuso, Cortez, Davis, Dietl, Druckenmiller and Eng2018), will require support for the infrastructure sustaining collections and recognition of the importance of specimen-based work that often goes wanting in paleontology and related disciplines (Johnson et al. Reference Johnson, Filkorn and Stecheson2005; Agnarsson and Kuntner Reference Agnarsson and Kuntner2007; Payne et al. Reference Payne, Smith, Kowalewski, Krause, Boyer, McClain, Finnegan, Novack-Gottshall and Sheble2012; Allmon et al. Reference Allmon, Dietl, Hendricks and Ross2018; Marshall et al. Reference Marshall, Finnegan, Clites, Holroyd, Bonuso, Cortez, Davis, Dietl, Druckenmiller and Eng2018; Engel et al. Reference Engel, Ceríaco, Daniel, Dellapé, Löbl, Marinov and Reis2021; Benichou et al. Reference Benichou, Buschbom, Campbell, Hermann, Kvaček, Mergen, Mitchell, Rinaldo and Agosti2022). A critical component to realizing a big data future in paleontology will be increased funding to support museum collections and data repositories, with respect to both maintaining existing materials and to obtaining and curating new materials. Illustrating the scope of the need, Allmon et al. (Reference Allmon, Dietl, Hendricks and Ross2018) estimated that it costs US$1 to digitize each specimen, and digitizing only the currently identified specimens in U.S. collections (as of 2018) would require an investment of US$35 million. That figure increased to US$75 million after including all fossils, not just those with existing taxonomic identifications. Investment at this scale represents a massive increase in funding, as the budget for this type of work in the United States was only US$10 million at the time (Allmon et al. Reference Allmon, Dietl, Hendricks and Ross2018). These monetary estimates also do not account for the costs of data storage and maintenance of data repositories (whether museum-based or external) that provide access to other researchers and the public. Particularly, as complex data become more commonplace (e.g., CT scans, images), infrastructure requirements will be critical to ensuring a big data future in paleontology. Without funding for this fundamental work, growth and advances in paleontology will be slow at best.
The illusion of completeness is elucidated further when considering biases in where data recorded in paleontological databases originated, what organisms are preferentially studied, who contributes to compiling data in databases, and who conducts the research (e.g., Raja and Dunne Reference Raja and Dunne2022; Raja et al. Reference Raja, Dunne, Matiwane, Khan, Nätscher, Ghilardi and Chattopadhyay2022c). Compilations of modern biodiversity data show a clear association between data production and wealthier, more resource-rich countries, particularly those in western Europe and North America (Amano and Sutherland Reference Amano and Sutherland2013; Hughes et al. Reference Hughes, Beas-Luna, Barner, Brewitt, Brumbaugh, Cerny-Chipman and Close2017). The same is true for compilations of paleontological data; a recent study examining data recorded in the PBDB found that 97% of fossil occurrence data were produced by researchers based in high- or upper middle-income countries (Raja et al. Reference Raja, Dunne, Matiwane, Khan, Nätscher, Ghilardi and Chattopadhyay2022c). The same study found a direct link between paleontological data production and socioeconomic factors, such as greater wealth, education level, and political stability (Raja et al. Reference Raja, Dunne, Matiwane, Khan, Nätscher, Ghilardi and Chattopadhyay2022c). These patterns clearly illustrate a global knowledge and power imbalance in paleontological research that can only be rectified by changes to how paleontological research is conducted (Cisneros et al. Reference Cisneros, Raja, Ghilardi, Dunne, Pinheiro, Fernández and Sales2022; Monarrez et al. Reference Monarrez, Zimmt, Clement, Gearty, Jacisin, Jenkins and Kusnerik2022; Raja et al. Reference Raja, Dunne, Matiwane, Khan, Nätscher, Ghilardi and Chattopadhyay2022c).
Conclusion
The scientific value of large-scale analyses in paleontology is undeniable, and the scope and quality of insights produced in such analyses will only increase with the inclusion of more data. Databases like the PBDB have been instrumental in making these research directions possible and, with a community initiative to improve data citation and sharing practices, can continue to unlock new discoveries about life on Earth. Although we focus here on paleontology, similar imbalances affect related and overlapping disciplines (e.g., archaeology [Marwick and Birch Reference Marwick and Birch2018], biodiversity research [Escribano et al. Reference Escribano, Galicia and Ariño2018; Mandeville et al. Reference Mandeville, Koch, Nilsen and Finstad2021], ecology and evolution [Hood and Sutherland Reference Hood and Sutherland2021], taxonomy [Agnarsson and Kuntner Reference Agnarsson and Kuntner2007; Engel et al. Reference Engel, Ceríaco, Daniel, Dellapé, Löbl, Marinov and Reis2021; Benichou et al. Reference Benichou, Buschbom, Campbell, Hermann, Kvaček, Mergen, Mitchell, Rinaldo and Agosti2022]), all of which can benefit from similar structural improvements. Whether citations are attributed to the data or to the original publication, there are potentially large implications for how research and researchers are credited and valued, and how journals are perceived. Our objective here is not to devalue papers examining large-scale trends relying on data compilations drawn from other scientists’ work, but rather to ensure it remains feasible for taxonomists, systematists, and other specimen-based workers, and those conducting the equally important work on stratigraphy, lithology, and depositional environments, to publish research and be credited in a way that acknowledges their critical importance to paleontology and all life sciences. Citation counts and the metrics derived from them continue to influence most aspects of a scientific career. When people producing data receive proper credit, the community data pool will increase in availability and quality. At the same time, the profile and prestige of paleontological journals will improve. As a unified science, paleontology will benefit and grow.
Acknowledgments
We thank the many authors of the official PBDB papers who shared their raw data with us, and those responsible for maintaining the PBDB as the excellent community resource that it is. We also thank M. Patzkowsky, G. Jones, M. Hopkins (editor), and P. Monarrez and P. Novack-Gottshall (reviewers) for their comments that improved an earlier version of this article. This work was supported in part by the Paleosynthesis Project, with funding from the Volkswagen Stiftung, and by the TERSANE project, with funding from the Deutsche Forschungsgemeinschaft (FOR 2332; grant nos. KI 806/17–1 (N.B.R., D.D.), BA 5148/1-2 to K. De Baets (P.S.N.), AB 109/11-1 to M. Aberhan (C.J.R.), and Ko 5382/2-1 (Á.T.K.). P.L.G. was supported by the São Paulo Research Foundation (FAPESP 2022/05697-9). B.M.G. was supported by the National Science Foundation (ANT-1947094 to C. Sidor). B.S. was supported by the Deutsche Forschungsgemeinschaft (JA 2718/3-1) and the Netherlands Earth System Science Centre (NESSC).
Author Contributions
J.A.S., N.B.R., and Á.T.K. contributed equally to this work. J.A.S. led manuscript drafting. J.A.S., N.B.R., Á.T.K., L.P.A.M, C.J.R., and B.S. conceived of and designed the study. D.D., J.A.S., N.B.R., E. M. Dunne, E.M.L., L.P.A.M., P.S.N., C.J.R., B.S., and Á.T.K. contributed to data collection. N.B.R., Á.T.K., and J.A.S. generated code to extract and manipulate data. T.C., D.D., and J.A.S. led figure development. J.A.S. and Á.T.K. conducted analyses. All authors edited, reviewed, and approved the submitted manuscript.
Competing Interest
The authors declare that they have no competing interests.
Data Availability Statement
All data and supplementary material are available on Zenodo at https://doi.org/10.5281/zenodo.7881567.
Code Availability Statement
All code used to extract, manipulate, and visualize data for this manuscript are available at https://doi.org/10.5281/zenodo.7881567. The code for the R package refer is available at https://github.com/adamkocsis/refer.