Policy Significance Statement
Mapping groups of actors sharing similar interests is crucial for understanding and addressing their specific claims. This occurs, for instance, when targeting groups of voters during political campaigns or sorting out claims of sets of stakeholders during negotiations, or still when mapping the balance of pros and cons around large infrastructure public projects. Yet, social network data, such as who-follows-who, that would be necessary to delineate these groups of actors and identify their claims are not always available. In this study, we propose a computational approach that makes use of the textual contents produced by actors to infer underlying social networks. Indeed, since actors’ claims manifest themselves in the topical content of the text data they produce, we show that these texts can be used to identify latent groups of individuals sharing similar interests (i.e., hidden communities of interests).
1. Introduction
Analysis of networks of relationships between different actors is well-known to provide a wealth of insights, be it about the identification of key members and their relationships, the interdependence and flows of influence among individuals, groups, and institutions, or still the very structure of the networks themselves and their evolution over time. This is what social network analysis (SNA) is all about: developing and applying network methods to investigate social structures wherever they may occur: friendship and acquaintance networks, business networks, knowledge networks, and so forth. In social networks, nodes typically refer to persons, organizations, or generally speaking actors, while edges or links represent some form of connection between the nodes. The underlying idea is that the network formed by these nodes and edges can be understood as a kind of structure that captures crucial aspects of a particular social or political phenomenon.
In political science, network theory and methods have been increasingly applied in the past decades to a broad range of questions (Lazer, Reference Lazer2011; Ward et al., Reference Ward, Stovel and Sacks2011; Victor et al., Reference Victor, Montgomery and Lubell2017). For instance, questions about voting, political participation, interest groups, and legislative networks have been addressed with SNA (Fowler, Reference Fowler2006; Huckfeldt, Reference Huckfeldt2009; Battaglini and Patacchini, Reference Battaglini and Patacchini2019; Praet et al., Reference Praet, Martens and Van Aelst2021), as have numerous issues in public policy and public administration (e.g., health policy) (Luke and Harris, Reference Luke and Harris2007; Shearer et al., Reference Shearer, Dion and Lavis2014), or in comparative politics (Siegel, Reference Siegel2011). Maybe one of the best known areas where network analyses have been applied concern international relations (Hafner-Burton et al., Reference Hafner-Burton, Kahler and Montgomery2009), notably on questions of terrorism, trade networks, global governance, and advocacy networks (Krebs, Reference Krebs2002; Ressler, Reference Ressler2006; Knoke, Reference Knoke, Scott and Kosslyn2015; Varone et al., Reference Varone, Ingold, Jourdain and Schneider2017).
Due to the richness that SNA affords, this type of approach is now ubiquitous across disciplines, from the natural sciences to the human and social sciences. Unsurprisingly, SNA also plays a significant role in the very study of science itself. In science studies indeed, the social networks that scientists form have been examined under numerous perspectives, from the identification of latent social structures and “hidden colleges” (Crane, Reference Crane1969) to broader investigations about the general dynamics of science (Small, Reference Small1999; Boyack et al., Reference Boyack, Klavans and Börner2005), including the role of networks of scientists and institutions over such issues as problem selection, discovery, collaboration, or even career dynamics (Tang et al., Reference Tang, Zhang, Yao, Li, Zhang and Su2008; Fortunato et al., Reference Fortunato, Bergstrom, Börner, Evans, Helbing, Milojević, Petersen, Radicchi, Sinatra, Uzzi, Vespignani, Waltman, Wang and Barabási2018; Kong et al., Reference Kong, Shi, Yu, Liu and Xia2019).
In any case, before being submitted to analysis, social networks first need to be built, which presupposes access to data about actors and their relationships. Sources of information are very diverse and depend on the cases at hand. In political science, data can, for instance, be manually collated from press articles, extracted from pieces of legislature, reconstructed from email exchanges, or still, mined from social media such as X (previously Twitter), Facebook, and the like. Likewise, in science studies, major sources of network data, including academic social media and citation data have been mined to identify researchers’ profiles, collaborations, and trajectories (Tang et al., Reference Tang, Zhang, Yao, Li, Zhang and Su2008; Kong et al., Reference Kong, Shi, Yu, Liu and Xia2019). Often, the interpretation of social networks requires supplementary information from which to infer the meaning of the relationships between actors or the specificity of groups of actors or communities. In other words, having data about nodes and edges is not enough when it comes to understanding some of the specific structural features of social networks. For instance, one will want to gain insights into the key properties of specific clusters of actors that become apparent through SNA: What is special about that particular advocacy group compared to the others? What do these political representatives have in common that may explain their higher success in passing pieces of the legislature? Such additional data may have been collected, typically as metadata, jointly with actors and interaction data. Yet, often, it is assembled in a second step. For instance, to make sense of scientific co-citation networks, data about author research specialties are often needed in addition to author names (nodes) and co-citation frequencies (edges), and are typically obtained by examining key publications or keywords through other means (e.g., Raimbault et al., Reference Raimbault, Cointet and Joly2016, Réale et al., Reference Réale, Khelfaoui, Montiglio and Gingras2020). With such approaches, networks are built from relational data (links between nodes), and the meaning of the resulting communities (set of nodes which may share similar content) is inferred with the help of supplementary data.
To address this network interpretability issue, some researchers have proposed to computationally mine the textual data of actors and develop specific topic models that could incorporate such actor-related data (Steyvers et al., Reference Steyvers, Smyth, Rosen-Zvi and Griffiths2004), notably in the case of social media and directional networks (McCallum et al., Reference McCallum, Wang and Corrada-Emmanuel2007; Pathak et al., Reference Pathak, DeLong and Banerjee2008), or in the case of co-authorship data (Zhou et al., Reference Zhou, Ji, Zha and Giles2006; Zhang et al., Reference Zhang, Qiu, Giles, Foley and Yen2007). Others have proposed to further develop community detection algorithms so as to include not only topological information but also prior constraining data on nodes, as in semi-supervised community detection algorithms and graph neural network approaches (Yang et al., Reference Yang, Cao, Jin, Wang and Meng2014; Ye et al., Reference Ye, Chen and Zheng2018). On the other hand, specific topic modeling techniques have been built upon to elaborate approaches aimed at more than simply extracting topics from textual data. This is, for instance, the case when topic models are used as a first step to identify product opportunities using social media mining (Ko et al., Reference Ko, Jeong, Choi and Yoon2018). Here too, we propose to use topic models not only to extract topics from texts but also to infer actor networks.
The present work tackles the question of social networks in extreme contexts where semantic or textual information produced by actors is abundant but relational data are scarce or even inexistent. In other words, contexts where data about the nodes are present, as well as textual data attributable to each node, but where no data about the edges between nodes exist. Such situations could be seen as even precluding the very notion of social network. Yet, as we propose to show, underlying communities of actors can still be identified based on their shared semantic content. We call such communities “hidden communities of interest” (HCoI), that is, groups of actors sharing similar semantic contents but whose social relationships with one another may be unknown. HCoI’s reflect the existence of underlying latent social networks whose study can nevertheless be pursued to gain insights, for instance, into their structure and evolution.
To a certain extent, semantic network approaches may be used in such contexts. Semantic networks typically map relationships between terms used in a given corpus by measuring their co-occurrences (Carley, Reference Carley1993; Danowski, Reference Danowski, Richards and Barnett1993; Doerfel and Barnett, Reference Doerfel and Barnett1999). These networks have been used in a wide variety of contexts to analyze word association patterns, for instance, in the context of information and communications technology policy (Danowski et al., Reference Danowski, Van Klyton, Tavera-Mesías, Duque, Radwan and Rutabayiro-Ngoga2023), in cognitive science for understanding semantic memory (Siew et al., Reference Siew, Wulff, Beckage and Kenett2019; Kumar et al., Reference Kumar, Steyvers and Balota2022; Christensen and Kenett, Reference Christensen and Kenett2023), in business and management to analyze public–private partnerships (Castelblanco et al., Reference Castelblanco, Guevara, Mesa and Sanchez2021), and in many domains of the social sciences (Segev, Reference Segev2021). When specifically targeting named entities, semantic networks can be used to infer relationships between these named entities by mapping their co-occurrences in texts. This has been done, for instance, to identify covert networks (Diesner and Carley, Reference Diesner and Carley2004) or reconstruct the social networks of cabinet members of past US presidents (Danowski and Cepela, Reference Danowski, Cepela, Memon, Xu, Hicks and Chen2010). Semantic network approaches can also be used jointly with social network approaches, for instance, to better understand the semantic content of specific clusters of authors within a social network, as has been done in health policy by examining the tweets of communities of vaccine-hesitant influencers inferred from the graph of their social relationships (Ruiz et al., Reference Ruiz, Featherstone and Barnett2021). Furthermore, when texts are attributed to actors, semantic networks can then be built for all actors and compared with one another using a matrix similarity measure, resulting in an overall actor network (Danowski, Reference Danowski and Wiil2011).
The approach we propose here for identifying latent networks of actors is somewhat akin to this last case: the main idea is to infer actor networks from similarities in the content of their textual data. Yet, instead of building actor semantic networks and comparing these networks with one another, we compute author topic profiles from a topic model fitted to the complete textual data, something Danowski (Reference Danowski and Wiil2011) somehow tried but inconclusively with the tools available at the time (to the best of our knowledge, this is the only attempt we have been able to identify post hoc). Here, we show that applying a combination of topic modeling and community detection approaches can indeed help reconstruct underlying communities of authors—or HCoI’s as we propose to name them. An advantage of using topic models in this context is the additional possibility of easily gaining insight into the semantic specificity of each community. Furthermore, by segmenting the corpus into time periods, diachronic analyses can be conducted to examine the temporal evolution of the communities and their genealogies.
Such an approach should be of interest to a diversity of political science contexts where textual data is accessible jointly with actors that produce them, but the relationships between these actors are unknown. This could be the case with textual data gathered from a variety of Internet sources (e.g., advocacy blogs, political websites, newsfeeds) as well as varied published materials (e.g., newspaper articles, books, pamphlets etc.) or even transcripts of recorded data (e.g., interviews). Such textual data, of course, may concern any policy-related topics, be they in public policy and administration (e.g., textual data expressing opinions about public health policies), international relations (e.g., textual data produced by various extremist groups), or voting and political participation (e.g., textual data originating from political blogs). The identification of HCoI’s in these varied contexts should in turn provide valuable insights at all stages of the policy lifecycle, from agenda setting and policy formulation, to decision-making, policy implementation, and evaluation (Howlett et al., Reference Howlett, Ramesh and Perl2020).
In the present contribution, we use as a test case a non-policy-oriented corpus of 16,917 full-text academic articles in the domain of the philosophy of science. Besides its simply being available—from previous studies that led to its topical analysis (Malaterre and Lareau, Reference Malaterre and Lareau2022)—, the corpus has the advantage of containing a large amount of textual data (in English language) produced by numerous authors over a period of nearly 90 years, thereby allowing diachronic analyses.
The proposed approach starts by fitting a latent Dirichlet allocation (LDA) topic model to this corpus, thereby resulting in topic probability distributions for all full-text articles. Having split the corpus into four broad time periods, these topic probability distributions were averaged out per author and per time period, depending on the author’s contribution to each article. This resulted in author topic profiles for each time period. Correlation analyses between author topic profiles then led to the construction of author correlation networks for each time period, which were submitted to Louvain community detection. In turn, the topic profile of each community was quantified by averaging out author topic profiles per community. These community topic profiles provide immediate insights into the semantic specificities of each community. Furthermore, measuring pairwise distances between community topic profiles across time periods provides a means of understanding the diachronic evolution of communities and their genealogies.
In what follows, we first describe the data and methods in more detail (Section 2). We then present the results, notably the networks of communities that were detected and their temporal evolution (Section 3). These findings are then discussed (Section 4).
2. Data and Methods
Since the proposed identification of HCoI’s relies on the textual content produced by actors, the approach comprises two main steps: first, the assembly of the working corpus and its preparation for computational analyses and second, the computational analyses per se (Figure 1).
2.1. Step 1—Corpus assembly and preprocessing
For this case study, we used a corpus of full-text academic articles that had been assembled in (Malaterre and Lareau, Reference Malaterre and Lareau2022). The corpus spans from 1930 (the first issue of the earliest published journal) to 2017 and includes 16,917 research articles from eight of the most significant philosophy of science journals in English language (Table 1). In the present case, there are good reasons to consider the corpus as a representative sample of all the texts produced by the authors assembled here. Of course, philosophy of science is published in numerous other venues, including general philosophy journals, disciplinary focused journals, science journals or books. It is also published in many non-English languages. Nevertheless, the corpus includes the most authoritative journals of the field. It also includes the journals that started the field some ninety years ago and are still flagship journals today. These are therefore good reasons for accepting the corpus as offering a representative perspective of the discipline.
The corpus was cleaned and preprocessed in a standard way. To reduce the size of the lexicon, and therefore computation time, only nouns, verbs, adverbs and adjectives were kept following part-of-speech (POS) tagging and lemmatization (TreeTagger package (Schmid, Reference Schmid1994) with Penn TreeBank tag sets (Marcus et al., Reference Marcus, Marcinkiewicz and Santorini1993)) and words occurring in fewer than 50 sentences in the corpus were removed. In parallel, author names were checked and disambiguated, ensuring similar spellings were used throughout the corpus. All authors (N=8,009) were assigned publication weights based on their respective number of articles (coauthored articles were evenly split). Four main time periods of 21 years each were then defined (1930–1951, 1952–1973, 1974–1995, 1996–2017).
As shown in Figure 2, the volume of articles has significantly increased over the past eight decades, from 1,575 for the 1930–1951 period to 8,300 for the 1996–2017 period, which is a 5.3-fold increase. Meanwhile, the number of authors has incurred an eightfold increase. Knowing that the number of articles per author has roughly remained constant throughout all four periods at about 2, the increase in authors denotes an increase in co-authorship. Indeed, the number of multiauthored articles has increased fourfold, from 4% in the first period to 16% in the last. Although this share is significantly lower than that in the sciences where single-authored articles are now virtually non-existent (e.g. in ecology, Barlow et al., Reference Barlow, Stephens, Bode, Cadotte, Lucas, Newton, Nuñez and Pettorelli2018), or even in some areas of the humanities (e.g. in economics, Kuld and O’Hagan, Reference Kuld and O’Hagan2018), multiple-authorship has been steadily rising in the philosophy of science. Note that the proportion of authors who only publish once (or “transients,” see Crane, Reference Crane1969) is relatively stable at about 65%. This not significantly different from what is observed elsewhere, a partial explanation being the share of doctoral students and postdoctoral researchers (e.g. in synthetic biology, Raimbault et al., Reference Raimbault, Cointet and Joly2016). Moreover, while the number of new authors (from one period to another) is above 80%, this proportion has been decreasing over time. This means that, although many authors in any given period were not present in the previous period, new authors now tend to represent a smaller share of authors compared to what it used to be.
2.2. Step 2—Data analysis and HCoI’s identification
Following (Malaterre and Lareau, Reference Malaterre and Lareau2022), topic modeling was carried out with the well-known LDA algorithm, following (Blei et al., Reference Blei, Ng and Jordan2003) and (Griffiths and Steyvers, Reference Griffiths and Steyvers2004). Units of analyses were complete articles. The topic modeling operation resulted in 25 probability distributions over the lexicon of the corpus terms (each probability distribution considered to represent a topic), and the probability distributions of these topics in each one of the 16,917 articles. The number of topics k = 25 was chosen as a compromise between an optimal coherence measure (Röder et al., Reference Röder, Both and Hinneburg2015) for a variety of models from k = 5 to 100 (see Figure 3) and upon manual inspection of top-words (in particular for models below 35, since higher-k models led to no increase in coherence). In the end, the topics of the model with k = 25 were found to be more meaningfully interpretable compared to the others, while being characterized by a relatively high coherence (though not the highest).
Inspection of the most probable terms within each topic and of selected text excerpts made it possible to carefully interpret and label all topics. For ease of handling, topics were also grouped into categories based on their correlation within corpus documents, and Louvain community detection performed on the graph of topic correlations in Gephi (Bastian et al., Reference Bastian, Heymann and Jacomy2009). These categories were interpreted based on expert knowledge of the field (the categories are denoted by a capital letter in front of the topic name). To give a sense of the 25 topics, their top 10 words are listed in Table 2, sorted by categories and alphabetic order.
In this particular case study, the topics correspond to well-known research themes in the philosophy of science (Malaterre and Lareau, Reference Malaterre and Lareau2022). Group A of topics denotes research questions that are characteristic of the philosophy of language and logic. Group B includes topics in epistemology and theory of knowledge (including questions about realism), while group C relates more specifically to induction, confirmation, and the use of probabilities. Group D is about rational decisions and game theory. Topics in the philosophy of biology and the neurosciences are found in group E. Group F includes a set of traditional topics that concern the process of scientific explanation, the nature of causation, and the status of natural kinds. Topics in the philosophy of physics are in group G, with thermodynamics, electromagnetism, chemistry in one topic, and relativity and quantum theory in the other. Finally, group H gathers topics that are characterized by a more historical or social discourse. These include research themes in the history of science and in the history of philosophy, but also investigations on the social dimensions of science.
Note that a measure of similarity between topics was also calculated as 1-d, with d = Hellinger distance—which is appropriate for probability distribution—between topic probability distributions over the corpus lexicon, using the Gensim implementation (Rehurek and Sojka, Reference Rehurek and Sojka2010). The corresponding heat map (Figure 4) confirms that the topics are overall very dissimilar, the only slight exceptions being the topic B-Arguments, whose similarity values are among the highest, reaching about 0.4 with the two other topics of cluster B (hence denoting quite a generic topic), and topic H-History, also with similarity values near 0.4 with two other topics of cluster H.
Obviously, the topics identified through such topic models depend on the textual data that are being modeled. A policy-related corpus will exhibit topics about policy matters as expressed by the texts. Note that the size and structure of the corpus also matter, notably with respect to the number k of topics that will ultimately be chosen: this is the case for the units of analysis (e.g., entire documents vs. sections of paragraph), and the overall number of these units of analysis. Metrics can help in the choice of k yet, though we have found that the interpretability of the topics by human judgment and their relevance for the research questions at stake also matter much (Grimmer and Stewart, Reference Grimmer and Stewart2013).
The probability distributions of topics inside documents were then used to identify the semantic signature of authors in terms of their contributions to particular topics. First, articles were sorted depending on their publication year into the four time periods defined above, and equally split between coauthors (i.e., that an article with two authors counted for ½ for each author in terms of weight). Then, article topic distributions were averaged out per author for each one of these periods (taking into account the co-authorship weights). This step resulted in topic profiles for each author based on their publications during any given time period. In other words, for each time period, probability distributions over the 25 topics were computed for all authors having published during that time period. These distributions are what we call “author topic profiles”.
For each time period, Pearson correlations among these author topic profiles were calculated. Correlation networks were built in Gephi (Bastian et al., Reference Bastian, Heymann and Jacomy2009), using Louvain community detection (with default parameters). To reduce noise, only authors with weighted publication above 2 were retained (thereby filtering out “transient authors”), and a correlation threshold was set to 0.6 (this resulted in keeping all significant author communities connected to the network main component across all four time periods while removing clutter). To facilitate the interpretation of each author community, topic profiles (i.e., topic probability distributions) were calculated at the community level by averaging out their author topic profiles.
To get further insights into the genealogy of communities over time, we calculated the Hellinger distances between community topic profiles across time periods and focused on closest pairings. These sets of distances made it possible to identify which communities persisted over time, which other ones appeared or disappeared, bifurcated or merged, thereby generating a diachronic picture of the evolution of author communities and their main topics.
3. Results
3.1. HCoI’s and their evolution through time
The approach we described makes it possible to identify groups of actors sharing similar semantic contents as revealed by the texts they produce, but whose social relationships with one another may be initially unknown or underlying (HCoI’s). Technically speaking, HCoI’s are groups of actors whose topic profiles (obtained through a weighted average of their document topic profiles) are highly correlated with one another. Adding a temporal dimension makes it possible to map the evolution of the different communities through time. In the present case study, we chose to investigate the different HCoI’s though four successive time windows so as to shed light not only on the structure of the resulting networks of actors, but also on its evolution and the relative importance of the different communities through time.
As with any network, key structural features of HCoI networks can be analyzed with the help of descriptive measures such as density (ratio of actual edges to the total possible number of edges), betweenness (the extent to which nodes lie between other nodes), modularity (probability that two associates of a node are themselves connected), cohesion or degree (number of edges per node) etc. (e.g., Wasserman and Faust, Reference Wasserman and Faust1994; Borgatti et al., Reference Borgatti, Everett and Johnson2013; Yang et al., Reference Yang, Keller and Zheng2016). Table 3 summarizes some of these network statistics in the present case study (calculated with Gephi on each graph). The density of the networks tends to decrease over time, while measures of betweenness, cohesion, and modularity increase: this indicates that the HCoI’s gradually differentiated from one another over time, resulting in more distinct and compact communities in the 2000s compared to the 1930s. This trend toward more numerous and highly distinctive communities also visually stands out when looking at the sequence of network graphs over time (Figures 5–8). These figures include a network representation of the communities present during the corresponding time period (Figures “a,” where nodes represent actors; actor name size and node size proportional to actor weighted number of publications; node color corresponding to community dominant topic), and the topic profiles of the communities (Figures “a”).
As is apparent, the overall field of interest—the discipline of philosophy of science—has significantly grown in terms of both domains of interest and actors. In particular, the number of specialized communities has incurred a threefold increase over the past eight decades. In the 1930s–1940s (Figure 5a), the field comprised just a handful of communities. A clearly identifiable cluster (1a) consists in the community of the logical positivists and members of the Vienna circle (e.g., Neurath, Reichenbach, Carnap, Hempel), distinctively focused on the philosophy of language and logic (as can be seen on the topic profile of that community on Figure 5b). As is well-known to experts in the field, the subsequent development of the philosophy of science owes much to these authors. The other half of the network consists of two closely interconnected communities, with, on the one hand (1c), a group of actors somehow at the border between philosophy and other humanities (e.g., history, anthropology, economics, psychology), and on the other (1d), authors engaging in more traditional metaphysics or ontology (e.g., realism, subjectivity etc.). Although still engaging with science, these two groups remained much anchored to a classical philosophical discourse. A distinct and much smaller community (1b), somehow at the fringe of the network, consists of philosophers focusing more on physics, and discussing issues related to matter, energy, or physical theories (e.g., electromagnetism or quantum mechanics; note the presence of Malisoff, founder of Philosophy of Science).
The actor network developed substantially in the 1950s throughout the early 1970s, with an increasing number of interconnected communities (Figure 6). Community (2a) includes logicians and philosophers of mathematics with a distinctively formal vocabulary. Philosophers of language constitute a separate community (2b). Occupying a central position in the network, (2c) is a community targeting specific issues related to confirmation and the status of scientific theories (e.g., induction, verifiability, corroboration, or refutation; note the presence of Popper). A small and peripheric group of actors (2d) consists of the nascent community of philosophers of biology, with a notable focus on evolutionary theory. On the contrary, philosophers of physics constitute a larger community (2e), addressing a diversity of epistemic issues related, for instance, to relativity theory or quantum mechanics. In continuity with the previous period, a distinctive community is constituted by actors at the border with traditional philosophy (2f), while a nearby community appears to address more sociological aspects of science (2g).
In the 1970s throughout the 1990s, the philosophy of science continued to grow in terms of actors but also in terms of topic communities (Figure 7). Community (3a) consists of logicians and philosophers of mathematics, somehow in continuity with a second community (3b) more centered on semantics and the philosophy of language (note the presence of Hintikka known for his work on formal epistemic logic and game semantics for logic). A specific community consists of actors addressing epistemology and theory of knowledge questions (3c). Questions about confirmation and the status of scientific theories characterize a community somehow at the center of the network (3d). Note the appearance of a specific community focused on probability theory and its relevance for science and knowledge (3e). Another new community consists of actors interested in decision and game theories, and their applications in science (3f). The community of philosophers of biology remains at the margin of the rest of philosophy of science but has significantly grown in size (3g; note the presence of Sober). A novel community has appeared around researchers more specifically targeting the philosophy of mind and the neurosciences (3h). Yet, another community of actors distinctively focuses on causation (3i). Two communities are characterized by topics related to the philosophy of physics: the first one with a clear focus on quantum mechanics and relativity (3k), and the second smaller one more oriented toward thermodynamics, chemistry, and electromagnetism (3j). A relatively diffuse community gathers actors who tend to have a more traditional philosophical standpoint (3l). Finally, a large community consists of a diverse set of actors who tend to target some social dimensions in science (3m).
The trend toward an increase in terms of number of actors and a specialization of discursive topics continued in the 1990s throughout the 2010s (Figure 8). A community of philosophers of language (4a) can be seen quite tightly connected to a second community of philosophers of logic (including modal and intuitionistic logic) notably interested in notions of truth (4b). A nearby community consists of epistemologists, philosophers specializing in theory of knowledge (4c). Toward the center of the network, a community focuses on the status of scientific theories notably with respect to realist and anti-realist stances (4d). A nearby and more diffuse community gathers actors interested in topics that relate to data, experiments, and modeling, but also somehow to causation (4e). The community of philosophers of probability, which had appeared in the previous period has grown in size and individuated itself (4f). A community of researchers somehow bridging philosophers of probability and of logic consists of actors focusing on game theory and various aspects of rational choice theory (4g). The community of philosophers of biology (4h) has significantly grown and is somehow more integrated with the rest of the network, notably with the community of philosophers of the neurosciences and others interested in scientific explanation (4i). At the center of the network lies a community generally interested in ontology (4j), addressing issues about properties or kinds among others. A large group of philosophers of physics constitutes a relatively well distinct community that tends to focus on relativity and quantum mechanics, with related issues such as the structure of space-time (4k). A noticeably distinct category of actors appears to mobilize classical philosophical works in their discussion of science (4l). Finally, a community of authors focuses on the social dimensions of science and various aspects of the practice of science (4m).
What was done here with this case-study corpus of academic articles could be done in any other policy-related context where textual data are abundantly produced by a set of actors whose relational data may be unknown. As soon as texts can be associated with actors, the topics extracted from these texts can be used to construct actor-specific topical profiles that can in turn be used to assess the relative proximity of these actors from one another. The resulting HCoI’s then correspond to groups of actors who produce texts with similar semantic content. So to speak, HCoI networks capture actor relationships in terms of who-talks-about-the-same-things-as-who.
3.2. Retracing community genealogies
Measuring the pairwise distances between the topic profiles of any two communities from two different periods provides insights on the transformation of HCoI’s into one another through time: the shorter the distances, the closer the communities in terms of their thematic interests (Figure 9). This makes it possible to map the evolution of HCoI’s though time, and in particular their genealogies as communities from one time period split into two or more communities in the subsequent period, or the other way around. When considered in conjunction with community topic profiles, the genealogical relationships between communities also make it possible to understand the relative shifts in significance of the different thematic interests through time among related HCoI’s.
In the present case study, transitioning from the first period (1930–1951) to the second (1952–1973) (Figure 9a), one sees a reasonably good filiation between communities focused on philosophy of language and logic (1a to 2b). However, the other three communities tend to consolidate into one, and possibly into two (1b–d to 2f–g): the early philosophy of physics (1b) tends to bifurcate into a community still centered on similar physics-related topics (2e) and another community closer to traditional philosophy (2f), the latter being in the continuity of actors engaging in more metaphysics or ontology (1d). Note how the communities 2f and 2g appear to be relatively close to all the communities of the previous period, indicating a reconfiguration of actors and their topics of interest. Given the increase in the number of communities, this also denotes a form of marginalization of what once constituted the core of the philosophy of science. Note how the philosophy of biology community of the second period (2d) shows little continuity with previous communities, indicating the emergence of a novel HCoI.
The transition from the second period (1952–1973) to the third (1974–1995) also shows an increase in the number of communities, yet filiations tend to be stronger, indicating a form of stabilization of research communities with novel themes still emerging (Figure 9b). Philosophy of language and logic communities map well onto one another (2a-b to 3a-b). The community about confirmation and scientific theories (2c) persisted into 3d, while giving rise to a distinct community focused on probability theory and its relevance for knowledge (3e). The philosophy of biology community also persisted as a well identified set of actors and topics (2d to 3g). The community of philosophers of physics (2e) appears to have grown and split into one community more focused on relativity and quantum theory (3k) and another on the rest of physics (3j). The two socio-historico-philosophical communities (2f–g) somehow persisted (3l-m), though one notes a relative proximity of the latter communities to many of the communities of the previous period, indicating multiple reconfigurations. Four novel communities appeared in the 1970s–1980s without any clear filiation from communities of the previous period: a community focusing on knowledge theory (3c), another exploring game theory and rational choice (3f), yet another on philosophy of mind and the neurosciences (3h), and finally a community distinctively focusing on causation (3i).
The number of communities stabilized during the last decades of the 20th century. The transition from the third (1974–1995) to the fourth period (1996–2017) shows a relatively good continuity (Figure 9c). Communities of philosophers of language and logic slightly reorganized themselves depending on topic alignments but remained stable as a group (3a-b to 4a-b). Epistemologists persisted as a specific community, while gaining in momentum and autonomy (3c to 4c). The community focusing on probability theory and knowledge also persisted (3e to 4f), as well the communities on game theory (3f to 4g), on the philosophy of biology (3g to 4h), on the philosophy of mind and the neurosciences (3h to 4i), on the philosophy of relativity and quantum theory (3k to 4k), and on the social dimensions of knowledge (3m to 4m). On the other hand, some communities tend to have somehow dissolved into several. This is notably the case for the community on confirmation and scientific theories (3d) denoting a detachment from these topics in the 1990s–2000s. Similarly, the community focusing on chemistry, electromagnetism, or thermodynamics (3j) has somehow dissolved in subsequent decades, as well as the one which was centered on more traditional philosophical issues (3l). Finally, philosophers focusing on causation (3i) appear to have joined a broader community also interested in data, experiments, and modeling (4e).
Similar genealogical analyses can be carried out in any other context where actors and their HCoI’s are similarly characterized by topic profiles. While an actor topic profile can be interpreted as a distinctive marker for each actor and used to cluster actors into specific interest-driven communities, the overall topic profile of each community can in turn be interpreted as a distinctive trait of that community and used to assess its relative proximity to other communities across time. This then makes it possible to trace back community genealogies and shed light on the origination of specific HCoI’s.
4. Discussion
As we have seen, the findings result from a combination of topic modeling and community detection approaches. The main objective of these approaches is to identify HCoI, that is, groups of actors sharing similar semantic contents but whose social relationships with one another may be unknown. The methods make it possible to identify such communities from their semantic content and in the absence of known social connections. They also make it possible to assess the relative topic proximity of these communities at a given time and diachronically.
In the present case study, the identification of HCoI’s in the academic domain of the philosophy of science highlight semantic reconfigurations in the field, with some communities dissolving into others (e.g., the community of confirmation and scientific theories of the 1970s–1990s), and others subsequently emerging (e.g., the communities of philosophers of biology in the early 1970s, or of epistemologists in the 1980s). Overall, the evolution of actor communities shows a phase of growth and diversification as the number of actors (and interests) increased, followed by a later phase of stabilization characterized by a form of intellectual entrenchment of larger and usually well delineated communities. These results concur with known episodes of the field, for instance the role of logical positivism in the constitution of the philosophy of science in the early 20th century (Giere and Richardson, Reference Giere and Richardson1996) or the emergence of a philosophy of biology in the 1970s as can be reconstituted by examining dedicated anthologies (Sober, Reference Sober2006; Rosenberg and Arp, Reference Rosenberg and Arp2009), which confers confidence to the approach.
The methods can be relevant to a broad range of policy-related contexts where textual data can be gathered alongside with corresponding author-actors, even in the total absence of relationship data between actors. The construction of topic profiles for all actors is sufficient for inferring the underlying content-oriented networks they form, thereby making it possible to identify HCoIs that these actors form (alongside with a topic-based interpretation of the distinctive semantic profile of each community). When a temporal dimension is added, HCoI genealogies can also be generated, providing additional insights into the appearance and evolution of communities.
Note that the resulting networks differ from social networks as usually construed. Indeed, social networks are typically understood as depicting actual relationships between actors, such as who-follows-who, who-is-friend-with-who, who-sends-a-message-to-who, and so forth. By contrast, in the HCoI approach that is proposed here, relationships between actors are inferred from similarities in the thematic content of their texts: the actor networks are based on the similarity of their topic profiles (averaged from their respective texts). In short, HCoI’s are about who-talks-about-the-same-thing-as-who. Whereas typical social networks can be said to encapsulate specific social relationships between actors (e.g., that of sending messages from one another), HCoI’s capture relationships between actors that are mediated by the texts produced these actors. Obviously, in some contexts, both social relational data and textual data may be available, offering the special opportunity to build multilayer or multiplex networks offering complementary perspectives on the same phenomenon. In any case, mapping out HCoI’s can be used as a heuristic to identify actual social connections between actors.
As mentioned earlier (Section 1), semantic networks may also be used to identify HCoI’s (e.g., Danowski, Reference Danowski and Wiil2011). However, this requires first building semantic networks for all actors based on their texts (which means calculating square matrices of the dimensionality of the size of the lexicon of the corpus), then measuring the pairwise distances between all networks (i.e., between all matrices). In the case of corpora with numerous authors, this approach may prove computationally demanding. In the case of the present corpus, this would mean building 8,009 semantic matrices of dimensionality 23,672 × 23,672 (size of the lexicon after lemmatization and POS tag filtering), and then calculating 8,009 × 8,009 /2 matric distances. By comparison, the approach proposed here relies on only three matrices of smaller dimensionality: an author × topic matric (8,009 × 25), a document × topic matrix (16,917 × 25), and a topic × term matrix (25 × 23,672). Although semantic networks can provide more details in terms of pairwise relationships between terms as opposed to the bag-of-words approach of topic models, the latter should prove less computationally demanding and more feasible. Furthermore, the topic model approach to HCoI’s also naturally leads to community topic profiles, facilitating their interpretation.
Of course, the quality of the HCoI networks obtained through textual analyses depends on the representativeness of the working corpus. This is crucial for the conclusions that will be drawn from the analyses. Note that the methods are agnostic as to the type of texts contained in the corpus. In the present case study, full-text academic articles were used (and we showed how multiple authorship could be handled). Yet, the methods can target any types of texts, be they posts on social media, blogs, letters, reports, (including textual transcriptions of audio content), but also surveys, voting intentions (Pekar et al., Reference Pekar, Najafi, Binner, Swanson, Rickard and Fry2022), or e-petitioning (Harrison et al., Reference Harrison, Dumas, DePaula, Fake, May, Atrey, Lee, Rishi and Ravi2022). A possible limitation that may be raised is in terms of languages: the methods we have described work best with a mono-language corpus. For a multilingual corpus, a simple approach is to machine translate the non-English texts into English so as to produce a monolingual working corpus (English is taken here as an example and could be replaced by any other language accepted by the POS tagging package). This is a strategy that has been successfully tested elsewhere (Vries et al., Reference Vries, Schoonvelde and Schumacher2018; Malaterre and Lareau, Reference Malaterre and Lareau2022).
Note also that the overall approach for identifying HCoI’s is agnostic as to which topic model is chosen. Here, LDA was used but other topic modeling algorithms are possible, provided the resulting topics are not crisp-associated with documents but are assigned via continuous measures amenable to renormalization as probability distributions. Indeed, a crisp clustering of documents into topics would also result in a crisp clustering of actors into topics, preventing the use of distance measures to assess the relative proximity of actors with one another, and the ultimate representation of actors within a network. As we explained, one of the main stages of the methods is to calculate author topic profiles which then serve as a basis for identifying HCoI’s. In turn, topic profiles can be calculated at the community level, proving a topic chart for each community.
While other social network approaches (e.g., bibliometric approaches in the case of scientific networks) typically rely on supplementary investigations about actor profiles to interpret and make sense of the observed networks, the methods we propose here make it possible to automatically pin down the specific identity of the identified communities in terms of discursive topics. Overall, this is crucial to understand what these communities talk about and how they are related to one another: their concerns, their claims, or broadly speaking their interests. Furthermore, the relatedness of communities with one another in terms of their interests can also be assessed by examining the topology of the HCoI’s networks and by identifying genealogical relationships between communities over time. Such an approach is relevant to many policy-related questions at all stages of the policy lifecycle, wherever textual data are available and attributable to specific actors. HCoI mapping can notably lead to a better understanding of the social forces in presence and their claims on specific issues, thereby contributing to agenda setting, for instance by collecting indirect inputs from citizens through social media analysis (Belkahla Driss et al., Reference Belkahla Driss, Mellouli and Trabelsi2019; Ronzhyn and Wimmer, Reference Ronzhyn and Wimmer2021). This can be done to understand a broad range of interests, from privacy issues in government AI deployment (Saura et al., Reference Saura, Ribeiro-Soriano and Palacios-Marqués2022) to the extraction of social network information in the case of anti-corruption policy (Diviák and Lord, Reference Diviák and Lord2023). The approach can in turn increase political legitimacy during policy implementation, as generally the case with data-driven, evidence-based decision-making (Starke and Lünich, Reference Starke and Lünich2020). HCoI mapping can also facilitate policy a posteriori evaluation, for instance, by aggregating perceived quality of citizen-centric public service in complement to other social media text analytics (Reddick et al., Reference Reddick, Chatfield and Ojo2017).
5. Conclusion
Combining topic modeling and community detection methods makes it possible to uncover HCoI and map their proximity in terms of semantic content both synchronically (through correlation networks) and diachronically (over time periods). Using, as case study, a working corpus of 16,917 full-text academic articles written by 8,009 philosophers of science from the 1930s up to the 2010s, this approach revealed how these actors constituted well delineated HCoI’s characterized by specific topic profiles, and how these HCoI’s evolved over time. Being consistent with what is otherwise known by experts in the field, these results lend credibility to the approach. The results notably show how it is possible to gain insights into the social structures underlying sets of texts through the characterization of their topic profiles. The approach provides insights into author-based communities, notably their semantic content in the form of directly interpretable topic profiles, but also their relative proximity and temporal evolution. When data about actual social interactions are not available but textual data are, mapping such HCoI networks can provide very relevant insights about groups of social actors sharing similar interests. In cases where both textual and social data are available, HCoIs analyses could also lead to a complementary content-based perspective compared to usual SNAs.
Data availability statement
Code and datasets available on Zenodo.org: https://doi.org/10.5281/zenodo.7967417.
Acknowledgments
The authors thank the audience of the 56th Hawaii International Conference on System Sciences for comments on an earlier version of this work (see Malaterre and Lareau, Reference Malaterre and Lareau2023).
Author contribution
Conceptualization: C.M., F.L.; Data curation: F.L.; Formal analysis and investigation: C.M., F.L.; Funding acquisition: C.M.; Investigation: C.M., F.L.; Methodology: C.M., F.L.; Project administration: C.M.; Resources: C.M.; Software: F.L.; Supervision: C.M.; Validation: C.M., F.L.; Visualization: C.M.; Writing—original draft preparation: C.M.; Writing—review and editing: C.M., F.L. Both authors approved the final submitted manuscript.
Funding statement
C.M. acknowledges funding from Canada Social Sciences and Humanities Research Council (Grant 430-2018-00899) and Canada Research Chairs (CRC-950-230795). F.L. acknowledges funding from the Fonds de recherche du Québec Société et culture (FRQSC-276470) and the Canada Research Chair in Philosophy of the Life Sciences at UQAM.
Competing interest
The authors declare none.
Comments
No Comments have been published for this article.