Introduction
The family has been the focus of interest for generations of scholars convinced that studying it offers significant insights into populations, societies, and even entire nations (Le Play Reference Le Play1877–1879). One of the most powerful ideas that fired the imaginations of countless researchers was the shared belief that the characteristics of historical family patterns in Europe could be identified, recorded, and analyzed in structural and numerical terms and that these patterns could be understood by paying attention to the phenomenon of household co-residence (Anderson Reference Anderson1980; Hammel and Laslett Reference Hammel and Laslett1974; Ruggles Reference Ruggles2012; Wall Reference Wall, Evert van Imhoff, Hooimeijer and van Wissen1995).Footnote 1 As part of the broader agenda of unearthing past demographic regimes, since the mid-1960s a number of scholars have undertaken an unprecedented effort to study historical co-residence patterns comparatively, analyzing archival documents that contained enumerations of people by residence units, and employing measurement tools from the social sciences and demography, first within pre-industrial England (Laslett Reference Laslett1965), then within Europe, and eventually beyond (Hajnal Reference Hajnal1982; Laslett Reference Laslett1977; Ruggles Reference Ruggles2010; VDEFH 1998; Wall Reference Wall, Richard Wall, Hareven and Cerman2001). As well as spatially classifying and taxonomizing European societies based on family characteristics, these scholars recognized that the way historical European families were organized in the past could spill over to higher levels of organization as societies evolve (Laslett Reference Laslett, Wall and Robin1983; Reher Reference Reher1998), leading to fruitful reflections on the relationship between the past and the present.
Despite their enthusiasm, however, the protracted efforts of these scholars have so far failed to provide a comprehensive reconstruction of historical European family structures. A key reason why is that all these initiatives have had to cope with a lack of reliable, large-scale historical data on family patterns representing the rich diversity of family structures on the continent. Not only was there no “pan-European” data infrastructure, but new data for comparative historical family demography generally proved difficult to obtain and time-consuming and costly to compile, analyze, and interpret within the technological limitations of the time, forcing scholars to rely on informal data sharing and painstaking efforts to compile/compare data collected by others (Wall Reference Wall, Richard Wall, Hareven and Cerman2001; also Bohon Reference Bohon2018; VDEFH 1998). For many areas in Europe, data remained scarce, and even where datasets were available, they were rarely in a machine-readable and standardized format, which made them difficult to process when seeking to account for the complexity of family organization or to conduct replication analyses (VDEFH 1998; Viazzo Reference Viazzo2003; cf. Kitchin Reference Kitchin2014: 32). Although the need for multi-layered analyses of family systems became apparent early on (Laslett Reference Laslett, Wall and Robin1983; Wall Reference Wall, Evert van Imhoff, Hooimeijer and van Wissen1995), the successful implementation of such analyses required the use of tools and methods for data management and processing that were out of reach for family historians (cf. Boonstra et al. Reference Boonstra, Leen and Doorn2006). Thus, the decomposition of data into easily usable but small parts (i.e., individual communities or a small group of communities), which were not infrequently far apart in space and in time, has long been the main approach applied in the debate on the geography of historical family systems in Europe (see Ruggles Reference Ruggles2012). Nevertheless, these diverse and sparse comparative data collections have often served as building blocks for the development of the most far-reaching models of the geography of European family systems.
Older practices of data collection and management were placed on a completely new footing in the 1990s, when the IPUMS and NAPP projects revealed the possibilities for mobilizing new historical demographic data, including for historical north-western Europe, through extensive digitization and transcription initiatives. Researchers of historical family structures who were used to working in “data deserts” now faced an avalanche of information. Thanks to the development of new computer technologies and the availability of the internet, rapid data processing and analysis, as well as unlimited data sharing and dissemination, became possible (see e.g., Ruggles Reference Ruggles2014; Ruggles et al. Reference Ruggles, Roberts, Sarkar and Sobek2011; Sobek et al. Reference Sobek, Cleveland, Flood, Hall, King, Ruggles and Schroeder2011).
Yet for all the enthusiasm generated by the IPUMS/NAPP revolutionaries, there were ambivalent feelings about the extent to which the emerging “data boosterism” would actually fulfill the longstanding dream of a pan-European reconstruction of family patterns. This was in part because those recent advances were limited to the population of the North Atlantic region, and focused mainly on the second part of the 19th century (Szołtysek and Gruber Reference Szołtysek and Gruber2016). At the start of the 21st century, large parts of continental Europe (for an exception, see VDEFH 1998) were still lacking the necessary data infrastructure for conducting systematic comparative historical family research. Thus, researchers in these regions did not even attempt to formulate their arguments based on the analysis of large-scale and harmonized census microdata (e.g., Burguière et al. Reference Burguière, Klapisch-Zuber, Segalen and Zonabend1996; Kertzer and Barbagli Reference Kertzer and Barbagli2001; Wall et al. Reference Wall, Richard Wall, Hareven and Cerman2001). By the late 2000s, it was suggested that an extensive pool of census or census-like material should be developed for as broad a territorial spectrum of continental Europe as possible, as had been previously done for the North Atlantic region. The Mosaic project (Szołtysek and Gruber Reference Szołtysek and Gruber2016), building on the experiences of the IPUMS and NAPP initiatives, took up this challenge by extending the collection and distribution of historical census and census-like microdata to the regions of continental Europe.Footnote 2
This paper is concerned with the changes that Mosaic has enabled in the study of historical European family patterns. Our main argument is that the combination of comprehensive archival search, digitization and computation, data mining, and open-access dissemination that is at the core of the Mosaic project is bringing about an important shift in the fundamental principles that have driven research on European family history to date. We also contend that these transformative features of Mosaic can lead to a significant shift in the scale and the scope of knowledge about historical European family systems (cf. Borgman Reference Borgman2015).
Accordingly, we argue that the transformation heralded by Mosaic has changed the ways data are sought, acquired, stored, processed, and made available for analysis. The availability of this unprecedented amount of computationally manipulable data is creating new options for expanding historical knowledge about past family systems (cf. Emigh and Hernández-Pérez Reference Emigh and Hernández-Pérez2022). As we will show, the sheer volume of Mosaic data now offers researchers opportunities to gain insights that were not previously possible, encompassing many areas that were either barely explored or entirely unknown before. Moreover, these advances can propel this field of research into new areas.
However, we also reiterate that the proposed vision of Mosaic-induced change goes beyond data infrastructure developments, as scaling up to much larger datasets leads to qualitative differences in the measurements, methods, and questions that are used (see Bohon Reference Bohon2018; Borgman Reference Borgman2015). In addition to breaking with the “data desert” paradigm, these new directions in family history research are dependent on applied computer-based innovations and techniques for combining data (cf. Boonstra et al. Reference Boonstra, Leen and Doorn2006; Schürer Reference Schürer and Thaller1986; Schürer and Wall Reference Schürer, Wall and Thaller1986) that allow multiple censuses to be analyzed as a single dataset; comparative analyses to be conducted at different geographical levels; and different characteristics of family systems to be effectively measured with metrics tailored to a particular place, time, and level of aggregation. Finally, we argue that the historical census microdata in Mosaic, rich and informative though they may be, come with their own challenges and limitations, some of which can be mitigated, and some of which cannot. This has resulted in a certain dialectic in the overall assessment of the data discussed here, which can be seen as either “great and rich” or “poor and uninformative,” depending on the research question and the epistemological standpoint.
These concerns shape the structure of the paper. After providing an overview of the genesis of the Mosaic project, and noting that the discussion of the “new” is always linked to the “old” (Aronova et al. Reference Aronova, von Oertzen and Sepkoski2017), we present these themes along the main axes mentioned in the title: i.e., as advances that have revealed new ways of embodying the main concerns of an earlier tradition of family history; and, accordingly, as improvements that have enabled innovations in concepts and approaches that are indeed capable of changing the ways in which research on historical family patterns will be shaped in the years to come. These two perspectives are complemented by a discussion of the main challenges that may arise in using Mosaic.
It should be noted that to understand the nature of the changes brought about by Mosaic, we must at least briefly consider the broader developments in historical-comparative family demography. We will not, however, deal with these developments in their entirety here. We are also aware that while Mosaic plays an important role within these broader trends, it is not the only recent project of its kind. In particular, we must be careful not to regard many of the features of the Mosaic project – especially the infrastructural and computational advances – as stand-alone achievements, as many of them, stem from several parallel knowledge infrastructure projects that are actually older and much larger than Mosaic, such as IPUMS and NAPP.Footnote 3 In many ways, Mosaic “stands on the shoulders” of its larger predecessors. There have, after all, been many parallel achievements in the development of longitudinal databases in recent decades (Mandemakers et al. Reference Mandemakers, Alter, Vézina and Puschmann2023). While a number of these studies have provided real innovations in family history in recent decades (e.g., Tsuya et al. Reference Tsuya, Feng, Alter and Lee2010), their contributions to the continental European and the pan-European geography of family patterns have been rather limited (e.g., Dillon and Roberts Reference Dillon and Roberts2002).
The emergence of Mosaic
Mosaic grew out of two census microdata infrastructure developments that took place almost simultaneously in the late 2000s. The first compilation was the CEURFAMFORM database, which contained information on the inhabitants of more than 20,000 rural households belonging to 236 parishes and 900 settlements in late 18th-century Poland–Lithuania. The data came from various types of population registers that were meticulously excavated from historical archives in Poland, Belarus, Ukraine, and Lithuania, and were then transcribed into a computer file (Szołtysek Reference Szołtysek2015). The other database was made up of the rich surviving material from the 1918 Albanian census, which covered most of the country, and contained transcribed information on 140,611 persons out of the 524,217 people who were living in some 1800 villages, towns, and cities in the Austro-Hungarian administered territory during the First World War (Kaser et al. Reference Kaser, Gruber, Kera and Pandelejmoni2011).
Simply due to the sheer amount of information they amassed, these two databases were unprecedented endeavors in the history of demographic studies of past populations. However, the innovative features of these databases did not end there. Although they covered great expanses of space and time and originated from different institutional contexts, both datasets followed similar core surveying principles. In particular, they both described the characteristics of all the individuals in a given locality by grouping them into co-resident domestic groups and provided information on each person’s age, sex, marital status, and relationship to the household head. In addition, in both datasets, such units consisted not only of the head’s core family, but also of his relatives, co-resident servants, and lodgers. Third, all of this information was harmonized across both datasets using the international coding structure of IPUMS (Sobek and Kennedy Reference Sobek and Kennedy2009).
These similarities made it possible to combine the two databases (Szołtysek and Gruber Reference Szołtysek and Gruber2014) while ensuring that they could be analyzed as a single dataset in which the same variables could be coded, and standardized queries could be made. Consequently, the Albanian-Polish project established the “prototype” for future Mosaic-type datasets in terms of the database structure and the rules for data inclusion, and in terms of the particular research framework in which they were embedded. Further data developments occurred quite rapidly (see Figure 1) due to the strong and coordinated financial and infrastructural support from the Max Planck Institute, the help of a pan-European network of researchers, and internet access. The Mosaic team and their partners were thus able to identify, sample, and digitize vast amounts of previously unknown census and census-like microdata from many areas of continental Europe.Footnote 4
These advances in data collection were accompanied throughout by a commitment to thoroughly examine the preconditions for data inclusion and to trace how and with which categories each population survey was conducted in a given context to ensure comparability (cf. VDEFH 1998: 115). Finally, to facilitate data transformation and dissemination, the common harmonization scheme was applied to all data collections.
Figures 2 and 3 show the spatial distribution of the most recent Mosaic data by location and region, including forthcoming data releases. While covering the entire territory of continental Europe with historical census microdata remains a dream we may never achieve, Mosaic’s current data scope represents an unprecedented expansion in the volume and the spatial breadth of data for the study of historical family patterns. In total, Mosaic contains information on 4364 settlements (villages, hamlets, parishes, estates) with 1,172,241 people living in over 200,000 family households across societies stretching from Navarre and Vizcaya in the west to western Siberia in the east, and from the “far north” of Europe via Saint Petersburg to Almeria and Kythera in the south.Footnote 5 These Mosaic sites are also grouped into 161 regions, which correspond either to the respective administrative units (usually also counties), or, in the absence of administrative units, to geographical clusters to facilitate meso-level analysis.Footnote 6 As a rule of thumb, efforts were made to ensure that each Mosaic region has at least 2000 inhabitants and that urban and rural settlements are separated.Footnote 7 An important extension of the current version of the dataset is the inclusion of historical census microdata from western Siberia that cover a large proportion of the indigenous peoples of Russia’s circumpolar north. This marks the first attempt to study the populations of north-west Asia using integrated census microdata structures.
The Mosaic samples come from different types of historical census and census-like materials (see Szołtysek and Gruber Reference Szołtysek and Gruber2016; also ft. 4). Despite the rigorous data pre-selection procedures, this diversity can affect both the nature and the quality of particular listings. To capture this institutional variability, our metadata were used to categorize all regional censuses into three groups according to their varying degrees of control over census administration (i.e., more direct and more intensive involvement of trained personnel in the census process) (see more in Szołtysek et al. Reference Szołtysek, Poniat and Gruber2018).
All these data are geo-referenced (both as location points and as regional centroids), which makes it possible to link them to various covariates derived from geographic information system (GIS) and other location attributes (see below). While the total area covered by the Mosaic data is extremely large, spanning 6345 km from west to east and 3687 km from north to south, the relevant data points are mostly noncontiguous (see discussion below). The database crosses many important fault lines in the European geography of demographic regimes. However, it also captures much of the variation across the continent in environmental characteristics, cultures (including kinship regimes), and socioeconomic geography, and in patterns of economic growth in the early modern and modern periods.
In total, the database covers 22 European countries, and most of these data – with the exception of the Croatian, Bulgarian, Belgian, Turkish, and Spanish data – come from census collections covering very large populations from multiple localities and wide geographical areas, and therefore provide a reasonably adequate representation of historical diversity in these areas, even if they are not nationally representative in a statistical sense. Most of the Mosaic samples also remain the best samples that are currently available for the regions or countries they cover, and it is likely that for some areas (e.g., Poland–Lithuania), better samples will never be obtained (Szołtysek and Gruber Reference Szołtysek and Gruber2016: 44; also Szołtysek Reference Szołtysek2015).
Consolidations
Mapping variation
One of the most tangible implications of the Mosaic project in relation to the main concerns and interests of the older family history tradition is its potential to map family characteristics in geographical space. Thanks to the geo-referenced nature of all the data, it is possible to display a large number of elements related to family organization at the meso (regional) or local level in cartographic (digital) form, and thus to make instant comparisons. For example, for the first time since the appearance of the seminal works of the 1960s and 1970s, we can map quite accurately many European regions in terms of the three variables that Hajnal (Reference Hajnal1982), Laslett (Reference Laslett1977), and many others have considered crucial to the study of historical family organization: marriage patterns, household structure, and the incidence of service (Figure 4) (see below on more sophisticated variables).
In addition to illustrating the patterns that once existed in Europe, this approach can serve important analytical purposes. It can, for example, show the role that geographical proximity played in patterns of family organization, and can thus improve our understanding of how aspects of family organization in one area differed from those in other areas. Rather than relying on simplistic notions of dividing lines, “transition zones,” and/or “ideal family systems” (Hajnal Reference Hajnal1982; Mitterauer Reference Mitterauer, Heady and Grandits2003; Reher Reference Reher1998; Therborn Reference Therborn2004; Todd Reference Todd1985), the analysis of Mosaic data can result in a more sensitive description of the geography of family patterns, and may lead to the discovery of more complex patterns, including those reflecting the ways in which family and demographic boundaries were crossed and spread, both spatially and temporally. These new geographies may still be incomplete, changeable, or contestable. However, compared to the ways these issues were managed in the “pre-Mosaic world,” this approach represents a major breakthrough. Take, for example, Laslett’s famous regional “sets of familial tendencies” (Laslett Reference Laslett, Wall and Robin1983), which can now be discussed not only on the basis of a few local case studies (e.g., Wall Reference Wall, Evert van Imhoff, Hooimeijer and van Wissen1995, Reference Wall, Richard Wall, Hareven and Cerman2001), but also on the basis of a large pool of regionally differentiated data on households, families, and individuals.
By mobilizing spatially organized, large-scale information at different levels of aggregation, the Mosaic database can not only better address the question of what the most important variations in European family organization were, it can also move the problem of variability in family characteristics to the center of inquiry (cf. Smith Reference Smith1984).
Figure 5 illustrates this point by showing the distribution of the values of the shares of nuclear and multifamily households for two sub-datasets of the Mosaic collection from the historical German territories and the Polish-Lithuanian Commonwealth. Despite its simplicity, this type of “compositional” data representation provides several important insights. For example, it shows that the extent of variation observed in Poland-Lithuania is not comparable to that found in the German data, and that none of the standard population units are homogeneous. It also shows that the identification and the sorting of sub-populations are indeed necessary to understand the family history of any area, because these are the only ways to capture real differences in local or regional conditions that make certain family patterns “thinkable” in particular contexts (cf. Plakans and Wetherell Reference Plakans, Wetherell, Engelen and Wolf2005). Accordingly, Mosaic allows for populations to be compared not only in terms of the mean values of certain indicators but also in terms of how much variation in certain family characteristics they can include.
In addition, the approach illustrated in Figure 5 alludes to the possibility of investigating the extent to which the size of localities can lead to random variations in the distribution of certain indicators. For example, a simple permutation test conducted for the two “country” populations in Figure 5 shows that if two German villages were randomly selected and the average of the simple family households was calculated from 1000 draws, 95 percent of the results would range from 44.2 to 85.1, and from 27.6 to 84.2 in Poland.Footnote 8 Thus, we observe a lot of differentiation each time, and see no significant differences between countries that we intuitively know are very different. In this respect, the agglomeration of Mosaic data can be more robust and rewarding, in part because the use of larger populations (of regions or macro-regions) can help to compensate for random errors due to stochastic fluctuations, allowing for more accurate and parsimonious estimates of many parameters than those obtained in earlier comparisons (cf. Burguière and Lebrun Reference Burguière, Lebrun and Zonabend1996: 36).
Because it offers large-scale data integrated across different levels of aggregation, Mosaic can easily be used to place local patterns in a larger meso- or macro-level context of which they are a part, and can thus better distinguish the particular from the general than scattered case studies could (see, e.g., Flandrin Reference Flandrin1979; Todorova Reference Todorova1996; cf. Kurosu Reference Kurosu, Koen Matthijs, Kok and Matsuo2016). How the particular can be systematically distinguished from the general and assessed on the basis of the scalable and multi-layered geographical structure of the dataset is shown in Figure 6 using the example of the proportion of female servants in a small community in Poland in 1791 and the corresponding scaling of the Mosaic data. This simple exercise shows that Kazimierza Wielka was only slightly different on the measure in question in the province to which it belonged (38.4 percent to 32.8 percent), but that it was definitely exceptional at the level of the country (12.4 percent) and the entire Eastern European region included in the database (14.8 percent). Such programmatic comparisons can be made for most Mosaic sites with a large collection of regional data and for a long list of variables.
Measurements
Due to the prevailing paradigm of research and data organization in the past, and given the technological limitations at that time, many important dimensions of family organization could not be effectively quantified and compared, let alone visualized.
Take, for example, a comparative analysis of the relationship between the age-specific proportion of men who had ever been married and the proportion of men who were heads of households, which has been advocated as a measure of the extent to which marriage signified the creation of an independent residential and economic unit. Such analysis has rarely been undertaken (and then with limited information content), because it was extremely difficult in the past to generate the necessary comparative data on age-specific marriage and household headship rates en masse (Hajnal Reference Hajnal1982; cf. Smith Reference Smith1993: 396–399), and it was even more difficult to process these data. Today, by contrast, historical microdata infrastructures such as Mosaic allow us to calculate these parameters simultaneously for multiple datasets and populations.
Figure 7 illustrates how such an investigation could be carried out for all Mosaic records. Because of the agglomeration of local censuses and technological capacities for data processing, what had been seen as a scarce commodity in earlier studies can now be easily transformed with Mosaic into a veritable “flood” of fine-grained information that can be sorted, sifted, and scaled for specific analyses. This information can be further used to investigate variations, spatial groupings, and central tendencies, generating potential discoveries on topics that – although central to family history research – could not be fully captured before (cf. Smith Reference Smith1993; also Szołtysek and Ogórek Reference Szołtysek and Ogórek2020).
By relying on synthetic cohort methods (as in Figure 7), we can compensate to some extent for missing longitudinal cohort data and obtain reasonable surrogate measures of the timing, magnitude, and pace of certain life course changes, especially for populations clustered around the same census period (see Watkins Reference Watkins1980). Figure 8 shows how this might be done for a section of the Mosaic data, and illustrates the differences in the timing of key life course transitions for three regions in 18th-century Poland-Lithuania. New studies of the life course (e.g., the impact of service, early marriage, living with grandparents) can use such (or similar) Mosaic results to assess the relative importance of particular historical demographic contexts.
The above examples have shown how a combination of the sheer volume of data can enable advances in measurement that were previously only possible with “low-hanging fruit.” While having more data does not always result in better research (e.g., Borgman Reference Borgman2015), another example of how Mosaic’s drive to assemble much larger datasets can increase the chances of gaining important research insights is the application of machine learning.Footnote 9 Because of its scale, content, and coverage, Mosaic is particularly well-suited to harnessing the power of unsupervised machine learning or cluster analysis techniques to infer optimal natural groupings in multidimensional data, which allow complex patterns to be identified with high levels of efficiency and low costs (e.g., Han et al. 2011; Hastie et al. Reference Hastie, Tibshirani and Friedman2009). This quality could prove crucial, as many classical models of family patterns are in fact sets of interrelated variables or elements (Hajnal Reference Hajnal1982; Laslett Reference Laslett, Wall and Robin1983), but have seldom been formally “tested” (e.g., Barbagli Reference Barbagli1991). The application of machine learning tools could provide new insights by answering previously unresolved questions, such as whether historical European populations form natural groupings based on how similar or dissimilar they are with respect to certain family demographic markers, and if so, how many such groupings can plausibly be identified. Such approaches can be particularly helpful in replacing the ad hoc deductive typologies prevalent in previous studies with formal methods of automatic pattern recognition.Footnote 10
For example, using the Partitioning Around Medoids algorithm and careful optimization criteria, Szołtysek and Ogórek (Reference Szołtysek and Ogórek2020) have shown (see Figure 9) that partitioning household formation systems in historical populations into four clusters is a far more reasonable way to capture variation in the Mosaic data (merged here with NAPP; see below) than the dual partition model proposed by Hajnal (Reference Hajnal1982). The proposed clustering solution yielded several other intriguing results. A similar approach can be applied to many other historical demographic problems.
It is noteworthy that most of the above measurements can be broken down by urban-rural differences. However, the validity of such comparisons is compromised by the overwhelming dominance of rural regions (80 percent) and the uneven spatial distribution of the urban population in the Mosaic database.
Finally, it should be mentioned that Mosaic can ultimately facilitate partial analyses of the impact of the socioeconomic status of household heads on various domestic group characteristics. Three-quarters of the Mosaic regions, which account for 73.8 percent of the population in the Mosaic database, include occupational information. Only 48 regions (26.2 percent) with 28.2 percent of the database population do not contain information on occupational titles. Currently, however, only 69 regions (37.7 percent) with 47.0 percent of the database population (536,214 persons) have their occupational titles coded, and further work to improve this situation is in progress.Footnote 11
Innovations
While Mosaic can help to consolidate the field of comparative historical family demography by providing better answers to many critical questions that have long been asked, it also provides fertile ground for innovations in the ways historical family demographic research is conducted in general. The following section describes some of these new elements, focusing on the issues of measurement, analysis, and data merging.
Measurements
As early as the 1980s, it was recognized that classifications of co-residence at the household level are limited and that such measures must be combined with measures of family composition at the individual level to capture the complexity of living arrangements (Ruggles Reference Ruggles1987; Schürer and Wall Reference Schürer, Wall and Thaller1986). Because of the structure of its core variables, Mosaic can enable such analyses by applying a common coding scheme for housing units based on the commonly used classification schemes, while also representing the individual relationships between the people included in the database through distinct but linked and compatible classification pointers (to be further broken down by sex, age, and marital status) (see Ruggles Reference Ruggles1995).
Table 1 shows an example of a combination of coding variables of different orders applied to the census list of members of a domestic group from an exemplary parish in the Mosaic collection. First, the relationships of these individuals to the main reference person on the list, a household head, are determined. Then, each person is assigned a common code for the residential structure in which they live. This is supplemented by the codes that capture the conjugal-family relationships of all individual household members (Wall Reference Wall1998), and finally by a set of dyadic variables that identify the marital, parental, sibling, and other kinship relationships between all persons living under the same roof (only a subset of the actual dyads available is given in the table). When analyzed in combination (either cross-sectionally or by age group), the different dyads can provide information on the simultaneous presence (or absence) of several kinship ties at certain stages of the person’s life in the domestic sphere. This can foster various in-depth research approaches focusing on the residential circumstances of older people, on age-specific changes in “micro-networks” (“roles”) in domestic groups, or on empirical considerations of the advantages and disadvantages of using individual-level versus household-level measures in specific contexts (Szołtysek Reference Szołtysek2015: 684–89; Szołtysek et al. Reference Szołtysek, Ogórek, Poniat and Gruber2020; cf. Ruggles Reference Ruggles2012).
Source: Szołtysek, Mikołaj (2015).
Note: the data are for the census from Słupia parish in Greater Poland province of Poland in 1791.
Second, by combining household- and individual-level variables that are harmonized across multiple datasets, Mosaic allows researchers to develop measures tailored to specific research problems without having to rely on predefined schemes (cf. Ruggles Reference Ruggles2012: 341). The main example in this regard concerns the use of Mosaic data to construct the Patriarchy Index (hereafter PI) to quantify the social and ideological construct of familial patriarchy (see Gruber and Szołtysek Reference Gruber and Szołtysek2016). For this index to be useful, it was first necessary to identify clearly defined items for cross-cultural comparisons in the multifaceted manifestations of the patriarchal order. Accordingly, the operationalisability of these items had to be tested using the information available in the Mosaic data, which inevitably led to the omission of aspects that, although theoretically important, are hardly reflected in the historical sources (e.g., domestic violence). Furthermore, given the open-ended, cross-cultural, and cross-temporal structure of Mosaic, it was necessary to walk a tightrope between specificity and generality in compiling the index to ensure that all its potential components had equal chances of occurring in populations from different regions and time periods. The aim of this approach was to ensure the greatest possible effectiveness with a minimum of information content.Footnote 12
The result was a composite measure consisting of four sub-indices to capture inter-generational and inter-gender relations: dominance of men over women, dominance of the older generation over the younger generation, patrilocality, and preference for sons. All 11 (earlier 12) variables that made up these sub-indices could easily be calculated from routine individual-level censuses or census-like microdata that had been widely used in Europe since the early modern period.Footnote 13
The PI can serve several purposes in the study of family history: (1) It can be used to measure the intensity of patriarchy in family systems across cultures (see Figure 10), and to assess whether the clustering of PI elements on particular dimensions differs across populations; (2) it can be used as a composite measure of family systems, and as a measure of strong/weak family ties in historical populations (cf. Reher Reference Reher1998); and, finally (3), it can serve as a predictor variable in modeling different demographic behaviors, also in comparison to other similar measures (see Szołtysek and Poniat Reference Szołtysek and Poniat2018; Szołtysek, Beltran Tapia, et al. Reference Szołtysek, Beltrán Tapia, Ogórek and Gruber2022).
Spatial analyses
Because the Mosaic data are geo-referenced (see above), a wide range of family demographic characteristics contained in the database can be projected onto geographic coordinates of specific populations and at different geographic levels. Thus, in addition to enabling descriptive mapping (see above), it is possible to take advantage of rapid advances in spatial computing technology (Gutmann et al. Reference Gutmann, Deane, Merchant, Sylvester, Merchant, Deane, Gutmann and Sylvester2011) to examine more explicitly the local spatial patterns of particular aspects of family systems, and to identify and understand their spatial variability (Anselin Reference Anselin1995; Fotheringham Reference Fotheringham1997). Thus, analyses based on Mosaic data have considerable potential for improving on the findings of previous scholarship. This is because much of the research to date on the historical family demography of the continent has been conducted without spatially structured data or even basic forms of spatial modeling (e.g., Alter Reference Alter2013; Ruggles Reference Ruggles2010), despite the recognition that “place really did matter” (Goodchild Reference Goodchild, Shekhar and Xiong2008: 200) when it came to the evolution of family structures in historical Europe. The fact that only small quantities of data were collected for many areas of continental Europe in the “pre-Mosaic era” was obviously one of the factors that hindered the development of spatial models.Footnote 14
Apart from the compilation of a vast collection of data, a prerequisite for moving forward in this area is having an appropriate definition of a network structure that reflects the idea of locality and connectivity (Anselin Reference Anselin1988). In the context of the spatial dispersion of Mosaic data points (and regions’ centroids) and their uneven density in many parts of Europe, the network structure of the five nearest neighbors (based on great circle distances) with a row-standardized inverse distance weight matrix (Anselin Reference Anselin1988) seemed to be the most optimal solution (see Figure 11). With this approach, each spatial point in our data has exactly the same number of neighbors, but the relative importance (weight) of each neighborhood attribute is proportional to its inverse distance (Getis and Aldstadt Reference Getis and Aldstadt2004). This implies that the structure of our data can take into account spatial relationships and proximity, as expressed in the so-called first law of geography, which states that patches that are close to each other are generally more similar than those that are further apart (Tobler Reference Tobler1970). By applying this matrix to Mosaic data, we can formally regionalize the many demographic variables stored in the database and locate boundaries between areas, flagging areas with anomalous values within regions, or identifying local patterns that deviate from regional patterns.
Figure 12 uses the example of the “proportion of older people living in stem families” (Szołtysek et al. Reference Szołtysek, Ogórek, Poniat and Gruber2020) to produce what is known as the Moran scatter plot (Anselin Reference Anselin1995), which illustrates the relationship between the values of the focal attribute at each of the Mosaic sites and the average value of the same attribute at neighboring sites in the matrix. In this case, we see that the majority of the Mosaic data fall in the upper-right quadrant and the lower-left quadrant in Figure 12, corresponding to positive spatial autocorrelation (similar values are observed at neighboring sites, either as high-high or low-low spatial autocorrelation). This pattern is also confirmed by a global indicator of spatial autocorrelation (Moran’s Global I), which is 0.43 (p < .001).Footnote 15 Using this scatter plot, we can also determine which areas of the Mosaic data map are most responsible for the observed high or low spatial autocorrelation, and which locations, if any, run counter to the overall pattern. This allows to capture the variation better than the older approaches based on more fragmentary data and less rigorous (non-spatial) comparisons could.Footnote 16
Data merging
The final area susceptible to innovation is related to the possibilities for expanding Mosaic data both vertically (in terms of content) and horizontally (in terms of scope). The former efforts stem from the motivation to increase the self-contained explanatory power of the database, and to move from describing to explaining patterns in the Mosaic data by embedding them in relevant sociocultural, demographic, and ecological/environmental contexts. In the “pre-Mosaic” studies, such gaps could occasionally be filled by intensive case studies or small subsystem studies and data triangulation (e.g., Mitterauer Reference Mitterauer1992). In large “surface” studies with multiple censuses, such a goal could only be achieved by mobilizing exogenous information from different sources and areas, which was then linked to the demographic/family data in Mosaic through geographical linkage and spatial overlay.
First, the regional Mosaic populations were linked to information on the prevailing infant mortality rate (hereafter IMR) and life expectancy at birth (e0), based on the assumption that both parameters had an important influence on living arrangements (Ruggles Reference Ruggles1987). Despite the heterogeneity of the procedures used to obtain this information (both data fusion and top-down/bottom-up extrapolations had to be used), a total of 160 Mosaic regions were assigned IMR values, and 145 regions were assigned the corresponding e0 values (Szołtysek, Ogórek, et al. Reference Szołtysek, Beltrán Tapia, Ogórek and Gruber2022). The data collected were generally consistent with the spatial distribution and evolution of infant mortality and life expectancy in historical Europe. Both variables also showed an expected mutual correlation (Pearson r= −.68 (p < 0.001).
In addition, for each regional population included in our database, the stage of demographic development was approximated by matching the respective data with the corresponding provincial-level estimates of the onset of fertility decline from the European Princeton Fertility Project (Coale and Watkins Reference Coale and Watkins1986). Accordingly, a dummy variable was created for each regional population that indicated whether the respective population belonged to a province that had already experienced monotonic fertility decline at the time of the census. Overall, the three variables discussed above could be used as moderately coarse control variables in modeling various family demographic processes operationalized with the Mosaic data, along with some variables that could be derived from the data itself (e.g., SMAM or child-women ratios) (e.g., Szołtysek, Beltran Tapia, et al. Reference Szołtysek, Beltrán Tapia, Ogórek and Gruber2022).
The next example of the vertical extension of the data concerns the possibilities for including environmental variables, either to use them as explananda of the European family patterns recorded in Mosaic or to include them as control variables in specific studies.Footnote 17 Again, such enrichment efforts can be done by collecting information from various increasingly available Big Data repositories on environmental features and biogeographical conditions.Footnote 18
Figure 13 shows some of the existing possibilities in which Mosaic regional data are overlaid and directly linked to specific contemporary geo-environmental raster data or to existing areal and raster top-down reconstructions of land-use patterns at the global scale.
For example, the measure of terrain ruggedness can be calculated separately for each of the Mosaic sites by weighting the gridded elevation data by the population size of the regions. This measure, which is perhaps the least controversial of all the geo-variates considered here, has already been shown to be a good and robust predictor of a range of family demographic characteristics drawn from the Mosaic data (e.g., Szołtysek, Beltran Tapia, et al. Reference Szołtysek, Beltrán Tapia, Ogórek and Gruber2022). Similarly, measures of the suitability of land for agriculture and the proportion of land under cultivation, either separately or in combination, can be used as rough proxy measures for the impact of geographical characteristics on the ecological endowment and historical role of agriculture in a given region. It is noteworthy, for example, that the three measures alone explain 11.5 percent of the variation in the proportion of multiple-family households in the Mosaic dataset (results of the ordinary least squared regression [OLS]).Footnote 19
One of the greatest benefits of data harmonization is that it allows data collected in different cultural contexts and over long periods of time to be brought together (see Borgman Reference Borgman2015; Kitchin Reference Kitchin2014). In the context of Mosaic, this created an interoperability that made it possible for the first time to place the large-scale family demographic patterns of historical continental Europe into a much broader comparative framework than ever before.
In the first instance, Mosaic could be easily integrated into the largest collection of nationally representative historical European census microdata compiled by the North Atlantic Population Project (distributed by IPUMS-International; Ruggles et al. Reference Ruggles, Roberts, Sarkar and Sobek2011). As the Mosaic data tend to be chronologically biased towards earlier periods, to ensure comparability, preference was given to the oldest available censuses when selecting NAPP data (i.e., for Iceland, Denmark, England and Wales, and Sweden), using complete censuses in each case (or samples thereof).Footnote 20 To achieve a relative balance in the number of regions between the two data corpora, the microdata from the NAPP were aggregated into 156 administrative units used in the respective census, and were included in the NAPP (generally counties).
This combination of the Mosaic and NAPP datasets created a real critical mass of data that has already led to a number of unexpected discoveries. It revealed for the first time the full range of family patterns across Europe, from the simplest to the most complex. It also showed that many assumed features of the north-west type of family organization were present in parts of Europe where they had not been expected, and often with intensities greater than those in the alleged “core” areas (Szołtysek and Ogórek Reference Szołtysek and Ogórek2020; Szołtysek, Ogórek et al. Reference Szołtysek and Ogórek2020; cf. Dennison and Ogilvie Reference Dennison and Ogilvie2014; Ruggles Reference Ruggles2010). Finally, in the Mosaic data, the highest levels of agreement in terms of mutual associations between the four household formation traits advocated by Hajnal were found outside the north-western “heartlands” in different central European populations (Szołtysek et al. Reference Szołtysek, Ogórek and Gruber2021).
However, even more global accounts could be created by merging historical and current data, as the harmonized structure of Mosaic and NAPP is fairly closely aligned with IPUMS-International’s global data. With such a goal in mind, a “global” patriarchal dataset has recently been created that combines Mosaic/NAPP data on 311 regions with 29 million people of historical Europe and North America with IPUMS-I data on 22 countries with 65 million people for the 1970–2014 period, and projects 546 territorial units (Figure 14). Such a comparative dataset can serve various purposes, including to map the concentration of patriarchal family systems in a “global” regional perspective by confronting the alleged European “uniqueness” using a Eurasian mirror; to examine the differences between historical Europe and its North Atlantic offshore territories in the past; or to assess how differences in basic historical and structural conditions (while also taking into account the factors discussed in Figure 13) have conditioned the emergence of various patriarchal formations (Szołtysek et al. Reference Szołtysek, Ogórek, Gruber and Beltrán Tapia2022).Footnote 21
Challenges
Given their scope and coverage, the Mosaic data surpass all previous efforts to create an infrastructure for family history data in continental Europe and offer many promising research opportunities. However, the use of these data comes with certain challenges.
Italy and the Iberian Peninsula are either not included or insufficiently included in the current dataset. This data gap limits our ability to explore the north-south dimension of variation in family systems in Europe (Reher Reference Reher1998) and may represent a missing element in the development of a “new” geography of family patterns based on machine-learning technologies.
Broad cross-cultural and cross-temporal comparisons using Mosaic data could pose epistemic risks in terms of the ontological status of the basic census units “unearthed” from historical censuses or census-like registers. These may arise if there is too little cross-cultural overlap in census definitions (which risks occidentalisation); if the term “household,” as defined by survey statisticians to ensure complete coverage, is not consistent with particular economic or social concepts (requiring a distinction between “etic” and “emic” ways of grouping people; see Szreter et al. Reference Szreter, Sholkamy and Dharmalingam2004); and if census “units” are taken out of context by overly mechanistic standardisation (requiring careful cross-cultural translation of the source material) (see Szołtysek Reference Szołtysek2023).
Because Mosaic captures populations that are unevenly distributed across time and space, each time window of the dataset literally contains different populations, even within broad macro-regions (Figure 15). As well as severely limiting the analysis of family change (although some broad temporal trends can certainly be identified), this seems to contradict the idea of comparing elements of different temporal sequences without a clear idea of the extent to which they might change over time (Wawro and Katznelson Reference Wawro and Katznelson2022; for similar examples in earlier studies, see Barbagli Reference Barbagli1991; Hajnal Reference Hajnal1982; Laslett Reference Laslett1977; Smith Reference Smith1993; Wall Reference Wall, Richard Wall, Hareven and Cerman2001; cf. Dennison and Ogilvie Reference Dennison and Ogilvie2014).
Since this mixing of time periods is virtually unavoidable with such extensive data (and an ideal data structure to mitigate this problem is unrealistic), we make four practical suggestions to ensure that analyses based on Mosaic data are justified even with this caveat in mind. First, 146 of the 161 Mosaic populations (90 percent of the current regions) represent populations that have not yet experienced a fertility transition, and, with the exception of France, most regions without this characteristic are widely dispersed without changing the overall picture. This narrows the gap between the Mosaic populations, at least in terms of the general demographic characteristics that most of them have long exhibited (Del Panta et al. Reference Del Panta, Rettaroli, Rosental and Wunsch2006). Second, the two largest data collections of “recent” populations in Mosaic (the 1918 census for Albania and the 1926 Polar Census) represent not only pre-transitional populations but also quite archaic family organizations, further reducing the seemingly huge time span of the data. Third, we suggest that all multivariate analyses of Mosaic data always include the period or other time-varying covariates (census quality, onset of fertility decline, IMR, or e0) as control variables. Finally, the pooled time cross-sections from Mosaic should ideally be cross-checked with other place-specific evidence before they can be assumed to represent family patterns that are durable beyond the specific time window covered by the data (e.g., Reher Reference Reher1998; Schürer et al. Reference Schürer, Garrett, Jaadla and Reid2018; Therborn Reference Therborn2004).
The fact that the Mosaic dataset has a huge overall volume does not necessarily mean that all its variables are free from noise generated by small Ns. Although Mosaic has tried to minimize these potential effects by creating regions that are “large enough” (see above), and thus allow the random fluctuations to become smaller as the sample size increases, population size can still be an issue if the calculation of certain variables requires a large reduction in the denominator (e.g., for age-specific measures).
An exemplary variable of this type is the child sex ratio, i.e., the number of males per 100 females in the 0–4 age group, which is commonly used as a cumulative measure of sex-specific mortality around birth, in infancy, and in childhood (Szołtysek, Ogórek, et al. Reference Szołtysek, Beltrán Tapia, Ogórek and Gruber2022). In Figure 16, the original sex ratios of the original samples (represented by filled squares) are overlaid with the 2.5 and 97.5 percentiles of the distribution of sex ratios resulting from the bootstrapping procedure using individual-level information from the Mosaic data (5000 sex ratio values were resampled for each region). This exercise clearly shows that the uncertainty of the calculated measure (sex ratio) increases dramatically as the sample size (the number of children 0–4 in the region) decreases. A practical lesson that can be drawn from this example is that researchers using Mosaic data should always be mindful of which at-risk population is being considered for particular demographic indicators, and should take every precaution when proceeding with the analysis. It is recommended that researchers apply resampling methods that use individual-level information from the Mosaic data file attached to the regional file to gain more confidence in specific measures.
As was mentioned above, Mosaic’s core data are relatively weak semantically, and linking them to additional contextual information (see above) leads to insurmountable limitations. For many potentially critical intervening factors (e.g., the socioeconomic structures and the labor, inheritance, and kinship patterns), creating relevant variables based on information from the secondary literature or from the original data providers would be extremely tedious, unproductive, and most likely impossible for the entire dataset. Many hindcast reconstructions of historical land-use patterns (see above) are clearly not “data” in the sense of measured quantities, but are, rather, good guesses about what happened (e.g., Klein Goldewijk and Verburg Reference Klein Goldewijk and Verburg2013). For many of these areas, it would only be possible to obtain meaningful information in the context of high-resolution local case studies (Hedefalk et al. Reference Hedefalk, Svensson and Harrie2017), which, once again, is not feasible for all Mosaic data points. These limitations should be kept in mind when developing multivariate models with the Mosaic data.
Furthermore, Mosaic data are not particularly useful for individual life course analyses, and their linkage/integration with longitudinal databases is actually quite cumbersome (Mandemakers et al. Reference Mandemakers, Alter, Vézina and Puschmann2023). This apparent lack of synergy is in fact reciprocal, as transforming the latter into a cross-sectional matrix of the NAPP/Mosaic data structure would require generalized solutions that are currently difficult or impossible to obtain (cf. Alter et al. Reference Alter, Mandemakers and Gutmann2009). Nevertheless, both the Mosaic data and the longitudinal data can serve the common goal of charting and explaining demographic dynamics (cf. Dillon and Roberts Reference Dillon and Roberts2002). First, the existing longitudinal databases could become a source of additional information for Mosaic-like large “surface” studies, at least for some areas of historical Europe (and even beyond). Moreover, as many of the longitudinal data sources are highly localized (e.g., Matthijs and Moreels Reference Matthijs and Moreels2010), they could benefit from the use of Mosaic data to assess the relative importance of particular family demographic contexts. This is particularly true with regard to the Mosaic project’s potential to outline broad regional patterns of life course transitions across cohorts (as mentioned above).
Last but not least, the successful management of a project like Mosaic requires a combination of different practices, skills, and technologies, and necessitates interdisciplinary conversations between scientists who do not always communicate directly with each other. Such collaboration can be very difficult without long-term and flexible institutional support, a long-term vision, and a commitment to manage and be accountable for the content on behalf of the data curators (Borgman Reference Borgman2015; Kitchin Reference Kitchin2014: 40). Strong institutional support is also crucial to continue the long-term task of digitizing and curating additional microdata samples for many parts of Europe in the future (cf. Emigh and Hernández-Pérez Reference Emigh and Hernández-Pérez2022), especially as such efforts often require international interactions and collaboration across large distances.
Conclusions
The main motivation for initiating the Mosaic project was a lack of existing comparative family history data, which, it was argued, had to be overcome to answer more systematically many important research questions related to our understanding of the population and family history of continental Europe. In this paper, we have explored the opportunities and challenges associated with filling this gap by developing and exploring a specifically European data infrastructure on historical family patterns.
The changes that Mosaic has ushered in reshape some of the fundamental principles of family history research in the data domain. For most of its history, historical family demography has operated in a data-poor environment in which measurements of many aspects of family organization have been difficult or inaccessible, or have been expensive and cumbersome to obtain, purchase, and process. Thanks to the Mosaic database, scholars interested in researching family history now have access to an unprecedented amount of fine-grained data on populations and societies, regions, and small areas and places, with a large share of these data referenced in geo-space and time.
The proposed vision of change goes beyond the purely technical aspects. Scaling from traditionally small data infrastructures to much larger data infrastructures leads to the introduction of new approaches to data processing and analysis that enable older questions to be answered and new questions to be asked in a more efficient way. By enabling them to shift from a data-poor to a data-rich approach to analyzing historical family systems, Mosaic provides researchers with opportunities to move from coarse aggregations to high resolutions, from simple descriptions to complex modeling, and from tentative observations to formal pattern recognition. These advances should, in turn, lead to a much broader, deeper, and more comprehensive understanding of past family patterns. A fuller history of European family organization can now be provided using a range of approaches, from sharpening and developing insights that have often been marginalized, obscured, or only secondarily addressed; to engaging in Big Data-like data dredging to comprehensively examine relationships between a large number of variables for which data are available.
Mosaic also raises fundamental questions about the organization and practice of historical family research (Borgman Reference Borgman2015). Efforts like the project discussed here offer new possibilities for fostering interdisciplinary collaborations beyond the lone-scholar model that has long dominated family history research. The complexity of research practices and the possible ways to explore Mosaic-like data inevitably encourage more (network) collaborations (“crowdsourcing of minds”), especially (but not only) between “computationally literate social scientists and socially literate computer scientists” (Kitchin Reference Kitchin2014: 137). The usage of Mosaic data may also improve the levels of research productivity within the field (especially in the context of public data sharing), the possibilities for further data re-use, and the provision of test-bed data for teaching and student projects.
Although the large volume of data collected by Mosaic may produce important innovations and improvements on previous studies of historical family systems based on more limited data, there are also strong continuities and potential synergies between the Mosaic project and the older practices of historical family demographers. For example, Mosaic does not advocate entirely replacing older studies with small datasets with large datasets analyzed using automated approaches. While the Mosaic database offers opportunities for conducting large-scale “surface” studies, it can also support more traditional approaches that focus on in-depth analysis of smaller entities, be it a community or a village. Small-scale studies can answer more finely tailored research questions or specifically formulated comparisons, telling individual, nuanced, and contextual stories, while also being less resource-intensive (cf. Kitchin Reference Kitchin2014: 29 ff). At the same time, the Mosaic database can help the authors of such studies develop better micro-stories (i.e., embedded in larger structures).
The “deluge” of Mosaic-like structured information on historical family patterns notwithstanding, some important areas are still not yet covered by the dataset. Thus, the organizers of the project are eager for it to grow bigger. The Mosaic project’s ability to generalize about the European familial past would definitely improve if more data on the Iberian, Mediterranean, Russian, and perhaps also French areas could be included. Moreover, the project’s ability to generalize about the place of Europe in world family systems would be enhanced if historical census and census-like microdata from Asia could be combined with its data (e.g., Dong et al. Reference Dong, Campbell, Kurosu, Yang and Lee2015; Ochiai and Hirai Reference Ochiai and Shoko2023); an opportunity that so far has not been taken up by any Asian colleagues. Although the scope of the data that could be usefully included in the database is not infinite, Mosaic is still far from the point beyond which further data would not add any (new) information (cf. Succi and Coveney Reference Succi and Coveney2019). For such data expansions to happen in the future, large amounts of funding, institutional support, and cooperation of the broader research community for data curation would be necessary.
Databases are now more widespread than microscopes, voltmeters, and test tubes. The increasing amount of data has led to major changes in research practices, and historical family demography is no exception to this general trend. While the Mosaic project is probably not a prime example of the use of “Big Data” (although the data might be called “biggish”), its transformative capabilities should not be ignored. With “data is the new oil” as the motto of our Zeitgeist, we are challenged to remember that mining new horizons of data can indeed yield scientifically useful insights even within the confines of historical family demography. Despite the potential outcry from parts of the family history community over such practices (e.g., Dennison Reference Dennison2021; Devos Reference Devos, Matthijs and Kok2016), it is unlikely that the trend of adopting large-scale data solutions in historical family demography will be slowed down and reshaped. We argue that social science historians of the family should recognize and face the challenges associated with large-scale data projects. The price of missing out on such opportunities may be high, given that family historians have already lost some of their previous standing as the primary interpreters of the panoramic worlds of historical family (see, e.g., Bertocchi and Bozzano Reference Bertocchi and Bozzano2019; Duranton et al. Reference Duranton, Rodríguez-Pose and Sandall2009; Gutman and Voigt Reference Gutmann and Voigt2022). After all, if the avalanche of data is here, shouldn’t we be digging?
Acknowledgments
We thank Joshua Goldstein for his encouragement and thorough support in the development of the Mosaic database.
Funding
Mikolaj Szoltysek disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research has been funded in whole by the National Science Centre (Poland) under the grant scheme OPUS (no. 2022/47/B/HS3/00004).