Hostname: page-component-6bf8c574d5-vmclg Total loading time: 0 Render date: 2025-02-20T06:00:54.165Z Has data issue: false hasContentIssue false

The REDATAM format and its challenges for data access and information creation in public policy

Published online by Cambridge University Press:  17 February 2025

Mauricio Vargas Sepúlveda*
Affiliation:
Department of Political Science and Munk School of Global Affairs and Public Policy, University of Toronto, Toronto, ON, Canada
Lital Barkai
Affiliation:
Independent Software Developer
*
Corresponding author: Mauricio Vargas Sepúlveda; Email: [email protected]

Abstract

The REDATAM (retrieval of data for small areas by microcomputer) statistical package and format, developed by ECLAC, has been a critical tool for disseminating census data across Latin America since the 1990s. However, significant limitations persist, including its proprietary nature, lack of documentation, and restricted flexibility for advanced data analysis. These challenges hinder the transformation of raw census data into actionable information for policymakers, researchers, and advocacy groups. To address these issues, we developed Open REDATAM, an open-source and multiplatform tool that converts REDATAM data into widely supported CSV files and native R and Python data structures. By providing integration with R and Python, Open REDATAM empowers users to work with the tools they already know and perform data analyses without leaving their R or Python window. Our work emphasizes the need for a REDATAM official format specification to further enable informed policy debates that can improve policy processes’ implementation and feedback.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Policy Significance Statement

The REDATAM format, widely used for census data across Latin America, presents significant challenges for modern data use in public policy. Despite its historical role in facilitating data access, its closed, non-transparent structure limits statistical analysis essential for informed decision-making. We call for an official format specification and anonymization procedures that protect individual privacy while keeping the data usable for research by allowing filtering at the unit of analysis (e.g., individual) level. For policymakers, addressing these technical limitations is relevant to facilitate data sharing and allowing advocacy groups to process and analyze census data to participate more effectively in policy debates.

1. Introduction

The REDATAM (retrieval of data for small areas by microcomputer, recuperación de datos para areas pequeñas por microcomputador in Spanish) statistical package and format, developed by ECLAC (Economic Commission for Latin America and the Caribbean, CEPAL in Spanish), remains a widely used tool in census data dissemination across Latin America. However, almost a decade after the original academic critique on the REDATAM format by De Grande (Reference De Grande2016), concerns persist regarding its limitations, including its proprietary nature, lack of transparency, and shortcomings related to the lack of flexibility for data analysis. These issues remain critical for researchers in economics, political science, and sociology, as access to data becomes increasingly important because of the increased proportion of quantitative studies in these fields, and policymakers and the public are interested in insights that contribute to better policy outcomes (Farrell and Knight, Reference Farrell and Knight2019).

For readers unfamiliar with the REDATAM data format, it is a binary format, a representation that it is not possible to open with software such as “Notepad” as we can do with a CSV file to see its contents. The downloaded files do not have a “redatam” extension, the main file is a dictionary with a “dic” or “dicx” extension, and each level (e.g., region, municipality, or household) has its own pointers file with a “ptr” extension and data files with an “rbf” extension where each data file contains a single variable.

Table 1 derived from the 2017 Chilean census data shows the population by sex at the country level.

Table 1. Population by sex in Chile (2017)

Similarly, Table 2 derived from the 2017 Chilean census data shows the total population at the region level.

Table 2. Population by region in Chile (2017)

These tables are obtained after reading multiple “rbf” files and merging them. To obtain these counts, REDATAM R+SP Process (not related to the R programming language) is the official software and it includes a graphic user interface (GUI) that allows to export variables using a point-and-click menu. The main drawback of this approach is that it can be time-consuming to apply multiple filters and export several variables, and it does not allow more advanced analysis such as ANOVA or Poisson regression. However, the same software includes a command prompt that lets the user use syntax like Scripting Query Language (SQL) to export data but does not provide functions to test statistical hypotheses (Figure 1).

Figure 1. REDATAM SP main window.

2. Data versus information

The distinction between data and information is crucial for understanding the role of REDATAM in public policy. Data refer to raw facts or observations that lack context or meaning. In contrast, information is data that has been processed, organized, or structured to convey meaning or support decision-making (King, Reference King2010). The REDATAM format, with its current lack of an official format specification, creates a bottleneck in transforming raw census data into information for policymakers, researchers, and the public. The REDATAM software does not provide the tools necessary to perform statistical analyses, such as (generalized) linear regression, limiting the potential insights that can be derived from census data. It even makes it challenging to obtain averages and count different subnational units by presenting a point-and-click graphic interface that restricts the analysis (ECLAC, 2023b).

These limitations underscore the importance of data access in the context of public policy analysis. Besides academic concerns about challenges in data processing because of a closed-source format and restrictive software, this affects advocacy groups and non-governmental organizations (NGOs) that can inform policy design, implementation, and change by bringing expert knowledge combined with information derived from census data to characterize populations (Jenkins-Smith et al., Reference Jenkins-Smith, Nohrstedt, Weible, Ingold, Weible and Sabatier2018).

Operational barriers, such as insufficient infrastructure and limited processing capacity, hinder the ability to leverage open data effectively (Kawashita et al., Reference Kawashita, Baptista and Soares2022). Many organizations face challenges related to organizational capabilities to manage and analyze open data, resulting in missed opportunities to enhance policy design and civic engagement. Considering that governments already covered the cost of producing and releasing census records, using open formats for their release could enhance civic participation and policy design by enabling NGOs to analyze data with already popular tools such as R or Python. This change has an economic and financial impact that would be marginal for governments, and currently, the cost of converting REDATAM data to open format is marginal (e.g., it requires a user to read the instructions) with the open-source tool we developed for any organization that wants to use it.

Beyond the technical and operational, the interplay of policy and legal frameworks significantly affects software utilities. Open-source initiatives often use licensing terms that can limit derivative works. For instance, the GNU GPL license requires that derivative works be distributed under the same copyleft license, and currently around 15,000 out of 21,000 R packages (70%) are GPL-licensed. Advocacy groups and NGOs encounter additional challenges when navigating these barriers, their efforts to combine expert knowledge with census data to inform and influence policies are often constrained by such restrictions, reducing their effectiveness in driving evidence-based governance (Kawashita et al., Reference Kawashita, Baptista and Soares2022). For this reason, Open REDATAM is distributed under the Apache license, which allows users to use, modify, and distribute the software freely, even for commercial uses if the authors of this article are listed as the copyright owners of the original software and therefore trademark use of our work is not allowed.

Addressing these barriers requires comprehensive strategies that include fostering a culture of openness, strengthening institutional infrastructure, and advocating for legal frameworks that accompany open data access with permissive software licenses for policy analysis that preserve individual data privacy but expose policy gaps and failures affecting target populations. These measures are relevant for virtuous policy processes with the participation and feedback from empowered citizens and advocacy coalitions, therefore enhancing democratic practices and accountability (Jenkins-Smith et al., Reference Jenkins-Smith, Nohrstedt, Weible, Ingold, Weible and Sabatier2018).

The relationship between data and information emphasizes the transformative potential of addressing these barriers. Data, as raw material, gains its value when transformed into information (e.g., tables or plots) that drive actionable insights. REDATAM’s current limitations impede this transformation, leaving users with raw data that is difficult to process and contextualize, creating a barrier for NGOs and governmental technical teams to better inform policymakers about policy challenges.

By embracing open formats, governments can enable policymakers, researchers, and advocacy groups to better utilize already existing data with free (e.g., free in monetary cost) tools for data analysis. Without civic participation, it is challenging to adopt a paradigmatic change from data to information to uncover systemic inequities, develop targeted interventions, and engage in a better-informed policy debate. Bridging the gap between data and information is not just a technical necessity, it is a cornerstone for fostering transparency, inclusivity, and effective governance that could lead to paradigmatic changes in policy design (Blyth, Reference Blyth2013; Broome et al., Reference Broome, Homolar and Kranke2018; Hall, Reference Hall1993).

Data are different from information, the first often requires transformations from raw text or numbers in documents, images, or SQL databases into summaries and interpretations that can inform policy decisions. Structured data, such as Excel spreadsheets or REDATAM files, adheres to a predefined structure, making it easier to query, analyze, and integrate compared to unstructured data. Unstructured data, such as scanned documents or images from traffic cameras, lacks this inherent organization, presenting obstacles in terms of processing it to inform policy design. Both forms of data inherit difficulties including discrepancies and mismanagement, and obtaining information from them can present additional technical issues when facing unsearchable formats like scanned PDFs (Beltran, Reference Beltran2023).

In order to bridge the gap between data and information, computational tools such as optical character recognition (OCR) and natural language processing facilitate processing and interpreting data presented in a format that poses a challenge to read and analyze figures, as it is the case of rescuing and harmonizing monetary figures and contextual details in fiscal budgets (Beltran, Reference Beltran2023). These methodologies demonstrate how data, regardless of the format, can be processed and combined with sector knowledge to identify patterns, such as recurring fiscal irregularities or trends in budgetary compliance, that structured data alone might not capture.

Similarly, the goal of REDATAM is to help to organize raw data to produce information that not only enhances transparency but also equips policymakers with an input that adds to specialized knowledge and other sources to maintain, alter, or create policies. Under the multiple streams framework, the process of transforming data into information can affect the interaction between policy, politics, and problems. Policymakers aim at organizing an agenda facing limited resources, including time and information, to provide solutions, and face trade-offs between precision and ambiguity when internalizing information to prepare responses to problems in a context where disagreement among experts facing the same information is a part of effective planning (Ackrill et al., Reference Ackrill, Kay and Zahariadis2013).

3. Data access versus statistical secret

Since its inception in 1986, REDATAM has become a standard for distributing census data in countries like Argentina, Colombia, Chile, and Mexico. While the software’s availability at no cost and its integration into national statistical systems have boosted its adoption, the format it uses to store microdata remains closed and poorly documented. This lack of transparency has raised concerns in both academic and policy-making circles, where access to comprehensive datasets is critical for informed decision-making.

Policy outcomes can be improved by using accurate demographic information in the different stages of policy processes. The closed nature of the REDATAM format hinders the ability of researchers and analysts to perform statistical analyses beyond the simple tabulations the software allows or at least with relative ease.

It is important to mention that we followed an unofficial format specification provided by De Grande (Reference De Grande2016) to develop Open REDATAM, and without it, we could not have created the tool. The lack of an official format specification constitutes what is known as security by obscurity. It is considered a bad practice as it consists of relying on the secrecy of the format to protect the data, and it is often a matter of time until the format is reverse-engineered or a vulnerability is found, as happened with the DVD format that once was considered unbreakable (Diehl, Reference Diehl2016).

While some encryption strategies are almost impossible to break by brute force, research shows that encryption can be broken by accessing the hardware where the data are stored, exploiting vulnerabilities in the software, using social engineering to obtain the encryption key, or by finding a flaw in the encryption algorithm. This suggests that efforts should be put toward encryption and anonymization, which are not mutually exclusive (Jager and Somorovsky, Reference Jager and Somorovsky2011).

There are also technical limitations identified in the past decade, such as the lack of encryption and compression strategies claimed to be implemented in the software (De Grande, Reference De Grande2016). This issue is not just a technical limitation but also a potential risk to data security and confidentiality that may emerge from an excess of confidence in security by obscurity at the expense of data anonymization procedures that protect individual privacy while keeping the data usable for research by allowing filtering at the unit of analysis (e.g., individual) level. National census data often include sensitive information that, if not properly encrypted, could compromise individual privacy. For example, at the time of writing this article, we were able to download the Uruguay 2011 census microdata in REDATAM format from ECLAC site that reads “for internal use at the Uruguay Statistical Bureau” (“base de uso interno del INE Uruguay” in Spanish; ECLAC, 2023a). While we did not find any sensitive information in the file, something like this could accidentally leak information that could be used to identify individuals.

Balancing transparency and data access with the obligation to protect statistical confidentiality posits a challenge for any government. The sixth principle of the United Nations Fundamental Principles of Official Statistics explicitly emphasizes the need to ensure confidentiality in statistical data (United Nations, 2014). It mandates that data collected for statistical purposes should not be used to identify individuals or entities, thereby safeguarding the privacy of respondents while enabling the use of anonymized data for policy debate. This principle is relevant for REDATAM microdata, where sharing census data without considerations could expose sensitive demographic and socioeconomic information.

Adhering to data privacy principles reinforces public trust in official statistics, ensuring that individuals feel secure in providing accurate information during data collection efforts. The empirical evidence reflects that new immigrants tend to have a lower trust in government institutions, and this leads to a higher non-response rate in official surveys for a target population that is more vulnerable and more affected by inadequate policies or the lack of policies that could help them (Sumption, Reference Sumption2020). For instance, during the 2024 Chilean census, there were multiple messages on social media instigating people not to answer the census questions based on the idea that the government would use age, employment, and military service completion data to discriminate against certain people (Fuentes, Reference Fuentes2024).

Governments can be better informed by tracking changes from one census to another to evaluate the long-term impact of policies and the need for adjustments or new policies, but using distorted data for policy design can be worse than designing policies based on social construction or tradition, which reinforces notions such as “the elder deserve tailored policies because of their age” (Schneider and Ingram, Reference Schneider and Ingram1993; Schneider and Sidney, Reference Schneider and Sidney2009).

While open data policies aim to maximize the accessibility and usability of data for diverse stakeholders, including policymakers, researchers, and advocacy groups, these goals must not come at the expense of exposing sensitive information. Open REDATAM provides a technical tool, and the discussion about this format could be enriched if ECLAC brings insights from users and existing frameworks such as the European Union General Data Protection Regulation, which can provide further guidance for reconciling these competing priorities (Zoonen, Reference Zoonen2020).

Another concern is the preservation of data in the long term. For instance, the 2001 Argentine census came with an installer that we could not run on Windows 10, but it worked with Wine on Ubuntu 22.04, and after installing the software we had access to the data to convert it with our tool (Figure 2).

Figure 2. 2001 Argentine census installer failing on Windows 10 but working on Ubuntu 22.04.

REDATAM data are stored so that it can be easily accessed and read without needing decryption tools. This raises significant concerns, particularly as more institutions adopt REDATAM under the assumption of confidentiality and security. Recently, Myanmar released its 2014 census data in REDATAM format, following the format’s widespread success in Latin America (ECLAC, 2023a).

4. Integration with other tools

Quantitative policy analysis often requires the integration of various datasets and the use of statistical methods to test hypotheses and the effect of multiple variables on a dependent variable of interest (Breunig and Ahlquist, Reference Breunig, Ahlquist, Engeli and Allison2014).

One of REDATAM’s most significant shortcomings is its incompatibility with software such as Microsoft Excel, SPSS, R, or Python, tools that are widespread in the social sciences. REDATAM’s closed format and lack of integration options highlight the necessity of developing tools that enable broader accessibility and usability of census data. To address these limitations, we developed Open REDATAM, a multiplatform tool compatible with Linux, Mac, and Windows. This tool converts REDATAM data into the universally supported CSV format, which can be processed using popular statistical software such as R, Python, SPSS, and Stata. Open REDATAM also generates XML summaries of tables and variables, facilitating access to metadata, variables description, and how the variables are related to each other.

Our approach builds upon the foundation of the original Redatam Converter designed by Pablo de Grande (Reference De Grande2016). By rewriting de Grande’s software from a graphic tool written in C# to export to CSV into a command line tool with an additional graphic menu written in C++, we improved its portability and ensured compatibility with a wide range of environments. Besides Windows, Mac, and Linux desktop systems, we tested our software with GitHub actions to verify that it works on different server environments, including a wide range of Linux distributions that use different C++ compilers.

De Grande (Reference De Grande2016) original implementation is distributed as Windows-only software. However, because de Grande has released the source code, it is possible to modify it and get it to work on Mac or Linux, because Microsoft offers C# compilers for these operating systems.

A difficulty of C# is that there are no official R or Python tools to use C# functions, and the REDATAM format requires a compiled language to read it because it is required to define personalized data structures that then can be exported to R, Python, or other languages. One of the advantages of using C++ is its integration with R and Python (Vaughan et al., Reference Vaughan, Hester and Romain2024; Wenzel, Reference Wenzel2024), which allows for the development of user-friendly tools that can be used by researchers and policymakers without requiring advanced programming skills. This means that continuing to use C# to create a multiplatform tool would leave the data import step to the user, which we avoided by providing R and Python packages that read REDATAM data directly into these languages without the additional command line step of exporting CSV files and then read them into R or Python.

An additional consideration is that we separated the core functions that read the data from the GUI. This separation allowed us to create lighter R and Python packages, and we focused our efforts and used the widely used Qt framework for the GUI (Qt, 2024), keeping all the code modular.

Our ground-up rewrite features R and Python packages, reducing the data access barrier for researchers and policymakers even more by moving the data import step from the user to the software developer following the “Tidy Data” principles described in Wickham et al. (Reference Wickham, Cetinkaya-Rundel and Grolemund2023).

Figure 3 shows the original Redatam Converter tool running on Windows XP, where the user selects the input dictionary and the output directory. We ran it on a ThinkPad X220 to validate Open REDATAM by comparing it with the REDATAM Converter tool for the following countries and years: Argentina (1991, 2001, and 2010), Bolivia (2001 and 2012), Chile (2017), Dominican Republic (2002), Ecuador (2010), El Salvador (2007), Guatemala (2018), Mexico (2000), Myanmar (2014), Peru (2007 and 2017), and Uruguay (2011).

Figure 3. Original REDATAM Converter running on Windows XP in 2024.

Figure 4 shows the new Open REDATAM, which follows a streamlined structure that allows the user to select the input dictionary and the output directory and then export the data to CSV files.

Figure 4. Open REDATAM running on Ubuntu 22.04.

The available Open REDATAM helps overcome the limitations of REDATAM’s format, but it is not a definitive solution to the challenges posed by the software’s closed format, and we posit the need for an official format specification to the public. For instance, the XLSX format is a widely used format for spreadsheets. While it was created by Microsoft to be used with Microsoft Office, it comes with a detailed official format specification (ISO 29500) that allows tools such as Google Sheets or R can read and write XLSX files (International Organization for Standardization, 2016). Today’s globalized world is heavily dependent on standards and protocols for data exchange, and counting on a REDATAM official format specification could build on top of the decades of experience that the International Organization for Standardization (ISO) has with standards for data exchange and security of information, and ISO role in trade and commerce comes from its ability to provide rules and guidelines that effective to ease the duties and communication in public and private sector (Büthe and Mattli, Reference Büthe and Mattli2011).

By converting REDATAM’s obscure structure into a more widely accessible format, this tool opens the possibility for more complex data manipulations and analysis that REDATAM’s original software does not support. To create this tool, we followed the partial (and unofficial) specification of the REDATAM format provided by De Grande (Reference De Grande2016), and we tested it with all the census microdata available from the ECLAC website. A previous experience for the Chilean 2017 census microdata was described by Vargas Sepúlveda (Reference Vargas Sepúlveda2021), Vargas Sepúlveda (Reference Vargas Sepúlveda2022b), and Vargas Sepúlveda (Reference Vargas Sepúlveda2024a).

5. Testing and comparison with IPUMS data

One of the major drawbacks of software testing, despite our efforts to test on different operating systems and hardware architectures, is that testing proves the absence of errors in the code but not code correctness. In other words, it is possible to test 100% of the written lines of code, and still have functions that conduct incorrect steps, such as applying a logarithmic transformation to a variable that should not be transformed.

We exported an aggregated table created with R and the 2011 Uruguayan census after reading it with the Open REDATAM package to SPSS format and then read it in REDATAM 7 and saved it in REDATAM format. The REDATAM GUI version we used cannot import CSV files, but this nonetheless allowed us to test that exporting from R to SPSS, to REDATAM, and then back to R does not change the data. The lack of official documentation was solved by following the steps described by Araujo Gonzalez (Reference Araujo Gonzalez2021).

The Integrated Public Use Microdata Series (IPUMS) is a widely used service that provides harmonized census data for different countries. IPUMS data are available in a binary format with an R package to read it, making it accessible to a wide range of users and eliminating errors derived from incorrect data parsing because users have to guess how to read it. The availability of comprehensive data from IPUMS allows us to compare our data extraction and aggregation in R with the data provided by IPUMS and thus gives us an indirect verification of our tool’s correctness.

We compared the results for the following countries and years: Bolivia (2012), Chile (2017), Dominican Republic (2002), Ecuador (2010), El Salvador (2007), Peru (2017), Uruguay (2011). For each country and year, we read the data in DIC (REDATAM legacy format) and DICX (a newer but also a legacy format since 2024) depending on the availability, and we obtained consistent results with the IPUMS data for the same countries and years. This comparison is not a definitive test of the correctness of our tool, but it is a step in the right direction to ensure that the data we extract is correct and can be used for policy analysis.

To avoid presenting cluttered information, we only provide the comparison for the 2017 Chilean census microdata (DIC and DICX, same results) and IPUMS data in the following tables with the aggregate count by sex at the country level (Table 3):

Table 3. 2017 Chilean census data aggregated by sex using R and the “redatam” (Open REDATAM) and “ipums” (IPUMS) packages

It is pertinent to also provide a finer granularity count by age group at the country level (Table 4):

Table 4. 2017 Chilean census data aggregated by age group using R and the “redatam” (Open REDATAM) and “ipums” (IPUMS) packages

These differences are explained by the fact that IPUMS data provides a sample with 10% of the full census databases in addition to data cleaning and harmonization steps, while the REDATAM data are provided as-is by governments (Ruggles et al., Reference Ruggles, King, Levison, McCaa and Sobek2003, Reference Ruggles, Cleveland, Lovaton, Sarkar, Sobek, Burk, Ehrlich, Heimann and Lee2024). These tables were obtained by reading the census in R with Open REDATAM and the “dplyr” package, in a similar way to the code examples provided in the next section.

6. Demonstration

For simplicity, the demonstration will show how to use Open REDATAM from R and create the first two tables presented in this article.

First, install the package from CRAN or GitHub:

install.packages ("redatam")

# or

remotes :: install_github ("litalbarkai/redatam")

For the case of Chile, we can count the population by sex, and to do that we start by downloading and reading the 2017 census data:

library (redatam)

library (dplyr)

# baseurl is used just to respect the margin

baseurl <- “ https://redatam.org/

url <- paste0 (

baseurl,

“cdr/descargas/censos/poblacion/CP2017CHL.zip”

)

zip <- basename (url)

dout <- “CHL2017”

if (! file.exists (zip)) download.file (url, zip)

if (! file.exists (dout)) unzip (zip, exdir = dout)

chl17 <- read_redatam (paste0 (dout, “/BaseOrg16/CPV2017-16.dicx”))

Then we join the different levels in the REDATAM database to match each person to a region (e.g., person to household, household to dwelling, and finally province to region). This approach allows us to obtain the aggregate tables with minimal effort in a posterior step:

chl17_pop <- chl17$region %>%

select(region_ref_id, nregion) %>%

inner_join(

chl17$provinci %>%

select(provinci_ref_id, region_ref_id)

) %>%

inner_join(

chl17$comuna %>%

select(comuna_ref_id, provinci_ref_id)

) %>%

inner_join(

chl17$distrito %>%

select(distrito_ref_id, comuna_ref_id)

) %>%

inner_join(

chl17$area %>%

select(area_ref_id, distrito_ref_id)

) %>%

inner_join(

chl17$zonaloc %>%

select(zonaloc_ref_id, area_ref_id)

) %>%

inner_join(

chl17$vivienda %>%

select(vivienda_ref_id, zonaloc_ref_id)

) %>%

inner_join(

chl17$hogar %>%

select(hogar_ref_id, vivienda_ref_id)

) %>%

inner_join(

chl17$persona %>%

select(persona_ref_id, hogar_ref_id, p08)

)

Finally, we aggregate the data to obtain the population by sex and region in two tables:

chl17 %>%

group_by(sex = p08) %>%

count()

chl17 %>%

group_by(region = nregion) %>%

count()

The REDATAM query syntax for this is shorter than the R code shown, but users familiar with R may find working in a known syntax to be preferable. Additionally, as mentioned, the REDATAM query syntax cannot handle certain statistical functions and plotting tools. A separate problem is exporting the data from the query results to CSV or another format for posterior analysis.

To count the population at region level, we can use the following query:

RUNDEF Job

SELECTION ALL

DEFINE REGION.COUNTER

AS COUNT PERSONA

TYPE INTEGER

TABLE C1

TITLE “Conteo de : PERSONA”

AS AREALIST

OF REGION, REGION.COUNTER 10.0

OUTPUTFILE DBF

OVERWRITE

This query is equivalent to using the GUI, but in the GUI, it is only possible to select one variable at a time while the query allows the user to select multiple variables at once (Figure 5).

Figure 5. REDATAM R+SP query results (the software reveals a memory leak).

Consider the overcrowding definitions described by Vargas Sepúlveda (Reference Vargas Sepúlveda2024a) for a dwelling (e.g., a studio apartment has zero bedrooms):

$$ \mathrm{Overcrowding}=\left\{\begin{array}{ll}\frac{\mathrm{No}.\mathrm{people}}{\mathrm{No}.\mathrm{bedrooms}}& \mathrm{if}\hskip0.4em \mathrm{No}.\mathrm{bedrooms}\ge 1\\ {}\frac{\mathrm{No}.\mathrm{people}}{\mathrm{No}.\mathrm{bedrooms}+1}& \mathrm{if}\hskip0.4em \mathrm{No}.\mathrm{bedrooms}=0\end{array}\right. $$
$$ \mathrm{Overcrowding}\ \mathrm{Discrete}=\left\{\begin{array}{ll}\mathrm{No}\;\mathrm{Overcrowding}& \mathrm{if}\ \mathrm{Overcrowding}<2.5\;\\ {}\mathrm{Mean}& \mathrm{if}\;2.5\le \mathrm{Overcrowding}<3.5\\ {}\mathrm{High}& \mathrm{if}\;3.5\le \mathrm{Overcrowding}<5\\ {}\mathrm{Critical}& \mathrm{if}\ \mathrm{Overcrowding}\ge 5\end{array}\right. $$

Following a similar procedure to the previous two tables, we can visualize the households with overcrowding in Santiago de Chile by combining the census microdata with the R packages “censo2017,” “chilemapas,” and “tintin” (Vargas Sepúlveda Reference Vargas Sepúlveda2024b, Reference Vargas Sepúlveda2022a, Reference Vargas Sepúlveda2024a; Figure 6):

Figure 6. Proportion of households with high or critical overcrowding in Santiago de Chile (2017) computed from the 2017 Chilean census microdata.

7. Access to source code and data

Any interested user can explore the source code on GitHub and propose improvements to the tool. The improvements are always welcome and can cover a wide range of topics, such as adding new features, making the software easier to use, fixing bugs, making the software faster, or improving the documentation.

For users that only require the data in an accessible format, we provide census microdata in R and CSV formats for a broader range of countries and years than in the tests to compare with the REDATAM Converter. The files can be downloaded from GitHub or OpenICPSR and these include:

  • Argentina: 1991, 2001, and 2010;

  • Bolivia: 2001 and 2012;

  • Chile: 2017;

  • Ecuador: 2010 and 2015 (Galapagos);

  • El Salvador: 2007;

  • Guatemala: 2018;

  • Peru: 2007 and 2017; and

  • Mexico: 2000, 2010, and 2020.

8. Conclusion

As we reflect on REDATAM’s role in the dissemination of census data over the past decade, it is evident that the software has both strengths and weaknesses. We were able to convert hierarchical datasets, including department (region), household, and individual levels. REDATAM’s closed, nontransparent nature and technical shortcomings do not contribute to the creation of information from data, and it is a challenge for researchers and policymakers to use the data in a meaningful way. The lack of an official format specification and anonymization strategies are significant concerns that must be addressed to ensure the usefulness and confidentiality of census data. REDATAM publicizing an official format specification would contribute to more flexibility in data analysis and help policymakers to fully benefit from census data.

Data availability statement

The Open REDATAM tool is available at https://github.com/pachadotdev/open-redatam. Curated microdata converted from REDATAM to CSV and R native format (RDS) is available from https://github.com/pachadotdev/redatam-microdata/releases and https://www.openicpsr.org/openicpsr/project/211903.

Acknowledgments

The authors would like to thank Pablo de Grande for his valuable feedback and suggestions, Catherine Moez for the useful comments on the drafts, Miguel Araujo Gonzalez for the unofficial REDATAM documentation, Renan Levine for suggesting how to share the data, the Integrated Public Use Microdata Series (IPUMS), and the statistical offices that provided the underlying data used for the validation.

Author contribution

MVS conceptualized the research, designed the GUI, and tested the correctness of the data exports and software builds. LB developed initial C++ version using C++17 of the Open REDATAM tool that required C# knowledge to understand de Grande’s work and corrected different parts of it as the initial tests revealed encoding issues and other bugs or failed with some REDATAM databases. Then, MVS created the R and Python packages that, we LB help, led to a posterior C++ refactor using C++11 for broader platform compatibility.

Funding statement

This initiative was not funded by the University of Toronto, nor any other individual or institution.

Competing interest

The authors declare no competing interests.

Ethical standard

This research did not involve undue access to data or any other ethical concerns.

Footnotes

This research article was awarded Open Data and Open Materials badges for transparent practices. See the Data Availability Statement for details.

References

Ackrill, R, Kay, A and Zahariadis, N (2013) Ambiguity, multiple streams, and EU policy. Journal of European Public Policy 20(6), 871887. https://doi.org/10.1080/13501763.2013.781824.CrossRefGoogle Scholar
Araujo Gonzalez, M (2021) Red7 Create Con R (Archivos CSV). Available at https://www.youtube.com/watch?v=83frlnXCIJIGoogle Scholar
Beltran, A (2023) Fiscal data in text: Information extraction from audit reports using natural language processing. Data & Policy 5(January), e7. https://doi.org/10.1017/dap.2023.4.CrossRefGoogle Scholar
Blyth, M (2013) Paradigms and paradox: The politics of economic ideas in two moments of crisis. Governance (Oxford) 26(2), 197215. https://doi.org/10.1111/gove.12010.CrossRefGoogle Scholar
Breunig, C and Ahlquist, JS (2014) Quantitative methodologies in public policy. In Engeli, I, Allison, CR (eds), Comparative Policy Studies. Research Methods Series. London: Palgrave Macmillan UK, pp. 109129. https://doi.org/10.1057/9781137314154_6.CrossRefGoogle Scholar
Broome, A, Homolar, A and Kranke, M (2018) Bad science: International organizations and the indirect power of global benchmarking. European Journal of International Relations 24(3), 514539. https://doi.org/10.1177/1354066117719320.CrossRefGoogle ScholarPubMed
Büthe, T and Mattli, W (2011) The New Global Rulers: The Privatization of Regulation in the World Economy. Princeton, NJ: Princeton University Press. https://doi.org/10.1515/9781400838790.Google Scholar
De Grande, P (2016) El Formato Redatam. Estudios Demográficos y Urbanos 31(3), 811832.CrossRefGoogle Scholar
Diehl, E (2016) Ten Laws for Security, 1st edn. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-42641-9CrossRefGoogle Scholar
ECLAC (2023a) Microdatos. Available at https://redatam.org/es/microdatos.Google Scholar
ECLAC (2023b) Tutorial uso básico R+SP process. Available at https://www.redatam.org/cdr/Tutoriales/Process_Esp.html.Google Scholar
Farrell, H and Knight, J (2019) How political science can be most useful. The Chronicle of Higher Education. Available at https://www.chronicle.com/article/how-political-science-can-be-most-useful/.Google Scholar
Fuentes, M (2024) Censo 2024: Desmintiendo información falsa – Escuela de Salud Pública. Available at https://saludpublica.uchile.cl/noticias/214629/censo-2024-desmintiendo-informacion-falsa.Google Scholar
Hall, PA (1993) Policy paradigms, social learning, and the state: The case of economic policymaking in Britain. Comparative Politics 25(3), 275296. https://doi.org/10.2307/422246.CrossRefGoogle Scholar
International Organization for Standardization (2016) ISO/IEC 29500–1:2016. ISO. Available at https://www.iso.org/standard/71691.html.Google Scholar
Jager, T and Somorovsky, J (2011) How to break XML encryption. In Proceedings of the 18th ACM Conference on Computer and Communications Security, CCS’11. New York, NY: Association for Computing Machinery, pp. 413422. https://doi.org/10.1145/2046707.2046756.Google Scholar
Jenkins-Smith, HC, Nohrstedt, D, Weible, CM and Ingold, K (2018) The advocacy coalition framework: An overview of the research program. In Weible, CM, Sabatier, PA (eds) Theories of the Policy Process, 4th edn, vol 1. New York, NY: Routledge, pp. 135171. https://doi.org/10.4324/9780429494284-5.CrossRefGoogle Scholar
Kawashita, I, Baptista, AA and Soares, D (2022) Open government data use by the public sector – An overview of its benefits, barriers, drivers, and enablers. Available at http://hdl.handle.net/10125/79648.CrossRefGoogle Scholar
Qt (2024) Development framework for cross-platform applications. Available at https://www.qt.io/product/framework.Google Scholar
Ruggles, S, Cleveland, L, Lovaton, R, Sarkar, S, Sobek, M, Burk, D, … Ehrlich, D, Heimann, Q and Lee, J (2024) Integrated public use microdata series (IPUMS). Available at https://international.ipums.org/international/.Google Scholar
Ruggles, S, King, ML, Levison, D, McCaa, R and Sobek, M (2003) IPUMS-International. Historical Methods: A Journal of Quantitative and Interdisciplinary History 36(2), 6065. https://doi.org/10.1080/01615440309601215.CrossRefGoogle Scholar
Schneider, A and Ingram, H (1993) Social construction of target populations: Implications for politics and policy. The American Political Science Review 87(2), 334347. https://doi.org/10.2307/2939044.CrossRefGoogle Scholar
Schneider, A and Sidney, M (2009) What is next for policy design and social construction theory? Policy Studies Journal 37(1), 103119. https://doi.org/10.1111/j.1541-0072.2008.00298.x.CrossRefGoogle Scholar
Sumption, M (2020) How useful are survey data for analyzing immigration policy? Data & Policy 2(January), e19. https://doi.org/10.1017/dap.2020.20.CrossRefGoogle Scholar
United Nations (2014) Fundamental principles of official statistics. Available at https://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx.Google Scholar
Vargas Sepúlveda, M (2021) The story behind Censo2017, the first rOpenSci package to be reviewed in Spanish. Available at https://ropensci.org/blog/2021/07/27/censo2017/.CrossRefGoogle Scholar
Vargas Sepúlveda, M (2022a) Chilemapas: Mapas de las divisiones politicas y administrativas de Chile (Maps of the Political and Administrative Divisions of Chile). Available at https://CRAN.R-project.org/package=chilemapas.Google Scholar
Vargas Sepúlveda, M (2022b) Interesting uses of Censo2017 a year after publishing. Available at https://ropensci.org/blog/2022/10/19/censo2017-one-year-after/.CrossRefGoogle Scholar
Vargas Sepúlveda, M (2024a) Censo2017: Base de Datos de Facil Acceso Del Censo 2017 de Chile (2017 Chilean Census Easy Access Database). Available at https://github.com/ropensci/censo2017.Google Scholar
Vargas Sepúlveda, M (2024b) Tintin: Tintin palette generator. Available at https://CRAN.R-project.org/package=tintin.CrossRefGoogle Scholar
Vaughan, D, Hester, J and Romain, F (2024) A C++11 interface for R’s C interface. Available at https://cpp11.r-lib.org/.Google Scholar
Wenzel, J (2024) Seamless operability between C++11 and Python. Available at https://pybind11.readthedocs.io/en/stable/index.html.Google Scholar
Wickham, H, Cetinkaya-Rundel, M and Grolemund, G (2023) R for Data Science. 2nd edn. O’Reilly. Available at https://r4ds.hadley.nz/.Google Scholar
Zoonen, Lv (2020) Data governance and citizen participation in the digital welfare state. Data & Policy 2(January), e10. https://doi.org/10.1017/dap.2020.10.CrossRefGoogle Scholar
Figure 0

Table 1. Population by sex in Chile (2017)

Figure 1

Table 2. Population by region in Chile (2017)

Figure 2

Figure 1. REDATAM SP main window.

Figure 3

Figure 2. 2001 Argentine census installer failing on Windows 10 but working on Ubuntu 22.04.

Figure 4

Figure 3. Original REDATAM Converter running on Windows XP in 2024.

Figure 5

Figure 4. Open REDATAM running on Ubuntu 22.04.

Figure 6

Table 3. 2017 Chilean census data aggregated by sex using R and the “redatam” (Open REDATAM) and “ipums” (IPUMS) packages

Figure 7

Table 4. 2017 Chilean census data aggregated by age group using R and the “redatam” (Open REDATAM) and “ipums” (IPUMS) packages

Figure 8

Figure 5. REDATAM R+SP query results (the software reveals a memory leak).

Figure 9

Figure 6. Proportion of households with high or critical overcrowding in Santiago de Chile (2017) computed from the 2017 Chilean census microdata.

Submit a response

Comments

No Comments have been published for this article.