Introduction to the Utilities

Peter Bol

doi:10.1017/jch.2020.10

Introduction to the Utilities

Published online by Cambridge University Press: 12 August 2020

Peter Bol

Show author details

Peter Bol*: Affiliation:
Harvard University
*: *Corresponding author. Email: [email protected]

Article contents

Extract
Footnotes
References

Rights & Permissions

Extract

A variety of databases, tools and platforms have created the foundation for digital scholarship in Chinese studies. The creators of some open-access projects introduce their work below, but first I offer some notes on the kinds of utilities that make up the expanding digital universe.

Type: Utilities
Information: Journal of Chinese History 中國歷史學刊 , Volume 4 , Special Issue 2: Digital Humanities , July 2020 , pp. 483 - 486

DOI: https://doi.org/10.1017/jch.2020.10 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2020

Searchable text databases are the most widely used resource. Beginning in 1984 with the dynastic histories, Academia Sinica's Institute of History and Philology has set the highest standard for the creation of digital texts, and since then other institutions have followed suit. A proportion of Scripta Sinica is open access.Footnote ¹ Since then, in addition to many commercial text databases, three other collections have been established that are entirely open access. CBETA, the Chinese Electronic Tripitaka Collection, began over twenty years ago.Footnote ² The most recent is the Kanseki Repository of premodern Chinese texts, popularly known as Kanripo; it is overseen by Christian Wittern of Kyoto University and has 9,500 texts.Footnote ³ The largest is Donald Sturgeon's Ctext, the Chinese Text Project, currently holding over thirty thousand titles and more than five billion characters. The text repository is only one part of the Ctext platform, which includes a variety of tools, as Sturgeon explains in his introduction to the platform.Footnote ⁴ The most convenient way to discover whether a text is available in one of these three repositories (or the Zhonghu jingdian gujiku 中华经典古籍库) is to consult textref.org; icons show whether a text is open to view, search, or download and whether a scanned image is available. The scanned image is important because, as Sturgeon explains in the case of Ctext, the searchable text is created by applying optical character recognition (OCR) to the scanned image. Textref.org currently has 54,000 records and is an important contribution to the cyber infrastructure for Chinese studies. All providers of digital text, whether open- or licensed-access can choose to reveal the titles they have, without losing their proprietary rights.Footnote ⁵

The Ming Qing Women's Writings Digital Archive and Database, discussed below by its director Grace Fong, is an example of digital archive of selected writings with an online scholarly apparatus.Footnote ⁶ MQWW is valuable for its collection of rare texts. It also illustrates how a database focused on a particular set of texts and people can make use of other online utilities. In this case MQWW uses an application programming interface (API) to call up information from the China Biographical Database, thus relieving it of the need to keep track of kin and social relations of the writers themselves.

The China Biographical Database (CBDB), discussed by two managers Wang Hongsu and Tsui Lik Hang, is an open-access biographical database, like the Dharma Drum Buddhist College (DDBC) Buddhist Studies Authority Database 佛學規範資料庫 and the Database of Names and Biographies 人物傳記資料庫 from the Institute of History and Philology.Footnote ⁷ One can discover persons in these databases through biogref.org, which currently has almost 520,000 records. All three databases provide categorized biographical data. There is an important difference, however, in that CBDB is a relational database composed of code tables and data tables. This allows it to be used in complex queries covering large numbers or persons. The existence of code tables for offices, people, places, and so on, also means that CBDB code tables, accessed through its APIs, can be used to mark up or “tag” texts on other platforms.

There are specialized datasets that provide code tables for tagging texts, such as the China Historical Geographic Information System (CHGIS), which also has an API to be used by online systems. Its most important use, as described in my “Visualization and Analysis of Historical Space” in this section, is the provision of data layers of administrative units for 2,000 years of China's history for use in GIS software. The CHGIS project also provides other valuable spatial datasets, including G. William Skinner's nineteenth- and twentieth-century datasets.

Platforms are online systems that allow users to upload their own data (or retrieve data that is already available on or through the platform) and use the capabilities of the platform to analyze the data. MARKUS, introduced by Hilde De Weerdt, is a platform that allows users to upload text and tag it, using code tables from CBDB, CHGIS, and other databases or by creating their own lists of terms. Tagging words in a text allows them to be extracted from the text and analyzed and visualized in diverse ways. Li Bin and his collaborators provide an example of this with their online system for the Basic Annals of Sima Qian's Records of the Grand Historian. In this case they tagged the text manually for person names and place names, thus allowing users to visualize the connections between persons and between persons and places statistically and geographically. Doing this with single texts manually is manageable but only an automated system would make this possible across large text corpora.

As Chen Shih-pei explains, the Local Gazetteers Research Tools (LoGaRT) is a platform devoted to extracting all manner of data from local gazetteers. Gazetteers are databases to-be: they contain structured information using common categories. But in fact there is just enough variation in the original texts to make this complicated. In her contribution to this section, on the composition of the Qing bureaucracy, Chen Bijia notes how challenging this can be in discussing the mining of the Roster of Appointments (Jinshen lu). The LoGaRt system is quite powerful, but the platform must be installed on a local server using texts available to that institution or users must arrange to spend time at the Max Planck Institute in Berlin. Currently most searchable gazetteers are in licensed text databases. PhiloLogic is another powerful platform or framework for text analysis to scale; when it is on a local server users can create their own instances with their own corpora. Jeffrey Tharsen and Clovis Gladstone explain its capabilities with the example of the Twenty-Four Chinese Histories; they have opened their instance to readers.

Various platforms provide various kinds of tools for analyzing text corpora. Ctext's text tools allow comparison between chosen texts, visualizing similarity and proximity based on statistical analysis, and more. In addition to the introduction to the Ctext platform and text tools for the analysis, the website offers detailed introductions and guides.Footnote ⁸ Some of the same capabilities are also part of PhiloLogic and they are being built into MARKUS. A different kind of platform is 10,000 Rooms, as introduced by Nicholas Frisch, which is aimed at enabling users to upload and annotate images of texts. This facilitates the study historical editions of texts that may exist in multiple woodblock editions.

Platforms typically allow users to upload texts for one-time analysis and, with registration, to store texts (taken from Ctext or Kanripo for instance) on the platform's servers, employ various tools, download data, and produce visualizations. The Digital Humanities Research Platform at Academia SinicaFootnote ⁹ and DocuSky from National Taiwan UniversityFootnote ¹⁰ provide these services to registered users. DocuSky, first developed by Tu Hsieh-chang and discussed by Hsiang Jieh, is unusual in that it is a system created to enable users to transform the texts and spreadsheets they may have on their own computer into their own online database. It provides a specific XML format which acts as a bridge between content and tools. Users can convert texts and spreadsheets into the XML format and use them with the tools provided by DocuSky or other open access platforms, such as MARKUS. In addition it is meant to create a link between researchers and developers.

The text corpora, databases, datasets, code tables, APIs, tools, and platforms discussed here are examples of the kinds of utilities and digital resources available for the study of China's history. The list is not exhaustive. I have not covered the various software packages being used in data analysis. Of the several challenges in using digital resources I will draw attention here to one: the lack of a reliable means of segmenting words or phrasemes in texts written in literary Chinese. The lack of white space between words is a problem for East Asian texts generally, which is exacerbated by the lack of punctuation and ambiguity of the status of a string of characters as a word, although there is a parser for modern spoken Chinese.Footnote ¹¹ There are open-access utilities that have had some success with punctuating literary Chinese and identifying parts of speech.Footnote ¹²

The increasing number of searchable text databases, most of them commercial, presents researchers with new challenges. First, researchers would like to search metadata (that is, information about the text such as author, title, edition) across databases, even if the local library does not have a license to the content. The HOLLIS catalog at Harvard, for example, reveals metadata for those Erudition databases it licenses, although unaffiliated users cannot access the content. The CrossAsia Fulltext Search Catalog from the State Library in Berlin does this as well for the domain of Asian studies.Footnote ¹³ The lack of interconnectivity across the digital universe of Chinese studies, led to the 2018 Shanghai conference on “Cyberinfrastructure for Historical China Studies.”Footnote ¹⁴ At this point there is no one agreed path forward, but there are several possibilities. The Max Planck Institute for the History of Science has developed the Research Infrastructure for the Study of Eurasia (RISE) which, through its API, is meant to enable institutions to create secure linkages between third-party research tools and various third-party textual collections.Footnote ¹⁵ The organizers of the 2018 conference, together with major libraries and research institutes in China and around the world, are working with the Chaoxing group to see whether a sophisticated, wide-ranging search, retrieval, and analysis system could be the basis for a common multi-lingual platform of open and licensed content.Footnote ¹⁶ Another approach, represented by the aforementioned textref.org and biogref.org, is for database providers to agree on a common standard for the basic metadata necessary to identify texts and individuals in their systems. The challenge is to build this into library and database workflow so that new data is entered automatically. A third option will take shape with the sixth and final edition of Endymion Wilkinson's Chinese History: A New Manual, to appear in 2021–2022. The Manual will then also appear as a curated online database that can continue to evolve, a kind of a hub whose spokes are links through APIs to library catalogs and other databases, at the same time that internal hyperlinks make it easy to explore the rich content of the book itself.