Abstracts

ChartEx: a project to extract information from the content of medieval charters and create a virtual workbench for historians to work with this information

July 19, 2013, 10:30 | Short Paper, Embassy Regents F

Summary

The ChartEx (Charter Excavator) Project is developing and evaluating an innovative collection of computational methods to assist researchers in searching, extracting, analyzing, linking and understanding the content of medieval charters. The project is using both natural language processing and data mining techniques to establish entities such as locations (location in this context refers to a specific building or piece of land) and actors, events and dates related to those locations. The project is also developing a “virtual workbench” to support historians in working with large corpora of digital charters and the vast amounts of information that can now be extracted from them. These methods could subsequently be applied to other corpora of digitized texts, be they historical or contemporary.

The ChartEx Project

Researchers now have access to a deluge of data in the form of digitized historical records. One example is medieval charters or title deeds which record transfers of land ownership and are a major source for the study of people and places in the past, including the topography, economy and social relationships of pre-modern communities. However, current digital search aids are not sufficiently sensitive to the needs of researchers seeking to exploit the wealth of detail available within this type of record. The ChartEx (Charter Excavator) Project is developing and evaluating an innovative collection of computational methods to assist researchers in searching, extracting, analyzing, linking and understanding the content of medieval charters. These methods could subsequently be applied to other corpora of digitized texts, be they historical or contemporary.

The ChartEx Project is extracting information from charters, using a combination of natural language processing (NLP) and data mining (DM) components to establish entities such as locations (location in this context refers to a specific building or piece of land) and actors, events and dates related to those locations. The NLP component uses rules derived from the knowledge of experienced researchers, in combination with the semantic meaning of the written language of the charters, to extract these entities. In order to inform these rules a new markup schema was defined for use in the project.

From a sample of manually marked up charters, the NLP component generates automatic markup of entities in a larger corpus of charters. The DM component then uses the output of the NLP component, as refined by researchers if needed, to extract relationships between entities.

However, the rules used by the NLP component and the relationships found by the DM component cannot reflect all the knowledge of experienced researchers. Therefore, the third crucial component of the ChartEx Project is the use of novel instrumental interaction techniques which will allow researchers to both refine the processing of the NLP and DM, and to directly manipulate (visualise, confirm, correct, refine, augment) relationships extracted from the charters to gain new insights of interest about the entities within them. Instrumental interaction emphasizes that both the computational system and human users have knowledge and must work together to achieve the users’ goals. ChartEx is developing a highly usable “virtual workbench” for researchers to support this instrumental interaction. The ChartEx Workbench has interactive visualizations of large amounts of data and a range of tools to manipulate data and their relationships.

The ChartEx project is demonstrating its approach by analysing five corpora of charters from the 10th to 16th centuries originating from both the UK and other European countries. Two corpora contain full Latin texts and three contain Latin texts that have been provided with English summaries within digital archival catalogues. The collections derive from The National Archives of the United Kingdom, the Borthwick Institute for Archives in York (United Kingdom), the deeds collected as part of the DEEDS project at the University of Toronto (Canada), and deeds originating from the archive at Cluny (France).

Charters are an abundant and fundamental source for the study of many aspects of medieval societies. While recent scholarship has expanded the range of charter studies to such fields as the history of emotions and performativity, their core usefulness remains their provision of basic data: personal names, place names, and dates. In particular, they help us to trace the ownership and occupation of houses and parcels of land over centuries, providing the basis for many further studies from history to tourism and conservation. Initially in the project we assumed that the formulaic nature of the language in medieval charters would make the NLP quite straightforward. However, the high level formulaic nature of expressions revealed considerable variability when subjected to fine-grained analysis. Nonetheless, by developing a detailed markup schema for the initial hand coding of a set of charters, it was possible to train the NLP component to identify personal names, place names and locations. The Latin (and after c.1300, vernacular) phrases that describe, for example, the location of a property (e.g., ‘the tenement in Petergate lying between the tenement once held by John the apothecary and now held by Richard of Huntington on one side, and the church of St Michael on the other’), were the pre-cursors to street-numbers and scientific spatial referencing developed from the 19th century. When researching a particular historical lived environment, the researcher needs to establish links between actors, events, and locations, by recovering and reconstructing the relationships between hundreds, even thousands, of data points, included in different charters and even in different archives.

Most current systems for searching digital charters depend on manual markup systems which identify entities such as place-names and personal-names in the text and include such entities in catalogue metadata. One example of such a project would be ‘Paradox of Medieval Scotland’ which uses metadata in just this way and then provides full text transcriptions of the locational data within the charter. Another example is the WARD 2 data in The National Archives. ChartEx is using NLP to go beyond the metadata and to explore the actual content of charters, thus enabling researchers to explore the discursive descriptions of locations, actors and events within those locations.

The ChartEx Project has designed a detailed markup scheme that adequately represents the ways in which historians currently read charters and extract spatial means from them. This new markup schema was created through a collaborative process involving researchers from 3 different institutions. This schema includes a set of entities and relationships that are at a level of abstraction above traditional diplomatic markup and specifically identifies actors, locations and events in a way that can be used by the NLP component. This new markup schema is documented in a detailed set of guidelines that have been used annotate a set of 250 charters from across the five different data archival data sets. These manually marked up charters provide act as a set training data for the technical components of the project.

The ChartEx Project has also developed the ChartEx Virtual Workbench to allow historians to work with the large amounts of data becoming available to them. For this we haven taken a highly user-centric approach, using the latest methodologies from human-computer interaction. Eight historians participated in contextual interviews, in which they described in considerable detail how they work with charters on a range of different historic problems and recreated actual pieces of research they had undertaken with charters. These sessions were recorded and analysed in detail to understand the specific tasks that historians are undertaking with charters and how to support them most effectively in doing the same tasks in a purely digital environment, as well as provide them with new functionality that they have not had before. In particular, we also considered the additional information that historians will have available from the results of the NLP and DM components in the project. On the basis of all this information, a number of “low fidelity” prototypes of possible workbench configurations and functionalities have been developed and two co-design workshops have been held with the six historians and two archivist advisors to the to discuss the different possibilities. As a result of these workshops, a more detailed prototype has been developed and is being evaluated by the original set of historians and archivists and a number of newly recruited historians and archivists, who are testing it with realistic research tasks. This will undoubtedly lead to a further iteration of refinement of the Virtual Workbench before the final prototype is developed by the end of the ChartEx Project.

The ChartEx Project is funded under the Digging into Data challenge (www.diggingintodata.org). The research consortium includes, historians and linguists as well as experts in human computer interaction, natural language processing and data mining from six institutions in four countries: Universities of York and Brighton (United Kingdom), University of Toronto (Canada), University of Washington and Columbia University (USA) and University of Leiden (Netherlands).