Normalisation in Historical Text Collections

July 17, 2013, 15:30 | Centennial Room, Nebraska Union

Improved full-text search, named-entity recognition and relationship extraction are all key research topics across many areas of technology, with emerging applications in the intelligence, healthcare and financial fields amongst many others [1] . In Digital Humanities, there is a growing interest in the application of such Natural Language Processing (NLP) approaches to historical texts [2] with a view to improving how a user can explore and analyse these collections [3] [4] [5] [6] . However, the text contained in handwritten historical manuscript collections can often be ‘noisy’ in nature — with variation in spelling, punctuation, word form, sentence structure and terminology. This is particularly the case with collections written in archaic language forms, such as Early-Modern English. Multiple studies have concluded that the applicability of modern NLP tooling to such historical texts has been very limited due to this inherent noisiness in the texts. This historical language barrier hinders the accessibility and thus the potential exploration and analysis of many significant historical text collections. This paper will discuss the normalisation of historical texts as a solution to this problem and examine how normalisation can improve the analysis, interpretation and exploration of these collections.

Normalisation is the process of transforming text into a single canonical form, in this case, the modern equivalent of the language. Once this has been completed, the texts can be processed using current NLP techniques and technologies. However, the normalisation of historical texts presents a difficult challenge in itself.

Much research has been undertaken in an attempt to cope with the correction and normalisation of text produced by Optical Character Recognition (OCR), speech recognition, instant messaging etc. which show similar characteristics to those of historical texts. One technique which has been applied is the use of a historical lexicon, supplemented by computational tools and linguistic models of variation. However, because of the absence of language standards, multiple orthographic variations of a given word or expression can be found in a collection of material, even in the same document. As a result, the quality of the results achieved, even after normalisation, has not been satisfactory. Researchers have also noted a general lack of tools and resources specialised to this domain.

This paper will present the normalisation research conducted as part of the CULTURA project, which has developed techniques for the normalisation of a 17th century manuscript collection written in Early Modern English, The 1641 Depositions [7] . CULTURA analyses the artefacts and through the application of novel linguistic models of variation, enables normalisation techniques to remove issues of inconsistency in spelling, grammar and punctuation. The technologies developed and applied have had to solve issues arising from the need to contend with noisy inputs, the impact noise can have on downstream applications, and the demands that noisy information places on document analysis. The normalisation of texts in Early Modern English can be interpreted as a special (restricted) case of translation. Using this intuition, a methodology was developed based upon statistical machine translation models. The key ingredient of this approach is a new translation module that further develops known OCR correction techniques.

Once the content has been normalised, further analysis is conducted to perform named entity and relationship extraction. This identifies the individuals, events, dates etc. within the collection and the relationships between these entities. It encodes this data in a manner which promotes interoperability with other collections of related cultural heritage material.

The normalisation process allows CULTURA to perform disambiguation on the text. For example, one of the main players mentioned many times in the 1641 Depositions collection is Sir Phelim O’Neill. In addition to huge variations in the spelling of his name (Phelin, felim, ffelim, O’Neill, Neil, Onell etc.), he is also referred to in a number of different ways throughout the depositions (The O’Neill, Phelim MacTurlough MacAodh, ‘The rebel leader’ etc.). Normalisation is used to address this variation and also identify the context in which he is mentioned – for example, is he accused of being directly involved in an incident or merely mentioned as a known rebel figure? A combination of NLP and SNA techniques are then used to identify relationships between this individual and other people in the community. He can also be associated with specific events and locations over time, all of which would be impossible without the collection having been normalised.

While the normalisation of the collection has facilitated this application of further NLP techniques, it has also supported improved interaction with the collection for novice researchers and members of the general public. People without a deep scholarly knowledge of such collections can find it easier to interact with, and gain an understanding of, the normalised versions of the transcribed text rather than the originals.

The un-normalised text of the collection is still available to “professional” scholars who use the CULTURA portal and every effort is made during normalisation to ensure that changes that implicitly involve interpretation of the text, or that go beyond normalisation, are avoided. The normalisation process should never impact upon the semantic content of the collections.

A number of important advancements have been achieved by CULTURA, including the development of:

  • A normalisation algorithm [8] called Regularities Based Embedding of Language Structures (REBELS).
  • An integrated REST based web service for the implemented normalisation module.
  • A tool for manual annotation which makes the normalisation process as simple as possible, and helps to verify consistency of annotations and to help the resolution of detected conflicts.

These developments all support a more effective text normalisation process and improve the effectiveness of the entity and relationship extraction procedures which follow. Experimental results have demonstrated that the normalisation averages 98% accuracy in the translation of regular words (tokens) in the 1641 Depositions collection.

In order to show the general applicability of the approach, the methodology was applied to another Early Modern English text collection, the Innsbruck Letters Corpus, part of the Innsbruck Computer-Archive of Machine-Readable English Texts (ICAMET) corpus. The results were compared to two state of the art systems, Moses [9] and VARD2 [10] . The CULTURA approach achieved an accuracy of 83% with minimal training, when compared with 77% and 65% for Moses and VARD2 respectively.

In summary, this paper will discuss the normalisation of historical texts, and the analysis and feature identification that normalisation facilitates. These are increasingly important processes for “noisy” cultural heritage resources, and provide significant benefits to the analysis, interpretation and exploration of these collections. Together, they can:

  • Improve the quality and re-usability of artefacts by normalising spelling, punctuation and nomenclature.
  • Facilitate deeper interrogation of the material.
  • Identify features of the collection — individuals, locations, dates etc.
  • Enable social network analysis on these features, identifying those with the greatest influence.
  • Open new pathways for the exploration and interrogation of the resources for both novices and experts.
  • Add structure and logic to otherwise featureless material, enabling new forms of engagement, use and enjoyment of the content.


1. Sunita Sarawagi. Information extraction. FnT Databases, 1(3), 2008.

2. The CHLT Project, funded under EU FP5, aimed to create an infrastructure for pioneering International Digital Library Technology (IDLT), and a range of IT applications for use within digital collections (with special emphasis on early modern Latin, classical Greek, and Old Norse texts), including generic tools for multi-lingual information retrieval; concept identification and visualisation; vocabulary analysis and syntactic parsing.

3. 3 The IMPACT project http://www.impact-project.eu

4. Annette Gotscharek, Andreas Neumann, Ulrich Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2009. Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data (AND '09). ACM, New York, NY, USA, 69-76.

5. A. Gotscharek, U. Reffle, C. Ringlstetter, and K. U. Schulz. On lexical resources for digitization of historical documents. In DocEng '09: Proceedings of the 9th ACM symposium on Document engineering, pages 193--200, New York, NY, USA, 2009.

6. A. Ernst-Gerlach and N. Fuhr. Retrieval in text collections with historic spelling using linguistic and spelling variants. In JCDL ’07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pages 333–341, New York, NY, USA, 2007. ACM.

7. The 1641 Depositions – http://1641.tcd.ie

8. Gerdjikov, S. “Some algebraic properties of alignments of words.” In Comptes rendus de l’académie bulgare des science. 2012.

9. Moses is a statistical machine translation system - http://www.statmt.org/moses/

10. A. Baron and P. Rayson. Automatic standardisation of texts containing spelling variation: How much training data do you need? In: Proceedings of the Corpus Linguistics Conference. Lancaster University, Lancaster, 2009.