Practical Interoperability: The Map of Early Modern London and the Internet Shakespeare Editions
July 17, 2013, 10:30 | Long Paper, Burnett 115
While the promise of interoperability has been one of the major driving forces in the adoption of standards such as TEI, it has long been recognized that interoperability has only limited practicality (McDonough 2008; Sperberg-McQueen 2008). As large-scale digitization projects have matured, it has become apparent that the most effective approach to interoperability between them is based on loose coupling through APIs and metadata exchange services such as OAI-PMH, rather than wholesale convertibility or aggregation (see, for example, Bol, Hsiang and Fong 2012; Matei 2012).
The Map of Early Modern London (MoEML) and the Internet Shakespeare Editions (ISE) are mature projects, and both are under active development. While MoEML's text database is steadily growing, the literary texts in the ISE collection, on the same network and sharing some of the same research team, have become a tempting target for integration. MoEML’s goal is to give users a sense of the lived space of London, particularly as that space was invoked by the implied geography of early modern plays. Shakespeare’s ten history plays are rich in references to London places. 1 Henry IV moves between Eastcheap and Westminster; the title character of Richard III bustles through London; and the Tower looms ominously over the action of nearly every play. Ingesting and mapping these references in the MoEML environment would stimulate research questions about Shakespeare and London alike. How typical is Shakespeare’s invocation of London? How do his characters move through the urban environment? What is the relationship between London and the court in Shakespeare’s historical vision? How does this vision compare to that of other playwrights, such as Thomas Heywood, and to that of historians like Holinshed and Stow?
MoEML maps the streets, sites, and significant boundaries of London from 1560 to 1640, basing its interface on the Agas Map of London, which dates originally from the 1560s. The project incorporates a detailed gazetteer, topical essays, and digital texts from the period, and will soon include three editions of John Stow's A Survey of London. At the heart of the project is an XML placeography incorporating over 720 streets, churches, wards, neighbourhoods, and sites of interest. Places are both geo-referenced and linked to the Agas Map.
One goal is to use the Agas Map as a platform on which to visualize the locations in texts of the period. To that end, MoEML includes a library of early modern texts with all the toponyms identified and tagged. With dramatic texts, we have until now included only the “Dramatic Extracts” that contain London toponyms. It would be preferable, though, to extract toponyms dynamically from existing digital editions and plot them on the Agas map. The simple data visualization in Figure 1 shows a prototype for Richard III, with each location sized according to the number of references to it, demonstrating the manner in which the Tower dominates the action.
The London locations in Richard III on the Agas Map, sized according to the number of references to them.
The Internet Shakespeare Editions is primarily an open-source digital anthology of Shakespeare's plays. The ISE's programming platform also runs the Queen's Men's Editions (QME) and the Digital Renaissance Editions (DRE). Between these three projects, all the plays of Shakespeare and his contemporaries from 1500 to 1640 will be available in standardized XML base texts. At the heart of the projects are carefully edited texts of each play, in both their early printed forms and in modern editions with spelling and punctuation regularized. ISE editors already tag the base texts with simple tags that are converted to XML. We could ask the ISE, DRE, and QME editors to add in the London references; however, they are likely to turn to MoEML for help with identifying specific locations, so it is preferable to process their XML files and identify the toponyms ourselves. We propose to begin our prototyping with the ten history plays because five of them are complete or nearly complete.
TEI versions of the ISE plays are currently indexed in an eXist XML database (like the MoEML texts) in order to provide search capabilities. The ISE textbase also includes modern-spelling versions of the texts, and all its versions of each core text are linked using "through line numbers" (TLNs) based on the First Folio. These features provide the basis for a comprehensive system to identify placename references throughout the texts.
We will use a multi-phase approach to identifying relevant placename instances. First, we will deploy a Named Entity Recognition tool such as the Stanford Named Entity Recognizer, trained on a subset of texts selected to provide sufficient variety of genre and known to include a useful number of London place references. We will combine the results with the entries in a dictionary of spelling variants of London placenames extracted from our MoEML collection. We will generate results in a form that includes:
- Candidate placename
- Surrounding context (paragraph, line selection, etc.)
- Link to online version of the text using TLN
- ID and name of candidate match location in MoEML database (if there is one)
- Link to MoEML location data
This manner of reporting the results will allow research assistants to rapidly accept, reject, or correct the placename instance. Confirmed references will be stored in a TEI document in the form of <linkGrp> elements: <linkGrp target="mol:CHAR1" n="Charing Cross"> <link target="ise:1H4/M/scene/2.1#tln-659|76-88"/> </linkGrp>
This tagging encodes a link between the MoEML placeography (Charing Cross, which has the @xml:id "CHAR1") and the ISE's modern-spelling version of Henry IV Part 1. Any other links to the same location will be encoded using <link> elements inside the same <linkGrp>. These links use Private URI Schemes for the sake of convenience. The pointers prefixed with "mol:" are dereferenced in the context of the MoEML database through XPath (//TEI[@xml:id="CHAR1"] in this case). The "ise:" prefix can be similarly dereferenced to construct a full URI to the target location in the document: http://internetshakespeare.uvic.ca/Library/Texts/1H4/M/scene/2.1#tln-659. The last component of the pointer contains the character offset range for the placename. A formal method for documenting and mechanically dereferencing private URI schemes and similar abbreviated pointers has been proposed (Holmes 2012) and is being considered for adoption by the TEI Council.
Once candidate placenames have been encoded for the modern-spelling editions of the plays, the TLN referencing system in use by the ISE can be used to identify corresponding references in the other editions. We will use this automated process for identifications:
- 1. Retrieve the text following the corresponding TLN from an original-spelling edition of the text.
- 2. Search for the placename as it appears in the modern-spelling edition. If found, record its offsets and generate a <link>.
- 3. If not found, try a search for each variant spelling of the placename known to the MoEML database.
- 4. If a match is still not found, tokenize the target text, create bigrams and trigrams, and run similarity metrics between the original-spelling placename and each n-gram. If a similarity threshold is reached, assume a match and create a <link>.
Various similarity metrics might be appropriate here, including the Universal Similarity Metric (USM; see Holmes 2010). Where a similarity metric is invoked, the results will be flagged for manual checking. Pursuing this particular example, the First Folio has Charing Cross with the spelling "Charing-crosse". A Java implementation of the USM gives these a similarity score of 0.206, which represents high similarity (scores are between 0 and 1, with 0 representing identity). Processing the First Folio edition would generate a second <link>: <link target="ise:1H4/F1/scene/2.1#tln-660|37-50"/>
A flowchart representing the placename identification process.
Mapping placename identification in the modern-spelling texts onto original-spelling versions.
The complete process is represented in the flow charts in Figures 2 and 3. This approach will enable us to generate a large number of matches and resulting links without excessive human labour. The link groups will be stored in the MoEML database. No modification of MoEML or ISE texts is required; this is the "loose coupling" mentioned above. Links to instances of London placenames in the ISE texts can be provided as part of MoEML's online placeography. Meanwhile, the ISE team has expressed interest in linking out to MoEML location data, which could easily be achieved either by processing the MoEML link groups to add annotations directly into the ISE texts, or (pursuing the loose coupling methodology) by making calls to an API provided by MoEML when rendering sections of ISE texts to incorporate relevant links.