Pure Transcriptional Markup
July 19, 2013, 08:30 | Long Paper, Embassy Regents C
Renear (2000) introduces the term "transcriptional markup" with a loose sense of pertaining to the reproduction of an existing text, and he does not attempt to define transcription per se. More recently Huitfeldt and Sperberg-McQueen (2008), and Huitfeldt, Marcoux, and Sperberg-McQueen (2009, 2010) have developed a model of transcription (hereafter the HMS model) that they believe "provides a sort of greatest common denominator for markup systems" (2010, 15). But an awkward gap exists between the HMS model's abstract components and the realities of, for example, the TEI markup language and its typical usage. Here I make the case for a kind of markup — perfectly possible though in practice improbable — which bridges that gap. I call it pure transcriptional markup because it refines the sense of Renear's term and grounds it in an actual, formally specified model of transcription. [1]
Transcription and text encoding are clearly connected, especially where markup replaces the work of presentation. [2] However, not everything that typical scholarly encoding practice might assert regarding the logical domain of a pre-existing text can be considered proper to transcription (if "transcription" is to have any specific meaning of its own). In the initial HMS model Huitfeldt and Sperberg-McQueen (2008) avoid the trap of trying to decide what visible details of an exemplar document E should be reproduced in a transcription T by asking instead what overall relation must hold between the two such that document T is a successful transcription of document E (expressed as t_similar). They say that under a set of reading conditions R, marks of E can be seen as a sequence of tokens each instantiating a type. So, abstractly, a document E is a sequence of types, and if under the same set of reading conditions R a document T can be seen as a token sequence whose corresponding type sequence matches E's type sequence, then E and T are t_similar.
The initial HMS model considers a simple sequence of basic, indivisible tokens at grapheme level; later versions (2009, 2010) introduce components that allow the modeling of a complex structure of token groups (and related types). These groups, known as compound tokens, may have as constituents either compound tokens or basic (atomic) tokens. The resulting view of a document comprising a structured, multi-level token sequence immediately suggests similarities with the well-known 'ordered hierarchy of content objects' (OHCO). As noted above, its creators explicitly connect the HMS model to markup, saying some aspects "serve purposes analogous to the generic identifiers and attribute-value pairs of SGML and related markup languages" (2010, 11), that "element types are types", and "element instances are tokens" (2010, 15).
However, some features of common encoding practice are problematic for the model. I briefly outline two of them here; a single example will illustrate both. [3] Suppose an original printed text contains the sentence "Joe stinks!" Describing this sentence in terms of the HMS model we have:
- 11 basic tokens at character level — 'J', 'o', 'e', ' ', 's', 't', 'i', 'n', 'k', 's','!'
- 2 compound tokens at word level — 'Joe' and 'stinks'
- 1 compound token at sentence level — 'Joe stinks!'
A typical TEI encoding of our example might be:
- <persName>Joe</persName> <emph rend="italics">stinks</emph>!
For the first of the awkward features, recall that for HMS element types are types and element instances are tokens. In markup theory an instance of an element with #PCDATA content would be start tag + content + end tag. So <persName>Joe</persName> is a single element instance and hence a single token. But what type does it instantiate — type 'persName', or type 'Joe'? The problem here is that the single element instance appears to be two tokens, one wrapped inside another. Furthermore, they appear to be operating at different levels: as a specific lexical item, and as a characterization of the lexical item.
The second awkward feature can be seen in <emph rend="italics">stinks</emph>. If element types are types, then we must assume an <emph> element instantiates an 'emphasis' type. Yet there are strong grounds for arguing that emphasis is not a type in the sense used by the model. Wetzel (2009, xii-xiii), citing arguments from Wollheim 1968, believes that while types may well be considered universals, they differ from other universals such as properties. Speaking of words as types she says "they are objects according to the common sense and scientific theories we have about them—values of the first-order variables and referents of singular terms—rather than properties" (124). The emphasis associated with the token "stinks" by the use of italic typeface is at the type level a property associated with the word type 'stinks', not a type itself. Viewing emphasis as a property does accord with the model, which allows for relations between properties and types.
Commonly in encoding schemes there are elements that certainly seem to be associated with types as per the HMS model; but there are also elements we would associate with properties, a point noted by Dubin (2003). The TEI scheme has many type-like elements, but no constraining principle that elements should be restricted to Peirceian types. Rather the 'targets' seem to be textual features — that may or may not be objects, may or may not be visible properties. Markup is directed at organizational features made manifest by the work of presentation (paragraphs, lists, etc.), and non-organizational features that stand out in the regular text flow by virtue of either visible difference or semantic nature.
The challenge, then, is firstly to describe a general view of encoding that does fit the HMS model, and secondly to see if this view helps account for actual encoding in terms of the model. We noted that in the HMS model transcription is defined by the mediation of a type sequence between E and T. The conceptual movement is not
The middle part of this progression — establishing the type sequence under a reading — may happen in the transcriber's head, but it is the core of the model nevertheless. It is also fundamental to the HMS model that at any one level, a token instantiates a single type. From these givens, some criteria for pure transcriptional encoding emerge.
The elements of pure transcriptional markup are analogous to the posited sentences that Zellig Harris in his mathematical theory of grammar says "go beyond what is normally said in English and are characterized as grammatically possible rather than as actual sentences" (1982, p. 15, my emphasis). The crucial point for Harris is that these sentences, while never encountered 'in the wild', can through a series of rule-governed transformations be reduced to — and so account for — the sentence forms we do encounter in everyday English. These operations "have the common property . . . of reducing high-likelihood, low information entries [or words into sentences]" (p. 8). I suggest a similar relation holds with respect to the difference between pure transcriptional markup and everyday markup, and that we could describe a series of non-random transformations that would reduce the high-likelihood, low-information encoding such as we see in the example above to the kind of markup we normally encounter. In digital humanities we commonly 'zero out' most transcriptional encoding leaving a majority of tokens in #PCDATA form - so "<character type='e'/>" becomes just "e". Where we want to retain information about a type property without using a specific form of #PCDATA token, we apply transformations that zero out the nature of the type (ie. we remove the low-information markup that says "stinks" is a word) leaving just the property to be expressed by the markup — hence TEI's <emph>.
References
Notes
1. This paper presents work from a much larger, ongoing project prompted by ideas first presented in Caton 2009. I no longer defend the conclusions of that earlier work, and the current project represents a substantial rethinking and development of my initial ideas.
2. On the work of presentation see Caton 2004. Another part of the larger project from which this work is drawn examines in detail how presentation mediates transcription and encoding.
3. Necessarily, the outline given here is greatly condensed from an extensive discussion in the larger project, with some consequent loss of continuity and supporting evidence from the argument.
4. Note that we would still have to associate a semantics with the pure transcriptional markup if we wanted to establish formal identity between documents in the manner suggested by Renear and Dubin (2003).