Pure Transcriptional Markup

July 19, 2013, 08:30 | Long Paper, Embassy Regents C

Renear (2000) introduces the term "transcriptional markup" with a loose sense of pertaining to the reproduction of an existing text, and he does not attempt to define transcription per se. More recently Huitfeldt and Sperberg-McQueen (2008), and Huitfeldt, Marcoux, and Sperberg-McQueen (2009, 2010) have developed a model of transcription (hereafter the HMS model) that they believe "provides a sort of greatest common denominator for markup systems" (2010, 15). But an awkward gap exists between the HMS model's abstract components and the realities of, for example, the TEI markup language and its typical usage. Here I make the case for a kind of markup — perfectly possible though in practice improbable — which bridges that gap. I call it pure transcriptional markup because it refines the sense of Renear's term and grounds it in an actual, formally specified model of transcription. [1]

Transcription and text encoding are clearly connected, especially where markup replaces the work of presentation. [2] However, not everything that typical scholarly encoding practice might assert regarding the logical domain of a pre-existing text can be considered proper to transcription (if "transcription" is to have any specific meaning of its own). In the initial HMS model Huitfeldt and Sperberg-McQueen (2008) avoid the trap of trying to decide what visible details of an exemplar document E should be reproduced in a transcription T by asking instead what overall relation must hold between the two such that document T is a successful transcription of document E (expressed as t_similar). They say that under a set of reading conditions R, marks of E can be seen as a sequence of tokens each instantiating a type. So, abstractly, a document E is a sequence of types, and if under the same set of reading conditions R a document T can be seen as a token sequence whose corresponding type sequence matches E's type sequence, then E and T are t_similar.

The initial HMS model considers a simple sequence of basic, indivisible tokens at grapheme level; later versions (2009, 2010) introduce components that allow the modeling of a complex structure of token groups (and related types). These groups, known as compound tokens, may have as constituents either compound tokens or basic (atomic) tokens. The resulting view of a document comprising a structured, multi-level token sequence immediately suggests similarities with the well-known 'ordered hierarchy of content objects' (OHCO). As noted above, its creators explicitly connect the HMS model to markup, saying some aspects "serve purposes analogous to the generic identifiers and attribute-value pairs of SGML and related markup languages" (2010, 11), that "element types are types", and "element instances are tokens" (2010, 15).

However, some features of common encoding practice are problematic for the model. I briefly outline two of them here; a single example will illustrate both. [3] Suppose an original printed text contains the sentence "Joe stinks!" Describing this sentence in terms of the HMS model we have:

  • 11 basic tokens at character level — 'J', 'o', 'e', ' ', 's', 't', 'i', 'n', 'k', 's','!'
  • 2 compound tokens at word level — 'Joe' and 'stinks'
  • 1 compound token at sentence level — 'Joe stinks!'

A typical TEI encoding of our example might be:

  • <persName>Joe</persName> <emph rend="italics">stinks</emph>!

For the first of the awkward features, recall that for HMS element types are types and element instances are tokens. In markup theory an instance of an element with #PCDATA content would be start tag + content + end tag. So <persName>Joe</persName> is a single element instance and hence a single token. But what type does it instantiate — type 'persName', or type 'Joe'? The problem here is that the single element instance appears to be two tokens, one wrapped inside another. Furthermore, they appear to be operating at different levels: as a specific lexical item, and as a characterization of the lexical item.

The second awkward feature can be seen in <emph rend="italics">stinks</emph>. If element types are types, then we must assume an <emph> element instantiates an 'emphasis' type. Yet there are strong grounds for arguing that emphasis is not a type in the sense used by the model. Wetzel (2009, xii-xiii), citing arguments from Wollheim 1968, believes that while types may well be considered universals, they differ from other universals such as properties. Speaking of words as types she says "they are objects according to the common sense and scientific theories we have about them—values of the first-order variables and referents of singular terms—rather than properties" (124). The emphasis associated with the token "stinks" by the use of italic typeface is at the type level a property associated with the word type 'stinks', not a type itself. Viewing emphasis as a property does accord with the model, which allows for relations between properties and types.

Commonly in encoding schemes there are elements that certainly seem to be associated with types as per the HMS model; but there are also elements we would associate with properties, a point noted by Dubin (2003). The TEI scheme has many type-like elements, but no constraining principle that elements should be restricted to Peirceian types. Rather the 'targets' seem to be textual features — that may or may not be objects, may or may not be visible properties. Markup is directed at organizational features made manifest by the work of presentation (paragraphs, lists, etc.), and non-organizational features that stand out in the regular text flow by virtue of either visible difference or semantic nature.

The challenge, then, is firstly to describe a general view of encoding that does fit the HMS model, and secondly to see if this view helps account for actual encoding in terms of the model. We noted that in the HMS model transcription is defined by the mediation of a type sequence between E and T. The conceptual movement is not

E_token_structure -> T_token_structure
but rather
E_token_structure -> E_type_structure -> T_token_structure.

The middle part of this progression — establishing the type sequence under a reading — may happen in the transcriber's head, but it is the core of the model nevertheless. It is also fundamental to the HMS model that at any one level, a token instantiates a single type. From these givens, some criteria for pure transcriptional encoding emerge.

* if elements are tokens they must instantiate a single type. We therefore move #PCDATA content into attribute values and make all basic-level elements empty. Our example "Joe stinks!" might be represented as follows:
<sentence> <word designation="persName"> <character type="j" form="majuscule"/> <character type="o"/> <character type="e"/> </word> <whitespace/> <word communicative_intent="emphasis"> <character type="s"/> <character type="t"/> <character type="i"/> <character type="n"/> <character type="k"/> <character type="s"/> </word> <punctuation type="exclamation"/> </sentence>
* elements, attributes, and element structure must either supply overtly or make available through logical inference the information necessary for t_similarity; we must therefore view pure transcriptional encoding as applying to E_type_structure rather than to E_token_structure. [4]
* under a reading R, at whatever level a token is considered basic, so must an element at that level be considered basic and therefore atomic; ie. if character level is considered basic, then "e" and "<character type='e'/>" are equally indivisible.
* the status of elements as tokens must be somewhat unusual. Wetzel (2012), following Wollheim (1968), notes that types usually resemble their tokens. Even if we allow for minor differences of form (eg. token "E" instantiating type 'e'), we cannot claim "<character type='e'/>" resembles 'e'. Pure transcriptional markup represents an intermediary stage not normally intended for human consumption — in effect a kind of precipitation out of the E_type_structure into something half-way towards a normal token sequence.

The elements of pure transcriptional markup are analogous to the posited sentences that Zellig Harris in his mathematical theory of grammar says "go beyond what is normally said in English and are characterized as grammatically possible rather than as actual sentences" (1982, p. 15, my emphasis). The crucial point for Harris is that these sentences, while never encountered 'in the wild', can through a series of rule-governed transformations be reduced to — and so account for — the sentence forms we do encounter in everyday English. These operations "have the common property . . . of reducing high-likelihood, low information entries [or words into sentences]" (p. 8). I suggest a similar relation holds with respect to the difference between pure transcriptional markup and everyday markup, and that we could describe a series of non-random transformations that would reduce the high-likelihood, low-information encoding such as we see in the example above to the kind of markup we normally encounter. In digital humanities we commonly 'zero out' most transcriptional encoding leaving a majority of tokens in #PCDATA form - so "<character type='e'/>" becomes just "e". Where we want to retain information about a type property without using a specific form of #PCDATA token, we apply transformations that zero out the nature of the type (ie. we remove the low-information markup that says "stinks" is a word) leaving just the property to be expressed by the markup — hence TEI's <emph>.


Caton, Paul (2004). Text Encoding, Theory, and English: A Critical Relation. Ph.D dissertation. Brown University.
Caton, Paul (2009). Lost in Transcription: Types, Tokens, and Modality in Document Representation. In Digital Humanities held June 2009 at College Park, University of Maryland.
Dubin, D. (2003). Object mapping for markup semantics. In Usdin, B. T. (ed). Proceedings of Extreme Markup Languages. Montreal, Quebec.
Harris, Z. (1982). A Grammar of English on Mathematical Principles. New York: Wiley-Interscience.
Huifeldt, C., and C. M. Sperberg-McQueen (2008). What is transcription? Literary and Linguistic Computing. 23 (3). 295-310. doi:10.1093/llc/fqn013
Huitfeldt, C., Y. Marcoux, and C. M. Sperberg-McQueen (2009). "What is transcription? (Part 2)." In Digital Humanities held June 2009 at College Park, University of Maryland.
Huitfeldt, Claus, Yves Marcoux and C. M. Sperberg-McQueen (2010). "Extension of the type/token distinction to document structure." In Balisage: The Markup Conference 2010. held August 3-6, 2010 in Montréal, Canada. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies.5. doi:10.4242/BalisageVol5.Huitfeldt01.
Renear, A. (2000). The descriptive/procedural distinction is flawed. Markup Languages 2 (4). 411-420.
Renear, A., and D. Dubin (2003). Towards identity conditions for digital documents. In S. Sutton (ed.) Proceedings of the 2003 Dublin Core Conference. University of Washington, Seattle, WA.
Wetzel, L. (2009). Types and Tokens: On Abstract Objects. Cambridge, MA.: MIT Press.
Wollheim, R. (1968). Art and Its Objects. New York: Harper and Row.


1. This paper presents work from a much larger, ongoing project prompted by ideas first presented in Caton 2009. I no longer defend the conclusions of that earlier work, and the current project represents a substantial rethinking and development of my initial ideas.

2. On the work of presentation see Caton 2004. Another part of the larger project from which this work is drawn examines in detail how presentation mediates transcription and encoding.

3. Necessarily, the outline given here is greatly condensed from an extensive discussion in the larger project, with some consequent loss of continuity and supporting evidence from the argument.

4. Note that we would still have to associate a semantics with the pure transcriptional markup if we wanted to establish formal identity between documents in the manner suggested by Renear and Dubin (2003).