Text Encoding, the Index, and the Dynamic Table of Contexts
July 19, 2013, 10:30 | Long Paper, Embassy Regents D
Short Abstract
This paper investigates the nature of the index and its role within scholarly publishing by means of an experiment using the Dynamic Table of Contexts Browser to publish a scholarly essay collection that offers a full intellectual index combined with semantic encoding and free-text search functionality.
Extended Abstract
The index has long been a feature of printed text. The noun form of the word developed from various material pointers, whether literal fingers or portions of instruments, into the more abstract concept of a sign or token that emerged simultaneously with the lists placed at the backs of books in the late sixteenth century (“Index,” def. n. 1, 2a, 4b, 5b).
The concept of the index is all over the digital world. Symptomatically, the most recent change to the meaning of the verb form emerged from computing in 1962 (“Index,” def. n. 5d). The digital humanities themselves trace their origin back to Father Busa’s concordance of the Index Thomisticus, a machine-generated index produced at IBM (Burton). Yet the conceptual index produced through the intellectual engagement of a human being with the meaning of the text, that is, the kind of index that we as scholars most value from print culture, is exceedingly rare in the context of digital texts, where instead the automated, machine-generated index abounds.
The immense gains resulting from the ability of computers to generate indefatigably exhaustive and unimaginably extensive indices or to deliver with lightning speed the portions of a text containing a word on which the user searches are undeniable. However, the fate of the intellectual index is uncertain given the rise of digital books generally and of semantic encoding within digital humanities publishing in particular. This paper reflects on the role of the index within scholarly publishing by means of an experiment using the Dynamic Table of Contexts Browser to publish a scholarly essay collection that offers a full professionally produced intellectual index, semantic encoding of recurrent features, and free text search.
The Dynamic Table of Contexts Browser was designed as a reading environment for digitally encoded texts that would combine the table of contents with tools drawing on embedded semantic markup to create new affordances for reading in a digital environment. The familiarity of the table of contents provides prospect and navigational assistance, while the browser leverages the embedded semantic markup to allow users, as a description of an earlier instance of the browser put it, “to add or subtract what are essentially index items in and out of the table of contents” (Ruecker et al., 180; Nelson et al., 12). The combination aims to provide an advance over the standard fixed-content, if expandable and contractable, digital table of contents, and thereby solve some key usability challenges related to lengthy digital texts.
What an index item is, “essentially,” however, is far from simple. The investigation described here revolves in part around the tension between free text and controlled vocabulary as the basis for indexing. There has long been an unresolved, antagonistic relation between free-text terms and controlled vocabulary terms. [1] Both describe ways of representing and retrieving information in a digital context. Yet, long before the invention of the computer, scholars quibbled over the value and effectiveness of uncontrolled vocabularies and free-text searches (Svenonius, 333). This controversy falls into three stages: the nineteenth century “title-term or title-catchword indexing” of library catalogues; the invention of keyword in context (KWIC) indexing in 1959 by Hans Peter Luhn; and the rise of “instantaneous keyword searching” so familiar to us today (Svenonius, 333; Garret). These debates frequently focus on optimizing access to the vast collections of digital texts on and off the web, but at the level of the book the emphasis on generalized controlled vocabularies gives way to the question of the value of highly granular and customized intellectual indexes, produced by professional indexers, which seem to be struggling for survival against the generalized free-text search function offered by eBook interfaces.
At stake in the transition from the traditional print index to digital modes of retrieving and organizing knowledge is not only the profession of the indexer, but the broader socio-cultural signification and future of the ‘index’. The first generation of eBooks often omitted indexes entirely, even when present in the printed book, or included them as passive page images lacking navigational features. [2] The American Society for Indexing (ASI) is working with the International Digital Publishing Forum (IDPF) to “ensure inclusion of usable indexes in nonfiction digital book formats and e-books” (“Digital Trends Task Force”) in the creation of the specifications for EPUB 3.0. [3] The new workflow required to “output in various formations (e.g. eBook, HTML, PDF for print, etc.)” is part of the challenge for the publishing industry, to which XML seems to be emerging as a leading solution, due to its separation of “function from layout” (“Moving to XML Workflow”). However, the differences between encoding and conventional indexing practices do not figure in considerations to date (MacGlashan).
Semantic encoding is, after all, indexical. It demarcates and hence allows an interface to point to a section of a text; it labels a span of text according to a controlled vocabulary of values that has been devised to elucidate the nature of the text being encoded and conceptually groups that particular span of text with all other spans that have been encoded likewise. The <index> tag in the Text Encoding Initiative tagset is defined as follows: “(index entry) marks a location to be indexed for whatever purpose” (TEI Consortium, “TEI element”). The TEI documentation provides an excellent summary of the tension between what it terms manual indexing and free-text search:
The indexing of scholarly texts is a skilled activity, involving substantial amounts of human judgment and analysis. It should not therefore be assumed that simple searching and information retrieval software will be able to meet all the needs addressed by a well-crafted manual index, although it may complement them for example by providing free text search. The role of an index is to provide access via keywords and phrases which are not necessarily present in the text itself, but must be added by the skill of the indexer. (“TEI Consortium, “3 Elements”)
This begs the question further, however, about the indexical function of markup, since most TEI indexing is still “manual” and much scholarly markup that uses the TEI involves tags, particularly named entities, that overlap substantially with a conventional index. Other semantically oriented schemas, such as the bespoke one developed by the Orlando Project to encode feminist literary history, provide numerous tags that overlap considerably with the terrain of the traditional index: a tag like <education> or <relationsWithPublisher> serves to index portions of the text collection that may not contain such keywords or phrases, but which have been identified as relevant to these concepts by the skilled encoders (Brown et al.). There are significant differences between markup and professional or manual indexing, [4] but the extent of overlap between the two is evident in the ways that such markup functions within interfaces which offer the user the ability to look up spans of text marked with those terms. Yet whereas markup generally is considered quite synonymous with indexing, broadly conceived, particularly where a controlled vocabulary is being employed, whatever the theoretical value of the intellectual index, it seems in practice to have been deemed as dispensable to online digital humanities projects as it has been to early eBooks.
Our survey of a range of online projects, as well as of systematic reviews of features of the digital edition, show little evidence that the backbone of scholarly print resource navigation, the semantic index, is deemed crucial to digital scholarly resources. [6] This probably has in part to do with cost balanced against perceived benefits. [7] User studies in information science show that the strengths of the manual index are rivaled or even outstripped by automated indexing and information retrieval: “users find them, on balance, more or less equally effective” (Anderson and Pérez-Carballo, 233; cf. Barnum et al.; cf. Fidel 575). However, the former is far more costly than the latter, particularly as the volume of digitized materials grows. Furthermore, some of the functionality of the back-of-the-book index is covered by standard entity markup, which is frequently combined with controlled-vocabulary markup associated with the particular domain of the resource. As John Walsh has argued of the image vocabulary employed by the Blake Archive, such controlled-vocabulary markup functions very much like an index (Walsh).
General usability studies do not, however, get at the value of the intellectual or subject index to the scholarly or expert user. Indeed, the Bureau of National Affairs, a Virginia, USA publisher of highly specialized news in such areas as law, employment, the environment, and health care, conducted a usability study at law schools comparing text searching and index-aided research, and found that index users had an 86 percent success rate while text searchers had only a 23 percent success rate, particularly for tasks that departed from specific facts (“Using Online Indexes”). This result suggests the power of intellectual indexing and the potential need for such indexing within online scholarly resources.
To evaluate the role of the intellectual index and its relationship to more common forms of indexing, via markup, in digital humanities publishing, we combined the two in an online edition within a new version of Dynamic Table of Contexts (DToC) interface to publish an online version of Canadian Women Writers: Connecting Texts and Generations, edited by Marie Carrière and Patricia Demers, in partnership with the University of Alberta Press and in conjunction with their print publication. The online version incorporates the same extensive intellectual index as the printed version, along with semantic markup for named entities and other recurrent features of the text, some of which (such as named entities) overlap with index terms. To accommodate the index, the interface has been revised to incorporate a specialized panel for the index terms that operates as an alternative to the panel for tags; users can use the two panels to see both types of term embedded together in the table of contents. It incorporates free-text search, coinciding with the ASI proposal to EPub that user interfaces combine search and index functionality (Wright et al., “DTTF proposal to EPub,” 2, 7), and draws on the utility of the concordance in the provision of KWIC-like snippets for index terms and tags as well as keyword searches.
This paper describes the markup strategy used to encode the collection, including the index, in TEI; summarizes the team’s revisions to the interface; and demonstrates the function of the intellectual index within the interface in relation to table of contents, the markup of named entities, and other features of the text. The Dynamic Table of Contexts Browser retains the traditional content of the intellectual index prepared for the back of the print edition, but dramatically reorients its location in the digital interface by placing it at the “front” or within the persistent navigation features of the reading interface. We will present some preliminary results of combining markup with an intellectual index by reporting on an initial user study undertaken with scholars from the Canadian Writing Research Collaboratory community from which the collection emerged, and suggest future directions for investigating the function of the intellectual index within digital scholarly editions.
References
Notes
1. “Controlled vocabulary terms are variously called index terms, descriptors, subject headings and, somewhat erroneously and increasingly ambiguously, keywords. Terms not belonging to a controlled vocabulary are called free-text terms, natural language terms and, again, keywords. Free-text is also used as an adjective to describe a type of searching, viz., searching that can be performed without the constraint of having to translate one’s own vocabulary into the vocabulary used by a particular system” (Svenonius, 332).
2. Blogger John Lamb notes that when the Amazon Kindle was released in 2007 it did not support an index. When publishers created an eBook for it they “often excluded the index, even when it existed in the print version” (Lamb). In November 2011, Amazon released a new search tool to replace the index, called the Kindle X-Ray: “a new search and information feature that allows you to find information about characters, events, and topics in books” (Wright, “Amazon”, 11).
3. The DTTF’s “proposal moved forward quickly and an Indexes Charter document was published for a vote. The IDPF approved the formation of the EPUB 3.0 Indexes Working Group in December 2011 […] When completed it will be added as a modular update” (Digital Trends Task Force).
4. These include the plurality of the encoders involved and the fact that the indexical terms were developed in advance and applied to multiple texts (Butler et al.). The plurality of the encoders also suggests an aspect of the potential of digital systems for novel forms of indexing that is beyond the scope of this paper: the crowdsourcing of keywords, and perhaps even relations among them, as a basis for a folksonomic navigational aid.
5. It is telling that one of the more extensive discussions of how to encode indexes in the TEI documentation relates to the encoding of “existing ‘pre-electronic’ documents” (Burnard). A section called “Why Markup is Important” from A Prospectus for Electronic Historical Editions, a methodological framework compiled by the Steering Committee of the Model Editions Partnership, indicates the extent to which early thinking about TEI considered indexing an intrinsic function of markup (“Why Markup is Important”).
6. The Rossetti Archive and Orlando, for example, provide quite sophisticated search capabilities including Boolean functions or ones that draw on sub-elements or attributes in the markup. However, digital humanities projects seldom include anything resembling the intellectual index with its carefully organized hierarchy of terms systematically synthesizing the contents of the book. The exemplary edition of Vincent Van Gogh’s letters project offers a case in point. Whereas the 6-volume hardback edition includes a “full index,” the online edition provides access to the letters by period, correspondent, place, or other features that could easily be flagged via structural and entity markup (Jansen et al.). Portions of the scholarly apparatus are available via a table of contents or through the same searches (Van Vliet and Kets-Vree).
7. It is noteworthy that even in the embattled context of print scholarship, the intellectual index seems to be losing ground: indexes are often absent from scholarly collections of essays altogether, or limited to named entity listings.