Surrogacy and Image Error: Transformations in the Value of Digitized Books

July 19, 2013, 13:30 | Long Paper, Embassy Regents D

The large-scale digitization of books is generating extraordinary collections of visual and textual surrogates, whose preservation is premised partly upon expected transformations in teaching and scholarship in the humanities. Questions have been and continue to be raised about the quality and usefulness of digital surrogates produced by third-party vendors and deposited in digital repositories for preservation and access (Cohen 2010). If the surrogacy of published materials that serve as primary sources is to find wide acceptance within humanities scholarship, then those who build and manage preservation repositories must be able to make claims about, validate the quality of, and certify the fitness of use of these preserved digital surrogates. Understanding the relationship between digital surrogacy and the presence or absence of evidence regarding digitization processes is thus a substantial challenge for scholars and preservation archivists alike.

The purpose of the paper is to synthesize and extend the findings and implications of a major ongoing research project at the University of Michigan School of Information. The research explores the relationship between quality (or its absence in the form of unacceptable error) and usefulness of digitized books at scale. The HathiTrust Digital Library is the test bed for the project. HathiTrust is an international repository collaborative that is preserving and providing access to the output of large-scale book digitization projects, including those by Google, the Internet Archive, and a host of localized digitization programs (York 2010). The research reported here is designed to produce some foundation of statistical truth, accompanied by a transparent methodology, so that follow-up user validation studies can explore how digital image error impacts the acceptance of digital surrogates for scholarly inquiry and the management of physical collections in libraries.

This research into the quality and usefulness of large-scale digitization is built on a synthesis of scholarship in multiple fields that typically do not intersect: information quality (Knight 2008), digital image analysis (Lin 2006), relevance clues (Saracevic, 2007), and humanities scholars’ use of digital collections (Henry 2010). The research is grounded on a model of error that specifies the gap between the digitization ideal, represented by digitization best practices and standards, and the realities of repositories’ acceptance of digitized content produced by third parties. The design of the research (Conway 2011), data gathering and analysis procedures (Conway and Bronicki 2012), and summary findings (Conway 2012) are reported separately. This work was supported by the US Institute of Museum and Library Services [grant number LG-06-10-0144-10]; and The Andrew W. Mellon Foundation.

The emphasis of the research is on the visual representation of books as digitally bound bitmap sequences, derived from sometimes deeply flawed source volumes and produced through a complex set of manual scanning processes and automated post-scan image processing procedures. The transformation of published books to digital code and algorithm is mitigated by the terms of digitization technologies. “The aesthetic transformations that make digital objects so eloquent are themselves always subject to the functional constraints imposed by the material variables of computation. Understood at this level, digital surrogates are just as ‘real’ (and tangible) as their analog counterparts.” (Eaves 2003, 164) The relationship between source and digital surrogate conforms to the “law of contact” proposed by Taussig (1993): “things which have once been in contact with each other continue to act on each other at a distance after the physical contact has been severed.” (52-53) Significantly, digital surrogates produced through high-volume digitization carry with them traces of the terms of their creation. Such traces may inevitably affect the trust that is essential the acceptance of digital surrogates as sources of scholarship. “If we cannot trust our means of reproduction of images of texts, can we trust the readings from them? How do scholars acknowledge the quality of digitized images of texts?” (Terras 2011, 1)

This paper is an explicit effort to foster a conversation on the impact of digital imaging at scale on humanities scholarship by marshaling empirical research data on digitization error to characterize the strengths and limitations of digital surrogacy. The implications for the use of surrogates are derived from data gathered from four 1,000-volume random samples of digital surrogates covering the full range of source volumes digitized by Google and the Internet Archive from more than 20 research libraries. Proportional and systematic sampling of page-images within each volume in the samples produced a study set of over 350,000 page images, which have been evaluated visually by highly trained coders working in two university libraries in the United States. Using a web-enabled database system, coders assigned error severity scores for eleven page-level errors and five book-level errors specified in the carefully tested model.

Statistical analysis of the datasets produces a stark portrait of the visual properties of book surrogates in a 10 million volume collection, in which nearly a third of all volumes that exhibit a low level of text-oriented degradation coexist with volumes where severe error cascades through inter-related digitization processes. Minor error that does not limit the readability of digitized text might be accepted as a part of the price of enhanced access. Only a minority of the volumes in HathiTrust are error free at very low levels of severity. The four most common errors (thick text, broken text, warped pages, and obscured content) are easily and reliably detectable and so common as to be part of the fabric of digital surrogacy. With the exception of Asian language text digitized by Google, near fatal errors largely exist randomly and in very small proportions in the corpus of HathiTrust volumes digitized by Google and the Internet Archive. Extremely severe error, however, compromises the integrity of large-scale digitization and threatens the long-term trustworthiness of repositories that preserve digital surrogates. The findings from one aspect of a multi-faceted investigation into the quality of the digital surrogates suggest that the imperfection of digital surrogates is a transparent and nearly ubiquitous attribute, one that reflects the flaws of the source and introduces new complexity in preservation repositories.

Additional details about the project, including its metrics and progress reports, may be found on the project’s website: http://hathitrust-quality.projects.si.umich.edu/ For information on HathiTrust Digital Library, see: http://www.hathitrust.org


Cohen, D. (2010). Is Google Good for History? Dan Cohen’s Digital Humanities Blog. Posting on 12 Jan. 2010. http://www.dancohen.org/2010/01/07/is-google-good-for-history/ (accessed 8 March 2013).
Conway, P. (2011). Archival Quality and Long-term Preservation: A Research Framework for Validating the Usefulness of Digital Surrogates, Archival Science, 11, 3. Open access online.
Conway, P. (2012). Validating Quality in Large-Scale Digitization: Selective Findings on the Distribution of Imaging Error. In Proceedings of UNESCO Memory of the World in the Digital Age, September 26-28, 2012, Vancouver, BC Canada.
Conway, P and J. Bronicki (2012). Error Metrics for Large-Scale Digitization. In Curating Quality: Ensuring Data Quality to Enable New Science: An invitational workshop sponsored by the National Science Foundation, September 10-11, 2012, Arlington, VA USA.
Eaves, M. (2003). Graphicality: Multimedia Fables for ‘Textual’ Critics. In Bergmann-Loizeaux E., and N. Fraistat (eds.) Reimagining Textuality: Textual Studies in the Late Age of Print, Madison: University of Wisconsin Press, 99-122.
Henry, C. (2010). The Idea of Order: Transforming Research Collections for 21st Century Scholarship. Washington, DC: Council on Library and Information Resources. http://www.clir.org/pubs/abstract/reports/pub147 (accessed 8 March 2013).
Knight, S. (2008). User Perceptions of Information Quality in World Wide Web Information Retrieval Behaviour. Ph.D. thesis, Edith Cowan University.
Lin, X. (2006). Quality Assurance in High Volume Document Digitization: A Survey. In Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06), 27-28 April, Lyon, France, 319-326.
Saracevic, T. (2007). Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part III: Behavior and Effects of Relevance, Journal of the American Society for Information Science and Technology, 58, 13: 2126-2144.
Taussig, M. (1993). Mimesis and Alterity: A Particular History of the Senses. Routledge, London.
Terras, M. (2011). Artefacts and Errors: Acknowledging Issues of Representation in the Digital Imaging of Ancient Texts. In Fischer, F., Fritze, C. and Vogeler, G. (eds), Kodikologie und Paläographie im digitalen Zeitalter 2 / Codicology and Palaeography in the Digital Age 2. Norderstedt, Germany: Books on Demand, 43 - 61. http://discovery.ucl.ac.uk/171362/ (accessed 8 March 2013.
York, J. J. (2010). Building a Future by Preserving Our Past: The Preservation Infrastructure of HathiTrust Digital Library. In Proceedings of 76th IFLA General Congress and Assembly, 10-15 August 2010, Gothenburg, Sweden.