Reliable Citation as a Foundation for Preservable Web-Based Digital Humanities Projects

July 17, 2013, 15:30 | Centennial Room, Nebraska Union

Preservation of digital humanities (DH) projects is an emerging problem. As languages, frameworks, platforms, libraries, and databases evolve, the effort required to maintain meaningful access to existing projects is a challenge that competes with efforts to produce new work. Creators of DH projects need to be able to continue to iterate and innovate but also demonstrate good stewardship of digital materials.

The field recognizes this problem, from the 2004 Sustaining Digital Scholarship (SDS) final report and continuing through such projects as Memento, SiteStory, and TAPAS, and is actively researching ways to enable sustainable, long-term stewardship for certain project components, but these efforts do not provide a comprehensive platform for the full DH project. All of these efforts consider a project to be a set of static pages or resources that seldom change. None of them would be sufficient for hosting or describing ongoing projects such as World Shakespeare Bibliography Online.

Many new DH projects build their web presence with open source platforms such as Wordpress, Drupal, or Omeka, extending them through customizations. The long-term sustainability of such projects is an issue, involving the cost of hosting as well as migration of customizations as platforms and languages evolve. SalahEldeen and Nelson show evidence that after only a year, an average of eleven percent of cited on-line resources are lost. It is not surprising then that none of the common systems used to build DH projects allows reliable citation since the same platforms used in DH host a share of the resources lost each year. It is surprising that reliable citation is not seen as a greater problem by the scholarly community. Reliable citation is necessary for long-term preservation and therefore expresses a need for a temporal content and data management system that allows for reliable citation of scholarly narratives and resources.

We might base the value of a scholarly work on its place in the larger scholarly conversation. If it is not part of the conversation, then it has little effect on the field and thus has little value. Scholarly work can only be part of the conversation if it can be referenced. This is well developed for traditional publications (e.g., citing a particular edition of a printed work), but remains a problem for web-based scholarly work, not because particular pages can not be addressed, but because the information presented as part of that page is not stable. Without being able to reference the particular version of the page, scholars can not make reliable arguments about the work. What a reader sees might differ from what the author saw when researching and writing the work referencing the web­based project. These problems increase when referencing a dynamic, algorithmic project.

We see three fundamental requirements for such a system to enable reliable citation of scholarly works and resources: temporal citation, reproducible citation, and sustainable citation. Any platform meeting these requirements should be able to provide level four preservation as described in the SDS final report (11).

A temporal citation of a web-based scholarly work must be able to address the view within the context of the project’s history. Not only must a scholar be able to point a reader to a particular resource in a project, but the scholar must be able to point to a particular resource at a particular date and time.

A reproducible citation of a web­based scholarly work must show the same content over time. Fetching the cited resource year after year should show no significant changes in the scholarly content of the resource.

A sustainable citation of a web-based scholarly work allows a scholar to cite a project and know that their readers will be able to see the same information they saw by following the citation, for as long as their citation exists. Sustainability is a social issue as much as a technical one. We are not trying to address the social issues involved in sustainability in this poster.

The poster consists of diagrams and text explaining how reliable citation works with respect to resource versioning and project timelines. In addition, a demonstration of a temporal content and data management system hosted at http://alpha.ookook.net/ providing reliable citation will accompany the poster so that attendees can interact with the system and see the platform affordances in action. The demonstration will also provide an opportunity for attendees to interact with the developer.

Reliable citation does not require any unique data model or software architecture. The poster outlines both the data model and the architecture as they are developed in the demonstration software, principally by segmenting a project’s history into discrete editions that aggregate changes to the project.

Discussion will quickly muddy if we don’t establish some nomenclature for dates and times. A resource date and time is the date and time for which the resource should be rendered. For example, if I specify a resource date and time of noon on January 1st, 2012, then I expect the see a rendering of the resource as it appeared at noon on January 1st, 2012. A request date and time is the date and time at which the request is made, even if the request is for a resource with a resource date and time different than the request date and time.

The data model partitions the project into two classes of objects: editioned objects, such as a project or theme, and versioned objects, such as pages or stylesheets. Editions are published for a span of time during which no public changes are made to the pages. Any changes made to a page require the creation of a new page version which will be aggregated with other versions when a new edition is created and published. Only one project edition is active for a resource date and time. By tracking the time spans for which an edition is active, we can reproduce the project as it existed at a particular date and time.

The demonstration software separates information into two editioned resources: Projects (web sites) and Themes (collections of style information). Editions of projects and themes are independent of each other, with each managing their own history.

References between different editioned resources are done by naming the referenced resource as well as the referenced resource date and time. This allows a project to select a theme in a reproducible fashion. References to a versioned resource within an editioned resource (e.g., a page within a project) may reference the page without referencing a particular version. The appropriate version will be retrieved based on the edition selected by the resource date and time of the request.

This poster will be of interest to anyone wishing to see how a platform supporting reliable citation might be designed.


SalahEldeen, M. Hany, and M. L. Nelson Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? (2012). arXiv:1209.3026.
Sustaining Digital Scholarship Final Report. (2004) http://www2.iath.virginia.edu/sds/SDS_AR_2003.pdf
TEI Archiving Publishing and Access Service (TAPAS) Project. http://www.tapasproject.org/
World Shakespeare Bibliography Online. http://www.worldshakesbib.org/