Coding Media History: A Digital Suite for Opening Access, Building Tools, and Analyzing Texts

July 17, 2013, 10:30 | Short Paper, Embassy Regents F


This short paper seeks to introduce the Digital Humanities community to three ongoing, interrelated projects: the Media History Digital Library (an open access digital resource), Lantern (a search tool) and Coding Media History (a text mining research project). Together, these three projects aim to use digital technology to transform the field of Film and Media Studies, a discipline that has lagged English and History in the creation of high impact DH work.

I would also like to suggest that the three interrelated projects demonstrate a productive model for scaffolding work in the Digital Humanities. The three sides of this work—enabling access, building tools, and text analysis — support and enhance one another. In the space below, I will briefly address all three projects and suggest the ways they enrich one another.

In terms of enabling access, the Media History Digital Library (www.mediahistoryproject.org) has digitized over 500,000 pages of out-of-copyright periodicals relating to the histories of film, broadcasting, and recorded sound. Prior to the launch of the MHDL, scholars wrote the histories of film and television through page-by-page microfilm readings of key periodicals, such as Moving Picture World and Photoplay. By scanning these publications along with previously unavailable materials, the MHDL enables scholars to conduct research more efficiently, ask new questions, and write new histories.

The MHDL’s collections are open access and built on a collaborative model. David Pierce and I lead the project, and we work closely with collectors, who loan materials, and sponsors, who pay for the scanning. The scanning is carried out by the Internet Archive (www.archive.org), which also hosts and preserves the digital files. By using the Internet Archive as a scanning vendor and provider of backend infrastructure, the MHDL follows in the tradition of other collaboratively built digital collections, including the Biodiversity Heritage Library (http://www.biodiversitylibrary.org/), Medical Heritage Library (http://www.medicalheritage.org/), International Children’s Digital Library (http://en.childrenslibrary.org/), and International Music Score Library Project (http://imslp.org/).

Film and media educators at institutions around the world are already incorporating the Media History Digital Library into their teaching. In one especially creative assignment, Elizabeth Clarke is having her students at Wilfrid Laurier University in Waterloo, Ontario read the MHDL’s digital editions of early cinema magazines and imagine they are the intended audience of motion picture exhibitors in the 1910s. Students are asked to design their own programs of short films and live entertainment based on what they discover inside the magazines.

The MHDL’s diverse user-base encompasses students, educators, expert researchers, and casual classic movie fans. In order to better serve all of these groups, I have been leading the development of Lantern, a software tool that is a co-production of the Media History Digital Library and UW-Madison’s Department of Communication Arts. Lantern offers users the ability to perform fulltext searches across the Media History Digital Library’s entire corpus. Eventually, we also hope to equip Lantern with powerful functionalities beyond search, such as topic modeling and network visualizations.

My team and I are developing Lantern through using Ruby on Rails, Python, XML, and CSS and customizing three open source technologies: Apache’s Solr search engine; the University of Virginia Library’s Blacklight interface; and the Internet Archive’s BookReader. We are currently indexing more materials into Lantern, overhauling its graphic interface, and enhancing its speed and functionality. We anticipate publicly launching Lantern in Summer 2013. In the meantime, you may view a work-in-progress demo at http://lantern-demo.commarts.wisc.edu/

The third project I want to address is a work-in-progress called “Coding Media History: Computational Analysis of the Hollywood Trade Press.” Despite the heavy reliance of film and television scholars on Variety and other industry trade papers, there has been little work that reflexively examines these sources. My research project, Coding Media History, uses computer analytics both to enrich our understanding of these key sources and destabilize the notion that we can conceive of 60 years of Variety as a singular “text.” In pursuit of these goals, I borrow from the text mining methods (and warnings) of Stephen Rasmsay and, especially, from Andrew J. Torget, Rada Mihalcea, Jon Christensen, and Geoff McGhee’s work on applying topic modeling and text mining to historical newspapers.

I have begun the process of working with a research assistant, who is marking-up the XML of the digitized publications. We will soon be able to start asking research questions over the marked-up corpus. In a 1905 issue of Variety, for instance, what percent of the pages were dedicated to vaudeville compared to motion pictures? How were these page allocations different in 1915, 1925, 1935, 1945, and 1955? When were radio and television introduced as their own sections? How did the buyers and amounts of advertising change over time? These are questions that I can answer by starting with the digitized magazines, adding a research assistant’s tags, and finally running my own algorithms over the marked-up corpus.

One of the questions I am exploring is the extent to which the various trade papers were truly similar or different from one another. The Hollywood trade papers have an infamous reputation for publishing the exact same studio press releases. By using open source plagiarism software, we can test whether this reputation is warranted. The answers to these questions hold real stakes. Consider the case of Motion Picture Herald, a trade paper that proclaimed to represent the interests of independent movie theatre owners. What does it mean if we discover that Motion Picture Herald published 40% of the same content as the trade papers that spoke to producers and the major studios? Can we truly think of Motion Picture Herald as representing the independent theatre owners’ interests?

In conclusion, the Media History Digital Library, Lantern, and Coding Media History are already making a positive intervention in the field of Film and Media Studies. I also hope that this suite of interrelated projects can serve as a useful model for scholars in other fields pursuing Digital Humanities projects. In the course of my work, I’ve found that being involved across a suite of activities (digitization, tool building, and text analysis) leads to better decision-making at every stage in the process. Although it’s unrealistic to expect that we’ll all become hybrid librarian-programmer-scholars, we need to better understand the integrated range of activities in order for the Digital Humanities to tackle bold new projects and break free of our tokenized comfort zones.


Hoyt, E., W. Hagenmaier, and C. Hagenmaier (2013). “Media + History + Digital + Library: An Experiment in Synthesis.” Journal of E-Media Studies 3: forthcoming.
Ramsay, S. (2011). Reading Machines: Toward an Algorithmic Criticism. Urbana: University of Illinois Press.
Torget, A. J., et al. (2011). Mapping Texts: Combining Text-Mining and Geo-Visualization To Unlock The Research Potential of Historical Newspapers. UNT Digital Library. http://digital.library.unt.edu/ark:/67531/metadc83797/. accessed March 9, 2013).
Yang, T.-I., A. J. Torget, and R. Mihalcea. (2011). “Topic Modeling on Historical Newspapers.” Proceedings of the Association for Computational Linguistics workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (June): 96-104. http://aclweb.org/anthology/W/W11/W11-15.pdf. (accessed March 9, 2013).