Lexomics: Integrating the research and teaching spaces
July 18, 2013, 10:30 | Short Paper, Embassy Regents E
Integrating research and teaching is exciting, time intensive, and a prescription for energizing faculty and students. We present outcomes of a six-year effort in multidisciplinary collaboration centered on the digital humanities as experienced in our teaching and research. Rooted in a set of “connected” courses between English and Computer Science (LeBlanc, et al. 2010) and three summers of NEH-funded research, our Lexomics Research Group has developed a modest set of web-based applications for scholars of digitized texts. We report here on the iterative development of the open-source toolset, how scholars both in and outside our group have used these tools to make significant discoveries, and perhaps most important how our research and teaching collaborations introduce a spirit of experimentation to the digital humanities.
Our current website is both a repository for our tool set as well as an evangelistic platform and teaching resource: http://lexomics.wheatoncollege.edu. We continue to develop online tools for three independent, but logically connected functions that lead scholars through the steps needed for performing hierarchical cluster analyses of texts and/or sections of texts. At this point, our cluster analysis tools are more narrowly focused than other toolsets, c.f. Voyant Tools and the data-intensive flow execution environment of Meandre. Our scrubber tool (PHP, CSS) accepts texts in multiple formats (.txt, .html, .docx) and handles preprocessing steps including stripping tags, removing stop words, and applying lemma lists. A second tool, diviText (ExtJS, PHP), accepts the output from scrubber, cuts texts into “chunks” in one of three ways (fixed size chunks, a specific number of chunks, and/or by manually selecting locations between words for chunk breaks), computes word counts within each chunk, and allows users to merge chunks. The latter functionality has proved valuable for generating “virtual manuscripts”, that is, joining sections from different manuscripts. A third tool, treeView (PHP, R) accepts output from diviText, performs a number of variants of hierarchical cluster analysis, and returns a dendrogram plot in .pdf or phyloXML format.
Based on feedback from scholars who are using our tools, the website now provides video and written tutorials to help new users get started. These tutorials have been especially valuable for introducing these tools to our undergraduates. In the spirit of evangelizing, our website offers a series of “best practices” videos, discussions and step-by-step diagrams that shred insight to the process of how textual analysis at this level of detail can lead to rich new questions. The instructional videos include “The Story of Daniel”, a discussion of one of our initial successes when using the tools where we showed that lexomic methods can accurately characterize the structure and relationships of texts that are already known, for example, identifying Genesis B within the Old English Genesis and the section of Daniel that is paralleled in Azarias (Drout, et al. 2011). Other videos include: “How to Read a Dendrogram”, “How to Create a Dendrogram”, “How to Read a Ribbon Diagram”, “Lexomics for Comparison”, and “Lexomics for Source Detection”. A much longer video, “Editions and Manuscripts,” addresses the challenges of choosing between different kinds of editions that may exist for a text that is found in multiple forms.
We have made what we think are significant discoveries in a number of spaces, including Beowulf, the poems of Cynewulf, Anglo-Saxon prose, a few Old Norse sagas, and Modern English texts including the Harlem Renaissance play Mule Bone (by Zora Neale Hurston and Langston Hughes). Lexomics is both an excellent first step to augment traditional scholarship as well as a rich source of deep analysis.
For example, previous lexomic analysis of several Old English poems suggests that there is a connection between dendrograms with an isolated, single leaf and poems that have an external source for one subsection of the poem different from the source or sources of the main body of the poem. We ﬁnd in the dendrogram of Daniel a single-leaf clade corresponding to lines 299–455 of the poem. This section includes parts of Daniel that have external Latin sources that are different from the source of the rest of the poem (the Latin Bible). Similarly, in the Anglo-Saxon poem Christ III, a single-leaf clade that represents lines 1350–1510 has its source in Sermon 57 of Cæsarius of Arles (lines 1379–1498), and a single-leaf clade in Genesis A (lines 1079–1256) is associated with the genealogical lists from Adam to Noah that give the lineages of both Cain and Seth (lines 1055–1252), material that, for at least some of its content, must have a source different from the biblical text. These relationships were already known to scholars, but our investigation of the Old English poem Guthlac A resolved a century-long critical controversy by demonstrating that a key section of this poem (when demons drag Guthlac to the mouth of hell) has a different proximate source than the rest of the poem and that Guthlac A therefore must have been composed after a separately circulating text similar in content to Vercelli Homily 23 (Downey et al., 2012).
The toolset, instructional materials, and publications are obvious deliverables from our efforts. Yet, we submit that our collaborative experiences with faculty and undergraduate students are even more exciting and provide a significant use-case of how scholarship in the humanities is evolving from the stereotypical solitary scholar to a paradigm of community, collaboration, and experimentation (cf. Unsworth, 1997). In our recent NEH- and locally-funded summer experience, humanities faculty in particular were pleasantly surprised with the intellectual environment that emerged. We got a glimpse of what it must have been like to work at a place like Bell Labs when they were making daily discoveries. This kind of collaborative, fast-moving research is unfortunately largely unknown in the humanities.
So how to continue our own momentum as well as replicate a spirit of experimentation for others? Earhart (2010) rightly notes that “digital projects remain rare, often the product of tenacious participants rather than a supportive academic environment”(emphasis added). We submit that faculty (not administrators nor technologists in the library) are the prime drivers and change must begin with our syllabi. Robust working relationships in the lab are strongest after students have already applied new modes of thinking in the classroom; for example, the importance of exposing undergraduate humanities students to computational thinking: problem decomposition, algorithmic thinking, and the success and failures of experimentation. And we need not overplay the lab metaphor. Our image of the digital humanities lab need not include beakers and soapstone benches, rather, the “new lab” is a room filled with scholars from multiple disciplines and a whiteboard.
Even if we had discovered nothing during our past summers in the lab, the intellectual thrill of the research group would have been a major accomplishment that these students (and we faculty) will never forget. But in fact we made discoveries, so many that there were days when participating faculty got none of their own work done because we were so busy bouncing from student to student seeing what they had found. Most critically, the experience continues to shape the way we share our disciplines with new cohorts of students. The solitary scholar still has a role to be sure, but that is no longer sufficient for the multidisciplinary demands and rewards to be gained from collaborations in the digital humanities: in our teaching, to our research, and back again.
lexomics — The term was originally coined by Betsey Dexter Dyer and ﬁrst appeared in Genome Technology (2002). Since then ‘‘lexomics’’ has appeared on the internet and in some publications without attribution. Some of these appearances could be independent inventions of the term.