Abstracts

TXM Platform for analysis of TEI encoded textual sources

July 19, 2013, 10:30 | Long Paper, Embassy Regents D

Textometry is a methodology of text corpora analysis combining qualitative and quantitative techniques (kwic concordances, word frequency lists, collocations, factorial analysis, etc.) and producing valuable results for various fields of the humanities (linguistics, literary studies, history, geography, etc.).

The first generation of textometric software operated mainly on “raw text” with limited metadata and structural markup. In the recent years, a great number of digital resources with complex markup have been created. These can include multiple languages, various readings and other forms of critical apparatus, annotations like word lemmas or part of speech, syntactic structures, etc. As a general markup environment, the TEI guidelines provide a common framework for encoding all kinds of textual resources, although this framework allows a great flexibility and sometimes the same information can be encoded in many different ways. It is a challenge for the software and for the researcher to interpret these data correctly out of the context of their original project but it is also an opportunity to make the textometric analysis deeper and more precise.

A new generation of textometric open-source software called TXM was initiated by the Textométrie research project [1] funded by the French ANR agency (2007-2010) bringing together previous textometric techniques and state-of-the-art text encoding and corpus-building technologies: Unicode, XML, TEI, NLP (Heiden, 2010; Heiden et al., 2010). The TXM platform can be downloaded for free at http://sf.net/projects/txm with its sources. This article presents the design and the current state of the import environment being developed since to allow the platform to analyze various kind of TEI encoded sources.

The TXM platform addresses the challenge of importing TEI encoded corpora by “translating” the source document structure into the terms (or objects) relevant for textometric analysis. The main objects are: “text units” (define text limits in a corpus), “text metadata” (associate texts with their properties), “lexical units” (the way the word forms are separated), “word properties” (how to get their lemma or morpho-syntactic description if available), “text divisions” (book parts, sections, paragraphs...), primary and secondary “text surface” (what is the main language of the text to run NLP tools on the right tokens and possible secondary languages: foreign quotations or section titles provided by the editor of a historical source text), “out-of-text”: parts not to be considered as part of the source text (critical apparatus, encoding comments, etc.), “pagination” (to build an edition of the texts), etc.

For each type of source corpus, one has to precisely define how the textometric objects are encoded in the TEI sources and how to extract them to express the corresponding objects in a specially designed XML-TXM pivotal format before being instantiated inside the platform. The XML-TXM format is specialized in analytic data categories, in a way similar to the “TEI Analytics” format of the MONK project (Brian L. Pytlik Zillig, 2009), but is richer in data categories and is a formal TEI extension.

The extraction process is implemented by a combination of specific XSLT stylesheets, XPATH expressions and Groovy script parameters [2] .

We will describe how that approach has been validated on a comprehensive set of completely unrelated TEI encoded corpora: “Bibliothèques Virtuelles Humanistes” corpus (BVH collection of 16th century books: http://www.bvh.univ-tours.fr), Flaubert’s “Bouvard et Pécuchet” 19th century novel corpus: http://dossiers-flaubert.ish-lyon.cnrs.fr, corpus of 5 years issues of the “DISCOURS” linguistic journal: http://discours.revues.org/?lang=en) and the TEI version of the Brown 1 million words corpus from the NLTK project: http://nltk.org.

TXM TEI import environment and its XML-TXM pivotal format have proven to be flexible enough to process various data sources efficiently. In further developments, we will define a complete ODD description of the XML-TXM format to document it better for the TEI community and to contribute to the discussion on the ability of software tools to analyze TEI encoded data.

References

Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. 24th Pacific Asia Conference on Language, Information and Computation. Éd. Kiyoshi Ishikawa Ryo Otoguro. Institute for Digital Enhancement of Cognitive Development, Waseda University, 4-7 November 2010. 389-398.
Heiden, S., J.-P. Magué, and B. Pincemin (2010). TXM : Une plateforme logicielle open-source pour la textométrie – conception et développement, in Bolasco, S., et al. (eds.), Statistical Analysis of Textual Data -Proceedings of 10th International Conference JADT.
Pytlik Zillig, B. L. (2009) TEI Analytics: converting documents into a TEI format for cross-collection text analysis. Literary and Linguistic Computing 24(2):187-192; doi:10.1093/llc/fqp005.

Notes

1. http://textometrie.ens-lyon.fr/?lang=en

2. The TXM import environment is implemented by several dynamic scripts written in the Groovy progamming language. All that software environment is directly accessible to the user to be modified and adapted: platform sources, import scripts, XSLT stylesheets, etc.