Abstracts

TXM Portal: Providing Online Access to Textometric Corpus Analysis

July 17, 2013, 15:30 | Centennial Room, Nebraska Union

This poster presents the TXM portal, a software providing online access to textometric corpus analysis. Textometry is a computerized methodology of corpus analysis combining qualitative and quantitative tools applicable in various fields of the humanities (linguistics, literary studies, geography, philosophy, history, etc.). This methodology was initially developed in France in the 1980’s under the name of lexicometry and a number of software products implementing various analytical tools were developed. TXM is a new generation of open-source software built on a modular basis bringing together previous textometric techniques and state-of-the-art text encoding and corpus-building technologies (Unicode, XML, TEI, NLP) (Heiden, 2010; Heiden et al., 2010; Pincemin et al., 2010). The word search engine used for TXM is provided by the Open CWB open-source project (http://cwb.sourceforge.net) and syntactic structures can be queried using the TigerSearch engine (provided that the corpus is syntactically annotated in Tiger XML format) (Lezius 2002). Statistical analyses are performed using the R library (http://www.r-project.org). Other search engines and libraries can be plugged in the TXM platform as necessary.

The TXM software is available in the form of a desktop application (for Windows, Mac and Linux) and of a web portal application sharing a common “toolbox” for corpus building, query and statistical analysis. Most of the corpus analysis features are the same in both applications, however a special attention in the poster will be given to the features that are only available in the portal version.

It should be noted that corpus import and annotation features are only available in the desktop version. The portal version allows the administrator to upload previously compiled “binary” TXM corpora.

The major specific feature of the TXM portal is the management of user registration and access rights to the corpora with the possibility to specify access conditions for each individual text of the corpus (e.g. limitation of context size in concordances). This is important for copyrighted texts where owners may wish to prevent users from copying an entire text or a substantial part of it. User accounts and profiles can be edited by the portal administrator through the web interface. Customized web pages (“home”, “help” and “contact”) can be created for each user profile. Internationalization feature is available for the portal interface and user web pages (the current portal distribution provides English and French interface).

Another feature that is only available in the TXM portal version is the creation of subcorpora by selecting texts with a special interface. It allows the user to choose various criteria to select texts (depending on the metadata available for the corpus), to add or remove texts individually and to visualize the dimensions of the subcorpus in number of words or texts.

The basic tools of textometric analysis are available in all TXM versions. These include creating corpus partitions for contrastive analysis, building indexes and concordances of word or text patterns, display of one or several alternative text edition versions (including facsimile images). One can also search for collocates of a particular word. Statistical analysis tools are available for corpus partitions. These include computing specificity of word or text patterns and correspondence analysis. The results of corpus queries can be downloaded for further analysis in the form of CSV tables.

The TXM portal software is available for free under the GNU GPLv3 license from the sourceforge development site (http://sourceforge.net/projects/txm). A demo TXM portal where the various tools can be tested on sample corpora is accessible at the following address: http://txm.risc.cnrs.fr/demo. TXM portal is currently used in production to provide regular access to the Base de Français Médiéval old French corpus (http://txm.bfm-corpus.org/bfm) to a user community of 200 medievalists from around the world.

References

Pincemin, B., S. Heiden, M.-H. Lay, J.-M. Leblanc, and J.-M. Viprey (2010). Fonctionnalités textométriques: Proposition de typologie selon un point de vue utilisateur. In S. Bolasco et al. (Eds.), Statistical Analysis of Textual Data — Proceedings of 10th International Conference JADT 2010, Edizioni Universitarie di Lettere Economia Diritto, Rome, 9-11 juin 2010.
Heiden, S. (2010). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. 24th Pacific Asia Conference on Language, Information and Computation. Éd. Kiyoshi Ishikawa Ryo Otoguro. Institute for Digital Enhancement of Cognitive Development, Waseda University, 2010. 389-398. online.
Heiden S., M. Jean-Philippe, and P. Bénédicte (2010). TXM : Une plateforme logicielle open-source pour la textométrie – conception et développement, in Sergio Bolasco& al (eds), Statistical Analysis of Textual Data -Proceedings of 10th International Conference JADT 2010.
Heiden S., and L. Alexei (2012). The TXM Portal Software Giving Access to Old French Manuscripts Online, Proceedings of the 1st Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage Objects, Seventh International Conference on Language Resources and Evaluation, Istanbul, Turkey, 2012.
Lezius, W. (2002). TIGERSearch – Ein Suchwerkzeug für Baumbanken // Proceedings der 6. Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2002). Saarbrücken. 2002. http://konvens2002.dfki.de.