Abstracts

Introduction to the TXM content analysis platform

July 16, 2013, 08:00 | Workshop, Regency B, Union

The objective of the “introduction to TXM” tutorial is to introduce the participants to the methodology of textometric content analysis (http://textometrie.ens-lyon.fr/?lang=en) through working with the TXM software directly on their own laptop computers. At the end of the tutorial, the participants will be able to input their own textual corpora (Unicode encoded raw texts or XML tagged texts) into TXM and to analyze them with the panel of content analysis tools available : word patterns frequency lists, kwic concordances and text browsing, rich full text search engine syntax (allowing to express various sequences of word forms, part of speech and lemma combinations constrained by XML structures), statistically specific sub-corpus vocabulary analysis, statistical collocation analysis, etc.).

During the tutorial, each participant will install TXM (from http://sourceforge.net/projects/txm) and the TreeTagger lemmatizer (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger) on her Windows, Mac or Linux laptop and will leave the tutorial with a ready to use environment.

The tutorial will also introduce the participants to the TXM community ecosystem (users mailing list and wiki, bug reports, etc.) and to the TXM portal version server software (See for example http://portal.textometrie.org/demo) for on line corpus distribution and analysis. Time permitting, TEI encoding aspects of corpora related to TXM could also be introduced, as well as speech transcriptions or parallel corpora encoding and analysis.

The objective of the “introduction to TXM” tutorial is to introduce the participants to the methodology of textometric content analysis (http://textometrie.ens-lyon.fr/?lang=en) through working with the TXM software directly on their own laptop computers. At the end of the tutorial, the participants will be able to input their own textual corpora (Unicode encoded raw texts or XML tagged texts) into TXM and to analyze them with the panel of content analysis tools available : word patterns frequency lists, kwic concordances and text browsing, rich full text search engine syntax (allowing to express various sequences of word forms, part of speech and lemma combinations constrained by XML structures), statistically specific sub-corpus vocabulary analysis, statistical collocation analysis, etc.)

The tutorial will be taught in English for the first time in DH2013 (the TXM User Graphical Interface is already available in English), and will complement two accepted communications introducing the TXM platform given during the conference:

  • —“TXM Platform for analysis of TEI encoded textual sources” #391 long paper;
  • — “TXM Portal: Providing Online Access to Textometric Corpus Analysis” #399 poster with live demo.

Tutorial Instructor

Serge Heiden

Project manager of the TXM platform development (http://textometrie.ens-lyon.fr/spip.php?article9). S. Heiden develops the textometry content analysis methodology through the development of tools able to process richly encoded corpora. Working on the relation between analysis tools and XML-TEI encoded corpora, he is involved in the TEI consortium activities as the TEI Tools SIG convener (http://www.tei-c.org/Activities/SIG/Tools).

Target audience and expected number of participants

The ideal number of participants is about 12-15 people, the maximum number of participants is about 20.

Each participant should come with her own laptop computer. The tutorial needs to run at least for a full day(*): typically half day for TXM tools fundamentals and half day for main corpus formats fundamentals (TXT and XML) and input procedures into the platform.

(*) The regular TXM tutorials run for two days (one day TXM introduction, one day corpus formating and import into TXM).

Brief Outline

  • 9am – 12pm
  • 1pm – 5pm
  • — Install & introduction: 45'
  • — TXM user interface & windows, corpus Description command
  • — Main tools: 2h15
  • —Lexicon analysis & spreadsheet export
  • —Index building for distributional semantics & Corpus Query Language syntax
  • — Concordance & Reading, Progression graphics
  • —Partitions, Subcorpus & Specificity/Factorial analysis
  • —Coccurrence analysis
  • —TXM portal demo (optional)
  • —TXM community: mailing lists, web sites and documentation
  • — TXM import strategy and main corpus formats: TXT-Unicode+CSV, XML+CSV, XML-TEI: 1/2h
  • — TXT-Unicode sample corpus and TXT+CSV import into TXM, sample analysis: 1h15
  • — introduction to XML and to TXT2XML conversion tools: 1/2h
  • — XML sample corpus and XML/w+CSV import into TXM, sample analysis: 1h45