Looking for needles in DH haystacks: efficient querying of complex data

July 16, 2013, 13:00 | Workshop, Ubuntu, Multicultural Center

The rapid development of the discipline (or, more precisely, disciplines) known as Digital Humanities resulted in the ever wider accessibility of digitization methods and, consequently, the steadily growing amount of digitized and interlinked data. However, as in many other disciplines that have followed a similar pattern of development, it turns out that, while the amount of information is growing, the methods for quick, easy and successful retrieval of that information are either not yet established, or not yet sufficiently widespread.

In view of the massive amount of available data, an average DH scholar is confronted with the task of finding a needle in a haystack: while, seemingly, everything is there: structured, interlinked and ready to be used, and while wellknown query mechanisms exist and have been used for years in other disciplines, the fundamental questions still concern the best way to formulate the particular research questions, the best method appropriate to the task at hand, or a friendly tool that would provide the relevant results in the desired format and without too steep a learning curve.

The tutorial is going to present stateoftheart methods in querying data, from textual to multimodal, with a focus on use cases commonly found in Digital Humanities, or envisioned for the near future of this expanding field. It will be taught by two specialists in markup languages and corpus linguistics, currently involved in the process of creating a new analysis platform designed to handle large amounts of linguistic data. This is not meant to be a tutorial just for linguists, however: we intend to provide an opportunity to carry over some wellknown methods and techniques from linguistic research, where they have been used for years, onto the broader area of Digital Humanities, where queries target not only texts, but also nontextual objects, such as binary streams, ontologies, prosopographic databases or GIS data (these latter types of objects will be discussed to the extent to which they can be linked from textual resources).

We shall focus primarily on the search in metadata, nonannotated data, and structured annotated data (especially TEIencoded).

Part of the way to ensure closer cooperation among DH researchers may be to provide them with a common language in which they can specify questions asked of a variety of datasets in a variety of structures. The tutorial shall investigate to what extent the creation of such a Corpus Query Lingua Franca is a realistic endeavour, what basic elements such a language would have to possess, what kind of objects it would have to query, and what set of constraints it will have to unavoidably obey. This is also one of the current foci of the Special Committee of ISO that addresses language resource management (TC37 SC4), in which both presenters actively participate.

Description of target audience and expected number of participants

We expect up to ca. 35 people. We intend the tutorial for general DH audience: variety is a virtue in this case, because we want to address actual use cases, some of which will surely come from the participants themselves.

Background information

Piotr Bański is an Assistant Professor of linguistics at the Institute of English Studies of the University of Warsaw, and a researcher at the Institut für Deutsche Sprache in Mannheim, where he is the Project Manager of the “Corpus Analysis Platform of the Next Generation” (KorAP), a project financed by the Leibniz Association (LeibnizGemeinschaft). He served as an elected member of the TEI Technical Council for term 20112012 and since 2010 has been involved in the work of the ISO TC37 SC4 committee for Language Resource Management. His latest project within the scope of ISO is work on Corpus Query Lingua Franca, within TC37 SC4 Working Group 6, convened by Andreas Witt. His current interests focus mostly on text encoding as well as the creation and use of robust language resources.

After graduating from Bielefeld University in 1996, Andreas Witt started at this university as a researcher and instructor in Text Technology. He was heavily involved in the establishment of the Magister and BA programmes in Text Technology at Bielefeld Universität in 1999 and 2002 respectively. After completing his Ph.D. in 2002 he became an assistant lecturer with the Text Technology group in Bielefeld. In 2006 he moved to Tübingen University, where he was involved in a project on “Sustainability of Linguistic Resources” and in projects on the interoperability of language data. Since 2009 he has been a senior researcher at the Institut für Deutsche Sprache (Institute for the German Language) in Mannheim. Andreas is a member of numerous research organizations including the TEI Special Interest Group "TEI for Linguists". His major research interests deal with questions of the use and limitations of markup languages for the linguistic description of language data.


Main issues addressed by the tutorial:

  • What should a text query system for DH in the 21st century look like?
  • What kinds of queries should a query system be able to deal with?
  • How to define a modern query language?
  • How should a text corpus be structured in the future?

List of topics (some of them may receive only cursory attention; much depends on the composition of the audience and the demand)

  • Digital text
  • Annotation of text
  • Annotation formats (HTML, TEI, others)
  • Text corpora
  • Corpora of written languages
  • Corpora of spoken language
  • Aligned corpora
  • Trees, Graphs, feature structures
  • Web as a Corpus
  • Characters and character encoding
  • Metadata
  • Simple search
  • Search with regular expressions
  • Search in XMLData (Xquery, XPATH)
  • Complex Annotations
  • Multilevel annotations
  • Relations between annotations
  • Existing corpus query systems