The German Language of the Year 1933. Building a Diachronic Text Corpus for Historical German Language Studies.

July 17, 2013, 10:30 | Short Paper, Embassy Regents E

In the following paper a new research project undertaken within the framework of the AAC–Austrian Academy Corpus operated by the Institute for Corpus Linguistics and Text Technology at the Austrian Academy of Sciences in Vienna will be presented. The topic of this research project is focused on the questions of developing a diachronic text corpus of historical significance and establishing a corpus based research environment for language studies of the interwar period with particular emphasis on the year 1933, the year when the Nazis came to power in Germany. In the presentation of this project an overview of the necessary methodological considerations and an outline of the research perspectives based upon the principles of corpus linguistics will be given. Corpus-based approaches for analyzing the language of the historical periods before, during and after National-Socialism have been rare, despite the numerous works in the fields of historical studies as well as in German language studies. A research group of the corpus research framework of the AAC has decided to start a special investigation into this challenging question.

The AAC is an established German language text corpus of more than 500 million tokens and represents a considerably large diachronic digital text corpus comprising several thousands of German language texts of important historical and cultural significance. The core of the AAC is from the first half of the twentieth century, so that the research issue of an analysis of the German language of 1933 is not only highly appropriate, but can be addressed on a comprehensive basis. Among the sources of the AAC a large number of texts of the historical period in question have already been collected, digitized, converted into machine-readable text and fully annotated as well as been provided with metadata. Structural and thematic mark-up has been applied according to annotation and mark-up schemes based upon XML related standards. The AAC has been committed to the research field of text technology since its official foundation in 2001.

Building a diachronic digital text corpus for historical German language studies of this particular kind is a particularly challenging task for various reasons. First, the technical difficulties of corpus building in dealing with a large historical variety of different text types and genres have to be taken into consideration. Second, the specific historical parameters and the methodological scope of such an investigation has to be taken into account. The German language of the year 1933 is being considered as a historical focal point for which an exemplary corpus-based research methodology for the study of the German language could be developed. The sources of a first exemplary study will cover manifold domains and genres, not only newspapers and political journals and magazines, which will be at the core, but also several other text types representing the historical communicative strategies will be included. Among them are pamphlets, flyers, advertisements, radio programs, political speeches, but also essays and literary texts as well as administrative, scientific or legal texts, just to name a few examples, which are all difficult to collect. The AAC has started to build up a small collection of ephemera in this field.

In the overall project a special emphasis will be given to the "Dritte Walpurgisnacht" (Third Walpurgis Night), written by the satirist Karl Kraus, which will be taken (in digital format), among other sources, as a starting point for text selection for the corpus. This text is the most important contemporary text of German literature dealing with National Socialism. In the "Third Walpurgis Night" Karl Kraus has documented the murderous reality of the Nazi regime as early as May 1933 and documented and commented upon the murderous language of that time in numerous examples. Because of this text no one can claim not to have been able to know from the very start where Nazi rule would lead. However, Karl Kraus, the editor and author of his journal "Die Fackel" (The Torch), who died in 1936, did not publish his text, a text which begins with the famous line "Mir fällt zu Hitler nichts ein", because in the face of violence the deed of the word was considered inappropriate by him.

The historical period covered by the AAC is ranging from the 1848 revolution to the fall of the iron curtain in 1989. In this period significant historical changes with remarkable influences on the language and the language use in the German speaking areas can be observed. The year 1933 and the years preceding as well as following the "Machtergreifung" (seizure of power) of the National socialists is a historical period of particular interest for language studies. In this case not primarily the well-known documents and the evident language of the Nazis will be included in the analysis, but systematically the less easily visible documents and less significant lexical items will be taken into consideration as well. This methodological approach is considered as particularly fruitful by means of applying methods of corpus linguistics and by testing new strategies of the application of these methods in the context of historical language studies. For this historical period the AAC corpus holdings provide a great number of reliable resources and interesting corpus based approaches for investigations into the linguistic and textual properties of the texts in question. The digital text is going to be enriched by additional data and will be lemmatized and provided with POS data thereby making use of the tag set for this purpose of the STTS (Stuttgart-Tübingen Tagset). "Quantitative corpus linguistics has proofed to be a valuable technique in many domains of philological, sociological and historical research. The digitized and linguistically annotated corpus is therefore an interesting source for studies in many fields and facilitates the investigation of changing patterns of language use, and how these reflect underlying cultural shifts." (M. Volk). The question is, whether corpus research methods based upon a multidisciplinary combination of corpus linguistics, lexicography, historical studies and cultural studies can be applied in order to gain insights into the textual representations of historical collections of this importance.

The AAC research group will go beyond a quantitative approach and integrate text studies into its research of the German language of 1933. The methodology of corpus based text research is determined by corpus linguistic, lexicographic and analytical procedures. The historical condition of Germany and Austria with their cultural and linguistic diversities and in particular the situation at the time of National Socialism have to be taken into consideration as historical changes with significant influences on the language. In contrast to other corpus-oriented projects, the working group proceeds from literary studies and text lexicographic premises. Corpus research and the creation of large electronic text collections have traditionally been the domain of corpus linguists. Literary digitization initiatives were quite often restricted to particular writers and many of these projects did neither produce large amounts of data nor pursue research on methods of how to tackle the problems involved in working with such data. Our perspectives parallel those formulated in the European project CLARIN which has been set up to "create, coordinate and make lan-guage resources and technology available and readily usable" (Call for Proposal), in this case also for text historians and for those interested in ideologically determined language change.

The AAC has already developed methods and tools to allow scholars to access these texts and other comparable resources. For this purpose the tools provided in order to access the corpus holdings will enable the researcher to input queries and to get a display of the results in forms provided with the necessary metadata such as information on sources, authors, date of publication etc., and with a display of the related pages of the results as digital text, also allowing access to the XML source of the texts and to define custom style sheets, alongside a feature to view facsimiles of the texts offering simultaneous access to text and facsimiles of pages. In addition to that a sophisticated navigational control tool will be provided, offering random access to the variety of different documents, journals, books, etc. to do linguistic, literary or historical research. Access to the text corpus will be given not only through query result lists but also through a structuring tool, which allows readers to navigate to any desired part of the corpus and results of queries are delivered by the server in XML format which makes it fairly easy to adjust the representation of the output, where XSLT style sheets can be used. Using such style sheet transformations also allows creating statistical analyses of the data. And it has been pointed out before (Smith, 2008) that available standard tools provide only limited support when processing query results. Building a diachronic digital text corpus for historical German language studies of this particular kind is a particularly challenging task which demands also the development of new tools and new approaches of text technology. This special research environment would be especially useful for corpus-based analyses of the language of critical historical periods such as the case of the German language in the year 1933.


AAC — Austrian Academy Corpus: AAC-FACKEL. Online Version: «Die Fackel. Herausgeber: Karl Kraus, Wien 1899-1936». In Biber, H., E. Breiteneder, H. Kabas, Mörth, K. AAC Digital Edition No 1. http://www.aac.ac.at/fackel
Biber, H. Aufbruch der Phrase zur Tat. Kommunikationsmaßnahmen und sprachliche Formungen der nationalsozialistischen Machtübernahme in Österreich 1938. In: Welzig, W., H. Biber, und C. Resch. Anschluss. März/April 1938 in Österreich. Wien, Verlag der Österreichischen Akademie der Wissenschaften 2010, S. 15-37
Bubenhofer, N. et al. (2007). XML-Technologien als Grundlage dynamischer Textpräsentation. Die digitale Quellenedition Der Zürcher Sommer 1968. Jahrbuch für Computerphilologie 9 89-110.
Mörth, K. (2000). The representation of literary texts by means of XML: some experiences of doing markup in historical magazines. In Fraser, M., Williamson, N., and Deegan M. (eds), Digital Evidence. 2002. Selected papers from DRH 2000, Digital Resources for the Humanities Conference. Office for Humanities Communication 14, 17-32.
Roth, T. (2009). Verteilte Korpusabfragesysteme. Proceedings in Language and Text Corpus. Design and Linguistic. Corpus Analysis.
Smith, N. et al. (2008). Corpus Tools and Methods. Today and Tomorrow: Incorporating Linguists' Manual Annotations. LLC. 23 163-180.
Volk, M. et al. (2010). Challenges in building a multilingual alpine heritage corpus, In: LREC Proceedings.