General Information
The Corpus of Old Polish Dramatic Texts (1772–1939) is an annotated collection of 50 Polish dramas (e.g., dramas, comedies, stage scenes). The project was carried out at the Institute of Linguistics of the University of Silesia in Katowice in cooperation with the Clarin-PL consortium. The corpus currently contains 814,043 segments, where the understanding of a segment is the same as that adopted by the creators of the National Corpus of the Polish Language [1].
The corpus was conceived as a tool enabling research on old colloquial Polish. The acronym KorTeDa refers to the name Corpus of Old Texts, which the authors intend to become a collection of various texts falling within the category of speech-related texts [2]. The Corpus of Old Polish Dramatic Texts (1772–1939) is the first subcorpus included in KorTeDa – the acronym KorTeDa is also used to refer to the current form of the corpus.
The selection of texts was aligned with the objectives of the mentioned research project. Therefore, KorTeDa includes works that reflect the basic features of colloquial language: expressiveness and interactivity. The texts forming the corpus were selected based on two criteria. The first was formal: the corpus includes only original, non-verse Polish texts intended for stage performance, thus belonging, according to literary theory, to the drama genre. The next selection stage was based on criteria derived directly from the overarching research goal. The included works exhibit the following characteristics:
- address everyday, social topics
- drama characters come from various social backgrounds
- communication between characters is informal
- the depicted world is consistent with the realities of the era
- the text is dynamic and rich in dialogues.
All the works included in the corpus were published in print and retain their original spelling. Only rare forms containing graphemes no longer in use today (e.g., the long 's') were modernized. Creating an operational corpus required minor adjustments to the text's punctuation layer (e.g., standardizing ellipsis notation, introducing parentheses for stage directions, which led to using square brackets instead of parentheses outside stage directions; periods at the end of information in stage directions were also removed).
The sources of the dramas were digital libraries (mainly Polona); it is worth noting that we included the oldest digitally available editions of the works of interest, and this year is noted in the metadata.
KorTeDa is a monolingual historical corpus gathering selected full texts meeting specific criteria. For historical corpora, it is difficult to apply the criterion of representativeness understood as a reference to a reality existing outside the corpus [3] – the texts in KorTeDa portray only a fragment of that reality as comprehensively as possible. Another important feature of language corpora – balance – is not fully achieved in KorTeDa, as it is homogeneous in terms of text type. A step towards chronological balance is the division of the period of interest (1772–1939) into six ordering intervals, each represented by a similar number of texts (7 to 11). The temporal span of the works allows the corpus to be used in diachronic studies: comparing selected linguistic features from the first period with the same features in the sixth period may reveal changes or trends in these features, although the observations will likely be more quantitative than qualitative [4]. It is also significant that this is another historical corpus [5], which could become part of the National Diachronic Corpus of Polish [6]. Detailed bibliographic data for the sources included in KorTeDa are tabulated in the „Texts in the Corpus” section.
KorTeDa is annotated both morphosyntactically and sociolinguistically. The morphosyntactic annotation is automatic and was introduced using the MorphoDiTa [7] tagger, which assigned specific morphosyntactic properties to each segment. Sociolinguistic parameters were manually assigned to each drama character based on data contained in the text. These were then automatically applied to the text using the Inforex [8] tool, adapted for dialogical texts. The principles of sociolinguistic annotation are presented in the article 'Sociolinguistic Annotation in the Corpus of Old Polish Dramatic Texts (1772–1939) [9]. KorTeDa users can select utterances marked with sociolinguistic variables, such as age, gender, and social status of characters, without needing query syntax. Additionally, each utterance segment was assigned the following metadata: title, subtitle, author, abbreviation, place of publication, year of creation, year of publication of the source text, time interval, and source. Some of these also serve as filters for narrowing down the browsed resources. KorTeDa also includes additional parameters for refining search results (stage directions, collective utterances). A detailed description of the corpus search capabilities is available in the „Instruction” section. The contemporary understanding of a language corpus includes the following features: digitality, annotation, appropriate size, and representativeness [4] – all these attributes characterize the Corpus of Old Polish Dramatic Texts (1772–1939).
The specificity of KorTeDa allows this tool to be used not only in linguistic research but also in various historically, literary, or theatrical-focused works.
Publications
- Pastuch M., Mitrenga B., Wąsińska K., 2024, Anotacja socjolingwistyczna w Korpusie dawnych polskich tekstów dramatycznych (1772–1939), „LingVaria”, z. 1(37), s. 151–170. https://doi.org/10.12797/LV.19.2024.37.10
- Pastuch M., Mitrenga B., Wąsińska K., 2024, Przerywnik leksykalny w historycznojęzykowych badaniach socjopragmatycznych (na materiale Korpusu dawnych polskich tekstów dramatycznych (1772–1939)), „Język Polski”, z. 104(1), s. 93–110. https://doi.org/10.31286/JP.00488
- Ryś A., 2023, Jak frazeologizmy funkcjonują w różnych wariantach i kontekstach? Funkcjonalna analiza materiału historycznojęzykowego na podstawie jednostki jak Boga kocham, „LingVaria”, z. 2 (36), s. 323–337. https://doi.org/10.12797/LV.18.2023.36.21
- Wąsińska K., 2023, Przymiotnik słodki w dawnych i współczesnych połączeniach wyrazowych języka polskiego, „Poradnik Językowy”, z. 10, s. 32–47. https://doi.org/10.33896/PorJ.2023.10.3
- Pastuch M., 2023, "From the Speaker's Point of View" – Subjectification as Pragmatic-Semantic Language Change, „Studies in Polish Linguistics”, t. 18, z. 4, s. 145–167. https://doi.org/10.4467/23005920SPL.23.007.18682
- Mitrenga B., Pastuch M., Wąsińska K., 2021, Możliwości i ograniczenia historycznych badań pragmalingwistycznych, „[w:] M. Mączyński, E. Horyń, E. Zmuda red., W kręgu dawnej polszczyzny VII, Wydawnictwo Naukowe Akademii Ignatianum w Krakowie”, s. 163–182. https://wydawnictwowam.pl/sites/default/files/88998_skrot.pdf?srsltid=AfmBOoq1FYqLkdyPC0A0Ty6ubLLMDB9DeYRBgYDaor941tB67lyU6CYc
- Ryś A., 2024, Jak współcześnie badać frazeologię w tekstach dawnych? Możliwości wykorzystania narzędzi cyfrowych w pracy historyka języka, Forum Lingwistyczne”, nr 12, s. 1 – 17. https://doi.org/10.31261/FL.2024.12.2.08
Presentations and Multimedia
Magdalena Pastuch, Barbara Mitrenga, Kinga Wąsińska, Jak badać dawny język potoczny – metody i narzędzia, Komisja Historii Języka PAN, 23 listopada 2023 r., online
Magdalena Pastuch, Barbara Mitrenga, Kinga Wąsińska, KorTeDa – nowe narzędzie do badania dawnej polszczyzny, konferencja pt. Dziesięć lat otwartej struktury naukowej Clarin w Polsce, 25–27 września 2023 r., Wrocław.
online: https://www.youtube.com/watch?v=ZIWjQ6JXmBk
Magdalena Pastuch, Barbara Mitrenga, Kinga Wąsińska, Narzędzia CLARIN-PL w historycznych badaniach socjo- i pragmalingwistycznych, konferencja pt. CLARIN-PL w praktyce badawczej, 9 grudnia 2022 r., Katowice.
Magdalena Pastuch, Barbara Mitrenga, Kinga Wąsińska, Potoczność w dawnych polskich dramatach, warsztaty Clarin_PL, 20 listopada 2020 r.
online: https://www.youtube.com/watch?v=eV00GPUbkyg
Barbara Mitrenga, Kinga Wąsińska, Dialogiczne teksty dawne – od digitalizacji źródeł historycznych do operatywnego korpusu, konferencja pt. Dane i język. Perspektywy badań w cyfrowych edycjach źródeł historycznych, 13–15 listopada 2024, Uniwersytet Warszawski, Warszawa.
Team
Author Team – University of Silesia in Katowice
- Magdalena Pastuch – kierownik zespołu
- Barbara Mitrenga
- Kinga Wąsińska
IT Tools and Technical Support – CLARIN-PL Consortium
- Tomasz Naskręt
- Tomasz Bernaś
- Jonatan Dalasiński
- Jan Wieczorek
- Marcin Oleksy
- Wojciech Rauk
First Annotation (2020–2022)
- Agnieszka Jedziniak-Danowska
- Marta Kasprowska-Jarczyk
- Aleksandra Ryś
- Paulina Tomaszewska
Second Annotation (2023–2024)
- Marta Kasprowska-Jarczyk
Superannotation
- Author team
First Proofreading
- Aleksandra Ryś
- Monika Golysz
- Emilia Cichy
Second Proofreading
- Author team
Auxiliary Work (2020–2022)
- Karolina Lisczyk
Text Selection Consultation
- Dorota Fox
- Magdalena Bąk
OCR Procedure
- Scientific Information Centre and Academic Library (CINiBA)
Conceptual Work
In the years 2018–2020, the following individuals participated in the initial work on the concept for selecting texts for the corpus: Beata Duda, Karolina Lisczyk, Joanna Przyklenk, Arkadiusz Pulikowski, Katarzyna Sujkowska, Jacek Tomaszczyk.
How to Cite
We kindly ask KorTeDa users to cite the following publication:
Pastuch M., Mitrenga B., Wąsińska K., 2024, "Anotacja socjolingwistyczna w Korpusie dawnych polskich tekstów dramatycznych (1772–1939)", LingVaria, z. 1(37), s. 151–170.
https://doi.org/10.12797/LV.19.2024.37.10
Contact
University of Silesia in Katowice
Faculty of Humanities
Institute of Linguistics
4 Uniwersytecka Street, Room A/3.26
40-007 Katowice, Poland