Publicado en línea: 30 Dec 2021 Páginas: 319 - 329
Resumen
Abstract
This article presents the current results of an ongoing study of the possibilities of fine-tuning automatic morphosyntactic and semantic annotation by means of improving the underlying formal grammar and ontology on the example of one Tibetan text. The ultimate purpose of work at this stage was to improve linguistic software developed for natural-language processing and understanding in order to achieve complete annotation of a specific text and such state of the formal model, in which all linguistic phenomena observed in the text would be explained. This purpose includes the following tasks: analysis of error cases in annotation of the text from the corpus; eliminating these errors in automatic annotation; development of formal grammar and updating of dictionaries. Along with the morpho-syntactic analysis, the current approach involves simultaneous semantic analysis as well. The article describes semantic annotation of the corpus, required by grammar revision and development, which was made with the use of computer ontology. The work is carried out with one of the corpus texts – a grammatical poetic treatise Sum-cu-pa (VII c.).
Publicado en línea: 30 Dec 2021 Páginas: 330 - 341
Resumen
Abstract
The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The lemmas with this type of homonymy are treated in the new release of the MorfFlex dictionary as separate. We show that the separation of paradigms according to the gender is not only superfluous, but also clumsy, because it forces a choice when making one is not necessary. That is why we call this type of hononymy “artificial”.
Publicado en línea: 30 Dec 2021 Páginas: 342 - 352
Resumen
Abstract
Inspired by earlier work on typological profiling of English by Benedikt Szmrecsányi and Bernd Kortmann ([1], [2], [3]), this paper investigates the typological profiles of English, Spanish, German, and Slovak, applying Szmrecsányi and Kortmann’s methodology of calculating a SYNTHETICITY INDEX and an ANALYTICITY INDEX based on 1,000-word corpus samples. The results show that Szmrecsányi and Kortmann’s methodology is replicable, and confirm claims in the literature about degrees of analyticity and syntheticity of these languages. Instead of a simple analytic-synthetic continuum, Szmrecsányi and Kortmann’s “typological space” [3] is used to visualize results, showing that languages can be both synthetic and analytic to varying degrees.
Publicado en línea: 30 Dec 2021 Páginas: 353 - 370
Resumen
Abstract
The paper deals with the acquisition of Slovak word order in written texts of students of Slovak as a foreign language. Its attention is focused on identifying the correct and incorrect placement of enclitic components, and their erroneous usage is analysed with respect to different investigated variables (types of enclitic components, types of syntactic construction, distance from lexical/syntactic anchor, and realization in pre- or post-verbal position). The paper also pays attention to the error rate regarding individual proficiency levels of students, and error distribution in two language groups, Slavic and Non-Slavic learners, is compared.
Publicado en línea: 30 Dec 2021 Páginas: 371 - 382
Resumen
Abstract
We present results of an automatic comparison of valency frames of interlinked adjectival and verbal lexical units based on the valency lexicons NomVallex and VALLEx. We distinguish nine derivational types of deverbal adjectives and examine whether they tend to display systemic or non-systemic valency behavior. The non-systemic valency behavior includes changes in the number of valency complementations and, more dominantly, non-systemic forms of actants, especially a prepositional group.
Publicado en línea: 30 Dec 2021 Páginas: 383 - 393
Resumen
Abstract
The paper presents work in progress on the compilation and automatic annotation of a dataset comprising examples of stative verbs in parallel Bulgarian-Russian corpora with the goal of facilitating the elaboration of a classification of stative verbs in the two languages based on their lexical and semantic properties. We extract stative verbs from the Bulgarian and the Russian WordNets with their assigned conceptual information (frames) from FrameNet. We then assign the set of probable Bulgarian and Russian stative verbs to the verb instances in a parallel Bulgarian-Russian corpus using WordNet correspondences to filter out unlikely stative candidates. Further, manual inspection will ensure high quality of the resource and its application for the purposes of semantic analysis.
Publicado en línea: 30 Dec 2021 Páginas: 394 - 404
Resumen
Abstract
The paper attempts to identify the usage and productivity of five different international suffixes in Slovak by means of corpus evidence. The analysis focuses on real and potential productivity in a two-stage comparison: 1) tokens/lemmas occurring in a general balanced corpus vs general corpus of specialised and academic texts, 2) general corpus of specialised and academic texts vs specialised (sub)corpora of medical, legal, economic and religious texts. The aim of the analysis is to explore whether productivity varies across registers by means of statistical measures.
Publicado en línea: 30 Dec 2021 Páginas: 405 - 414
Resumen
Abstract
It is shown that the mean morpheme length (measured in phonemes) decreases with the increasing length of word types (in morphemes) in Czech texts, i.e., these language units behave according to the Menzerath-Altmann law. The law is not valid in general for word tokens. Some hints towards an interpretation of parameters are presented.
Publicado en línea: 30 Dec 2021 Páginas: 415 - 424
Resumen
Abstract
This study aims at tracing a reallocation process of a grammatical feature alongside the dialect-standard axis with the aid of corpus linguistics methods; more precisely with an integrative application of quantitative and qualitative approaches. The phenomenon under investigation is articles without the definiteness marker d- in German, usually ascribed to the Bavarian dialect area. Analyses show, however, that this apparently dialectal feature diffuses to other communication settings closer to the intended standard language use. This process is accompanied by a refunctionalisation of reduced article forms, indicating the relevance of language-internal relations for reallocation of grammatical features. The methodical approach should be easily applicable to other variants and – as many European languages show a diaglossic repertoire – relevant to other languages as well.
Publicado en línea: 30 Dec 2021 Páginas: 425 - 433
Resumen
Abstract
Complex adverbial prepositions with spatial meaning have not been sufficiently studied so far in Czech. To establish a set of these expressions in their actual usage, the resources of the Czech National Corpus were used in this study. The research has shown that the SYN2020 corpus is a relevant tool for searching for two-word expressions with a LOCATIVE ADVERB – SIMPLE PREPOSITION structure that have the same function as a one-word locative preposition. The article describes a method for the extraction of these expressions from the corpus, as well as a method for the collection of their quantitative data using corpus tools. As a result of the research, a list of expressions that are presumably complex prepositions is provided.
Publicado en línea: 30 Dec 2021 Páginas: 434 - 443
Resumen
Abstract
Theories of valency and valency dictionaries are inevitably and understandably based on the valency behavior of frequent verbs. This paper scrutinizes 154 low-frequency Czech verbs and argues that they demonstrate that Czech verbs are more malleable in their valency behavior than suggested by the literature. It is argued that this fits better within a constructionist approach to valency rather than a lexicalist one. Furthermore, the paper illustrates two alternations, previously unrecognized for Czech as semantic diatheses, namely the causative-inchoative alternation and the Agent-Means alternation.
Publicado en línea: 30 Dec 2021 Páginas: 444 - 453
Resumen
Abstract
In this paper, we present a preliminary study of three intensifiers (absolutně, naprosto, úplně) based on data from three different corpora, a written corpus SYN2020, a web corpus ONLINE-ARCHIVE, and a spoken corpus ORTOFON 1. Providing a parallel annotation of a random sample of each intensifier, we focus on their functions and meanings in context. We analyse their properties in order to define those features which are relevant to their word class assignment, and to prepare grounds for the future disambiguation tasks.
Publicado en línea: 30 Dec 2021 Páginas: 454 - 464
Resumen
Abstract
The paper presents a novel and unified morphological description of numerals and pronouns, as compiled for the newest edition of the Prague Dependency Treebank (Prague Dependency Treebank – Consolidated 1.0) and its integral part the morphological dictionary MorfFlex. On the basis of considerable experience with real data annotation and the use of the morphological dictionary, particular changes were proposed. For both of the parts of speech a new set of subtypes was proposed, based mainly on the morphological criterion and its combination with semantic properties and other relevant features, such as definiteness in numerals and possessivity, reflexivity, and clitichood in pronouns. Each subtype has a specific value at the 2nd position of the morphological tag, which serves also as an indicator of the applicability of other tag categories.
Publicado en línea: 30 Dec 2021 Páginas: 465 - 474
Resumen
Abstract
This article reports on the quantitative corpus-based investigation into the form-function interplay of the English detached adjectival construction with an explicit subject. Taking Usage-based Construction Grammar as its theoretical framework, this paper investigates the patterns of attraction of lexical items that appear in the main slots of the grammatical construction. The data obtained substantiate the constructional status of the construction and determine its semantic and functional specification in present-day English.
Publicado en línea: 30 Dec 2021 Páginas: 477 - 487
Resumen
Abstract
Text readability metrics assess how much effort a reader must put into comprehending a given text. They are, e.g., used to choose appropriate readings for different student proficiency levels, or to make sure that crucial information is efficiently conveyed (e.g., in an emergency). Flesch Reading Ease is such a globally used formula that it is even integrated into the MS Word Processor. However, its constants are language-dependent. The original formula was created for English. So far it has been adapted to several European languages, Bangla, and Hindi. This paper describes the Czech adaptation, with the language-dependent constants optimized by a machine-learning algorithm working on parallel corpora of Czech and English, Russian, Italian, and French, respectively.
Publicado en línea: 30 Dec 2021 Páginas: 488 - 501
Resumen
Abstract
This paper presents a synchronic and diachronic computer corpus of Makarska littoral dialects. This corpus was created as part of the project to explore the ikavian neoštokavian dialects of the narrow coastal area in Croatian region of Dalmatia around the town of Makarska. The dialectological characteristics of the dialects studied are briefly presented first, followed by presentation of the digital system. The system is logically organized in first part as a corpus of literary texts created from 1729 to 1803 and digitally processed, and in the second part from the materials collected through dialectological questionnaires prepared and methodologically adapted as part of the creation of the Croatian Linguistic Atlas. Methods of collecting linguistic data, method of input into the digital form and methods and possibilities of data processing will be explained. Based on the input and search strategies within the system, the examples will prove the origin of the dialects of the Makarska littoral to be that of the ikavian neoštokavian dialect described in the dialectological literature. This computer-based principle of work is a novelty in Croatian dialectology which has not been digitally processed so far and offers a basis for future dialectological research. This platform can be used in order to shorten the time of data processing and to analyse them more systematically and more efficiently. So far, there has been no such digital repository for any Croatian speech. This project represents a thorough synchronic and diachronic study of one rounded language area.
Publicado en línea: 30 Dec 2021 Páginas: 502 - 509
Resumen
Abstract
A new interactive map-based web application named Mapka was published by the Institute of the Czech National Corpus in 2020. It aims to serve linguists, as well as schools and the general public, and it features various functions described in this paper. Mapka was designed as a supplement to the CNC spoken corpora, starting with the DIALEKT corpus (more to come in the future). Its main function is to display various types of territorial division (primarily in terms of dialect, but also administrative) and networks of localities associated with the corpus. The main dialect regions are provided with overviews of their typical dialectal features and two samples of dialectal discourse – one slightly historical and one contemporary. The application offers the possibility of searching for municipalities, plotting the points on the map and creating a custom map. The paper concludes with future prospects concerning an enhanced and improved version of the application.
Publicado en línea: 30 Dec 2021 Páginas: 510 - 519
Resumen
Abstract
In this paper, we would like to provide a brief overview of the current state of pronunciation teaching in e-learning and demonstrate a new approach to building tools for automatic feedback concerning correct pronunciation based on the most frequent or typical errors in speech production made by non-native speakers. We will illustrate this in the process of designing annotation for a sound recognition tool to provide feedback on pronunciation. At the end of the paper, we will also present how we have tried to apply this annotation to the tool, what caveats we have found and what our plans are.
Publicado en línea: 30 Dec 2021 Páginas: 520 - 530
Resumen
Abstract
ORATOR v2 is a new 1.5M word corpus of Czech monologues, delivered to a live audience in semi-formal to formal settings. It was designed to chart the space of naturally occurring monologues which can be obtained for corpus processing. As such, it aims for diversity but does not attempt any balancing of subcategories, recognizing that some types of data are inherently easier to obtain in high volume than others. The transcription guidelines and annotation tools employed are the same as other recent spoken corpora published by the CNC, which facilitates interesting comparisons between various types of spoken Czech. The present paper sketches out three case studies, comparing ORATOR to the informal conversations of ORTOFON v2 in terms of the frequencies of demonstratives and hesitations, as well as lexical richness.
Publicado en línea: 30 Dec 2021 Páginas: 531 - 544
Resumen
Abstract
This paper presents a specialized corpus tool GramatiKat in the context of Open Science principles, namely data sharing, which offers opportunities for original research and facilitates verifiability of research and building on previous research. The tool is designed primarily for examining grammatical categories from the quantitative point of view. It offers grammatical profiles of particular lemmas (currently 14 thousand Czech nouns) and the proportion of individual grammatical categories within a part of speech, i.e., the standard behavior of a word class. The data in GramatiKat are pre-processed, statistically evaluated, and presented in charts and tables for clarity, and they are available to other linguists, especially from fields of morphology and lexicography. This article is aimed at providing inspiration and support to corpus and non-corpus linguists with utilization and enhanced use of the existing tools and with the creation of new specialized tools available to other users.
Publicado en línea: 30 Dec 2021 Páginas: 545 - 555
Resumen
Abstract
The paper introduces a new section separated from journalistic texts in Czech corpora, namely interviews. This genre is highly specific; from among the texts that can be found in newspapers and magazines, it is probably the closest to spoken language. In two case studies, we present the possible application of the interviews subcorpus in linguistic research. The first one deals with the role of paralinguistic behaviour, especially laughter in written interviews vs. spoken dialogues. The second one investigates the specifics of the demonstrative ten in the function of a nominal attribute, again in both written and spoken data.
Publicado en línea: 30 Dec 2021 Páginas: 556 - 567
Resumen
Abstract
We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.
Publicado en línea: 30 Dec 2021 Páginas: 568 - 578
Resumen
Abstract
The article presents the process of building the Franček Slovenian language portal aimed at primary- and secondary-school students. We discuss problems and solutions of linking and adapting existing non-pedagogical dictionaries for school use, while overcoming content and structural differences among the dictionaries. We also present some solutions within the process of adaptation to the online medium and visualisation adjustments for three age groups of school users with different content needs and levels of (meta)linguistic knowledge.
Publicado en línea: 30 Dec 2021 Páginas: 579 - 589
Resumen
Abstract
The paper describes methodology for creating a Slovak database of speech under stress and pilot observations. While the relationship between stress and speech characteristics can be utilized in a wide domain of speech technology applications, its research suffers from the lack of suitable databases, particularly in conversational speech. We propose a novel procedure to record acted speech in the home of actors and using their own smartphones. We describe both the collection of speech material under three levels of stress and the subsequent annotation of stress levels in this material. First observations suggest a reasonable inter-annotator agreement, as well as interesting avenues for the relationship between the intended stress levels and those perceived in speech.
Publicado en línea: 30 Dec 2021 Páginas: 590 - 602
Resumen
Abstract
The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.
Publicado en línea: 30 Dec 2021 Páginas: 603 - 617
Resumen
Abstract
Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb maty ‘to have’ and (po-)znaty ‘to know’ in Carpathian Rusyn language. Using resampling based methods like Cross-Validation, Bootstrapping and Random Forests, we provide a strategy for circumventing possible methodological pitfalls and gaining the most information from our precious data, without trying to p-hack the results. Calculating the predictive power of several sociolinguistic factors on linguistic variation, we can make valid statements about the (sociolinguistic) status of Rusyn and the stability of the old dialect continuum of Rusyn varieties.
Publicado en línea: 30 Dec 2021 Páginas: 618 - 630
Resumen
Abstract
A literary essay is an interesting unit for language analyses, as its stylistic means often exceed the boundaries of the genre of an artistic essay. The article presents a new corpus of Czech literary essays covering approximately fifty years from 1890 to 1940. Along with the characterisation of the corpus and its annotation, the paper focuses on the TxM corpus tool: In the second part of the study, we use selected texts to conduct an analysis of seven various authors through multidimensional cluster analysis, factorial correspondence analysis and a specificity score. The main parameter of the analyses was usage of parts of speech in texts by individual authors. At present, the Corpus of Czech Essays contains 40 essayist titles written by 15 authors covering various topics (music, visual arts, theatre, literature, etc.).
Publicado en línea: 30 Dec 2021 Páginas: 631 - 640
Resumen
Abstract
This work-in-progress paper presents a specialized language corpus UcebKo built from textbooks of Czech for foreigners. The corpus integrates three subcorpora (UcebKo-A2, UcebKo-B1, and UcebKo-B2) which allow research of Czech as a second/foreign language at chosen language levels (A2, B1, and B2). In this case, the research is focused on word-formation, where the first results, i.e., mapping of derived words denoting persons, illustrate the approach and methodology used.
Publicado en línea: 30 Dec 2021 Páginas: 643 - 655
Resumen
Abstract
This paper focuses on a linguistic image of mother in German languages. It seeks to grasp it through a typical context of the German word Mutter ʻmotherʼ. The research is based on results of distributional and thematic analyses of these words. These analyses are used as a base for reconstructing prototypical characteristics of “mother” and the related concepts used by speakers of German. The paper develops these findings into compiling the most frequent collocations and other (mostly contextual) information gathered by the use of corpus tools. The paper concludes with an outline of unconscious axiological processes used in evaluating the image of mother on the good/bad axis.
Publicado en línea: 30 Dec 2021 Páginas: 656 - 666
Resumen
Abstract
The paper focuses on dynamics of changes of several linguistic and text properties in diachronic development of Czech. Specifically, we analyze the proportion of identical word-forms (types), the average type length, text length, the proportion of hapax legomena, the moving average type-token ratio, and entropy. For the analysis, seven translations of the Gospel of Matthew from the 14th to the 21st century were used. The study reveals some differences in dynamics of changes of particular properties.
Publicado en línea: 30 Dec 2021 Páginas: 667 - 678
Resumen
Abstract
This article takes a bird’s eye view of how positive or negative sentiments in the news press about countries and nationality nouns seem to reflect the country’s general income groups. The study focuses on the four income groups classified by the World Bank and their co-occurrence with positively and negatively classified adjectives from the Subjectivity Lexicon for Czech. A search in the journalistic subcorpus of the SYN series, release 8 of the Czech National Corpus, results in a time line covering three decades. Previous research on subjectivity has either focused on other parts of the Subjectivity Lexicon or on fewer adjectives from other languages. In this article, it is argued that the income groups are treated in descending order, i.e., the higher the income, the more positive the sentiment. Even when the most influential groups in the top and bottom are removed, the result holds. Discourse concerning global war and peace, and the security of different nations, is also detected as a result.
Publicado en línea: 30 Dec 2021 Páginas: 679 - 689
Resumen
Abstract
The research paper analyses key words found in pre-election communication of electorally successful political parties, based on which the main communication differences among those parties and the specifics of pre-election communication, as well as the pre-election discourse as a whole, are identified. Research material consists of political parties’ microblogs published on individual political parties’ Facebook profiles in the period from January 1, 2020 to February 28, 2020, with a reference corpus formed by the total of these microblogs. The analysis showed professionalization of political communication, the use of new, but also traditional ways of interaction with the electorate, pre-election communication based on the presentation of candidates, offensive and combative tone of the most successful parties, self-presentation, hints of persuasive and manipulative techniques, topic points of electoral programmes, but also thematic neutrality and non-specificity that suggest smaller electoral success.
Publicado en línea: 30 Dec 2021 Páginas: 690 - 704
Resumen
Abstract
The phenomenon of political evasiveness in the genre of a political interview has been the focus of several discourse studies employing conversation analysis, critical discourse analysis and the social psychology approach. Most of the above-mentioned studies focus on a detailed qualitative analysis of political discourse identifying a wide range of communication strategies that permit politicians to ambiguate their agency and at the same time boost their positive face. Since these strategies may change over time and also be subject to a culture specific environment, the aim of this paper is to discover a) which evasive communicative strategies were employed by Slovak politicians in 2012–2016, b) which lexical substitutions were most frequently used by them to avoid negative connotations of face-threatening questions, and finally, c) which cognitive frames formed a frequent conceptual background of their evasive political argumentation. The paper will draw on a combination of quantitative and qualitative approach to the analysis of non-replies devised by Bull and Mayer (1993) and critical discourse analysis in the sample of five Slovak radio interviews aired on the Rádio Express. The selection of interviews was not random- in each interview the politician was asked highly conflictual questions about bribery, embezzlement or disputes in the coalition. Based on qualitative research of Russian-Slovak political discourse (2009) by Dulebová it is hypothesized that a) the evasive strategy of ‘attack’ on the opposition and ‘attack on the interviewer’ would occur in our sample with the highest prominence in the speech of the former Prime Minister Fico, and b) the politicians accused of direct involvement in scandals would be the most evasive ones.
Publicado en línea: 30 Dec 2021 Páginas: 705 - 718
Resumen
Abstract
The paper follows the tradition of research in legal linguistics and into formulaic language, specifically into lexical bundles. The aim of the paper is to describe lexical bundles in samples from the corpus of Slovak judicial decisions OD-JUSTICE by means of quantitative characteristics of the identified bundles and by their comparison with bundles found in two other specialized corpora: the corpus of Slovak legal regulations and the corpus of annual reports by Slovak public institutions. For the identification of bundles, the concept of the h-point was used. Identified bundles are described with respect to their maximal, minimal, average, median and mode values, distributions and ratios. The aim of the paper is to outline an interpretation of these bundle characteristics with regard to communicative function(s) of compared document genres.
This article presents the current results of an ongoing study of the possibilities of fine-tuning automatic morphosyntactic and semantic annotation by means of improving the underlying formal grammar and ontology on the example of one Tibetan text. The ultimate purpose of work at this stage was to improve linguistic software developed for natural-language processing and understanding in order to achieve complete annotation of a specific text and such state of the formal model, in which all linguistic phenomena observed in the text would be explained. This purpose includes the following tasks: analysis of error cases in annotation of the text from the corpus; eliminating these errors in automatic annotation; development of formal grammar and updating of dictionaries. Along with the morpho-syntactic analysis, the current approach involves simultaneous semantic analysis as well. The article describes semantic annotation of the corpus, required by grammar revision and development, which was made with the use of computer ontology. The work is carried out with one of the corpus texts – a grammatical poetic treatise Sum-cu-pa (VII c.).
The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The lemmas with this type of homonymy are treated in the new release of the MorfFlex dictionary as separate. We show that the separation of paradigms according to the gender is not only superfluous, but also clumsy, because it forces a choice when making one is not necessary. That is why we call this type of hononymy “artificial”.
Inspired by earlier work on typological profiling of English by Benedikt Szmrecsányi and Bernd Kortmann ([1], [2], [3]), this paper investigates the typological profiles of English, Spanish, German, and Slovak, applying Szmrecsányi and Kortmann’s methodology of calculating a SYNTHETICITY INDEX and an ANALYTICITY INDEX based on 1,000-word corpus samples. The results show that Szmrecsányi and Kortmann’s methodology is replicable, and confirm claims in the literature about degrees of analyticity and syntheticity of these languages. Instead of a simple analytic-synthetic continuum, Szmrecsányi and Kortmann’s “typological space” [3] is used to visualize results, showing that languages can be both synthetic and analytic to varying degrees.
The paper deals with the acquisition of Slovak word order in written texts of students of Slovak as a foreign language. Its attention is focused on identifying the correct and incorrect placement of enclitic components, and their erroneous usage is analysed with respect to different investigated variables (types of enclitic components, types of syntactic construction, distance from lexical/syntactic anchor, and realization in pre- or post-verbal position). The paper also pays attention to the error rate regarding individual proficiency levels of students, and error distribution in two language groups, Slavic and Non-Slavic learners, is compared.
We present results of an automatic comparison of valency frames of interlinked adjectival and verbal lexical units based on the valency lexicons NomVallex and VALLEx. We distinguish nine derivational types of deverbal adjectives and examine whether they tend to display systemic or non-systemic valency behavior. The non-systemic valency behavior includes changes in the number of valency complementations and, more dominantly, non-systemic forms of actants, especially a prepositional group.
The paper presents work in progress on the compilation and automatic annotation of a dataset comprising examples of stative verbs in parallel Bulgarian-Russian corpora with the goal of facilitating the elaboration of a classification of stative verbs in the two languages based on their lexical and semantic properties. We extract stative verbs from the Bulgarian and the Russian WordNets with their assigned conceptual information (frames) from FrameNet. We then assign the set of probable Bulgarian and Russian stative verbs to the verb instances in a parallel Bulgarian-Russian corpus using WordNet correspondences to filter out unlikely stative candidates. Further, manual inspection will ensure high quality of the resource and its application for the purposes of semantic analysis.
The paper attempts to identify the usage and productivity of five different international suffixes in Slovak by means of corpus evidence. The analysis focuses on real and potential productivity in a two-stage comparison: 1) tokens/lemmas occurring in a general balanced corpus vs general corpus of specialised and academic texts, 2) general corpus of specialised and academic texts vs specialised (sub)corpora of medical, legal, economic and religious texts. The aim of the analysis is to explore whether productivity varies across registers by means of statistical measures.
It is shown that the mean morpheme length (measured in phonemes) decreases with the increasing length of word types (in morphemes) in Czech texts, i.e., these language units behave according to the Menzerath-Altmann law. The law is not valid in general for word tokens. Some hints towards an interpretation of parameters are presented.
This study aims at tracing a reallocation process of a grammatical feature alongside the dialect-standard axis with the aid of corpus linguistics methods; more precisely with an integrative application of quantitative and qualitative approaches. The phenomenon under investigation is articles without the definiteness marker d- in German, usually ascribed to the Bavarian dialect area. Analyses show, however, that this apparently dialectal feature diffuses to other communication settings closer to the intended standard language use. This process is accompanied by a refunctionalisation of reduced article forms, indicating the relevance of language-internal relations for reallocation of grammatical features. The methodical approach should be easily applicable to other variants and – as many European languages show a diaglossic repertoire – relevant to other languages as well.
Complex adverbial prepositions with spatial meaning have not been sufficiently studied so far in Czech. To establish a set of these expressions in their actual usage, the resources of the Czech National Corpus were used in this study. The research has shown that the SYN2020 corpus is a relevant tool for searching for two-word expressions with a LOCATIVE ADVERB – SIMPLE PREPOSITION structure that have the same function as a one-word locative preposition. The article describes a method for the extraction of these expressions from the corpus, as well as a method for the collection of their quantitative data using corpus tools. As a result of the research, a list of expressions that are presumably complex prepositions is provided.
Theories of valency and valency dictionaries are inevitably and understandably based on the valency behavior of frequent verbs. This paper scrutinizes 154 low-frequency Czech verbs and argues that they demonstrate that Czech verbs are more malleable in their valency behavior than suggested by the literature. It is argued that this fits better within a constructionist approach to valency rather than a lexicalist one. Furthermore, the paper illustrates two alternations, previously unrecognized for Czech as semantic diatheses, namely the causative-inchoative alternation and the Agent-Means alternation.
In this paper, we present a preliminary study of three intensifiers (absolutně, naprosto, úplně) based on data from three different corpora, a written corpus SYN2020, a web corpus ONLINE-ARCHIVE, and a spoken corpus ORTOFON 1. Providing a parallel annotation of a random sample of each intensifier, we focus on their functions and meanings in context. We analyse their properties in order to define those features which are relevant to their word class assignment, and to prepare grounds for the future disambiguation tasks.
The paper presents a novel and unified morphological description of numerals and pronouns, as compiled for the newest edition of the Prague Dependency Treebank (Prague Dependency Treebank – Consolidated 1.0) and its integral part the morphological dictionary MorfFlex. On the basis of considerable experience with real data annotation and the use of the morphological dictionary, particular changes were proposed. For both of the parts of speech a new set of subtypes was proposed, based mainly on the morphological criterion and its combination with semantic properties and other relevant features, such as definiteness in numerals and possessivity, reflexivity, and clitichood in pronouns. Each subtype has a specific value at the 2nd position of the morphological tag, which serves also as an indicator of the applicability of other tag categories.
This article reports on the quantitative corpus-based investigation into the form-function interplay of the English detached adjectival construction with an explicit subject. Taking Usage-based Construction Grammar as its theoretical framework, this paper investigates the patterns of attraction of lexical items that appear in the main slots of the grammatical construction. The data obtained substantiate the constructional status of the construction and determine its semantic and functional specification in present-day English.
Text readability metrics assess how much effort a reader must put into comprehending a given text. They are, e.g., used to choose appropriate readings for different student proficiency levels, or to make sure that crucial information is efficiently conveyed (e.g., in an emergency). Flesch Reading Ease is such a globally used formula that it is even integrated into the MS Word Processor. However, its constants are language-dependent. The original formula was created for English. So far it has been adapted to several European languages, Bangla, and Hindi. This paper describes the Czech adaptation, with the language-dependent constants optimized by a machine-learning algorithm working on parallel corpora of Czech and English, Russian, Italian, and French, respectively.
This paper presents a synchronic and diachronic computer corpus of Makarska littoral dialects. This corpus was created as part of the project to explore the ikavian neoštokavian dialects of the narrow coastal area in Croatian region of Dalmatia around the town of Makarska. The dialectological characteristics of the dialects studied are briefly presented first, followed by presentation of the digital system. The system is logically organized in first part as a corpus of literary texts created from 1729 to 1803 and digitally processed, and in the second part from the materials collected through dialectological questionnaires prepared and methodologically adapted as part of the creation of the Croatian Linguistic Atlas. Methods of collecting linguistic data, method of input into the digital form and methods and possibilities of data processing will be explained. Based on the input and search strategies within the system, the examples will prove the origin of the dialects of the Makarska littoral to be that of the ikavian neoštokavian dialect described in the dialectological literature. This computer-based principle of work is a novelty in Croatian dialectology which has not been digitally processed so far and offers a basis for future dialectological research. This platform can be used in order to shorten the time of data processing and to analyse them more systematically and more efficiently. So far, there has been no such digital repository for any Croatian speech. This project represents a thorough synchronic and diachronic study of one rounded language area.
A new interactive map-based web application named Mapka was published by the Institute of the Czech National Corpus in 2020. It aims to serve linguists, as well as schools and the general public, and it features various functions described in this paper. Mapka was designed as a supplement to the CNC spoken corpora, starting with the DIALEKT corpus (more to come in the future). Its main function is to display various types of territorial division (primarily in terms of dialect, but also administrative) and networks of localities associated with the corpus. The main dialect regions are provided with overviews of their typical dialectal features and two samples of dialectal discourse – one slightly historical and one contemporary. The application offers the possibility of searching for municipalities, plotting the points on the map and creating a custom map. The paper concludes with future prospects concerning an enhanced and improved version of the application.
In this paper, we would like to provide a brief overview of the current state of pronunciation teaching in e-learning and demonstrate a new approach to building tools for automatic feedback concerning correct pronunciation based on the most frequent or typical errors in speech production made by non-native speakers. We will illustrate this in the process of designing annotation for a sound recognition tool to provide feedback on pronunciation. At the end of the paper, we will also present how we have tried to apply this annotation to the tool, what caveats we have found and what our plans are.
ORATOR v2 is a new 1.5M word corpus of Czech monologues, delivered to a live audience in semi-formal to formal settings. It was designed to chart the space of naturally occurring monologues which can be obtained for corpus processing. As such, it aims for diversity but does not attempt any balancing of subcategories, recognizing that some types of data are inherently easier to obtain in high volume than others. The transcription guidelines and annotation tools employed are the same as other recent spoken corpora published by the CNC, which facilitates interesting comparisons between various types of spoken Czech. The present paper sketches out three case studies, comparing ORATOR to the informal conversations of ORTOFON v2 in terms of the frequencies of demonstratives and hesitations, as well as lexical richness.
This paper presents a specialized corpus tool GramatiKat in the context of Open Science principles, namely data sharing, which offers opportunities for original research and facilitates verifiability of research and building on previous research. The tool is designed primarily for examining grammatical categories from the quantitative point of view. It offers grammatical profiles of particular lemmas (currently 14 thousand Czech nouns) and the proportion of individual grammatical categories within a part of speech, i.e., the standard behavior of a word class. The data in GramatiKat are pre-processed, statistically evaluated, and presented in charts and tables for clarity, and they are available to other linguists, especially from fields of morphology and lexicography. This article is aimed at providing inspiration and support to corpus and non-corpus linguists with utilization and enhanced use of the existing tools and with the creation of new specialized tools available to other users.
The paper introduces a new section separated from journalistic texts in Czech corpora, namely interviews. This genre is highly specific; from among the texts that can be found in newspapers and magazines, it is probably the closest to spoken language. In two case studies, we present the possible application of the interviews subcorpus in linguistic research. The first one deals with the role of paralinguistic behaviour, especially laughter in written interviews vs. spoken dialogues. The second one investigates the specifics of the demonstrative ten in the function of a nominal attribute, again in both written and spoken data.
We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.
The article presents the process of building the Franček Slovenian language portal aimed at primary- and secondary-school students. We discuss problems and solutions of linking and adapting existing non-pedagogical dictionaries for school use, while overcoming content and structural differences among the dictionaries. We also present some solutions within the process of adaptation to the online medium and visualisation adjustments for three age groups of school users with different content needs and levels of (meta)linguistic knowledge.
The paper describes methodology for creating a Slovak database of speech under stress and pilot observations. While the relationship between stress and speech characteristics can be utilized in a wide domain of speech technology applications, its research suffers from the lack of suitable databases, particularly in conversational speech. We propose a novel procedure to record acted speech in the home of actors and using their own smartphones. We describe both the collection of speech material under three levels of stress and the subsequent annotation of stress levels in this material. First observations suggest a reasonable inter-annotator agreement, as well as interesting avenues for the relationship between the intended stress levels and those perceived in speech.
The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.
Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb maty ‘to have’ and (po-)znaty ‘to know’ in Carpathian Rusyn language. Using resampling based methods like Cross-Validation, Bootstrapping and Random Forests, we provide a strategy for circumventing possible methodological pitfalls and gaining the most information from our precious data, without trying to p-hack the results. Calculating the predictive power of several sociolinguistic factors on linguistic variation, we can make valid statements about the (sociolinguistic) status of Rusyn and the stability of the old dialect continuum of Rusyn varieties.
A literary essay is an interesting unit for language analyses, as its stylistic means often exceed the boundaries of the genre of an artistic essay. The article presents a new corpus of Czech literary essays covering approximately fifty years from 1890 to 1940. Along with the characterisation of the corpus and its annotation, the paper focuses on the TxM corpus tool: In the second part of the study, we use selected texts to conduct an analysis of seven various authors through multidimensional cluster analysis, factorial correspondence analysis and a specificity score. The main parameter of the analyses was usage of parts of speech in texts by individual authors. At present, the Corpus of Czech Essays contains 40 essayist titles written by 15 authors covering various topics (music, visual arts, theatre, literature, etc.).
This work-in-progress paper presents a specialized language corpus UcebKo built from textbooks of Czech for foreigners. The corpus integrates three subcorpora (UcebKo-A2, UcebKo-B1, and UcebKo-B2) which allow research of Czech as a second/foreign language at chosen language levels (A2, B1, and B2). In this case, the research is focused on word-formation, where the first results, i.e., mapping of derived words denoting persons, illustrate the approach and methodology used.
This paper focuses on a linguistic image of mother in German languages. It seeks to grasp it through a typical context of the German word Mutter ʻmotherʼ. The research is based on results of distributional and thematic analyses of these words. These analyses are used as a base for reconstructing prototypical characteristics of “mother” and the related concepts used by speakers of German. The paper develops these findings into compiling the most frequent collocations and other (mostly contextual) information gathered by the use of corpus tools. The paper concludes with an outline of unconscious axiological processes used in evaluating the image of mother on the good/bad axis.
The paper focuses on dynamics of changes of several linguistic and text properties in diachronic development of Czech. Specifically, we analyze the proportion of identical word-forms (types), the average type length, text length, the proportion of hapax legomena, the moving average type-token ratio, and entropy. For the analysis, seven translations of the Gospel of Matthew from the 14th to the 21st century were used. The study reveals some differences in dynamics of changes of particular properties.
This article takes a bird’s eye view of how positive or negative sentiments in the news press about countries and nationality nouns seem to reflect the country’s general income groups. The study focuses on the four income groups classified by the World Bank and their co-occurrence with positively and negatively classified adjectives from the Subjectivity Lexicon for Czech. A search in the journalistic subcorpus of the SYN series, release 8 of the Czech National Corpus, results in a time line covering three decades. Previous research on subjectivity has either focused on other parts of the Subjectivity Lexicon or on fewer adjectives from other languages. In this article, it is argued that the income groups are treated in descending order, i.e., the higher the income, the more positive the sentiment. Even when the most influential groups in the top and bottom are removed, the result holds. Discourse concerning global war and peace, and the security of different nations, is also detected as a result.
The research paper analyses key words found in pre-election communication of electorally successful political parties, based on which the main communication differences among those parties and the specifics of pre-election communication, as well as the pre-election discourse as a whole, are identified. Research material consists of political parties’ microblogs published on individual political parties’ Facebook profiles in the period from January 1, 2020 to February 28, 2020, with a reference corpus formed by the total of these microblogs. The analysis showed professionalization of political communication, the use of new, but also traditional ways of interaction with the electorate, pre-election communication based on the presentation of candidates, offensive and combative tone of the most successful parties, self-presentation, hints of persuasive and manipulative techniques, topic points of electoral programmes, but also thematic neutrality and non-specificity that suggest smaller electoral success.
The phenomenon of political evasiveness in the genre of a political interview has been the focus of several discourse studies employing conversation analysis, critical discourse analysis and the social psychology approach. Most of the above-mentioned studies focus on a detailed qualitative analysis of political discourse identifying a wide range of communication strategies that permit politicians to ambiguate their agency and at the same time boost their positive face. Since these strategies may change over time and also be subject to a culture specific environment, the aim of this paper is to discover a) which evasive communicative strategies were employed by Slovak politicians in 2012–2016, b) which lexical substitutions were most frequently used by them to avoid negative connotations of face-threatening questions, and finally, c) which cognitive frames formed a frequent conceptual background of their evasive political argumentation. The paper will draw on a combination of quantitative and qualitative approach to the analysis of non-replies devised by Bull and Mayer (1993) and critical discourse analysis in the sample of five Slovak radio interviews aired on the Rádio Express. The selection of interviews was not random- in each interview the politician was asked highly conflictual questions about bribery, embezzlement or disputes in the coalition. Based on qualitative research of Russian-Slovak political discourse (2009) by Dulebová it is hypothesized that a) the evasive strategy of ‘attack’ on the opposition and ‘attack on the interviewer’ would occur in our sample with the highest prominence in the speech of the former Prime Minister Fico, and b) the politicians accused of direct involvement in scandals would be the most evasive ones.
The paper follows the tradition of research in legal linguistics and into formulaic language, specifically into lexical bundles. The aim of the paper is to describe lexical bundles in samples from the corpus of Slovak judicial decisions OD-JUSTICE by means of quantitative characteristics of the identified bundles and by their comparison with bundles found in two other specialized corpora: the corpus of Slovak legal regulations and the corpus of annual reports by Slovak public institutions. For the identification of bundles, the concept of the h-point was used. Identified bundles are described with respect to their maximal, minimal, average, median and mode values, distributions and ratios. The aim of the paper is to outline an interpretation of these bundle characteristics with regard to communicative function(s) of compared document genres.