Revista y Edición

Volumen 73 (2022): Edición 3 (December 2022)
The use of language as instrument and means of discrimination

Volumen 73 (2022): Edición 2 (September 2022)

Volumen 73 (2022): Edición 1 (June 2022)

Volumen 72 (2022): Edición 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volumen 72 (2021): Edición 3 (December 2021)

Volumen 72 (2021): Edición 2 (December 2021)
NLP, Corpus Linguistics and Interdisciplinarity

Volumen 72 (2021): Edición 1 (June 2021)

Volumen 71 (2020): Edición 3 (December 2020)
Číslo venované problematike maďarského jazyka a maďarských nárečí na Slovensku

Volumen 71 (2020): Edición 2 (December 2020)

Volumen 71 (2020): Edición 1 (June 2020)

Volumen 70 (2019): Edición 3 (December 2019)

Volumen 70 (2019): Edición 2 (December 2019)

Volumen 70 (2019): Edición 1 (June 2019)

Volumen 69 (2018): Edición 3 (December 2018)

Volumen 69 (2018): Edición 2 (December 2018)

Volumen 69 (2018): Edición 1 (June 2018)

Volumen 68 (2017): Edición 3 (December 2017)

Volumen 68 (2017): Edición 2 (December 2017)

Volumen 68 (2017): Edición 1 (June 2017)

Volumen 67 (2016): Edición 3 (December 2016)

Volumen 67 (2016): Edición 2 (December 2016)

Volumen 67 (2016): Edición 1 (June 2016)

Volumen 66 (2015): Edición 2 (December 2015)

Volumen 66 (2015): Edición 1 (June 2015)

Volumen 65 (2015): Edición 2 (March 2015)

Volumen 65 (2014): Edición 1 (June 2014)

Volumen 64 (2013): Edición 2 (December 2013)

Volumen 64 (2013): Edición 1 (June 2013)

Volumen 63 (2012): Edición 2 (December 2012)

Volumen 63 (2012): Edición 1 (June 2012)

Volumen 62 (2012): Edición 2 (October 2012)

Volumen 62 (2011): Edición 1 (January 2011)

Volumen 61 (2010): Edición 2 (January 2010)

Volumen 61 (2010): Edición 1 (January 2010)

Volumen 60 (2009): Edición 2 (January 2009)

Volumen 60 (2009): Edición 1 (January 2009)

Detalles de la revista
Formato
Revista
eISSN
1338-4287
Publicado por primera vez
05 Mar 2010
Periodo de publicación
2 veces al año
Idiomas
Inglés

Buscar

Volumen 72 (2021): Edición 2 (December 2021)
NLP, Corpus Linguistics and Interdisciplinarity

Detalles de la revista
Formato
Revista
eISSN
1338-4287
Publicado por primera vez
05 Mar 2010
Periodo de publicación
2 veces al año
Idiomas
Inglés

Buscar

0 Artículos
Acceso abierto

From Graphematics to Phrasal, Sentential, and Textual Semantics Through Morphosyntax by Means of Corpus-Driven Grammar and Ontology: A Case Study on One Tibetan Text

Publicado en línea: 30 Dec 2021
Páginas: 319 - 329

Resumen

Abstract

This article presents the current results of an ongoing study of the possibilities of fine-tuning automatic morphosyntactic and semantic annotation by means of improving the underlying formal grammar and ontology on the example of one Tibetan text. The ultimate purpose of work at this stage was to improve linguistic software developed for natural-language processing and understanding in order to achieve complete annotation of a specific text and such state of the formal model, in which all linguistic phenomena observed in the text would be explained. This purpose includes the following tasks: analysis of error cases in annotation of the text from the corpus; eliminating these errors in automatic annotation; development of formal grammar and updating of dictionaries. Along with the morpho-syntactic analysis, the current approach involves simultaneous semantic analysis as well. The article describes semantic annotation of the corpus, required by grammar revision and development, which was made with the use of computer ontology. The work is carried out with one of the corpus texts – a grammatical poetic treatise Sum-cu-pa (VII c.).

Palabras clave

  • Tibetan language
  • computer ontology
  • Tibetan corpus
  • natural language processing
  • corpus linguistics
  • parsing
Acceso abierto

Artificial Homonymy

Publicado en línea: 30 Dec 2021
Páginas: 330 - 341

Resumen

Abstract

The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The lemmas with this type of homonymy are treated in the new release of the MorfFlex dictionary as separate. We show that the separation of paradigms according to the gender is not only superfluous, but also clumsy, because it forces a choice when making one is not necessary. That is why we call this type of hononymy “artificial”.

Palabras clave

  • homonymy
  • polysemy
  • gender variation
  • dictionary
Acceso abierto

Typological Profiling of English, Spanish, German and Slovak: A Corpus-Based Approach

Publicado en línea: 30 Dec 2021
Páginas: 342 - 352

Resumen

Abstract

Inspired by earlier work on typological profiling of English by Benedikt Szmrecsányi and Bernd Kortmann ([1], [2], [3]), this paper investigates the typological profiles of English, Spanish, German, and Slovak, applying Szmrecsányi and Kortmann’s methodology of calculating a SYNTHETICITY INDEX and an ANALYTICITY INDEX based on 1,000-word corpus samples. The results show that Szmrecsányi and Kortmann’s methodology is replicable, and confirm claims in the literature about degrees of analyticity and syntheticity of these languages. Instead of a simple analytic-synthetic continuum, Szmrecsányi and Kortmann’s “typological space” [3] is used to visualize results, showing that languages can be both synthetic and analytic to varying degrees.

Palabras clave

  • typological profiling
  • syntheticity index
  • analyticity index
  • typological space
  • English
  • German
  • Spanish
  • Slovak
  • corpus samples
Acceso abierto

Acquiring Word Order in Slovak as a Foreign Language: Comparison of Slavic and Non-Slavic Learners Utilizing Corpus Data

Publicado en línea: 30 Dec 2021
Páginas: 353 - 370

Resumen

Abstract

The paper deals with the acquisition of Slovak word order in written texts of students of Slovak as a foreign language. Its attention is focused on identifying the correct and incorrect placement of enclitic components, and their erroneous usage is analysed with respect to different investigated variables (types of enclitic components, types of syntactic construction, distance from lexical/syntactic anchor, and realization in pre- or post-verbal position). The paper also pays attention to the error rate regarding individual proficiency levels of students, and error distribution in two language groups, Slavic and Non-Slavic learners, is compared.

Palabras clave

  • word order
  • enclitics
  • error analysis
  • syntactic complexity
  • Slavic learners
  • Non-Slavic learners
  • acquisition stages
  • interlanguage
Acceso abierto

Systemic and non-systemic valency behavior of Czech deverbal adjectives

Publicado en línea: 30 Dec 2021
Páginas: 371 - 382

Resumen

Abstract

We present results of an automatic comparison of valency frames of interlinked adjectival and verbal lexical units based on the valency lexicons NomVallex and VALLEx. We distinguish nine derivational types of deverbal adjectives and examine whether they tend to display systemic or non-systemic valency behavior. The non-systemic valency behavior includes changes in the number of valency complementations and, more dominantly, non-systemic forms of actants, especially a prepositional group.

Palabras clave

  • deverbal adjective
  • derivational type
  • non-systemic valency
  • passive valency
Acceso abierto

Towards classification of stative verbs in view of corpus data

Publicado en línea: 30 Dec 2021
Páginas: 383 - 393

Resumen

Abstract

The paper presents work in progress on the compilation and automatic annotation of a dataset comprising examples of stative verbs in parallel Bulgarian-Russian corpora with the goal of facilitating the elaboration of a classification of stative verbs in the two languages based on their lexical and semantic properties. We extract stative verbs from the Bulgarian and the Russian WordNets with their assigned conceptual information (frames) from FrameNet. We then assign the set of probable Bulgarian and Russian stative verbs to the verb instances in a parallel Bulgarian-Russian corpus using WordNet correspondences to filter out unlikely stative candidates. Further, manual inspection will ensure high quality of the resource and its application for the purposes of semantic analysis.

Palabras clave

  • stative verbs
  • parallel corpora
  • semantic annotation
Acceso abierto

Usage and empirical productivity of international adjectival suffixes in Slovak based on general and specialised corpora

Publicado en línea: 30 Dec 2021
Páginas: 394 - 404

Resumen

Abstract

The paper attempts to identify the usage and productivity of five different international suffixes in Slovak by means of corpus evidence. The analysis focuses on real and potential productivity in a two-stage comparison: 1) tokens/lemmas occurring in a general balanced corpus vs general corpus of specialised and academic texts, 2) general corpus of specialised and academic texts vs specialised (sub)corpora of medical, legal, economic and religious texts. The aim of the analysis is to explore whether productivity varies across registers by means of statistical measures.

Palabras clave

  • productivity
  • realized productivity
  • potential productivity
  • general corpus
  • specialised corpus
  • adjective
  • suffix
Acceso abierto

The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech

Publicado en línea: 30 Dec 2021
Páginas: 405 - 414

Resumen

Abstract

It is shown that the mean morpheme length (measured in phonemes) decreases with the increasing length of word types (in morphemes) in Czech texts, i.e., these language units behave according to the Menzerath-Altmann law. The law is not valid in general for word tokens. Some hints towards an interpretation of parameters are presented.

Palabras clave

  • Menzerath-Altmann law
  • word
  • morpheme
  • phoneme
  • Czech
Acceso abierto

Persistent features – Corpus-based evidence for reallocation processes in German

Publicado en línea: 30 Dec 2021
Páginas: 415 - 424

Resumen

Abstract

This study aims at tracing a reallocation process of a grammatical feature alongside the dialect-standard axis with the aid of corpus linguistics methods; more precisely with an integrative application of quantitative and qualitative approaches. The phenomenon under investigation is articles without the definiteness marker d- in German, usually ascribed to the Bavarian dialect area. Analyses show, however, that this apparently dialectal feature diffuses to other communication settings closer to the intended standard language use. This process is accompanied by a refunctionalisation of reduced article forms, indicating the relevance of language-internal relations for reallocation of grammatical features. The methodical approach should be easily applicable to other variants and – as many European languages show a diaglossic repertoire – relevant to other languages as well.

Palabras clave

  • reallocation
  • article system
  • Bavarian
  • dialect-standard axis
Acceso abierto

On corpus-driven research of complex adverbial prepositions with spatial meaning in Czech

Publicado en línea: 30 Dec 2021
Páginas: 425 - 433

Resumen

Abstract

Complex adverbial prepositions with spatial meaning have not been sufficiently studied so far in Czech. To establish a set of these expressions in their actual usage, the resources of the Czech National Corpus were used in this study. The research has shown that the SYN2020 corpus is a relevant tool for searching for two-word expressions with a LOCATIVE ADVERB – SIMPLE PREPOSITION structure that have the same function as a one-word locative preposition. The article describes a method for the extraction of these expressions from the corpus, as well as a method for the collection of their quantitative data using corpus tools. As a result of the research, a list of expressions that are presumably complex prepositions is provided.

Palabras clave

  • complex preposition
  • locative adverb
  • spatial meaning
  • Czech language
  • Czech National Corpus
Acceso abierto

The study of valency is biased toward more frequent verbs: A corpus study of the valency of less frequent verbs in Czech

Publicado en línea: 30 Dec 2021
Páginas: 434 - 443

Resumen

Abstract

Theories of valency and valency dictionaries are inevitably and understandably based on the valency behavior of frequent verbs. This paper scrutinizes 154 low-frequency Czech verbs and argues that they demonstrate that Czech verbs are more malleable in their valency behavior than suggested by the literature. It is argued that this fits better within a constructionist approach to valency rather than a lexicalist one. Furthermore, the paper illustrates two alternations, previously unrecognized for Czech as semantic diatheses, namely the causative-inchoative alternation and the Agent-Means alternation.

Palabras clave

  • valency
  • valency alternation
  • causativity
  • frequency
Acceso abierto

Between adverbs and particles: A corpus study of selected intensifiers

Publicado en línea: 30 Dec 2021
Páginas: 444 - 453

Resumen

Abstract

In this paper, we present a preliminary study of three intensifiers (absolutně, naprosto, úplně) based on data from three different corpora, a written corpus SYN2020, a web corpus ONLINE-ARCHIVE, and a spoken corpus ORTOFON 1. Providing a parallel annotation of a random sample of each intensifier, we focus on their functions and meanings in context. We analyse their properties in order to define those features which are relevant to their word class assignment, and to prepare grounds for the future disambiguation tasks.

Palabras clave

  • particles
  • adverbs
  • intensifiers
  • corpus
  • Czech
Acceso abierto

Capturing Numerals and Pronouns at the Morphological Layer in the Prague Dependency Treebanks of Czech

Publicado en línea: 30 Dec 2021
Páginas: 454 - 464

Resumen

Abstract

The paper presents a novel and unified morphological description of numerals and pronouns, as compiled for the newest edition of the Prague Dependency Treebank (Prague Dependency Treebank – Consolidated 1.0) and its integral part the morphological dictionary MorfFlex. On the basis of considerable experience with real data annotation and the use of the morphological dictionary, particular changes were proposed. For both of the parts of speech a new set of subtypes was proposed, based mainly on the morphological criterion and its combination with semantic properties and other relevant features, such as definiteness in numerals and possessivity, reflexivity, and clitichood in pronouns. Each subtype has a specific value at the 2nd position of the morphological tag, which serves also as an indicator of the applicability of other tag categories.

Palabras clave

  • numerals
  • pronouns
  • morphology
  • treebank
  • annotation
  • Czech
Acceso abierto

English detached adjectival constructions with an explicit subject: A quantitative corpus-based analysis

Publicado en línea: 30 Dec 2021
Páginas: 465 - 474

Resumen

Abstract

This article reports on the quantitative corpus-based investigation into the form-function interplay of the English detached adjectival construction with an explicit subject. Taking Usage-based Construction Grammar as its theoretical framework, this paper investigates the patterns of attraction of lexical items that appear in the main slots of the grammatical construction. The data obtained substantiate the constructional status of the construction and determine its semantic and functional specification in present-day English.

Palabras clave

  • detached clauses
  • Usage-based Construction Grammar
  • grammatical construction
  • simple collexeme analysis
Acceso abierto

Using a parallel corpus to adapt the Flesch Reading Ease formula to Czech

Publicado en línea: 30 Dec 2021
Páginas: 477 - 487

Resumen

Abstract

Text readability metrics assess how much effort a reader must put into comprehending a given text. They are, e.g., used to choose appropriate readings for different student proficiency levels, or to make sure that crucial information is efficiently conveyed (e.g., in an emergency). Flesch Reading Ease is such a globally used formula that it is even integrated into the MS Word Processor. However, its constants are language-dependent. The original formula was created for English. So far it has been adapted to several European languages, Bangla, and Hindi. This paper describes the Czech adaptation, with the language-dependent constants optimized by a machine-learning algorithm working on parallel corpora of Czech and English, Russian, Italian, and French, respectively.

Palabras clave

  • complexity
  • parallel corpus
  • Czech
  • Flesch Reading Ease
  • machine learning
Acceso abierto

A synchronic and diachronic computer corpus of Makarska littoral dialects (Croatia)

Publicado en línea: 30 Dec 2021
Páginas: 488 - 501

Resumen

Abstract

This paper presents a synchronic and diachronic computer corpus of Makarska littoral dialects. This corpus was created as part of the project to explore the ikavian neoštokavian dialects of the narrow coastal area in Croatian region of Dalmatia around the town of Makarska. The dialectological characteristics of the dialects studied are briefly presented first, followed by presentation of the digital system. The system is logically organized in first part as a corpus of literary texts created from 1729 to 1803 and digitally processed, and in the second part from the materials collected through dialectological questionnaires prepared and methodologically adapted as part of the creation of the Croatian Linguistic Atlas. Methods of collecting linguistic data, method of input into the digital form and methods and possibilities of data processing will be explained. Based on the input and search strategies within the system, the examples will prove the origin of the dialects of the Makarska littoral to be that of the ikavian neoštokavian dialect described in the dialectological literature. This computer-based principle of work is a novelty in Croatian dialectology which has not been digitally processed so far and offers a basis for future dialectological research. This platform can be used in order to shorten the time of data processing and to analyse them more systematically and more efficiently. So far, there has been no such digital repository for any Croatian speech. This project represents a thorough synchronic and diachronic study of one rounded language area.

Palabras clave

  • spoken corpus
  • corpus design
  • computer corpus
  • dialect corpus
  • dialectology
  • štokavian
Acceso abierto

Mapka: A map application for working with corpora of spoken Czech

Publicado en línea: 30 Dec 2021
Páginas: 502 - 509

Resumen

Abstract

A new interactive map-based web application named Mapka was published by the Institute of the Czech National Corpus in 2020. It aims to serve linguists, as well as schools and the general public, and it features various functions described in this paper. Mapka was designed as a supplement to the CNC spoken corpora, starting with the DIALEKT corpus (more to come in the future). Its main function is to display various types of territorial division (primarily in terms of dialect, but also administrative) and networks of localities associated with the corpus. The main dialect regions are provided with overviews of their typical dialectal features and two samples of dialectal discourse – one slightly historical and one contemporary. The application offers the possibility of searching for municipalities, plotting the points on the map and creating a custom map. The paper concludes with future prospects concerning an enhanced and improved version of the application.

Palabras clave

  • corpus
  • map
  • Czech language
  • spoken language
  • dialect
Acceso abierto

L2 Czech Annotation for Automatic Feedback on Pronunciation

Publicado en línea: 30 Dec 2021
Páginas: 510 - 519

Resumen

Abstract

In this paper, we would like to provide a brief overview of the current state of pronunciation teaching in e-learning and demonstrate a new approach to building tools for automatic feedback concerning correct pronunciation based on the most frequent or typical errors in speech production made by non-native speakers. We will illustrate this in the process of designing annotation for a sound recognition tool to provide feedback on pronunciation. At the end of the paper, we will also present how we have tried to apply this annotation to the tool, what caveats we have found and what our plans are.

Palabras clave

  • pronunciation
  • L2
  • Czech
  • machine learning
  • neural networks
  • e-learning
  • annotation
  • speech recognition
  • automatic feedback
  • phonetics
Acceso abierto

Designing a Corpus of Czech Monologues: Orator v2

Publicado en línea: 30 Dec 2021
Páginas: 520 - 530

Resumen

Abstract

ORATOR v2 is a new 1.5M word corpus of Czech monologues, delivered to a live audience in semi-formal to formal settings. It was designed to chart the space of naturally occurring monologues which can be obtained for corpus processing. As such, it aims for diversity but does not attempt any balancing of subcategories, recognizing that some types of data are inherently easier to obtain in high volume than others. The transcription guidelines and annotation tools employed are the same as other recent spoken corpora published by the CNC, which facilitates interesting comparisons between various types of spoken Czech. The present paper sketches out three case studies, comparing ORATOR to the informal conversations of ORTOFON v2 in terms of the frequencies of demonstratives and hesitations, as well as lexical richness.

Palabras clave

  • speech
  • corpus
  • monologue
  • Czech
Acceso abierto

Sharing Data Through Specialized Corpus-Based Tools: The Case of GramatiKat

Publicado en línea: 30 Dec 2021
Páginas: 531 - 544

Resumen

Abstract

This paper presents a specialized corpus tool GramatiKat in the context of Open Science principles, namely data sharing, which offers opportunities for original research and facilitates verifiability of research and building on previous research. The tool is designed primarily for examining grammatical categories from the quantitative point of view. It offers grammatical profiles of particular lemmas (currently 14 thousand Czech nouns) and the proportion of individual grammatical categories within a part of speech, i.e., the standard behavior of a word class. The data in GramatiKat are pre-processed, statistically evaluated, and presented in charts and tables for clarity, and they are available to other linguists, especially from fields of morphology and lexicography. This article is aimed at providing inspiration and support to corpus and non-corpus linguists with utilization and enhanced use of the existing tools and with the creation of new specialized tools available to other users.

Palabras clave

  • specialized corpus tools
  • grammatical category
  • morphology
  • lexicography
  • Open Science
Acceso abierto

The New Value of the Structural Attribute Section in the SYN v8 Corpus and its Possible Application in Linguistic Research

Publicado en línea: 30 Dec 2021
Páginas: 545 - 555

Resumen

Abstract

The paper introduces a new section separated from journalistic texts in Czech corpora, namely interviews. This genre is highly specific; from among the texts that can be found in newspapers and magazines, it is probably the closest to spoken language. In two case studies, we present the possible application of the interviews subcorpus in linguistic research. The first one deals with the role of paralinguistic behaviour, especially laughter in written interviews vs. spoken dialogues. The second one investigates the specifics of the demonstrative ten in the function of a nominal attribute, again in both written and spoken data.

Palabras clave

  • Czech spoken corpora
  • interviews
  • paralinguistic behaviour
  • determiner
Acceso abierto

An HMM-Based PoS Tagger for Old Church Slavonic

Publicado en línea: 30 Dec 2021
Páginas: 556 - 567

Resumen

Abstract

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.

Palabras clave

  • HMM tagger
  • Old Church Slavonic
  • PoS tagging
  • hybrid models
  • Universal Dependencies
Acceso abierto

Building an Educational Language Portal Using Existing Dictionary Data

Publicado en línea: 30 Dec 2021
Páginas: 568 - 578

Resumen

Abstract

The article presents the process of building the Franček Slovenian language portal aimed at primary- and secondary-school students. We discuss problems and solutions of linking and adapting existing non-pedagogical dictionaries for school use, while overcoming content and structural differences among the dictionaries. We also present some solutions within the process of adaptation to the online medium and visualisation adjustments for three age groups of school users with different content needs and levels of (meta)linguistic knowledge.

Palabras clave

  • pedagogical lexicography
  • language portal
  • Slovenian language
  • dictionary linking
  • children’s dictionary
Acceso abierto

StressDat – Database of speech under stress in Slovak

Publicado en línea: 30 Dec 2021
Páginas: 579 - 589

Resumen

Abstract

The paper describes methodology for creating a Slovak database of speech under stress and pilot observations. While the relationship between stress and speech characteristics can be utilized in a wide domain of speech technology applications, its research suffers from the lack of suitable databases, particularly in conversational speech. We propose a novel procedure to record acted speech in the home of actors and using their own smartphones. We describe both the collection of speech material under three levels of stress and the subsequent annotation of stress levels in this material. First observations suggest a reasonable inter-annotator agreement, as well as interesting avenues for the relationship between the intended stress levels and those perceived in speech.

Palabras clave

  • speech database
  • speech under stress
  • stress annotation
  • inter-annotator agreement
Acceso abierto

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Publicado en línea: 30 Dec 2021
Páginas: 590 - 602

Resumen

Abstract

The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Palabras clave

  • Mandarin
  • Russian
  • parallel corpus
  • Chinese word segmentation (CWS)
  • grapheme-to-phoneme conversion (G2P)
  • PoS-tagging
  • code-switching detection
Acceso abierto

A Robust Approach to Variation in Carpathian Rusyn: Resampling-Based Methods for Small Data Sets

Publicado en línea: 30 Dec 2021
Páginas: 603 - 617

Resumen

Abstract

Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb maty ‘to have’ and (po-)znaty ‘to know’ in Carpathian Rusyn language. Using resampling based methods like Cross-Validation, Bootstrapping and Random Forests, we provide a strategy for circumventing possible methodological pitfalls and gaining the most information from our precious data, without trying to p-hack the results. Calculating the predictive power of several sociolinguistic factors on linguistic variation, we can make valid statements about the (sociolinguistic) status of Rusyn and the stability of the old dialect continuum of Rusyn varieties.

Palabras clave

  • oral corpora
  • border effects
  • language variation
  • spoken language corpus
  • robust statistics
  • Carpathian Rusyn
Acceso abierto

A Corpus of Czech Essays from the Turn of the 1900s

Publicado en línea: 30 Dec 2021
Páginas: 618 - 630

Resumen

Abstract

A literary essay is an interesting unit for language analyses, as its stylistic means often exceed the boundaries of the genre of an artistic essay. The article presents a new corpus of Czech literary essays covering approximately fifty years from 1890 to 1940. Along with the characterisation of the corpus and its annotation, the paper focuses on the TxM corpus tool: In the second part of the study, we use selected texts to conduct an analysis of seven various authors through multidimensional cluster analysis, factorial correspondence analysis and a specificity score. The main parameter of the analyses was usage of parts of speech in texts by individual authors. At present, the Corpus of Czech Essays contains 40 essayist titles written by 15 authors covering various topics (music, visual arts, theatre, literature, etc.).

Palabras clave

  • annotation
  • corpus
  • corpus linguistics
  • quantitative analysis
  • literary essay
  • multidimensional analysis
  • orthography
  • specificity score
  • TxM
Acceso abierto

Building Czech Textbook Corpora (UcebKo) for Word-Formation Research of Czech as a Second Language

Publicado en línea: 30 Dec 2021
Páginas: 631 - 640

Resumen

Abstract

This work-in-progress paper presents a specialized language corpus UcebKo built from textbooks of Czech for foreigners. The corpus integrates three subcorpora (UcebKo-A2, UcebKo-B1, and UcebKo-B2) which allow research of Czech as a second/foreign language at chosen language levels (A2, B1, and B2). In this case, the research is focused on word-formation, where the first results, i.e., mapping of derived words denoting persons, illustrate the approach and methodology used.

Palabras clave

  • word-formation
  • derivational morphology
  • textbook corpus
  • Czech as a second language
  • names of persons
Acceso abierto

On Conceptual and Axiological Aspects of the Word Mutter ʻMotherʼ in Context (Based on Corpus Material)

Publicado en línea: 30 Dec 2021
Páginas: 643 - 655

Resumen

Abstract

This paper focuses on a linguistic image of mother in German languages. It seeks to grasp it through a typical context of the German word Mutter ʻmotherʼ. The research is based on results of distributional and thematic analyses of these words. These analyses are used as a base for reconstructing prototypical characteristics of “mother” and the related concepts used by speakers of German. The paper develops these findings into compiling the most frequent collocations and other (mostly contextual) information gathered by the use of corpus tools. The paper concludes with an outline of unconscious axiological processes used in evaluating the image of mother on the good/bad axis.

Palabras clave

  • corpus linguistics
  • collocation profile
  • context
  • linguistic image of a word
  • axiology
Acceso abierto

Czech Translations of the Gospel of Matthew from the Diachronic Point of View – Plus Ça Change…

Publicado en línea: 30 Dec 2021
Páginas: 656 - 666

Resumen

Abstract

The paper focuses on dynamics of changes of several linguistic and text properties in diachronic development of Czech. Specifically, we analyze the proportion of identical word-forms (types), the average type length, text length, the proportion of hapax legomena, the moving average type-token ratio, and entropy. For the analysis, seven translations of the Gospel of Matthew from the 14th to the 21st century were used. The study reveals some differences in dynamics of changes of particular properties.

Palabras clave

  • Gospel of Matthew
  • diachronic development
  • dynamics of changes
  • properties of text
  • Czech language
Acceso abierto

Income, Nationality and Subjectivity in Media Text

Publicado en línea: 30 Dec 2021
Páginas: 667 - 678

Resumen

Abstract

This article takes a bird’s eye view of how positive or negative sentiments in the news press about countries and nationality nouns seem to reflect the country’s general income groups. The study focuses on the four income groups classified by the World Bank and their co-occurrence with positively and negatively classified adjectives from the Subjectivity Lexicon for Czech. A search in the journalistic subcorpus of the SYN series, release 8 of the Czech National Corpus, results in a time line covering three decades. Previous research on subjectivity has either focused on other parts of the Subjectivity Lexicon or on fewer adjectives from other languages. In this article, it is argued that the income groups are treated in descending order, i.e., the higher the income, the more positive the sentiment. Even when the most influential groups in the top and bottom are removed, the result holds. Discourse concerning global war and peace, and the security of different nations, is also detected as a result.

Palabras clave

  • income groups
  • news press
  • sentiment
  • nationality
  • corpus linguistics
  • Czech language
Acceso abierto

Key Words and Political Parties in the 2020 Pre-Election Campaign on Facebook

Publicado en línea: 30 Dec 2021
Páginas: 679 - 689

Resumen

Abstract

The research paper analyses key words found in pre-election communication of electorally successful political parties, based on which the main communication differences among those parties and the specifics of pre-election communication, as well as the pre-election discourse as a whole, are identified. Research material consists of political parties’ microblogs published on individual political parties’ Facebook profiles in the period from January 1, 2020 to February 28, 2020, with a reference corpus formed by the total of these microblogs. The analysis showed professionalization of political communication, the use of new, but also traditional ways of interaction with the electorate, pre-election communication based on the presentation of candidates, offensive and combative tone of the most successful parties, self-presentation, hints of persuasive and manipulative techniques, topic points of electoral programmes, but also thematic neutrality and non-specificity that suggest smaller electoral success.

Palabras clave

  • key words analysis
  • Facebook
  • microblog
  • pre-election communication
Acceso abierto

‘And we are Stuck in One Place, Minister.’ A Study of Evasiveness in Replies to Face-Threatening Questions in Slovak Political Interviews on Scandals (A Combined Approach)

Publicado en línea: 30 Dec 2021
Páginas: 690 - 704

Resumen

Abstract

The phenomenon of political evasiveness in the genre of a political interview has been the focus of several discourse studies employing conversation analysis, critical discourse analysis and the social psychology approach. Most of the above-mentioned studies focus on a detailed qualitative analysis of political discourse identifying a wide range of communication strategies that permit politicians to ambiguate their agency and at the same time boost their positive face. Since these strategies may change over time and also be subject to a culture specific environment, the aim of this paper is to discover a) which evasive communicative strategies were employed by Slovak politicians in 2012–2016, b) which lexical substitutions were most frequently used by them to avoid negative connotations of face-threatening questions, and finally, c) which cognitive frames formed a frequent conceptual background of their evasive political argumentation. The paper will draw on a combination of quantitative and qualitative approach to the analysis of non-replies devised by Bull and Mayer (1993) and critical discourse analysis in the sample of five Slovak radio interviews aired on the Rádio Express. The selection of interviews was not random- in each interview the politician was asked highly conflictual questions about bribery, embezzlement or disputes in the coalition. Based on qualitative research of Russian-Slovak political discourse (2009) by Dulebová it is hypothesized that a) the evasive strategy of ‘attack’ on the opposition and ‘attack on the interviewer’ would occur in our sample with the highest prominence in the speech of the former Prime Minister Fico, and b) the politicians accused of direct involvement in scandals would be the most evasive ones.

Palabras clave

  • attack
  • CDA
  • corruption
  • evasiveness
  • face-threat
  • hedging
  • interview
  • media
  • metaphor
  • muzhik
  • scandals
  • social psychology
  • political discoure
Acceso abierto

Lexical Bundles in the Corpus of Slovak Judicial Decisions

Publicado en línea: 30 Dec 2021
Páginas: 705 - 718

Resumen

Abstract

The paper follows the tradition of research in legal linguistics and into formulaic language, specifically into lexical bundles. The aim of the paper is to describe lexical bundles in samples from the corpus of Slovak judicial decisions OD-JUSTICE by means of quantitative characteristics of the identified bundles and by their comparison with bundles found in two other specialized corpora: the corpus of Slovak legal regulations and the corpus of annual reports by Slovak public institutions. For the identification of bundles, the concept of the h-point was used. Identified bundles are described with respect to their maximal, minimal, average, median and mode values, distributions and ratios. The aim of the paper is to outline an interpretation of these bundle characteristics with regard to communicative function(s) of compared document genres.

Palabras clave

  • lexical bundles
  • formulaic language
  • judicial decisions
  • specialized discourse
  • legal linguistics
  • pattern-driven research
0 Artículos
Acceso abierto

From Graphematics to Phrasal, Sentential, and Textual Semantics Through Morphosyntax by Means of Corpus-Driven Grammar and Ontology: A Case Study on One Tibetan Text

Publicado en línea: 30 Dec 2021
Páginas: 319 - 329

Resumen

Abstract

This article presents the current results of an ongoing study of the possibilities of fine-tuning automatic morphosyntactic and semantic annotation by means of improving the underlying formal grammar and ontology on the example of one Tibetan text. The ultimate purpose of work at this stage was to improve linguistic software developed for natural-language processing and understanding in order to achieve complete annotation of a specific text and such state of the formal model, in which all linguistic phenomena observed in the text would be explained. This purpose includes the following tasks: analysis of error cases in annotation of the text from the corpus; eliminating these errors in automatic annotation; development of formal grammar and updating of dictionaries. Along with the morpho-syntactic analysis, the current approach involves simultaneous semantic analysis as well. The article describes semantic annotation of the corpus, required by grammar revision and development, which was made with the use of computer ontology. The work is carried out with one of the corpus texts – a grammatical poetic treatise Sum-cu-pa (VII c.).

Palabras clave

  • Tibetan language
  • computer ontology
  • Tibetan corpus
  • natural language processing
  • corpus linguistics
  • parsing
Acceso abierto

Artificial Homonymy

Publicado en línea: 30 Dec 2021
Páginas: 330 - 341

Resumen

Abstract

The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The lemmas with this type of homonymy are treated in the new release of the MorfFlex dictionary as separate. We show that the separation of paradigms according to the gender is not only superfluous, but also clumsy, because it forces a choice when making one is not necessary. That is why we call this type of hononymy “artificial”.

Palabras clave

  • homonymy
  • polysemy
  • gender variation
  • dictionary
Acceso abierto

Typological Profiling of English, Spanish, German and Slovak: A Corpus-Based Approach

Publicado en línea: 30 Dec 2021
Páginas: 342 - 352

Resumen

Abstract

Inspired by earlier work on typological profiling of English by Benedikt Szmrecsányi and Bernd Kortmann ([1], [2], [3]), this paper investigates the typological profiles of English, Spanish, German, and Slovak, applying Szmrecsányi and Kortmann’s methodology of calculating a SYNTHETICITY INDEX and an ANALYTICITY INDEX based on 1,000-word corpus samples. The results show that Szmrecsányi and Kortmann’s methodology is replicable, and confirm claims in the literature about degrees of analyticity and syntheticity of these languages. Instead of a simple analytic-synthetic continuum, Szmrecsányi and Kortmann’s “typological space” [3] is used to visualize results, showing that languages can be both synthetic and analytic to varying degrees.

Palabras clave

  • typological profiling
  • syntheticity index
  • analyticity index
  • typological space
  • English
  • German
  • Spanish
  • Slovak
  • corpus samples
Acceso abierto

Acquiring Word Order in Slovak as a Foreign Language: Comparison of Slavic and Non-Slavic Learners Utilizing Corpus Data

Publicado en línea: 30 Dec 2021
Páginas: 353 - 370

Resumen

Abstract

The paper deals with the acquisition of Slovak word order in written texts of students of Slovak as a foreign language. Its attention is focused on identifying the correct and incorrect placement of enclitic components, and their erroneous usage is analysed with respect to different investigated variables (types of enclitic components, types of syntactic construction, distance from lexical/syntactic anchor, and realization in pre- or post-verbal position). The paper also pays attention to the error rate regarding individual proficiency levels of students, and error distribution in two language groups, Slavic and Non-Slavic learners, is compared.

Palabras clave

  • word order
  • enclitics
  • error analysis
  • syntactic complexity
  • Slavic learners
  • Non-Slavic learners
  • acquisition stages
  • interlanguage
Acceso abierto

Systemic and non-systemic valency behavior of Czech deverbal adjectives

Publicado en línea: 30 Dec 2021
Páginas: 371 - 382

Resumen

Abstract

We present results of an automatic comparison of valency frames of interlinked adjectival and verbal lexical units based on the valency lexicons NomVallex and VALLEx. We distinguish nine derivational types of deverbal adjectives and examine whether they tend to display systemic or non-systemic valency behavior. The non-systemic valency behavior includes changes in the number of valency complementations and, more dominantly, non-systemic forms of actants, especially a prepositional group.

Palabras clave

  • deverbal adjective
  • derivational type
  • non-systemic valency
  • passive valency
Acceso abierto

Towards classification of stative verbs in view of corpus data

Publicado en línea: 30 Dec 2021
Páginas: 383 - 393

Resumen

Abstract

The paper presents work in progress on the compilation and automatic annotation of a dataset comprising examples of stative verbs in parallel Bulgarian-Russian corpora with the goal of facilitating the elaboration of a classification of stative verbs in the two languages based on their lexical and semantic properties. We extract stative verbs from the Bulgarian and the Russian WordNets with their assigned conceptual information (frames) from FrameNet. We then assign the set of probable Bulgarian and Russian stative verbs to the verb instances in a parallel Bulgarian-Russian corpus using WordNet correspondences to filter out unlikely stative candidates. Further, manual inspection will ensure high quality of the resource and its application for the purposes of semantic analysis.

Palabras clave

  • stative verbs
  • parallel corpora
  • semantic annotation
Acceso abierto

Usage and empirical productivity of international adjectival suffixes in Slovak based on general and specialised corpora

Publicado en línea: 30 Dec 2021
Páginas: 394 - 404

Resumen

Abstract

The paper attempts to identify the usage and productivity of five different international suffixes in Slovak by means of corpus evidence. The analysis focuses on real and potential productivity in a two-stage comparison: 1) tokens/lemmas occurring in a general balanced corpus vs general corpus of specialised and academic texts, 2) general corpus of specialised and academic texts vs specialised (sub)corpora of medical, legal, economic and religious texts. The aim of the analysis is to explore whether productivity varies across registers by means of statistical measures.

Palabras clave

  • productivity
  • realized productivity
  • potential productivity
  • general corpus
  • specialised corpus
  • adjective
  • suffix
Acceso abierto

The Menzerath-Altmann law as the relation between lengths of words and morphemes in Czech

Publicado en línea: 30 Dec 2021
Páginas: 405 - 414

Resumen

Abstract

It is shown that the mean morpheme length (measured in phonemes) decreases with the increasing length of word types (in morphemes) in Czech texts, i.e., these language units behave according to the Menzerath-Altmann law. The law is not valid in general for word tokens. Some hints towards an interpretation of parameters are presented.

Palabras clave

  • Menzerath-Altmann law
  • word
  • morpheme
  • phoneme
  • Czech
Acceso abierto

Persistent features – Corpus-based evidence for reallocation processes in German

Publicado en línea: 30 Dec 2021
Páginas: 415 - 424

Resumen

Abstract

This study aims at tracing a reallocation process of a grammatical feature alongside the dialect-standard axis with the aid of corpus linguistics methods; more precisely with an integrative application of quantitative and qualitative approaches. The phenomenon under investigation is articles without the definiteness marker d- in German, usually ascribed to the Bavarian dialect area. Analyses show, however, that this apparently dialectal feature diffuses to other communication settings closer to the intended standard language use. This process is accompanied by a refunctionalisation of reduced article forms, indicating the relevance of language-internal relations for reallocation of grammatical features. The methodical approach should be easily applicable to other variants and – as many European languages show a diaglossic repertoire – relevant to other languages as well.

Palabras clave

  • reallocation
  • article system
  • Bavarian
  • dialect-standard axis
Acceso abierto

On corpus-driven research of complex adverbial prepositions with spatial meaning in Czech

Publicado en línea: 30 Dec 2021
Páginas: 425 - 433

Resumen

Abstract

Complex adverbial prepositions with spatial meaning have not been sufficiently studied so far in Czech. To establish a set of these expressions in their actual usage, the resources of the Czech National Corpus were used in this study. The research has shown that the SYN2020 corpus is a relevant tool for searching for two-word expressions with a LOCATIVE ADVERB – SIMPLE PREPOSITION structure that have the same function as a one-word locative preposition. The article describes a method for the extraction of these expressions from the corpus, as well as a method for the collection of their quantitative data using corpus tools. As a result of the research, a list of expressions that are presumably complex prepositions is provided.

Palabras clave

  • complex preposition
  • locative adverb
  • spatial meaning
  • Czech language
  • Czech National Corpus
Acceso abierto

The study of valency is biased toward more frequent verbs: A corpus study of the valency of less frequent verbs in Czech

Publicado en línea: 30 Dec 2021
Páginas: 434 - 443

Resumen

Abstract

Theories of valency and valency dictionaries are inevitably and understandably based on the valency behavior of frequent verbs. This paper scrutinizes 154 low-frequency Czech verbs and argues that they demonstrate that Czech verbs are more malleable in their valency behavior than suggested by the literature. It is argued that this fits better within a constructionist approach to valency rather than a lexicalist one. Furthermore, the paper illustrates two alternations, previously unrecognized for Czech as semantic diatheses, namely the causative-inchoative alternation and the Agent-Means alternation.

Palabras clave

  • valency
  • valency alternation
  • causativity
  • frequency
Acceso abierto

Between adverbs and particles: A corpus study of selected intensifiers

Publicado en línea: 30 Dec 2021
Páginas: 444 - 453

Resumen

Abstract

In this paper, we present a preliminary study of three intensifiers (absolutně, naprosto, úplně) based on data from three different corpora, a written corpus SYN2020, a web corpus ONLINE-ARCHIVE, and a spoken corpus ORTOFON 1. Providing a parallel annotation of a random sample of each intensifier, we focus on their functions and meanings in context. We analyse their properties in order to define those features which are relevant to their word class assignment, and to prepare grounds for the future disambiguation tasks.

Palabras clave

  • particles
  • adverbs
  • intensifiers
  • corpus
  • Czech
Acceso abierto

Capturing Numerals and Pronouns at the Morphological Layer in the Prague Dependency Treebanks of Czech

Publicado en línea: 30 Dec 2021
Páginas: 454 - 464

Resumen

Abstract

The paper presents a novel and unified morphological description of numerals and pronouns, as compiled for the newest edition of the Prague Dependency Treebank (Prague Dependency Treebank – Consolidated 1.0) and its integral part the morphological dictionary MorfFlex. On the basis of considerable experience with real data annotation and the use of the morphological dictionary, particular changes were proposed. For both of the parts of speech a new set of subtypes was proposed, based mainly on the morphological criterion and its combination with semantic properties and other relevant features, such as definiteness in numerals and possessivity, reflexivity, and clitichood in pronouns. Each subtype has a specific value at the 2nd position of the morphological tag, which serves also as an indicator of the applicability of other tag categories.

Palabras clave

  • numerals
  • pronouns
  • morphology
  • treebank
  • annotation
  • Czech
Acceso abierto

English detached adjectival constructions with an explicit subject: A quantitative corpus-based analysis

Publicado en línea: 30 Dec 2021
Páginas: 465 - 474

Resumen

Abstract

This article reports on the quantitative corpus-based investigation into the form-function interplay of the English detached adjectival construction with an explicit subject. Taking Usage-based Construction Grammar as its theoretical framework, this paper investigates the patterns of attraction of lexical items that appear in the main slots of the grammatical construction. The data obtained substantiate the constructional status of the construction and determine its semantic and functional specification in present-day English.

Palabras clave

  • detached clauses
  • Usage-based Construction Grammar
  • grammatical construction
  • simple collexeme analysis
Acceso abierto

Using a parallel corpus to adapt the Flesch Reading Ease formula to Czech

Publicado en línea: 30 Dec 2021
Páginas: 477 - 487

Resumen

Abstract

Text readability metrics assess how much effort a reader must put into comprehending a given text. They are, e.g., used to choose appropriate readings for different student proficiency levels, or to make sure that crucial information is efficiently conveyed (e.g., in an emergency). Flesch Reading Ease is such a globally used formula that it is even integrated into the MS Word Processor. However, its constants are language-dependent. The original formula was created for English. So far it has been adapted to several European languages, Bangla, and Hindi. This paper describes the Czech adaptation, with the language-dependent constants optimized by a machine-learning algorithm working on parallel corpora of Czech and English, Russian, Italian, and French, respectively.

Palabras clave

  • complexity
  • parallel corpus
  • Czech
  • Flesch Reading Ease
  • machine learning
Acceso abierto

A synchronic and diachronic computer corpus of Makarska littoral dialects (Croatia)

Publicado en línea: 30 Dec 2021
Páginas: 488 - 501

Resumen

Abstract

This paper presents a synchronic and diachronic computer corpus of Makarska littoral dialects. This corpus was created as part of the project to explore the ikavian neoštokavian dialects of the narrow coastal area in Croatian region of Dalmatia around the town of Makarska. The dialectological characteristics of the dialects studied are briefly presented first, followed by presentation of the digital system. The system is logically organized in first part as a corpus of literary texts created from 1729 to 1803 and digitally processed, and in the second part from the materials collected through dialectological questionnaires prepared and methodologically adapted as part of the creation of the Croatian Linguistic Atlas. Methods of collecting linguistic data, method of input into the digital form and methods and possibilities of data processing will be explained. Based on the input and search strategies within the system, the examples will prove the origin of the dialects of the Makarska littoral to be that of the ikavian neoštokavian dialect described in the dialectological literature. This computer-based principle of work is a novelty in Croatian dialectology which has not been digitally processed so far and offers a basis for future dialectological research. This platform can be used in order to shorten the time of data processing and to analyse them more systematically and more efficiently. So far, there has been no such digital repository for any Croatian speech. This project represents a thorough synchronic and diachronic study of one rounded language area.

Palabras clave

  • spoken corpus
  • corpus design
  • computer corpus
  • dialect corpus
  • dialectology
  • štokavian
Acceso abierto

Mapka: A map application for working with corpora of spoken Czech

Publicado en línea: 30 Dec 2021
Páginas: 502 - 509

Resumen

Abstract

A new interactive map-based web application named Mapka was published by the Institute of the Czech National Corpus in 2020. It aims to serve linguists, as well as schools and the general public, and it features various functions described in this paper. Mapka was designed as a supplement to the CNC spoken corpora, starting with the DIALEKT corpus (more to come in the future). Its main function is to display various types of territorial division (primarily in terms of dialect, but also administrative) and networks of localities associated with the corpus. The main dialect regions are provided with overviews of their typical dialectal features and two samples of dialectal discourse – one slightly historical and one contemporary. The application offers the possibility of searching for municipalities, plotting the points on the map and creating a custom map. The paper concludes with future prospects concerning an enhanced and improved version of the application.

Palabras clave

  • corpus
  • map
  • Czech language
  • spoken language
  • dialect
Acceso abierto

L2 Czech Annotation for Automatic Feedback on Pronunciation

Publicado en línea: 30 Dec 2021
Páginas: 510 - 519

Resumen

Abstract

In this paper, we would like to provide a brief overview of the current state of pronunciation teaching in e-learning and demonstrate a new approach to building tools for automatic feedback concerning correct pronunciation based on the most frequent or typical errors in speech production made by non-native speakers. We will illustrate this in the process of designing annotation for a sound recognition tool to provide feedback on pronunciation. At the end of the paper, we will also present how we have tried to apply this annotation to the tool, what caveats we have found and what our plans are.

Palabras clave

  • pronunciation
  • L2
  • Czech
  • machine learning
  • neural networks
  • e-learning
  • annotation
  • speech recognition
  • automatic feedback
  • phonetics
Acceso abierto

Designing a Corpus of Czech Monologues: Orator v2

Publicado en línea: 30 Dec 2021
Páginas: 520 - 530

Resumen

Abstract

ORATOR v2 is a new 1.5M word corpus of Czech monologues, delivered to a live audience in semi-formal to formal settings. It was designed to chart the space of naturally occurring monologues which can be obtained for corpus processing. As such, it aims for diversity but does not attempt any balancing of subcategories, recognizing that some types of data are inherently easier to obtain in high volume than others. The transcription guidelines and annotation tools employed are the same as other recent spoken corpora published by the CNC, which facilitates interesting comparisons between various types of spoken Czech. The present paper sketches out three case studies, comparing ORATOR to the informal conversations of ORTOFON v2 in terms of the frequencies of demonstratives and hesitations, as well as lexical richness.

Palabras clave

  • speech
  • corpus
  • monologue
  • Czech
Acceso abierto

Sharing Data Through Specialized Corpus-Based Tools: The Case of GramatiKat

Publicado en línea: 30 Dec 2021
Páginas: 531 - 544

Resumen

Abstract

This paper presents a specialized corpus tool GramatiKat in the context of Open Science principles, namely data sharing, which offers opportunities for original research and facilitates verifiability of research and building on previous research. The tool is designed primarily for examining grammatical categories from the quantitative point of view. It offers grammatical profiles of particular lemmas (currently 14 thousand Czech nouns) and the proportion of individual grammatical categories within a part of speech, i.e., the standard behavior of a word class. The data in GramatiKat are pre-processed, statistically evaluated, and presented in charts and tables for clarity, and they are available to other linguists, especially from fields of morphology and lexicography. This article is aimed at providing inspiration and support to corpus and non-corpus linguists with utilization and enhanced use of the existing tools and with the creation of new specialized tools available to other users.

Palabras clave

  • specialized corpus tools
  • grammatical category
  • morphology
  • lexicography
  • Open Science
Acceso abierto

The New Value of the Structural Attribute Section in the SYN v8 Corpus and its Possible Application in Linguistic Research

Publicado en línea: 30 Dec 2021
Páginas: 545 - 555

Resumen

Abstract

The paper introduces a new section separated from journalistic texts in Czech corpora, namely interviews. This genre is highly specific; from among the texts that can be found in newspapers and magazines, it is probably the closest to spoken language. In two case studies, we present the possible application of the interviews subcorpus in linguistic research. The first one deals with the role of paralinguistic behaviour, especially laughter in written interviews vs. spoken dialogues. The second one investigates the specifics of the demonstrative ten in the function of a nominal attribute, again in both written and spoken data.

Palabras clave

  • Czech spoken corpora
  • interviews
  • paralinguistic behaviour
  • determiner
Acceso abierto

An HMM-Based PoS Tagger for Old Church Slavonic

Publicado en línea: 30 Dec 2021
Páginas: 556 - 567

Resumen

Abstract

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.

Palabras clave

  • HMM tagger
  • Old Church Slavonic
  • PoS tagging
  • hybrid models
  • Universal Dependencies
Acceso abierto

Building an Educational Language Portal Using Existing Dictionary Data

Publicado en línea: 30 Dec 2021
Páginas: 568 - 578

Resumen

Abstract

The article presents the process of building the Franček Slovenian language portal aimed at primary- and secondary-school students. We discuss problems and solutions of linking and adapting existing non-pedagogical dictionaries for school use, while overcoming content and structural differences among the dictionaries. We also present some solutions within the process of adaptation to the online medium and visualisation adjustments for three age groups of school users with different content needs and levels of (meta)linguistic knowledge.

Palabras clave

  • pedagogical lexicography
  • language portal
  • Slovenian language
  • dictionary linking
  • children’s dictionary
Acceso abierto

StressDat – Database of speech under stress in Slovak

Publicado en línea: 30 Dec 2021
Páginas: 579 - 589

Resumen

Abstract

The paper describes methodology for creating a Slovak database of speech under stress and pilot observations. While the relationship between stress and speech characteristics can be utilized in a wide domain of speech technology applications, its research suffers from the lack of suitable databases, particularly in conversational speech. We propose a novel procedure to record acted speech in the home of actors and using their own smartphones. We describe both the collection of speech material under three levels of stress and the subsequent annotation of stress levels in this material. First observations suggest a reasonable inter-annotator agreement, as well as interesting avenues for the relationship between the intended stress levels and those perceived in speech.

Palabras clave

  • speech database
  • speech under stress
  • stress annotation
  • inter-annotator agreement
Acceso abierto

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Publicado en línea: 30 Dec 2021
Páginas: 590 - 602

Resumen

Abstract

The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Palabras clave

  • Mandarin
  • Russian
  • parallel corpus
  • Chinese word segmentation (CWS)
  • grapheme-to-phoneme conversion (G2P)
  • PoS-tagging
  • code-switching detection
Acceso abierto

A Robust Approach to Variation in Carpathian Rusyn: Resampling-Based Methods for Small Data Sets

Publicado en línea: 30 Dec 2021
Páginas: 603 - 617

Resumen

Abstract

Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb maty ‘to have’ and (po-)znaty ‘to know’ in Carpathian Rusyn language. Using resampling based methods like Cross-Validation, Bootstrapping and Random Forests, we provide a strategy for circumventing possible methodological pitfalls and gaining the most information from our precious data, without trying to p-hack the results. Calculating the predictive power of several sociolinguistic factors on linguistic variation, we can make valid statements about the (sociolinguistic) status of Rusyn and the stability of the old dialect continuum of Rusyn varieties.

Palabras clave

  • oral corpora
  • border effects
  • language variation
  • spoken language corpus
  • robust statistics
  • Carpathian Rusyn
Acceso abierto

A Corpus of Czech Essays from the Turn of the 1900s

Publicado en línea: 30 Dec 2021
Páginas: 618 - 630

Resumen

Abstract

A literary essay is an interesting unit for language analyses, as its stylistic means often exceed the boundaries of the genre of an artistic essay. The article presents a new corpus of Czech literary essays covering approximately fifty years from 1890 to 1940. Along with the characterisation of the corpus and its annotation, the paper focuses on the TxM corpus tool: In the second part of the study, we use selected texts to conduct an analysis of seven various authors through multidimensional cluster analysis, factorial correspondence analysis and a specificity score. The main parameter of the analyses was usage of parts of speech in texts by individual authors. At present, the Corpus of Czech Essays contains 40 essayist titles written by 15 authors covering various topics (music, visual arts, theatre, literature, etc.).

Palabras clave

  • annotation
  • corpus
  • corpus linguistics
  • quantitative analysis
  • literary essay
  • multidimensional analysis
  • orthography
  • specificity score
  • TxM
Acceso abierto

Building Czech Textbook Corpora (UcebKo) for Word-Formation Research of Czech as a Second Language

Publicado en línea: 30 Dec 2021
Páginas: 631 - 640

Resumen

Abstract

This work-in-progress paper presents a specialized language corpus UcebKo built from textbooks of Czech for foreigners. The corpus integrates three subcorpora (UcebKo-A2, UcebKo-B1, and UcebKo-B2) which allow research of Czech as a second/foreign language at chosen language levels (A2, B1, and B2). In this case, the research is focused on word-formation, where the first results, i.e., mapping of derived words denoting persons, illustrate the approach and methodology used.

Palabras clave

  • word-formation
  • derivational morphology
  • textbook corpus
  • Czech as a second language
  • names of persons
Acceso abierto

On Conceptual and Axiological Aspects of the Word Mutter ʻMotherʼ in Context (Based on Corpus Material)

Publicado en línea: 30 Dec 2021
Páginas: 643 - 655

Resumen

Abstract

This paper focuses on a linguistic image of mother in German languages. It seeks to grasp it through a typical context of the German word Mutter ʻmotherʼ. The research is based on results of distributional and thematic analyses of these words. These analyses are used as a base for reconstructing prototypical characteristics of “mother” and the related concepts used by speakers of German. The paper develops these findings into compiling the most frequent collocations and other (mostly contextual) information gathered by the use of corpus tools. The paper concludes with an outline of unconscious axiological processes used in evaluating the image of mother on the good/bad axis.

Palabras clave

  • corpus linguistics
  • collocation profile
  • context
  • linguistic image of a word
  • axiology
Acceso abierto

Czech Translations of the Gospel of Matthew from the Diachronic Point of View – Plus Ça Change…

Publicado en línea: 30 Dec 2021
Páginas: 656 - 666

Resumen

Abstract

The paper focuses on dynamics of changes of several linguistic and text properties in diachronic development of Czech. Specifically, we analyze the proportion of identical word-forms (types), the average type length, text length, the proportion of hapax legomena, the moving average type-token ratio, and entropy. For the analysis, seven translations of the Gospel of Matthew from the 14th to the 21st century were used. The study reveals some differences in dynamics of changes of particular properties.

Palabras clave

  • Gospel of Matthew
  • diachronic development
  • dynamics of changes
  • properties of text
  • Czech language
Acceso abierto

Income, Nationality and Subjectivity in Media Text

Publicado en línea: 30 Dec 2021
Páginas: 667 - 678

Resumen

Abstract

This article takes a bird’s eye view of how positive or negative sentiments in the news press about countries and nationality nouns seem to reflect the country’s general income groups. The study focuses on the four income groups classified by the World Bank and their co-occurrence with positively and negatively classified adjectives from the Subjectivity Lexicon for Czech. A search in the journalistic subcorpus of the SYN series, release 8 of the Czech National Corpus, results in a time line covering three decades. Previous research on subjectivity has either focused on other parts of the Subjectivity Lexicon or on fewer adjectives from other languages. In this article, it is argued that the income groups are treated in descending order, i.e., the higher the income, the more positive the sentiment. Even when the most influential groups in the top and bottom are removed, the result holds. Discourse concerning global war and peace, and the security of different nations, is also detected as a result.

Palabras clave

  • income groups
  • news press
  • sentiment
  • nationality
  • corpus linguistics
  • Czech language
Acceso abierto

Key Words and Political Parties in the 2020 Pre-Election Campaign on Facebook

Publicado en línea: 30 Dec 2021
Páginas: 679 - 689

Resumen

Abstract

The research paper analyses key words found in pre-election communication of electorally successful political parties, based on which the main communication differences among those parties and the specifics of pre-election communication, as well as the pre-election discourse as a whole, are identified. Research material consists of political parties’ microblogs published on individual political parties’ Facebook profiles in the period from January 1, 2020 to February 28, 2020, with a reference corpus formed by the total of these microblogs. The analysis showed professionalization of political communication, the use of new, but also traditional ways of interaction with the electorate, pre-election communication based on the presentation of candidates, offensive and combative tone of the most successful parties, self-presentation, hints of persuasive and manipulative techniques, topic points of electoral programmes, but also thematic neutrality and non-specificity that suggest smaller electoral success.

Palabras clave

  • key words analysis
  • Facebook
  • microblog
  • pre-election communication
Acceso abierto

‘And we are Stuck in One Place, Minister.’ A Study of Evasiveness in Replies to Face-Threatening Questions in Slovak Political Interviews on Scandals (A Combined Approach)

Publicado en línea: 30 Dec 2021
Páginas: 690 - 704

Resumen

Abstract

The phenomenon of political evasiveness in the genre of a political interview has been the focus of several discourse studies employing conversation analysis, critical discourse analysis and the social psychology approach. Most of the above-mentioned studies focus on a detailed qualitative analysis of political discourse identifying a wide range of communication strategies that permit politicians to ambiguate their agency and at the same time boost their positive face. Since these strategies may change over time and also be subject to a culture specific environment, the aim of this paper is to discover a) which evasive communicative strategies were employed by Slovak politicians in 2012–2016, b) which lexical substitutions were most frequently used by them to avoid negative connotations of face-threatening questions, and finally, c) which cognitive frames formed a frequent conceptual background of their evasive political argumentation. The paper will draw on a combination of quantitative and qualitative approach to the analysis of non-replies devised by Bull and Mayer (1993) and critical discourse analysis in the sample of five Slovak radio interviews aired on the Rádio Express. The selection of interviews was not random- in each interview the politician was asked highly conflictual questions about bribery, embezzlement or disputes in the coalition. Based on qualitative research of Russian-Slovak political discourse (2009) by Dulebová it is hypothesized that a) the evasive strategy of ‘attack’ on the opposition and ‘attack on the interviewer’ would occur in our sample with the highest prominence in the speech of the former Prime Minister Fico, and b) the politicians accused of direct involvement in scandals would be the most evasive ones.

Palabras clave

  • attack
  • CDA
  • corruption
  • evasiveness
  • face-threat
  • hedging
  • interview
  • media
  • metaphor
  • muzhik
  • scandals
  • social psychology
  • political discoure
Acceso abierto

Lexical Bundles in the Corpus of Slovak Judicial Decisions

Publicado en línea: 30 Dec 2021
Páginas: 705 - 718

Resumen

Abstract

The paper follows the tradition of research in legal linguistics and into formulaic language, specifically into lexical bundles. The aim of the paper is to describe lexical bundles in samples from the corpus of Slovak judicial decisions OD-JUSTICE by means of quantitative characteristics of the identified bundles and by their comparison with bundles found in two other specialized corpora: the corpus of Slovak legal regulations and the corpus of annual reports by Slovak public institutions. For the identification of bundles, the concept of the h-point was used. Identified bundles are described with respect to their maximal, minimal, average, median and mode values, distributions and ratios. The aim of the paper is to outline an interpretation of these bundle characteristics with regard to communicative function(s) of compared document genres.

Palabras clave

  • lexical bundles
  • formulaic language
  • judicial decisions
  • specialized discourse
  • legal linguistics
  • pattern-driven research