Rivista e Edizione

Volume 73 (2022): Edizione 1 (June 2022)

Volume 73 (2022): Edizione 1 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volume 72 (2022): Edizione 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volume 72 (2021): Edizione 3 (December 2021)

Volume 72 (2021): Edizione 2 (December 2021)
NLP, Corpus Linguistics and Interdisciplinarity

Volume 72 (2021): Edizione 1 (June 2021)

Volume 71 (2020): Edizione 3 (December 2020)
Číslo venované problematike maďarského jazyka a maďarských nárečí na Slovensku

Volume 71 (2020): Edizione 2 (December 2020)

Volume 71 (2020): Edizione 1 (June 2020)

Volume 70 (2019): Edizione 3 (December 2019)

Volume 70 (2019): Edizione 2 (December 2019)

Volume 70 (2019): Edizione 1 (June 2019)

Volume 69 (2018): Edizione 3 (December 2018)

Volume 69 (2018): Edizione 2 (December 2018)

Volume 69 (2018): Edizione 1 (June 2018)

Volume 68 (2017): Edizione 3 (December 2017)

Volume 68 (2017): Edizione 2 (December 2017)

Volume 68 (2017): Edizione 1 (June 2017)

Volume 67 (2016): Edizione 3 (December 2016)

Volume 67 (2016): Edizione 2 (December 2016)

Volume 67 (2016): Edizione 1 (June 2016)

Volume 66 (2015): Edizione 2 (December 2015)

Volume 66 (2015): Edizione 1 (June 2015)

Volume 65 (2014): Edizione 2 (December 2014)

Volume 65 (2014): Edizione 1 (June 2014)

Volume 64 (2013): Edizione 2 (December 2013)

Volume 64 (2013): Edizione 1 (June 2013)

Volume 63 (2012): Edizione 2 (December 2012)

Volume 63 (2012): Edizione 1 (June 2012)

Volume 62 (2011): Edizione 2 (December 2011)

Volume 62 (2011): Edizione 1 (June 2011)

Volume 61 (2010): Edizione 2 (December 2010)

Volume 61 (2010): Edizione 1 (June 2010)

Volume 60 (2009): Edizione 2 (December 2009)

Volume 60 (2009): Edizione 1 (June 2009)

Dettagli della rivista
Formato
Rivista
eISSN
1338-4287
ISSN
0021-5597
Pubblicato per la prima volta
05 Mar 2010
Periodo di pubblicazione
2 volte all'anno
Lingue
Inglese

Cerca

Volume 70 (2019): Edizione 2 (December 2019)

Dettagli della rivista
Formato
Rivista
eISSN
1338-4287
ISSN
0021-5597
Pubblicato per la prima volta
05 Mar 2010
Periodo di pubblicazione
2 volte all'anno
Lingue
Inglese

Cerca

33 Articoli
Accesso libero

Colloquial Lexemes in Journalistic Texts

Pubblicato online: 21 Dec 2019
Pagine: 139 - 147

Astratto

Abstract

In our paper we mainly focus on the research of colloquial lexical units in journalistic texts. The aim of the research is colloquiality as a marked attribute of journalistic texts. At first we define the terms hovorovosť (colloquiality) (also in relation to the term hovorenosť (spokenness)) and hovorový (colloquial). Since the point was the research of “living language” – represented by field of journalism – our source material were journalistic texts from the database of the Slovak National Corpus. The number of occurrence of colloquial lexical units was recorded according to their absolute frequency and the results were categorized and interpreted. The most frequented means of expression were verified in current lexicographic processing and the changes of the indicator of colloquiality was studied. With style parameters in background, we evaluated the markedness of the vocabulary of analyzed journalistic texts.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Frequency in Corpora as a Signal of Lexicalization (On the Absolute Usage of Comparative and Superlative Adjectives)

Pubblicato online: 21 Dec 2019
Pagine: 148 - 157

Astratto

Abstract

The study deals with the category of comparison of Czech adjectives from the semantic point of view; it concentrates especially on the so-called absolute (or elative) usage of comparatives and the absolute usage of superlatives and their lexicographic treatment (or absence of the lexicographic treatment) in Czech monolingual dictionaries. The question is whether their frequency in corpora can prove lexicalization of this usage.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

On the Valency of Various Types of Adverbs and Its Lexicographic Description

Pubblicato online: 21 Dec 2019
Pagine: 158 - 169

Astratto

Abstract

This paper deals with the neglected issue of the valency of adverbs. After providing a brief theoretical background, a procedure is presented of extracting the list of potentially valent adverbs from two syntactically parsed corpora of Czech, SYN2015 and PDT. Taking note of the methodological and theoretical problems surrounding this task, especially those relating to the fuzzy boundaries of word classes, we outline the types of adverbs identified as having valency properties. Where appropriate, we comment on – and occasionally suggest improvements in – the lexicographic treatment of valent adverbs.

Parole chiave

  • adverbs
  • valency
  • dictionary
  • syntactically parsed corpora
  • Czech
Accesso libero

The Synchronic Dynamics of Words Ending in -ita/-osť

Pubblicato online: 21 Dec 2019
Pagine: 170 - 179

Astratto

Abstract

This paper focuses on the potential of using corpora to study manifestations of the synchronic dynamics of language and on the analysis of how words with the suffixes -ita/-osť function in contemporary texts. The analysis is based on data from the Slovak National Corpus: the corpus of older texts (texts from 1955 to 1989), the primary corpus (texts from 1955 to 2017, especially since 2000), and the corpus of online texts (until 2017). A comparison of the frequency and collocations of the analyzed words shows the dynamics of these microsystems in the language of the previous and the current period.

Parole chiave

  • synchrony of the language
  • lexical analysis
  • dynamics of abstract terms
  • frequency
  • Slovak National Corpus
Accesso libero

Slovak Comparative Correlatives: New Insights

Pubblicato online: 21 Dec 2019
Pagine: 180 - 190

Astratto

Abstract

Comparative Correlatives (CCs) are structures that have attracted substantial interest. In Slovak, they typically look like the following proverb:

Čím bližšie Rím, tým horší kresťan.

‘The closer (to) Rome, the worse the Christian.’

So far, no extensive research has been conducted on CCs in Slavic languages except Polish [1]. In Slovak, CCs have not received a great deal of attention. Accordingly, this study examines the various forms of CCs in a Slovak National Corpus (SNC) random sample of 500 tokens, showing that there is much more variety than has been acknowledged in the literature. Frequencies will be used to show that there are iconic structures, and it will be argued that there are construction-specific properties that suggest the existence of a specific CC construction in Slovak.

Parole chiave

  • Slovak
  • Comparative Correlative
  • Slovak National Corpus
Accesso libero

Analysis of Verbal Prepositional “of” Structures

Pubblicato online: 21 Dec 2019
Pagine: 191 - 199

Astratto

Abstract

The article presents empirical research of verbal prepositional “of“ structures, grammatical collocations of the verb and the preposition OF. The preposition OF belongs among the most frequent prepositions in the English language. The study is based on comparisons of English and Czech sentences containing verbs and prepositions that are followed by the object. Material was taken from the electronic data bank Prague Czech-English Dependency Treebank 2.0. The structures were examined and analyzed from morphological, syntactical and semantic points of view. The aim of the study is to create English-Czech verbal prepositional counterparts; to create verbal prepositional groups on the grounds of the similar semantic, syntactic features; to identify the features that are the same for each verb group and generalize them; to identify trends and tendencies for verbs when they collocate with a certain preposition. The findings are presented in several charts and tables.

Parole chiave

  • verbal prepositional structure
  • grammatical collocations
  • verbal semantic group
  • preposition “of”
Accesso libero

Temporal ‘Since’ in Slovak: Conjunction(s) and Aspect Choice – A Corpus Study

Pubblicato online: 21 Dec 2019
Pagine: 200 - 215

Astratto

Abstract

It has recently been shown by especially [1] through [4] and [12] for Russian and by [8] and [9] for Polish that conjunctions corresponding to Dutch sinds (cf. also [1], [2], [3]) and English since (cf. also [7], [10]) have temporal functions, which are subject to restrictions on the choice of tense and aspect. Ultimately these restrictions can be related to the semantic input of tense and aspect into complex sentences with these connective items. For Polish extensive data provided by corpus research enabled us to shed light on the usage and restrictions in this area and also to establish which constellations with particular conjunctions are more or less likely or not possible (cf. [8], [9]). In the present contribution we present freshly sourced quantitative Slovak SNK-corpus data. We consider the sixteen logically possible tense-aspect constellations, and the Slovak connective items: odkedy; odvtedy, čo / ako; od chvíle, keď / čo / ako; od tých čias, čo / ako; od tej doby, čo / ako. This quantitative data study is intended to pinpoint the areas of future research; for this purpose at certain instances comparisons are made with Polish, the only other language we have such data for to date.

Parole chiave

  • conjunction
  • tense
  • aspect
  • anteriority
  • simultaneity
  • taxis
  • Slovak
  • Polish
Accesso libero

In which Clause do Subordinate Conjunctions Prosodically Belong?

Pubblicato online: 21 Dec 2019
Pagine: 216 - 224

Astratto

Abstract

This paper deals with the position of three Czech subordinating conjunctions že ’that’, když ‘when’, and až ‘when’ within the prosodic word, using the phonetic annotation in the ORTOFON corpus. The position of subordinating conjunctions is traditionally described as initial within the subordinate clause, but the situation in spontaneous speech is not so clear. This paper shows the functional differences between the various positions within the prosodic word and presents the words which are most frequently combined with the selected conjunctions.

Parole chiave

  • conjunction
  • spontaneous spoken language
  • spoken corpus
  • prosody
  • prosodic word
Accesso libero

Russian Indefinite Pronoun kakoj-libo: Non-Standard Usage and Changes in the Semantics

Pubblicato online: 21 Dec 2019
Pagine: 225 - 233

Astratto

Abstract

The paper deals with meaning and use of an indefinite pronoun kakojlibo ‘any/some’ in the modern Russian language. Research based on corpus data revealed non-standard usage of the pronoun kakoj-libo ‘any/some’. The paper describes main types of the deviations and evaluates their pragmatic and semantic effect. Finally, tendencies of the change in semantics and use of these pronouns are characterized.

Parole chiave

  • Russian language
  • semantics
  • indefinite pronouns
  • nonstandard speech
  • corpus-based approach
Accesso libero

Ways of Automatic Identification of Words Belonging to Semantic Field

Pubblicato online: 21 Dec 2019
Pagine: 234 - 243

Astratto

Abstract

The paper presents results of the ongoing research on creation of the semantic field of the “empire” concept. A semantic field is a collection of content units covering a certain area of human experience and forming a relatively autonomous microsystem with one or several centers. Relations in such microsystems are also called associations. The idea is to extract from data on syntagmatic collocability a set of lexical units connected by systemic paradigmatic relations of various types and strength using distributional analysis techniques. The first goal of the study is to develop methodology to fill a semantic field with lexical units on the basis of morphologically tagged corpora. We were using the Sketch Engine corpus system that implements the method of distributional statistical analysis. Text material is represented by our own corpora in the domain of “empire”. In the course of the work we have acquired lists of items filling the semantic space around the concept of “empire”.

Parole chiave

  • semantic field
  • concept of empire
  • distributive and statistical analysis
  • corpus
  • thesaurus
Accesso libero

Analysis of the Lemma Mateřství (Motherhood)

Pubblicato online: 21 Dec 2019
Pagine: 244 - 253

Astratto

Abstract

The paper presents results of analysis of the lemma mateřství ‘motherhood’. The authors applied methods of corpus linguistics and discourse analysis – the corpus assisted discourse studies approach – in order to survey representations of the lemma in Czech journalistic texts published from 2010 to 2014, sorted the results into discourse categories on the basis of collocation and concordance analysis, and found out that chief referential discourse-of-motherhood categories were surrogate motherhood, relationship of motherhood and career, delight from motherhood, family relationships, financial and time aspects of motherhood, changes brought by motherhood, and active motherhood. Surrogate motherhood was presented as a solution to women who cannot have a baby themselves, nevertheless also as a complicated issue, in which case emphasis was put on relevant legislation. Motherhood was presented as a danger for a woman’s career, however also as a source of joy, an essential relationship within a family, a right for financial support from the state, a life change, an activity, and an entity closely connected to time factors.

Parole chiave

  • motherhood
  • discourse
  • mass media
  • corpus
  • CADS
Accesso libero

Corpus-Supported Semantic Studies: Part/Whole Expressions in Russian

Pubblicato online: 21 Dec 2019
Pagine: 254 - 266

Astratto

Abstract

We investigate valency properties of partials – words and constructions that express the Part/Whole relation, primarily in Russian, offering new observations largely based on the Russian National Corpus. Special attention is given to such lexical units as bol’šinstvo ‘majority’, men’šinstvo ‘minority’, čast’ ‘part’, protsent ‘percentages’, v bol’šinstve svoem ‘in its <their, etc.>majority’, ‘po bol’šej časti ‘for the most part’, etc.

Parole chiave

  • corpus
  • semantics
  • valency
  • part-whole
Accesso libero

Wackernagel’s Position and Contact Position of Pronominal Enclitics in Older Czech. Competition or Cooperation?

Pubblicato online: 21 Dec 2019
Pagine: 267 - 275

Astratto

Abstract

The paper focuses on analyzing the relationship among word order positions of pronominal enclitics in the history of Czech. Specifically, we look at the Wackernagel’s position and the contact position and we try to decide whether these two positions compete, as usually taken for granted, or whether there is a certain kind of cooperation between them. The results show that the positions do not compete, at least not in the majority of cases. We used a corpus-based on selected books of the first edition of the Old Czech Bible and Kralice Bible for the analysis.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Frequency Dictionary of 16th Century Cyrillic Written Monument

Pubblicato online: 21 Dec 2019
Pagine: 276 - 288

Astratto

Abstract

The article presents the algorithm of the frequency dictionary to an original ancient text, “Otpys” (“Response”) by Kliryk Ostrozkyi (the Cleric of Ostroh) of the late 16th century. Until now, no historical corpus of text of the Ukrainian language has been created; therefore the drafting of metagraphical texts with their subsequent processing in accordance with linguistic tasks can fill this gap. The peculiarity of creating a frequency dictionary based on one written monument is in using the model of frequency dictionaries and describing the specifics of processing the ancient text. These specifics is based on a deep understanding of the state of language in the end of the 16th century and consists in the unification of graphic and spelling variants, as well as in the formation of stems and lemmas. Work results are presented in the form of a Frequency Dictionary of Word Forms of “Otpys” by Kliryk Ostrozkyi according to the frequency decrease and a Frequency Dictionary of “Otpys” by Kliryk Ostrozkyi according to the frequency decrease.

Parole chiave

  • frequency dictionary
  • tokenization
  • stemming
  • lemmatization
  • hapax legomena
  • written monument of the late 16 century
Accesso libero

Kinship Terminology in Western Slavic Languages Based on Corpora Analysis

Pubblicato online: 21 Dec 2019
Pagine: 289 - 298

Astratto

Abstract

This paper is discussing kinship arrangements and more generally families of Western Slavs based on linguistic and corpora data. It is argued here that we can find correlation between lexicon and society, and that studying of lexicon can provide supportive data for society examination. In this paper we used corpora data that provides us with reliable information about lexicon that is truly used by speakers of Western Slavic languages and provided possible explanations for changes occurring in this part of vocabulary. Paper is divided into three main parts, one discussing relations between social reality and kinship terminology, while the second is discussing data from corpora. Third part is devoted to drawing conclusions.

Parole chiave

  • kinship terminology
  • corpora linguistics
  • social reality
  • family
  • Western Slavic languages
Accesso libero

Gender-Specific Adjectives in Czech Newspapers and Magazines

Pubblicato online: 21 Dec 2019
Pagine: 299 - 312

Astratto

Abstract

This study is one of the few studies dealing with gender in the Czech language using corpus methods. It focuses on the issue of gender in Czech journalistic texts from the years 2010–2014. The main goal was to investigate the extent of stereotypical images of men and women in the press. This analysis is based on adjectival collocations of the lexemes muž ‘man’ and žena ‘woman’ and their semantic categorization. The research uses a journalistic part of the SYN2015 corpus. First, gender-specific adjectival collocates were identified. Second, adjectival collocates were classified into semantic categories and analyzed within journalistic genres. This study has shown that certain adjectives tend to co-occur with one of the examined lexemes and project a gender-stereotypical image of men and women within particular journalistic genres. It was confirmed that men are strongly associated with age specification, strength, appearance, and negative situations as a subject of crime, whereas women are related to motherhood, attractiveness, ethnicity, nationality, and are more often seen as victims of crime.

Parole chiave

  • gender studies
  • language and gender
  • discourse analysis
  • corpus linguistics
  • sociolinguistics
Accesso libero

From the National Corpus of Polish to the Polish Corpus Infrastructure

Pubblicato online: 21 Dec 2019
Pagine: 315 - 323

Astratto

Abstract

The National Corpus of Polish emerged as a cumulative result of many years of work on large reference corpora by computer scientists and linguists in Poland. While its impact on research in linguistics, humanities and language technology is unquestionable and highly significant, the construction of the national corpus was halted in 2011. In the paper we call for activating the research community and funding institutions around the construction of a corpus infrastructure with the national corpus at its heart. It is claimed that on the verge of an artificial intelligence revolution the envisaged Polish Corpus Infrastructure would provide reliable language data, combine available resources and allow easy integration of new ones.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Relevant Criteria for Selection of Spoken Data: Theory Meets Practice

Pubblicato online: 21 Dec 2019
Pagine: 324 - 335

Astratto

Abstract

The present paper seeks to review relevant criteria used in classifying speech events (SEs) from the perspective of spoken corpus design. The primary goal is to survey the landscape of possible types of spoken language, so as to assess in which directions the coverage of spoken Czech offered by Czech National Corpus corpora can be expanded in the future. We approach the problem from both theoretical and practical points of view, examining what the theoretical literature has to say as well as approaches implemented in practice by existing spoken corpora of various languages. We then synthesize the obtained information into a pragmatically motivated set of SE classification criteria which does not aspire to be universal or definitive but aims to serve as a useful guiding principle and conceptual framework for understanding and promoting SE diversity when collecting spoken data.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

The Dialekt Corpus and Its Possibilities

Pubblicato online: 21 Dec 2019
Pagine: 336 - 344

Astratto

Abstract

DIALEKT, a corpus of Czech dialects, has been continuously curated and expanded by the Spoken Corpora section of the Institute of the Czech National Corpus. The following paper aims first to give a concise characteristic of the corpus, addressing its sociolinguistic parameters and possible subcorpora derivable thereof, its two-layer approach to the transcription of dialect recordings, and lemmatization & morphological tagging of the corpus. Subsequently, we move on to examples of how linguists can use the corpus and discuss two related projects which expand upon currently available possibilities: an archive of dialect-specific differential phones of the Czech language (completed) and an interactive web environment for spatial map-based visualization of data from all kinds of spoken corpora (in preparation). Thanks in part also to these additional tools, the DIALEKT corpus should serve both experts in the field as well as the general public.

Parole chiave

  • spoken corpus
  • dialect corpus
  • dialectology
  • corpus design
  • transcription
Accesso libero

Annotations in the Corpus of Texts of Students Learning Slovak as a Foreign Language (ERRKORP)

Pubblicato online: 21 Dec 2019
Pagine: 345 - 357

Astratto

Abstract

The article presents the upcoming acquisition corpus of written texts of students learning Slovak as a Foreign Language and focuses on the annotation of texts, which includes information about the text as well as social and linguistic details about the student. The article also discusses the tags that identify individual errors in the texts and concept of creating the tagset itself.

Parole chiave

  • language error
  • learner corpus
  • slovak
  • tagging
  • annotation
Accesso libero

Parts of Speech in NovaMorf, A New Morphological Annotation of Czech

Pubblicato online: 21 Dec 2019
Pagine: 358 - 369

Astratto

Abstract

A detailed morphological description of word forms in any language is a necessary condition for a successful automatic processing of linguistic data. The paper focuses on a new description of morphological categories, mainly on the subcategorization of parts of speech in Czech within the NovaMorf project. NovaMorf focuses on the description of morphological properties of Czech word forms in a more compact and consistent way and with a higher explicative power than approaches used so far. It also aims at the unification of diverse approaches to morphological annotation of Czech. NovaMorf approach will be reflected in a new morphological dictionary to be exploited for a new automatic morphological analysis (and disambiguation) of corpora of contemporary Czech.

Parole chiave

  • NovaMorf
  • morphological annotation
  • parts of speech
  • morphological categories
  • subcategorization
Accesso libero

Improving Nominalized Adjectives Tagging

Pubblicato online: 21 Dec 2019
Pagine: 370 - 379

Astratto

Abstract

Part of speech transitions represent an interesting issue in terms of Automatic Morphological Analysis (AMA). In these cases, two parts of speech have to be considered: initial and final. However, their automatic recognition is complicated by the same form. This article presents the results of a corpus study aimed at mapping nominalized adjectives tagging with a focus on detecting candidates for nominalization among frequent adjectives. Analysis of the data obtained from the ČNK SYN v5 corpus shows different reasons for incorrect tagging. Taking into account these reasons, we propose three solutions for the improvement nominalized adjectives tagging.

Parole chiave

  • nominalized adjectives
  • automatic morphological analysis
  • disambiguation
  • corpus
  • tagging
Accesso libero

Modifications of the Czech Morphological Dictionary for Consistent Corpus Annotation

Pubblicato online: 21 Dec 2019
Pagine: 380 - 389

Astratto

Abstract

We describe systematic changes that have been made to the Czech morphological dictionary related to annotating new data within the project of Prague Dependency Treebank (PDT). We bring new solutions to several complicated morphological features that occur in Czech texts. We introduced two new parts of speech, namely foreign word and segment. We adopted new principles for morphological analysis of global and inflectional variants, homonymous lemmas, abbreviations and aggregates. The changes were initiated by the need of consistency between the data and the dictionary and of the dictionary itself.

Parole chiave

  • morphological dictionary
  • Czech part of speech
  • corpus annotation
  • Golden rule of morphology
Accesso libero

Levels of Annotation in the Slovene Training Corpus ssj500k 2.2

Pubblicato online: 21 Dec 2019
Pagine: 390 - 399

Astratto

Abstract

This paper presents the Slovene Training Corpus ssj500k 2.2, which has been annotated on the levels of tokenization, sentence segmentation, part-of-speech tagging, lemmatization, syntactic dependencies, named entities, verbal multi-word expressions, and semantic role labeling. It describes the individual layers of annotation and shows the scope of using the training corpus in the production of various lexicons, such as the lexicon of multi-word units and the valency lexicon of modern Slovene. It concludes by presenting our future work, i.e. the annotation of multi-word expressions based on the Slovene Lexical Database.

Parole chiave

  • corpus linguistics
  • training corpus
  • corpus annotation
  • Slovene language
Accesso libero

Meaning and Semantic Roles in CzEngClass Lexicon

Pubblicato online: 21 Dec 2019
Pagine: 403 - 411

Astratto

Abstract

This paper focuses on Semantic Roles, an important component of studies in lexical semantics, as they are captured as part of a bilingual (Czech-English) synonym lexicon called CzEngClass. This lexicon builds upon the existing valency lexicons included within the framework of the annotation of the various Prague Dependency Treebanks. The present analysis of Semantic Roles is being approached from the Functional Generative Description point of view and supported by the textual evidence taken specifically from the Prague Czech-English Dependency Treebank.

Parole chiave

  • semantic roles
  • valency
  • parallel corpus
  • lexical semantics
  • lexical resource
Accesso libero

Introducing Semantic Labels into the DeriNet Network

Pubblicato online: 21 Dec 2019
Pagine: 412 - 423

Astratto

Abstract

The paper describes a semi-automatic procedure introducing semantic labels into the DeriNet network, which is a large, freely available resource modeling derivational relations in the lexicon of Czech. The data were assigned labels corresponding to five semantic categories (diminutives, possessives, female nouns, iteratives, and aspectual meanings) by a machine learning model, which achieved excellent results in terms of both precision and recall.

Parole chiave

  • derivation
  • semantic category
  • comparative semantic concepts
  • suffix
  • machine learning
Accesso libero

Non-Systemic Valency Behavior of Czech Deverbal Nouns Based on the NomVallex Lexicon

Pubblicato online: 21 Dec 2019
Pagine: 424 - 433

Astratto

Abstract

In order to describe non-systemic valency behavior of Czech deverbal nouns, we present results of an automatic comparison of valency frames of interlinked noun and verbal lexical units included in valency lexicons NomVallex and VALLEX. We show that the non-systemic valency behavior of the nouns is mostly manifested by non-systemic forms of their actants, while changes in the number or type of adnominal actants are negligible as for their frequency. Non-systemic forms considerably contribute to a general increase in the number of forms in valency frames of nouns compared to the number of forms in valency frames of their base verbs. The non-systemic forms are more frequent in valency frames of non-productively derived nouns than in valency frames of productively derived ones.

Parole chiave

  • adnominal morphemic forms
  • Czech deverbal nouns
  • non-systemic valency behavior
  • valency
  • valency lexicon
Accesso libero

Towards Reciprocal Deverbal Nouns in Czech: From Reciprocal Verbs to Reciprocal Nouns

Pubblicato online: 21 Dec 2019
Pagine: 434 - 443

Astratto

Abstract

Reciprocal verbs are widely debated in the current linguistics. However, other parts of speech can be characterized by reciprocity as well – in contrast to verbs, their analysis is underdeveloped so far. In this paper, we make an attempt to fill this gap, applying results of the description of Czech reciprocal verbs to nouns derived from these verbs. We show that many aspects characteristic of reciprocal verbs hold for reciprocal nouns as well.

Parole chiave

  • reciprocity
  • deverbal nouns
  • lexical and syntactic reciprocal nouns
Accesso libero

Processing of Derivational Features for (Semi)Automatic Creation of Dictionary Definitions in the User Interface (CZEDD) for Learning Czech as a Second Language: Suffix -tel and -ista

Pubblicato online: 21 Dec 2019
Pagine: 444 - 455

Astratto

Abstract

This work-in-progress paper presents the tool CZEDD which enables the user to learn how to predict the meaning of words. The CZEDD consists of (semi) automatic definitions for derived words because a lot of these words have predictable lexical meaning. The tool will be intended for foreigners who learn the Czech language and it could be useful as a dictionary and/or translator in which the definitions based on the word’s structure are stored. Two detailed case examples (the suffix -tel, and the suffix -ista) illustrate the approach.

Parole chiave

  • derivational morphology
  • Czech for foreigners
  • suffixes
  • lexical meaning
  • structural meaning
  • dictionary
Accesso libero

Conception and Development of an Open Database System on Historical Multilingualism in Austria

Pubblicato online: 21 Dec 2019
Pagine: 456 - 466

Astratto

Abstract

This paper discusses the development and structure of an online information system, which aims to gather and visualize data on historical multilingualism in Austria (German: historische Mehrsprachigkeit in Österreich, short: MiÖ), with a particular focus on Slavic languages. The database tracks the development of multilingualism over time, its distribution in space and its representation in literature, therefore allowing to examine its dynamics and change. As an example, we investigate the area of the so-called Marchfeld (č./sk. Moravské pole). The paper further discusses how the database is embedded into the collaborative research platform of the Special Research Program “German in Austria (DiÖ)” as well as its technical realization and the possibility to include data from other related research projects.

Parole chiave

  • online information system
  • historical multilingualism
  • language contact
  • Austria
  • Austria-Hungary
Accesso libero

On Possibilities and Methods of Analysis of Thematic Expressions in Spoken Texts

Pubblicato online: 21 Dec 2019
Pagine: 469 - 480

Astratto

Abstract

The treatise focuses on mutual comparison of three methods of detection of prominent text units (prominent in relation to the contents of the text). The methods are: 1) analysis of key words based on comparison of source and referential corpora, 2) thematic concentration and h-point, and 3) the TF*IDF method. We try to thematize their pros and cons and, using the results of the carried out analyses, propose the optimal method for the extraction of thematic words from the spoken texts the frequency structure of which differs distinctly from the frequency structure of written texts.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Identification of Spontaneous Spoken Texts in Slovak

Pubblicato online: 21 Dec 2019
Pagine: 481 - 490

Astratto

Abstract

We propose a text classification method for the purpose of creating a language model for automatic recognition of spontaneous spoken speech. Transcripts from our departmental speech database served as spontaneous spoken texts. Using supervised machine learning methods, we have created multiple classification models (including neural networks), that were able to distinguish them from written texts with high accuracy. We subsequently verified the accuracy of our trained models on a database of texts containing direct speech extracted from newspaper articles.

Parole chiave

  • spontaneous speech
  • text classification
  • supervised machine learning
  • neural networks
  • Slovak language
Accesso libero

Affordable Annotation of the Mobile App Reviews

Pubblicato online: 21 Dec 2019
Pagine: 491 - 497

Astratto

Abstract

This paper focuses on the use-case study of the annotation of the mobile app reviews from Google Play and Apple Store. These annotations of sentiment polarity were created for later use in the automatic processing based on machine learning. This should solve some of the problems encountered in the previous analyses of the Czech language where data assumptions play a greater role than annotation itself (due to the financial constraints). Our proposal shows that some of the assumptions used for English do not apply to Czech and that it is possible to annotate such data without extensive financing.

Parole chiave

  • sentiment polarity
  • topics analysis
  • annotation
33 Articoli
Accesso libero

Colloquial Lexemes in Journalistic Texts

Pubblicato online: 21 Dec 2019
Pagine: 139 - 147

Astratto

Abstract

In our paper we mainly focus on the research of colloquial lexical units in journalistic texts. The aim of the research is colloquiality as a marked attribute of journalistic texts. At first we define the terms hovorovosť (colloquiality) (also in relation to the term hovorenosť (spokenness)) and hovorový (colloquial). Since the point was the research of “living language” – represented by field of journalism – our source material were journalistic texts from the database of the Slovak National Corpus. The number of occurrence of colloquial lexical units was recorded according to their absolute frequency and the results were categorized and interpreted. The most frequented means of expression were verified in current lexicographic processing and the changes of the indicator of colloquiality was studied. With style parameters in background, we evaluated the markedness of the vocabulary of analyzed journalistic texts.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Frequency in Corpora as a Signal of Lexicalization (On the Absolute Usage of Comparative and Superlative Adjectives)

Pubblicato online: 21 Dec 2019
Pagine: 148 - 157

Astratto

Abstract

The study deals with the category of comparison of Czech adjectives from the semantic point of view; it concentrates especially on the so-called absolute (or elative) usage of comparatives and the absolute usage of superlatives and their lexicographic treatment (or absence of the lexicographic treatment) in Czech monolingual dictionaries. The question is whether their frequency in corpora can prove lexicalization of this usage.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

On the Valency of Various Types of Adverbs and Its Lexicographic Description

Pubblicato online: 21 Dec 2019
Pagine: 158 - 169

Astratto

Abstract

This paper deals with the neglected issue of the valency of adverbs. After providing a brief theoretical background, a procedure is presented of extracting the list of potentially valent adverbs from two syntactically parsed corpora of Czech, SYN2015 and PDT. Taking note of the methodological and theoretical problems surrounding this task, especially those relating to the fuzzy boundaries of word classes, we outline the types of adverbs identified as having valency properties. Where appropriate, we comment on – and occasionally suggest improvements in – the lexicographic treatment of valent adverbs.

Parole chiave

  • adverbs
  • valency
  • dictionary
  • syntactically parsed corpora
  • Czech
Accesso libero

The Synchronic Dynamics of Words Ending in -ita/-osť

Pubblicato online: 21 Dec 2019
Pagine: 170 - 179

Astratto

Abstract

This paper focuses on the potential of using corpora to study manifestations of the synchronic dynamics of language and on the analysis of how words with the suffixes -ita/-osť function in contemporary texts. The analysis is based on data from the Slovak National Corpus: the corpus of older texts (texts from 1955 to 1989), the primary corpus (texts from 1955 to 2017, especially since 2000), and the corpus of online texts (until 2017). A comparison of the frequency and collocations of the analyzed words shows the dynamics of these microsystems in the language of the previous and the current period.

Parole chiave

  • synchrony of the language
  • lexical analysis
  • dynamics of abstract terms
  • frequency
  • Slovak National Corpus
Accesso libero

Slovak Comparative Correlatives: New Insights

Pubblicato online: 21 Dec 2019
Pagine: 180 - 190

Astratto

Abstract

Comparative Correlatives (CCs) are structures that have attracted substantial interest. In Slovak, they typically look like the following proverb:

Čím bližšie Rím, tým horší kresťan.

‘The closer (to) Rome, the worse the Christian.’

So far, no extensive research has been conducted on CCs in Slavic languages except Polish [1]. In Slovak, CCs have not received a great deal of attention. Accordingly, this study examines the various forms of CCs in a Slovak National Corpus (SNC) random sample of 500 tokens, showing that there is much more variety than has been acknowledged in the literature. Frequencies will be used to show that there are iconic structures, and it will be argued that there are construction-specific properties that suggest the existence of a specific CC construction in Slovak.

Parole chiave

  • Slovak
  • Comparative Correlative
  • Slovak National Corpus
Accesso libero

Analysis of Verbal Prepositional “of” Structures

Pubblicato online: 21 Dec 2019
Pagine: 191 - 199

Astratto

Abstract

The article presents empirical research of verbal prepositional “of“ structures, grammatical collocations of the verb and the preposition OF. The preposition OF belongs among the most frequent prepositions in the English language. The study is based on comparisons of English and Czech sentences containing verbs and prepositions that are followed by the object. Material was taken from the electronic data bank Prague Czech-English Dependency Treebank 2.0. The structures were examined and analyzed from morphological, syntactical and semantic points of view. The aim of the study is to create English-Czech verbal prepositional counterparts; to create verbal prepositional groups on the grounds of the similar semantic, syntactic features; to identify the features that are the same for each verb group and generalize them; to identify trends and tendencies for verbs when they collocate with a certain preposition. The findings are presented in several charts and tables.

Parole chiave

  • verbal prepositional structure
  • grammatical collocations
  • verbal semantic group
  • preposition “of”
Accesso libero

Temporal ‘Since’ in Slovak: Conjunction(s) and Aspect Choice – A Corpus Study

Pubblicato online: 21 Dec 2019
Pagine: 200 - 215

Astratto

Abstract

It has recently been shown by especially [1] through [4] and [12] for Russian and by [8] and [9] for Polish that conjunctions corresponding to Dutch sinds (cf. also [1], [2], [3]) and English since (cf. also [7], [10]) have temporal functions, which are subject to restrictions on the choice of tense and aspect. Ultimately these restrictions can be related to the semantic input of tense and aspect into complex sentences with these connective items. For Polish extensive data provided by corpus research enabled us to shed light on the usage and restrictions in this area and also to establish which constellations with particular conjunctions are more or less likely or not possible (cf. [8], [9]). In the present contribution we present freshly sourced quantitative Slovak SNK-corpus data. We consider the sixteen logically possible tense-aspect constellations, and the Slovak connective items: odkedy; odvtedy, čo / ako; od chvíle, keď / čo / ako; od tých čias, čo / ako; od tej doby, čo / ako. This quantitative data study is intended to pinpoint the areas of future research; for this purpose at certain instances comparisons are made with Polish, the only other language we have such data for to date.

Parole chiave

  • conjunction
  • tense
  • aspect
  • anteriority
  • simultaneity
  • taxis
  • Slovak
  • Polish
Accesso libero

In which Clause do Subordinate Conjunctions Prosodically Belong?

Pubblicato online: 21 Dec 2019
Pagine: 216 - 224

Astratto

Abstract

This paper deals with the position of three Czech subordinating conjunctions že ’that’, když ‘when’, and až ‘when’ within the prosodic word, using the phonetic annotation in the ORTOFON corpus. The position of subordinating conjunctions is traditionally described as initial within the subordinate clause, but the situation in spontaneous speech is not so clear. This paper shows the functional differences between the various positions within the prosodic word and presents the words which are most frequently combined with the selected conjunctions.

Parole chiave

  • conjunction
  • spontaneous spoken language
  • spoken corpus
  • prosody
  • prosodic word
Accesso libero

Russian Indefinite Pronoun kakoj-libo: Non-Standard Usage and Changes in the Semantics

Pubblicato online: 21 Dec 2019
Pagine: 225 - 233

Astratto

Abstract

The paper deals with meaning and use of an indefinite pronoun kakojlibo ‘any/some’ in the modern Russian language. Research based on corpus data revealed non-standard usage of the pronoun kakoj-libo ‘any/some’. The paper describes main types of the deviations and evaluates their pragmatic and semantic effect. Finally, tendencies of the change in semantics and use of these pronouns are characterized.

Parole chiave

  • Russian language
  • semantics
  • indefinite pronouns
  • nonstandard speech
  • corpus-based approach
Accesso libero

Ways of Automatic Identification of Words Belonging to Semantic Field

Pubblicato online: 21 Dec 2019
Pagine: 234 - 243

Astratto

Abstract

The paper presents results of the ongoing research on creation of the semantic field of the “empire” concept. A semantic field is a collection of content units covering a certain area of human experience and forming a relatively autonomous microsystem with one or several centers. Relations in such microsystems are also called associations. The idea is to extract from data on syntagmatic collocability a set of lexical units connected by systemic paradigmatic relations of various types and strength using distributional analysis techniques. The first goal of the study is to develop methodology to fill a semantic field with lexical units on the basis of morphologically tagged corpora. We were using the Sketch Engine corpus system that implements the method of distributional statistical analysis. Text material is represented by our own corpora in the domain of “empire”. In the course of the work we have acquired lists of items filling the semantic space around the concept of “empire”.

Parole chiave

  • semantic field
  • concept of empire
  • distributive and statistical analysis
  • corpus
  • thesaurus
Accesso libero

Analysis of the Lemma Mateřství (Motherhood)

Pubblicato online: 21 Dec 2019
Pagine: 244 - 253

Astratto

Abstract

The paper presents results of analysis of the lemma mateřství ‘motherhood’. The authors applied methods of corpus linguistics and discourse analysis – the corpus assisted discourse studies approach – in order to survey representations of the lemma in Czech journalistic texts published from 2010 to 2014, sorted the results into discourse categories on the basis of collocation and concordance analysis, and found out that chief referential discourse-of-motherhood categories were surrogate motherhood, relationship of motherhood and career, delight from motherhood, family relationships, financial and time aspects of motherhood, changes brought by motherhood, and active motherhood. Surrogate motherhood was presented as a solution to women who cannot have a baby themselves, nevertheless also as a complicated issue, in which case emphasis was put on relevant legislation. Motherhood was presented as a danger for a woman’s career, however also as a source of joy, an essential relationship within a family, a right for financial support from the state, a life change, an activity, and an entity closely connected to time factors.

Parole chiave

  • motherhood
  • discourse
  • mass media
  • corpus
  • CADS
Accesso libero

Corpus-Supported Semantic Studies: Part/Whole Expressions in Russian

Pubblicato online: 21 Dec 2019
Pagine: 254 - 266

Astratto

Abstract

We investigate valency properties of partials – words and constructions that express the Part/Whole relation, primarily in Russian, offering new observations largely based on the Russian National Corpus. Special attention is given to such lexical units as bol’šinstvo ‘majority’, men’šinstvo ‘minority’, čast’ ‘part’, protsent ‘percentages’, v bol’šinstve svoem ‘in its <their, etc.>majority’, ‘po bol’šej časti ‘for the most part’, etc.

Parole chiave

  • corpus
  • semantics
  • valency
  • part-whole
Accesso libero

Wackernagel’s Position and Contact Position of Pronominal Enclitics in Older Czech. Competition or Cooperation?

Pubblicato online: 21 Dec 2019
Pagine: 267 - 275

Astratto

Abstract

The paper focuses on analyzing the relationship among word order positions of pronominal enclitics in the history of Czech. Specifically, we look at the Wackernagel’s position and the contact position and we try to decide whether these two positions compete, as usually taken for granted, or whether there is a certain kind of cooperation between them. The results show that the positions do not compete, at least not in the majority of cases. We used a corpus-based on selected books of the first edition of the Old Czech Bible and Kralice Bible for the analysis.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Frequency Dictionary of 16th Century Cyrillic Written Monument

Pubblicato online: 21 Dec 2019
Pagine: 276 - 288

Astratto

Abstract

The article presents the algorithm of the frequency dictionary to an original ancient text, “Otpys” (“Response”) by Kliryk Ostrozkyi (the Cleric of Ostroh) of the late 16th century. Until now, no historical corpus of text of the Ukrainian language has been created; therefore the drafting of metagraphical texts with their subsequent processing in accordance with linguistic tasks can fill this gap. The peculiarity of creating a frequency dictionary based on one written monument is in using the model of frequency dictionaries and describing the specifics of processing the ancient text. These specifics is based on a deep understanding of the state of language in the end of the 16th century and consists in the unification of graphic and spelling variants, as well as in the formation of stems and lemmas. Work results are presented in the form of a Frequency Dictionary of Word Forms of “Otpys” by Kliryk Ostrozkyi according to the frequency decrease and a Frequency Dictionary of “Otpys” by Kliryk Ostrozkyi according to the frequency decrease.

Parole chiave

  • frequency dictionary
  • tokenization
  • stemming
  • lemmatization
  • hapax legomena
  • written monument of the late 16 century
Accesso libero

Kinship Terminology in Western Slavic Languages Based on Corpora Analysis

Pubblicato online: 21 Dec 2019
Pagine: 289 - 298

Astratto

Abstract

This paper is discussing kinship arrangements and more generally families of Western Slavs based on linguistic and corpora data. It is argued here that we can find correlation between lexicon and society, and that studying of lexicon can provide supportive data for society examination. In this paper we used corpora data that provides us with reliable information about lexicon that is truly used by speakers of Western Slavic languages and provided possible explanations for changes occurring in this part of vocabulary. Paper is divided into three main parts, one discussing relations between social reality and kinship terminology, while the second is discussing data from corpora. Third part is devoted to drawing conclusions.

Parole chiave

  • kinship terminology
  • corpora linguistics
  • social reality
  • family
  • Western Slavic languages
Accesso libero

Gender-Specific Adjectives in Czech Newspapers and Magazines

Pubblicato online: 21 Dec 2019
Pagine: 299 - 312

Astratto

Abstract

This study is one of the few studies dealing with gender in the Czech language using corpus methods. It focuses on the issue of gender in Czech journalistic texts from the years 2010–2014. The main goal was to investigate the extent of stereotypical images of men and women in the press. This analysis is based on adjectival collocations of the lexemes muž ‘man’ and žena ‘woman’ and their semantic categorization. The research uses a journalistic part of the SYN2015 corpus. First, gender-specific adjectival collocates were identified. Second, adjectival collocates were classified into semantic categories and analyzed within journalistic genres. This study has shown that certain adjectives tend to co-occur with one of the examined lexemes and project a gender-stereotypical image of men and women within particular journalistic genres. It was confirmed that men are strongly associated with age specification, strength, appearance, and negative situations as a subject of crime, whereas women are related to motherhood, attractiveness, ethnicity, nationality, and are more often seen as victims of crime.

Parole chiave

  • gender studies
  • language and gender
  • discourse analysis
  • corpus linguistics
  • sociolinguistics
Accesso libero

From the National Corpus of Polish to the Polish Corpus Infrastructure

Pubblicato online: 21 Dec 2019
Pagine: 315 - 323

Astratto

Abstract

The National Corpus of Polish emerged as a cumulative result of many years of work on large reference corpora by computer scientists and linguists in Poland. While its impact on research in linguistics, humanities and language technology is unquestionable and highly significant, the construction of the national corpus was halted in 2011. In the paper we call for activating the research community and funding institutions around the construction of a corpus infrastructure with the national corpus at its heart. It is claimed that on the verge of an artificial intelligence revolution the envisaged Polish Corpus Infrastructure would provide reliable language data, combine available resources and allow easy integration of new ones.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Relevant Criteria for Selection of Spoken Data: Theory Meets Practice

Pubblicato online: 21 Dec 2019
Pagine: 324 - 335

Astratto

Abstract

The present paper seeks to review relevant criteria used in classifying speech events (SEs) from the perspective of spoken corpus design. The primary goal is to survey the landscape of possible types of spoken language, so as to assess in which directions the coverage of spoken Czech offered by Czech National Corpus corpora can be expanded in the future. We approach the problem from both theoretical and practical points of view, examining what the theoretical literature has to say as well as approaches implemented in practice by existing spoken corpora of various languages. We then synthesize the obtained information into a pragmatically motivated set of SE classification criteria which does not aspire to be universal or definitive but aims to serve as a useful guiding principle and conceptual framework for understanding and promoting SE diversity when collecting spoken data.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

The Dialekt Corpus and Its Possibilities

Pubblicato online: 21 Dec 2019
Pagine: 336 - 344

Astratto

Abstract

DIALEKT, a corpus of Czech dialects, has been continuously curated and expanded by the Spoken Corpora section of the Institute of the Czech National Corpus. The following paper aims first to give a concise characteristic of the corpus, addressing its sociolinguistic parameters and possible subcorpora derivable thereof, its two-layer approach to the transcription of dialect recordings, and lemmatization & morphological tagging of the corpus. Subsequently, we move on to examples of how linguists can use the corpus and discuss two related projects which expand upon currently available possibilities: an archive of dialect-specific differential phones of the Czech language (completed) and an interactive web environment for spatial map-based visualization of data from all kinds of spoken corpora (in preparation). Thanks in part also to these additional tools, the DIALEKT corpus should serve both experts in the field as well as the general public.

Parole chiave

  • spoken corpus
  • dialect corpus
  • dialectology
  • corpus design
  • transcription
Accesso libero

Annotations in the Corpus of Texts of Students Learning Slovak as a Foreign Language (ERRKORP)

Pubblicato online: 21 Dec 2019
Pagine: 345 - 357

Astratto

Abstract

The article presents the upcoming acquisition corpus of written texts of students learning Slovak as a Foreign Language and focuses on the annotation of texts, which includes information about the text as well as social and linguistic details about the student. The article also discusses the tags that identify individual errors in the texts and concept of creating the tagset itself.

Parole chiave

  • language error
  • learner corpus
  • slovak
  • tagging
  • annotation
Accesso libero

Parts of Speech in NovaMorf, A New Morphological Annotation of Czech

Pubblicato online: 21 Dec 2019
Pagine: 358 - 369

Astratto

Abstract

A detailed morphological description of word forms in any language is a necessary condition for a successful automatic processing of linguistic data. The paper focuses on a new description of morphological categories, mainly on the subcategorization of parts of speech in Czech within the NovaMorf project. NovaMorf focuses on the description of morphological properties of Czech word forms in a more compact and consistent way and with a higher explicative power than approaches used so far. It also aims at the unification of diverse approaches to morphological annotation of Czech. NovaMorf approach will be reflected in a new morphological dictionary to be exploited for a new automatic morphological analysis (and disambiguation) of corpora of contemporary Czech.

Parole chiave

  • NovaMorf
  • morphological annotation
  • parts of speech
  • morphological categories
  • subcategorization
Accesso libero

Improving Nominalized Adjectives Tagging

Pubblicato online: 21 Dec 2019
Pagine: 370 - 379

Astratto

Abstract

Part of speech transitions represent an interesting issue in terms of Automatic Morphological Analysis (AMA). In these cases, two parts of speech have to be considered: initial and final. However, their automatic recognition is complicated by the same form. This article presents the results of a corpus study aimed at mapping nominalized adjectives tagging with a focus on detecting candidates for nominalization among frequent adjectives. Analysis of the data obtained from the ČNK SYN v5 corpus shows different reasons for incorrect tagging. Taking into account these reasons, we propose three solutions for the improvement nominalized adjectives tagging.

Parole chiave

  • nominalized adjectives
  • automatic morphological analysis
  • disambiguation
  • corpus
  • tagging
Accesso libero

Modifications of the Czech Morphological Dictionary for Consistent Corpus Annotation

Pubblicato online: 21 Dec 2019
Pagine: 380 - 389

Astratto

Abstract

We describe systematic changes that have been made to the Czech morphological dictionary related to annotating new data within the project of Prague Dependency Treebank (PDT). We bring new solutions to several complicated morphological features that occur in Czech texts. We introduced two new parts of speech, namely foreign word and segment. We adopted new principles for morphological analysis of global and inflectional variants, homonymous lemmas, abbreviations and aggregates. The changes were initiated by the need of consistency between the data and the dictionary and of the dictionary itself.

Parole chiave

  • morphological dictionary
  • Czech part of speech
  • corpus annotation
  • Golden rule of morphology
Accesso libero

Levels of Annotation in the Slovene Training Corpus ssj500k 2.2

Pubblicato online: 21 Dec 2019
Pagine: 390 - 399

Astratto

Abstract

This paper presents the Slovene Training Corpus ssj500k 2.2, which has been annotated on the levels of tokenization, sentence segmentation, part-of-speech tagging, lemmatization, syntactic dependencies, named entities, verbal multi-word expressions, and semantic role labeling. It describes the individual layers of annotation and shows the scope of using the training corpus in the production of various lexicons, such as the lexicon of multi-word units and the valency lexicon of modern Slovene. It concludes by presenting our future work, i.e. the annotation of multi-word expressions based on the Slovene Lexical Database.

Parole chiave

  • corpus linguistics
  • training corpus
  • corpus annotation
  • Slovene language
Accesso libero

Meaning and Semantic Roles in CzEngClass Lexicon

Pubblicato online: 21 Dec 2019
Pagine: 403 - 411

Astratto

Abstract

This paper focuses on Semantic Roles, an important component of studies in lexical semantics, as they are captured as part of a bilingual (Czech-English) synonym lexicon called CzEngClass. This lexicon builds upon the existing valency lexicons included within the framework of the annotation of the various Prague Dependency Treebanks. The present analysis of Semantic Roles is being approached from the Functional Generative Description point of view and supported by the textual evidence taken specifically from the Prague Czech-English Dependency Treebank.

Parole chiave

  • semantic roles
  • valency
  • parallel corpus
  • lexical semantics
  • lexical resource
Accesso libero

Introducing Semantic Labels into the DeriNet Network

Pubblicato online: 21 Dec 2019
Pagine: 412 - 423

Astratto

Abstract

The paper describes a semi-automatic procedure introducing semantic labels into the DeriNet network, which is a large, freely available resource modeling derivational relations in the lexicon of Czech. The data were assigned labels corresponding to five semantic categories (diminutives, possessives, female nouns, iteratives, and aspectual meanings) by a machine learning model, which achieved excellent results in terms of both precision and recall.

Parole chiave

  • derivation
  • semantic category
  • comparative semantic concepts
  • suffix
  • machine learning
Accesso libero

Non-Systemic Valency Behavior of Czech Deverbal Nouns Based on the NomVallex Lexicon

Pubblicato online: 21 Dec 2019
Pagine: 424 - 433

Astratto

Abstract

In order to describe non-systemic valency behavior of Czech deverbal nouns, we present results of an automatic comparison of valency frames of interlinked noun and verbal lexical units included in valency lexicons NomVallex and VALLEX. We show that the non-systemic valency behavior of the nouns is mostly manifested by non-systemic forms of their actants, while changes in the number or type of adnominal actants are negligible as for their frequency. Non-systemic forms considerably contribute to a general increase in the number of forms in valency frames of nouns compared to the number of forms in valency frames of their base verbs. The non-systemic forms are more frequent in valency frames of non-productively derived nouns than in valency frames of productively derived ones.

Parole chiave

  • adnominal morphemic forms
  • Czech deverbal nouns
  • non-systemic valency behavior
  • valency
  • valency lexicon
Accesso libero

Towards Reciprocal Deverbal Nouns in Czech: From Reciprocal Verbs to Reciprocal Nouns

Pubblicato online: 21 Dec 2019
Pagine: 434 - 443

Astratto

Abstract

Reciprocal verbs are widely debated in the current linguistics. However, other parts of speech can be characterized by reciprocity as well – in contrast to verbs, their analysis is underdeveloped so far. In this paper, we make an attempt to fill this gap, applying results of the description of Czech reciprocal verbs to nouns derived from these verbs. We show that many aspects characteristic of reciprocal verbs hold for reciprocal nouns as well.

Parole chiave

  • reciprocity
  • deverbal nouns
  • lexical and syntactic reciprocal nouns
Accesso libero

Processing of Derivational Features for (Semi)Automatic Creation of Dictionary Definitions in the User Interface (CZEDD) for Learning Czech as a Second Language: Suffix -tel and -ista

Pubblicato online: 21 Dec 2019
Pagine: 444 - 455

Astratto

Abstract

This work-in-progress paper presents the tool CZEDD which enables the user to learn how to predict the meaning of words. The CZEDD consists of (semi) automatic definitions for derived words because a lot of these words have predictable lexical meaning. The tool will be intended for foreigners who learn the Czech language and it could be useful as a dictionary and/or translator in which the definitions based on the word’s structure are stored. Two detailed case examples (the suffix -tel, and the suffix -ista) illustrate the approach.

Parole chiave

  • derivational morphology
  • Czech for foreigners
  • suffixes
  • lexical meaning
  • structural meaning
  • dictionary
Accesso libero

Conception and Development of an Open Database System on Historical Multilingualism in Austria

Pubblicato online: 21 Dec 2019
Pagine: 456 - 466

Astratto

Abstract

This paper discusses the development and structure of an online information system, which aims to gather and visualize data on historical multilingualism in Austria (German: historische Mehrsprachigkeit in Österreich, short: MiÖ), with a particular focus on Slavic languages. The database tracks the development of multilingualism over time, its distribution in space and its representation in literature, therefore allowing to examine its dynamics and change. As an example, we investigate the area of the so-called Marchfeld (č./sk. Moravské pole). The paper further discusses how the database is embedded into the collaborative research platform of the Special Research Program “German in Austria (DiÖ)” as well as its technical realization and the possibility to include data from other related research projects.

Parole chiave

  • online information system
  • historical multilingualism
  • language contact
  • Austria
  • Austria-Hungary
Accesso libero

On Possibilities and Methods of Analysis of Thematic Expressions in Spoken Texts

Pubblicato online: 21 Dec 2019
Pagine: 469 - 480

Astratto

Abstract

The treatise focuses on mutual comparison of three methods of detection of prominent text units (prominent in relation to the contents of the text). The methods are: 1) analysis of key words based on comparison of source and referential corpora, 2) thematic concentration and h-point, and 3) the TF*IDF method. We try to thematize their pros and cons and, using the results of the carried out analyses, propose the optimal method for the extraction of thematic words from the spoken texts the frequency structure of which differs distinctly from the frequency structure of written texts.

Parole chiave

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Accesso libero

Identification of Spontaneous Spoken Texts in Slovak

Pubblicato online: 21 Dec 2019
Pagine: 481 - 490

Astratto

Abstract

We propose a text classification method for the purpose of creating a language model for automatic recognition of spontaneous spoken speech. Transcripts from our departmental speech database served as spontaneous spoken texts. Using supervised machine learning methods, we have created multiple classification models (including neural networks), that were able to distinguish them from written texts with high accuracy. We subsequently verified the accuracy of our trained models on a database of texts containing direct speech extracted from newspaper articles.

Parole chiave

  • spontaneous speech
  • text classification
  • supervised machine learning
  • neural networks
  • Slovak language
Accesso libero

Affordable Annotation of the Mobile App Reviews

Pubblicato online: 21 Dec 2019
Pagine: 491 - 497

Astratto

Abstract

This paper focuses on the use-case study of the annotation of the mobile app reviews from Google Play and Apple Store. These annotations of sentiment polarity were created for later use in the automatic processing based on machine learning. This should solve some of the problems encountered in the previous analyses of the Czech language where data assumptions play a greater role than annotation itself (due to the financial constraints). Our proposal shows that some of the assumptions used for English do not apply to Czech and that it is possible to annotate such data without extensive financing.

Parole chiave

  • sentiment polarity
  • topics analysis
  • annotation

Pianifica la tua conferenza remota con Sciendo