Zeszyty czasopisma

Tom 73 (2022): Zeszyt 1 (June 2022)

Tom 73 (2022): Zeszyt 1 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Tom 72 (2022): Zeszyt 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Tom 72 (2021): Zeszyt 3 (December 2021)

Tom 72 (2021): Zeszyt 2 (December 2021)
NLP, Corpus Linguistics and Interdisciplinarity

Tom 72 (2021): Zeszyt 1 (June 2021)

Tom 71 (2020): Zeszyt 3 (December 2020)
Číslo venované problematike maďarského jazyka a maďarských nárečí na Slovensku

Tom 71 (2020): Zeszyt 2 (December 2020)

Tom 71 (2020): Zeszyt 1 (June 2020)

Tom 70 (2019): Zeszyt 3 (December 2019)

Tom 70 (2019): Zeszyt 2 (December 2019)

Tom 70 (2019): Zeszyt 1 (June 2019)

Tom 69 (2018): Zeszyt 3 (December 2018)

Tom 69 (2018): Zeszyt 2 (December 2018)

Tom 69 (2018): Zeszyt 1 (June 2018)

Tom 68 (2017): Zeszyt 3 (December 2017)

Tom 68 (2017): Zeszyt 2 (December 2017)

Tom 68 (2017): Zeszyt 1 (June 2017)

Tom 67 (2016): Zeszyt 3 (December 2016)

Tom 67 (2016): Zeszyt 2 (December 2016)

Tom 67 (2016): Zeszyt 1 (June 2016)

Tom 66 (2015): Zeszyt 2 (December 2015)

Tom 66 (2015): Zeszyt 1 (June 2015)

Tom 65 (2014): Zeszyt 2 (December 2014)

Tom 65 (2014): Zeszyt 1 (June 2014)

Tom 64 (2013): Zeszyt 2 (December 2013)

Tom 64 (2013): Zeszyt 1 (June 2013)

Tom 63 (2012): Zeszyt 2 (December 2012)

Tom 63 (2012): Zeszyt 1 (June 2012)

Tom 62 (2011): Zeszyt 2 (December 2011)

Tom 62 (2011): Zeszyt 1 (June 2011)

Tom 61 (2010): Zeszyt 2 (December 2010)

Tom 61 (2010): Zeszyt 1 (June 2010)

Tom 60 (2009): Zeszyt 2 (December 2009)

Tom 60 (2009): Zeszyt 1 (June 2009)

Informacje o czasopiśmie
Format
Czasopismo
eISSN
1338-4287
ISSN
0021-5597
Pierwsze wydanie
05 Mar 2010
Częstotliwość wydawania
2 razy w roku
Języki
Angielski

Wyszukiwanie

Tom 68 (2017): Zeszyt 2 (December 2017)

Informacje o czasopiśmie
Format
Czasopismo
eISSN
1338-4287
ISSN
0021-5597
Pierwsze wydanie
05 Mar 2010
Częstotliwość wydawania
2 razy w roku
Języki
Angielski

Wyszukiwanie

31 Artykułów
Otwarty dostęp

Georgian Dialect Corpus: Linguistic and Encyclopedic Information in Online Dictionaries

Data publikacji: 24 Jan 2018
Zakres stron: 109 - 121

Abstrakt

Abstract

The Georgian Dialect Corpus (GDC) has been created within the framework of the project “Linguistic Portrait of Georgia”. It was the first attempt to create a structured corpus of Georgian dialects. The work of this project includes building the technical framework for a corpus, collecting the corpus (text) data of Georgian dialects including the lexicographic data (dictionaries), their linguistic processing, digitizing, developing annotation framework, making decision on the morphosyntactic annotation. Currently, the Georgian Dialect Corpus is a platform consisting of the dialect corpus, the text library, the lexicographical database/online dialect dictionaries. For the purposes of developing the lexicographical database and dialect dictionaries, we have created a new program – the Lexicographic Editor. It allows us to structure and improve the dictionaries with multiple linguistic and lexicographic information. The lexicographic concept of the GDC has been developed taking into consideration linguistic and social features of the Georgian dialects.

Słowa kluczowe

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Otwarty dostęp

Modeling Semantic Distance in the Pattern Dictionary of English Verbs

Data publikacji: 24 Jan 2018
Zakres stron: 122 - 135

Abstrakt

Abstract

We explore human judgments on how well individual patterns of 29 target verbs from the Pattern Dictionary of English Verbs describe their random KWICs. We focus on cases where more than one pattern is judged as highly appropriate for a given KWIC and seek to estimate the effect of event participants (arguments) being denotatively similar in two patterns, considering all pair combinations in a given lemma. We compare this effect to the effect of several contextual features of the KWICs, the effect of paired PDEV implicatures implying each other, and the effect of belonging to a given lemma. We show that the lemma effect is still stronger than any feature going across lemmas we have examined so far, so that each verb appears to be a little universe in its own right.

Słowa kluczowe

  • usage patterns
  • lexicography
  • verbs
  • CPA
  • semantics
  • word embeddings
  • WSD
  • graded decisions
  • corpus
  • English
  • annotation
Otwarty dostęp

Golden Rule of Morphology and Variants of Word forms

Data publikacji: 24 Jan 2018
Zakres stron: 136 - 144

Abstrakt

Abstract

In many languages, some words can be written in several ways. We call them variants. Values of all their morphological categories are identical, which leads to an identical morphological tag. Together with the identical lemma, we have two or more wordforms with the same morphological description. This ambiguity may cause problems in various NLP applications. There are two types of variants – those affecting the whole paradigm (global variants) and those affecting only wordforms sharing some combinations of morphological values (inflectional variants). In the paper, we propose means how to tag all wordforms, including their variants, unambiguously. We call this requirement “Golden rule of morphology”. The paper deals mainly with Czech, but the ideas can be applied to other languages as well.

Słowa kluczowe

  • morphology
  • global variants
  • inflectional variants
  • multiple lemma
  • Golden rule of morphology
Otwarty dostęp

Morphological Disambiguation of Multiword Expressions and Its Impact on the Disambiguation of Their Environment in a Sentence

Data publikacji: 24 Jan 2018
Zakres stron: 145 - 155

Abstrakt

Abstract

This study concerns the impact of the collocation/phraseme disambiguation component within the complex system of the rule-based morphological disambiguation of Czech. This system constitutes one of the two main disambiguation subsystems that are responsible for the morphological disambiguation of the corpora of synchronic Czech within the Czech National Corpus project. We will show that although the part of texts constituted by collocations/phrasemes (generally multiword expressions – MWEs) is relatively small and consequently the errorfree morphological disambiguation of MWEs covers only a small portion of textual material, such perfectly disambiguated fragments in sentences help to improve the disambiguation of the rest, non-MWE part of sentences.

Słowa kluczowe

  • multiword expressions
  • lexical database
  • morphological analysis
  • morphological ambiguity
  • morphological disambiguation
  • process of disambiguation
  • Czech National Corpus
Otwarty dostęp

Valency Potential of Slovak and French Verbs in Contrast

Data publikacji: 24 Jan 2018
Zakres stron: 156 - 168

Abstrakt

Abstract

The paper presents results of synchronous contrastive study of fifteen most frequent Slovak full verbs and their French equivalents by the method of corpus analysis aimed at observation and comparison of their valency potential in relation to their semantic structure. The inventory of valency structures of Slovak verbs and their French equivalents shows not only differences, but also, to a great extent, identical semantic-syntactic connectivities. The main apport of the study lies in the contrastive research perspective and the interdisciplinary character on the crossroads of grammar, semantics, syntax, cognitive and corpus linguistics. Findings can be of use to linguists, terminologists, lexicographers, authors of textbooks and grammars, translators and interpreters, as well as to French-speaking learners of Slovak and Slovak students of French.

Słowa kluczowe

  • linguistics
  • grammar
  • corpus
  • verb
  • valency
Otwarty dostęp

Microsyntactic Annotation of Corpora and its Use in Computational Linguistics Tasks

Data publikacji: 24 Jan 2018
Zakres stron: 169 - 178

Abstrakt

Abstract

Microsyntax is a linguistic discipline dealing with idiomatic elements whose important properties are strongly related to syntax. In a way, these elements may be viewed as transitional entities between the lexicon and the grammar, which explains why they are often underrepresented in both of these resource types: the lexicographer fails to see such elements as full-fledged lexical units, while the grammarian finds them too specific to justify the creation of individual well-developed rules. As a result, such elements are poorly covered by linguistic models used in advanced modern computational linguistic tasks like high-quality machine translation or deep semantic analysis. A possible way to mend the situation and improve the coverage and adequate treatment of microsyntactic units in linguistic resources is to develop corpora with microsyntactic annotation, closely linked to specially designed lexicons. The paper shows how this task is solved in the deeply annotated corpus of Russian, SynTagRus.

Słowa kluczowe

  • Text corpora
  • Russian syntactically tagged corpus SynTagRus
  • syntactic idioms
  • microsyntactic annotation
  • microsyntactic dictionary
Otwarty dostęp

Clitic Climbing, Finiteness and the Raising-Control Distinction. A Corpus–based study

Data publikacji: 24 Jan 2018
Zakres stron: 179 - 190

Abstrakt

Abstract

In the paper, we discuss the phenomenon of clitic climbing out of finite da2-complements in contemporary Serbian. Scholars’ opinions on the acceptability and occurrence of this construction, based on a handful of self-made examples, vary considerably. Expanding on the assumption that the correctness of the phenomenon has often been denied due to its rareness we employ large corpora to examine the problem. We focus on possible constraints arising from the syntactic properties of clause-embedding predicates.

Słowa kluczowe

  • srWac
  • constraints on clitic climbing
  • semifinite complements
  • syntax
  • Serbian
Otwarty dostęp

On the Development of an Interdisciplinary Annotation and Classification System for Language Varieties – Challenges and Solutions

Data publikacji: 24 Jan 2018
Zakres stron: 191 - 207

Abstrakt

Abstract

The Special Research Programme (SFB) ‘German in Austria: Variation – Contact – Perception’ is a project financed by the Austrian Science Fund (FWF F60). Its nine project parts are collaboratively conducting research on the variation and change of the German language in Austria. The SFB explores the use and the subjective perception of the German language in Austria as well as its contact with other languages. Methodologically and theoretically, most SFB project parts are situated within variationist linguistics, others in contact linguistics and perceptionist linguistics. This paper gives an insight into the conception of a framework for the annotation and ultimately also classification of language varieties, which is being developed within the SFB. It outlines the requirements of the various project parts and reviews, whether and how standardised language codes (ISO 639) and language tags (following BCP 47) can be utilised for the annotation of language varieties in variationist linguistic projects.

Słowa kluczowe

  • language varieties
  • dialects
  • language tags
Otwarty dostęp

Possible but not probable :A quantitative analysis of valency behaviour of Czech nouns in the Prague Dependency Treebank

Data publikacji: 24 Jan 2018
Zakres stron: 208 - 218

Abstrakt

Abstract

In order to optimize corpus searches for valency lexicon production, we analyse the relative frequencies of different combinations of valency complementations of Czech deverbal nouns in the Prague Dependency Treebank, considering differences between productively and non-productively derived nouns and their semantic class. We also classify combinations of forms of participants according to their frequency.

Słowa kluczowe

  • valency
  • valency lexicon
  • Czech nouns
  • Word Sketch
  • corpus
  • Prague Dependency Treebank
  • quantitative analysis
Otwarty dostęp

New Spoken Corpora of Czech: ORTOFON and DIALEKT

Data publikacji: 24 Jan 2018
Zakres stron: 219 - 228

Abstrakt

Abstract

The paper introduces the ORTOFON corpus of spontaneous spoken Czech and the DIALEKT corpus of Czech dialects, their design principles and practical solutions adopted during data collection.

Słowa kluczowe

  • dialectology
  • lemmatization
  • spoken corpus
  • tagging
  • transcription
Otwarty dostęp

What Does že jo (and že ne) Mean in Spoken Dialogue

Data publikacji: 24 Jan 2018
Zakres stron: 229 - 237

Abstrakt

Abstract

The goal of this paper is to examine the role of two collocations (že jo and že ne) in spoken dialogue. Both are said to be typical of spontaneous conversation and express a large scale of pragmatic functions, e.g. uncertainty of the speaker or a request for a backchannel. The examination of their positioning within the utterance in relation to the meaning of their close context helped us to identify the functions and to distinguish between cases which are simple co-occurrences of the conjunction že and the particle jo/ne, and those which are instances of the set phrase. The source material comes from the ORAL2013 and DIALOG corpora.

Słowa kluczowe

  • DIALOG
  • spoken corpora
  • ORAL
  • co-occurrence
Otwarty dostęp

Grammatical Change Trends in Contemporary Czech Newspapers

Data publikacji: 24 Jan 2018
Zakres stron: 238 - 248

Abstrakt

Abstract

The paper presents a corpus-driven method for the detection of recent grammatical change in contemporary Czech newspapers. It is based on a large and homogeneous material (825 million tokens of a single newspaper) that covers a 23-year time span. The task is operationalised into finding the most relevant frequency change manifested by selected subsets of the Czech tagset. The results show changing proportions of parts of speech, nominal cases etc. that indicate a shift towards more “verbal” language associated with increasing informality of the newspaper register.

Słowa kluczowe

  • modern diachrony
  • language change
  • Czech
  • newspaper register
  • corpus composition
Otwarty dostęp

Corpus-Based Semantic Models of the Noun Phrases Containing Words with ‘Person’ Marker

Data publikacji: 24 Jan 2018
Zakres stron: 249 - 257

Abstrakt

Abstract

The mechanism underlying constructing of lexically correct sequences of words is an object of attention both in theoretical and applied fields of linguistics. This paper reveals some aspects of modelling the patterns of semantic valence in noun phrases of NN (Noun+Noun) structure, one or both components of which contain the ‘person’ semantic tag. The research is based on the Corpus of Ukrainian and performed with the help of automatic language processing.

Słowa kluczowe

  • natural language processing
  • semantic valence
  • noun phrases
Otwarty dostęp

Text collections for evaluation of Russian morphological taggers

Data publikacji: 24 Jan 2018
Zakres stron: 258 - 267

Abstrakt

Abstract

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.

Słowa kluczowe

  • text collection
  • shared task
  • morphological tagging
  • universal dependencies
  • morphological parsing
  • Russian corpora
Otwarty dostęp

Subcategorization of Adverbial Meanings Based on Corpus Data

Data publikacji: 24 Jan 2018
Zakres stron: 268 - 277

Abstrakt

Abstract

We introduce a corpus based description of selected adverbial meanings in Czech sentences. Its basic repertory is one of a long lasting tradition in both scientific and school grammars. However, before the corpus era, researchers had to rely on their own excerption; but nowadays, current syntax has a vast material basis in the form of electronic corpora available. On the case of spatial adverbials, we describe our methodology which we used to acquire a detailed, comprehensive, well-arranged description of meanings of adverbials including a list of formal realizations with examples. Theoretical knowledge stemming from this work will lead into an improval of the annotation of the meanings in the Prague Dependency Treebanks which serve as the corpus sources for our research. The Prague Dependency Treebanks include data manually annotated on the layer of deep syntax and thus provide a large amount of valuable examples on the basis of which the meanings of adverbials can be defined more accurately and subcategorized more precisely. Both theoretical and practical results will subsequently be used in NLP, such as machine translation.

Słowa kluczowe

  • adverbial meanings
  • deep syntax
  • annotation
  • treebank
Otwarty dostęp

Measuring and Improving Children’s Reading Aloud Attributes by Computers

Data publikacji: 24 Jan 2018
Zakres stron: 278 - 286

Abstrakt

Abstract

In this paper, method of an automated measuring reading aloud attributes is presented. The forced alignment as a part of speech recognition technique is used. The recorded reading aloud is forced aligned to the known text and the attributes are computed from it. The tempo and fluency of children are monitored and used for an individual motivation. The length of the read text is chosen according to readers’ skills so that children end up reading at about the same time and poor readers are not frustrated. This approach has been tested and improved at the elementary school for five years and brought positive results.

Słowa kluczowe

  • speech recognition
  • teaching reading
  • reading aloud
Otwarty dostęp

Three Aspects of Processing Ophthalmological Terminology in a “Small Language”: a Case of Croatian Term Bank Struna

Data publikacji: 24 Jan 2018
Zakres stron: 287 - 295

Abstrakt

Abstract

In this paper, we will present the problems we have observed while editing terminological units as a part of the specialized language of ophthalmology that is currently being processed as part of the program Struna. Struna is the Croatian National term bank (http://struna.ihjj.hr/). Its aim is to gradually standardize Croatian terminology, for all professional domains, by coordinating the work of domain experts, terminologists and language experts [1], [2]. The Croatian Ophthalmological Terminology1 is the first Struna project that encompasses a subfield of an already existing field in the database. Namely, in 2013 the general medical terminology was processed as a part of the project Croatian Anatomy and Physiology. This situation has revealed a new set of problems that previously were not taken into account and has forced us to re-evaluate methodology and adapt accordingly.

Słowa kluczowe

  • Struna
  • terminology
  • specialized language of ophthalmology
  • terminology management
Otwarty dostęp

Terminology and Labelling Words by Subject in Monolingual Dictionaries – What Do Domain Labels Say to Dictionary Users ?

Data publikacji: 24 Jan 2018
Zakres stron: 296 - 304

Abstrakt

Abstract

The paper focuses on labelling words by subject in a non-specialized dictionary. We compare the existing monolingual dictionaries of Czech and their ways of labelling terms of medicine and related fields; besides apparent differences between dictionaries, there are also inconsistencies within one dictionary. We consider pros and cons of domain labels as such and their usability in the light of needs and limits of dictionary users, with the aim to motivate further discussion on related issues.

Słowa kluczowe

  • terminology
  • terms
  • lexicography
  • monolingual dictionary
  • e-dictionary
  • domain labels
  • Czech
Otwarty dostęp

Correlative Conjunctions in Spoken Texts

Data publikacji: 24 Jan 2018
Zakres stron: 305 - 315

Abstrakt

Abstract

Correlative conjunctions (such as buď – anebo (either – or), jednak – jednak (firstly – secondly) etc.) represent one means of textual cohesion. The occurrence of one component of the pair implies the use of the other, which contributes to the cohesiveness of a text. Using data provided by the corpus of informal spoken Czech ORAL2013, I will try to demonstrate their use in a prototypical spoken language, that is commonly considered less coherent and more fragmentary compared to written language.

Słowa kluczowe

  • correlative conjunctions
  • spoken Czech
  • ORAL2013
  • corpus
Otwarty dostęp

Issues of POS Tagging of the (Diachronic) Corpus of Czech : Preparing a Morphological Dictionary

Data publikacji: 24 Jan 2018
Zakres stron: 316 - 325

Abstrakt

Abstract

Many important decisions concerning the part-of-speech categorization remain unexplained in the current practice, only reported in corpus manuals. The aim of this paper is to offer a different perspective on the problems of morphological annotation of corpora – the perspective of mapping and analyzing conceptual problems in the annotation. Focused mainly on function words in Czech, we discuss the possibilities of the POS tagging of the inherently ambiguous category of particles and we introduce criteria for distinguishing particles from interjections.

Słowa kluczowe

  • corpus
  • function words
  • morphological annotation
  • Czech
Otwarty dostęp

Designing the Database of Speech Under Stress

Data publikacji: 24 Jan 2018
Zakres stron: 326 - 335

Abstrakt

Abstract

This study describes the methodology used for designing a database of speech under real stress. Based on limits of existing stress databases, we used a communication task via a computer game to collect speech data. To validate the presence of stress, known psychophysiological indicators such as heart rate and electrodermal activity, as well as subjective self-assessment were used. This paper presents the data from first 5 speakers (3 men, 2 women) who participated in initial tests of the proposed design. In 4 out of 5 speakers increases in fundamental frequency and intensity of speech were registered. Similarly, in 4 out of 5 speakers heart rate was significantly increased during the task, when compared with reference measurement from before the task. These first results show that proposed design might be appropriate for building a speech under stress database. However, there are still considerations that need to be addressed.

Słowa kluczowe

  • stress
  • arousal
  • stress detection
  • heart rate
  • speech under stress
  • speech database
Otwarty dostęp

Annotation of the Evaluative Language in a Dependency Treebank

Data publikacji: 24 Jan 2018
Zakres stron: 336 - 345

Abstrakt

Abstract

In the paper, we present our efforts to annotate evaluative language in the Prague Dependency Treebank 2.0. The project is a follow-up of the series of annotations of small plaintext corpora. It uses automatic identification of potentially evaluative nodes through mapping a Czech subjectivity lexicon to syntactically annotated data. These nodes are then manually checked by an annotator and either dismissed as standing in a non-evaluative context, or confirmed as evaluative. In the latter case, information about the polarity orientation, the source and target of evaluation is added by the annotator. The annotations unveiled several advantages and disadvantages of the chosen framework. The advantages involve more structured and easy-to-handle environment for the annotator, visibility of syntactic patterning of the evaluative state, effective solving of discontinuous structures or a new perspective on the influence of good/bad news. The disadvantages include little capability of treating cases with evaluation spread among more syntactically connected nodes at once, little capability of treating metaphorical expressions, or disregarding the effects of negation and intensification in the current scheme.

Słowa kluczowe

  • dependency treebank
  • corpus
  • plaintext annotation
Otwarty dostęp

TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Data publikacji: 24 Jan 2018
Zakres stron: 346 - 354

Abstrakt

Abstract

This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.

Słowa kluczowe

  • automatic annotation
  • speech recognition
  • speech corpus
Otwarty dostęp

Helping the Translator Choose: The Concept of a Dictionary of Equivalents

Data publikacji: 24 Jan 2018
Zakres stron: 355 - 363

Abstrakt

Abstract

The purpose of the article is to present the innovative concept of a dictionary of equivalents, a reference work designed specifically for translators of legal texts. The article describes the features of legal terminology which render legal translation particularly difficult, such as polysemy and synonymy as well as incongruence among legal systems. Then it proposes a classification and labelling system of equivalents which ought to be offered in a terminographic reference work for legal translators.

Słowa kluczowe

  • dictionary of equivalents
  • legal language
  • legal translation
  • congruence
  • equivalence
Otwarty dostęp

CzEngClass – Towards a Lexicon of Verb Synonyms with Valency Linked to Semantic Roles

Data publikacji: 24 Jan 2018
Zakres stron: 364 - 371

Abstrakt

Abstract

In this paper, we introduce our ongoing project about synonymy in bilingual context. This project aims at exploring semantic ‘equivalence’ of verb senses of generally different verbal lexemes in a bilingual (Czech-English) setting. Specifically, it focuses on their valency behavior within such equivalence groups. We believe that using bilingual context (translation) as an important factor in the delimitation of classes of synonymous lexical units (verbs, in our case) may help to specify the verb senses, also with regard to the (semantic) roles relation to other verb senses and roles of their arguments more precisely than when using monolingual corpora. In our project, we work “bottom-up”, i.e., from an evidence as recorded in our corpora and not “top-down”, from a predefined set of semantic classes.

Słowa kluczowe

  • lexical resources
  • valency
  • synonymy
  • semantic roles
  • dependency corpus
  • multilingual
Otwarty dostęp

Slavic Phraseology: A View Through Corpora

Data publikacji: 24 Jan 2018
Zakres stron: 372 - 384

Abstrakt

Abstract

The study of word collocability is one of the main tasks of linguistics. The combinatory ability of language units, collocability, is one of the linguistic syntagmatic laws. This phenomenon is the main object of the phraseology and lexicography. The article deals with set phrases of different types in Russian, Czech and Slovak from the point of view of their quantitative evaluation. Corpus linguistics understand set phrases as statistically determined unities. This approach is the basic point of different automatic ways to extract idioms and collocations. The paper describes experiments which show how text corpora and corpus methods and tools can be used to expand the entries in existing dictionaries and how set phrases could be evaluated quantitatively. It is shown and maintained that corpus linguistics methods and tools allow to create dictionaries of new type which have to include a larger amount of set phrases and collocations than before.

Słowa kluczowe

  • Slavic phraseology
  • phraseological units
  • set phrases
  • idioms
  • collocations
  • corpus
  • lexicography
Otwarty dostęp

Slovak Dependency Treebank in Universal Dependencies

Data publikacji: 24 Jan 2018
Zakres stron: 385 - 395

Abstrakt

Abstract

We describe a conversion of the syntactically annotated part of the Slovak National Corpus into the annotation scheme known as Universal Dependencies. Only a small subset of the data has been converted so far; yet it is the first Slovak treebank that is publicly available for research. We list a number of research projects in which the dataset has been used so far, including the first parsing results.

Słowa kluczowe

  • treebank
  • dependency
  • universal dependencies
  • syntax
  • morphology
  • tagging
  • parsing
Otwarty dostęp

Compound Adverbs as an Issue in Machine Analysis of Czech language

Data publikacji: 24 Jan 2018
Zakres stron: 396 - 403

Abstrakt

Abstract

Compound adverbs represent an interesting issue in terms of Automatic Morphological Analysis (AMA). The reason is that compound adverbs in Czech are expressions formed by compounding existing words that are different parts of speech without any change in their form. An indicative sign of compound adverbs is that they can always be decomposed again. Compound adverbs may be written as one word but sometimes a multiword form coexists. A word that is originally a different part of speech gains an adverbial meaning and becomes an adverb. This article presents the results of a corpus probe aimed at mapping expressions that are demonstrably compound adverbs and were not recognized by AMA or were incorrectly tagged by AMA as another part of speech. Analysis of data obtained from the Czech National Corpus (ČNK) SYN v3 show that the unrecognized and incorrectly tagged units can be divided into several groups. Based on knowledge of these groups it is possible to refine part of speech tagging by AMA. The corpus probe examined units written in accordance with the current codification as well as substandard units.

Słowa kluczowe

  • compound adverb
  • multiword expression
  • automatic morphological analysis
  • nominal form
  • corpus
  • tag
Otwarty dostęp

The Use of Authorial Corpora Beyond Linguistics

Data publikacji: 24 Jan 2018
Zakres stron: 404 - 414

Abstrakt

Abstract

The study concentrates on the issue of quantitative and qualitative methods within the context of literary theory. It intends namely to present the concept of the literary corpus of Czech prose and define main parameters of the corpus. Besides the project of a specialized corpus, primarily intended for the use in the field of literary theory, the study deals with current stochastic and corpus methods applied by foreign scholars in analysis of literary prosaic texts. The study tries to incorporate the original project of Czech prose literary corpus in this contemporary context that represents one form of a recently flourishing discipline called Digital Humanities (Digital Literary Studies).

Słowa kluczowe

  • Literary Studies
  • Digital Humanities
  • Literary Corpus
  • Thematic Analysis
Otwarty dostęp

Automatic Morphemic Analysis in the Corpus of the Ukrainian Language: Results and Prospects

Data publikacji: 24 Jan 2018
Zakres stron: 415 - 425

Abstrakt

Abstract

The article describes theoretical issues, principles of constructing and functioning of the Automated System of Morphemic and Derivational Analysis (ASMDA). The ASMDA system performs the following functions: 1) information system; 2) automatic morphemic annotation of text; 3) automatic linguistic constructor for frequency dictionaries. Description of the use of ASMDA as an automatic morphemic analyser of Ukrainian texts’ lexicon is in the centre of attention; this article also describes structure as well as search and classification options of electronic morphemic dictionaries presented in linguistic research system of the Corpus of the Ukrainian language.

Słowa kluczowe

  • Morphemic-Derivational database
  • Corpus of the Ukrainian language
  • the morphic segmentator of the Ukrainian text
  • Electronic dictionary of frequency
  • automatic morphemic analysis
Otwarty dostęp

Ján Horecký‘s Approach to Language and Thinking

Data publikacji: 24 Jan 2018
Zakres stron: 426 - 431

Abstrakt

Abstract

The paper aims to reflect on theoretical foundations of Horecký’s approach to the relation between language (and more specifically: terms) and thinking (concepts). Reflections are devoted to Horecký’s explicit and implicit beliefs on the nature of terms and concepts and their mutual relation, as well as their relation to reality around. Definitions of both term and concept appear in some of Horecký’s major papers. The paper focuses on models of term-concept relation proposed in those papers. Finally, an attempt is made to find some convergences and divergences in theories of Horecký and the Czech logician Pavel Tichý.

Słowa kluczowe

  • terms
  • concepts
  • philosophy
  • Ján Horecký
  • Pavel Tichý
  • logical spectrum
  • Transparent Intensional Logic
31 Artykułów
Otwarty dostęp

Georgian Dialect Corpus: Linguistic and Encyclopedic Information in Online Dictionaries

Data publikacji: 24 Jan 2018
Zakres stron: 109 - 121

Abstrakt

Abstract

The Georgian Dialect Corpus (GDC) has been created within the framework of the project “Linguistic Portrait of Georgia”. It was the first attempt to create a structured corpus of Georgian dialects. The work of this project includes building the technical framework for a corpus, collecting the corpus (text) data of Georgian dialects including the lexicographic data (dictionaries), their linguistic processing, digitizing, developing annotation framework, making decision on the morphosyntactic annotation. Currently, the Georgian Dialect Corpus is a platform consisting of the dialect corpus, the text library, the lexicographical database/online dialect dictionaries. For the purposes of developing the lexicographical database and dialect dictionaries, we have created a new program – the Lexicographic Editor. It allows us to structure and improve the dictionaries with multiple linguistic and lexicographic information. The lexicographic concept of the GDC has been developed taking into consideration linguistic and social features of the Georgian dialects.

Słowa kluczowe

  • corpus linguistics
  • corpus lexicography
  • dialect corpora
Otwarty dostęp

Modeling Semantic Distance in the Pattern Dictionary of English Verbs

Data publikacji: 24 Jan 2018
Zakres stron: 122 - 135

Abstrakt

Abstract

We explore human judgments on how well individual patterns of 29 target verbs from the Pattern Dictionary of English Verbs describe their random KWICs. We focus on cases where more than one pattern is judged as highly appropriate for a given KWIC and seek to estimate the effect of event participants (arguments) being denotatively similar in two patterns, considering all pair combinations in a given lemma. We compare this effect to the effect of several contextual features of the KWICs, the effect of paired PDEV implicatures implying each other, and the effect of belonging to a given lemma. We show that the lemma effect is still stronger than any feature going across lemmas we have examined so far, so that each verb appears to be a little universe in its own right.

Słowa kluczowe

  • usage patterns
  • lexicography
  • verbs
  • CPA
  • semantics
  • word embeddings
  • WSD
  • graded decisions
  • corpus
  • English
  • annotation
Otwarty dostęp

Golden Rule of Morphology and Variants of Word forms

Data publikacji: 24 Jan 2018
Zakres stron: 136 - 144

Abstrakt

Abstract

In many languages, some words can be written in several ways. We call them variants. Values of all their morphological categories are identical, which leads to an identical morphological tag. Together with the identical lemma, we have two or more wordforms with the same morphological description. This ambiguity may cause problems in various NLP applications. There are two types of variants – those affecting the whole paradigm (global variants) and those affecting only wordforms sharing some combinations of morphological values (inflectional variants). In the paper, we propose means how to tag all wordforms, including their variants, unambiguously. We call this requirement “Golden rule of morphology”. The paper deals mainly with Czech, but the ideas can be applied to other languages as well.

Słowa kluczowe

  • morphology
  • global variants
  • inflectional variants
  • multiple lemma
  • Golden rule of morphology
Otwarty dostęp

Morphological Disambiguation of Multiword Expressions and Its Impact on the Disambiguation of Their Environment in a Sentence

Data publikacji: 24 Jan 2018
Zakres stron: 145 - 155

Abstrakt

Abstract

This study concerns the impact of the collocation/phraseme disambiguation component within the complex system of the rule-based morphological disambiguation of Czech. This system constitutes one of the two main disambiguation subsystems that are responsible for the morphological disambiguation of the corpora of synchronic Czech within the Czech National Corpus project. We will show that although the part of texts constituted by collocations/phrasemes (generally multiword expressions – MWEs) is relatively small and consequently the errorfree morphological disambiguation of MWEs covers only a small portion of textual material, such perfectly disambiguated fragments in sentences help to improve the disambiguation of the rest, non-MWE part of sentences.

Słowa kluczowe

  • multiword expressions
  • lexical database
  • morphological analysis
  • morphological ambiguity
  • morphological disambiguation
  • process of disambiguation
  • Czech National Corpus
Otwarty dostęp

Valency Potential of Slovak and French Verbs in Contrast

Data publikacji: 24 Jan 2018
Zakres stron: 156 - 168

Abstrakt

Abstract

The paper presents results of synchronous contrastive study of fifteen most frequent Slovak full verbs and their French equivalents by the method of corpus analysis aimed at observation and comparison of their valency potential in relation to their semantic structure. The inventory of valency structures of Slovak verbs and their French equivalents shows not only differences, but also, to a great extent, identical semantic-syntactic connectivities. The main apport of the study lies in the contrastive research perspective and the interdisciplinary character on the crossroads of grammar, semantics, syntax, cognitive and corpus linguistics. Findings can be of use to linguists, terminologists, lexicographers, authors of textbooks and grammars, translators and interpreters, as well as to French-speaking learners of Slovak and Slovak students of French.

Słowa kluczowe

  • linguistics
  • grammar
  • corpus
  • verb
  • valency
Otwarty dostęp

Microsyntactic Annotation of Corpora and its Use in Computational Linguistics Tasks

Data publikacji: 24 Jan 2018
Zakres stron: 169 - 178

Abstrakt

Abstract

Microsyntax is a linguistic discipline dealing with idiomatic elements whose important properties are strongly related to syntax. In a way, these elements may be viewed as transitional entities between the lexicon and the grammar, which explains why they are often underrepresented in both of these resource types: the lexicographer fails to see such elements as full-fledged lexical units, while the grammarian finds them too specific to justify the creation of individual well-developed rules. As a result, such elements are poorly covered by linguistic models used in advanced modern computational linguistic tasks like high-quality machine translation or deep semantic analysis. A possible way to mend the situation and improve the coverage and adequate treatment of microsyntactic units in linguistic resources is to develop corpora with microsyntactic annotation, closely linked to specially designed lexicons. The paper shows how this task is solved in the deeply annotated corpus of Russian, SynTagRus.

Słowa kluczowe

  • Text corpora
  • Russian syntactically tagged corpus SynTagRus
  • syntactic idioms
  • microsyntactic annotation
  • microsyntactic dictionary
Otwarty dostęp

Clitic Climbing, Finiteness and the Raising-Control Distinction. A Corpus–based study

Data publikacji: 24 Jan 2018
Zakres stron: 179 - 190

Abstrakt

Abstract

In the paper, we discuss the phenomenon of clitic climbing out of finite da2-complements in contemporary Serbian. Scholars’ opinions on the acceptability and occurrence of this construction, based on a handful of self-made examples, vary considerably. Expanding on the assumption that the correctness of the phenomenon has often been denied due to its rareness we employ large corpora to examine the problem. We focus on possible constraints arising from the syntactic properties of clause-embedding predicates.

Słowa kluczowe

  • srWac
  • constraints on clitic climbing
  • semifinite complements
  • syntax
  • Serbian
Otwarty dostęp

On the Development of an Interdisciplinary Annotation and Classification System for Language Varieties – Challenges and Solutions

Data publikacji: 24 Jan 2018
Zakres stron: 191 - 207

Abstrakt

Abstract

The Special Research Programme (SFB) ‘German in Austria: Variation – Contact – Perception’ is a project financed by the Austrian Science Fund (FWF F60). Its nine project parts are collaboratively conducting research on the variation and change of the German language in Austria. The SFB explores the use and the subjective perception of the German language in Austria as well as its contact with other languages. Methodologically and theoretically, most SFB project parts are situated within variationist linguistics, others in contact linguistics and perceptionist linguistics. This paper gives an insight into the conception of a framework for the annotation and ultimately also classification of language varieties, which is being developed within the SFB. It outlines the requirements of the various project parts and reviews, whether and how standardised language codes (ISO 639) and language tags (following BCP 47) can be utilised for the annotation of language varieties in variationist linguistic projects.

Słowa kluczowe

  • language varieties
  • dialects
  • language tags
Otwarty dostęp

Possible but not probable :A quantitative analysis of valency behaviour of Czech nouns in the Prague Dependency Treebank

Data publikacji: 24 Jan 2018
Zakres stron: 208 - 218

Abstrakt

Abstract

In order to optimize corpus searches for valency lexicon production, we analyse the relative frequencies of different combinations of valency complementations of Czech deverbal nouns in the Prague Dependency Treebank, considering differences between productively and non-productively derived nouns and their semantic class. We also classify combinations of forms of participants according to their frequency.

Słowa kluczowe

  • valency
  • valency lexicon
  • Czech nouns
  • Word Sketch
  • corpus
  • Prague Dependency Treebank
  • quantitative analysis
Otwarty dostęp

New Spoken Corpora of Czech: ORTOFON and DIALEKT

Data publikacji: 24 Jan 2018
Zakres stron: 219 - 228

Abstrakt

Abstract

The paper introduces the ORTOFON corpus of spontaneous spoken Czech and the DIALEKT corpus of Czech dialects, their design principles and practical solutions adopted during data collection.

Słowa kluczowe

  • dialectology
  • lemmatization
  • spoken corpus
  • tagging
  • transcription
Otwarty dostęp

What Does že jo (and že ne) Mean in Spoken Dialogue

Data publikacji: 24 Jan 2018
Zakres stron: 229 - 237

Abstrakt

Abstract

The goal of this paper is to examine the role of two collocations (že jo and že ne) in spoken dialogue. Both are said to be typical of spontaneous conversation and express a large scale of pragmatic functions, e.g. uncertainty of the speaker or a request for a backchannel. The examination of their positioning within the utterance in relation to the meaning of their close context helped us to identify the functions and to distinguish between cases which are simple co-occurrences of the conjunction že and the particle jo/ne, and those which are instances of the set phrase. The source material comes from the ORAL2013 and DIALOG corpora.

Słowa kluczowe

  • DIALOG
  • spoken corpora
  • ORAL
  • co-occurrence
Otwarty dostęp

Grammatical Change Trends in Contemporary Czech Newspapers

Data publikacji: 24 Jan 2018
Zakres stron: 238 - 248

Abstrakt

Abstract

The paper presents a corpus-driven method for the detection of recent grammatical change in contemporary Czech newspapers. It is based on a large and homogeneous material (825 million tokens of a single newspaper) that covers a 23-year time span. The task is operationalised into finding the most relevant frequency change manifested by selected subsets of the Czech tagset. The results show changing proportions of parts of speech, nominal cases etc. that indicate a shift towards more “verbal” language associated with increasing informality of the newspaper register.

Słowa kluczowe

  • modern diachrony
  • language change
  • Czech
  • newspaper register
  • corpus composition
Otwarty dostęp

Corpus-Based Semantic Models of the Noun Phrases Containing Words with ‘Person’ Marker

Data publikacji: 24 Jan 2018
Zakres stron: 249 - 257

Abstrakt

Abstract

The mechanism underlying constructing of lexically correct sequences of words is an object of attention both in theoretical and applied fields of linguistics. This paper reveals some aspects of modelling the patterns of semantic valence in noun phrases of NN (Noun+Noun) structure, one or both components of which contain the ‘person’ semantic tag. The research is based on the Corpus of Ukrainian and performed with the help of automatic language processing.

Słowa kluczowe

  • natural language processing
  • semantic valence
  • noun phrases
Otwarty dostęp

Text collections for evaluation of Russian morphological taggers

Data publikacji: 24 Jan 2018
Zakres stron: 258 - 267

Abstrakt

Abstract

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.

Słowa kluczowe

  • text collection
  • shared task
  • morphological tagging
  • universal dependencies
  • morphological parsing
  • Russian corpora
Otwarty dostęp

Subcategorization of Adverbial Meanings Based on Corpus Data

Data publikacji: 24 Jan 2018
Zakres stron: 268 - 277

Abstrakt

Abstract

We introduce a corpus based description of selected adverbial meanings in Czech sentences. Its basic repertory is one of a long lasting tradition in both scientific and school grammars. However, before the corpus era, researchers had to rely on their own excerption; but nowadays, current syntax has a vast material basis in the form of electronic corpora available. On the case of spatial adverbials, we describe our methodology which we used to acquire a detailed, comprehensive, well-arranged description of meanings of adverbials including a list of formal realizations with examples. Theoretical knowledge stemming from this work will lead into an improval of the annotation of the meanings in the Prague Dependency Treebanks which serve as the corpus sources for our research. The Prague Dependency Treebanks include data manually annotated on the layer of deep syntax and thus provide a large amount of valuable examples on the basis of which the meanings of adverbials can be defined more accurately and subcategorized more precisely. Both theoretical and practical results will subsequently be used in NLP, such as machine translation.

Słowa kluczowe

  • adverbial meanings
  • deep syntax
  • annotation
  • treebank
Otwarty dostęp

Measuring and Improving Children’s Reading Aloud Attributes by Computers

Data publikacji: 24 Jan 2018
Zakres stron: 278 - 286

Abstrakt

Abstract

In this paper, method of an automated measuring reading aloud attributes is presented. The forced alignment as a part of speech recognition technique is used. The recorded reading aloud is forced aligned to the known text and the attributes are computed from it. The tempo and fluency of children are monitored and used for an individual motivation. The length of the read text is chosen according to readers’ skills so that children end up reading at about the same time and poor readers are not frustrated. This approach has been tested and improved at the elementary school for five years and brought positive results.

Słowa kluczowe

  • speech recognition
  • teaching reading
  • reading aloud
Otwarty dostęp

Three Aspects of Processing Ophthalmological Terminology in a “Small Language”: a Case of Croatian Term Bank Struna

Data publikacji: 24 Jan 2018
Zakres stron: 287 - 295

Abstrakt

Abstract

In this paper, we will present the problems we have observed while editing terminological units as a part of the specialized language of ophthalmology that is currently being processed as part of the program Struna. Struna is the Croatian National term bank (http://struna.ihjj.hr/). Its aim is to gradually standardize Croatian terminology, for all professional domains, by coordinating the work of domain experts, terminologists and language experts [1], [2]. The Croatian Ophthalmological Terminology1 is the first Struna project that encompasses a subfield of an already existing field in the database. Namely, in 2013 the general medical terminology was processed as a part of the project Croatian Anatomy and Physiology. This situation has revealed a new set of problems that previously were not taken into account and has forced us to re-evaluate methodology and adapt accordingly.

Słowa kluczowe

  • Struna
  • terminology
  • specialized language of ophthalmology
  • terminology management
Otwarty dostęp

Terminology and Labelling Words by Subject in Monolingual Dictionaries – What Do Domain Labels Say to Dictionary Users ?

Data publikacji: 24 Jan 2018
Zakres stron: 296 - 304

Abstrakt

Abstract

The paper focuses on labelling words by subject in a non-specialized dictionary. We compare the existing monolingual dictionaries of Czech and their ways of labelling terms of medicine and related fields; besides apparent differences between dictionaries, there are also inconsistencies within one dictionary. We consider pros and cons of domain labels as such and their usability in the light of needs and limits of dictionary users, with the aim to motivate further discussion on related issues.

Słowa kluczowe

  • terminology
  • terms
  • lexicography
  • monolingual dictionary
  • e-dictionary
  • domain labels
  • Czech
Otwarty dostęp

Correlative Conjunctions in Spoken Texts

Data publikacji: 24 Jan 2018
Zakres stron: 305 - 315

Abstrakt

Abstract

Correlative conjunctions (such as buď – anebo (either – or), jednak – jednak (firstly – secondly) etc.) represent one means of textual cohesion. The occurrence of one component of the pair implies the use of the other, which contributes to the cohesiveness of a text. Using data provided by the corpus of informal spoken Czech ORAL2013, I will try to demonstrate their use in a prototypical spoken language, that is commonly considered less coherent and more fragmentary compared to written language.

Słowa kluczowe

  • correlative conjunctions
  • spoken Czech
  • ORAL2013
  • corpus
Otwarty dostęp

Issues of POS Tagging of the (Diachronic) Corpus of Czech : Preparing a Morphological Dictionary

Data publikacji: 24 Jan 2018
Zakres stron: 316 - 325

Abstrakt

Abstract

Many important decisions concerning the part-of-speech categorization remain unexplained in the current practice, only reported in corpus manuals. The aim of this paper is to offer a different perspective on the problems of morphological annotation of corpora – the perspective of mapping and analyzing conceptual problems in the annotation. Focused mainly on function words in Czech, we discuss the possibilities of the POS tagging of the inherently ambiguous category of particles and we introduce criteria for distinguishing particles from interjections.

Słowa kluczowe

  • corpus
  • function words
  • morphological annotation
  • Czech
Otwarty dostęp

Designing the Database of Speech Under Stress

Data publikacji: 24 Jan 2018
Zakres stron: 326 - 335

Abstrakt

Abstract

This study describes the methodology used for designing a database of speech under real stress. Based on limits of existing stress databases, we used a communication task via a computer game to collect speech data. To validate the presence of stress, known psychophysiological indicators such as heart rate and electrodermal activity, as well as subjective self-assessment were used. This paper presents the data from first 5 speakers (3 men, 2 women) who participated in initial tests of the proposed design. In 4 out of 5 speakers increases in fundamental frequency and intensity of speech were registered. Similarly, in 4 out of 5 speakers heart rate was significantly increased during the task, when compared with reference measurement from before the task. These first results show that proposed design might be appropriate for building a speech under stress database. However, there are still considerations that need to be addressed.

Słowa kluczowe

  • stress
  • arousal
  • stress detection
  • heart rate
  • speech under stress
  • speech database
Otwarty dostęp

Annotation of the Evaluative Language in a Dependency Treebank

Data publikacji: 24 Jan 2018
Zakres stron: 336 - 345

Abstrakt

Abstract

In the paper, we present our efforts to annotate evaluative language in the Prague Dependency Treebank 2.0. The project is a follow-up of the series of annotations of small plaintext corpora. It uses automatic identification of potentially evaluative nodes through mapping a Czech subjectivity lexicon to syntactically annotated data. These nodes are then manually checked by an annotator and either dismissed as standing in a non-evaluative context, or confirmed as evaluative. In the latter case, information about the polarity orientation, the source and target of evaluation is added by the annotator. The annotations unveiled several advantages and disadvantages of the chosen framework. The advantages involve more structured and easy-to-handle environment for the annotator, visibility of syntactic patterning of the evaluative state, effective solving of discontinuous structures or a new perspective on the influence of good/bad news. The disadvantages include little capability of treating cases with evaluation spread among more syntactically connected nodes at once, little capability of treating metaphorical expressions, or disregarding the effects of negation and intensification in the current scheme.

Słowa kluczowe

  • dependency treebank
  • corpus
  • plaintext annotation
Otwarty dostęp

TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Data publikacji: 24 Jan 2018
Zakres stron: 346 - 354

Abstrakt

Abstract

This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.

Słowa kluczowe

  • automatic annotation
  • speech recognition
  • speech corpus
Otwarty dostęp

Helping the Translator Choose: The Concept of a Dictionary of Equivalents

Data publikacji: 24 Jan 2018
Zakres stron: 355 - 363

Abstrakt

Abstract

The purpose of the article is to present the innovative concept of a dictionary of equivalents, a reference work designed specifically for translators of legal texts. The article describes the features of legal terminology which render legal translation particularly difficult, such as polysemy and synonymy as well as incongruence among legal systems. Then it proposes a classification and labelling system of equivalents which ought to be offered in a terminographic reference work for legal translators.

Słowa kluczowe

  • dictionary of equivalents
  • legal language
  • legal translation
  • congruence
  • equivalence
Otwarty dostęp

CzEngClass – Towards a Lexicon of Verb Synonyms with Valency Linked to Semantic Roles

Data publikacji: 24 Jan 2018
Zakres stron: 364 - 371

Abstrakt

Abstract

In this paper, we introduce our ongoing project about synonymy in bilingual context. This project aims at exploring semantic ‘equivalence’ of verb senses of generally different verbal lexemes in a bilingual (Czech-English) setting. Specifically, it focuses on their valency behavior within such equivalence groups. We believe that using bilingual context (translation) as an important factor in the delimitation of classes of synonymous lexical units (verbs, in our case) may help to specify the verb senses, also with regard to the (semantic) roles relation to other verb senses and roles of their arguments more precisely than when using monolingual corpora. In our project, we work “bottom-up”, i.e., from an evidence as recorded in our corpora and not “top-down”, from a predefined set of semantic classes.

Słowa kluczowe

  • lexical resources
  • valency
  • synonymy
  • semantic roles
  • dependency corpus
  • multilingual
Otwarty dostęp

Slavic Phraseology: A View Through Corpora

Data publikacji: 24 Jan 2018
Zakres stron: 372 - 384

Abstrakt

Abstract

The study of word collocability is one of the main tasks of linguistics. The combinatory ability of language units, collocability, is one of the linguistic syntagmatic laws. This phenomenon is the main object of the phraseology and lexicography. The article deals with set phrases of different types in Russian, Czech and Slovak from the point of view of their quantitative evaluation. Corpus linguistics understand set phrases as statistically determined unities. This approach is the basic point of different automatic ways to extract idioms and collocations. The paper describes experiments which show how text corpora and corpus methods and tools can be used to expand the entries in existing dictionaries and how set phrases could be evaluated quantitatively. It is shown and maintained that corpus linguistics methods and tools allow to create dictionaries of new type which have to include a larger amount of set phrases and collocations than before.

Słowa kluczowe

  • Slavic phraseology
  • phraseological units
  • set phrases
  • idioms
  • collocations
  • corpus
  • lexicography
Otwarty dostęp

Slovak Dependency Treebank in Universal Dependencies

Data publikacji: 24 Jan 2018
Zakres stron: 385 - 395

Abstrakt

Abstract

We describe a conversion of the syntactically annotated part of the Slovak National Corpus into the annotation scheme known as Universal Dependencies. Only a small subset of the data has been converted so far; yet it is the first Slovak treebank that is publicly available for research. We list a number of research projects in which the dataset has been used so far, including the first parsing results.

Słowa kluczowe

  • treebank
  • dependency
  • universal dependencies
  • syntax
  • morphology
  • tagging
  • parsing
Otwarty dostęp

Compound Adverbs as an Issue in Machine Analysis of Czech language

Data publikacji: 24 Jan 2018
Zakres stron: 396 - 403

Abstrakt

Abstract

Compound adverbs represent an interesting issue in terms of Automatic Morphological Analysis (AMA). The reason is that compound adverbs in Czech are expressions formed by compounding existing words that are different parts of speech without any change in their form. An indicative sign of compound adverbs is that they can always be decomposed again. Compound adverbs may be written as one word but sometimes a multiword form coexists. A word that is originally a different part of speech gains an adverbial meaning and becomes an adverb. This article presents the results of a corpus probe aimed at mapping expressions that are demonstrably compound adverbs and were not recognized by AMA or were incorrectly tagged by AMA as another part of speech. Analysis of data obtained from the Czech National Corpus (ČNK) SYN v3 show that the unrecognized and incorrectly tagged units can be divided into several groups. Based on knowledge of these groups it is possible to refine part of speech tagging by AMA. The corpus probe examined units written in accordance with the current codification as well as substandard units.

Słowa kluczowe

  • compound adverb
  • multiword expression
  • automatic morphological analysis
  • nominal form
  • corpus
  • tag
Otwarty dostęp

The Use of Authorial Corpora Beyond Linguistics

Data publikacji: 24 Jan 2018
Zakres stron: 404 - 414

Abstrakt

Abstract

The study concentrates on the issue of quantitative and qualitative methods within the context of literary theory. It intends namely to present the concept of the literary corpus of Czech prose and define main parameters of the corpus. Besides the project of a specialized corpus, primarily intended for the use in the field of literary theory, the study deals with current stochastic and corpus methods applied by foreign scholars in analysis of literary prosaic texts. The study tries to incorporate the original project of Czech prose literary corpus in this contemporary context that represents one form of a recently flourishing discipline called Digital Humanities (Digital Literary Studies).

Słowa kluczowe

  • Literary Studies
  • Digital Humanities
  • Literary Corpus
  • Thematic Analysis
Otwarty dostęp

Automatic Morphemic Analysis in the Corpus of the Ukrainian Language: Results and Prospects

Data publikacji: 24 Jan 2018
Zakres stron: 415 - 425

Abstrakt

Abstract

The article describes theoretical issues, principles of constructing and functioning of the Automated System of Morphemic and Derivational Analysis (ASMDA). The ASMDA system performs the following functions: 1) information system; 2) automatic morphemic annotation of text; 3) automatic linguistic constructor for frequency dictionaries. Description of the use of ASMDA as an automatic morphemic analyser of Ukrainian texts’ lexicon is in the centre of attention; this article also describes structure as well as search and classification options of electronic morphemic dictionaries presented in linguistic research system of the Corpus of the Ukrainian language.

Słowa kluczowe

  • Morphemic-Derivational database
  • Corpus of the Ukrainian language
  • the morphic segmentator of the Ukrainian text
  • Electronic dictionary of frequency
  • automatic morphemic analysis
Otwarty dostęp

Ján Horecký‘s Approach to Language and Thinking

Data publikacji: 24 Jan 2018
Zakres stron: 426 - 431

Abstrakt

Abstract

The paper aims to reflect on theoretical foundations of Horecký’s approach to the relation between language (and more specifically: terms) and thinking (concepts). Reflections are devoted to Horecký’s explicit and implicit beliefs on the nature of terms and concepts and their mutual relation, as well as their relation to reality around. Definitions of both term and concept appear in some of Horecký’s major papers. The paper focuses on models of term-concept relation proposed in those papers. Finally, an attempt is made to find some convergences and divergences in theories of Horecký and the Czech logician Pavel Tichý.

Słowa kluczowe

  • terms
  • concepts
  • philosophy
  • Ján Horecký
  • Pavel Tichý
  • logical spectrum
  • Transparent Intensional Logic

Zaplanuj zdalną konferencję ze Sciendo