Journal & Issues

Volume 73 (2022): Issue 3 (December 2022)
The use of language as instrument and means of discrimination

Volume 73 (2022): Issue 2 (September 2022)

Volume 73 (2022): Issue 1 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volume 73 (2022): Issue 1 (June 2022)

Volume 72 (2022): Issue 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volume 72 (2021): Issue 3 (December 2021)

Volume 72 (2021): Issue 2 (December 2021)
NLP, Corpus Linguistics and Interdisciplinarity

Volume 72 (2021): Issue 1 (June 2021)

Volume 71 (2020): Issue 3 (December 2020)
Číslo venované problematike maďarského jazyka a maďarských nárečí na Slovensku

Volume 71 (2020): Issue 2 (December 2020)

Volume 71 (2020): Issue 1 (June 2020)

Volume 70 (2019): Issue 3 (December 2019)

Volume 70 (2019): Issue 2 (December 2019)

Volume 70 (2019): Issue 1 (June 2019)

Volume 69 (2018): Issue 3 (December 2018)

Volume 69 (2018): Issue 2 (December 2018)

Volume 69 (2018): Issue 1 (June 2018)

Volume 68 (2017): Issue 3 (December 2017)

Volume 68 (2017): Issue 2 (December 2017)

Volume 68 (2017): Issue 1 (June 2017)

Volume 67 (2016): Issue 3 (December 2016)

Volume 67 (2016): Issue 2 (December 2016)

Volume 67 (2016): Issue 1 (June 2016)

Volume 66 (2015): Issue 2 (December 2015)

Volume 66 (2015): Issue 1 (June 2015)

Volume 65 (2014): Issue 2 (December 2014)

Volume 65 (2014): Issue 1 (June 2014)

Volume 64 (2013): Issue 2 (December 2013)

Volume 64 (2013): Issue 1 (June 2013)

Volume 63 (2012): Issue 2 (December 2012)

Volume 63 (2012): Issue 1 (June 2012)

Volume 62 (2011): Issue 2 (December 2011)

Volume 62 (2011): Issue 1 (June 2011)

Volume 61 (2010): Issue 2 (December 2010)

Volume 61 (2010): Issue 1 (June 2010)

Volume 60 (2009): Issue 2 (December 2009)

Volume 60 (2009): Issue 1 (June 2009)

Journal Details
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English

Search

Volume 72 (2022): Issue 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Journal Details
Format
Journal
eISSN
1338-4287
First Published
05 Mar 2010
Publication timeframe
2 times per year
Languages
English

Search

14 Articles
Open Access

Consistency of morphological dictionary MorfFlex

Published Online: 17 Aug 2022
Page range: 855 - 861

Abstract

Abstract

Language corpora usually contain, in addition to their own texts, various types of annotations. The most common one is a morphological annotation, which consists in assigning a lemma and a morphological tag to each wordform. For morphological tagging, morphological dictionaries are traditionally used. Our paper presents a new version of the so-called “Prague” morphological dictionary MorfFlex used for tagging many Czech corpora (particularly Prague Dependency Treebanks, corpora published by the Institute of the Czech National Corpus in Prague or large Czech web corpora of the Aranea series). Three basic principles were used to update the dictionary: the Golden Rule of Morphology, the Principle of Paradigm Unity, and the Principle of Paradigm Uniqueness.

Keywords

  • morphological dictionary
  • morphological analysis
  • language corpus
  • the Czech language
Open Access

Kladenští type as a problem of automatic morphological analysis

Published Online: 17 Aug 2022
Page range: 862 - 872

Abstract

Abstract

The aim of our paper is to demonstrate the procedures by which the data needed to refine tools for automatic morphological analysis of Czech can be obtained using a corpus, namely the Araneum Bohemicum IV Maximum (Czech, 20.03) 7.10 G web corpus of the ARANEA series and Araneum Bohemicum Maximum (Czech, 15.04) 3,20 G (hereinafter Araneum). Particularly, we will focus on propria of the Kladenští type, i.e., substantivized adjectives of denoting groups of persons according to affiliation. The goal of the probe into the Aranea web corpus is: 1) a corpus-based description of frequented properties of the Kladenští type, which can be used as a starting point for rule disambiguation; 2) creating a list of the most frequent lemmas belonging to the Kladenští type, which can then be included into dictionaries of automatic morphological analyzers (e.g. the MorfFlex dictionary by Hajič and Hlaváčová). We believe that the probe can help improve the results of tools for automatic morphological analysis of Czech.

Keywords

  • automatic morphological analysis
  • derivational type
  • part of speech transition
Open Access

Language interpretation of German migration discourse (in comparison view of the years 2019 and 2015/16)

Published Online: 17 Aug 2022
Page range: 873 - 881

Abstract

Abstract

The presented paper is a research dive into the topic of web corpora as well as an analysis of linguistic grasp of the issue of migration from the perspective of social, cultural and cognitive linguistics. The presented research reflects the problem of the construction of the language grasp of this issue in Europe in a selected German mass media discourse. We compare the phenomenon of migration in 2015/2016, when record migration flows to the EU were recorded, and in 2019, when migration kept increasing. The analysis of language grasp of the issue of migration is a part of our scientific research within the project VEGA Xenisms in German and Slovak communications.

Keywords

  • linguistic interpretation
  • web corpora
  • migration
  • German political discourse
Open Access

Thematic words in the Slovak pre-election campaign on Facebook

Published Online: 17 Aug 2022
Page range: 882 - 893

Abstract

Abstract

The aim of this paper is 1. to describe/specify and compare thematic orientations of 735 pre-election microblogs published on the virtual profiles of major six Slovak political parties and 2. based on this description and comparison to identify and sketch features of Slovak political discourse. The conceptual and methodological frame consists of thematic words, that is, autosemantics above the so-called h-point, and the qualitative analysis of these thematic words. The identified features of the general Slovak pre-election communication include: populist communication, ego presentation of the party, leader or the candidate, conflict between government parties and opposition parties, image of Slovakia as a country facing troubles but also hiding potential to solve them.

Keywords

  • h-point
  • Facebook
  • microblog
  • political communication
  • political discourse
  • pre-election campaign
  • thematic words
Open Access

Kabyle corpus digital database and exploitation. Test of lexicometric analysis of the identity dimension in the romanesque discourse

Published Online: 17 Aug 2022
Page range: 894 - 905

Abstract

Abstract

The purpose of this contribution is to show, through a preliminary analysis of a corpus sample composed of the first five kabyle novels (1963-1990), the contribution of lexicometry as a new method based on statistics, in the treatment of large corpora and the establishment of databases. The aim is to describe all the phases intrinsic to the preliminary processing of a corpus (transcription, tagging and lemmatization) before submitting them to the various stages of its exploitation. Thus, in our corpus, we have opted to deal with the theme of identity induced by the five works by highlighting both the overused vocabulary and the singularity of each work in relation to the corpus as a whole. But before moving on to the quantitative analysis of the vocabulary, a work of data preparation is necessary. We intend to focus on the orthographic choices to be adopted by removing all ambiguities, the marking out and the lemmatization of the corpus. In order to do this, we have resorted to Lexico5 computer tool.

Keywords

  • corpus
  • kabyle
  • identity
  • novel
  • lexicometry
  • databases
Open Access

Some observations on the composition by blending in contemporary French from Petit Robert

Published Online: 17 Aug 2022
Page range: 906 - 915

Abstract

Abstract

The objective of this article is to analyze the composition by amalgam in current French by focusing on the one hand on the notion of amalgam in linguistics and on the other hand on the use and the frequency of use of the chosen amalgams in the diatopic variation of French. The notion of amalgam and / or of portmanteau word does not seem obvious and the explanations or definitions offered by dictionaries as well as by works on lexicology are not unanimous and differ from one another. Before presenting the results of a more detailed research, we therefore find it essential to frame the contribution in a theoretical context dealing with the notion of amalgam, or even of portmanteau word, which allows us to better understand the whole problem.

Keywords

  • blending
  • French language
  • lexicography
  • loanword
  • diatopy
Open Access

Extracting fishing terminology using GNU/Linux tools

Published Online: 17 Aug 2022
Page range: 916 - 926

Abstract

Abstract

The technological revolution that has occurred in recent decades has made accessible for researches large textual data collections. At the same time, the development of increasingly sophisticated computer tools provides them with new methods of analyzing texts. In the present study however we examine the functionalities offered by traditional tools, namely GNU/Linux tools, easily accessible via the command line but still unknown among linguists with little or no computer knowledge. Our goal is to show how using the web corpus on the one hand and the processing GNU/Linux tools on the other, we can extract key-terms of fishing jargon.

Keywords

  • web corpus
  • GNU/Linux tools
  • key-term
  • fishing terminology
Open Access

How different types of linguistic corpora shed light (or not) on various categories of substandard lexicon: contrastive analysis of vocabulary in the comedy “Les Kaïra” [Porn in the hood], a typical example of the hood film genre

Published Online: 17 Aug 2022
Page range: 927 - 941

Abstract

Abstract

The arrival of WaC corpora, including Aranea family corpora, with its “close-to-spoken language” writings from different non-formal web pages brought the new options to researchers of sociolects, mainly to those who were previously obliged to observe youth collectives in its spontaneous discourses with its consequent time-consuming transcripts. Non-spontaneous spoken language from rap songs or youth film dialogues also help researchers to describe the level of societal diffusion of some typical features of youth slang. In this paper, we focus on demonstration of these crossed approaches in order to describe three types of verbs, used in a successful comedy about Parisian peri-urban post-adolescents Les Kaïra (2012), representing different types of substandard lexicon.

Keywords

  • substandard verbs
  • French
  • neology
  • film dialogues
  • corpus linguistics
  • hood films
Open Access

Didactising specialised parallel corpora: the case of European directives

Published Online: 17 Aug 2022
Page range: 942 - 950

Abstract

Abstract

Within the framework of a didactic proposal, this article proposes to present a preliminary step to the specialized translation French-Greek. It will attempt to highlight the benefits of autonomous learning through the consultation of a corpus of specialized parallel texts established by the EU institutions. The use of concordancers will provide solutions to students wishing to study the variability of terminology and specialized vocabulary at monolingual and bilingual levels.

Keywords

  • specialized translation French-Greek
  • parallel corpora
  • variability
  • terminology
  • concordancer
Open Access

Proposal to use the study corpus for contemporary French in Didactics of French as a Foreign Language

Published Online: 17 Aug 2022
Page range: 951 - 966

Abstract

Abstract

The ORFÉO platform (Tools and Research on Written and Oral French) has been making available to users since 2018 a Study Corpus for sampled Contemporary French as well as operating tools. Although this resource is intended for an audience of researchers and students in the fields of linguistics and automatic language processing, we endeavor in this article to report on the didactic potential that it offers within the framework of a Licensing Syntax course treating “subordination” and intended for Czech and Slovak students at levels B1 to C1 in French. We propose a didactic sequence composed of four activities and pursuing three objectives: consolidation of the mastery of the basic functions of dont («which») from a corpus of friendly conversations; the use of simple query interface tools and the introduction of certain principles of corpus sociolinguistics. The corpus-based approach, by confronting learners with authentic contextualized data, helps to redefine the teaching-learning priorities of a language by giving primacy not to respect for grammatical norms but to genre norms.

Keywords

  • Didactics of French as a Foreign Language
  • Data-driven learning
  • corpus linguistics
  • “dont”
  • sociolinguistics
Open Access

Comparative Corpus-Driven Study of Prepositional Semantics in Russian and Czech

Published Online: 17 Aug 2022
Page range: 967 - 976

Abstract

Abstract

This paper deals with prepositions with causal meaning in Russian and Czech. In Slavic languages prepositions are closely connected to cases. Russian and Czech prepositions have many common features. Prepositions show a relation in space or time or a special relationship between two or more people, places, things or situations. In the current paper we are dealing with causal relations. There are different ways to express them. Among these means, the most common are prepositional-case forms and complex sentences with a subordinate causal part. We analyze the repertoire of causal prepositions in both languages and describe their statistical representation in corpora. Another task is to reveal translation equivalents between two languages.

Keywords

  • preposition
  • causal meaning
  • Russian language
  • Czech language
  • corpus statistics
  • parallel corpora
Open Access

Identifying Errors in Russian Web Corpora

Published Online: 17 Aug 2022
Page range: 977 - 985

Abstract

Abstract

The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

Keywords

  • corpora
  • web texts
  • errors
  • typos
  • orthography
  • typography
  • Russian language
Open Access

A Project Work as a Way of Bringing Corpora to Secondary School

Published Online: 17 Aug 2022
Page range: 986 - 995

Abstract

Abstract

Corpus linguistics is one of the most dynamic and rapidly developing areas of modern linguistics. It affects all areas of linguistics, including methodology of teaching foreign languages, translation and other linguistic disciplines. Corpus linguistics has had a direct impact on teaching foreign languages. However, in general, it remains a marginal method in teaching. Analysis of publications on the subject allows us to conclude that very few studies are long-term and aimed at working with schoolchildren. This article proposes a model for the development of sustainable interest among high school students in online corpora as sources of linguistic information, including the initiation stage in the form of project work in mini-groups to study well-known sayings with the consequent stage aiming at completing tasks supplementing the main textbook on a regular basis. The organization of project work addressing the corps of 11th grade students of the Natural Science Lyceum at Peter the Great St. Petersburg Polytechnic University is described. The paper outlines further research.

Keywords

  • corpus linguistics
  • language pedagogy
  • longitudinal studies
  • method of projects/project work
  • proverbs
  • sayings
Open Access

Chinese Language Word Embeddings Based on the Corpus Hanku

Published Online: 17 Aug 2022
Page range: 996 - 1004

Abstract

Abstract

Vector models based on word embeddings are an indispensable part of advanced Natural Language Processing research and language analysis. We describe several Chinese language (Pǔtōnghuà) word embeddings, the differences from “western” language models caused by specific orthographic and linguistic features of the written Chinese language, and introduce a publicly available web interface for querying the vector models, aimed at linguistically or pedagogically oriented users.

Keywords

  • word embeddings
  • Chinese
  • Pǔtōnghuà
  • corpus
  • NLP
14 Articles
Open Access

Consistency of morphological dictionary MorfFlex

Published Online: 17 Aug 2022
Page range: 855 - 861

Abstract

Abstract

Language corpora usually contain, in addition to their own texts, various types of annotations. The most common one is a morphological annotation, which consists in assigning a lemma and a morphological tag to each wordform. For morphological tagging, morphological dictionaries are traditionally used. Our paper presents a new version of the so-called “Prague” morphological dictionary MorfFlex used for tagging many Czech corpora (particularly Prague Dependency Treebanks, corpora published by the Institute of the Czech National Corpus in Prague or large Czech web corpora of the Aranea series). Three basic principles were used to update the dictionary: the Golden Rule of Morphology, the Principle of Paradigm Unity, and the Principle of Paradigm Uniqueness.

Keywords

  • morphological dictionary
  • morphological analysis
  • language corpus
  • the Czech language
Open Access

Kladenští type as a problem of automatic morphological analysis

Published Online: 17 Aug 2022
Page range: 862 - 872

Abstract

Abstract

The aim of our paper is to demonstrate the procedures by which the data needed to refine tools for automatic morphological analysis of Czech can be obtained using a corpus, namely the Araneum Bohemicum IV Maximum (Czech, 20.03) 7.10 G web corpus of the ARANEA series and Araneum Bohemicum Maximum (Czech, 15.04) 3,20 G (hereinafter Araneum). Particularly, we will focus on propria of the Kladenští type, i.e., substantivized adjectives of denoting groups of persons according to affiliation. The goal of the probe into the Aranea web corpus is: 1) a corpus-based description of frequented properties of the Kladenští type, which can be used as a starting point for rule disambiguation; 2) creating a list of the most frequent lemmas belonging to the Kladenští type, which can then be included into dictionaries of automatic morphological analyzers (e.g. the MorfFlex dictionary by Hajič and Hlaváčová). We believe that the probe can help improve the results of tools for automatic morphological analysis of Czech.

Keywords

  • automatic morphological analysis
  • derivational type
  • part of speech transition
Open Access

Language interpretation of German migration discourse (in comparison view of the years 2019 and 2015/16)

Published Online: 17 Aug 2022
Page range: 873 - 881

Abstract

Abstract

The presented paper is a research dive into the topic of web corpora as well as an analysis of linguistic grasp of the issue of migration from the perspective of social, cultural and cognitive linguistics. The presented research reflects the problem of the construction of the language grasp of this issue in Europe in a selected German mass media discourse. We compare the phenomenon of migration in 2015/2016, when record migration flows to the EU were recorded, and in 2019, when migration kept increasing. The analysis of language grasp of the issue of migration is a part of our scientific research within the project VEGA Xenisms in German and Slovak communications.

Keywords

  • linguistic interpretation
  • web corpora
  • migration
  • German political discourse
Open Access

Thematic words in the Slovak pre-election campaign on Facebook

Published Online: 17 Aug 2022
Page range: 882 - 893

Abstract

Abstract

The aim of this paper is 1. to describe/specify and compare thematic orientations of 735 pre-election microblogs published on the virtual profiles of major six Slovak political parties and 2. based on this description and comparison to identify and sketch features of Slovak political discourse. The conceptual and methodological frame consists of thematic words, that is, autosemantics above the so-called h-point, and the qualitative analysis of these thematic words. The identified features of the general Slovak pre-election communication include: populist communication, ego presentation of the party, leader or the candidate, conflict between government parties and opposition parties, image of Slovakia as a country facing troubles but also hiding potential to solve them.

Keywords

  • h-point
  • Facebook
  • microblog
  • political communication
  • political discourse
  • pre-election campaign
  • thematic words
Open Access

Kabyle corpus digital database and exploitation. Test of lexicometric analysis of the identity dimension in the romanesque discourse

Published Online: 17 Aug 2022
Page range: 894 - 905

Abstract

Abstract

The purpose of this contribution is to show, through a preliminary analysis of a corpus sample composed of the first five kabyle novels (1963-1990), the contribution of lexicometry as a new method based on statistics, in the treatment of large corpora and the establishment of databases. The aim is to describe all the phases intrinsic to the preliminary processing of a corpus (transcription, tagging and lemmatization) before submitting them to the various stages of its exploitation. Thus, in our corpus, we have opted to deal with the theme of identity induced by the five works by highlighting both the overused vocabulary and the singularity of each work in relation to the corpus as a whole. But before moving on to the quantitative analysis of the vocabulary, a work of data preparation is necessary. We intend to focus on the orthographic choices to be adopted by removing all ambiguities, the marking out and the lemmatization of the corpus. In order to do this, we have resorted to Lexico5 computer tool.

Keywords

  • corpus
  • kabyle
  • identity
  • novel
  • lexicometry
  • databases
Open Access

Some observations on the composition by blending in contemporary French from Petit Robert

Published Online: 17 Aug 2022
Page range: 906 - 915

Abstract

Abstract

The objective of this article is to analyze the composition by amalgam in current French by focusing on the one hand on the notion of amalgam in linguistics and on the other hand on the use and the frequency of use of the chosen amalgams in the diatopic variation of French. The notion of amalgam and / or of portmanteau word does not seem obvious and the explanations or definitions offered by dictionaries as well as by works on lexicology are not unanimous and differ from one another. Before presenting the results of a more detailed research, we therefore find it essential to frame the contribution in a theoretical context dealing with the notion of amalgam, or even of portmanteau word, which allows us to better understand the whole problem.

Keywords

  • blending
  • French language
  • lexicography
  • loanword
  • diatopy
Open Access

Extracting fishing terminology using GNU/Linux tools

Published Online: 17 Aug 2022
Page range: 916 - 926

Abstract

Abstract

The technological revolution that has occurred in recent decades has made accessible for researches large textual data collections. At the same time, the development of increasingly sophisticated computer tools provides them with new methods of analyzing texts. In the present study however we examine the functionalities offered by traditional tools, namely GNU/Linux tools, easily accessible via the command line but still unknown among linguists with little or no computer knowledge. Our goal is to show how using the web corpus on the one hand and the processing GNU/Linux tools on the other, we can extract key-terms of fishing jargon.

Keywords

  • web corpus
  • GNU/Linux tools
  • key-term
  • fishing terminology
Open Access

How different types of linguistic corpora shed light (or not) on various categories of substandard lexicon: contrastive analysis of vocabulary in the comedy “Les Kaïra” [Porn in the hood], a typical example of the hood film genre

Published Online: 17 Aug 2022
Page range: 927 - 941

Abstract

Abstract

The arrival of WaC corpora, including Aranea family corpora, with its “close-to-spoken language” writings from different non-formal web pages brought the new options to researchers of sociolects, mainly to those who were previously obliged to observe youth collectives in its spontaneous discourses with its consequent time-consuming transcripts. Non-spontaneous spoken language from rap songs or youth film dialogues also help researchers to describe the level of societal diffusion of some typical features of youth slang. In this paper, we focus on demonstration of these crossed approaches in order to describe three types of verbs, used in a successful comedy about Parisian peri-urban post-adolescents Les Kaïra (2012), representing different types of substandard lexicon.

Keywords

  • substandard verbs
  • French
  • neology
  • film dialogues
  • corpus linguistics
  • hood films
Open Access

Didactising specialised parallel corpora: the case of European directives

Published Online: 17 Aug 2022
Page range: 942 - 950

Abstract

Abstract

Within the framework of a didactic proposal, this article proposes to present a preliminary step to the specialized translation French-Greek. It will attempt to highlight the benefits of autonomous learning through the consultation of a corpus of specialized parallel texts established by the EU institutions. The use of concordancers will provide solutions to students wishing to study the variability of terminology and specialized vocabulary at monolingual and bilingual levels.

Keywords

  • specialized translation French-Greek
  • parallel corpora
  • variability
  • terminology
  • concordancer
Open Access

Proposal to use the study corpus for contemporary French in Didactics of French as a Foreign Language

Published Online: 17 Aug 2022
Page range: 951 - 966

Abstract

Abstract

The ORFÉO platform (Tools and Research on Written and Oral French) has been making available to users since 2018 a Study Corpus for sampled Contemporary French as well as operating tools. Although this resource is intended for an audience of researchers and students in the fields of linguistics and automatic language processing, we endeavor in this article to report on the didactic potential that it offers within the framework of a Licensing Syntax course treating “subordination” and intended for Czech and Slovak students at levels B1 to C1 in French. We propose a didactic sequence composed of four activities and pursuing three objectives: consolidation of the mastery of the basic functions of dont («which») from a corpus of friendly conversations; the use of simple query interface tools and the introduction of certain principles of corpus sociolinguistics. The corpus-based approach, by confronting learners with authentic contextualized data, helps to redefine the teaching-learning priorities of a language by giving primacy not to respect for grammatical norms but to genre norms.

Keywords

  • Didactics of French as a Foreign Language
  • Data-driven learning
  • corpus linguistics
  • “dont”
  • sociolinguistics
Open Access

Comparative Corpus-Driven Study of Prepositional Semantics in Russian and Czech

Published Online: 17 Aug 2022
Page range: 967 - 976

Abstract

Abstract

This paper deals with prepositions with causal meaning in Russian and Czech. In Slavic languages prepositions are closely connected to cases. Russian and Czech prepositions have many common features. Prepositions show a relation in space or time or a special relationship between two or more people, places, things or situations. In the current paper we are dealing with causal relations. There are different ways to express them. Among these means, the most common are prepositional-case forms and complex sentences with a subordinate causal part. We analyze the repertoire of causal prepositions in both languages and describe their statistical representation in corpora. Another task is to reveal translation equivalents between two languages.

Keywords

  • preposition
  • causal meaning
  • Russian language
  • Czech language
  • corpus statistics
  • parallel corpora
Open Access

Identifying Errors in Russian Web Corpora

Published Online: 17 Aug 2022
Page range: 977 - 985

Abstract

Abstract

The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

Keywords

  • corpora
  • web texts
  • errors
  • typos
  • orthography
  • typography
  • Russian language
Open Access

A Project Work as a Way of Bringing Corpora to Secondary School

Published Online: 17 Aug 2022
Page range: 986 - 995

Abstract

Abstract

Corpus linguistics is one of the most dynamic and rapidly developing areas of modern linguistics. It affects all areas of linguistics, including methodology of teaching foreign languages, translation and other linguistic disciplines. Corpus linguistics has had a direct impact on teaching foreign languages. However, in general, it remains a marginal method in teaching. Analysis of publications on the subject allows us to conclude that very few studies are long-term and aimed at working with schoolchildren. This article proposes a model for the development of sustainable interest among high school students in online corpora as sources of linguistic information, including the initiation stage in the form of project work in mini-groups to study well-known sayings with the consequent stage aiming at completing tasks supplementing the main textbook on a regular basis. The organization of project work addressing the corps of 11th grade students of the Natural Science Lyceum at Peter the Great St. Petersburg Polytechnic University is described. The paper outlines further research.

Keywords

  • corpus linguistics
  • language pedagogy
  • longitudinal studies
  • method of projects/project work
  • proverbs
  • sayings
Open Access

Chinese Language Word Embeddings Based on the Corpus Hanku

Published Online: 17 Aug 2022
Page range: 996 - 1004

Abstract

Abstract

Vector models based on word embeddings are an indispensable part of advanced Natural Language Processing research and language analysis. We describe several Chinese language (Pǔtōnghuà) word embeddings, the differences from “western” language models caused by specific orthographic and linguistic features of the written Chinese language, and introduce a publicly available web interface for querying the vector models, aimed at linguistically or pedagogically oriented users.

Keywords

  • word embeddings
  • Chinese
  • Pǔtōnghuà
  • corpus
  • NLP