Magazine et Edition

Volume 73 (2022): Edition 1 (June 2022)

Volume 73 (2022): Edition 1 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volume 72 (2022): Edition 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Volume 72 (2021): Edition 3 (December 2021)

Volume 72 (2021): Edition 2 (December 2021)
NLP, Corpus Linguistics and Interdisciplinarity

Volume 72 (2021): Edition 1 (June 2021)

Volume 71 (2020): Edition 3 (December 2020)
Číslo venované problematike maďarského jazyka a maďarských nárečí na Slovensku

Volume 71 (2020): Edition 2 (December 2020)

Volume 71 (2020): Edition 1 (June 2020)

Volume 70 (2019): Edition 3 (December 2019)

Volume 70 (2019): Edition 2 (December 2019)

Volume 70 (2019): Edition 1 (June 2019)

Volume 69 (2018): Edition 3 (December 2018)

Volume 69 (2018): Edition 2 (December 2018)

Volume 69 (2018): Edition 1 (June 2018)

Volume 68 (2017): Edition 3 (December 2017)

Volume 68 (2017): Edition 2 (December 2017)

Volume 68 (2017): Edition 1 (June 2017)

Volume 67 (2016): Edition 3 (December 2016)

Volume 67 (2016): Edition 2 (December 2016)

Volume 67 (2016): Edition 1 (June 2016)

Volume 66 (2015): Edition 2 (December 2015)

Volume 66 (2015): Edition 1 (June 2015)

Volume 65 (2014): Edition 2 (December 2014)

Volume 65 (2014): Edition 1 (June 2014)

Volume 64 (2013): Edition 2 (December 2013)

Volume 64 (2013): Edition 1 (June 2013)

Volume 63 (2012): Edition 2 (December 2012)

Volume 63 (2012): Edition 1 (June 2012)

Volume 62 (2011): Edition 2 (December 2011)

Volume 62 (2011): Edition 1 (June 2011)

Volume 61 (2010): Edition 2 (December 2010)

Volume 61 (2010): Edition 1 (June 2010)

Volume 60 (2009): Edition 2 (December 2009)

Volume 60 (2009): Edition 1 (June 2009)

Détails du magazine
Format
Magazine
eISSN
1338-4287
Première publication
05 Mar 2010
Période de publication
2 fois par an
Langues
Anglais

Chercher

Volume 72 (2022): Edition 4 (June 2022)
Building Web corpora as sources for linguistic research and its applications

Détails du magazine
Format
Magazine
eISSN
1338-4287
Première publication
05 Mar 2010
Période de publication
2 fois par an
Langues
Anglais

Chercher

14 Articles
Accès libre

Consistency of morphological dictionary MorfFlex

Publié en ligne: 17 Aug 2022
Pages: 855 - 861

Résumé

Abstract

Language corpora usually contain, in addition to their own texts, various types of annotations. The most common one is a morphological annotation, which consists in assigning a lemma and a morphological tag to each wordform. For morphological tagging, morphological dictionaries are traditionally used. Our paper presents a new version of the so-called “Prague” morphological dictionary MorfFlex used for tagging many Czech corpora (particularly Prague Dependency Treebanks, corpora published by the Institute of the Czech National Corpus in Prague or large Czech web corpora of the Aranea series). Three basic principles were used to update the dictionary: the Golden Rule of Morphology, the Principle of Paradigm Unity, and the Principle of Paradigm Uniqueness.

Mots clés

  • morphological dictionary
  • morphological analysis
  • language corpus
  • the Czech language
Accès libre

Kladenští type as a problem of automatic morphological analysis

Publié en ligne: 17 Aug 2022
Pages: 862 - 872

Résumé

Abstract

The aim of our paper is to demonstrate the procedures by which the data needed to refine tools for automatic morphological analysis of Czech can be obtained using a corpus, namely the Araneum Bohemicum IV Maximum (Czech, 20.03) 7.10 G web corpus of the ARANEA series and Araneum Bohemicum Maximum (Czech, 15.04) 3,20 G (hereinafter Araneum). Particularly, we will focus on propria of the Kladenští type, i.e., substantivized adjectives of denoting groups of persons according to affiliation. The goal of the probe into the Aranea web corpus is: 1) a corpus-based description of frequented properties of the Kladenští type, which can be used as a starting point for rule disambiguation; 2) creating a list of the most frequent lemmas belonging to the Kladenští type, which can then be included into dictionaries of automatic morphological analyzers (e.g. the MorfFlex dictionary by Hajič and Hlaváčová). We believe that the probe can help improve the results of tools for automatic morphological analysis of Czech.

Mots clés

  • automatic morphological analysis
  • derivational type
  • part of speech transition
Accès libre

Language interpretation of German migration discourse (in comparison view of the years 2019 and 2015/16)

Publié en ligne: 17 Aug 2022
Pages: 873 - 881

Résumé

Abstract

The presented paper is a research dive into the topic of web corpora as well as an analysis of linguistic grasp of the issue of migration from the perspective of social, cultural and cognitive linguistics. The presented research reflects the problem of the construction of the language grasp of this issue in Europe in a selected German mass media discourse. We compare the phenomenon of migration in 2015/2016, when record migration flows to the EU were recorded, and in 2019, when migration kept increasing. The analysis of language grasp of the issue of migration is a part of our scientific research within the project VEGA Xenisms in German and Slovak communications.

Mots clés

  • linguistic interpretation
  • web corpora
  • migration
  • German political discourse
Accès libre

Thematic words in the Slovak pre-election campaign on Facebook

Publié en ligne: 17 Aug 2022
Pages: 882 - 893

Résumé

Abstract

The aim of this paper is 1. to describe/specify and compare thematic orientations of 735 pre-election microblogs published on the virtual profiles of major six Slovak political parties and 2. based on this description and comparison to identify and sketch features of Slovak political discourse. The conceptual and methodological frame consists of thematic words, that is, autosemantics above the so-called h-point, and the qualitative analysis of these thematic words. The identified features of the general Slovak pre-election communication include: populist communication, ego presentation of the party, leader or the candidate, conflict between government parties and opposition parties, image of Slovakia as a country facing troubles but also hiding potential to solve them.

Mots clés

  • h-point
  • Facebook
  • microblog
  • political communication
  • political discourse
  • pre-election campaign
  • thematic words
Accès libre

Kabyle corpus digital database and exploitation. Test of lexicometric analysis of the identity dimension in the romanesque discourse

Publié en ligne: 17 Aug 2022
Pages: 894 - 905

Résumé

Abstract

The purpose of this contribution is to show, through a preliminary analysis of a corpus sample composed of the first five kabyle novels (1963-1990), the contribution of lexicometry as a new method based on statistics, in the treatment of large corpora and the establishment of databases. The aim is to describe all the phases intrinsic to the preliminary processing of a corpus (transcription, tagging and lemmatization) before submitting them to the various stages of its exploitation. Thus, in our corpus, we have opted to deal with the theme of identity induced by the five works by highlighting both the overused vocabulary and the singularity of each work in relation to the corpus as a whole. But before moving on to the quantitative analysis of the vocabulary, a work of data preparation is necessary. We intend to focus on the orthographic choices to be adopted by removing all ambiguities, the marking out and the lemmatization of the corpus. In order to do this, we have resorted to Lexico5 computer tool.

Mots clés

  • corpus
  • kabyle
  • identity
  • novel
  • lexicometry
  • databases
Accès libre

Some observations on the composition by blending in contemporary French from Petit Robert

Publié en ligne: 17 Aug 2022
Pages: 906 - 915

Résumé

Abstract

The objective of this article is to analyze the composition by amalgam in current French by focusing on the one hand on the notion of amalgam in linguistics and on the other hand on the use and the frequency of use of the chosen amalgams in the diatopic variation of French. The notion of amalgam and / or of portmanteau word does not seem obvious and the explanations or definitions offered by dictionaries as well as by works on lexicology are not unanimous and differ from one another. Before presenting the results of a more detailed research, we therefore find it essential to frame the contribution in a theoretical context dealing with the notion of amalgam, or even of portmanteau word, which allows us to better understand the whole problem.

Mots clés

  • blending
  • French language
  • lexicography
  • loanword
  • diatopy
Accès libre

Extracting fishing terminology using GNU/Linux tools

Publié en ligne: 17 Aug 2022
Pages: 916 - 926

Résumé

Abstract

The technological revolution that has occurred in recent decades has made accessible for researches large textual data collections. At the same time, the development of increasingly sophisticated computer tools provides them with new methods of analyzing texts. In the present study however we examine the functionalities offered by traditional tools, namely GNU/Linux tools, easily accessible via the command line but still unknown among linguists with little or no computer knowledge. Our goal is to show how using the web corpus on the one hand and the processing GNU/Linux tools on the other, we can extract key-terms of fishing jargon.

Mots clés

  • web corpus
  • GNU/Linux tools
  • key-term
  • fishing terminology
Accès libre

How different types of linguistic corpora shed light (or not) on various categories of substandard lexicon: contrastive analysis of vocabulary in the comedy “Les Kaïra” [Porn in the hood], a typical example of the hood film genre

Publié en ligne: 17 Aug 2022
Pages: 927 - 941

Résumé

Abstract

The arrival of WaC corpora, including Aranea family corpora, with its “close-to-spoken language” writings from different non-formal web pages brought the new options to researchers of sociolects, mainly to those who were previously obliged to observe youth collectives in its spontaneous discourses with its consequent time-consuming transcripts. Non-spontaneous spoken language from rap songs or youth film dialogues also help researchers to describe the level of societal diffusion of some typical features of youth slang. In this paper, we focus on demonstration of these crossed approaches in order to describe three types of verbs, used in a successful comedy about Parisian peri-urban post-adolescents Les Kaïra (2012), representing different types of substandard lexicon.

Mots clés

  • substandard verbs
  • French
  • neology
  • film dialogues
  • corpus linguistics
  • hood films
Accès libre

Didactising specialised parallel corpora: the case of European directives

Publié en ligne: 17 Aug 2022
Pages: 942 - 950

Résumé

Abstract

Within the framework of a didactic proposal, this article proposes to present a preliminary step to the specialized translation French-Greek. It will attempt to highlight the benefits of autonomous learning through the consultation of a corpus of specialized parallel texts established by the EU institutions. The use of concordancers will provide solutions to students wishing to study the variability of terminology and specialized vocabulary at monolingual and bilingual levels.

Mots clés

  • specialized translation French-Greek
  • parallel corpora
  • variability
  • terminology
  • concordancer
Accès libre

Proposal to use the study corpus for contemporary French in Didactics of French as a Foreign Language

Publié en ligne: 17 Aug 2022
Pages: 951 - 966

Résumé

Abstract

The ORFÉO platform (Tools and Research on Written and Oral French) has been making available to users since 2018 a Study Corpus for sampled Contemporary French as well as operating tools. Although this resource is intended for an audience of researchers and students in the fields of linguistics and automatic language processing, we endeavor in this article to report on the didactic potential that it offers within the framework of a Licensing Syntax course treating “subordination” and intended for Czech and Slovak students at levels B1 to C1 in French. We propose a didactic sequence composed of four activities and pursuing three objectives: consolidation of the mastery of the basic functions of dont («which») from a corpus of friendly conversations; the use of simple query interface tools and the introduction of certain principles of corpus sociolinguistics. The corpus-based approach, by confronting learners with authentic contextualized data, helps to redefine the teaching-learning priorities of a language by giving primacy not to respect for grammatical norms but to genre norms.

Mots clés

  • Didactics of French as a Foreign Language
  • Data-driven learning
  • corpus linguistics
  • “dont”
  • sociolinguistics
Accès libre

Comparative Corpus-Driven Study of Prepositional Semantics in Russian and Czech

Publié en ligne: 17 Aug 2022
Pages: 967 - 976

Résumé

Abstract

This paper deals with prepositions with causal meaning in Russian and Czech. In Slavic languages prepositions are closely connected to cases. Russian and Czech prepositions have many common features. Prepositions show a relation in space or time or a special relationship between two or more people, places, things or situations. In the current paper we are dealing with causal relations. There are different ways to express them. Among these means, the most common are prepositional-case forms and complex sentences with a subordinate causal part. We analyze the repertoire of causal prepositions in both languages and describe their statistical representation in corpora. Another task is to reveal translation equivalents between two languages.

Mots clés

  • preposition
  • causal meaning
  • Russian language
  • Czech language
  • corpus statistics
  • parallel corpora
Accès libre

Identifying Errors in Russian Web Corpora

Publié en ligne: 17 Aug 2022
Pages: 977 - 985

Résumé

Abstract

The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

Mots clés

  • corpora
  • web texts
  • errors
  • typos
  • orthography
  • typography
  • Russian language
Accès libre

A Project Work as a Way of Bringing Corpora to Secondary School

Publié en ligne: 17 Aug 2022
Pages: 986 - 995

Résumé

Abstract

Corpus linguistics is one of the most dynamic and rapidly developing areas of modern linguistics. It affects all areas of linguistics, including methodology of teaching foreign languages, translation and other linguistic disciplines. Corpus linguistics has had a direct impact on teaching foreign languages. However, in general, it remains a marginal method in teaching. Analysis of publications on the subject allows us to conclude that very few studies are long-term and aimed at working with schoolchildren. This article proposes a model for the development of sustainable interest among high school students in online corpora as sources of linguistic information, including the initiation stage in the form of project work in mini-groups to study well-known sayings with the consequent stage aiming at completing tasks supplementing the main textbook on a regular basis. The organization of project work addressing the corps of 11th grade students of the Natural Science Lyceum at Peter the Great St. Petersburg Polytechnic University is described. The paper outlines further research.

Mots clés

  • corpus linguistics
  • language pedagogy
  • longitudinal studies
  • method of projects/project work
  • proverbs
  • sayings
Accès libre

Chinese Language Word Embeddings Based on the Corpus Hanku

Publié en ligne: 17 Aug 2022
Pages: 996 - 1004

Résumé

Abstract

Vector models based on word embeddings are an indispensable part of advanced Natural Language Processing research and language analysis. We describe several Chinese language (Pǔtōnghuà) word embeddings, the differences from “western” language models caused by specific orthographic and linguistic features of the written Chinese language, and introduce a publicly available web interface for querying the vector models, aimed at linguistically or pedagogically oriented users.

Mots clés

  • word embeddings
  • Chinese
  • Pǔtōnghuà
  • corpus
  • NLP
14 Articles
Accès libre

Consistency of morphological dictionary MorfFlex

Publié en ligne: 17 Aug 2022
Pages: 855 - 861

Résumé

Abstract

Language corpora usually contain, in addition to their own texts, various types of annotations. The most common one is a morphological annotation, which consists in assigning a lemma and a morphological tag to each wordform. For morphological tagging, morphological dictionaries are traditionally used. Our paper presents a new version of the so-called “Prague” morphological dictionary MorfFlex used for tagging many Czech corpora (particularly Prague Dependency Treebanks, corpora published by the Institute of the Czech National Corpus in Prague or large Czech web corpora of the Aranea series). Three basic principles were used to update the dictionary: the Golden Rule of Morphology, the Principle of Paradigm Unity, and the Principle of Paradigm Uniqueness.

Mots clés

  • morphological dictionary
  • morphological analysis
  • language corpus
  • the Czech language
Accès libre

Kladenští type as a problem of automatic morphological analysis

Publié en ligne: 17 Aug 2022
Pages: 862 - 872

Résumé

Abstract

The aim of our paper is to demonstrate the procedures by which the data needed to refine tools for automatic morphological analysis of Czech can be obtained using a corpus, namely the Araneum Bohemicum IV Maximum (Czech, 20.03) 7.10 G web corpus of the ARANEA series and Araneum Bohemicum Maximum (Czech, 15.04) 3,20 G (hereinafter Araneum). Particularly, we will focus on propria of the Kladenští type, i.e., substantivized adjectives of denoting groups of persons according to affiliation. The goal of the probe into the Aranea web corpus is: 1) a corpus-based description of frequented properties of the Kladenští type, which can be used as a starting point for rule disambiguation; 2) creating a list of the most frequent lemmas belonging to the Kladenští type, which can then be included into dictionaries of automatic morphological analyzers (e.g. the MorfFlex dictionary by Hajič and Hlaváčová). We believe that the probe can help improve the results of tools for automatic morphological analysis of Czech.

Mots clés

  • automatic morphological analysis
  • derivational type
  • part of speech transition
Accès libre

Language interpretation of German migration discourse (in comparison view of the years 2019 and 2015/16)

Publié en ligne: 17 Aug 2022
Pages: 873 - 881

Résumé

Abstract

The presented paper is a research dive into the topic of web corpora as well as an analysis of linguistic grasp of the issue of migration from the perspective of social, cultural and cognitive linguistics. The presented research reflects the problem of the construction of the language grasp of this issue in Europe in a selected German mass media discourse. We compare the phenomenon of migration in 2015/2016, when record migration flows to the EU were recorded, and in 2019, when migration kept increasing. The analysis of language grasp of the issue of migration is a part of our scientific research within the project VEGA Xenisms in German and Slovak communications.

Mots clés

  • linguistic interpretation
  • web corpora
  • migration
  • German political discourse
Accès libre

Thematic words in the Slovak pre-election campaign on Facebook

Publié en ligne: 17 Aug 2022
Pages: 882 - 893

Résumé

Abstract

The aim of this paper is 1. to describe/specify and compare thematic orientations of 735 pre-election microblogs published on the virtual profiles of major six Slovak political parties and 2. based on this description and comparison to identify and sketch features of Slovak political discourse. The conceptual and methodological frame consists of thematic words, that is, autosemantics above the so-called h-point, and the qualitative analysis of these thematic words. The identified features of the general Slovak pre-election communication include: populist communication, ego presentation of the party, leader or the candidate, conflict between government parties and opposition parties, image of Slovakia as a country facing troubles but also hiding potential to solve them.

Mots clés

  • h-point
  • Facebook
  • microblog
  • political communication
  • political discourse
  • pre-election campaign
  • thematic words
Accès libre

Kabyle corpus digital database and exploitation. Test of lexicometric analysis of the identity dimension in the romanesque discourse

Publié en ligne: 17 Aug 2022
Pages: 894 - 905

Résumé

Abstract

The purpose of this contribution is to show, through a preliminary analysis of a corpus sample composed of the first five kabyle novels (1963-1990), the contribution of lexicometry as a new method based on statistics, in the treatment of large corpora and the establishment of databases. The aim is to describe all the phases intrinsic to the preliminary processing of a corpus (transcription, tagging and lemmatization) before submitting them to the various stages of its exploitation. Thus, in our corpus, we have opted to deal with the theme of identity induced by the five works by highlighting both the overused vocabulary and the singularity of each work in relation to the corpus as a whole. But before moving on to the quantitative analysis of the vocabulary, a work of data preparation is necessary. We intend to focus on the orthographic choices to be adopted by removing all ambiguities, the marking out and the lemmatization of the corpus. In order to do this, we have resorted to Lexico5 computer tool.

Mots clés

  • corpus
  • kabyle
  • identity
  • novel
  • lexicometry
  • databases
Accès libre

Some observations on the composition by blending in contemporary French from Petit Robert

Publié en ligne: 17 Aug 2022
Pages: 906 - 915

Résumé

Abstract

The objective of this article is to analyze the composition by amalgam in current French by focusing on the one hand on the notion of amalgam in linguistics and on the other hand on the use and the frequency of use of the chosen amalgams in the diatopic variation of French. The notion of amalgam and / or of portmanteau word does not seem obvious and the explanations or definitions offered by dictionaries as well as by works on lexicology are not unanimous and differ from one another. Before presenting the results of a more detailed research, we therefore find it essential to frame the contribution in a theoretical context dealing with the notion of amalgam, or even of portmanteau word, which allows us to better understand the whole problem.

Mots clés

  • blending
  • French language
  • lexicography
  • loanword
  • diatopy
Accès libre

Extracting fishing terminology using GNU/Linux tools

Publié en ligne: 17 Aug 2022
Pages: 916 - 926

Résumé

Abstract

The technological revolution that has occurred in recent decades has made accessible for researches large textual data collections. At the same time, the development of increasingly sophisticated computer tools provides them with new methods of analyzing texts. In the present study however we examine the functionalities offered by traditional tools, namely GNU/Linux tools, easily accessible via the command line but still unknown among linguists with little or no computer knowledge. Our goal is to show how using the web corpus on the one hand and the processing GNU/Linux tools on the other, we can extract key-terms of fishing jargon.

Mots clés

  • web corpus
  • GNU/Linux tools
  • key-term
  • fishing terminology
Accès libre

How different types of linguistic corpora shed light (or not) on various categories of substandard lexicon: contrastive analysis of vocabulary in the comedy “Les Kaïra” [Porn in the hood], a typical example of the hood film genre

Publié en ligne: 17 Aug 2022
Pages: 927 - 941

Résumé

Abstract

The arrival of WaC corpora, including Aranea family corpora, with its “close-to-spoken language” writings from different non-formal web pages brought the new options to researchers of sociolects, mainly to those who were previously obliged to observe youth collectives in its spontaneous discourses with its consequent time-consuming transcripts. Non-spontaneous spoken language from rap songs or youth film dialogues also help researchers to describe the level of societal diffusion of some typical features of youth slang. In this paper, we focus on demonstration of these crossed approaches in order to describe three types of verbs, used in a successful comedy about Parisian peri-urban post-adolescents Les Kaïra (2012), representing different types of substandard lexicon.

Mots clés

  • substandard verbs
  • French
  • neology
  • film dialogues
  • corpus linguistics
  • hood films
Accès libre

Didactising specialised parallel corpora: the case of European directives

Publié en ligne: 17 Aug 2022
Pages: 942 - 950

Résumé

Abstract

Within the framework of a didactic proposal, this article proposes to present a preliminary step to the specialized translation French-Greek. It will attempt to highlight the benefits of autonomous learning through the consultation of a corpus of specialized parallel texts established by the EU institutions. The use of concordancers will provide solutions to students wishing to study the variability of terminology and specialized vocabulary at monolingual and bilingual levels.

Mots clés

  • specialized translation French-Greek
  • parallel corpora
  • variability
  • terminology
  • concordancer
Accès libre

Proposal to use the study corpus for contemporary French in Didactics of French as a Foreign Language

Publié en ligne: 17 Aug 2022
Pages: 951 - 966

Résumé

Abstract

The ORFÉO platform (Tools and Research on Written and Oral French) has been making available to users since 2018 a Study Corpus for sampled Contemporary French as well as operating tools. Although this resource is intended for an audience of researchers and students in the fields of linguistics and automatic language processing, we endeavor in this article to report on the didactic potential that it offers within the framework of a Licensing Syntax course treating “subordination” and intended for Czech and Slovak students at levels B1 to C1 in French. We propose a didactic sequence composed of four activities and pursuing three objectives: consolidation of the mastery of the basic functions of dont («which») from a corpus of friendly conversations; the use of simple query interface tools and the introduction of certain principles of corpus sociolinguistics. The corpus-based approach, by confronting learners with authentic contextualized data, helps to redefine the teaching-learning priorities of a language by giving primacy not to respect for grammatical norms but to genre norms.

Mots clés

  • Didactics of French as a Foreign Language
  • Data-driven learning
  • corpus linguistics
  • “dont”
  • sociolinguistics
Accès libre

Comparative Corpus-Driven Study of Prepositional Semantics in Russian and Czech

Publié en ligne: 17 Aug 2022
Pages: 967 - 976

Résumé

Abstract

This paper deals with prepositions with causal meaning in Russian and Czech. In Slavic languages prepositions are closely connected to cases. Russian and Czech prepositions have many common features. Prepositions show a relation in space or time or a special relationship between two or more people, places, things or situations. In the current paper we are dealing with causal relations. There are different ways to express them. Among these means, the most common are prepositional-case forms and complex sentences with a subordinate causal part. We analyze the repertoire of causal prepositions in both languages and describe their statistical representation in corpora. Another task is to reveal translation equivalents between two languages.

Mots clés

  • preposition
  • causal meaning
  • Russian language
  • Czech language
  • corpus statistics
  • parallel corpora
Accès libre

Identifying Errors in Russian Web Corpora

Publié en ligne: 17 Aug 2022
Pages: 977 - 985

Résumé

Abstract

The explosion of the Web leads to the production of large amounts of texts and inevitably influences their quality. Errors that tend to occur more often can distort results, especially when texts are used for scientific purposes, in language teaching or learning. Hence, there is a need to examine the existing corpora based on web texts and to clean up the data, which may contain such “noisy” fragments. In our study, we deal with the problem of errors and analyze the Aranea Russicum Maximum corpus. Among such errors, we can name, above all, encoding errors, incorrect font types, as well as segments written in other languages. These phenomena result in incorrect morphological analysis and lemmatization, frequency distortion, as well as the fact that lexical units cannot be found and therefore displayed to corpus users. The paper focuses on the errors, describes their types and outlines possible ways to eliminate them.

Mots clés

  • corpora
  • web texts
  • errors
  • typos
  • orthography
  • typography
  • Russian language
Accès libre

A Project Work as a Way of Bringing Corpora to Secondary School

Publié en ligne: 17 Aug 2022
Pages: 986 - 995

Résumé

Abstract

Corpus linguistics is one of the most dynamic and rapidly developing areas of modern linguistics. It affects all areas of linguistics, including methodology of teaching foreign languages, translation and other linguistic disciplines. Corpus linguistics has had a direct impact on teaching foreign languages. However, in general, it remains a marginal method in teaching. Analysis of publications on the subject allows us to conclude that very few studies are long-term and aimed at working with schoolchildren. This article proposes a model for the development of sustainable interest among high school students in online corpora as sources of linguistic information, including the initiation stage in the form of project work in mini-groups to study well-known sayings with the consequent stage aiming at completing tasks supplementing the main textbook on a regular basis. The organization of project work addressing the corps of 11th grade students of the Natural Science Lyceum at Peter the Great St. Petersburg Polytechnic University is described. The paper outlines further research.

Mots clés

  • corpus linguistics
  • language pedagogy
  • longitudinal studies
  • method of projects/project work
  • proverbs
  • sayings
Accès libre

Chinese Language Word Embeddings Based on the Corpus Hanku

Publié en ligne: 17 Aug 2022
Pages: 996 - 1004

Résumé

Abstract

Vector models based on word embeddings are an indispensable part of advanced Natural Language Processing research and language analysis. We describe several Chinese language (Pǔtōnghuà) word embeddings, the differences from “western” language models caused by specific orthographic and linguistic features of the written Chinese language, and introduce a publicly available web interface for querying the vector models, aimed at linguistically or pedagogically oriented users.

Mots clés

  • word embeddings
  • Chinese
  • Pǔtōnghuà
  • corpus
  • NLP

Planifiez votre conférence à distance avec Sciendo