In compiling and testing the diachronic part of the Helsinki Corpus of English Texts, our project group has come across three problems which arise from the use of computer corpora in studies of syntax and vocabulary. While these problems are mainly associated with work on diachronic corpora, they may be universal enough to deserve somewhat more general consideration. They could be called “The philologist’s dilemma”, “God’s truth fallacy”, and “The mystery of vanishing reliability”. The first could be described as pedagogical, the second methodological and the third pragmatic.
The introductory it pattern, as in ‘It is important to note that information was added’, is a tool used by academic writers for a range of different rhetorical and information-structural purposes. It is thus an important pattern for students to learn. Since previous research on student writing has indicated that there seems to be a correlation between form and function of the pattern, the present study sets out to investigate this more systematically in non-native-speaker and nativespeaker student writing in two disciplines (linguistics and literature). In doing so, the study adds to and extends previous research looking into factors such as NS status and discipline. It uses data from three corpora: ALEC, BAWE and MICUSP. The results show that there is indeed a correlation between form and function, as the most common syntactic types of the pattern each display a preferred function and vice versa. While very few differences across NS status were found, there were certain discipline-specific disparities. The findings, which could be useful for teaching students about the use of the introductory it pattern, also have implications for the automatized functional tagging of parsed corpora.
This study focuses on the progressive vs. non-progressive alternation to revisit the debate on the ENL-ESL-EFL continuum (i.e. whether native (ENL) and nonnative (ESL/EFL) Englishes are dichotomous types of English or form a gradient continuum). While progressive marking is traditionally studied independently of its unmarked counterpart, we examine (i) how the grammatical contexts of both constructions systematically affect speakers’ constructional choices in ENL (American, British), ESL (Indian, Nigerian and Singaporean) and EFL (Finnish, French and Polish learner Englishes) and (ii) what light speakers’ varying constructional choices bring to the continuum debate. Methodologically, we use a clustering technique to group together individual varieties of English (i.e. to identify similarities and differences between those varieties) based on linguistic contextual features such as AKTIONSART, ANIMACY, SEMANTIC DOMAIN (of aspect-bearing lexical verb), TENSE, MODALITY and VOICE to assess the validity of the ENL-ESL-EFL classification for our data. Then, we conduct a logistic regression analysis (based on lemmas observed in both progressive and non-progressive constructions) to explore how grammatical contexts influence speakers’ constructional choices differently across English types. While, overall, our cluster analysis supports the ENL-ESL-EFL classification as a useful theoretical framework to explore cross-variety variation, the regression shows that, when we start digging into the specific linguistic contexts of (non-)progressive constructions, this classification does not systematically transpire in the data in a uniform manner. Ultimately, by including more than one statistical technique into their exploration of the continuum, scholars could avoid potential methodological biases.
Research into orthography in the history of English is not a simple venture. The history of English spelling is primarily based on printed texts, which fail to capture the range of variation inherent in the language; many manuscript phenomena are simply not found in printed texts. Manuscript-based corpora would be the ideal research data, but as this is resource-intensive, linguists use editions that have been produced by non-linguists. Many editions claim to retain original spellings, but in practice text is always normalized at the graph level and possibly more so. This does not preclude using such a corpus for orthographical research, but there has been no systematic way to determine the philological reliability of an edited text. In this paper we present a typological methodology we are developing for the evaluation of orthographical quality of edition-based corpora, with the aim of making the best use of bad data in the context of editions and manuscript practices. As a case study, we apply this methodology to the Early Modern and Late Modern English sections of the Corpus of Early English Correspondence.
Published Online: 11 Apr 2018 Page range: 133 - 166
Abstract
Abstract
This paper offers a formally driven quantitative analysis of stance-annotated sentences in the Brexit Blog Corpus (BBC). Our goal is to identify features that determine the formal profiles of six stance categories (contrariety, hypotheticality, necessity, prediction, source of knowledge and uncertainty) in a subset of the BBC. The study has two parts: firstly, it examines a large number of formal linguistic features, such as punctuation, words and grammatical categories that occur in the sentences in order to describe the specific characteristics of each category, and secondly, it compares characteristics in the entire data set in order to determine stance similarities in the data set. We show that among the six stance categories in the corpus, contrariety and necessity are the most discriminative ones, with the former using longer sentences, more conjunctions, more repetitions and shorter forms than the sentences expressing other stances. necessity has longer lexical forms but shorter sentences, which are syntactically more complex. We show that stance in our data set is expressed in sentences with around 21 words per sentence. The sentences consist mainly of alphabetical characters forming a varied vocabulary without special forms, such as digits or special characters.
Published Online: 11 Apr 2018 Page range: 167 - 190
Abstract
Abstract
The aims of this paper are twofold: i) to present the motivation and design of a sociohistorical corpus derived from the popular BBC Radio show, Desert Island Discs (DID); and ii) to illustrate the potential of the DID corpus (DIDC) with a case study. In an era of ever-increasing digital resources and scholarly interest in recent language change, there remains an enormous disparity between available written and spoken corpora. We describe how a corpus derived from DID contributes to redressing the balance. Treating DID as an example of a specialized register, namely, a ‘biographical chat show’, we review its attendant situational characteristics, and explain the affordances and design features of a sociolinguistic corpus sampling of the show. Finally, to illustrate the potential of DIDC for linguistic exploration of recent change, we conduct a case study on two pronouns with generic, impersonal reference, namely you and one.
In compiling and testing the diachronic part of the Helsinki Corpus of English Texts, our project group has come across three problems which arise from the use of computer corpora in studies of syntax and vocabulary. While these problems are mainly associated with work on diachronic corpora, they may be universal enough to deserve somewhat more general consideration. They could be called “The philologist’s dilemma”, “God’s truth fallacy”, and “The mystery of vanishing reliability”. The first could be described as pedagogical, the second methodological and the third pragmatic.
The introductory it pattern, as in ‘It is important to note that information was added’, is a tool used by academic writers for a range of different rhetorical and information-structural purposes. It is thus an important pattern for students to learn. Since previous research on student writing has indicated that there seems to be a correlation between form and function of the pattern, the present study sets out to investigate this more systematically in non-native-speaker and nativespeaker student writing in two disciplines (linguistics and literature). In doing so, the study adds to and extends previous research looking into factors such as NS status and discipline. It uses data from three corpora: ALEC, BAWE and MICUSP. The results show that there is indeed a correlation between form and function, as the most common syntactic types of the pattern each display a preferred function and vice versa. While very few differences across NS status were found, there were certain discipline-specific disparities. The findings, which could be useful for teaching students about the use of the introductory it pattern, also have implications for the automatized functional tagging of parsed corpora.
This study focuses on the progressive vs. non-progressive alternation to revisit the debate on the ENL-ESL-EFL continuum (i.e. whether native (ENL) and nonnative (ESL/EFL) Englishes are dichotomous types of English or form a gradient continuum). While progressive marking is traditionally studied independently of its unmarked counterpart, we examine (i) how the grammatical contexts of both constructions systematically affect speakers’ constructional choices in ENL (American, British), ESL (Indian, Nigerian and Singaporean) and EFL (Finnish, French and Polish learner Englishes) and (ii) what light speakers’ varying constructional choices bring to the continuum debate. Methodologically, we use a clustering technique to group together individual varieties of English (i.e. to identify similarities and differences between those varieties) based on linguistic contextual features such as AKTIONSART, ANIMACY, SEMANTIC DOMAIN (of aspect-bearing lexical verb), TENSE, MODALITY and VOICE to assess the validity of the ENL-ESL-EFL classification for our data. Then, we conduct a logistic regression analysis (based on lemmas observed in both progressive and non-progressive constructions) to explore how grammatical contexts influence speakers’ constructional choices differently across English types. While, overall, our cluster analysis supports the ENL-ESL-EFL classification as a useful theoretical framework to explore cross-variety variation, the regression shows that, when we start digging into the specific linguistic contexts of (non-)progressive constructions, this classification does not systematically transpire in the data in a uniform manner. Ultimately, by including more than one statistical technique into their exploration of the continuum, scholars could avoid potential methodological biases.
Research into orthography in the history of English is not a simple venture. The history of English spelling is primarily based on printed texts, which fail to capture the range of variation inherent in the language; many manuscript phenomena are simply not found in printed texts. Manuscript-based corpora would be the ideal research data, but as this is resource-intensive, linguists use editions that have been produced by non-linguists. Many editions claim to retain original spellings, but in practice text is always normalized at the graph level and possibly more so. This does not preclude using such a corpus for orthographical research, but there has been no systematic way to determine the philological reliability of an edited text. In this paper we present a typological methodology we are developing for the evaluation of orthographical quality of edition-based corpora, with the aim of making the best use of bad data in the context of editions and manuscript practices. As a case study, we apply this methodology to the Early Modern and Late Modern English sections of the Corpus of Early English Correspondence.
This paper offers a formally driven quantitative analysis of stance-annotated sentences in the Brexit Blog Corpus (BBC). Our goal is to identify features that determine the formal profiles of six stance categories (contrariety, hypotheticality, necessity, prediction, source of knowledge and uncertainty) in a subset of the BBC. The study has two parts: firstly, it examines a large number of formal linguistic features, such as punctuation, words and grammatical categories that occur in the sentences in order to describe the specific characteristics of each category, and secondly, it compares characteristics in the entire data set in order to determine stance similarities in the data set. We show that among the six stance categories in the corpus, contrariety and necessity are the most discriminative ones, with the former using longer sentences, more conjunctions, more repetitions and shorter forms than the sentences expressing other stances. necessity has longer lexical forms but shorter sentences, which are syntactically more complex. We show that stance in our data set is expressed in sentences with around 21 words per sentence. The sentences consist mainly of alphabetical characters forming a varied vocabulary without special forms, such as digits or special characters.
The aims of this paper are twofold: i) to present the motivation and design of a sociohistorical corpus derived from the popular BBC Radio show, Desert Island Discs (DID); and ii) to illustrate the potential of the DID corpus (DIDC) with a case study. In an era of ever-increasing digital resources and scholarly interest in recent language change, there remains an enormous disparity between available written and spoken corpora. We describe how a corpus derived from DID contributes to redressing the balance. Treating DID as an example of a specialized register, namely, a ‘biographical chat show’, we review its attendant situational characteristics, and explain the affordances and design features of a sociolinguistic corpus sampling of the show. Finally, to illustrate the potential of DIDC for linguistic exploration of recent change, we conduct a case study on two pronouns with generic, impersonal reference, namely you and one.