Open Access

Against level-3-only analyses in corpus linguistics


Cite

In the last few decades, much work in corpus linguistics has attempted to discover, and then interpret, differences in the frequencies of use of linguistic elements (words, patterns, constructions, discourse features, etc.). It is probably fair to say that such studies were particularly frequent in (i) learner corpus research, (ii) corpus-based varieties research, and (iii) sociolinguistically motivated studies. For instance, many studies have discussed the differences in how often certain elements are used (i) in corpus data from native speakers vs. corpus data from learner from different L1 backgrounds, (ii) in corpora representing different inner- and outer-circle varieties, or (iii) by speakers in corpora representing people of different gender or sexual identities.

This paper will make the admittedly bold claim that any such study can in fact by definition unable to ‘prove’ what is often their main points, namely that the distributional differences found are in fact due to the one hypothesized explanatory variable(s) of L1, VARIETY, or, e.g., GENDER even when the distributional differences are significant and come with a decent effect size. To substantiate this claim, I will discuss some terminology from the family of methods known as multi-level modeling, namely the distinction between level-1, level-2, ... level-n variables and its relevance for many corpus studies. Second, I will then demonstrate how studies using only the above kinds of variables cannot distinguish the effect of their favored predictors from the effect of local/contextual level-1 variables. Third, in discussing this, I will exemplify how such effects need to be explored quantitatively instead.