Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb