Bertweetro: Pre-Trained Language Models for Romanian Social Media Content

Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modelling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74-88. Search in Google Scholar

Albanese, F., & Feuerstein, E. (2021). Improved topic modelling in twitter through community pooling. In String Processing and Information Retrieval: 28th International Symposium, SPIRE 2021, Lille, France, October 4–6, 2021, Proceedings 28 (pp. 209-216). Springer International Publishing. Search in Google Scholar

Alfred, V. A., Monica, S. L., & Jeffrey, D. U. (2007). Compilers principles, techniques & tools. pearson Education. Search in Google Scholar

Athiwaratkun, B., Wilson, A. G., & Anandkumar, A. (2018). Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901. Search in Google Scholar

Barriere, V., & Balahur, A. (2020). Improving sentiment analysis over non-English tweets using multilingual transformers and automatic translation for data-augmentation. arXiv preprint arXiv:2010.03486. Search in Google Scholar

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. Search in Google Scholar

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84. Search in Google Scholar

Blei, D., & Lafferty, J. (2006). Correlated topic models. Advances in neural information processing systems, 18, 147. Search in Google Scholar

Bochinski, E., Senst, T., & Sikora, T. (2017). Hyper-parameter optimization for convolutional neural network committees based on evolutionary algorithms. In 2017 IEEE international conference on image processing (ICIP) (pp. 3924-3928). IEEE. Search in Google Scholar

Boyd-Graber, J., & Blei, D. (2008). Syntactic topic models. Advances in neural information processing systems, 21. Search in Google Scholar

Briciu, A., Călin, A. D., Miholca, D. L., Moroz-Dubenco, C., Petrașcu, V., & Dascălu, G. (2024). Machine-Learning-Based Approaches for Multi-Level Sentiment Analysis of Romanian Reviews. Mathematics, 12(3), 456. Search in Google Scholar

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901. Search in Google Scholar

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv:2010.02559. Search in Google Scholar

Cheng, X., Yan, X., Lan, Y., & Guo, J. (2014). Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 26(12), 2928-2941. Search in Google Scholar

Ciobotaru, A., & Dinu, L. P. (2023). SART & COVIDSentiRo: Datasets for Sentiment Analysis Applied to Analyzing COVID-19 Vaccination Perception in Romanian Tweets. Procedia Computer Science, 225, 1331-1339. Search in Google Scholar

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Search in Google Scholar

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). Search in Google Scholar

Dimitrov, D., Baran, E., Fafalios, P., Yu, R., Zhu, X., Zloch, M., & Dietze, S. (2020). Tweetscov19-a knowledge base of semantically annotated tweets about the covid-19 pandemic. In Proceedings of the 29th ACM international conference on information & knowledge management (pp. 2991-2998). Search in Google Scholar

Dingliwal, S., Shenoy, A., Bodapati, S., Gandhe, A., Gadde, R. T., & Kirchhoff, K. (2021). Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems. arXiv preprint arXiv:2112.08718. Search in Google Scholar

Dumitrescu, S. D., Avram, A. M., & Pyysalo, S. (2020). The birth of Romanian BERT. arXiv preprint arXiv:2009.08712. Search in Google Scholar

Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies (pp. 359-369). Search in Google Scholar

Erlingsson, Ú., Feldman, V., Mironov, I., Raghunathan, A., Talwar, K., & Thakurta, A. (2019). Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 2468-2479). Society for Industrial and Applied Mathematics. Search in Google Scholar

Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23-38. Search in Google Scholar

Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47. Search in Google Scholar

Gentzkow, M., Kelly, B., & Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3), 535-574. Search in Google Scholar

Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. Search in Google Scholar

Gupta, M. R., Bengio, S., & Weston, J. (2014). Training highly multiclass classifiers. The Journal of Machine Learning Research, 15(1), 1461-1492. Search in Google Scholar

Hamborg, F., Donnay, K., & Merlo, P. (2021). NewsMTSC: a dataset for (multi-) target-dependent sentiment classification in political news articles. Association for Computational Linguistics (ACL). Search in Google Scholar

He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654. Search in Google Scholar

Ho, V. A., Nguyen, D. H. C., Nguyen, D. H., Pham, L. T. V., Nguyen, D. V., Nguyen, K. V., & Nguyen, N. L. T. (2020). Emotion recognition for vietnamese social media text. In Computational Linguistics: 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Hanoi, Vietnam, October 11–13, 2019, Revised Selected Papers 16 (pp. 319-333). Springer Singapore. Search in Google Scholar

Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80-88). Search in Google Scholar

Istrati, L., & Ciobotaru, A. (2022). Automatic monitoring and analysis of brands using data extracted from twitter in Romanian. In Intelligent Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 3 (pp. 55-75). Springer International Publishing. Search in Google Scholar

Izsak, P., Berchansky, M., & Levy, O. (2021). How to train BERT with an academic budget. arXiv preprint arXiv:2104.07705. Search in Google Scholar

Lee, K., Palsetia, D., Narayanan, R., Patwary, M. M. A., Agrawal, A., & Choudhary, A. (2011). Twitter trending topic classification. In 2011 IEEE 11th international conference on data mining workshops (pp. 251-258). IEEE. Search in Google Scholar

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2020). Mining of massive data sets. Cambridge university press. Search in Google Scholar

Levine, Y., Lenz, B., Lieber, O., Abend, O., Leyton-Brown, K., Tennenholtz, M., & Shoham, Y. (2020). Pmi-masking: Principled masking of correlated spans. arXiv preprint arXiv:2010.01825. Search in Google Scholar

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. A. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35, 1950-1965. Search in Google Scholar

Liu, X., He, P., Chen, W., & Gao, J. (2019). Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482. Search in Google Scholar

Masala, M., Ruseti, S., & Dascalu, M. (2020). Robert–a romanian bert model. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 6626-6637). Search in Google Scholar

Mori, N., Takeda, M., & Matsumoto, K. (2005). A comparison study between genetic algorithms and bayesian optimize algorithms by novel indices. In Proceedings of the 7th annual conference on Genetic and evolutionary computation (pp. 1485-1492). Search in Google Scholar

Neagu, D. C., Rus, A. B., Grec, M., Boroianu, M. A., Bogdan, N., & Gal, A. (2022). Towards sentiment analysis for romanian twitter content. Algorithms, 15(10), 357. Search in Google Scholar

Neagu, D. C., Rus, A. B., Grec, M., Boroianu, M., & Silaghi, G. C. (2022). Topic Classification for Short Texts. In International Conference on Information Systems Development (pp. 207-222). Cham: Springer International Publishing. Search in Google Scholar

Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200. Search in Google Scholar

Oh, S. (2017). Top-k hierarchical classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1). Search in Google Scholar

Ojha, V. K., Abraham, A., & Snášel, V. (2017). Metaheuristic design of feedforward neural networks: A review of two decades of research. Engineering Applications of Artificial Intelligence, 60, 97-116. Search in Google Scholar

Paaß, G., & Giesselbach, S. (2023). Pre-trained Language Models. In Foundation Models for Natural Language Processing: Pre-trained Language Models Integrating Media (pp. 19-78). Cham: Springer International Publishing. Search in Google Scholar

Pelikan, M., Goldberg, D. E., & Lobo, F. G. (2002). A survey of optimization by building and using probabilistic models. Computational optimization and applications, 21, 5-20. Search in Google Scholar

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. Search in Google Scholar

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67. Search in Google Scholar

Rahman, M. A., & Akter, Y. A. (2019). Topic classification from text using decision tree, K-NN and multinomial naïve bayes. In 2019 1st international conference on advances in science, engineering and robotics technology (ICASERT) (pp. 1-4). IEEE. Search in Google Scholar

Raschka, S. (2021). Model evaluation, model selection, and algorithm selection in machine learning. arXiv 2018. arXiv preprint arXiv:1811.12808. Search in Google Scholar

Tani, L., Rand, D., Veelken, C., & Kadastik, M. (2021). Evolutionary algorithms for hyperparameter optimization in machine learning for application in high energy physics. The European Physical Journal C, 81, 1-9. Search in Google Scholar

Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Wei, J., Wang, X., ... & Metzler, D. (2022). Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131. Search in Google Scholar

Vasile, A., Rădulescu, R., & Păvăloiu, I. B. (2014). Topic classification in Romanian blogosphere. In 12th Symposium on Neural Network Applications in Electrical Engineering (NEUREL) (pp. 131-134). IEEE. Search in Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. Search in Google Scholar

Vayansky, I., & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, 101582. Search in Google Scholar

Velankar, A., Patil, H., & Joshi, R. (2022). Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In IAPR workshop on artificial neural networks in pattern recognition (pp. 121-128). Cham: Springer International Publishing. Search in Google Scholar

Wei, J., Garrette, D., Linzen, T., & Pavlick, E. (2021). Frequency effects on syntactic rule learning in transformers. arXiv preprint arXiv:2109.07020. Search in Google Scholar

Wettig, A., Gao, T., Zhong, Z., & Chen, D. (2022). Should you mask 15% in masked language modeling?. arXiv preprint arXiv:2202.08005. Search in Google Scholar

Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 (pp. 818-833). Springer International Publishing. Search in Google Scholar

Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M. R., & King, I. (2018). Topic memory networks for short text classification. arXiv preprint arXiv:1809.03664. Search in Google Scholar

Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into deep learning. Cambridge University Press. Search in Google Scholar

Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, 1, 43-52. Zhao, J., Liu, K., & Xu, L. (2016). Sentiment analysis: Mining opinions, sentiments, and emotions. Search in Google Scholar

Idioma:: Inglés

Calendario de la edición:: 3 veces al año
Temas de la revista:: Negocios y Economía, Economía política, Finanzas públicas y teoría fiscal, Gestión empresarial, Contabilidad de gestión, control financiero, cálculo de costes, inversión, Marketing, ventas, relaciones con el cliente, Gestión empresarial, otros

RSS Feed de revista

Bertweetro: Pre-Trained Language Models for Romanian Social Media Content

Dan Claudiu Neagu

Publicado en línea: 01 abr 2025

Páginas: 83 - 111

DOI: https://doi.org/10.2478/subboec-2025-0005

Palabras clavemachine learning; natural language processing; language models; transformers; text classification; under-resourced languages

© 2025 Dan Claudiu Neagu, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Palabras clave
machine learning; natural language processing; language models; transformers; text classification; under-resourced languages