Bag of Words and Embedding Text Representation Methods for Medical Article Classification

Aggarwal, C.C. and Zhai, C.-X. (Eds) (2012). Mining Text Data, Springer, New York.Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2019). FLAIR: An easy-to-use framework for state-of-the-art NLP, Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Stroudsburg, USA, pp. 54–59.Search in Google Scholar

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2021). Flair: A Very Simple Framework for State-of-the-Art NLP, Version 0.10, https://github.com/flairNLP/flair.Search in Google Scholar

Akbik, A., Blythe, D. and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling, Proceedings of the 27th International Conference on Computational Linguistics, COLING-2018, Santa Fe, USA, pp. 1638–1649.Search in Google Scholar

Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection, Statistics Surveys 4: 40–79.Search in Google Scholar

Babić, K., Martincic-Ipsic, S. and Meštrović, A. (2020). Survey of neural text representation models, Information 11(11): 511.Search in Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information, arXiv: 1607.04606.Search in Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2020). fastText: Library for Efficient Text Classification and Representation Learning, Version 0.9.2, https://fastte xt.cc.Search in Google Scholar

Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M. and Kasneci, G. (2022). Deep neural networks and tabular data: A survey, arXiv: 2110.01889.Search in Google Scholar

Breiman, L. (2001). Random forests, Machine Learning 45(1): 5–32.Search in Google Scholar

Chawla, N.V., Bowyer, K. W. Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16: 321–357.Search in Google Scholar

Cichosz, P. (2018). A case study in text mining of discussion forum posts: Classification with bag of words and global vectors, International Journal of Applied Mathematics and Computer Science 28(4): 787–801, DOI: 10.2478/amcs-2018-0060.Search in Google Scholar

Cohen, A.M., Hersh, W.R., Peterson, K. and Yen, P.-Y. (2006). Reducing workload in systematic review preparation using automated citation classification, Journal of the American Medical Informatics Association 13(2): 206–219.Search in Google Scholar

Cohn, D., Atlas, L. and Ladner, R. (1994). Improving generalization with active learning, Machine Learning 15(2): 201–221.Search in Google Scholar

Cortes, C. and Vapnik, V.N. (1995). Support-vector networks, Machine Learning 20(3): 273–297.Search in Google Scholar

Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions, Journal of Applied Economic Sciences 12(1): 25–35.Search in Google Scholar

Deb, S. and Chanda, A.K. (2022). Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data, Machine Learning with Applications 7: 100253.Search in Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT-2019, Minneapolis, USA, pp. 4171–4186.Search in Google Scholar

Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization, Proceedings of the 17th International Conference on Information and Knowledge Management, CIKM-98, Bethesda, USA, pp. 148–155.Search in Google Scholar

Egan, J.P. (1975). Signal Detection Theory and ROC Analysis, Academic Press, New York.Search in Google Scholar

Fawcett, T. (2006). An introduction to ROC analysis, Pattern Recognition Letters 27(8): 861–874.Search in Google Scholar

Forman, G. (2003). An extensive empirical study of feature selection measures for text classification, Journal of Machine Learning Research 3: 1289–1305.Search in Google Scholar

Forman, G. and Scholz, M. (2010). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement, ACM SIGKDD Explorations Newsletter 12(1): 49–57.Search in Google Scholar

García Adeva, J.J., Pikatza Atxaa, J.M., Ubeda Carrillo, M. and Ansuategi Zengotitabengoa, E. (2014). Automatic text classification to support systematic reviews in medicine, Expert Systems with Applications 41(4): 1498–1508.Search in Google Scholar

Graves, A. (2013). Generating sequences with recurrent neural networks, arXiv: 1308.0850.Search in Google Scholar

Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines, Wiley, Hoboken.Search in Google Scholar

Hassan, S., Mihalcea, R. and Banea, C. (2007). Random-walk term weighting for improved text classification, Proceedings of the 1st IEEE International Conference on Semantic Computing, ICSC-2007, Irvine, USA, pp. 53–60.Search in Google Scholar

Helaskar, M.N. and Sonawane, S.S. (2019). Text classification using word embeddings, Proceedings of the 5th International Conference on Computing, Communication, Control, and Automation, ICCUBEA-2019, New York, USA, pp. 1–4.Search in Google Scholar

Hilbe, J.M. (2009). Logistic Regression Models, Chapman and Hall, Boca Raton.Search in Google Scholar

Honnibal, M., Montani, I., Van Landeghem, S. and Boyd, A. (2021). spaCy: Industrial-Strength Natural Language Processing in Python, http://spacy.io.Search in Google Scholar

Ji, X., Ritter, A. and Yen, P.-Y. (2017). Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews, Journal of Biomedical Informatics 69: 33–42.Search in Google Scholar

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning, ECML-98, Chemnitz, Germany, pp. 137–142.Search in Google Scholar

Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms, Springer, New York.Search in Google Scholar

Jonnalagadda, S. and Petitti, D. (2013). A new iterative method to reduce workload in systematic review process, International Journal of Computational Biology and Drug Design 6(1–2): 5–17.Search in Google Scholar

Kaibi, I., Nfaoui, E.H. and Satori, H. (2019). A comparative evaluation of word embeddings techniques for Twitter sentiment analysis, Proceedings of the 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems, WITS-2019, Fez, Morocco, pp. 1–4.Search in Google Scholar

Khabsa, M., Elmagarmid, A., Ilyas, I., Hammady, H. and Ouzzani, M. (2016). Learning to identify relevant studies for systematic reviews using random forest and external information, Machine Learning 102(3): 465–482.Search in Google Scholar

Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail, Information Sciences: An International Journal 177(10): 2167–2187.Search in Google Scholar

Koziarski, M. and Woźniak, M. (2017). CCR: A combined cleaning and resampling algorithm for imbalanced data classification, International Journal of Applied Mathematics and Computer Science 27(4): 727–736, DOI: 10.1515/amcs-2017-0050.Search in Google Scholar

Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning, ICML-2014, Beijing, China, pp. 1188–1196.Search in Google Scholar

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S.and So, C.H. and Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36(4): 1234–1240.Search in Google Scholar

Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of the 10th European Conference on Machine Learning, ECML-98, Chemnitz, Germany, pp. 4–15.Search in Google Scholar

Matwin, S., Kouznetsov, A., Inkpen, D., Frunza, O. and O’Blenis, P. (2010). A new algorithm for reducing the workload of experts in performing systematic reviews, Journal of the American Medical Informatics Association 17(4): 446–453.Search in Google Scholar

McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, USA, pp. 41-48.Search in Google Scholar

Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery 28(1): 92–122.Search in Google Scholar

Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013). Efficient estimation of word representations in vector space, arXiv: 1301.3781.Search in Google Scholar

Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics, Cognitive Science 34(8): 1388–1429.Search in Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12: 2825–2830.Search in Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP-2014, Doha, Qatar, pp. 1532–1543.Search in Google Scholar

Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization, in B. Schölkopf et al. (Eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, pp. 185–208.Search in Google Scholar

Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications, Novi Sad Journal of Mathematics 38(3): 227–234.Search in Google Scholar

Řehůřek (2021). Gensim: Topic Modeling for Humans, Version 4.0.1, https://radimrehurek.com/gensim.Search in Google Scholar

Řehůřek, V. and Sojka, P. (2010). Software framework for topic modelling with large corpora, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50.Search in Google Scholar

Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B.B., Chen, X. and Wang, X. (2020). A survey of deep active learning, ACM Computing Surveys 54(9): 1–40.Search in Google Scholar

Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection, Proceedings of the 1st Conference on Email and Anti Spam, CEAS-2004, Moutain View, USA, pp. 284–292.Search in Google Scholar

Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval, Information Processing and Management 24(5): 513–523.Search in Google Scholar

Szymański, J. (2014). Comparative analysis of text representation methods using classification, Cybernetics and Systems 45(2): 180–199.Search in Google Scholar

van den Bulk, L.M., Bouzembrak, Y., Gavai, A., Liu, N., van den Heuvel, L.J. and Marvin, H.J.P. (2022). Automatic classification of literature in systematic reviews on food safety using machine learning, Current Research in Food Science 5: 84–95.Search in Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ł. and Polosukhin, I. (2017). Attention is all you need, Advances in Neural Information Processing Systems, NIPS-2017, Long Beach, USA, pp. 6000–6010.Search in Google Scholar

Wang, C., Nulty, P. and Lillis, D. (2020). A comparative study on word embeddings in deep learning for text classification, Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, NLPIR-2020, Seoul, Korea, pp. 37–46.Search in Google Scholar

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q. and Rush, A. M. (2020). Transformers: State-of-the-art natural language processing, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, (online).Search in Google Scholar

Xue, D. and Li, F. (2015). Research of text categorization model based on random forests, 2015 IEEE International Conference on Computational Intelligence and Communication Technology, CICT-2015, Ghaziabad, India, pp. 173–176.Search in Google Scholar

Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference on Machine Learning, ICML-97, Nashville, USA, pp. 412-420.Search in Google Scholar

Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh, UK, pp. 172–182.Search in Google Scholar

Zhu, X. and Goldberg, A. (2009). Introduction to Semi-Supervised Learning, Morgan & Claypool, San Rafael,.Search in Google Scholar

Zymkowski, T., Szymański, J., Sobecki, A., Drozda, P., Szalapak, K., Komar-Komarowski, K. and Scherer, R. (2022). Short texts representations for legal domain classification, Proceedings of the 21st International Conference on Artificial Intelligence and Soft Computing, ICAISC-2022, Zakopane, Poland, pp. 105–114.Search in Google Scholar

eISSN:: 2083-8492
Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 4 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Mathematik, Angewandte Mathematik

Zeitschrift RSS Feed

Bag of Words and Embedding Text Representation Methods for Medical Article Classification

Online veröffentlicht: 21. Dez. 2023

Seitenbereich: 603 - 621

Eingereicht: 19. Apr. 2023

Akzeptiert: 28. Aug. 2023

DOI: https://doi.org/10.34768/amcs-2023-0043

Schlüsselwörtertext representation, text classification, bag of words, word embeddings

© 2023 Paweł Cichosz, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Schlüsselwörter
text representation, text classification, bag of words, word embeddings