1. bookVolume 28 (2018): Issue 4 (December 2018)
Journal Details
First Published
05 Apr 2007
Publication timeframe
4 times per year
access type Open Access

A Case Study in Text Mining of Discussion Forum Posts: Classification with Bag of Words and Global Vectors

Published Online: 11 Jan 2019
Volume & Issue: Volume 28 (2018) - Issue 4 (December 2018)
Page range: 787 - 801
Received: 22 Sep 2017
Accepted: 02 Jun 2018
Journal Details
First Published
05 Apr 2007
Publication timeframe
4 times per year

Despite the rapid growth of other types of social media, Internet discussion forums remain a highly popular communication channel and a useful source of text data for analyzing user interests and sentiments. Being suited to richer, deeper, and longer discussions than microblogging services, they particularly well reflect topics of long-term, persisting involvement and areas of specialized knowledge or experience. Discovering and characterizing such topics and areas by text mining algorithms is therefore an interesting and useful research direction. This work presents a case study in which selected classification algorithms are applied to posts from a Polish discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana. The utility of two different vector text representations is examined: the simple bag of words representation and the more refined embedded global vectors one. While the former is found to work well for the multinomial naive Bayes algorithm, the latter turns out more useful for other classification algorithms: logistic regression, SVMs, and random forests. The obtained results suggest that post-classification can be applied for measuring publication intensity of particular topics and, in the case of forums related to psychoactive substances, for monitoring the risk of drug-related crime.


Aggarwal, C.C. and Zhai, C.-X. (Eds.) (2012). Mining Text Data, Springer, New York, NY.10.1007/978-1-4614-3223-4Search in Google Scholar

Aswani Kumar, C. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval, International Journal of Applied Mathematics and Computer Science 16(4): 551-558.Search in Google Scholar

Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances, Philosophical Transactions of the Royal Society of London 53: 370-418.10.1098/rstl.1763.0053Search in Google Scholar

Bilski, A. and Wojciechowski, J. (2016). Automatic parametric fault detection in complex analog systems based on a method of minimum node selection, International Journal of Applied Mathematics and Computer Science 26(3): 655-668, DOI: 10.1515/amcs-2016-0045.10.1515/amcs-2016-0045Open DOISearch in Google Scholar

Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022.Search in Google Scholar

Breiman, L. (1996). Bagging predictors, Machine Learning 24(2): 123-140.10.1007/BF00058655Search in Google Scholar

Breiman, L. (2001). Random forests, Machine Learning 45(1): 5-32.10.1023/A:1010933404324Search in Google Scholar

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees, Chapman and Hall, New York, NY.Search in Google Scholar

Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning, Proceedings of the 9th European Conference on Artificial Intelligence (ECAI-90), Stockholm, Sweden, pp. 147-149.Search in Google Scholar

Cichosz, P. (2015). Data Mining Algorithms: Explained Using R, Wiley, Chichester.10.1002/9781118950951Search in Google Scholar

Cortes, C. and Vapnik, V.N. (1995). Support-vector networks, Machine Learning 20(3): 273-297.10.1007/BF00994018Search in Google Scholar

Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, New York, NY.10.1017/CBO9780511801389Search in Google Scholar

Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions, Journal of Applied Economic Sciences 12(1): 25-35.Search in Google Scholar

Dietterich, T.G. (2000). Ensemble methods in machine learning, Proceedings of the 1st International Workshop on Multiple Classifier Systems, Cagliari, Italy, pp. 1-15.Search in Google Scholar

Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning 29(2-3): 103-137.10.1023/A:1007413511361Search in Google Scholar

Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12: 2121-2159.Search in Google Scholar

Dumais, S.T. (2005). Latent semantic analysis, Annual Review of Information Science and Technology 38(1): 188-229.10.1002/aris.1440380105Search in Google Scholar

Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization, Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM-98), Bethesda, MD, USA, pp. 148-155.Search in Google Scholar

Egan, J.P. (1975). Signal Detection Theory and ROC Analysis, Academic Press, New York, NY.Search in Google Scholar

Fawcett, T. (2006). An introduction to ROC analysis, Pattern Recognition Letters 27(8): 861-874.10.1016/j.patrec.2005.10.010Search in Google Scholar

Forman, G. (2003). An extensive empirical study of feature selection measures for text classification, Journal of Machine Learning Research 3: 1289-1305. Goldberg, Y. and Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.’s negative sampling word-embedding method, arXiv: 1402.3722.Search in Google Scholar

Guyon, I.M. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3: 1157-1182.Search in Google Scholar

Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines, Wiley, New York, NY.10.1002/9780470503065Search in Google Scholar

Hand, D.J. and Yu, K. (2001). Idiot’s Bayes-not so stupid after all?, International Statistical Review 69(3): 385-399.10.2307/1403452Search in Google Scholar

Heaps, H.S. (1978). Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York, NY.Search in Google Scholar

Hilbe, J.M. (2009). Logistic Regression Models, Chapman and Hall, New York, NY.10.1201/9781420075779Search in Google Scholar

Holtz, P., Kronberger, N. and Wagner, W. (2012). Analyzing Internet forums: A practical guide, Journal of Media Psychology 24(2): 55-66.10.1027/1864-1105/a000062Search in Google Scholar

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning (ECML-98), Chemnitz, Germany, pp. 137-142.Search in Google Scholar

Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms, Springer, New York, NY.10.1007/978-1-4615-0907-3Search in Google Scholar

Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail, Information Sciences: An International Journal 177(10): 2167-2187.10.1016/j.ins.2006.12.005Search in Google Scholar

Lau, J.H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation, Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 78-86.Search in Google Scholar

Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, pp. 1188-1196.Search in Google Scholar

Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of the Tenth European Conference on Machine Learning (ECML- 98), Chemnitz, Germany, pp. 4-15.Search in Google Scholar

Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest, R News 2(3): 18-22, http://CRAN.R-project.org/doc/Rnews/.Search in Google Scholar

Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Springer, New York, NY.10.1007/978-1-4615-5689-3Search in Google Scholar

Liu, H., Motoda, H., Setiono, R. and Zhao, Z. (2010). Feature selection: An ever-evolving frontier in data mining, Proceedings of the 4th Workshop on Feature Selection in Data Mining (FSDM-10), Hyderabad, India, pp. 4-13.Search in Google Scholar

Lui, A. K.-F., Li, S.C. and Choy, S.O. (2007). An evaluation of automatic text categorization in online discussion analysis, Proceedings of the 7th IEEE International Conference on Advanced Learning Technologies (ICALT-2007), Niigata, Japan, pp. 205-209.Search in Google Scholar

Manning, C.D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval, Cambridge University Press, Cambridge.10.1017/CBO9780511809071Search in Google Scholar

Marra, R.M., Moore, J.L. and Klimczak, A.K. (2004). Content analysis of online discussion forums: A comparative analysis of protocols, Educational Technology Research and Development 52(2): 23-40.10.1007/BF02504837Search in Google Scholar

McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, WI, USA, pp. 41-48.Search in Google Scholar

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. and Leisch, F. (2015). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, R package version 1.6-7, https://CRAN.R-project.org/package=e1071.Search in Google Scholar

Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space, arXiv:1301.3781.Search in Google Scholar

Mikolov, T., Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation, arXiv:1309.4168.Search in Google Scholar

Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics, Cognitive Science 34(8): 1388-1429.10.1111/j.1551-6709.2010.01106.x21564253Search in Google Scholar

Moldovan, A., Boţ, R.I. and Wanka, G. (2005). Latent semantic indexing for patent documents, International Journal of Applied Mathematics and Computer Science 15(4): 551-560.Search in Google Scholar

Oooms, J. (2016). hunspell: Morphological Analysis and Spell Checker for R, R package version 2.3, https://CRAN.R-project.org/package=hunspell.Search in Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-14), Doha, Qatar, pp. 1532-1543.Search in Google Scholar

Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization, in B. Schölkopf et al. (Eds.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, pp.185-208.Search in Google Scholar

Platt, J.C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in A.J. Smola et al. (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, pp. 61-74.Search in Google Scholar

Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1: 81-106.10.1007/BF00116251Search in Google Scholar

R Development Core Team (2016). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, http://www.R-project.org.Search in Google Scholar

Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications, Novi Sad Journal of Mathematics 38(3): 227-234.Search in Google Scholar

Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection, Proceedings of the 1st International Conference on Email and Anti Spam (CEAS-04), Mountain View, CA, USA, pp. 398-403.Search in Google Scholar

Rousseau, F., Kiagias, E. and Vazirgiannis, M. (2015). Text categorization as a graph classification problem, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics and the 6th International Joint Conference on Natural Language Processing (ACLIJCNLP-15), Beijing, China, pp. 1702-1712.Search in Google Scholar

Said, D. and Wanas, N. (2011). Clustering posts in online discussion forum threads, International Journal of Computer Science and Information Technology 3(2): 1-14.10.5121/ijcsit.2011.3201Search in Google Scholar

Schölkopf, B. and Smola, A.J. (2001). Learning with Kernels, MIT Press, Cambridge, MA.Search in Google Scholar

Sebastiani, F. (2002). Machine learning in automated text categorization, ACM Computing Surveys 34(1): 1-47.10.1145/505282.505283Search in Google Scholar

Selivanov, D. (2016). text2vec: Modern Text Mining Framework for R, R package version 0.4.0, https://CRAN.R-project.org/package=text2vec.Search in Google Scholar

Siwek, K. and Osowski, S. (2016). Data mining methods for prediction of air pollution, International Journal of Applied Mathematics and Computer Science 26(2): 467-478, DOI: 10.1515/amcs-2016-0033.10.1515/amcs-2016-0033Open DOISearch in Google Scholar

Szymański, J. (2014). Comparative analysis of text representation methods using classification, Cybernetics and Systems 45(2): 180-199.10.1080/01969722.2014.874828Search in Google Scholar

Wu, Q., Ye, Y., Zhang, H., Ng, M.K. and Ho, S.-H. (2014). ForesTexter: An efficient random forest algorithm for imbalanced text categorization, Knowledge-Based Systems 67: 105-116.10.1016/j.knosys.2014.06.004Search in Google Scholar

Xu, B., Guo, X., Ye, Y. and Cheng, J. (2012). An improved random forest classifier for text categorization, Journal of Computers 7(12): 2913-2920.10.4304/jcp.7.12.2913-2920Search in Google Scholar

Xue, D. and Li, F. (2015). Research of text categorization model based on random forests, 2015 IEEE International Conference on Computational Intelligence and Communication Technology (CICT-15), Ghaziabad, India, pp. 173-176.Search in Google Scholar

Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference on Machine Learning (ICML-97), Nashville, TN, USA, pp. 412-420.Search in Google Scholar

Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP-11), Edinburgh, UK, pp. 172-182.Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo