Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Arora, S., Liang, Y.Y., & Ma, T.Y. (2017). A simple but tough-to-beat baseline for sentence embeddings. In proceedings of International Conference on Learning Representations, Toulon, France, April 24–26, 2017. AroraS. LiangY.Y. MaT.Y. 2017 A simple but tough-to-beat baseline for sentence embeddings In proceedings of International Conference on Learning Representations Toulon, France April 24–26, 2017 Search in Google Scholar

Astrakhantsev, N. (2015). Methods and software for terminology extraction from domain-specific text collection (Unpublished doctoral dissertation). Ph. D. thesis, Institute for System Programming of Russian Academy of Sciences. AstrakhantsevN. 2015 Methods and software for terminology extraction from domain-specific text collection (Unpublished doctoral dissertation) Ph. D. thesis, Institute for System Programming of Russian Academy of Sciences Search in Google Scholar

Awan, M.N., & Beg, M.O. (2020). Top-rank: A topicalpostionrank for extraction and classification of keyphrases in text. Computer Speech & Language, 65, 101116. AwanM.N. BegM.O. 2020 Top-rank: A topicalpostionrank for extraction and classification of keyphrases in text Computer Speech & Language 65 101116 10.1016/j.csl.2020.101116 Search in Google Scholar

Beltagy, I., Lo, K., & Cohan, A. (2019). Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. BeltagyI. LoK. CohanA. 2019 Scibert: A pretrained language model for scientific text arXiv preprint arXiv:1903.10676. Search in Google Scholar

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993–1022. BleiD.M. NgA.Y. JordanM.I. 2003 Latent dirichlet allocation Journal of machine Learning research 3 Jan 993 1022 Search in Google Scholar

Cagliero, L., & La Quatra, M. (2020). Extracting highlights of scientific articles: A supervised summarization approach. Expert Systems with Applications, 160, 113659. CaglieroL. La QuatraM. 2020 Extracting highlights of scientific articles: A supervised summarization approach Expert Systems with Applications 160 113659 10.1016/j.eswa.2020.113659 Search in Google Scholar

Curiskis, S.A., Drake, B., Osborn, T.R., & Kennedy, P.J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management, 57(2), 102034. CuriskisS.A. DrakeB. OsbornT.R. KennedyP.J. 2020 An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit Information Processing & Management 57 2 102034 10.1016/j.ipm.2019.04.002 Search in Google Scholar

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391–407. DeerwesterS. DumaisS.T. FurnasG.W. LandauerT.K. HarshmanR. 1990 Indexing by latent semantic analysis Journal of the American society for information science 41 6 391 407 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 Search in Google Scholar

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. DevlinJ. ChangM.-W. LeeK. ToutanovaK. 2018 Bert: Pre-training of deep bidirectional transformers for language understanding arXiv preprint arXiv:1810.04805. Search in Google Scholar

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, pp. 226–231). EsterM. KriegelH.-P. SanderJ. XuX. 1996 A density-based algorithm for discovering clusters in large spatial databases with noise In Kdd 96 226 231 Search in Google Scholar

Harris, Z.S. (1954). Distributional structure. Word, 10(2–3), 146–162. HarrisZ.S. 1954 Distributional structure Word 10 2–3 146 162 10.1080/00437956.1954.11659520 Search in Google Scholar

Hou, J.H., Yang, X.C., & Chen, C.M. (2018). Emerging trends and new developments in information science: A document co-citation analysis (2009–2016). Scientometrics, 115(2), 869–892. HouJ.H. YangX.C. ChenC.M. 2018 Emerging trends and new developments in information science: A document co-citation analysis (2009–2016) Scientometrics 115 2 869 892 10.1007/s11192-018-2695-9 Search in Google Scholar

Jelodar, H., Wang, Y.L., Yuan, C., Feng, X., Jiang, X.H., Li, Y.C., & Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. JelodarH. WangY.L. YuanC. FengX. JiangX.H. LiY.C. ZhaoL. 2019 Latent dirichlet allocation (lda) and topic modeling: Models, applications, a survey Multimedia Tools and Applications 78 11 15169 15211 10.1007/s11042-018-6894-4 Search in Google Scholar

Jones, K.S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21. JonesK.S. 1972 A statistical interpretation of term specificity and its application in retrieval Journal of Documentation 28 11 21 10.1108/eb026526 Search in Google Scholar

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. JoulinA. GraveE. BojanowskiP. MikolovT. 2016 Bag of tricks for efficient text classification arXiv preprint arXiv:1607.01759. Search in Google Scholar

Kenter, T., Borisov, A., & De Rijke, M. (2016). Siamese cbow: Optimizing word embeddings for sentence representations. arXiv preprint arXiv:1606.04640. KenterT. BorisovA. De RijkeM. 2016 Siamese cbow: Optimizing word embeddings for sentence representations arXiv preprint arXiv:1606.04640. Search in Google Scholar

Kim, J., Yoon, J., Park, E., & Choi, S. (2020). Patent document clustering with deep embeddings. Scientometrics, 1–15. KimJ. YoonJ. ParkE. ChoiS. 2020 Patent document clustering with deep embeddings Scientometrics 1 15 10.1007/s11192-020-03396-7 Search in Google Scholar

Krenn, M., & Zeilinger, A. (2020). Predicting research trends with semantic and neural networks with an application in quantum physics. Proceedings of the National Academy of Sciences, 117(4), 1910–1916. KrennM. ZeilingerA. 2020 Predicting research trends with semantic and neural networks with an application in quantum physics Proceedings of the National Academy of Sciences 117 4 1910 1916 10.1073/pnas.1914370116 Search in Google Scholar

Kuhn, T., Perc, M., & Helbing, D. (2014). Inheritance patterns in citation networks reveal scientific memes. Physical Review X, 4(4), 041036. KuhnT. PercM. HelbingD. 2014 Inheritance patterns in citation networks reveal scientific memes Physical Review X 4 4 041036 10.1103/PhysRevX.4.041036 Search in Google Scholar

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188–1196). LeQ. MikolovT. 2014 Distributed representations of sentences and documents In International conference on machine learning 1188 1196 Search in Google Scholar

Li, J.Z., Fan, Q.N., & Zhang, K., et al. (2007). Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences, 12(5), 917–921. LiJ.Z. FanQ.N. ZhangK. 2007 Keyword extraction based on tf/idf for chinese news document Wuhan University Journal of Natural Sciences 12 5 917 921 10.1007/s11859-007-0038-4 Search in Google Scholar

Liu, H.W., Kou, H.Z., Yan, C., & Qi, L.Y. (2019). Link prediction in paper citation network to construct paper correlation graph. EURASIP Journal on Wireless Communications and Networking, 2019(1), 1–12. LiuH.W. KouH.Z. YanC. QiL.Y. 2019 Link prediction in paper citation network to construct paper correlation graph EURASIP Journal on Wireless Communications and Networking 2019 1 1 12 10.1186/s13638-019-1561-7 Search in Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119). MikolovT. SutskeverI. ChenK. CorradoG.S. DeanJ. 2013 Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems 3111 3119 Search in Google Scholar

Miller, G.A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41. MillerG.A. 1995 Wordnet: A lexical database for english Communications of the ACM 38 11 39 41 10.1145/219717.219748 Search in Google Scholar

Peganova, I., Rebrova, A., & Nedumov, Y. (2019). Labelling hierarchical clusters of scientific articles. In 2019 ivannikov memorial workshop (ivmem) (pp. 26–32). PeganovaI. RebrovaA. NedumovY. 2019 Labelling hierarchical clusters of scientific articles In 2019 ivannikov memorial workshop (ivmem) 26 32 10.1109/IVMEM.2019.00010 Search in Google Scholar

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. PetersM.E. NeumannM. IyyerM. GardnerM. ClarkC. LeeK. ZettlemoyerL. 2018 Deep contextualized word representations arXiv preprint arXiv:1802.05365. Search in Google Scholar

Radu, R.-G., Rădulescu, I.-M., Truică, C.-O., Apostol, E.-S., & Mocanu, M. (2020). Clustering documents using the document to vector model for dimensionality reduction. In 2020 ieee international conference on automation, quality and testing, robotics (aqtr) (pp. 1–6). RaduR.-G. RădulescuI.-M. TruicăC.-O. ApostolE.-S. MocanuM. 2020 Clustering documents using the document to vector model for dimensionality reduction In 2020 ieee international conference on automation, quality and testing, robotics (aqtr) 1 6 10.1109/AQTR49680.2020.9129967 Search in Google Scholar

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text mining: Applications and theory, 1, 1–20. RoseS. EngelD. CramerN. CowleyW. 2010 Automatic keyword extraction from individual documents Text mining: Applications and theory 1 1 20 10.1002/9780470689646.ch1 Search in Google Scholar

Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53–65. RousseeuwP.J. 1987 Silhouettes: A graphical aid to the interpretation and validation of cluster analysis Journal of computational and applied mathematics 20 53 65 10.1016/0377-0427(87)90125-7 Search in Google Scholar

Steinley, D. (2004). Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3), 386. SteinleyD. 2004 Properties of the hubert-arable adjusted rand index Psychological methods 9 3 386 10.1037/1082-989X.9.3.38615355155 Search in Google Scholar

Vahidnia, S., Abbasi, A., & Abbass, H.A. (2020). Document clustering and labeling for research trend extraction and evolution mapping. In C. Zhang, P. Mayr, W. Lu, & Y. Zhang (Eds.), Proceedings of the 1st workshop on extraction and evaluation of knowledge entities from scientific documents co-located with the ACM/IEEE joint conference on digital libraries in 2020, eeke@jcdl 2020, virtual event, china, august 1st, 2020 (Vol. 2658, pp. 54–62). Retrieved from http://ceur-ws.org/Vol-2658/paper7.pdf VahidniaS. AbbasiA. AbbassH.A. 2020 Document clustering and labeling for research trend extraction and evolution mapping In ZhangC. MayrP. LuW. ZhangY. (Eds.), Proceedings of the 1st workshop on extraction and evaluation of knowledge entities from scientific documents co-located with the ACM/IEEE joint conference on digital libraries in 2020, eeke@jcdl 2020, virtual event china august 1st, 2020 2658 54 62 Retrieved from http://ceur-ws.org/Vol-2658/paper7.pdf Search in Google Scholar

Ward Jr, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301), 236–244. WardJ.H.Jr 1963 Hierarchical grouping to optimize an objective function Journal of the American statistical association 58 301 236 244 10.1080/01621459.1963.10500845 Search in Google Scholar

Weber, T., Kranzlmüller, D., Fromm, M., & Tavares de Sousa, N. (2020). Using supervised learning to classify metadata of research data by field of study. Quantitative Science Studies, 1–26. WeberT. KranzlmüllerD. FrommM. Tavares de SousaN. 2020 Using supervised learning to classify metadata of research data by field of study Quantitative Science Studies 1 26 10.1162/qss_a_00049 Search in Google Scholar

Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478–487). XieJ. GirshickR. FarhadiA. 2016 Unsupervised deep embedding for clustering analysis In International conference on machine learning 478 487 Search in Google Scholar

Xu, H.Y., Winnink, J., Yue, Z.H., Liu, Z.Q., & Yuan, G.T. (2020). Topic-linked innovation paths in science and technology. Journal of Informetrics, 14(2), 101014. XuH.Y. WinninkJ. YueZ.H. LiuZ.Q. YuanG.T. 2020 Topic-linked innovation paths in science and technology Journal of Informetrics 14 2 101014 10.1016/j.joi.2020.101014 Search in Google Scholar

Xu, S., Hao, L.Y., An, X., Yang, G.C., & Wang, F.F. (2019). Emerging research topics detection with multiple machine learning models. Journal of Informetrics, 13(4), 100983. XuS. HaoL.Y. AnX. YangG.C. WangF.F. 2019 Emerging research topics detection with multiple machine learning models Journal of Informetrics 13 4 100983 10.1016/j.joi.2019.100983 Search in Google Scholar

Xu, S., Zhai, D.S., Wang, F.F., An, X., Pang, H.S., & Sun, Y.R. (2019). A novel method for topic linkages between scientific publications and patents. Journal of the Association for Information Science and Technology, 70(9), 1026–1042. XuS. ZhaiD.S. WangF.F. AnX. PangH.S. SunY.R. 2019 A novel method for topic linkages between scientific publications and patents Journal of the Association for Information Science and Technology 70 9 1026 1042 10.1002/asi.24175 Search in Google Scholar

Zeng, A., Shen, Z.S., Zhou, J.L., Wu, J.S., Fan, Y., Wang, Y.G., & Stanley, H.E. (2017). The science of science: From the perspective of complex systems. Physics Reports, 714–715, 1–73. Retrieved from https://doi.org/10.1016/j.physrep.2017.10.001 doi: 10.1016/j.physrep.2017.10.001 ZengA. ShenZ.S. ZhouJ.L. WuJ.S. FanY. WangY.G. StanleyH.E. 2017 The science of science: From the perspective of complex systems Physics Reports 714–715 1 73 Retrieved from https://doi.org/10.1016/j.physrep.2017.10.001 10.1016/j.physrep.2017.10.001 Open DOI Search in Google Scholar

Zhang, Q.R., Li, Y., Liu, J.S., Chen, Y.D., & Chai, L.H. (2017). A dynamic co-word network-related approach on the evolution of China's urbanization research. Scientometrics, 111(3), 1623–1642. doi: 10.1007/s11192-017-2314-1 ZhangQ.R. LiY. LiuJ.S. ChenY.D. ChaiL.H. 2017 A dynamic co-word network-related approach on the evolution of China's urbanization research Scientometrics 111 3 1623 1642 10.1007/s11192-017-2314-1 Open DOI Search in Google Scholar

Zhang, Y., Chen, H.S., Lu, J., & Zhang, G.Q. (2017). Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016. Knowledge-Based Systems, 133, 255–268. Retrieved from http://dx.doi.org/10.1016/j.knosys.2017.07.011 doi: 10.1016/j.knosys.2017.07.011 ZhangY. ChenH.S. LuJ. ZhangG.Q. 2017 Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016 Knowledge-Based Systems 133 255 268 Retrieved from http://dx.doi.org/10.1016/j.knosys.2017.07.011 10.1016/j.knosys.2017.07.011 Open DOI Search in Google Scholar

Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H.S., & Zhang, G.Q. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117. ZhangY. LuJ. LiuF. LiuQ. PorterA. ChenH.S. ZhangG.Q. 2018 Does deep learning help topic extraction? A kernel k-means clustering method with word embedding Journal of Informetrics 12 4 1099 1117 10.1016/j.joi.2018.09.004 Search in Google Scholar

Zhang, Y., Zhang, G.Q., Zhu, D.H., & Lu, J. (2017). Scientific evolutionary pathways: Identifying and visualizing relationships for scientific topics. Journal of the Association for Information Science and Technology, 68(8), 1925–1939. Retrieved from http://doi.wiley.com/10.1002/asi.23814 doi: 10.1002/asi.23814 ZhangY. ZhangG.Q. ZhuD.H. LuJ. 2017 Scientific evolutionary pathways: Identifying and visualizing relationships for scientific topics Journal of the Association for Information Science and Technology 68 8 1925 1939 Retrieved from http://doi.wiley.com/10.1002/asi.23814 10.1002/asi.23814 Open DOI Search in Google Scholar

Zhou, Y., Lin, H., Liu, Y.F., & Ding, W. (2019). A novel method to identify emerging technologies using a semi-supervised topic clustering model: A case of 3d printing industry. Scientometrics, 120(1), 167–185. ZhouY. LinH. LiuY.F. DingW. 2019 A novel method to identify emerging technologies using a semi-supervised topic clustering model: A case of 3d printing industry Scientometrics 120 1 167 185 10.1007/s11192-019-03126-8 Search in Google Scholar

Język:: Angielski

Częstotliwość wydawania:: 4 razy w roku
Dziedziny czasopisma:: Informatyka, Technologia informacyjna, Zarządzenie projektami, Bazy danych i eksploracja danych

Kanał RSS czasopisma

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Sahand Vahidnia

Alireza Abbasi

Hussein A. Abbass

Kategoria artykułu: Research Paper

Data publikacji: 18 cze 2021

Zakres stron: 99 - 122

Otrzymano: 30 lis 2020

Przyjęty: 26 kwi 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Słowa kluczoweDynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Słowa kluczowe
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning