Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Sahand Vahidnia; Alireza Abbasi; Hussein A. Abbass

Accès libre

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Sahand Vahidnia

Alireza Abbasi

Hussein A. Abbass

| 18 juin 2021

Journal of Data and Information Science

Édition 6 (2021): Edition 3 (June 2021)

À propos de cet article

Article précédent

Article suivant

Citez

Partagez

Article Category: Research Paper

Publié en ligne: 18 juin 2021

Pages: 99 - 122

Reçu: 30 nov. 2020

Accepté: 26 avr. 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Mots clés
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

The overall and simplified process and objectives.

DEC Autoencoder diagram. K is the number of clusters.

Data growth from 1970 to 2019 in the three journals, yielding over about 4300 records.

Diagram of clusters in the study period. Y axis is the number of articles (symmetrical). X axis is years.

N-gram dictionary histogram of the dataset. Y-axis is frequency and X-axis is N.

Comparison of clustering techniques for pre-trained BERT embeddings.

Embedding	Clustering	Evaluation Metrics

		NMI	AMI	ARI
BERT (uncased A12)	kmeans (rand)	0.19507	0.1948	0.1442
	kmeans++	0.1950	0.1948	0.1442
	Agglomerative	0.2158	0.2156	0.1877
	DBSCAN	0.0042	0.0037	0.0001
	DEC	0.2568	0.2566	0.2377
SciBERT	kmeans (rand)	0.1498	0.1496	0.1266
	kmeans++	0.1492	0.1489	0.1266
	Agglomerative	0.1903	0.1901	0.1505
	DBSCAN	0.0042	0.0037	0.0001
	DEC	0.1776	0.1774	0.1731

Ordered key phrases of the clusters.

Cluster #	Terms (Normalized TF-IDF score)
c 1	creativity(1.00), sentiment_analysis(0.85), university(0.81), facial(0.79), insect(0.74), dreyfus(0.71), expert_system(0.67), music(0.65), indian_language(0.64), recommendation(0.63), argumentation(0.62), swarm(0.62), data_mining(0.61), face_recognition(0.61), natural_language_processing(0.60)
c 2	ois(1.00), execution(0.98), sinix(0.88), perception(0.80), people(0.75), unix(0.69), team(0.66), discourse(0.62), intention(0.57
c 3	revision(1.00), contraction(0.70), postulate(0.65), horn(0.65)
c 4	csp(1.00), propagation(0.80), arc_consistency(0.75), backjumping(0.59)
c 5	description_logic(1.00), deep_learning(0.89), ontology(0.74), rcc(0.56)
c 6	auction(1.00), equilibrium(0.74), election(0.66), coalition(0.66), bargaining(0.56)
c 7	support_vector_machine(1.00), classifier(0.68), knee(0.66)
c 8	document(1.00), wikipedia(0.99), wordnet(0.68), dictionary(0.63)
c 9	phase_transition(1.00), minimax(0.89), voting(0.87), alpha_beta(0.75), chess(0.69), backbone(0.64), optimal_solution(0.63), heuristic_function(0.63), game_tree(0.61), ratio(0.59), heuristic_search(0.59), monte_carlo_tree_search(0.55)
c 10	execution(1.00), reward(0.80), ebl(0.77), pomdp(0.68), team(0.66), heuristic_search(0.64), action_model(0.63), portfolio(0.60), monte_carlo_tree_search(0.59), mdp(0.59), conformant(0.58), mdps(0.57)

Comparison of clustering techniques with various embeddings on two different training corpora.

Training Dataset =>		KIPRIS			WoS+KIPRIS

Evaluation metrics=>		NMI	AMI	ARI	NMI	AMI	ARI
Embedding	Clustering
6*FastText (mean)	K-means (rand)	0.379	0.379	0.312	0.387	0.387	0.327
	K-means (++)	0.379	0.379	0.322	0.387	0.387	0.327
	Hierarchy Aggl.	0.391	0.391	0.289	0.363	0.363	0.306
	DBSCAN	0.006	0.005	0.000	0.005	0.005	0.000
	DEC	0.511	0.511	0.504	0.459	0.459	0.400
	DEC (scaled)	0.329	0.329	0.268	0.284	0.283	0.239
6*FastText (w. mean)	K-means (rand)	0.243	0.243	0.186	0.239	0.239	0.184
	K-means (++)	0.243	0.243	0.186	0.239	0.239	0.184
	Hierarchy Aggl.	0.260	0.260	0.140	0.234	0.234	0.176
	DBSCAN	0.037	0.035	0.001	0.011	0.010	0.000
	DEC	0.348	0.347	0.321	0.352	0.352	0.300
	DEC (scaled)	0.201	0.201	0.169	0.172	0.172	0.158
6*Doc2Vec	K-means (rand)	0.586	0.586	0.629	0.712	0.712	0.742
	K-means (++)	0.586	0.586	0.630	0.711	0.711	0.741
	Hierarchy Aggl.	0.444	0.444	0.457	0.602	0.602	0.633
	DBSCAN	0.004	0.004	0.000	0.004	0.004	0.000
	DEC	0.600	0.600	0.629	0.734	0.734	0.759
	DEC (scaled)	0.235	0.235	0.220	0.322	0.322	0.279
Topic Modeling			NMI	AMI	ARI
LDA			0.350	0.350	0.291

eISSN:: 2543-683X
Langue:: Anglais

Périodicité:: 4 fois par an
Sujets de la revue:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

RSS Feed de la revue

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Article Category: Research Paper

Publié en ligne: 18 juin 2021

Pages: 99 - 122

Reçu: 30 nov. 2020

Accepté: 26 avr. 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Mots clés
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Comparison of clustering techniques for pre-trained BERT embeddings.

Ordered key phrases of the clusters.

Comparison of clustering techniques with various embeddings on two different training corpora.

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Article Category: Research Paper

Publié en ligne: 18 juin 2021

Pages: 99 - 122

Reçu: 30 nov. 2020

Accepté: 26 avr. 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Mots clésDynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Comparison of clustering techniques for pre-trained BERT embeddings.

Ordered key phrases of the clusters.

Comparison of clustering techniques with various embeddings on two different training corpora.

Mots clés
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning