Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Sahand Vahidnia; Alireza Abbasi; Hussein A. Abbass

Accesso libero

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Sahand Vahidnia

Alireza Abbasi

Hussein A. Abbass

| 18 giu 2021

Journal of Data and Information Science

Volume 6 (2021): Numero 3 (June 2021)

INFORMAZIONI SU QUESTO ARTICOLO

Articolo precedente

Articolo Successivo

Cita

Article Category: Research Paper

Pubblicato online: 18 giu 2021

Pagine: 99 - 122

Ricevuto: 30 nov 2020

Accettato: 26 apr 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Parole chiave
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

The overall and simplified process and objectives.

DEC Autoencoder diagram. K is the number of clusters.

Data growth from 1970 to 2019 in the three journals, yielding over about 4300 records.

Diagram of clusters in the study period. Y axis is the number of articles (symmetrical). X axis is years.

N-gram dictionary histogram of the dataset. Y-axis is frequency and X-axis is N.

Comparison of clustering techniques for pre-trained BERT embeddings.

Embedding	Clustering	Evaluation Metrics

		NMI	AMI	ARI
BERT (uncased A12)	kmeans (rand)	0.19507	0.1948	0.1442
	kmeans++	0.1950	0.1948	0.1442
	Agglomerative	0.2158	0.2156	0.1877
	DBSCAN	0.0042	0.0037	0.0001
	DEC	0.2568	0.2566	0.2377
SciBERT	kmeans (rand)	0.1498	0.1496	0.1266
	kmeans++	0.1492	0.1489	0.1266
	Agglomerative	0.1903	0.1901	0.1505
	DBSCAN	0.0042	0.0037	0.0001
	DEC	0.1776	0.1774	0.1731

Ordered key phrases of the clusters.

Cluster #	Terms (Normalized TF-IDF score)
c 1	creativity(1.00), sentiment_analysis(0.85), university(0.81), facial(0.79), insect(0.74), dreyfus(0.71), expert_system(0.67), music(0.65), indian_language(0.64), recommendation(0.63), argumentation(0.62), swarm(0.62), data_mining(0.61), face_recognition(0.61), natural_language_processing(0.60)
c 2	ois(1.00), execution(0.98), sinix(0.88), perception(0.80), people(0.75), unix(0.69), team(0.66), discourse(0.62), intention(0.57
c 3	revision(1.00), contraction(0.70), postulate(0.65), horn(0.65)
c 4	csp(1.00), propagation(0.80), arc_consistency(0.75), backjumping(0.59)
c 5	description_logic(1.00), deep_learning(0.89), ontology(0.74), rcc(0.56)
c 6	auction(1.00), equilibrium(0.74), election(0.66), coalition(0.66), bargaining(0.56)
c 7	support_vector_machine(1.00), classifier(0.68), knee(0.66)
c 8	document(1.00), wikipedia(0.99), wordnet(0.68), dictionary(0.63)
c 9	phase_transition(1.00), minimax(0.89), voting(0.87), alpha_beta(0.75), chess(0.69), backbone(0.64), optimal_solution(0.63), heuristic_function(0.63), game_tree(0.61), ratio(0.59), heuristic_search(0.59), monte_carlo_tree_search(0.55)
c 10	execution(1.00), reward(0.80), ebl(0.77), pomdp(0.68), team(0.66), heuristic_search(0.64), action_model(0.63), portfolio(0.60), monte_carlo_tree_search(0.59), mdp(0.59), conformant(0.58), mdps(0.57)

Comparison of clustering techniques with various embeddings on two different training corpora.

Training Dataset =>		KIPRIS			WoS+KIPRIS

Evaluation metrics=>		NMI	AMI	ARI	NMI	AMI	ARI
Embedding	Clustering
6*FastText (mean)	K-means (rand)	0.379	0.379	0.312	0.387	0.387	0.327
	K-means (++)	0.379	0.379	0.322	0.387	0.387	0.327
	Hierarchy Aggl.	0.391	0.391	0.289	0.363	0.363	0.306
	DBSCAN	0.006	0.005	0.000	0.005	0.005	0.000
	DEC	0.511	0.511	0.504	0.459	0.459	0.400
	DEC (scaled)	0.329	0.329	0.268	0.284	0.283	0.239
6*FastText (w. mean)	K-means (rand)	0.243	0.243	0.186	0.239	0.239	0.184
	K-means (++)	0.243	0.243	0.186	0.239	0.239	0.184
	Hierarchy Aggl.	0.260	0.260	0.140	0.234	0.234	0.176
	DBSCAN	0.037	0.035	0.001	0.011	0.010	0.000
	DEC	0.348	0.347	0.321	0.352	0.352	0.300
	DEC (scaled)	0.201	0.201	0.169	0.172	0.172	0.158
6*Doc2Vec	K-means (rand)	0.586	0.586	0.629	0.712	0.712	0.742
	K-means (++)	0.586	0.586	0.630	0.711	0.711	0.741
	Hierarchy Aggl.	0.444	0.444	0.457	0.602	0.602	0.633
	DBSCAN	0.004	0.004	0.000	0.004	0.004	0.000
	DEC	0.600	0.600	0.629	0.734	0.734	0.759
	DEC (scaled)	0.235	0.235	0.220	0.322	0.322	0.279
Topic Modeling			NMI	AMI	ARI
LDA			0.350	0.350	0.291

eISSN:: 2543-683X
Lingua:: Inglese

Frequenza di pubblicazione:: 4 volte all'anno
Argomenti della rivista:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

Feed RSS della rivista

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Article Category: Research Paper

Pubblicato online: 18 giu 2021

Pagine: 99 - 122

Ricevuto: 30 nov 2020

Accettato: 26 apr 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Parole chiave
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Comparison of clustering techniques for pre-trained BERT embeddings.

Ordered key phrases of the clusters.

Comparison of clustering techniques with various embeddings on two different training corpora.

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Article Category: Research Paper

Pubblicato online: 18 giu 2021

Pagine: 99 - 122

Ricevuto: 30 nov 2020

Accettato: 26 apr 2021

DOI: https://doi.org/10.2478/jdis-2021-0024

Parole chiaveDynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning

© 2021 Sahand Vahidnia et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Comparison of clustering techniques for pre-trained BERT embeddings.

Ordered key phrases of the clusters.

Comparison of clustering techniques with various embeddings on two different training corpora.

Parole chiave
Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning