Accès libre

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

À propos de cet article

Citez

Figure 1

The overall and simplified process and objectives.
The overall and simplified process and objectives.

Figure 2

PV-DM model diagram.
PV-DM model diagram.

Figure 3

DEC Autoencoder diagram. K is the number of clusters.
DEC Autoencoder diagram. K is the number of clusters.

Figure 4

Additional WoS data figure.
Additional WoS data figure.

Figure 5

Data growth from 1970 to 2019 in the three journals, yielding over about 4300 records.
Data growth from 1970 to 2019 in the three journals, yielding over about 4300 records.

Figure 6

Diagram of clusters in the study period. Y axis is the number of articles (symmetrical). X axis is years.
Diagram of clusters in the study period. Y axis is the number of articles (symmetrical). X axis is years.

Figure 7

N-gram dictionary histogram of the dataset. Y-axis is frequency and X-axis is N.
N-gram dictionary histogram of the dataset. Y-axis is frequency and X-axis is N.

Comparison of clustering techniques for pre-trained BERT embeddings.

Embedding Clustering Evaluation Metrics

NMI AMI ARI
BERT (uncased A12) kmeans (rand) 0.19507 0.1948 0.1442
kmeans++ 0.1950 0.1948 0.1442
Agglomerative 0.2158 0.2156 0.1877
DBSCAN 0.0042 0.0037 0.0001
DEC 0.2568 0.2566 0.2377
SciBERT kmeans (rand) 0.1498 0.1496 0.1266
kmeans++ 0.1492 0.1489 0.1266
Agglomerative 0.1903 0.1901 0.1505
DBSCAN 0.0042 0.0037 0.0001
DEC 0.1776 0.1774 0.1731

Ordered key phrases of the clusters.

Cluster # Terms (Normalized TF-IDF score)
c 1 creativity(1.00), sentiment_analysis(0.85), university(0.81), facial(0.79), insect(0.74), dreyfus(0.71), expert_system(0.67), music(0.65), indian_language(0.64), recommendation(0.63), argumentation(0.62), swarm(0.62), data_mining(0.61), face_recognition(0.61), natural_language_processing(0.60)
c 2 ois(1.00), execution(0.98), sinix(0.88), perception(0.80), people(0.75), unix(0.69), team(0.66), discourse(0.62), intention(0.57
c 3 revision(1.00), contraction(0.70), postulate(0.65), horn(0.65)
c 4 csp(1.00), propagation(0.80), arc_consistency(0.75), backjumping(0.59)
c 5 description_logic(1.00), deep_learning(0.89), ontology(0.74), rcc(0.56)
c 6 auction(1.00), equilibrium(0.74), election(0.66), coalition(0.66), bargaining(0.56)
c 7 support_vector_machine(1.00), classifier(0.68), knee(0.66)
c 8 document(1.00), wikipedia(0.99), wordnet(0.68), dictionary(0.63)
c 9 phase_transition(1.00), minimax(0.89), voting(0.87), alpha_beta(0.75), chess(0.69), backbone(0.64), optimal_solution(0.63), heuristic_function(0.63), game_tree(0.61), ratio(0.59), heuristic_search(0.59), monte_carlo_tree_search(0.55)
c 10 execution(1.00), reward(0.80), ebl(0.77), pomdp(0.68), team(0.66), heuristic_search(0.64), action_model(0.63), portfolio(0.60), monte_carlo_tree_search(0.59), mdp(0.59), conformant(0.58), mdps(0.57)

Comparison of clustering techniques with various embeddings on two different training corpora.

Training Dataset => KIPRIS WoS+KIPRIS


Evaluation metrics=> NMI AMI ARI NMI AMI ARI
Embedding Clustering
6*FastText (mean) K-means (rand) 0.379 0.379 0.312 0.387 0.387 0.327
K-means (++) 0.379 0.379 0.322 0.387 0.387 0.327
Hierarchy Aggl. 0.391 0.391 0.289 0.363 0.363 0.306
DBSCAN 0.006 0.005 0.000 0.005 0.005 0.000
DEC 0.511 0.511 0.504 0.459 0.459 0.400
DEC (scaled) 0.329 0.329 0.268 0.284 0.283 0.239
6*FastText (w. mean) K-means (rand) 0.243 0.243 0.186 0.239 0.239 0.184
K-means (++) 0.243 0.243 0.186 0.239 0.239 0.184
Hierarchy Aggl. 0.260 0.260 0.140 0.234 0.234 0.176
DBSCAN 0.037 0.035 0.001 0.011 0.010 0.000
DEC 0.348 0.347 0.321 0.352 0.352 0.300
DEC (scaled) 0.201 0.201 0.169 0.172 0.172 0.158
6*Doc2Vec K-means (rand) 0.586 0.586 0.629 0.712 0.712 0.742
K-means (++) 0.586 0.586 0.630 0.711 0.711 0.741
Hierarchy Aggl. 0.444 0.444 0.457 0.602 0.602 0.633
DBSCAN 0.004 0.004 0.000 0.004 0.004 0.000
DEC 0.600 0.600 0.629 0.734 0.734 0.759
DEC (scaled) 0.235 0.235 0.220 0.322 0.322 0.279
Topic Modeling NMI AMI ARI
LDA 0.350 0.350 0.291
eISSN:
2543-683X
Langue:
Anglais