A Topic Detection Method Based on Word-attention Networks

Zheng Xie

Acceso abierto

A Topic Detection Method Based on Word-attention Networks

Zheng Xie

| 18 ago 2021

Journal of Data and Information Science

Volumen 6 (2021): Edición 4 (November 2021)

Acerca de este artículo

Artículo anterior

Artículo siguiente

Cite

Article Category: Research Paper

Publicado en línea: 18 ago 2021

Páginas: 139 - 163

Recibido: 19 jun 2021

Aceptado: 23 jul 2021

DOI: https://doi.org/10.2478/jdis-2021-0032

Palabras clave
Scientific topics, Text analysis, Deep learning

© 2021 Zheng Xie, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

The statistics of the PNAS papers’ in three major disciplines. There are 90.44% of papers belonging to major disciplines, namely Biological Sciences, Physical Sciences, and Social Sciences. Panel (a) shows the annual number of papers of each discipline. Panel (b) shows the fraction of papers of Biophysics in Biological Sciences.

The illustration of the proposed method. The method consists of two steps: constructing word branches by the Transformer; using methods of community detection to partition the network that are formed by connecting word branches. The topics here are in the form of subgraph.

The performances of using the random walk method to detect node communities. The word networks are treated as directed ones. The step is the parameter of the random walk method, the length of random walks to perform. Panels show the performances of the method with the step 3 (blue dot lines), 4 (red dot lines) and 5 (orange dot lines).

The performances of using the louvain method to detect node communities. The word networks are treated as undirected ones. The resolution is the parameter of the louvain method. Panels show the performances of the method with the resolution 0.5 (orange dot lines), 0.7 (red dot lines), and 1.0 (blue dot lines).

The performances of using the louvain method to detect edge communities. The louvain method is used to obtain overlapped communities. Panels show the performances of this method with resolution 0.5 (orange dot lines), 0.7 (red dot lines), and 1.0 (blue dot lines).

The comparisons between the LDA and our method. Panels (a–d) show the average values of assessment indexes, and Panels (e–h) show the standard deviations for running the LDA (red dot lines) and the provided method using the louvain method to detect node communities (resolution 0.7, blue dot lines).

The performances of the LDA with the same number of topics as that of our method. Panels show the performances of the case with the number of communities of tokens (blue dot lines) and the performances with the number of communities of edges (red dot lines). These communities are detected by the louvain method.

The evolution of the tokens is emphasized by the proposed method using the louvain method. Panels show the word clouds of the top five tokens in the top five topics according to tokens’ betweenness centrality in the word network.

The evolution of the tokens emphasized by the LDA. Panels show the word clouds of the top five tokens in the top five topics according to the summation of a token's weight overall topics. The number of topics is five. The index n here is the number of tokens emphasized by running the LDA ten times.

The difference on emphasized tokens. Panel (a) shows the tokens emphasized by both the LDA and our method. The set of tokens emphasized by the provided method includes that emphasized by the LDA. Panel (b) shows the tokens only emphasized by our method.

j.jdis-2021-0032.tab.004

Algorithm 3. Constructing word branches.
Input:
	titles of papers;
	parameter m;
Output:
	word branches.
1:	for each preprocessed title a₀, ..., an do
2:	let b₀₁ = a₀;
3:	for i from 1 to n do
4:	predict b_i₁, ..., b_im ranked according to the probability given by the Transformer;
5:	if i > 1 then
6:	generate directed edges from b_(i−1)1 to b_i₁, ..., b_im;
7:	use b₀₁, ..., b_i₁ to predict next tokens by the Transformer;
8:	end if
9:	end for
10:	end for

j.jdis-2021-0032.tab.005

Algorithm 4. Cropping word branches.
Input:
	word branches;
Output:
	cropped word branches.
1:	calculate the token-paper matrix (f_ij)_N×M;
2:	calculate the tf-idf matrix (w_ij)_N×M;
3:	for i from 1 to M do
4:	rank tokens according to {w_i₁, ..., w_iN};
5:	if token b_i₁ is not in top x% then
6:	crop the tokens of {b_i₁, ..., b_im};
7:	crop the edges connecting those tokens;
8:	else
9:	crop the tokens of {b_i₂, ..., b_im} that are not in top x%;
10:	crop the edges connecting those tokens.
11:	end if
12:	end for

j.jdis-2021-0032.tab.002

Algorithm 1. Data preprocessing.
Input:
	titles and abstracts of papers;
Output:
	list of token lists.
1:	stem words using the PorterStemmer of NLTK²;
2:	remove stopwords using the stopword corpus of NLTK;
3:	remove the words that appear in less than x = 6 papers’ abstract and title.

j.jdis-2021-0032.tab.003

Algorithm 2. The architecture of the Transformer used here.
1:	model=get model(token num=max(len(source token dict), len(target token dict)), embed dim=32, encoder num=4, decoder num=4, head num=4, hidden dim=32, dropout rate=0.05, use same embed=False,)
2:	model.compile(‘adam’, ‘sparse categorical crossentropy’)
3:	model.fit(x=[np.array(encode input∗30), np.array(decode input∗30)], y=np.array(decode output∗30), epochs=5, batch size=32,)

The information of the dataset dblp.

Time	a	b	c	d	e	f
1999	2,475	3,274	95,021	0.11	2.371	0.998
2000	2,380	3,347	93,910	0.101	2.395	0.998
2001	2,455	3,477	108,954	0.108	2.355	0.999
2002	2,812	3,710	117,269	0.094	2.272	1.0
2003	2,656	3,592	115,019	0.1	2.312	0.999
2004	2,955	3,919	138,451	0.101	2.299	0.999
2005	3,131	4,084	154,041	0.099	2.275	0.999
2006	3,248	4,260	166,614	0.1	2.289	0.999
2007	3,419	4,368	184,420	0.102	2.279	0.999
2008	3,408	4,436	184,881	0.104	2.304	1.0
2009	3,658	4,609	212,771	0.098	2.218	1.0
2010	3,639	4,668	221,090	0.1	2.204	0.999
2011	3,462	4,688	220,020	0.111	2.228	0.999
2012	3,621	4,875	209,517	0.114	2.28	1.0
2013	3,593	4,846	231,959	0.096	2.189	1.0
2014	3,334	4,679	210,099	0.096	2.208	1.0

eISSN:: 2543-683X
Idioma:: Inglés

Calendario de la edición:: 4 veces al año
Temas de la revista:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

RSS Feed de revista

A Topic Detection Method Based on Word-attention Networks

Article Category: Research Paper

Publicado en línea: 18 ago 2021

Páginas: 139 - 163

Recibido: 19 jun 2021

Aceptado: 23 jul 2021

DOI: https://doi.org/10.2478/jdis-2021-0032

Palabras clave
Scientific topics, Text analysis, Deep learning

© 2021 Zheng Xie, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

j.jdis-2021-0032.tab.004

j.jdis-2021-0032.tab.005

j.jdis-2021-0032.tab.002

j.jdis-2021-0032.tab.003

The information of the dataset dblp.

A Topic Detection Method Based on Word-attention Networks

Article Category: Research Paper

Publicado en línea: 18 ago 2021

Páginas: 139 - 163

Recibido: 19 jun 2021

Aceptado: 23 jul 2021

DOI: https://doi.org/10.2478/jdis-2021-0032

Palabras claveScientific topics, Text analysis, Deep learning

© 2021 Zheng Xie, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

j.jdis-2021-0032.tab.004

j.jdis-2021-0032.tab.005

j.jdis-2021-0032.tab.002

j.jdis-2021-0032.tab.003

The information of the dataset dblp.

Palabras clave
Scientific topics, Text analysis, Deep learning