Open Access

Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches

Golub, Koraljka

,

Johan Hagelbäck

Hagelbäck, Johan

and

Apr 22, 2020

Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches's Cover Image

Journal of Data and Information Science

Volume 5 (2020): Issue 1 (February 2020)

About this article

Previous Article

Cite

Share

Download Cover

Article Category: Research Paper

Published Online: Apr 22, 2020

Page range: 18 - 38

Received: Feb 04, 2020

Accepted: Mar 25, 2020

DOI: https://doi.org/10.2478/jdis-2020-0003

Keywords
LIBRIS, Dewey Decimal Classification, Automatic classification, Machine learning, Support Vector Machine, Multinomial Naïve Bayes, Simple linear network, Standard neural network, 1D convolutional neural network, Recurrent neural network, Word embeddings, String matching

© 2020 Koraljka Golub et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

The different datasets generated from the raw LIBRIS data_

Dataset	ID	records	classes
Titles	T	143,838	816
Titles and keywords	T_KW	121,505	802
Keywords only	KW	121,505	802
Titles, major classes	T_MC	72,937	29
Titles and keywords, major classes	T_KW_MC	60,641	29
Keywords only, major classes	KW_MC	60,641	29

Accuracy of the Support Vector Machine classifier when using two digits_

Support Vector Machine

Dataset	Accuracy, unigrams		Accuracy, unigrams + 2-grams

	Training set	Test set	Training set	Test set
T_KW_stm_2d	90.60%	72.68%	96.23%	73.32%
T_KW_2d	91.21%	72.14%	95.48%	73.24%
KW_2d	81.75%	71.86%	86.18%	71.96%

Accuracy of the Support Vector Machine classifier on the different datasets_

Dataset	Accuracy, unigrams		Accuracy, unigrams + 2-grams

	Training set	Test set	Training set	Test set
T	93.74%	40.91%	99.59%	40.45%
T_KW	97.50%	65.25%	99.90%	66.13%
KW	83.09%	64.02%	92.38%	64.09%
T_MC	93.95%	57.99%	99.62%	57.80%
T_KW_MC	97.89%	80.75%	99.93%	81.37%
KW_MC	90.58%	79.56%	96.30%	80.38%

Accuracy of the Supper Vector Machine classifier using different pre-processing_

Support Vector Machine

Dataset	Accuracy, unigrams		Accuracy, unigrams + 2-grams

	Training set	Test set	Training set	Test set
T_KW_MC	97.89%	80.75%	99.93%	81.37%
T_KW_MC_rem	92.51%	80.94%	95.02%	81.83%
T_KW_MC_stm	97.21%	81.07%	99.91%	81.80%
T_KW_MC_stm_rem	92.18%	81.34%	94.89%	82.20%
T_KW_MC_sw	95.44%	80.98%	98.48%	81.24%
T_KW_MC_sw_rem	92.46%	81.04%	94.30%	82.13%
T_KW_MC_sw_stm	94.87%	81.40%	98.72%	81.24%
T_KW_MC_sw_stm_rem	92.17%	81.54%	94.16%	81.90%

Accuracy of Linear and RNN classifiers using word embeddings_

Dataset	Linear		RNN

	Training set	Test set	Training set	Test set
T_KW_MC	97.17%	79.99%	92.76%	78.70%
KW_MC	91.30%	78.41%	88.03%	78.74%
T_KW_MC_stm	96.90%	80.81%	92.38%	79.16%

Accuracy of the Naïve Bayes classifier when using two digits_

Naïve Bayes

Dataset	Accuracy, unigrams		Accuracy, unigrams + 2-grams

	Training set	Test set	Training set	Test set
T_KW_stm_2d	87.40%	65.64%	93.18%	67.79%
T_KW_2d	88.26%	64.78%	93.55%	66.92%
KW_2d	78.36%	68.12%	82.53%	67.94%

Accuracy of NN and CNN classifiers using word embeddings_

Dataset	NN		CNN

	Training set	Test set	Training set	Test set
T_KW_MC	96.19%	79.40%	95.33%	79.92%
KW_MC	90.54%	78.23%	90.39%	79.15%
T_KW_MC_stm	95.92%	79.57%	94.60%	80.38%

Accuracy of the Naïve Bayes classifier using different pre-processing_

Naïve Bayes

Dataset	Accuracy, unigrams		Accuracy, unigrams + 2-grams

	Training set	Test set	Training set	Test set
T_KW_MC	95.42%	76.52%	99.66%	75.96%
T_KW_MC_rem	90.17%	76.79%	93.25%	78.21%
T_KW_MC_stm	94.32%	76.36%	99.59%	76.36%
T_KW_MC_stm_rem	89.62%	76.26%	92.95%	78.27%
T_KW_MC_sw	95.50%	76.46%	99.64%	76.62%
T_KW_MC_sw_rem	90.28%	77.09%	92.33%	78.60%
T_KW_MC_sw_stm	94.49%	76.59%	99.53%	76.95%
T_KW_MC_sw_stm_rem	89.79%	76.36%	91.96%	78.90%

Accuracy of the Multinomial Naïve Bayes classifier on the different datasets_

Dataset	Accuracy, unigrams		Accuracy, unigrams + 2-grams

	Training set	Test set	Training set	Test set
T	83.54%	34.89%	95.82%	34.15%
T_KW	90.01%	55.33%	98.14%	55.45%
KW	75.28%	59.15%	84.95%	58.11%
T_MC	90.83%	54.21%	98.63%	50.51%
T_KW_MC	95.42%	76.52%	99.66%	75.96%
KW_MC	86.94%	77.25%	94.24%	77.09%

Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

Journal RSS Feed