This paper compares the paradigmatic differences between knowledge organization (KO) in library and information science and knowledge representation (KR) in AI to show the convergence in KO and KR methods and applications.
Methodology
The literature review and comparative analysis of KO and KR paradigms is the primary method used in this paper.
Findings
A key difference between KO and KR lays in the purpose of KO is to organize knowledge into certain structure for standardizing and/or normalizing the vocabulary of concepts and relations, while KR is problem-solving oriented. Differences between KO and KR are discussed based on the goal, methods, and functions.
Research limitations
This is only a preliminary research with a case study as proof of concept.
Practical implications
The paper articulates on the opportunities in applying KR and other AI methods and techniques to enhance the functions of KO.
Originality/value:
Ontologies and linked data as the evidence of the convergence of KO and KR paradigms provide theoretical and methodological support to innovate KO in the AI era.
With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.
Design/methodology/approach
State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels).
Findings
Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes).
Research limitations
Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.
Practical implications
In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.
Originality/value
The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
This paper informs about the publication of the second edition of the Integrative Levels Classification (ILC2), a freely-faceted knowledge organization system (KOS), and reviews the main changes that have been introduced as compared to its first edition (ILC1).
Design/methodology/approach
The most relevant changes are illustrated, with special reference to those of interest to general classification theory, by means of examples of notation for individual classes and combinations of them.
Findings
Changes introduced in ILC2 include: the names and order of some main classes; the development of subclasses for various phenomena, especially quantities and algebraic structures; the order of facet categories and the new category of Disorder; notation for special facets; distinction of the semantical function of facets (attributes) from their syntactic function. The system can be freely accessed online through a PHP browser as well as in SKOS format.
Research limitations
Only a selection of changed classes is discussed for space reasons.
Practical implications
ILC1 has been previously applied to the BARTOC directory of KOSs. Update of BARTOC data to ILC2 and application of ILC2 to further information systems are envisaged. Possible methods for reclassifying BARTOC with ILC2 are discussed.
Originality
ILC is a newly developed classification system, based on phenomena instead of traditional disciplines and featuring various innovative devices. This paper is an original account of its most recent evolution.
This paper presents the ARQUIGRAFIA project, an open, public and nonprofit, continuous growth web collaborative environment dedicated to Brazilian architectural photographic images.
Design/methodology/approach
The ARQUIGRAFIA project promotes the active and collaborative participation among its institutional users (GLAMs, NGOs, laboratories and research groups) and private users (students, professionals, professors, researchers), both can create an account and share their digitized iconographic collections in the same Web environment by uploading their files, indexing, georeferencing and assigning a Creative Commons license.
Findings
The development of users interactions by means of semantic differentials impressions recording on visible plastic-spatial aspects of the architectures in synthetic infographics, as well as by the retrieval of images through an advanced system search based on those impressions parameters. By gamification means, the system often invites users to review images’ in order to improve images’ data accuracy. The pilot project named Open Air Museum that allows users to add audio descriptions to images in situ. An interface for users’ digital curatorship will be soon available.
Research limitations
The ARQUIGRAFIA’s multidisciplinary team gathering professors-researchers, graduate and undergraduate students from the Architecture and Urbanism, Design, Information Science, Computer Science faculties of the University of São Paulo, demands continuous financial resources for grants, for contracting third party services, for the participation in scientific events in Brazil and abroad, and for equipment. Since 2016, significant budget cuts in the University of São Paulo own research funds and in Brazilian federal scientific agencies can compromise the continuity of this project.
Practical implications
The open source template called +GRAFIA that can freely help other areas of knowledge to build their own visual Web collaborative environments.
Originality/value
The collaborative nature of the ARQUIGRAFIA distinguishes it from institutional image databases on the internet, precisely because it involves a heterogeneous network of collaborators.
This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment. Key requirements, which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment, are presented in the perspective of linked data.
Design/methodology/approach
This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive, the search and browse menus of Gugak Archive’s website and K-PAAN, the performing arts portal site.
Findings
The importance of consistency, continuity, and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment. However, a semantic web environment also requires new tools such as web identifiers (URIs), data models (RDF), and link information (interlinking).
Research limitations
The scope of this study does not include practical implementation strategies for the archival records management system and website services. The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.
Practical implications:
The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system. This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.
Originality/value
This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment. In the application of the semantic web services’ principles and methods to an Gugak Archive, this study can contribute to the improvement of information organization and services in the field of Korean traditional music.
This study attempts to propose an abstract model by gathering concepts that can focus on resource representation and description in a digital curation model and suggest a conceptual model that emphasizes semantic enrichment in a digital curation model.
Design/methodology/approach
This study conducts a literature review to analyze the preceding curation models, DCC CLM, DCC&U, UC3, and DCN.
Findings
The concept of semantic enrichment is expressed in a single word, SEMANTIC in this study. The Semantic Enrichment Model, SEMANTIC has elements, subject, extraction, multi-language, authority, network, thing, identity, and connect.
Research limitations
This study does not reflect the actual information environment because it focuses on the concepts of the representation of digital objects.
Practical implications
This study presents the main considerations for creating and reinforcing the description and representation of digital objects when building and developing digital curation models in specific institutions.
Originality/value
This study summarizes the elements that should be emphasized in the representation of digital objects in terms of information organization.
To develop a set of metrics and identify criteria for assessing the functionality of LOD KOS products while providing common guiding principles that can be used by LOD KOS producers and users to maximize the functions and usages of LOD KOS products.
Design/methodology/approach
Data collection and analysis were conducted at three time periods in 2015–16, 2017 and 2019. The sample data used in the comprehensive data analysis comprises all datasets tagged as types of KOS in the Datahub and extracted through their respective SPARQL endpoints. A comparative study of the LOD KOS collected from terminology services Linked Open Vocabularies (LOV) and BioPortal was also performed.
Findings
The study proposes a set of Functional, Impactful and Transformable (FIT) metrics for LOD KOS as value vocabularies. The FAIR principles, with additional recommendations, are presented for LOD KOS as open data.
Research limitations
The metrics need to be further tested and aligned with the best practices and international standards of both open data and various types of KOS.
Practical implications
Assessment performed with FAIR and FIT metrics support the creation and delivery of user-friendly, discoverable and interoperable LOD KOS datasets which can be used for innovative applications, act as a knowledge base, become a foundation of semantic analysis and entity extractions and enhance research in science and the humanities.
Originality/value
Our research provides best practice guidelines for LOD KOS as value vocabularies.
This paper compares the paradigmatic differences between knowledge organization (KO) in library and information science and knowledge representation (KR) in AI to show the convergence in KO and KR methods and applications.
Methodology
The literature review and comparative analysis of KO and KR paradigms is the primary method used in this paper.
Findings
A key difference between KO and KR lays in the purpose of KO is to organize knowledge into certain structure for standardizing and/or normalizing the vocabulary of concepts and relations, while KR is problem-solving oriented. Differences between KO and KR are discussed based on the goal, methods, and functions.
Research limitations
This is only a preliminary research with a case study as proof of concept.
Practical implications
The paper articulates on the opportunities in applying KR and other AI methods and techniques to enhance the functions of KO.
Originality/value:
Ontologies and linked data as the evidence of the convergence of KO and KR paradigms provide theoretical and methodological support to innovate KO in the AI era.
With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.
Design/methodology/approach
State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels).
Findings
Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average; the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes).
Research limitations
Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.
Practical implications
In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.
Originality/value
The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
This paper informs about the publication of the second edition of the Integrative Levels Classification (ILC2), a freely-faceted knowledge organization system (KOS), and reviews the main changes that have been introduced as compared to its first edition (ILC1).
Design/methodology/approach
The most relevant changes are illustrated, with special reference to those of interest to general classification theory, by means of examples of notation for individual classes and combinations of them.
Findings
Changes introduced in ILC2 include: the names and order of some main classes; the development of subclasses for various phenomena, especially quantities and algebraic structures; the order of facet categories and the new category of Disorder; notation for special facets; distinction of the semantical function of facets (attributes) from their syntactic function. The system can be freely accessed online through a PHP browser as well as in SKOS format.
Research limitations
Only a selection of changed classes is discussed for space reasons.
Practical implications
ILC1 has been previously applied to the BARTOC directory of KOSs. Update of BARTOC data to ILC2 and application of ILC2 to further information systems are envisaged. Possible methods for reclassifying BARTOC with ILC2 are discussed.
Originality
ILC is a newly developed classification system, based on phenomena instead of traditional disciplines and featuring various innovative devices. This paper is an original account of its most recent evolution.
This paper presents the ARQUIGRAFIA project, an open, public and nonprofit, continuous growth web collaborative environment dedicated to Brazilian architectural photographic images.
Design/methodology/approach
The ARQUIGRAFIA project promotes the active and collaborative participation among its institutional users (GLAMs, NGOs, laboratories and research groups) and private users (students, professionals, professors, researchers), both can create an account and share their digitized iconographic collections in the same Web environment by uploading their files, indexing, georeferencing and assigning a Creative Commons license.
Findings
The development of users interactions by means of semantic differentials impressions recording on visible plastic-spatial aspects of the architectures in synthetic infographics, as well as by the retrieval of images through an advanced system search based on those impressions parameters. By gamification means, the system often invites users to review images’ in order to improve images’ data accuracy. The pilot project named Open Air Museum that allows users to add audio descriptions to images in situ. An interface for users’ digital curatorship will be soon available.
Research limitations
The ARQUIGRAFIA’s multidisciplinary team gathering professors-researchers, graduate and undergraduate students from the Architecture and Urbanism, Design, Information Science, Computer Science faculties of the University of São Paulo, demands continuous financial resources for grants, for contracting third party services, for the participation in scientific events in Brazil and abroad, and for equipment. Since 2016, significant budget cuts in the University of São Paulo own research funds and in Brazilian federal scientific agencies can compromise the continuity of this project.
Practical implications
The open source template called +GRAFIA that can freely help other areas of knowledge to build their own visual Web collaborative environments.
Originality/value
The collaborative nature of the ARQUIGRAFIA distinguishes it from institutional image databases on the internet, precisely because it involves a heterogeneous network of collaborators.
This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment. Key requirements, which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment, are presented in the perspective of linked data.
Design/methodology/approach
This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive, the search and browse menus of Gugak Archive’s website and K-PAAN, the performing arts portal site.
Findings
The importance of consistency, continuity, and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment. However, a semantic web environment also requires new tools such as web identifiers (URIs), data models (RDF), and link information (interlinking).
Research limitations
The scope of this study does not include practical implementation strategies for the archival records management system and website services. The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.
Practical implications:
The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system. This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.
Originality/value
This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment. In the application of the semantic web services’ principles and methods to an Gugak Archive, this study can contribute to the improvement of information organization and services in the field of Korean traditional music.
This study attempts to propose an abstract model by gathering concepts that can focus on resource representation and description in a digital curation model and suggest a conceptual model that emphasizes semantic enrichment in a digital curation model.
Design/methodology/approach
This study conducts a literature review to analyze the preceding curation models, DCC CLM, DCC&U, UC3, and DCN.
Findings
The concept of semantic enrichment is expressed in a single word, SEMANTIC in this study. The Semantic Enrichment Model, SEMANTIC has elements, subject, extraction, multi-language, authority, network, thing, identity, and connect.
Research limitations
This study does not reflect the actual information environment because it focuses on the concepts of the representation of digital objects.
Practical implications
This study presents the main considerations for creating and reinforcing the description and representation of digital objects when building and developing digital curation models in specific institutions.
Originality/value
This study summarizes the elements that should be emphasized in the representation of digital objects in terms of information organization.
To develop a set of metrics and identify criteria for assessing the functionality of LOD KOS products while providing common guiding principles that can be used by LOD KOS producers and users to maximize the functions and usages of LOD KOS products.
Design/methodology/approach
Data collection and analysis were conducted at three time periods in 2015–16, 2017 and 2019. The sample data used in the comprehensive data analysis comprises all datasets tagged as types of KOS in the Datahub and extracted through their respective SPARQL endpoints. A comparative study of the LOD KOS collected from terminology services Linked Open Vocabularies (LOV) and BioPortal was also performed.
Findings
The study proposes a set of Functional, Impactful and Transformable (FIT) metrics for LOD KOS as value vocabularies. The FAIR principles, with additional recommendations, are presented for LOD KOS as open data.
Research limitations
The metrics need to be further tested and aligned with the best practices and international standards of both open data and various types of KOS.
Practical implications
Assessment performed with FAIR and FIT metrics support the creation and delivery of user-friendly, discoverable and interoperable LOD KOS datasets which can be used for innovative applications, act as a knowledge base, become a foundation of semantic analysis and entity extractions and enhance research in science and the humanities.
Originality/value
Our research provides best practice guidelines for LOD KOS as value vocabularies.