Semantic technology standards advanced in the Semantic Web era have enabled open access to well-structured and well-curated Linked Open Data (LOD) datasets. Among the most useful LOD products are the Knowledge Organization Systems (KOS) that were originally published as thesauri, classifications, taxonomies, name authorities, or picklists and now are available as LOD datasets. At the tenth anniversary of the W3C (2009) formal recommendations Simple Knowledge Organization System (SKOS) and SKOS eXtension for Labels (SKOS-XL), the number of KOS datasets available through open data registries (e.g. Datahub
The “FAIR Guiding Principles for scientific data management and stewardship” (Wilkinson et al., 2016) provides guidelines for the publication of digital resources such as datasets, code, workflows, and research objects, in a manner that makes them Findable, Accessible, Interoperable, and Reusable (FAIR). The FAIR principles
Consequently, we were inspired to implement and assess FAIR principles for KOS open datasets and in particular to assemble relevant functional metrics for a specific type of product—the LOD KOS vocabularies. This paper reports a set of metrics developed through a comprehensive data analysis and a comparative study. The project extends a previous study of “KOS in the Semantic Web” (Zeng & Mayr, 2018) which examined the functions of LOD KOS based on a set of collected cases through the viewpoints of LOD dataset producers, KOS vocabulary producers, and researchers who are the end-users of LOD KOS. The study reveals the remarkable potential of LOD KOS while also highlighting obstacles and issues. It is believed that the main barrier for maximizing the usage of LOD KOS resides in communication about these KOS through their delivery services rather than their structure, format, or contents. Undoubtedly, common guiding principles and assessment metrics are needed for the LOD KOS. By using these metrics, any assessment performed on LOD KOS products can lead to actions addressing any or all identified issues case-by-case, collectively or independently.
Our research has also confirmed that, semantic technologies have brought KOS vocabularies into a new era with many technologically advanced use cases made possible. KOS’ functions have been extended far beyond being controlled vocabularies and taxonomies. They have become knowledge bases, the trustable resources for knowledge graphs, and fundamental components for the contextualization of data-driven and AI-dominated processes. It is essential that the LOD KOS be measurable by developers and users for enhanced and effective usage, while encouraging innovative approaches for the LOD KOS to be FAIR (Findable, Accessible, Interoperable, and Reusable) as open data, plus to be FIT (Functional, Impactful, and Transformable) as value vocabularies.
In this section, we present the methodology of an investigation designed to gather descriptive data of LOD KOS datasets. This study involved a series of steps for collecting and analyzing LOD KOS datasets to assess their functionality. Data collection and analysis were conducted at three time periods in 2015–16, 2017, and 2019. The sample data comprises all datasets tagged as kinds of knowledge organizations systems in the Datahub (
The steps described below involve data collected through the SPARQL endpoints of LOD KOS products registered in the Datahub. We further analyzed and recorded the presence of certain features and extracted LOD KOS properties used via a SPARQL query.
To search for qualified KOS datasets to study, we used the following terms provided and used as tags by providers in the Datahub: To evaluate the structure and content of the datasets in detail, verifications were performed manually to ensure that the datasets found aligned with the definition of each type of KOS. In some cases, we found that providers tagged their products with terms such as “terminology” or “taxonomy”. However, the datasets tagged as terminology may just have been a list of terms not crafted for information retrieval purposes, arranged in no order, or have no definitions. In the case of those tagged as taxonomies, some datasets were neither groups of objects based on any particular characteristics as commonly understood, nor did they have any kind of hierarchical arrangement. After validation, those datasets that did not fit the commonly understood definitions of various types of KOS were removed from consideration. Therefore, as showing in Figure 2, datasets “found” refers to all datasets that were tagged as a kind of KOS showed at the initial search, while “verified” refers to those that were checked and align with commonly understood definitions of these terms. (Refer to Figure 2 in Section 3.3 for the initial search and verified results.) To study those being confirmed as real KOS vocabularies in each of the categories, the names and features of all datasets were documented in a spreadsheet. For each dataset, we checked and recorded whether the SPARQL API was available and working, was moved, unavailable or returned a type of error message. (Refer to Table 1 in Section 4.1.) For each dataset, additional features of the SPARQL API are also recorded, including: (1) the name of the editor facilitating deployment of the SPARQL endpoint (e.g. Virtuoso, Fuseki, PoolParty, etc.); (2) the default query, if available; (3) the enabled operations (e.g. SELECT, CONSTRUCT, ASK, DESCRIBE); (4) the maximum number of possible results from queries; (5) the enabled HTTP methods; (6) the available formats in which results can be downloaded; and (7) the number of example queries provided. (Refer to Table 2 and 3 in Section 4.1.) To investigate the datatype properties of each dataset, a specific query was run at each endpoint across all datasets to extract the first 100 properties ( A complete assessment of the datasets’ interoperability level requires an examination of the property vocabularies used. We created a list of common vocabularies, including Dublin Core, FOAF, RDFS, OWL, SKOS, DBPedia, Schema.org, etc. The use of standard vocabularies was recorded along with the occurrence of specialized and locally created terms. Extracted element properties were then color-marked to provide an overview of the sources of the properties reused by a dataset. Note the frequency of non-standard/local terms compared to standard controlled terms as evidenced by the color codes. (Refer to Figure 3 and Figure 4 in Section 4.1.) To give a simple quantitative analysis of the collected data, we obtained counts of the collected datasets for each KOS type, the frequencies of various property elements, as well as an understanding of how many KOS overall use those properties. Furthermore, the number of occurrences of various document formats was captured. Finally, the number of endpoints that provide SPARQL query examples or templates for exploration of the dataset was also assessed. In addition to the data collected from Datahub, another data collection and analysis was added in 2019, aimed at gathering real cases that demonstrate areas that could be enhanced. The data were collected from BioPortal and LOV.
Beside the processes mentioned above, a parallel project “KOS in the Semantic Web” has been running concurrently, focusing on case studies and content analysis of the LOD KOS products. Many specific cases have been traced and studied in order to discover best practices and innovative approaches in KOS creation, connection, production, and transformational usages. In the following sections, selected cases will be referenced from our reports and publications from this “KOS in the Semantic Web” project.
Number of SPARQL endpoints provided (data collected in 2016, 2017, and 2019 from the Datahub).
2016 | 2017 | 2019 | ||||||
---|---|---|---|---|---|---|---|---|
Search Type of KOS/DATASET | # found | # with SPARQL endpoints | Search Type of KOS/DATASET | # found | # with SPARQL endpoints | Search Type of KOS/DATASET | # found | # with SPARQL endpoints |
Thesaurus | 67 | 39 | Thesaurus | 40 | Thesaurus | 80 | 41 | |
Classification | 458 | 29 | Classification | 476 | 31 | Classification | 478 | 31 |
Taxonomy | 26 | 8 | Taxonomy | 35 | 8 | Taxonomy | 37 | 10 |
Terminology | 35 | 7 | Terminology | 39 | 8 | Terminology | 39 | 8 |
List | 665 | 52 | List | 821 | 58 | List | 825 | 59 |
Available serialization formats of KOS datasets (sorted based on data collected 2019).
Format | 2016 | 2017 | 2019 |
---|---|---|---|
JSON | 54 | 42 | 74 |
HTML | 47 | 37 | 71 |
XML | 55 | 42 | 69 |
TSV | 44 | 30 | 63 |
RDF+XML | 40 | 30 | 61 |
DEFAULT/AUTO | 37 | 27 | 51 |
TURTLE | 30 | 26 | 39 |
CSV | 34 | 20 | 39 |
N-TRIPLES | 26 | 18 | 36 |
JAVASCRIPT | 23 | 11 | 31 |
SPREADSHEET | 22 | 3 | 30 |
PLAIN/TEXT | 20 | 21 | 28 |
QUERY STRUCTURE | 15 | 15 | 23 |
SERIALIZED PHP | 15 | 15 | 22 |
JSON-LD | 3 | 1 |
As explained by the European Commission Expert Group on FAIR Data (2018), “FAIR” or “FAIR data” should be understood as shorthand for a concept that comprises a range of scholarly materials that surround and relate to research data. This includes the algorithms, tools, workflows, and analytical pipelines that lead to creation of the data and which give it meaning. It also encompasses the technical specifications, standards, metadata, vocabularies, ontologies and identifiers that are needed to provide meaning, both to the data itself and to any associated materials. We recommend that LOD KOS follow the FAIR principles, to improve findability, accessibility, interoperability, and to enable reuse of any digital assets owned.
Since FAIR Guiding Principles have been studied and reported in multiple locations, we will not repeat the explanations and implementation cases in this paper. Instead we emphasize situations which align with FAIR and can be enhanced with FAIRification. Each of the FAIR principles is presented and described below with our findings which come from two main sources.
First, in Figure 1, various descriptive elements used on the Datahub for the datasets are highlighted, found in various views, they indicate how the LOD KOS have been registered/described and made available in the Datahub.
Second, the properties describing the LOD KOS collected through their own SPARQL endpoints have been captured and analyzed.
Figure 1
Various levels of FAIRness of LOD KOS datasets as seen from the Datahub.

Findable requires that, metadata and data should be easy to find for both humans and computers.
Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an indispensable component of the FAIRification process. Among the “F” approaches, when considering the LOD KOS datasets registered in the Datahub; we found that the metadata provided by those providers could benefit from being enriched. Furthermore, while many met requirements for “F”, some missed basic information about the KOS dataset, such as languages, creators, and history. (See example on the upper right of Figure 1.) Thus, we have a specific additional recommendation for Findable:
Enrich metadata about KOS as much as possible to enable data discovery processes.
Accessible requires that, once the user finds the required data, she/he needs to know how they can be accessed, possibly including authentication and authorization.
Access to a LOD KOS includes various paths: query access, entity-level access, and access via data dump. There are a wide range of types or formats in which a KOS vocabulary dataset can be delivered. A highly accessible KOS would provide not only a SPARQL endpoint for query access, but also common entity-level browsing and searching access. It would be ideal to deliver datasets in certain RDF serialization formats, and provide examples in varying formats (e.g. left in Figure 1). Yet, in reality, some KOS products only have one mode of downloadable access. This leads to a specific additional recommendation for Accessible:
Provide multiple pathways for accessing the KOS datasets.
Interoperable requires that the data usually needs to be integrated with other data. In addition, the data needs to interoperate with applications or workflows for analysis, storage, and processing.
The preliminary findings of this study reveal that, the metadata that have been used in describing the vocabulary types vary at different registries. For example, the way vocabulary types are categorized in the Datahub is un-standardized, even though the terms to use are suggested (refer to Figure 2).
Figure 2
Initial search and verified results of types of KOS.

The situation is similar in other KOS registries. The tags could be mis-used when indicating the types of KOS vocabularies. One can imagine the amount of time spent on verifying them before any mapping. The issue can be resolved by applying the
Utilize the
Metadata about the LOD KOS found at individual endpoints showed that many have used Dublin Core Metadata Element Set
Figure 3
Property checking (2016).
Study conducted in 2015–2016. Independent special properties used in the datasets are orange colored.

Figure 4
Property checking (2019).
Study conducted in 2019. Independent special properties used in the datasets are grey colored.

Reusable indicates that the ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
License and provenance metadata are critical to the dissemination of data. There has been a paradigm shift in the digital age for the publication of KOS products. Before the Web, authorized and relevant versions of a thesaurus or taxonomy were clear due to the existence of a single release source. However today, multiple versions, formats, and locations for a single KOS vocabulary can be found ranging from formally updated releases to project-based releases at various hubs or registries. Deciding which source to use is in the hands of the users and not the producers of a vocabulary. Lack of license and provenance metadata may cause confusion and negatively impact the KOS, especially those being constantly updated and with high quality control. Therefore, as alluded to in Figure 1 (those items listed as not present, e.g. no-license-metadata), to enable dataset reusability we additionally recommend for Reusable:
Adequately supply license and provenance metadata to enable datasets’ reusability.
The main aim of FAIR principles is to enable and advance the reuse of data. All data is not created equally and as such we wanted to examine KOS data in light of these principles. We discovered that while the FAIR principles could significantly improve the quality of these datasets, additional considerations were necessary for KOS as value vocabularies. This section presents the main research findings for LOD KOS as value vocabularies with a set of functional metrics and recommendations. We use the term “metrics” in the sense of key performance indicators which can be used to assess the efficiency, performance, and quality of LOD KOS datasets. Tiemensma (2010) discusses this usage of the term and describes it as a shift from measuring what you can count to measuring what counts. As such, we have identified three critical indicators here defined: Functional, Impactful, and Transformable, and coined the acronym FIT to reference them. These recommendations are supported by our research findings.
As introduced at the beginning of this article, the FAIR guiding principles inspired us to assemble relevant functional metrics for this specific type of product—the KOS vocabularies, in addition to implementing FAIR for KOS as open datasets. Our “KOS in the Semantic Web” study (Zeng & Mayr, 2018) highlighted the need for common guiding principles and assessment metrics for the LOD KOS and further that these should be used by LOD KOS producers and users to maximize the functions and usages of LOD KOS products and enable better research and application development utilizing them. In the following sections, when presenting the FIT metrics, selected examples will be presented as evidence of the reality of today’s LOD KOS products.
For this section, a significant portion will be devoted to the research findings and recommendations related to the Functional metric. If a dataset is assessed according to these guidelines and fails to adhere to the principles indicated in terms of its functions, it would not be useful to further consider its Impacts and potential Transformable usages.
Our findings suggest that this is the most critical metric for datasets to align with. No matter how a LOD KOS dataset meets the FAIR principles, if a LOD KOS is not made available in ways that enhances its inherent purpose, it would not be relevant to consider it as a good value vocabulary. To be Functional, we recommend that a dataset be assessed using four major criteria.
Table 1 presents the number of datasets of the original KOS types, including thesauri, classifications, taxonomies, terminologies, and lists, plus the number of SPARQL endpoints provided. Table 2 shows the formats as well as the number of datasets which make data available in that format. The major findings reveal that: (1) Many serialization formats have been used for KOS’ deliverables, such as JSON, HTML, Turtle, N-triples, RDF/XML, CSV, and more. The default/auto format is usually a html table. (2) The number of KOS products offering operational SPARQL endpoints is still very limited, despite a gradual increase. We should understand that, enabling a SPARQL endpoint allows for the targeting of specific bits of data, and the content types of query results can be selected based on the intended usage in applications. (3) The majority of KOS datasets made available via SPARQL endpoints have been implemented with tools including Virtuoso, PoolParty, Fuseki, and ARC SPARQL+. Others have used their own custom-built implementation or do not indicate what tool they are using. (4) Supported SPARQL features differ based on the service implementation, yet we found that all endpoints offered a number of serialization formats for query results.
Our recommendation is that a KOS vocabulary should be delivered in consumable formats: available in various data serialization formats and accessible through a SPARQL endpoint.
Our findings have shown that nearly 80% of endpoints reviewed are operational. Though encouraging, it is still evidence that more than 20% are no longer working. Being able to rely on an endpoint or to anticipate when it might be unavailable will be critical for some users. It is our recommendation that institutions should commit to ensuring the sustainability of access to their KOS dataset deliverables by providing a persistently available SPARQL endpoint.
An assessment of the properties used inside a LOD KOS dataset is necessary because even though the datasets are made available, it is not immediately obvious what classes and properties are involved and what links exist between datasets. Understanding the properties used can help us evaluate how suited a dataset may be for reuse in other contexts; it also allows users to better understand and integrate it in various applications. One of the metrics of dataset performance quality when evaluating a data structure is assessing whether it has complied with W3C standards and if independent specialized properties are used. Our assumptions were that LOD KOS datasets would rely heavily on W3C recommended standard properties from SKOS, OWL, and RDFS. Yet, the initial findings in 2015 alerted us to the fact that a noticeable number of independent specialized properties are used. These are represented by the orange-colored rows in the following partial example from the 2015–16 sample. A similar property check was applied to the 2019 data collected to assess changes over time. In Figure 4, the independent properties used in the datasets are colored in grey.
The findings reveal that properties from standard vocabularies are from SKOS, Dublin Core and Dublin Core Terms, OWL, RDFS, DBPedia ontology, FOAF, and Schema.org. SKOS was highly used especially among datasets tagged as thesauri. RDFS and OWL are increasingly being used. The majority of datasets have included independent specialized properties (as additional or as primary) to represent their structures, which could directly impact their interoperability and reusability. The situation is especially troublesome when there is no hint of how to query and use these properties through an endpoint. Even for users who know the SPARQL query language, they still must know the internal properties and data structures, in order to use the products.
We recommend dataset properties and structure information be more effectively and readily available. A SPARQL service should at least contain refined query examples to reveal the internal structures of the datasets. This will serve to make vocabulary contents reachable, as the usability and reusability of LOD KOS products is a major hurdle yet to be overcome.
A common and increasingly challenging issue for the full usage of LOD products is that, end-users may have difficulty accessing and using the LOD datasets since they might not have been trained to access data dumps or SPARQL endpoints. Therefore, although SPARQL gives users the opportunity to design unique queries, a non-technical user may find themselves at a loss as they encounter endpoint implementations loaded with a default query and no other information. Among the available endpoints checked, less than a quarter of SPARQL endpoints would load with a default query in the query window. From Table 3, for example, it discloses that, in 2019, about 41% of them loaded with a default query; less than 20% provided query examples for users to explore the data; and only a few endpoints made more than three example queries available.
Query examples available by year.
Year | # of datasets | Endpoint provided | Endpoint no longer available | # providing default query | # providing example queries | # providing more than 3 example queries |
---|---|---|---|---|---|---|
2019 | 1,459 | 149 | 74 | 66 | 26 | 9 |
2017 | 1,450 | 145 | 63 | 33 | 21 | 10 |
2016 | 1,251 | 135 | 29 | - | 16 | 6 |
Figure 5 shows an ideal example for endpoint providers to emulate. In the UNESCO vocabularies SPARQL service, multiple query examples are provided such as those for obtaining lists of all concepts, microthesauri, or data values in certain languages.
Figure 5
User friendly SPARQL service providing multiple templates for obtaining data.

We highly recommend KOS producers adopt and adhere to best practices like this to enhance usability. Datasets with SPARQL endpoints should provide query examples or forms and templates to enable the easy creation of queries allowing users to interact with the data. This could be accomplished through a showcase of example queries which loads in the query window and is adjustable to user needs (e.g. Wikidata query service
Although such user-friendly products are rare, they illustrate how LOD KOS datasets can be potentially useful to researchers and eventually become knowledge bases (i.e. not just published RDF triple stores of value vocabularies). This will be the primary strategy that will enable LOD KOS to be Transformable (to be discussed in section 4.3).
In summary, this whole section on Functional presents four major criteria and addresses the importance of delivering the vocabulary in consumable formats, ensuring a product is accessible through persistently operational SPARQL endpoint. At the minimal level, a KOS SPARQL endpoint should contain refined query examples to inform the dataset properties and internal structures. The usability and user friendliness can be enhanced by providing default or refined query examples that enable users to explore the data structure and contents of the KOS. When a KOS is Functional, it could further its Impacts and potential Transformable usages.
A KOS vocabulary requires tremendous amounts of investments. How can we measure the investment worthiness of the task and additionally maximize its impact? The following section illustrates the best practices found among well-known KOS vocabularies.
The first recommended approach is to expose the LOD KOS vocabulary through terminology services such as vocabulary registries and repositories. The direct result of this action would be an increase of visibility. In our “KOS in the Semantic Web” study, major vocabulary registries and services are listed and explained, including (a) vocabulary registries and (b) vocabulary repositories/portals (Zeng, 2018).
Vocabulary
Vocabulary
The impact of a KOS can be measured through its usage by data providers in two categories: (a) used as a primary value vocabulary and (b) used in semantic enrichment processes.
In the 21st century, the number of users of, and variety of applications for KOS as primary value vocabularies by data providers have increased in comparison to the original usages by cataloging, indexing, and abstracting services in the 20th century. LOD KOS vocabularies have become a fundamental component of the LOD building blocks because they enable datasets to become 4- and 5-star Open Data, that depend on KOS value vocabularies as the sources of URIs/IRIs. Individual KOS services may accumulate daily site statistics (visits, downloads, and sharing). KOS vocabularies served by BioPortal are shown with statistics of visits and a list of the projects using this vocabulary (see example about
A noticeable new usage of LOD KOS is related to the semantic enrichment process. Enriching metadata has been used to improve data quality while providing more contextual and multilingual information. Metadata enrichment from select KOS vocabularies is now an integral part of Europeana and its data providers’ strategy to enrich millions of data values related to concepts, places, and agents.
Figure 6
Vocabularies used by Europeana for semantic enrichment.

Figure 7
Query templates for ULAN and TGN (portion).

Institutions that use name authority data to semantically enrich their digital collections can easily embed the identifiers within individual datasets. By using specific properties of established schemas, for example,
To achieve the semantic interoperability of existing KOS vocabularies, activities establishing relationships between the contents of one vocabulary and those of another have seen increased engagement. Mapping is a common process of establishing relationships between the concepts of one vocabulary and those of another. A top-down or centralized full vocabulary mapping could be initiated by one source vocabulary (e.g. AGROVOC) and mapped to other target vocabularies. Alignments require interoperability in syntax & structure. The levels of mapping might be clearly defined based on SKOS, such as the
The bottom-up alignment product Mix’n’match
Another approach distinct from vocabulary-based mapping is value-based mapping. A similar volunteer-contributed outcome is the “Authority Control” section in Wikipedia pages for agents, places, works (distinct intellectual or artistic creation), and historical events where identifiers of a THING are provided (e.g. for Leonardo da Vinci
Cases of KOS showed and discussed at professional conferences and academic publications provides another way to disseminate, discover, and measure the impact of a LOD KOS. Established methods such as content analysis and bibliometrics would be appropriate for studying their impact. (Since those methodologies are pretty mature, we will not explain them here in detail.) Notable professional conferences include the NKOS workshops
In summary, the discussions about the Impactful metric in this section reveal ways to measure the new impacts of KOS vocabularies brought by their advanced LOD products in the 21st century. Any vocabulary can be exposed through terminology services following the FAIR criteria, used as a primary value vocabulary as well as in semantic enrichment processes, mapped with other KOS vocabularies (whole or part) and aligned with entry-level entities of Wikimedia. Research projects and usages showed or discussed at professional conferences and publications provide evidence of the impacts. All these will help to maximize the impact of a LOD KOS vocabulary, which may affect the investment decisions of the vocabulary itself as an open resource.
During the last decade, encouragingly, a handful of LOD KOS products have extended the functionality of original KOS resources through publishing into LOD. Among the transformable approaches, we would like to highlight the great potential when LOD KOS datasets become knowledge bases (i.e. rather than existing solely as published RDF triple stores). A LOD KOS is Transformable when it extends its functionality and impact through innovative adaptations.
LOD brings effective new features to KOS vocabularies, enabling derivation of components from the original datasets in a few seconds. For example, from the UNESCO Thesaurus, dozens of micro-thesauri (e.g. “Geography and oceanography”, “Culture”, “Religion”, “Social policy and welfare”, “International relations”, “Finance and trade”) can be obtained quickly. A micro-thesaurus is a designated subset of a thesaurus that is capable of functioning as a complete thesaurus (ISO 25964-2:2013). Their individual concepts can be also obtained in other languages.
Fully benefiting from the original faceted and hierarchical structures of a KOS vocabulary, a LOD KOS gives users the autonomy to determine what structure and information is desired and can be reproduced. One case worth sharing is the Art and Architecture Thesaurus (AAT). AAT’s Linked Data SPARQL endpoint
Another illustration of this point is the Global Agricultural Concept Scheme (GACS)
These cases demonstrate the benefits of giving users the autonomy to determine what and how they will use the data provided, which acts as incentive to reproduce it in unique applications.
The above cases also apply to T3, as KOS are being extended to fit the diverse needs of language, culture, domain, and structure. This concept is not new for KOS, since several have been internationally adopted and used worldwide in the 20th century. For KOS to be appropriately adopted, reused, and reproduced in these contexts, the provenance data of the whole KOS or parts are essential for its quality and trustworthiness. Such data can be very well documented and used in the LOD version. Commonly used properties such as Example: Example: in the RDF/XML raw data of
As we consider these cases, especially the last example, we can further ask the question: how can LOD KOS products become something beyond a value vocabulary? Next, we will outline our final T: Supports innovative and transformative uses beyond normal “value vocabularies.”
Through various case studies, we found the newest and most important function of KOS datasets which should be considered as “knowledge bases,” beyond being normal “value vocabularies” (Zeng & Mayr, 2018). With the advancement of the RDF model, a graph data model is considered to be one of the most flexible formal data structures. Among the knowledge bases, “knowledge graphs” have increasingly become a more widely used concept and label.
A Knowledge Graph (KG) is a graph-theoretic knowledge representation that (at its simplest) models entities and attribute values as nodes, and relationships and attributes as labeled, directed edges (Kejriwal, Sequeda, & Lopez, 2019). Prior to coinage of the term “Knowledge Graph”, proponents of the Semantic Web pressed for the use of graph-theoretic models, pattern-matching query languages, graph data management and use of publicly available KGs like DBpedia, GeoNames and Wikipedia for information retrieval as well as knowledge acquisition and alignment (Kejriwal, Sequeda, & Lopez, 2019). Among the many benefits of knowledge graphs, one of the most noticeable is the potential for discovery of hidden knowledge. The contextual information which enables this discovery is provided by the RDF triples which follow the Linked Data principles and are embedded in trustable value vocabularies and property vocabularies.
Consider the SPARQL query examples provided by UniProt (Universal Protein Resource). One of them is “Select all bacterial taxa and their scientific name from the UniProt taxonomy” which is obviously a function that a value vocabulary provides. It also provides over 20 other query examples, such as “Select the preferred gene name and disease annotation of all human UniProt entries that are known to be involved in a disease” and “Select all triples that relate to the taxon that describes Examples provided by UniProt
The important roles of KOS in the creation of knowledge graphs have been emphasized in the past two years, as knowledge graph development has been considered a major strategy for corporations including Google, Apple, Amazon, Alphabet, Microsoft Corp, Facebook, and more. The Microsoft Academic Knowledge Graph (MAKG) set has over eight billion triples with information about scientific publications and related entities as of 2018–11
In summary, the discussions about the Transformable in this section imply great potential for KOS vocabularies to extend their functionality and impact through innovative adaptations. Allowing special KOS products to be derived from the original data, giving users autonomy in reproduction, and enabling the extensions to fit diverse needs are major transformable approaches. More importantly, the innovative use of the originally constructed high quality, contextualized data entries enable the LOD KOS to generate large or specialized knowledge graphs, which function as knowledge bases; and to become foundations of semantic analysis and entity extractions. They consequently become the building blocks of a framework for research in humanities and science. LOD KOS products are thus transformed beyond being just “value vocabularies.”
The motivation for this research was to encourage more productions of LOD KOS products. It addresses the major issues and challenges encountered with LOD KOS as well as offers suggestions for improving their quality and the impacts of their contribution to the Semantic Web. It is our passion to share best practice approaches identified through our multiple years of investigations and present a set of recommended metrics. By using these metrics, any assessment performed on LOD products can lead to actions addressing any or all identified issues, case-by-case, from the top-down or the bottom-up, collectively or independently.
In conclusion, without FAIR principles, FIT metrics have no foundation. Therefore, as an open dataset, a LOD KOS should be Findable, Accessible, Interoperable, and Reusable, plus implementing these additional recommendations for KOS as FAIR datasets:
Findable recommendation – Enrich metadata as much as possible to enable data discovery processes. Accessible recommendation: Provide multiple pathways for access to the data. Interoperable recommendation: Utilize the KOS types vocabulary to standardize the way vocabulary types are categorized and thereby support mapping and interoperability. Reusable recommendation: Adequately supply license and provenance metadata to enable dataset reusability.
As a value vocabulary, a LOD KOS should be Functional, Impactful, and Transformable, as outlined in Table 4.
FIT – Metrics for LOD KOS (as value vocabularies).
[The vocabulary is...] | [The vocabulary...] | [The vocabulary...] |
The vocabulary is delivered in consumable formats Provided SPARQL endpoints are operational Dataset properties and structures are informed effectively Services are user-friendly, making vocabulary contents reachable | Exposed through terminology services Used by data providers
as a primary value vocabulary in semantic enrichment Mapped with other KOS vocabularies Showed/discussed at professional conferences and publications | Allows special KOS products to be derived from the original data The user is given autonomy to determine what structure and information is desired and can be reproduced from the vocabulary Enables extensibility to fit diverse needs Supports innovative and transformative uses beyond normal “value vocabularies” |
The metrics of FAIR and FIT for LOD KOS need to be further tested and aligned with the best practices and international standards of both open data and various types of KOS. Discussions with the KOS community and further enhancement of these metrics will be ongoing.
Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Number of SPARQL endpoints provided (data collected in 2016, 2017, and 2019 from the Datahub).
2016 | 2017 | 2019 | ||||||
---|---|---|---|---|---|---|---|---|
Search Type of KOS/DATASET | # found | # with SPARQL endpoints | Search Type of KOS/DATASET | # found | # with SPARQL endpoints | Search Type of KOS/DATASET | # found | # with SPARQL endpoints |
Thesaurus | 67 | 39 | Thesaurus | 40 | Thesaurus | 80 | 41 | |
Classification | 458 | 29 | Classification | 476 | 31 | Classification | 478 | 31 |
Taxonomy | 26 | 8 | Taxonomy | 35 | 8 | Taxonomy | 37 | 10 |
Terminology | 35 | 7 | Terminology | 39 | 8 | Terminology | 39 | 8 |
List | 665 | 52 | List | 821 | 58 | List | 825 | 59 |
Available serialization formats of KOS datasets (sorted based on data collected 2019).
Format | 2016 | 2017 | 2019 |
---|---|---|---|
JSON | 54 | 42 | 74 |
HTML | 47 | 37 | 71 |
XML | 55 | 42 | 69 |
TSV | 44 | 30 | 63 |
RDF+XML | 40 | 30 | 61 |
DEFAULT/AUTO | 37 | 27 | 51 |
TURTLE | 30 | 26 | 39 |
CSV | 34 | 20 | 39 |
N-TRIPLES | 26 | 18 | 36 |
JAVASCRIPT | 23 | 11 | 31 |
SPREADSHEET | 22 | 3 | 30 |
PLAIN/TEXT | 20 | 21 | 28 |
QUERY STRUCTURE | 15 | 15 | 23 |
SERIALIZED PHP | 15 | 15 | 22 |
JSON-LD | 3 | 1 |
FIT – Metrics for LOD KOS (as value vocabularies).
[The vocabulary is...] | [The vocabulary...] | [The vocabulary...] |
The vocabulary is delivered in consumable formats Provided SPARQL endpoints are operational Dataset properties and structures are informed effectively Services are user-friendly, making vocabulary contents reachable | Exposed through terminology services Used by data providers
as a primary value vocabulary in semantic enrichment Mapped with other KOS vocabularies Showed/discussed at professional conferences and publications | Allows special KOS products to be derived from the original data The user is given autonomy to determine what structure and information is desired and can be reproduced from the vocabulary Enables extensibility to fit diverse needs Supports innovative and transformative uses beyond normal “value vocabularies” |
Query examples available by year.
Year | # of datasets | Endpoint provided | Endpoint no longer available | # providing default query | # providing example queries | # providing more than 3 example queries |
---|---|---|---|---|---|---|
2019 | 1,459 | 149 | 74 | 66 | 26 | 9 |
2017 | 1,450 | 145 | 63 | 33 | 21 | 10 |
2016 | 1,251 | 135 | 29 | - | 16 | 6 |
Editorial board publication strategy and acceptance rates in Turkish national journals Research misconduct in hospitals is spreading: A bibliometric analysis of retracted papers from Chinese university-affiliated hospitals The need to develop tailored tools for improving the quality of thematic bibliometric analyses: Evidence from papers published in Sustainability and Scientometrics The notion of dominant terminology in bibliometric research