Maintaining the accessibility of collections for future generations is a central mission of libraries and other memory institutions. Metadata longevity should be ensured to keep the long-term accessibility of data collections. However, we are facing the difficulties in metadata longevity, such as the consistent maintenance of metadata, maintenance of metadata vocabularies and metadata terms, structural and syntactic features of metadata, metadata description rules, and so forth. This paper focuses on consistent maintenance of metadata vocabularies and metadata terms. This is because the changes of definitions of a metadata term may not always be recorded appropriately. The definition of a metadata term may include meaning and usage of the term, relationships to other terms, human-readable labels, and so forth. Metadata terms are usually defined as a set of terms, which is called a metadata vocabulary. This paper aims to propose a metadata model designed to keep track of the changes to definitions of metadata terms and metadata vocabularies.
In digital preservation standards, e.g. Open Archival Information System (OAIS)
The goal of this paper is to propose a model for formal provenance description of metadata vocabularies to keep track of primitive changes of their terms. The classified primitive change types can be applied to terms expressing either properties or classes of resources, i.e. both property vocabulary and value vocabulary.
The rest of this paper is organized as follows. Section 2 clarifies the meanings of Term and Term Definition in this paper. Section 3 presents requirements of provenance description of metadata vocabularies for metadata maintenance. Section 4 summarizes the related literature about metadata registries services and representation of changes. Section 5 applies W3C PROV to provenance description of metadata vocabularies. Section 6 provides a detailed description of the proposed model in this paper. The concluding remarks are given in Section 7.
In the library community, commonly used metadata vocabularies are controlled vocabularies and metadata element sets (Hyland et al., 2013; Isaac et al., 2011), e.g. subject headings, authority files, Resource Description and Access (RDA) See See LCSH introduction at
To propose general provenance description model for tracking primitive changes of metadata terms in metadata vocabularies, this study defines “Term” and “Term Definition” as follows.
Provenance comes from French verb “provenir.” Provenance means source or history or derivation of an object, which can be work, data, etc. The provenance of a piece of data is the process that led to the piece of data in a computer system (Moreau, 2010). According to the W3C Provenance Working Group, provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing (Moreau et al., 2013). Provenance is used for many purposes, e.g. making judgments about information to determine whether to trust it, reproducing how something was generated (Gil et al., 2013).
Metadata vocabularies have to be maintained to keep metadata terms consistently interpretable. The definition of a metadata term may be changed, e.g. renaming of a term, revision of the meaning of the term, and revision of relationships to other related terms. It is crucial to trace changes of metadata terms in metadata vocabularies. Provenance description for long-term maintenance of metadata vocabularies is primarily the series of activities that have taken place on metadata vocabularies and their terms. This paper proposes a model to describe provenance description of metadata vocabularies based on W3C PROV. We classified entities and activities based on the relations defined in W3C PROV to describe primitive changes of metadata terms in metadata vocabularies. The recorded entities and activities are traceable to provide evidence for change tracking, which brings the benefits of provenance description of metadata vocabularies, e.g. preventing misinterpretation and auditing inconsistencies of metadata vocabularies. These benefits are valuable for the long-term maintenance of metadata vocabularies throughout their life cycle.
Provenance of metadata vocabularies is a record that describes the agents, activities, and entities involved in the lifecycle of metadata vocabularies. Provenance of metadata vocabularies includes information about how metadata terms in a metadata vocabulary and its term definitions come to a specific state. The definitions of metadata terms can change over time. For instance, a term can be split into two related terms, or the semantic relationship between two terms can change over time. Those who are responsible for maintaining metadata vocabularies need to pay attention to the changes and also document the changes.
Groth et al. (2012) illustrated requirements of provenance on the Web. The requirements refer to many dimensions, e.g. activities, records of changes, derivation, and interoperability. These requirements present the content of provenance and their use requirements. However, these requirements are not directly oriented to metadata maintenance. Keeping track of provenance of metadata vocabularies is beneficial to the consistent maintenance of metadata vocabularies. Provenance description of metadata vocabularies should be recorded in machine-readable, traceable and interoperable form to support the effective check of inconsistency caused by changes.
The SPARQL Protocol and RDF Query Language (SPARQL) is a query language and protocol for RDF. Please see the details at
This section discusses related works from the two aspects that are closely related to this study – metadata registry and representation of changes.
The reuse of existing metadata terms is essential to improve metadata interoperability. Metadata registry plays an important role in collecting and sharing metadata vocabularies to achieve metadata interoperability. Although metadata interoperability is an important aspect for long-term maintenance of metadata, metadata registry does not ensure metadata longevity. Metadata registry typically holds the following functions, i.e. registration, management, storage and sharing of metadata elements sets, and controlled vocabularies and application profiles. For example, Open Metadata Registry (OMR)
OMR also provides service to vocabulary owners and managers about the versioning and change tracking of their registered vocabularies. The information about changed time, action, and the vocabulary maintainer who made the change are accessible on OMR history page. RDA vocabularies (element sets and value vocabularies) are maintained in the RDA Registry based on OMR with a combination of Git and GitHub. RDA Registry supports the semantic versioning of RDA vocabularies. The version designations follow the general principles of semantic versioning. GitHub provides the changes list of released RDA vocabularies in natural language, e.g. lists of “Adds new RDA entities,” “Adds new RDA elements,” “Adds new constrained RDA elements,” “Deprecates published RDA elements,” “Adds value vocabularies,” and “Renames value vocabularies” (Phipps, Dunsire, & Hillmann, 2015). However, these changes of RDA vocabularies are not kept interpretable to machines over time.
Javed, Abgaz, and Pahl (2014) proposed a layered change log model to record the changes of ontology using RDF triple-based representation. The changes are recorded using their own change metadata ontology and existing Provenance Vocabulary Core Ontology terms. Chawuthai et al. (2016) presented a logical model named Linked Taxonomic Knowledge (LTK) and LTK Ontology for preserving and representing changes in taxonomic knowledge for linked data. The changes in conception or in the relationship between taxa are preserved as events along with aspects of time, provenance, causes, and effects. A tool supporting version management of RDF vocabularies named SemVersion has been developed (Kendall et al., 2008). SemVersion provides structural and semantic versioning for RDF models and RDF-based ontology language like RDFS (Völkel & Groza, 2006).
Changeset vocabulary defines a set of terms (e.g. Addition, ChangeReason, and Removal) to describe changes between two versions of a resource description by using two sets of triples, i.e. additions and removals (Tunnicliffe & Davis, 2009). Changeset vocabulary represents changes to resource descriptions using RDF reification. An update is represented by a set of statements about statements and whether they are added or removed (Meinhardt, 2015). Changeset vocabulary is used by LCSH to describe the information of “Change Notes” of subject headings. The document-centric approved list of new headings and revisions to existing headings in LCSH are available on the Acquisitions and Bibliographic Access Web page
The W3C PROV standard for provenance description and provenance interchange is developed by W3C Provenance Working Group in 2013. The data model defined by W3C PROV, i.e. PROV-DM is used to encode the revision history of wiki pages (Missier & Chen, 2013). Getty Thesaurus of Geographic Names adopts W3C PROV to describe revision history of geographic names. W3C PROV is used to document the Activity information about the revision of geographic names, e.g. Activity type (Create, Modify) and temporal information associated with the Activity. Given to the extendibility of W3C PROV, this paper selects W3C PROV to record how metadata vocabularies change as provenance in RDF.
The W3C PROV standard includes a set of specifications which refers to many aspects of provenance, e.g. modeling, serialization, exchange, access, validation, semantics, and reasoning (Moreau et al., 2015). PROV-DM defines a conceptual data model along with relations to describe general provenance. PROV-O defines an OWL ontology consisting of a set of classes and properties for mapping PROV-DM to RDF. W3C PROV is for general provenance description and allows application to specific domains.
This paper applies W3C PROV to describe the provenance of metadata vocabularies. The main reason is that W3C PROV is a Web-oriented provenance standard for provenance description and provenance interchange. Entities and Activities are an important component to describe provenance in PROV-DM. An Entity is a physical, digital, conceptual, or other kind of thing (Gil et al., 2013). An “Activity” is something that occurs over a period of time and acts upon or with “Entities” (Moreau et al., 2013). An Activity can be used to represent how an Entity comes into existence, and how its attributes change to become a new Entity (Gil et al., 2013). To describe the provenance of metadata vocabularies based on W3C PROV, it is necessary to classify the Entities and Activities associated with changes among different versions of a metadata vocabulary. In other words, W3C PROV is used to describe the provenance of metadata vocabularies by defining what Entities have been changed and how the changes are caused by a series of Activities.
To describe the provenance of metadata vocabularies, Activities acting on the previously classified Entities are categorized into the following types, i.e. Revision, Addition, Deletion, and Replacement. Table 1 shows the correspondence of the classified Activities to the classified Entities. The mark “○” means “applicable” and “×” means “not-applicable.” Table 2 illustrates the classified Activities with their names and definitions. It is notable that replacement of term can be the following cases, e.g. a composite term was split into more than one term; or more than one term was merged to a term; or a term was replaced by another term. Table 3 provides change types of metadata vocabularies as well as their terms with specific examples, which are mainly from the changes between BIBFRAME 2.0 vocabulary (BIBFRAME 2.0 vocabulary list view, 2016) and BIBFRAME 1.0 vocabulary (BIBFRAME 2.0 specifications notes, 2016). In this paper, the separation of a single term into two or more terms is called a split. An example of a split in a subject heading is given in Table 3.
Activities acted on Entities for provenance of metadata vocabularies.Subtypes of PROV Entity Subtypes of PROV Activity Revision Addition Deletion Replacement Vocabulary ○ × × × Term ○ ○ ○ ○ Term Definition ○ ○ ○ ○
Definitions of the classified Activities for provenance of metadata vocabularies.Activity name Definition RevisionOnVocabulary The revision of the contents or information of a metadata vocabulary RevisionOnTerm The revision of a term of the metadata vocabulary AdditionOnTerm The addition of a term DeletionOnTerm The deletion of a term ReplacementOnTerm The replacement of term(s) by other term(s) RevisionOnTermDefinition The revision of a term definition AdditionOnTermDefinition The addition of a term definition DeletionOnTermDefinition The deletion of a term definition ReplacementOnTermDefinition The replacement of a term definition by another term definition
Primitive change types of metadata vocabularies and their terms with examples.Change type Example Revision of a Vocabulary BIBFRAME 1.0 vocabulary is revised to BIBFRAME 2.0 vocabulary. Revision of a Term Addition of a Term Class bf:Note is newly defined in BIBFRAME 2.0 vocabulary. Deletion of a Term Property bf:otherEditionOf that was defined in BIBFRAME 1.0 vocabulary is deleted in BIBFRAME 2.0 vocabulary. Replacement of a Term Property bf:credits in BIBFRAME 2.0 vocabulary essentially replaces bf:creditsNote in BIBFRAME 1.0 vocabulary; Subject heading “Folklore, Negro” is split into “Folklore, African” and “Folklore, Afro-American.” Revision of a Term Definition Addition of a Term Definition The inverse property to property bf:absorbed is added in BIBFRAME 2.0 vocabulary. Deletion of a Term Definition The definitions of property bf:otherEditionOf that was defined in BIBFRAME 1.0 vocabulary is deleted in BIBFRAME 2.0 vocabulary. Replacement of a Term Definition The expected value of property bf:copyrightRegistration is corrected in BIBFRAME 2.0 vocabulary.
A revision of a vocabulary is caused by a revision of its terms. The revision of a term may be a revision of the term as an instance, or a revision of documentation of the term. For example, replacement of a single term by a set of terms is a revision of an instance, and replacement of a title text is a revision of term definition. Therefore, the relationships between the classified Activities are as follows. A
The relations between Entities and Activities defined in W3C PROV include
Figure 1(a) provides provenance description in RDF graphs defined for the example of term replacement in Table 3: Subject heading “Folklore, Negro” is split into “Folklore, African” and “Folklore, Afro-American” (Knowlton, 2005). The classes and properties with prefix “mv” are defined in this research. The property
This paper assumes the following URIs to describe the headings: “Folklore, Negro” with “
This paper identifies the thesaurus Entity before the split by URI “
The goal of this paper is to define a model for provenance description of metadata vocabularies based on W3C PROV and RDF. To achieve this, we defined primitive change types of metadata vocabularies and their metadata terms as shown in Tables 1, 2, and 3. Following the proposed model, the provenance description of metadata vocabularies and their metadata terms can be recorded in RDF, which is machine-readable and traceable using SPARQL. Keeping change history of metadata vocabularies traceable by machines is important to keep numerous metadata consistently interpretable.
The proposed model can describe the revision history of metadata terms. As shown in Figure 1(a), the subject heading “Folklore, Negro” (before the split) connects with “Folklore, African” and “Folklore, Afro-American” (after the split) through property
Figure 2 defines the RDF model for the provenance description of a metadata term corresponding to the meaning revision example of term “soundContent.” We use the URI “
Not only provenance description of metadata vocabularies but also provenance description of structural features of metadata is crucial for the long-term maintenance of metadata. Related to this paper, our previous papers present models for provenance description of metadata schemas (Li & Sugimoto, 2014; Li, Nagamori, & Sugimoto, 2015). The practical use and service development of metadata provenance to facilitate long-term maintenance of metadata is left as the future research.
Provenance tracking is an important issue for the long-term maintenance of metadata vocabularies. Evidence of such provenance of metadata vocabularies enables consistent maintenance of metadata vocabularies. This paper proposes a model to formally describe provenance of metadata vocabularies, especially how metadata terms and term definitions (e.g. meaning and usage) change over time.
In this paper, the W3C PROV standard for general provenance description is applied to describe provenance of metadata vocabularies. We classified primitive change types of metadata terms in metadata vocabularies with specific examples. This study proposes a general model for provenance description of metadata vocabularies to track the primitive changes of metadata terms between different versions of a metadata vocabulary, e.g. split and merge of metadata terms and revision of meaning of metadata terms.