The fundamental standard for the Semantic Web is Resource Description Framework (RDF) (Cyganiak, Wood, & Lanthaler, 2014), a machine-understandable data modeling language built upon the notion of statements about resources with triple form <subject, property, object> (SPO), in which the subject represents the resource being described, the property denotes the predicate, and the object contains the value associated with the property for the given subject or a simple value (called literal). RDF is designed to use among resources mapped at hierarchic levels in a graph-based representation, and now is widely used in Semantic Web applications for managing large bases and data ontologies. With the exponential growing of information, the Web has involved from a network of linked documents to one where both documents and data are linked, the advent of Linked Data (Bizer, Heath, & Berners-Lee, 2009) and its underlying technologies make data reuse and federation possible, and in particular RDF data seems promising for the process. In fact, the complexity of data processing is increasingly challenging, data consumers often deal with various triple data or even non-RDF data, e.g. internal or open data in relational databases or CSV, Excel, JSON, and XML files, so the tools or services support for multiple-source data processing which covers the whole life-cycle from data extraction & integration, storage, and querying to applications, is needed.
There are numerous tools used for Linked data preparation. RDF generators which support extracting RDF data from relational database (RDB): R2RML Parser
Suppose a data consumer wants to build an RDF repository that integrates information from various non-RDF and RDF sources, or merge RDF data in different formats from multiple repositories with less or even no coding, a packaged and multifunctional RDF data process method will be much easier to use and perform more effective. None of the tools or services presented so far support such comprehensive and convenient RDF data processing. Therefore, we developed the RDFAdaptor—a set of Extract-Transform-Load (ETL)
The rest of this paper is structured as follows: Section 2 gives an overview of the existing literature or of significance for the study area, and summarizes the current problems. Section 3 outlines the overall design process of the plugin architecture, and gives the implementation details. Section 4 is devoted to describe the main functionalities of the plugin set and application scenarios. In section 5 we describe some cases/experiments of the RDFAdaptor and discuss the experiences and lesson learnt. Finally, Section 6 concludes the paper and provides the directions for future work.
Since the publication and exponential growing of RDF data, maintaining data processing and management tasks of increasing complexity is challenging. Many researchers or data solution providers have attempted to fulfill the required outcomes of Semantic Web. In this section a brief overview of both types of existing frameworks for RDF data transformation, processing, and management are given.
Extract-Transform-Load (ETL) is the common paradigm by which data from multiple systems is combined to a target database, data store, or data warehouse, with the task of dealing with data warehouse homogeneity, cleaning and loading problems. ETL arises from the fact that modern computing business data resides in various locations and in many compatible formats. It is a key process to bring all the data together in a standard, homogeneous environment, managing and processing such huge collection of interconnected data costs 55% of the total data warehouse runtime (Inmon, 1997). There exist a multitude of open-source ETL frameworks / tools, most of them are non-RDF frameworks—for example Apache Camel
Pentaho Data Integration
Non-RDF ETL frameworks are the key approaches to perform ETL in relational databases or other data formats (e.g. CSV/Excel, XML files), while not able to process RDF data, such as exchange RDF data among systems, extract RDF data from external SPARQL endpoints, transform RDF data from/to other format, or load RDF data to external SPARQL endpoints. As RDF data gains traction, the proper support for its processing and management is more important than ever. Linked Data Integration Framework (LDIF) (Schultz et al., 2011) is an open-source Linked data integration framework aiming at Web data transformation with a predefined set of
RDB (Relational Database) is the most traditional and popular storage model and performs well in almost all application scenarios. However, it only guarantees the syntax and structure of the stored data regardless of semantic meaning, and has little effect on data exchange and sharing among the Web users (Elmasri & Navathe, 2003). Moreover, storing RDF data into relational databases, namely extending relational databases to support semantic data poses serious limitations on query and raises all kinds of issues, e.g. results in self-joins, and the openness of RDF data mode brings challenges to define a fixed relational schema (Heese & Znamirowski, 2011). To preserve the re-usability of published relational data and maximize the profit of utilizing Semantic Web, there have been a number of efforts transforming data from RDB schema to RDF schema (RDB-to-RDF) (Stefanova & Risch, 2013), such as direct mapping or domain semantic-driven mapping. The processes can be broadly classified into four categories:
Ontology matching: Concepts and relations are extracted from relational schema or data by using data mining, and then mapped to a temporal established ontology or specific database schema (Shvaiko & Euzenat, 2013). Direct mapping: In this W3C recommended approach, tables of RDB are mapped to classes defined by an RDFS vocabulary, and attributes are mapped to RDF properties (Kyzirakos et al., 2018) (Arenas et al., 2012). Mapping Language: This involves cases of low similarity between database and target RDF graph, as exampled by R2RML (Das, Sundara, & Cyganiak, 2012), which enables users express the desired transformation by following chosen structure or vocabulary. Search Engine-based: Transformation process is based on the SPARQL query of search engines with capability in supporting large collection of concurrent queries (Roshdy, Fadel, & EIYamany, 2013)
There are various RDF storage solutions such as triple store, vertically partitioned store, row store or column store. For efficient query processing in semantic-oriented environments, plenty of RDF stores which interpret the predicate as a connection between subject and object have been developed—for example Apache Jena—TDB, Amazon Neptune
MarkLogic
GraphDB
Virtuoso
Blazegrah
It's important to note that RDF stores and associated query languages such as SPARQL Query (Harris & Seaborne, 2013) and SPARQL Update (Gearon, Passant, & Polleres, 2013) offer the means to query and update large amounts of RDF data. SPARQL is the standard query language for RDF data.
With RDF data continuing to evolve in technology and scale, maintaining RDF data process is getting more complicated. Instead of requiring data wranglers to select different tools and configure them with a custom script, an integrated solution that provides standard maintenance and easy-to-use interfaces for multi-scenario applications is needed to ensure data processing tasks are executed regularly and efficiently.
The RDFAdaptor is conceived as a set of Kettle's ETL plugins focused on RDF data processing, which allows users to transform, merge, transfer, update, share linked data, and also aims to take maximum advantage of the Kettle's ETL ecosystem. Figure 1 illustrates the overall framework of the RDFAdaptor plugin.
The RDFAdaptor data access connector is built on top of Kettle and RDF stores, with Kettle's built-in interface to multiple non-RDF data sources and SPARQL 1.1 Query Language supported by most available RDF databases serving as a common communication protocol for indexing of RDF data. The RDFAdaptor set reuses an open-source Java framework RDF4J
There are four types of specific plugins, which are determined by the combination with RDF4J and SPARQL 1.1 protocol, as well as their intended purposes:
Taking into account user-friendliness and efficiency, RDFAdaptor was developed with various of one-stop configuration templates created in graphical user interface for all plugins, thus kept clean and straightforward, only SparqlUpdate needs users to do a little programming in JavaScript. More details about the usage follow in Section 4.
Given the time consuming of data collection and complexity of managing, to address these challenges we implemented a hybrid data storage mode which supports both local/remote file systems and RDF stores in a pattern of selection operation. The middleware RDF4J enables the access directly to Virtuoso, GraphDB and MarkLogic, while Blazegraph is accessed by RESTful API.
The RDFAdaptor inherits plenty of functions of Pentaho Data Integration, having most features required to process RDF data in a high intuitive way. From a user standpoint, RDFAdaptor is free of installation and easy to use. It has been successfully deployed and used in WINDOWS environment with Java Development Kit (JDK) installed, and provides a redeveloped user interactive interface (Spoon) to add and start tasks. Figure 3 illustrates an excerpt of the frontend interface screen where a data process task is designed. A task is composed of three main parts:
It further can be seen that the user interface additionally includes modules such as
The RDFAdaptor enables a range of use scenarios, including: (1) transformation from non-RDF data to linked data (RDF), (2) RDF data format conversion, (3) linked data migration between triple stores, (4) RDF data update and management.
The RDFAdaptor contains a wide variety of data access, and makes possible complex and reliable data transformations into linked data, i.e. the shift from internal or open data in relational databases or files containing structured data such as Excel, CSV, XML, and JSON files to RDF data.
For RDF data generation, RDFAdaptor provides the following features:
Support multiple mainstream non-RDF format inputs, inheriting from Kettle's strong capability Visualized, dynamic and advanced RDF schema mapping One-stop execution that saves the effort for programming or script implementing Repeatable data transformation, one configuration and run everywhere Efficient paralleling process that can provide multithreaded operations
As depicted in subsection 3.2, a task of RDF data generation can be deployed by adding corresponding steps (configuration templates). To express how the task works, an application scenario instance of RDB2RDF is given. Using links to external relational database (table) as input with filed information by SQL query, we use the plugin
Parameters defined in
Parameter | Description | |
---|---|---|
Namespace | Prefix | collections of names identified by URI references |
Namespace | different prefixes depending on the required namespaces | |
Mapping Setting | Subject URI | HTTPURI template for the Subject/Resource, a placeholder {sid} would be used and replaced by UniqueKey |
Class Types | the classes to which the resource belongs, supporting multi-class types(split by semicolon), such as skos:Concepts; foaf:Person | |
UniqueKey | the unique and stable primary key of resource, part of the Subject URI | |
Fields Mapping Parameters | a list of field map from selected data source to target RDF schema, including the input Stream Field, Predicates, Object URIs, Multi-Values Sepator, Data Type, Lang Tag | |
Dataset Metadata | Meta Subject URI | URI pattern of generated dataset |
Meta Class Types | the classes to which the resource belongs | |
Parameters | a list of descriptions of generated dataset, including PropertyType, Predicates, Object Values, DataType, Lang Tag | |
Output Setting | File system setting | option for file system storage, including Filename and RDF format |
RDF store setting | option for RDF store, including triple store name, server URL, Repository ID, Username (if any), Password, Graph URI |
The template contains four main components:
The constant expansion of RDF data with wealthy syntaxes brings about the problems in limitations on interoperability between tools/systems and disposal with various syntactical variants. Due to the different features of RDF data formats, data wranglers or Semantic Web developers chose them for personal taste, practical scenario, or intended purpose. To give an example, the most prominent syntaxes for RDF nowadays are RDF/XML, Turtles, and JSON. While RDF/XML preferably is used in information exchanges or large scale datasets, Turtle represents abbreviated syntax more suitable for human readability, JSON (or JSON-LD) is designed for embedding RDF in HTML markup and cutting down on the RDF consumption by Web applications.
Therefore, RDF data translation or conversion between different serialization formats is quite valuable. The task can be developed with the plugin
Data source allows local file system, remote URL (SPARQL endpoints) or string stream generated by the previous step in the task. The input format is determined by means of artificial selection, under the premise that correct data source is supplied, or the automatic detection of the document format will fail with an error log. Table 2 depicts the parameter details defined in
Parameters defined in
Parameter | Description | |
---|---|---|
Input | Source | RDF tiples to be converted or loaded |
Source Type | data source, such as local file system, Remote URL or string stream | |
Source RDF Format | format of the input RDF data, fully supporting the common RDF formats | |
Large Input Triples | a selector for input data scale large or not, if the input is large, then the output step can not count, merge or split the triples | |
Advance | BaseIRI | resolve against a Base IRI if RDF data contains relative IRIs |
BNode | a selector for preserving BNode IDs | |
Verify URI syntax | a selector for URI syntax/relative URIs/language tags/datatypes check | |
Verify relative URIs | which returns fail log when corresponding errors occur | |
Verify language tags | ||
Verify datatypes | ||
Language tags | a selector for language tags / datatype, including fail parsing if | |
Datatype | languages / datatypes are not recognised and normalizing recognised language tags / datatypes values | |
Output | Target RDF Format | RDF format of the converted output |
Commit or Split Size | number of RDF triples for the output to each RDF files or submit to stores every batch, the default value is 0, which means all the input data would be processed at one time | |
Local File Setting | options of file system storage, including three selectors for “Save to File System”, “Keep Source FileName” and “Merge to Single File (take precedence over “Commit or Split Size”)”, File name and location | |
TripleStore Setting | options of RDF store, including a selector for “Save to Store”, Triple Store, Server URL, Database/RepositoryID/NameSpace (identifier of database for different triple store), UserName, Password, and Graph URI. | |
Stream setting | option of String Stream for further data transferring, including a selector for “Save to Stream”, and Result Field |
Parameters defined in
Parameter | Description | |
---|---|---|
SPARQL Setting | Accept URL from field | checkbox, if checked means the Url of the SPARQL Endpoint would be coming from Kettle's previous steps and the value could get from the “URL field name” |
URL field name | only used by giving a list of drop-down options of input fields when the option “Accept URL from field” is selected | |
SPARQL Endpoint URL | endpoint Url queried when “Query Endpoint Url From Field” is disabled | |
Query Type | query type which provides two options: Graph query or Tuple query | |
SPARQL Query | SPARQL query forms: SELECT or CONSTRUCT | |
Limit | limitation on data size to be processed if necessary | |
Offset | the starting position of data processing | |
Output Setting | Result Field Name | field specified for file saving |
RDF Format | target local data format, either JSON, XML, CSV or TSV for SELECT query, RDF format only for CONSTRUCT query | |
Max Rows | definition of the maximum size of the output file, empty of 0 means get all the triples | |
Http Auth | HTTP UserID | user ID of SPARQL endpoint if any |
HTTP Password | password of SPARQL endpoint if UserID exists |
Another important feature of the plugin is supporting load the coming triples at the same time, to RDF files, triple stores, and also the streams to Kettle's next step. All the RDF formats, commercial (MarkLogic, GraphDB) and open source (Virtuoso, Blazegraph) triple stores are supported. It also supports loading the same triples to different file locations and different kinds of triple stores at one time.
The maintenance of linked data among different RDF repositories is of increasingly importance in the Semantic Web, especially for RDF data migration which can realize the feasibility of managing large-scale and distributed RDF datasets. Given that plenty of knowledge bases are open on the Web, the linked data migration using SPARQL queries is one of the important means for massive RDF data organization and management. The plugin
It should be noted that the user needs to add a text file-like output step, in which more output parameters (file name, target storage path, parent fold, compression or not, et al.) can be set.
It should be noted that the parameter “Max Rows” enables users to obtain all the data of triple store in batches when it is the default value of 0.
New data usually arrives regularly. A executable and repeatable data transformation is needed for a scalable and lower-cost data process. As mentioned previously in Subsection 2.2, RDF store is also a kind of graph store which works as a mutable container of RDF graphs. Similar to RDF datasets operated on by the SPARQL 1.1 Query Language, RDF graphs can be updated by using SPARQL 1.1 Update as a graph update language. Based on the plugin
Table 4 describes the parameters of configuration setting in an RDF data update task in detail, and Figure 5(b) shows the demo screenshot. Unlike the other three plugins where the user only needs to fill out the configuration templates,
Parameters defined in
Parameter | Description | |
---|---|---|
SPARQL Setting | Query Endpoint Url From Field? | checkbox, if checked means the Url of the SPARQL Query Endpoint would be coming from Kettle's previous steps and the value could get from the “Query Endpoint Url Field” |
Query Endpoint Url Field | only used by giving a list of drop-down options of input fields when the option “Query Endpoint Url From Field” is selected | |
Query Endpoint Url | The value of the Query Endpoint Url would be used when “Query Endpoint Url From Field” is unchecked | |
Update Endpoint Url From Field? | checkbox, if checked means the Url of the SPARQL Update Endpoint would be coming from Kettle's previous steps and the value could get from the “Update Endpoint Url Field | |
Update Endpoint Url Field | only used by giving a list of drop-down options of input fields when the option “Update Endpoint Url From Field” is selected | |
Update Endpoint Url | The value of the Update Endpoint Url would be used when “Update Endpoint Url From Field” is unchecked | |
Query From Field? | checkbox, if checked means the SPARQL Update Query would be coming from Kettle's previous steps and the value could get the “Query Field Name” | |
Query Field Name | only used when the option “Query From Field” is selected | |
Base URI | resolve against a Base IRI if RDF data contains relative IRIs | |
SPARQL Update Query | JavaScript programming for graph update which is only used when the option “Query From Field” is disable | |
Output Setting | Result Field Name | field specified for file saving |
Http Auth | HTTP UserID | user ID of SPARQL endpoint if any |
HTTP Password | password of SPARQL endpoint if UserID exists |
Graph Update:
Graph Management:
Here shows an example snippet to update the graph
Also, the modules in those two types of update operations can be used in a cascading combination to achieve more complex applications.
Based on the extensive application scenario design, the RDFAdaptor plugins have been deployed and tested successfully in some practical use cases involving linked data processing. In this section, we describe some experiments or projects in open linked data and data organization in library and information field.
The National Science and Technology Library (NSTL) in China is an information service agency of science and technology literature which provides public services of full-text access or knowledge discovery, et al. Before NSTL releases its core datasets—data about journals, journal articles, books, conference papers, dissertations, scientific reports, patent standards, metrology regulations, and specific resource, lots of preparatory work about data processing are involved. With the Semantic Web opening up new opportunities for data mining and increasing content discoverability, NSTL moved to adopt linked data—the rich representation of data model, and worked to build large knowledge repositories about real world entities and concepts to support advanced knowledge service forms such as semantic retrieval. Traditionally NSTL mainly uses relational databases to store structured data collected and processed from multiple source, so the goal of NSTL is to get through all the links from data collection to relational data release in a low-cost and efficient way.
Currently, NSTL has successfully integrated the RDFAdaptor into the whole data process including:
ETL of multi-source heterogeneous data Semantization of the collected data Integration of various types of data SPARQL query for internal data manager over graph repositories
With the graph repositories, NSTL is taking further steps in other usage scenarios: e.g. building RDF data into knowledge graphs to serve information needs precisely, or using the links to external repositories to contribute the web of linked science data.
To evaluate the effectiveness of the plugins, a set of experiments on RDF data generation/translation and loading (to triple store) are performed, Table 5 gives the final results.
RDF data generation/translation and loading.
Data Source | Data Format | Number of Records | Number of mapped fields | Number of RDF generated | Total Time-consuming |
---|---|---|---|---|---|
MongDB | json | 1,948,268 | 17 | 37,038,563 | 32min18s |
SqlServer | RDB | 336,831 | 5 | 1,159,687 | 38.6s |
798,389 | 9 | 7,521,876 | 5min4s |
The above table shows that the plugins can support the generation/translation and loading of RDF data at a good performance, about 10,000 triples per second in our case, mainly depends on the reading speed of data sources, such that SqlServer is significantly more efficient than MongDB for our testing cases. Moreover, the complexity of data is inversely correlated with efficiency, the more fields there are, the slower the RDF is generated/translated and loaded.
As for another common case, data wranglers may want to directly get RDF triples from some remote SPARQL query points, so we try to dump all the triples from the AGROVOC's SPARQL query endpoint to local turtle files via a transformation of Kettle using the
The experiments and experiences of using the plugins in these use cases confirmed that RDFAdaptor can be used to complement other services or tools with missing functionalities in an efficient way. Using RDFAdaptor simplified the tasks of RDF data processing and at the same time provided us with valuable hints about the importance to adapt to the needs of existing data infrastructure. However, it is apparent that support for some of these functionality dimensions still has limitations and requires further improvement, e.g. manual configuration and editing—a costly process, in plugin
In this paper we have presented the RDFAdaptor which comes with four developer-friendly and out-of the box plugins for RDF data processing built on top of the prominent ETL tool—Pentaho Data Integration and RDF java framework RDF4J. The RDFAdaptor is implemented by Java programming backend in a semantic-based way and able to manage data of different types and formats from heterogeneous sources or multiple repositories even SPARQL endpoint (remote URL). It is intended as a comprehensive solution to address the problem of efficient multiple-source data processing which covers the whole life-cycle from extraction, transformation, integration, and storage, and embodies the following features:
Support for multiple types of data sources, including non-RDF data (e.g. Excel, CSV, XML and JSON files or relational databases) and RDF data. A hybrid data storage mode which supports both local file systems and prominent RDF stores in a pattern of selection operation. A friendly and straightforward user interface, which sketches key components or modules such as steps, hops. Offer a simple and efficient way to create and start complex data process tasks by providing various configuration templates which can save artificial and cost. Multi-scenario applications, including RDF data generation, RDF data translation or conversion between different serialization formats, remote RDF data migration, or RDF graph update. Flexibility to complement other services or tools.
In addition to the features presented so far, the exemplary use cases / experiments introduced in Section 5 also reveals certain limitations of RDFAdaptor. As future work we consider to further improve the plugin functions in various aspects. This includes: existing error detection/check (e.g. URI syntax, language tags, datatypes), encapsulation of the plugin