The current enhancement in technology has led to the wide generation and usage of data. Spatial data is also easily available and accessible to different types of users. It is creating opportunities for different types of users as it is freely available and easily accessible. The data are largely used by private and public organisations for planning and data analysis. Spatial data cannot be handled with traditional tools and techniques of data mining, so a new concept of spatial data warehouse and spatial database has evolved. Spatial online analytical processing (SOLAP) tools are used to provide an interactive environment for the representation of spatial data. Spatial metadata is the data that explains spatial data. Spatial metadata was required by mapping organisations for production and management of the dataset. Now advancement in technology and the internet have enhanced the use of geospatial metadata for searching spatial data. Spatial metadata aids users in deciding about the relevant data from large datasets. The data used for map production, urban planning, landcover, and so on. must be free from dirty data, that is useless data from the dataset. However, manual cleaning of data using the geographical information systems (GIS) function is not easy for end users because of the following reasons:
It requires deep knowledge of the cleaning functions of the software, which is not provided during training. Both the cleaned and uncleaned data give the same results in the vector model and lack visual analysis and do not enhance the confidence of using data for any scientific problem. The summary of cleaned data is unavailable and hence cleaning is always required on the same data by repeating all manual cleaning steps as the user is unaware of the deleted fields. The summary holds information about the cleaned fields and succours the future use of data for purposes of analysis.
To handle these issues in data selection and analysis, we have designed a prototype that can clean data in a user-friendly environment by taking input from the user and provide cleaning information that can be saved as CSV files and reused whenever the user needs to know about the quality of data. In this paper, we have designed and developed a novel cleaning prototype called
Many eminent scholars have worked in this field and their findings are valuable for this research work. Mainly, 22 studies have been found to be very effective and impressive and have provided direction to this research work. Atluri and Chun (2004) state that geospatial data can be accessed by both nongovernmental organisations and government agencies. Spatial data can be accessed by downloading data from geo-portals, sending data on secondary storage devices, and from business partners. Sheoran and Parmar (2020) used the spatial data of Gurugram District for performing multicriteria analysis and decision making. Azri et al. (2014) concluded that metadata is an essential requirement for discovery, assessment, access, understanding and standardisation of geospatial information. Therefore, spatial metadata has become an essential component of any standard data repository. Gaikwad et al. (2014) state that metadata is a vital part of data for describing its relevancy, characteristics, uniqueness, freshness, purpose and interoperability of dataset with the components.
Spatial data quality is affected by the quality of their sources (Jakobsson 2002). Lim (2010) concluded that a reduced data quality leads to unexpected results. Error-free data having salient data quality is a must and to achieve this, data cleaning is highly necessary. When data are integrated from multiple sources, then errors present at a single source also propagates with them and amalgamates at a single place. Eldrandaly et al. (2019) stated that spatial data is gathered from multiple data sources having anomalies and errors and cannot be considered fit for analysis, planning and decision-making purposes. Data cleaning is mandatory for data storage and information management. Zhao et al. (2019) proposed a spatial and temporal compression framework CLEAN to compute and maintain trajectory data. Zylshal (2020) performed visual and statistical analysis for performing topographic correction to reduce reflectance variability in mountainous regions. Bielecka and Burek (2019) compared and analysed research on the quality and uncertainty of spatial data. Data cleaning is performed on a data set using data cleaning functions. Data are first examined to detect errors and then cleaned using data cleaning techniques. Dirty data can be in the form of missing values, duplicate values and extraneous attributes in a spatial dataset. This dirty data spoils the quality and suitability and makes the data unfit for any particular application and scientific problem of data analysis. Errors enter into the dataset at the
The border detection algorithm was developed by Arturas Mazeika and Michael H.B Ohlen in 2006. It performs cleaning of string data in two steps. In the first step, a cluster is formed near the string data by connecting the border and the centre of the hyper-spherical; in the second step the cluster string is cleansed by the repeated cluster. This algorithm is simple and yields a clean output for string data.
This technique uses smart tokens to identify duplicate records and lowers the dependency of data cleaning on the match threshold.
Record linkage similarity measures algorithm: This technique is used to compare two relational tuples for their similarity.
Koshley and Halder (2015) proposed an abstraction-based data cleaning approach that results in instances of abstract domains. Many errors enter in data like typo, measurement and data integration errors that are harmful during decision making and analysis.
Kumar and Khosla (2018) have shown the value of data cleaning by working on the pollution dataset. Dirt, errors and noise present in data hamper the data quality. They performed a survey, analysis and visualisation on dirty unstructured data of air. Li and Chen (2014) stated that the Geospatial Sensor Web performs resource access, query, discovery and resource visualisation. Parmar and Sheoran (2021) performed context-based cleaning on the population dataset using the record linkage technique. Ridzuan and Zainon (2019) state that data cleansing is a time-consuming and complex process, but after data cleansing, the quality of data is enhanced and its verification and validation can be tested. Zou et al. (2018) reviewed the technology and applications of geospatial data for better understanding of large and complex datasets using visualisation platforms.
According to Boella et al. (2019), the Volunteered Geographic Information System collects and distributes user-generated content having geographic components. Keim (2002) classified the visualisation techniques into dense pixel display, iconic display, standard 2D/3D display, and interaction and distortion techniques.
Yoshizumi et al. (2020) defines geospatial analysis as a tool or analysis that was specifically designed for geospatial data or applications. Visualisation techniques have immense potential to communicate geospatial data. Thiyagalingam and Getov (2006) organised and illustrated applications of metadata in a hierarchical fashion. Zhao, Huang (2010) found that quality determination of online analytical processing (OLAP) metadata is difficult due to the structural essential of metadata.
Quality is a primary requirement for evaluating the value of any product. Any product that is poor in quality cannot be considered worthy and gets neglected. The data and information quality can be determined by analysing some parameters. Spatial data is complex and requires information that can explain its types and usage. The quality of spatial data is determined by using its metadata. In terms of spatial data, quality refers to completeness, accuracy, and consistency as seen in Figure 1. These spatial data quality elements are consumed by various organisations for different applications. GIS users easily access and edit spatial data using Google earth, google maps, GIS tools and social media sites. This enormous production and use of spatial data create difficulty in maintaining the quality of spatial datasets. Data quality refers to adherence to the excellence of data to meet a given set of objectives. Data quality standards are assessed by mapping agencies and the private sector for producing good results. Data collected from different sources vary in terms of displacements, orientation and resolution.
Data quality parameters.
GIS evolved in the 1960s, and after the 1970s spatial data increased at a high speed due to satellite images. From the 1980s, spatial data quality has become a topic of concern in the GIS community. Acceptance of GIS on a large scale increased the number of digital spatial GIS users in different areas. The fall in prices of computers and easy availability of spatial data in digital form shifted consumers from analogue to digital data. The management of spatial data in digital form is much easier than managing it in analogue form. Satisfying the quality requirements in spatial data was a big challenge in the geospatial data community. It symbolises the fitness of data for use. Spatial data having quality can be used for any specific application as it is free from errors and produces the right results. The quality requirement is different for different types of users. For data producers, it is related to the inherent nature of the data and for the user it is fitness for use. The presence of errors in spatial data degrades its quality. Errors enter at different stages of data collection, entry and storage in the databases. These errors are identified and detected using error detection procedures to maintain and manage the quality of the data. However, to remove all errors together is difficult in spatial data as it is complex, but proper recognition and reduction of these errors is highly essential to maintain its quality. Depending on the accuracy, completeness and consistency of data, users take decisions in their planning and use.
Completeness refers to the presence of all attributes in the dataset to the reference dataset. It is the number of omitted and committed objects. The method of evaluation of completeness determines the quality measure; it can be measured using integer, percentage, Boolean and ratio values. Commission and omission are the sub-elements of completeness.
Consistency in spatial data refers to adherence to the syntactic rules used to describe the schema of the database. It can be measured in percentage, ratio, integer and Boolean values.
Accuracy is considered as an important parameter to judge the data quality of spatial data. It is the difference in values in the dataset to the values in the reference dataset.
Spatial data management is presented by a myriad of research papers for different areas such as healthcare, city modelling, remote sensing, image classification and spatial game analytics. Spatial metadata is widely used by different applications without determining the quality of the data. During various stages of data collection, many errors propagate with data. Before application of the data, these errors need to be uncovered and removed using data cleaning algorithm. Dirty spatial data produce invalid results that hamper the data analysis and visualisation processes. Manual cleaning of attribute data of spatial data using GIS functions available in the QGIS software have been performed by Parmar and Sheoran (2021), but the following issues and concerns were found after manual cleaning:
Cleaning is highly important before making use of data. The data cleaning process involves the following stages:
Deep data analysis – detailed analysis is required to know the different types of errors, quality problems, inconsistencies, and anomalies present in the data. Identification of data transformation workflow – data integrated from different sources have different schema, errors, dirtiness and heterogeneity. The degree of heterogeneity and dirtiness in data determines the kind of mapping rules and transformation workflow needed for cleaning. Data testing – verification of transformation workflow and rules is needed to test their accuracy and effectiveness for data cleaning. Transformation – in this phase, transformation steps are executed and query operations are run to clean data. Retreat of data – in this phase dirty data is replaced by clean data to maintain quality at the original source and to prevent doing the same cleaning work again.
Spatial data management is presented by a myriad of research papers for different areas such as healthcare, city modelling, remote sensing, image classification and spatial game analytics. But the untidy data gives wrong results and needs cleaning for proper data analysis and correct decision analysis and cannot be used to solve any of the above scientific problems. Data are massively produced, stored and accessed by millions of people around the globe. Advancement in technology and availability of resources are the main reasons for easy access to spatial data. Spatial data integrated from different sources is full of errors that can reduce the quality of data; hence, these errors need to be removed using any data cleaning method. Data quality can be improved by performing data cleansing on the dataset. Spatial data quality refers to the accuracy, consistency, integrity, and completeness of data.
Cleaning of spatial metadata is not possible by all types of users, but only professional and expert users can perform contextual metadata cleaning. For proper analysis and visualisation of spatial data, it should be free from dirty data. So in this research, we have developed a
Vector data of Gurugram District in shapefile format was collected from the Society for Geoinformatics and Sustainable Development (SGSD) and added to QGIS. All the data used in the study are authentic, as they have been received from responsible and authorised agencies and used after verification.
The framework for the proposed model is depicted in Figure 2. First, collected spatial data is added to the vector layer of the QGIS software. The spatial data layer is investigated for analysis and use. If data contains dirty values and needs cleaning and requires supply of quality information for future use, then the designed
Framework of the
The algorithmic steps followed by the
Algorithm:
INPUT: Spatial data from different sources having dirty attribute data. OUTPUT: Clean data with summary of cleaning information. Select geospatial data (vector data) layers in QGIS software. Add geospatial data (vector data) layers in QGIS software. Import the For each vector data layer
Select geospatial data (vector data) layers in the Execute the (Auto cleaning is done by the Empty values of the required (area) field are calculated. Null values, duplicate values, 0 values are removed from the attribute table. Extraneous fields are searched and deleted. Cleaning process completed and cleaned layer saved as a new layer. Metadata and summary of cleaned data is generated and displayed. Metadata and a summary can be exported as CSV files for future reference. |
The
The following steps are performed in the design of the automated Automatic Cleaning Step. Cleaning Status Message.
Once the user loads spatial data layers into the QGIS, the user can select any layer to start the cleaning process on it. During execution of the code, the program will evaluate various cases in which it will decide whether to remove features/attributes automatically or ask the user whether it is necessary to remove anything
The proposed
Data cleaning process using
Case 1: Removal of duplicate values
Executing the code, the program will select these They have duplicated the 0 BOUNDAR_ID value; They have NULL Name value.
Duplicate values.
Case 2: Removal of null values
Executing the code, the program will select these features (blue selection) and it will delete them, as shown in Figure 5, because:
They have the NULL BOUNDAR_ID value; They have the NULL AREA value.
Null values.
Case 3: Removal of extraneous attributes
Executing the code, the program will select and delete the attributes (Unique_Id, Name) as shown in Figure 6, because they are empty.
Extraneous attributes.
Case 4: Handling missing values in an important attribute
Executing the code, the program will select AREA and it will compute and fill the value of Area as it is a necessary attribute field and cannot be deleted as shown in Figure 7.
Missing values in important attribute.
After cleaning the current layer, the
Analysis of
Spatial data of Gurugram District in shapefile format is added to the vector layer of QGIS platform and the attribute table is analysed as shown in Figure 8.
Attribute table of spatial data.
Automated
Geospatial metadata cleaning using
The user is asked for selecting deletion of duplicate attribute values as shown in Figure 10.
User interaction with
The
Cleaned data saved as new layer.
The quality information of the cleaned layer is displayed as a summary in tabular form along with the metadata information of the new layer as shown in Figure 12.
Quality information of geospatial data.
After execution of the
Attribute table of new layer.
Spatial data free from errors resulting after execution of the
Visualisation of cleaned and uncleaned data
Metadata and summary of cleaned layer.
This section presents the metadata cleaning ability of state of art and
Spatial data of Gurugram District in shapefile format taken from Society for Geoinformatics and Sustainable Development (SGSD) was analysed in QGIS to determine the quality of the data stored in the attribute table. The attribute table of the data was full of errors and their removal requires the use of various GIS functions available in QGIS. Cleaning of attribute data was done using GIS functions by Parmar and Sheoran (2021), but manual cleaning is a tedious, cumbersome, time-consuming and lengthy process. So the
Comparative analysis of
Cleaning using QGIS functions | Cleaning using |
---|---|
The user must have prior knowledge of GIS cleaning functions and its steps. QGIS is a vast software having various functions. New users are not aware of these functions and need a tutorial before performing the cleaning process. | Users can perform cleaning using a single function with a single click in QGIS. There is no need to analyse the dirty data. A user just needs to import the cleaning function in the Python console of QGIS and click on the run tab. The vector layer will be cleaned. |
It is suitable for trained GIS users. The cleaning needs expertise in QGIS and cannot be handled by novice users. | It is suitable for all types of users. |
It is a time-consuming process as it requires operation and analysis of various GIS functions such as JOIN, DELETE and SQL Query in Advance Filter Expression. | It is a very fast cleaning process. There is no need for any GIS function and query execution. |
It is not an interactive approach as no input is asked from the user. The user is not aware of the work performed by the GIS functions. | It is interactive and user-friendly as input is asked from a user before the removal of duplicate values. |
It is less reliable as cleaning performance depends on the skills of the user. If the user chooses wrong functions, then cleaning is not done properly. | It is reliable as cleaning is performed by the |
Incapable to provide cleaning information of attributes. The summary of cleaned data is not available. | Provide cleaning information of the attributes as shown in Figure 12. |
The cleaned layer cannot be automatically saved. | The cleaned layer is saved as a new layer automatically after cleaning. |
Metadata information of spatial data cannot be stored for future use. | Metadata information of cleaned data is exported as CSV files and can be used for comparison and analysis. |
Data quality parameters such as completeness, consistency and accuracy cannot be perceived by the users after cleaning as no cleaning information is provided. | Users can easily judge the quality parameters after analysing summary information. |
Output as cleaned vector layer is not distinguishable from the dirty layer. | Output as the cleaned vector layer is apparent to the dirty layer as shown in Figure 14. Spatial data in green is cleaned data and is free from errors. |
The output produced after performing cleaning operation using available GIS functions in the attribute table of QGIS software is analysed. The cleaned values cannot be visualised apparently from dirty values as both are stored in the same layer and the user cannot judge whether the data are cleaned or not until the attribute table is analysed.
The output produced by the
Performance analysis of
From Figure 14, the values of
This tool is very helpful for data analysis of spatial data, as cleaned data leads to right analysis while wrong data gives a wrong output; moreover, wrong analysis of data is a big problem and is to be taken seriously.
The objective is to clean the contextual metadata given in the attribute table of spatial data added to the QGIS software for the right data analysis of spatial data. If M represents the dataset of all attribute tables
Metadata data set would be taken from data sources
Thus,
The cleaned data set CM can be used for data analysis as it is free from missing values, null values, extraneous attributes and so on.
In this paper, an intensive metadata cleaning tool,