Open Access

Research on Efficient Algorithms for Intelligent Computing in Big Data Analytics

,  and   
Feb 03, 2025

Cite
Download Cover

Introduction

With the arrival of the information age, computer technology and Internet means have been widely promoted and applied, which provides superior conditions for entering the era of big data [12]. In the context of big data, the unit should establish a new type of intelligent management ideas according to its own needs. The unit should be developed into a professional institution to handle internal affairs by analysing data through information technology means, to analyse data scientifically and efficiently, and at the same time to provide an effective reference basis for the unit’s business activities by combining the results of data analysis [34]. In general, in the intelligent system, data sources have differences, and data also exist in different standards. The optimisation of the database system differences will have different results. This difference in results is also considered to be the innate differences in data information. This error problem can not be circumvented [56].

Big data has an important role in the information economy and modern lifestyle. With the development of hardware and software technologies, big data collection is facilitated. Big data analytics refers to the techniques used to examine and process big data in order to obtain effective information [79]. In the modern world, the rate of data generation is accelerating with the development of advanced big data analytics and computational intelligence technologies. Big data analytics can contribute to the government’s efforts to provide better services to its citizens and can help the government to improve key sectors such as healthcare and public transport, which in turn can help shape a more efficient modern society [1013].

“As a catalyst for transformation and growth in the new era, AI technology represented by intelligent computing and decision-making, which deeply integrates big data intelligence and optimization algorithms, has injected unprecedented vitality into the vigorous development of the digital economy [1416].” Intelligent computing and decision-making technology, as the most core and valuable application field of AI, not only plays a crucial role in promoting the development of the national digital economy industry but also is an important engine for Beijing to build a scientific and technological innovation highland, and accelerate the construction of a global digital economy benchmark city [1719].

Computational intelligence is a subcategory of machine learning where algorithms are designed to mimic human information processing and reasoning mechanisms for dealing with complex and uncertain data sources, and computational intelligence technologies are computational methods and techniques developed to solve complex, real-world, data-driven problems [2022]. To deal with the growing number of real-world problems, fuzzy logic, evolutionary algorithms, and artificial neural networks constitute three core computational intelligence approaches. The combination of computational intelligence techniques can be used to extract insight and meaning from data, provide integrated solutions adapted to offline and online hardware and software data processing and control requirements, applied in a variety of application domains, and can be used to provide effective multi-purpose intelligent data analytics and decision support systems for a wide range of industrial and commercial applications characterized by the need to analyze large amounts of fuzzy or complex information in order to support operational and cost cost-effective decision making [2325].

The article is based on the Hadoop HDFS system for highly reliable collection and distributed data storage of massive data, designing secondary indexes under the cloud storage model, and parallel regional querying of system data through the Hilbert coding method. The query performance of the Hadoop HDFS data storage and query system is examined through experiments to explore the data scalability and node scalability of the Hadoop HDFS system. After completing the efficient storage and querying of data, the Spark platform and traditional DBSCAN parallel algorithm are combined to construct the Spark DBSCAN algorithm for efficient mining of data, and experiments are designed to compare the Spark DBSCAN algorithm with the traditional DBSCAN parallel algorithm to test the performance of Spark DBSCAN algorithm on data mining and clustering.

Hadoop-based efficient storage and querying of massive data
Platform architecture

The processing of massive network data in the mobile Internet needs to solve the problems of reliability of data collection, storage of network traffic data, fast and efficient processing of massive network data, as well as a reliable guarantee of data security of storage, etc., and the existing cloud computing platforms can’t satisfy these needs well, so this paper needs to design, optimise and improvement work.

In this context, this paper studies and designs a massive data processing platform based on Hadoop [26]. A complete massive network data processing platform architecture is shown in Fig. 1. Firstly, this paper deploys data collection equipment at key link nodes. The device can achieve real-time collection of data in the current network. Then, the raw network data collected is classified. Among them, the key indicators for real-time analysis are classified, merged, forwarded, and other operations performed through the data forwarding layer before being uploaded to the data storage area of the massive network data processing platform. As for the raw traffic message data, it is first cached in the storage medium of the collector and then physically transported to the data centre where the massive network data processing platform is located, and then the data is stored in the data storage area of the platform by local uploading. In the data storage area, the HDPF (Hadoop-based Data Processing Framework) module is first used to pre-process the primary data, filter out the wrong and defective data, and ensure the integrity, reliability and accuracy of the data to be stored. Then the preprocessed primary data is stored HDFS to in. Then, this paper describes the data characteristics, design, and development of the corresponding data processing analysis software for analyzing and processing semi-structured, table-like data structures. Finally, the user of the massive network data processing platform can, through the client, call the relevant MapReduce data processing software to process and analyse the massive network data stored in HDFS, from which the data analysis results of the effective value can be obtained.

Figure 1.

Mass network data processing platform framework based on Hadoop

Highly Reliable Massive Data Acquisition

Data collection is the underlying architecture of the Hadoop-based massive network data processing platform, which collects network traffic through the traffic monitoring equipment deployed on the network, stores it in the form of stream records, and ultimately pools it into cloud storage through different data transmission channels. The massive network data collection framework is shown in Figure 2, where the traffic monitoring device collector is mainly deployed in the key nodes of MAN egress links, provincial network egress links, backbone network interconnect links, etc. The collector transmits the collected data to each transmission channel channel, which initially processes the collected data and stores it in the form of stream records according to the data transmission channel. The collector transmits the collected data to each transmission channel. Channel carries out preliminary processing on the received data, forwards it to different Uploaders according to the type of data, and each Uploader aggregates and merges the received data of the same type, and finally transmits them to HDFS.

Figure 2.

Mass network data acquisition framework

Distributed data storage

The storage of big data is the foundation and support of the massive network data processing platform. In order to meet the high concurrency and low-latency read and write characteristics of massive data, as well as the storage architecture with high reliability, fault tolerance, stability and other needs, it is necessary to use a distributed file system architecture to complete the storage of massive data.

GFS

GFS is Google File System, which is a massive data file storage system used by Google for its cloud computing services. And is tightly integrated with Google’s MapReduce, Bigtable, and other technologies, making it one of Google’s core technologies. Like other distributed file systems, a GFS cluster has a Master node and a large number of Chunk service nodes that can be accessed by multiple users simultaneously. The nodes of the cluster are all running a system environment is generally a Linux operating system. The Master node is responsible for maintaining file system metadata, including namespaces, file-to-block mappings, block locations, and so on. GFS users communicate with Master nodes and Chunk service nodes through clients to perform data read and write operations. Among them, the user and the Master node only communicate about metadata operations, while the client and the Chunk service node directly communicate and interact with specific data reading and writing operations.

Compared with previous file systems, GFS has the following characteristics:

Massive data storage in chunks.

Node failure in the cluster is the norm.

Separation of data management and read/write.

GFS is now one of the best distributed file system in the operational environment to achieve performance, reliability and cost balance.

Distributed Data Storage in Heterogeneous Environments

HDFS, as an open source implementation of GFS, can be easily and quickly deployed on many common hardware devices and can dynamically expand the cluster size according to the need to increase the storage space for large data sets. The multi-copy technology effectively guarantees the reliability of user data storage. At the same time, due to its open-source nature, researchers can customize the file system according to their own needs, with a high degree of autonomy.

For the storage architecture of HDFS, researchers have conducted extensive research and proposed corresponding optimization solutions for different application scenarios. The article suggests a method for distributing data in HDFS. This strategy is constructed on the basis of a consistent hash algorithm for data storage. In HDFS, node failure is common. The consistent hash algorithm ensures that when a new node is added or removed from the cluster, it does not significantly change the mapping of data on the nodes. When data needs to be stored, the policy determines where to store the data based on the hash value attribute of the data as a way to increase the percentage number of map jobs localised when subsequent jobs are run and ultimately increase the speed of data processing.

Massive data indexing and query method design
Secondary Index Design in Cloud Storage Model

Indexing is a data structure with certain regularity designed according to the attributes, location and shape of spatial objects, which is located between the user query and the data in the spatial database, and improves the efficiency and speed of the query by filtering out a large amount of spatial data unrelated to the query area. Traditional indexing methods for faceted elements include R-trees, quadtrees, and their variants. For small-scale data, this indexing method has better indexing efficiency in relational databases. With the explosive growth of data volume, distributed storage of massive data is required. This paper combines HBase (Hadoop Database) storage characteristics and data query characteristics to design the indexing method for big data under the cloud storage model: secondary multi-column index table design based on attribute query.

HBase provides exact query (get) and full table scan query (scan) operations based on row keys, while a single row key index is overwhelming in the face of multi-dimensional queries, and full table scan is inefficient for massive data queries [27]. In this paper, we designed a non-row key index based on the Coprocessor of RegionObserver under HBase, by listening to the client’s write operation when the data in MemStore is refreshed to the StoreFile before calling the Coprocessor’s built-in Hook function to build an index on the data in it, and when the index is constructed, it is added to the index map. The index file is added to the index map for data query. Based on the above ideas, a secondary multi-column indexing scheme under HBase is proposed, which can effectively avoid a series of problems brought by splicing non-row key values into RowKey, and by constructing the mapping relationship between non-row key query fields and RowKey values, the target data can be retrieved quickly and accurately in order to achieve efficient indexing when querying for non-row keys. When querying a non-row key value, it will first go to the secondary index table to query the RowKey value corresponding to the current non-row key value and then go to the HBase source data table to query the record corresponding to the RowKey value.

The secondary multi-column index is a “key-value” data established for a certain column or certain columns of the target record, and its index structure adopts the inverted index, with the column value of the original table in HBase as the key and the RowKey value of the original table columns as the value, when querying with these column values as the conditions, it will be possible to retrieve the corresponding “key-value” data by searching the corresponding When a query is made based on these column values, the target record can be found quickly by retrieving the corresponding “key-value”data. Among them, the construction of the secondary multi-column index table mainly includes the design of the index table row key and the design of the column value. The design of the row key is determined by the value of the target field to be indexed. If multiple non-row keys need to be indexed, the row key value of the index table is a combination of these several column values.

Parallel region query method based on Hilbert coding

The steps of the parallel region query algorithm can be summarised in two steps: filtering and refining. Specifically, first calculate the MBR of the query region, construct the HBase index filter based on the MBR, get the grid code (Hilbert code) intersecting with the MBR, then scan the index table based on the grid code, save the hit index records in the to-be-searched dataset, and then transform the to-be-searched dataset to the RDD operator through Spark, and then parallelly query the corresponding dataset with the HBase storage table by the RDD operator. The corresponding dataset is queried in parallel in the HBase storage table, and finally, the obtained dataset is refined with the query region to return the dataset that meets the query region’s conditions.

Experimental performance evaluation

This section presents an experimental evaluation of the query performance of the Hadoop HDFS system. For comparative reference, the performance of several typical distributed RDF triple databases (SHARD, Jena-Hbase) is evaluated in the same environment.

Experimental environment

Experimental platform configuration

The system uses a physical cluster of 18 nodes for the experiments. Among these 18 nodes, one serves as the node to run the Query Client, and the remaining 17 nodes serve as the distributed storage nodes for the Hadoop HDFS system. Each node is configured with two Xeon Quad 2.4GHz processors, 24GB of RAM, and 2TB 7200 RPM SATA hard drives. All nodes are interconnected with 1Gb/s Ethernet. Each node is installed with RedHat Enterprise Linux 6.0 operating system and Ext3 format file system. The version of Hbase installed in the cluster is 0.94, the version of Java is 1.6, and the version of Redis is 2.4. The heap size of the JVM in all the experiments is uniformly set to 8GB.

Benchmark

The system performance is evaluated using the LUBM (Lehigh University Benchmark) standard test set, which consists of a simulated synthetic OWL/RDF dataset of arbitrary size and 10 query statements reflecting various query characteristics.

Specifically, three sizes of university datasets generated with 5, 50, and 500 universities as seeds were used in the experiments. They are labelled as LUBM-5, LUBM-50 and LUBM-500, respectively.

Query Performance Comparison

The experiments tested the performance (query execution time) of ▯ Hadoop HDFS against the above comparison systems on three datasets, LUBM-5, LUBM-50 and LUBM-500, respectively. All results in the experiments are averaged over 10 runs. In addition, the values were recorded for both coldstart and hot-run states. The specific experimental results are shown in Table 1.

Comparison of query execution time

Database Unit: ms Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
LUBM-5 Hadoop HDFS Cold 235 9445 241 369 425 1491 299 365 14K 277
Hot 114 9188 159 152 194 513 109 142 14K 152
Jena-Hbase Cold 20K 11K 60K 4256 62K 2378 NA NA NA 18K
Hot 16K 10K 45K 4024 9345 864 NA 322K NA 18K
SHARD Cold 156K 302K 184K 212K 287K 672K 65K 203K 856K 200K
Hot 101K 285K 112K 124K 169K 611K 42K 172K 432K 142K
LUBM-50 Hadoop HDFS Cold 244 9051 303 314 415 2003 511 425 14K 363
Hot 112 8879 115 164 185 1734 203 302 14K 122
Jena-Hbase - NA NA NA NA NA NA NA NA NA NA
SHARD Cold 188K 415K 224K 306K 179K 406K 206K 108K 425K 174K
Hot 116K 315K 189K 177K 133K 342K 166K 77K 348K 130K
LUBM-500 Hadoop HDFS Cold 218 8974 266 273 231 18K 237 321 15K 227
Hot 112 8546 105 130 121 17K 133 201 15K 102
Jena-Hbase - NA NA NA NA NA NA NA NA NA NA
SHARD Cold 306K 986K 426K 387K 462K 884K 506K 472K 926K 412K
Hot 245K 758K 285K 204K 306K 695K 330K 394K 734K 283K

The 10 queries of LUBM were divided into two groups with different categories. The first group includes queries 1, 3, 4, 5, 7, 8, and 10. These queries are highly selective and return fewer result data, and thus are labelled as fast queries. Correspondingly, the second group of queries includes queries 2, 6, and 9, which return more results or have more complex queries, are slower to execute and are labelled as slow queries.

In the experiments, some systems failed to return results within a reasonable timeframe (1000 seconds) on some datasets, and the results were labelled as NA. Jena-Hbase failed to return results within a reasonable timeframe on two large-scale datasets, LUBM-50 and LUBM-500.

Hadoop HDFS outperforms Jena-Hbase in executing all the queries. Even for fast queries, the execution time of Jena-Hbase increases as the dataset grows. On the contrary, the execution time of Hadoop HDFS is fixed for fast queries, which is mainly due to the efficient indexing schema and query optimisation strategy of Hadoop HDFS. In addition, Hadoop HDFS achieves a three-order-of-magnitude performance improvement over SHARD for most query use cases. This is mainly due to the fact that SHARD needs to transform each query into a series of MapReduce job executions. This imposes a significant time overhead for job startup and network disk reads and writes.

The memory usage of indexes in Hadoop HDFS is shown in Table 2. As can be seen from Table 2, the memory usage of indexes in Hadoop HDFS is very low. This allows the indexes to fit into memory using very few nodes. Therefore, all the queries mentioned above can actually be performed using Hadoop HDFS. In addition, Hadoop HDFS has good load-balancing properties due to the use of a consistent hash to divide index data in memory.

Hadoop HDFS index storage usage

LUBM-5 LUBM-50 LUBM-500
Total 195.4MB 2.0GB 17.9GB
Avg.±Std. 10.25±1.68MB 118.00±19.48MB 1.02GB±203.45MB
Scalability experiments

This section evaluates the scalability of the Hadoop HDFS system. First, the data set size is adjusted by fixing the number of machines to evaluate the data scalability of the system. Then, the number of machines is adjusted by fixing the data size to evaluate the system’s scalability.

Data scalability

Hadoop HDFS is tested on a cluster of 17 nodes, and five datasets whose sizes are scaled up from 5 universities to 500 universities are used. The experimental results are shown in Fig. 3. For fast queries, e.g., Q1, the query time of Hadoop HDFS is fixed because of the highly selective index structure of Hadoop HDFS as well as the fact that there is not much change in the number of result entries returned by the query. For slow queries, e.g., Q6, Hadoop HDFS achieves near-linear scalability as the data size grows.

Node scalability

The scalability of Hadoop HDFS is evaluated using different numbers of nodes on the LUBM-50 dataset. The experimental results are shown in Fig. 4, where the time consumption of Hadoop HDFS is almost constant for all the queries. This is because the indexes of Hadoop HDFS can be stored in memory using only a small number of nodes. The distributed memory space of Hadoop HDFS is increased by the addition of new nodes, which also increases the system’s ability to store and manage larger data.

Figure 3.

Query time change with adjusted data

Figure 4.

Query time change with adjusted storage node

Efficient Data Mining Based on Spark DBSCAN Algorithm

After solving the storage and query problems, this chapter will mainly focus on the efficient mining of massive data by combining the DBSCAN algorithm with a distributed engine. This chapter implements the distributed and parallelised DBSCAN algorithm on the Spark platform [28]. For Spark DBSCAN, the process and implementation are first introduced, then some optimization schemes are proposed, and a detailed performance evaluation is carried out.

Distributed parallelised DBSCAN algorithm flow

This section describes the distributed DBSCAN algorithm flow, which has numerous differences compared to the single node serialized DBSCAN algorithm flow [29]. The distributed DBSCAN algorithm flow is inspired by the idea of partitioning and is divided into three main phases. Firstly, the massive data set is divided into numerous smaller data sets. This step is called partitioning. There are numerous partitioning rules during this period, each with different constraints. Then each partitioned data is used as input to a serialized DBSCAN algorithm, and the parallel computation of numerous partitions enables parallelization of the algorithm. Eventually, a merging stage is required to combine and output the results of the numerous partitioned clustering results.

Zoning phase

Partitioning phase is an important phase, partitioning of data consists of two sub phases, data slicing and slice extension phase. The data slicing stage is to divide the dataset X into M sets Si without intersection according to the specified rules, i.e., X=i=oi=MSi and SiSj = ∅, where ij and iM, jM. The partition extension stage is to extend the range of the partition outward by a certain distance, which should be larger than the radius of the DBSCAN clustering ε in order to ensure the correctness of the final merging result. In order to reduce the partition size and improve the computational efficiency, this paper extends the partition by a ε distance. In this paper, the dataset is divided into multiple partitions using Bisection Space Partitioning (BSP).BSP was proposed by M. Berger and S. Bokhari in that it can recursively partition the desired space into roughly equal regions.

Local clustering phase

After dividing the dataset X into numerous partitions, DBSCAN clustering, called local clustering, can be executed on each partitioned data. The input of local clustering is only for partitioned data, because the massive data has been divided into multiple partitions with less data in the first stage, and local clustering is executed in parallel, so the time required for clustering massive data can be greatly reduced. Local clustering is the core of the whole distributed DBSCAN algorithm, and the clustering output of this phase is an important input for the merging phase. At the end of local clustering, each point in the data set will be assigned at least one cluster ID and a point type (core, boundary or noise). Most points will only be assigned a single cluster ID, but points in the partitioned overlap region will be labeled with multiple cluster IDs, and likewise with a point type.

Global merged cluster phase

After all the local clustering phases are completed, the global merging phase performs the global merging of clusters based on the clustering results of the points that are in the overlapping region of the partitions. The global merging of clusters phase can be completed in three steps: 1) Remove the points that are in the overlapping region and have multiple local clusters. 2) Determine which local cluster IDs belong to the same global cluster to be merged in more than one partition based on these points and record this information. 3) Regenerate the unique global cluster IDs.

Implementation and Optimisation of the Spark DBSCAN Algorithm

This section specifically discusses the implementation of the DBSCAN algorithm on Spark and the corresponding optimization strategies, as well as the proper names that will be used.

Implementation

The scheme of Spark implementation of the DBSCAN algorithm is shown in Figure 5. Although the figure is illustrated with 4 partitions as an example, without loss of generality, it can be extended to more partitions.

For the partitioning algorithm, due to its simplicity, only the process of extending slice Si into partition Pi in the partitioning algorithm is plotted in the figure, corresponding to the process from (1) to (2) in Figure 5. The implementation of this process is mainly done by passing an extendSplit() function to the map function. extendSplit() function’s function is to extend each partition Si outwards ε to become partition pi.

Fig. 5’s (2) to (3) show the implementation of local clustering. In the local clustering phase, the algorithm performs the clustering operation independently on the data in each partition, and the output is obtained as a set of points with labeled cluster IDs and types. We use the DBSCAN algorithm. This phase is also performed by passing a DBSCAN algorithm function to the map operator.

Fig. 5 in (4) to (6) shows the process of extracting the overlapped points and constructing the undirected graph. In the merge phase, the extractOverLap() method uses a double loop over partitions and data points to remove points whose partitions are overlapped and have multiple clusters. extractOverLap() method uses partition information that can be obtained from the (1) phase, but we use a broadcast approach to pass only the partition information to each Worker, which has the advantage of reduce network transfers, which is a step towards optimisation of the program. From Section 3.1.3, it can be seen that there may be different partitions in the interleaved region that have multiple cluster IDs, leading to the result that there will be a situation where a point has multiple partition information at the same time, e.g., from the topmost RDD partition in (4), it can be seen that the labelPoint1 point is in both partitions P1 and P2 at the same time. In order to adapt the input parameters as inputs to the algorithm (buildGraph), the data needs to be regrouped according to partitions using the groupby operator after this stage. Subsequently, we used the Scala language to implement the buildGraph() method of the algorithm, and it is a serialised implementation. The reason for taking a serialised implementation is that the number of clusters in the general result is not very large, excluding the noise classes, and there are only 4 classes of clusters in total for the 4 partitions. A small number of clusters takes up only a small amount of memory and, again, does not require the use of large memory resources in the distributed cluster.

The clustering results obtained in step (3) are used by the relabel() method to update the previous local clustering results, which are then saved to a file using the saveAsTextFile() method at the end.

Optimisation

In terms of data transmission optimisation, this scheme uses broadcast variables to release the data needed for computation to each Worker in advance, which avoids the need for each subtask to go to other Workers through the network to pull data when executing computation, which has the advantage of saving one copy of the data per node, not one copy of the data per sub-task. This reduces the amount of data transferred between Workers, saving network bandwidth and reducing algorithm runtime. This solution uses the broadcast mechanism to obtain the partition information needed by the extractOverlap() function.

Figure 5.

Spark implementation process

Experimental performance comparison
Experimental environment

This subsection of the text experiments built three nodes (1 management node, two computing nodes) of the Hadoop cluster, and in the Hadoop cluster built on the Spark cluster, the entire Spark cluster using Hadoop’s HDFS for data storage to facilitate the storage of datasets and can improve the speed of the entire experimental environment. The hardware configuration of the three nodes used for the experiment is an Ubuntu Server system, the CPU is Intel 12th generation i9-12900H processor, memory is 16G, and hard disk storage is 60G. The software configuration is Hadoop-3.2.3, Spark-3.2.2, and Python-3.9.13. The Spark cluster built in this paper uses Yarn for resource management during the running state. The experiments utilize Yarn-Client mode to enhance the flexibility of resource allocation for the entire Spark cluster.

Experimental data and analysis

In this paper, in order to explore the accuracy of the proposed Spark DBSCAN algorithm with its DBSCAN algorithm on different datasets, this paper compares the clustering result metrics of the traditional parallel DBSCAN algorithm (Naive DBSCAN) and the proposed Spark DBSCAN algorithm on different datasets as shown in Table 3.

Comparison of clustering result indexes of different parallel DBSCAN algorithms

Data set Algorithm Silhouette coefficient Purity Rand index Adjusted Rand index F1-score
R15 Naive DBSCAN 0.7658 0.9644 0.9685 0.9532 0.9412
Spark DBSCAN 0.7346 0.9416 0.9602 0.9263 0.9331
Jain Naive DBSCAN 0.3015 0.9745 0.4913 0.1026 0.2578
Spark DBSCAN 0.3015 0.9745 0.4913 0.1026 0.2578
Pathbased Naive DBSCAN 0.3562 0.9278 0.7016 0.1152 0.1723
Spark DBSCAN 0.3562 0.9278 0.7016 0.1152 0.1723
Aggregation Naive DBSCAN 0.3325 0.8244 0.8078 0.1605 0.2346
Spark DBSCAN 0.3325 0.8244 0.8078 0.1605 0.2346
D31 Naive DBSCAN 0.5815 0.9045 0.9952 0.8142 0.8156
Spark DBSCAN 0.5685 0.8712 0.9896 0.7724 0.7789

As shown in Table 3, compared to the stand-alone DBSCAN algorithm, the Spark DBSCAN algorithm, in most cases, shows a decrease in the metrics such as contour coefficient, purity and Rand coefficient. This is mainly due to the fact that in the parallel DBSCAN algorithm, multiple computing nodes process different data slices in parallel, and then merge the local clustering results into the final clustering results, and the process may lead to conflicts or overlaps between the clustering results of different partitions, which affects the accuracy of the final clustering results. The accuracy of the Spark DBSCAN algorithm is not much different from that of traditional parallel DBSCAN algorithms in different data sets. The Spark DBSCAN algorithm’s accuracy is not significantly different. The clustering results of the Spark DBSCAN algorithm are shown in Figure 6. The clustering results of the Spark DBSCAN algorithm and traditional parallel DBSCAN algorithm on Jain and Pathbased datasets are the same because the datasets have the same amount of data and similar data characteristics.

Figure 6.

Spark DBSCAN algorithm clustering result graph

Subsequently, in this paper, in order to show the performance of the Spark DBSCAN algorithm in terms of runtime efficiency, the runtime comparison between the traditional parallel DBSCAN algorithm and the Spark DBSCAN algorithm on different datasets is shown in Table 4, where the number of partitions of Naive DBSCAN algorithm and Spark DBSCAN algorithm is 4.

Comparison of clustering time cost of different parallel DBSCAN algorithms

Data set Algorithm Clustering time
R15 Naive DBSCAN 20.485s
Spark DBSCAN 17.065s
Jain Naive DBSCAN 18.746s
Spark DBSCAN 15.062s
Pathbased Naive DBSCAN 17.223s
Spark DBSCAN 16.012s
Aggregation Naive DBSCAN 15.462s
Spark DBSCAN 4.726s
D31 Naive DBSCAN 87.633s
Spark DBSCAN 40.745s

As shown in Table 4, the Spark DBSCAN algorithm runs faster on different datasets compared to the Naive DBSCAN algorithm, and the cost of the runtime on the Aggregation dataset is only 30.57% of the Naive DBSCAN parallel algorithm, and on the D31 dataset, it is only 46.50 per cent.

Conclusion

The author first uses the Hadoop HDFS system for efficient storage and querying of big data, and conducts experiments on this storage and querying system to check its querying performance. Then, the Spark DBSCAN algorithm is proposed to mine the system data efficiently, and experiments are designed to test the algorithm’s performance.

Hadoop HDFS query performance is better than Jena-Hbase and SHARD systems. The overall query performance of Hadoop HDFS is 3 orders of magnitude higher than the SHARD system. Hadoop HDFS has a fixed quick query time, and it can achieve near-linear scalability by increasing data size for slow queries. The time consumption of Hadoop HDFS for all queries is basically constant.

In most cases, the Spark DBSCAN algorithm experiences a slight decrease in clustering result metrics, but this decrease is not significant, resulting in a very small difference in clustering accuracy between the Spark DBSCAN algorithm and the traditional parallel DBSCAN algorithm.

On different datasets, the Spark DBSCAN algorithm runs significantly faster than the Naive DBSCAN algorithm, particularly on the Aggregation and D31 datasets. The cost of the Spark DBSCAN algorithm’s running time on the Aggregation and D31 datasets is only 30.57% and 46.50% of that of the Naive DBSCAN parallel algorithm.

Language:
English