Design and Development of the Mass Image Storage Platform Based on Hadoop

With fast development and deep appliance of the Internet, it appears more and more large portal sites, e-business websites and net community. These websites have large picture resource and the number of pictures increases in high speed. Traditional technological framework cannot play an active role in dealing with massive images. How to build a cheap and efficient picture storage management system in the precondition of fulfilling high concurrent access has become a to-be-solved problem.

This paper puts forward a massive image storage module based on Hadoop through analysis of HDFS of Hadoop, research of MapReduce and business demand analysis of picture storage. This system has a framework of Master\Slave and based on HDFS distributed file system of Hadoop. It’s hardware is constructed on the common Linux machine clusters. High error tolerance, high response and load balancing can be realized through indoor monitor and it provides outdoor service, what’s more, it can also fulfill high concurrency and depen dable appliance.

II.

OVERALL PLATFORM DESIGN

The system adopts HA framework and smooth expansion which enable the availability and extension of the whole file system. Besides, the system uses flat data organization structure through efficient extended first-level index mechanism, in this way, which can place sequence file and offset position relationship of pictures and files quickly. Storage perfection and stability of every storage node is realized through load balancing and cache system design. The system employs MVC3 layer design after combining nature of different structure of massive image data, distribution and diversity and considering the realization of system organization. As a result, the structure is clearer and the system is easier to extend. This overall platform consists of three layers, including data resource layer, business logic layer, application interface layer. Data resource layer is the basis of whole platform and the fundamental part of cloud storage. Business logic layer can parallel process massive image data and manage the entire platform system configuration. It is the most important part of the cloud storage, but also has technical content. The main function is to coordinate a plurality of storage apparatus underlying business logic layer, to provide a uniform API interface to the upper layer application service and to shield storage apparatus on the lower layer.

III.

PLATFORM FUNCTION DESIGN

Platform overall function module structure

Considering system’s function, the whole system’s overall function module structure is presented like Figure 1.

The overall function module structure of the platform

Stratified platform module structure analysis

Data resource layer module structure

At the bottom is the data storage layer, mainly composed of a storage module. For massive data storage, different data sources provided by different database need to be shielded by storage module to provide database access services, so that the system can meet the requirement of handling massive data storage, as a result, the system has better extendibility and completeness which is ease of management and deployment. A large number of low-cost machines are combined into a cluster through HDFS of Hadoop which provide massive storage capacity.

Business logic layer module structure

Business logic layer is the core of the whole system, but also key elements of the system design and development. It uses a distributed database technology and Linux cluster technology, providing the main function of the massive data parallel load storage. The processed data is stored at distributed database of the system, through the massive parallel processing of data in this layer, while also providing management support services to ensure system normal operation. This layer consists of five functional modules.

Image file pre-processing module

Image file preprocessing module is to preprocess the image file, file name design and picture metadata management. Preprocessing module combines the strong correlation files into Sequence File by using the classification algorithm, which greatly reduces the number of files in HDFS. Through the design of the image file name, the image can be easily read.

Storage control module

Storage control module’s main function is to provide storage management web interface and a unified command, realizing the management of the storage layer nodes via web interface. Hadoop provides opened port 50070 for viewing NameNode information. You can view the current normal Node, Node failures, NameNode log information through this interface.

Caching service module

The main purpose of caching service module is to build buffer. The client requests sent to here by cache buffer screening, where we use Redis to build caching system. Redis is similar to traditional Memcached. They both are a <Key/Value> type storage system. Value type of storage it supports is relatively more, including String, List, Set and zset. Push\pop, add\ remove and intersection, union set, difference Set, operations or richer operations are supported by these data types, and these operations are atomic.

Business processing module

Business processing module is mainly to solve image processing tasks. As usual, uploaded Internet application pictures need to be processed such as, upload picture as generate thumbnails, upload portrait with Image Segmentation. In the case of bulk upload pictures, if the picture processing focus on the application server, the application server will take up CPU resources, affecting the quality of the application service. Traditional approach is a machine alone does image processing. Such an approach can be decoupled from the application. However, after the emergence of MapReduce, we can process pictures based on MapReduce. Image processing is distributed on the cheap storage machine of image storage and stored directly after processing which not only save hardware resources, but also deal with stress distribution to each node.

Load balancing module

Load balancing module is mainly used to solve the stability under high concurrent use of the system. We deploy load balancing for the entire system, using HAProxy of RoundRobin load balancing algorithm, dividing front-end user requests pressure to each Web image server. HAproxy provides high availability, load balancing, and proxy TCP and HTTP-based applications. It is a free, fast and reliable solution. Our requests are sent to different servers through distinction of read and write. Picture read requests are sent to the image server. On the one hand it reads the picture metadata information through cache, on the other hand it hosts image through NameNode.

Application interface layer structure

This layer is made up by GUI interface module based on users and API interface module based on calculating ways. This interface module faces towards users for whom providing different kinds of operating applied tools. So users have easy access to huge data storage process. We can provide compiling applied system and assign API in calculating storage according to some high-lever users having higher and more requirements, which can realize needed applied function.

IV.

SYSTEM REALIZATION AND RESULT ANALYSIS

Software preparation

Operation system: Ubuntu 9.04

Distributed file system: Hadoop0.20.2. The motion of Hadoop needs JDK environment and this paper chooses JDK1.6.0_31, including a NameNode server, a JobTracker server and four DataNodeservers.

Image server: Nginx-0.9.6

Caching software: Redis

Load balancing software: HAProxy

Java environment: JRE6

Java develop instrument: eclipse3.2

The main install process of Hadoop is as follows: first dispose host file, create new users and catalogue, install JDK and dispose environment index; then dispose SSH enrollment without a code, install Hadoop and its disposing.

Research result analysis

We first adopts applied server complication line to do “store and take” operation. We choose pictures randomly to analyze, finding the size of pictures is between 2K and 10M, and the effect is like Figure 2.

Storage system constructed by Hadoop can read TPS (transmission number of operations per second) and it increases with threads, but its growth gradually slowed down to 70 when the thread reached to the first peak, and then add read threads, and TPS is no longer growing stably. Traditional NFS storage systems require multiple IO when read pictures, so when the number of threads have reached to the 40, the system arrive to its load limit. While NFS storage systems using Hadoop’s new storage system reduces the number of IO operations, under the same machine configuration, youcan achieve more concurrent read operations.

We then analyzed using random writes pictures. The picture size distribution is in the 2K-10M, including small pictures and big pictures. After the client concurrent requests, HDFS will store pictures. The effect is shown in Figure 3.

Written TPS’s thread count peaked at 60, and then an additional written thread is involved, TPS is no longer growing stably at this moment. Written performance of the new storage system of Hadoop is superior to NFS written performance which ensure the high throughput of the system.

By analyzing renderings of concurrent read and write multi-thread, you can find a new storage system uses Hadoop in optimizing the storage procedure, reducing the number of IO access process and improving the carrying capacity of the system. In a word, the overall deposit is better than NFS systems. Especially in the face of high concurrent read and write operations, the new storage system HDFS guarantee the stability of the system and the security of the data. So we can be sure that the new storage systems of Hadoop have the ability to solve the problem of massive storage ability.

CONCLUSIONS

This paper designs and develops massive pictures and data storage platform based on Hadoop, adopting the technology of Linux data base and parallel distributed database technology. This paper employs a series of pictures and data management ways such as: HDFS distributed file system, Map/Reduce parallel calculation module and HBase data base technology. Besides, the paper also adopts Redis and HAProxy to construct buffer and load balancing, in this way, the whole system reach to a stable and healthy condition. If this platform is built on many cheap and common computers, the requirement of high efficiency and pictures and data management can be reached. The result of platform module realization manifests: this system has fine extension quality and is easy to defend, so the technological route and design method adopted by the system is effective and feasible.

eISSN:: 2470-8038
Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, other

Journal RSS Feed

Design and Development of the Mass Image Storage Platform Based on Hadoop

Published Online: May 07, 2018

Page range: 80 - 83

DOI: https://doi.org/10.21307/ijanmc-2018-034

KeywordsMassive Images, Hadoop, Distributed Calculation, Storage Framework

© 2018 Xiaoqing Zhou et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1.

Figure 2.

Figure 3.

Keywords
Massive Images, Hadoop, Distributed Calculation, Storage Framework