Overview of current techniques in remote data auditing

Abstract The emergence of cloud computing brings the infinite imagination space, both in individual and organizations, due to its unprecedented advantages in the IT history: on-demand self-service, ubiquitous network access, location independent resource pooling, rapid resource elasticity, usage-based pricing and transference of risk. Many of the individuals or organizations ease the pressure on their local data storage, and mitigate the maintenance overhead of local data storage by using outsource data to cloud. However, the data outsourcing is not absolutely safe in the cloud. In order to enhance the users’ confidence of the integrity of their outsource data in the cloud. To promote the rapid deployment of cloud data storage service and regain security assurances with outsourced data dependability, many scholars tend to design the Remote Data Auditing (RDA) technique as a new concept to enable public auditability for the outsourced data in the cloud. The RDA is a useful technique to ensure the correctness of the data outsourced to cloud servers. This paper presents a comprehensive survey on techniques of remote data auditing in cloud server. Recently, more and more remote auditing approaches are categorized into the three different classes, that is, replication-based, erasure coding-based, and network coding-based to present a taxonomy. This paper also aims to the explore major issues.


Introduction
Cloud Computing has been envisioned as the next-generation architecture of IT enterprise, due to its long list of unprecedented advantages in the IT history: on-demand self-service, ubiquitous network access, location independent resource pooling, rapid resource elasticity, usage-based pricing and transference of risk, see Mell and Grance [1], Buyya et al. [2], Wang et al. [3] and [4].With resource virtualization, cloud can deliver computing resources and services in a pay-as-you-go mode, which is envisioned to become as convenient enough to use as frequently as the daily-life utilities such as electricity, gas, water and telephone in the near future Mell and Grance [5], and [6,7].Now, many international Internet giants provide a cloud computing service to the user and cloud computing has become a tendency.But at present, cloud computing development is facing many problems among which the problem of security is the important one.The architecture of cloud data storage service is illustrated in Fig. 1.Because the cloud computing service provider is a separate entity, the data stored in the cloud, in fact its equal to give up on the data of actual control (see Wang et al. [3]).As a result, many cases brought data stored in the cloud security hidden danger due to the following reasons.The first, although the infrastructures under the cloud are much more powerful and reliable than client's hardware, they still face a broad range of both internal and external threats to data integrity Subashini and Kavitha [8].More recently, all kinds of safety accidents in Amazon, Google, and other cloud service provider intensified people's concerns, such as, Amazon S3's downtime [9], Gmail's mass email deletion incident (see Arrington and Disaster [10]), Apple Mobile Me's post-launch downtime (see Krigsman [11]).The second, driven by their own interests and other motivations for CSPs, some cloud service providers behave dishonestly toward cloud customers regarding the status of their outsourced data.Examples include cloud service providers, for monetary reasons, reputation reasons and so on Juels et al. [12].In short, the integrity of the data stored in the cloud server is difficult to be guaranteed, even if stored in the cloud server comparing the economy.This problem may impede successful deployment of the cloud architecture, if not properly addressed.The conventional integrity verification methods in computer cloud are inapplicable because data owners no longer physically possess the storage of their data (refer to Ateniese et al. [13]).On the other hand, downloading the whole outsource data is impractical.So, designing a suitable audit mechanism that can remotely verify the correctness of outsource data is a very necessary.To prove the correctness of the outsource data existing in cloud storage more reliably and efficiently, the remote data auditing service comprises a set of protocols designed.The remote data auditing frameworks use the technique to verify the outsourced data in which a small fragment of whole data is only required to be accessed by the auditor.The remote data auditing technique must consider the following properties: (a) Efficiency: To audit the data with least possible computational complexity; (b) Public Verifiability: To allow delegating auditing process to a trustworthy Third Party Auditor (TPA) instead of client.(c) Frequency: To allow the verification process to be repeated as frequent as possible with different challenge messages; (d) Detection probability: It is the probability of a potential data corruption detection; (e) Recovery: The ability to restore corrupted data to original state; and (f) Dynamic Update: To still be able to audit data while the cloud user is allowed to perform delete, modify, and append operation on his/her outsourced file without requiring retrieving the entire uploaded data (refer to Wang et al. [3]).
This paper presents a basis for classifying the present and future developments within remote data auditing techniques in distributed cloud server domain and summarizes the recent remote data auditing techniques about distributions servers.This paper also analyzes the similarities and differences of the existing technology, and at the same time diagnose the significant and outstanding issues for further studies.

Background
The term "Cloud Computing" was inspired by the cloud symbol that is often used to show the Internet characteristic.Cloud Computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with high efficiency or service provider interaction (see [5] for more details).
The cloud service models rely on a pricing model of pay-as-you-go that charges the users on the basis of the amount of usage and some service metrics.Cloud computing, unlike traditional computing, utilizes virtualization maximize computing power.Virtualization, by separating the logical from the physical, resolves some of the challenges faced by traditional computing (see Lynch [14]).Figure 2 illustrates three models in cloud services: Software as Service (SaaS), Platform as Service (PaaS), and Infrastructure as Service (IssS).
In cloud computing, the available service models are (more details can refer to [4]): (a) Infrastructure as a Service (IaaS).It provides the user with the capability to provision processing, storage, networks, and other fundamental computing resources, and allow the consumer to deploy and run arbitrary software, which can include operating systems and applications.The user is able to control operating operating systems, storage, deployed applications, and possibly limited control of select networking components.(b) Platform as a Service (PaaS).It provides the consumer with the capability to deploy onto the cloud infrastructure.User created or acquired applications, produced using programming languages and tools supported by the provider.The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but can control the deployed applications and possibly application hosting environment configurations.(c) Software as a Service (SaaS).It provides the consumer with the capability to use the provider's applications running on a cloud infrastructure.The applications are accessible from various client devices, through a thin client interface, such as a web browser (e.g.web-based e-mail).The consumer does not manage or control the underlying cloud infrastructure, including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user specific application configuration settings.Four deployment models have been identified for cloud architecture solutions, described below:  enables data and application portability (e.g., cloud bursting for load-balancing between clouds).Cloud computing in its advantage compared with the traditional computer, has the capability to solve a number of identified deficiencies of traditional architectures due to its unique characteristics, but the adoption of this architecture may bring a lot of problems.

Remote data auditing
Today, most of the individuals and organizations use cloud storage to remotely store their data and enjoy the on-demand high-quality applications and services from a shared pool of configurable computing resources, without the burden of local data storage and maintenance.Although cloud computing makes these advantages more beneficial than ever, it also brings new and challenging security threats toward users' outsourced data.Because cloud service providers (CSP) are separated administrative entities, data outsourcing is actually relinquishing user's ultimate control over the fate of their data.As a result, this situation puts the correctness of the data in the cloud at risk due to the following reasons.(a)Although the infrastructures under the cloud are much more powerful and reliable than personal computing devices, they are still facing the broad range of both internal and external threats for data integrity (see Ateniese et al. [13]).Examples of outages and security breaches of noteworthy cloud services appear from time to time (Ateniese et al. [15], Kincaid [16], and Sookhak et al., [17]).(b) There do exist various motivations for CSP to behave unfaithfully toward the cloud users because of their outsourced data status.For examples, CSP might reclaim storage for monetary reasons by discarding data that have not been or are rarely accessed, or even hide data loss incidents to maintain a reputation (Wang et al. [18], Ateniese [19], and Shah et al. [20]).In short, although outsourcing data to the cloud is economically attractive for long-term large-scale storage, it does not immediately offer any guarantee on data integrity and availability.This problem, if not properly addressed, may impede the success of cloud architecture.
The RDA schemes for distributed cloud servers consist of four main entities.As follow: (1) Data Owner(DO): the person who uploads his/her data to the cloud space.(2)Cloud Service Provider : who has amount of computing resources and stores and manages DOs data.The CSP is also responsible for managing cloud servers.
(3) Third Party Auditor (TPA): In order to alleviate the computation burden on data owners side, the auditing process is often assigned to a TPA with adequate skills and capabilities to accomplish the auditing task on behalf of the data owner.The TPA's role is particularly important when data owners possess relatively poor computing device in terms of processing power, storage space and bandwidth.While TPA is regarded as a trustful and reliable entity it might be inquisitive at the same time.Consequently, one significant countermeasure during data auditing is to prevent TPA obtaining knowledge of data owners data content and protect privacy of data.( 4) User (individual or enterprise): Who is enrolled and authenticated by the DO and permitted to have pre-determined type of access on the outsourced data (refer to Sookhak et al. [17]).The architecture of RDA when TPA is involved is shown in Figure 3.

The research status
In order to protect the integrity of the data, the researchers put forward many good solutions.This section introduces the current RDA methods for distributed storage systems.

Replication-based Remote Data Auditing
In an unreliable storage system, when data failures happen, redundancy plays an important role in improving the reliability.In order to effectively deal with data failures, the effective way is to use a replication technique in which multiple copies of data are outsourced within the distributed storage systems (see Ying and Vlassov [21]).When data corrupt, the client can use an intact copy of the file with size | f | from any of the r unaffected servers.However, using the replication method to take up the storage space is r| f | (see Agrawal and Jalote [22]).
Although realizes the method is simple, there is no strong evidence to prove that the cloud actually stores multiple copies of the data files.In other words, in the replication-based storage systems, we don't know whether our data in the server has multiple backup (see Chen et al. [23]).Such as, in peer-to-peer networks, servers perform a freeloading attack with the aim of using disproportionately more system's resources without contributing a proportionate quantity of resources back to peers (refer to Osipkov et al. [24] and Ateniese et al. [19]).As a result, the data owners encounter a decline in the durability and availability of the file and the CSP has more storage space to sell to the users.
A simple method is to allow the user to apply for a single Provable data possession (PDP) method t times for t different servers to overcome the above problems.But, the servers can collude and pretend that t copies of files are stored while only a single copy is stored in reality as an identical copy of the file is stored on all of the t servers (see Chen et al. [23]).Besides, using this method computational cost is too high and the method is inapplicable for distributed storage systems particularly for larger values of t.
Curtmola et al. proposed a provably-secure scheme, called Multiple Replica Provable Data Possession (MR-PDP), which were the first to address the collude attack in the replication-based schemes.In this scheme, the client encrypts the original file and masks the blocks of the encrypted file by using a random value (Ru) for each of the replica to generates t unique replicas (Rs) (see Barsoum and Hasan [25]).Then, the client creates a tag using a decrypted file for each block.Storage form as shown below.
(Ti is the tag for the i t h data block (b[i]), n is the number of blocks, d and ( are the clients private key, and g is the public key) The user selected the method of random sampling to ensure the data in the servers.By the preceding description, we may draw the method which is suitable for checking the integrity of outsourced data in distributed servers.However, the method only supports private verification and when we update the file it will generate a huge computation and communication overhead on the client and server (see Barsoum and Hasan [25]).
A scheme named Efficient Multi-Copy Provable Data Possession (EMC-PDP) was proposed by Barsoum and Hasan.This scheme is based on the technique of Boneh-Lynn-Shacham (BLS) homomorphic linear authenticators.The scheme includes two versions: Deterministic (DEMC-PDP) and Probabilistic (PEMC-PDP).The deterministic version verified the file blocks and the probabilistic version casually inspects the file.
The main method of the scheme is to generate a unique replication of file by attaching a replica number to the original file.The client using the following equation generates a tag for each block of replicas and allocation the tag among different servers.
(F id indicates a unique fingerprint for each file that is generated by attaching the filename to the number of blocks and the number of replica, i indicates the replication number, j is the block number, d is the client's private key, and u is a generator for a bilinear group mapping G).Although this scheme support the trusted third party auditing test, to update a block of the file, the client must re-encrypt and upload the entire replicas to the servers.Even though this scheme offering the security, its produce a higher storage overhead not only on the client but on the server.
The Dynamic Multi-Replica Provable Data Possession (DMR-PDP) scheme was proposed by Mukundan et al. [26]   ( j is the index of the block, i is the number of replicas, kjrepresents a random number that is used to identify the number of replica, r i, j indicates a random number that is used for the Paillier encryption scheme, and N is the owner's public key ) Xiao et al. [27] using the homomorphic verification response (HVR) and hash index hierarchy (HIH), called the Cooperative Provable Data Possession (CPDP), built a replication-based remote data auditing framework for distributed systems.The HIH is a hierarchical structure includes three layers: Express, Service, and Storage Layers.They are described as follows: 1) Express Layer: offers an abstract representation of the stored resources; 2) Service Layer: offers and manages cloud storage services; 3) Storage Layer: realizes data storage on many physical devices.Homomorphic verifiable response is used to integrate multiple responses from the different CSPs in CPCP scheme.Homomorphic verifiable response is the key technique of CPDP because it not only reduces the communication bandwidth, but also conceals the location of outsourced data in the distributed cloud storage environment.However, a heterogeneous structure of the proposed scheme leads to a high communication load due to the inter-communication between various cloud servers.

Erasure Coding-based Remote Data Auditing
The erasure code technique is based on the Maximum Distance Separable (MDS) code which is a form of data repairing technique.The erasure code technique compared to replication, provides more reliability for the same redundancy.As is shown in the Figure 5, when a block of file corruption is detected, the client can utilize the remaining intact blocks to recompute the codes of the corrupted block.Figure 5 indicates the original input file including three blocks (b [1], b [2], b [3]) is encrypted using the (3:2) erasure code.However, the communication overhead of this method is higher than the replication-based technique (see Kubiatowicz et al. [28], Changho and Ramchandran [29], and Li and Shu [30]).The scheme of High-Availability and Integrity Layer (HAIL) is about erasure coding-based remote data auditing for distributed storage systems.When a file corruption is detected in a server, the client recovers the corrupted block using the Proofs of Retrievability scheme.This project consists of three parts as follows: (a) Dispersal code: using erasure code technique distributed the file to the servers.(b) Server code: when the servers receive the data blocks, the blocks need to be encoded using an error-correcting code to guarantee the blocks.(c) Aggregation code: the responses from all of the servers to the received challenge including the multiple Message Authentication Codes are combined into a single composite Message Authentication Code. Figure 6 shows the dispersal coding technique.
The scheme of Secure Distributed Storage (SDS) is based on Homomorphic Token and the Reed-Solomon erasure-correcting code to guarantee the data integrity and availability in an erasure code distributed storage system (see Wang [31]).Compared to previous methods, the most distinctive feature of this method is to support error localization techniques.This project mainly consists of four parts, as follows: (a)Data Dispersal: Using the Reed-Solomon erasure code to distribute the original file to the servers and encrypt the blocks.(b) Token Pre-computation: Before distributing the file in the servers, random subset of data blocks generate the tokens.The client randomly selects r sectors ({I k } r k=1 ) of the j th block of the extend client's file G by using pseudo-random permutation with K chal as a key and generates r coefficients {α k } r K=1 by pseudo-random function with Kcoe as a key, to calculate the linear combination of the data block as a verification token of the block j: (where v j indicates the verification token of the block j and G j [I k ] is the I k sector of j t h block of the extend client's file G) (c) Error Localization: To identify malicious servers, the client requires to reveal the {α k } r K=1 and the pseudorandom permutation key K chal and request the servers to compute the linear combination of the certain block.(4) Dynamic Update: The scheme supports the client dynamic update operations, such as updates, deletes, appends, and inserts.The client utilize homomorphic token updataes the changed blocks without retrieving other blocks.

Network Coding-based Remote Data Auditing
The technique of the network coding has the capability to reduce the communication overhead during the repair process.When a block failure of file was detected, a new data block is created on the basis of a linear combination of the stored data blocks across the intact servers during the repair process.
the original file as a linear combination of the data blocks is generated.As is shown in Figure 7  In Anh and Markopoulou [36], the authors built on a RDA method which relies on a homomorphic Message Authentication Code scheme and a customized encryption scheme.when the client and server have a shared secret key, the scheme existence risk of pollution attacks.The homomorphic Message Authentication Code cryptosystem protect the scheme avoid the risk and this cryptosystem consists of three parts, namely: sign, combine, and verify.We use the sign algorithm to generate a tag for blocks and then use the homomorphic property to generate a linear combination of tags of the blocks.The characteristics of the scheme to some extent reduced the communication and computation overhead.In order to reduce the burden of the user on communication and computation overhead, the verification tag of the new data block is computed by combining the verification tags of intact blocks.So, we don't need to have the client generate a verification tag for the new data block.Figure 8 shows the scheme including setup, challenge, proof, and verify steps.

Exciting issue
The section we highlight some of the most important issues and challenges in applying and utilizing the remote data storage auditing approaches for the future research works.

Dynamic Data Update
In the static data auditing methods, the clients must download the whole outsourced data from the cloud and upload them after performing the corresponding operations, during the update operations, such as modify, delete, insert, and open.Supporting dynamic data update is an important characteristic of RDA methods both for single and distributed cloud servers, since many common applications such as online word processing intrinsically deal with dynamic form of data or involved with dynamic log files.If the auditing approach supports the only needs to download the number of blocks that are to be updated.Therefore, such a feature reduces the cost of computation and communication of updating data on the client and servers.
Although Cash et al. [37] put forward a method to overcome the dynamic data update issue in cloud computing, the lack of public verification feature makes this method impractical .On the other hand, current dynamic data update methods also impose high storage and computation overhead on data owner.As a result, enabling user to efficiently and dynamically update their outsourced data requires more future research and developments.

Batch Auditing
Because of the characteristic of the current RDA method, addressing batch auditing in the distributed storage systems is more difficult.For the TPA process multiple auditing tasks received from different users at the same time rather than performing each of the tasks one by one.In other word, batch auditing mechanism utilizes the linear sum of the random blocks to shorten the proof message and thus mitigate the corresponding communication overhead.Only a few current RDA methods focus on batch auditing issue in the distributed storage systems (refer to Agrawal and Boneh [38]).A simple way to achieve such a goal is to use a bilinear aggregate signature to combine the proof messages into a single and unique signature that can be verified by the auditor.

Privacy preserving
When data owner outsource data to the remote cloud or delegate the auditing task to the third party, under the hypothesis, the TPA is a trustworthy agent.Obviously, this is not a logical assumption increasing the possibility of leaks.Data owner does not want the TPA, of course, knowing the details of data contents (see Mandagere et al. [39] and Meyer and Bolosky [40]).How to effectively solve the problem of the development of the technology of the RDA is very important.

Large-Scale Data Auditing
In the era of big data where billions of files of data are stored in the cloud.2013 global data volume 4.4 ZB, global data in 2014 at around 6.2 ZB, total 2015 global total data at around 8.6 ZB, 2016 will be around 12 ZB, in 2020, the amount of data will reach 40 ZB.As data volumes growing exponentially, the communication, storage and computation cost on both the auditor side and the provider side to existing RDA technology faces great challenges.Nowadays, more and more big data applications employ cloud to store a mass of data, such as Facebook.Although every day the users of Facebook just to produce a small amount of text and image data, the user behavior records are constantly updated and the user behavior is very important in the era of big data (see Sakr et al. [41] and Naone [42]).The dynamic data that a large number of data owner modifies a single bit of the outsourced data brought a problem that data auditing worsens.

Conclusion
In recent years, auditing outsourced data in cloud computing has gained more attention.Existing RDA approaches accomplish data checking process in diverse modes.Different models focus on different aspects, such as several approaches audit the integrity of outsourced data, a number of approaches focus on error recovery and the rest of approaches check the data ownership as well.The final goal of RDA is to guarantee the integrity and privacy of outsourced data in single and distributed cloud.
In this paper, We explained the concept of cloud computing and discussed the different techniques used to guarantee data integrity and privacy in the cloud computing.We also discussed the issues in the current RDA approaches and detailed analysis of all the methods.As we know, Cloud computing is currently developing very rapidly emerging industries and has a broad development prospects.But it's security problem that still obstruct the development of the key factors.To solve these problems, it needs the researchers to pay more attention and make more contributions.

Fig. 1
Fig.1The architecture of cloud data storage service.
(a) Private cloud.The cloud infrastructure is operated for a private organization.It may be managed by the organization or a third party, and may exist on premise or off premise.(b) Community cloud.The cloud infrastructure is shared by several organizations and supports a specific community that has communal concerns (e.g., mission, security requirements, policy, and compliance considerations).It maybe managed by the organizations or a third party, and may exist on premise or off premise.(c) Public cloud.The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.(d) Hybrid cloud.The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities, but are bounded together by standardized or proprietary technology, which

Fig. 2
Fig. 2 Different service delivery models in cloud computing.
verified the integrity of multiple copies.The scheme utilized the technique of the Paillier encryption to generate distinct copies of the original file F = b[1], b[2], ..., b[m].The process shown in figure 4 uses the following equation:

Fig. 4
Fig. 4 Generating a Unique Replication in DMR-PDP Scheme.

Fig. 6
Fig. 6 Encoding of the Original File and Outsourcing to the Servers in HAIL Scheme.
For example, the client is given a file including m bloks(F = b[1], b[2], ..., b[m]), based on a coding coefficient vector of m random value (v 1 , v 2 , ..., v m ): : a network coding-based distributed storage system for the original file includes three blocks (b[1], b[2], b[3]).The network coding-based RDA methods must meet the following four conditions Juels et al.[12], Zhang et al.[34] and Oliveira et al.[35]:(i)Error localization (ii) Loss of a fixed layout of the file (iii) Reply attack (iv)Pollution attack.

Fig. 8
Fig. 8 Network Code Remote Data Auditing Scheme.