Abstract
Cloud data provenance is metadata that records the history of the creation and operations performed on a cloud data object.
云数据源是一种对于云端数据对象记录创作和操作历史记录的元数据。
Secure data provenance is crucial for data accountability, forensics and privacy. 安全的数据来源对于数据责任、取证和隐私至关重要。(forensics辩论练习,辩论术)
In this paper, we propose a decentralized and trusted cloud data provenance architecture using blockchain technology.
本文提出了一种基于区块链技术的去中心化的可信云数据源结构。
Blockchain-based data provenance can provide tamper-proof records, enable the transparency of data accountability in the cloud, and help to enhance the privacy and availability of the provenance data.
基于区块链的数据溯源可以提供防篡改记录,实现云端数据责任的透明度,并有助于增强来源数据的隐私性和可用性。
(tamper-proof防干扰)
We make use of the cloud storage scenario and choose the cloud file as a data unit to detect user operations for collecting provenance data.
我们运用云存储场景,选择云文件作为数据单元用于检测用户收集来源数据的操作。
We design and implement ProvChain, an architecture to collect and verify cloud data provenance, by embedding the provenance data into blockchain transactions.
在此设计了一个provchain链,一个通过将源数据嵌入区块链交易用于收集和验证云端数据源。
ProvChain operates mainly in three phases: (1) provenance
data collection, (2) provenance data storage, and (3) provenance data validation.
有三个阶段:1)收集源数据2)存储源数据3)验证源数据
Results from performance evaluation demonstrate that ProvChain provides security features including tamper-proof provenance, user privacy and reliability with low overhead for the cloud storage applications.
性能评估结果表明,ProvChain为云存储应用程序提供了安全功能,包括防篡改来源、用户隐私和可靠性,开销较低。(overhead 也就是开销的意思)
Keywords-Data provenance, Blockchain, Cloud Computing, Privacy, Reliability, Blockchain Cloud.
I. INTRODUCTION
Cloud computing is widely adopted by commercial and military environment to support data storage, on demand computing and dynamic provisioning. Cloud computing environments are distributed and heterogeneous with a diversity of software and hardware components which are provided by different vendors, possibly introducing risks of vulnerabilities and incompatibility. The security assurance of intra-cloud and inter-cloud data management and transfer arises as a key issue. Cloud auditing can only be effective if all operations on the data can be tracked reliably. Provenance is a process that determines the history of a data product, starting from its original sources [1]. Assured provenance data can help detect access violations within the cloud computing infrastructure. However, developing assured data provenance remains a critical issue for cloud storage applications. Besides, provenance data may contain sensitive information about the original data and the data owners. Hence, there is a need to secure not only the cloud data but also ensure integrity and trustworthiness of provenance data. State-of-the-art cloud based provenance services are vulnerable to accidental corruption or malicious forgery of provenance data[2] .
Blockchain technology has attracted interest due to a shared, distributed and fault-tolerant database that every participant in the network can share the ability to nullify adversaries by harnessing the computational capabilities of the honest nodes and information exchanged is resilient to manipulation. Blockchain network is a distributed public ledger where any single transaction is witnessed and verified by network nodes. Blockchain’s decentralized architecture can be leveraged to develop an assured data provenance capability for cloud computing environment. In decentralized architecture, every node participates in the network for providing services, thereby providing better efficiency. Availability is also ensured because of blockchain’s distributed characteristics. Since a centralized authority is frequently used in cloud services, there is a need to safeguard the personal data while maintaining privacy. With blockchain based cloud data provenance service, all data operations are transparently and permanently recorded. Thus, the trust between users and cloud service providers can easily be established. Furthermore, maintaining provenance can assist in improving the trust of cloud users toward cyber-threat information sharing [3] [4] to enable proactive cyber defense at a reduced security investment [5] [6].
云计算在商业和军事环境中被广泛采用,用以支持数据存储、按需计算和动态供应。云计算环境是分布式和异构的,具有不同供应商提供的各种软件和硬件组件,可能会带来漏洞和不兼容的风险。云内和云间数据管理和传输的安全保障是一个关键问题。只有对数据上的所有操作都能可靠地跟踪,云审计才能有效。来源是一个确定数据产品历史的过程,从其原始来源开始[1]。保证出处数据可以帮助检测云计算基础设施中的访问冲突。然而,开发可靠的数据来源仍然是云存储应用程序的一个关键问题。此外,物源数据可能包含原始数据和数据所有者的敏感信息。
因此,不仅需要保护云数据,还需要确保来源数据的完整性和可靠性。最先进的基于云的出处服务容易受到意外损坏或恶意伪造出处数据的攻击[2]。
区块链技术吸引了人们的兴趣,因为共享、分布式和容错的数据库,网络中的每个参与者都可以通过利用诚实节点的计算能力来共享消除对手的能力,并且交换的信息对操纵具有弹性。区块链网络是一个分布式的公共账本,其中任何一个交易都是由网络节点见证和验证的。区块链的分散架构可用于开发云计算环境的可靠数据来源能力。在分散体系结构中,每个节点都参与网络提供服务,从而提供更好的效率。由于区块链的分布式特性,可用性也得到了保证。由于云服务中经常使用集中式授权,因此需要在维护隐私的同时保护个人数据。使用基于区块链的云数据来源服务,所有数据操作都是透明和永久记录的。因此,用户和云服务提供商之间的信任可以很容易地建立起来。此外,维护来源有助于提高云用户对网络威胁信息共享的信任[3][4],以减少安全投资实现主动网络防御[5][6]
In this paper, we present ProvChain, a blockchain based data provenance architecture to provide assurance of data operations in a cloud storage application, while enhancing privacy and availability at the same time. ProvChain records the operation history as provenance data which will be hashed into Merkle tree nodes [7]. A list of hashes of provenance data will constitute a Merkle tree and the tree root node will be anchored to a blockchain transaction. A list of blockchain transactions will be used to form a block
and the block needs to be confirmed by a set of nodes in order to be included in the blockchain. An attempt to modify a provenance data record will require an adversary to locate the transaction and the block. Blockchain’s underlying cryptographic theory will allow to modify a block record only if the adversary can present a longer chain of blocks than the rest of miners’ blockchain, which is quite difficult to achieve. By leveraging the global-scale computing power of blockchain network, the blockchain based data provenance can provide integrity and trustworthiness. In our architecture, we keep the hashed identity of users in order to protect their privacy from rest of the nodes in blockchain network. The rest of the paper is organized as follows. Section II
provides an overview of the state-of-the-art data provenance efforts and blockchain technology. Section III describes the design of ProvChain, our blockchain based data provenance architecture. The detailed implementation is given in Section IV. Performance evaluation of ProvChain is presented in Section V. Finally, we conclude in Section VI.
在本文中,我们提出了一种基于区块链的数据溯源体系结构ProvChain,旨在提高隐私性和可用性的同时,为云存储应用程序中的数据操作提供保证。ProvChain将操作的历史记录为来源数据,并将其散列到Merkle树节点中[7]。溯源数据的散列列表将构成一个Merkle树,树根节点将锚定到一个区块链交易。区块链交易列表将用于形成一个区块。为了将区块链包含在区块链中,需要通过一组节点来确认区块。修改出处数据记录的尝试将需要对手定位事务和块。区块链的基础加密理论将允许修改一个区块记录,前提是对手可以呈现一个比其他矿工区块链更长的区块链,这是很难实现的。利用区块链网络的全球规模计算能力,基于区块链的数据溯源能够提供完整性和可靠性。在我们的体系结构中,我们保留用户的散列身份,以保护他们的隐私不受区块链网络中其他节点的影响。论文的其余部分组织如下。第二节概述了最先进的数据来源工作和区块链技术。第三节描述了我们基于区块链的数据溯源架构ProvChain的设计。第四节给出了详细的实施方案。第五节给出了验证链的性能评估。最后,我们在第六节得出结论。
II. BACKGROUND AND RELATED WORK
A. Data provenance
Data provenance is very critical for cloud computing system administrators to debug break-ins to the system or network. Cloud computing environments are typically characterized by data transfers between diverse system and network components. These data exchanges could take place within a data center or across federated data centers. The data does not usually follow the same path due to multiples copies of the data and diversity of paths taken to ensure resilience. This design adds degree of difficulty for administrators
to accurately identify the origin of attack, what software and/or hardware components caused the attack, and the impacts of the attack. Security violations needed to be identified at a fine granularity and provenance can assist. Current state-of-the art provenance systems in the cloud support the above tasks through logging and auditing technologies. These technologies are not effective in cloud computing systems, which are complex in nature, due to several layers of interoperating software and hardware
components spread across geographical and organizational boundaries. To identify the origin, cause and impact of security
violations in cloud infrastructures will require collection of forensics and logs from the diverse and disparate sources
which is an insurmountable task. At the same time, logs only provide a sequential history of actions related to every application. The provenance data provides the history of the origins of all changes to a data object, list of components that have either forwarded or processed the object and users who have viewed and/or modified the object and has enhanced requirements for assurance.
数据溯源对于云计算系统管理员调试系统或网络的入侵非常关键。云计算环境通常以不同系统和网络组件之间的数据传输为特征。这些数据交换可以在数据中心内或跨联邦数据中心进行。由于数据的多个副本和为确保恢复能力而采取的路径的多样性,数据通常不会遵循相同的路径。这种设计增加了管理员准确识别攻击来源、导致攻击的软件和/或硬件组件以及攻击的影响的难度。需要以精细的粒度来识别安全违规行为,并且可以提供帮助。当前最先进的云端来源系统通过日志和审计技术支持上述任务。这些技术在本质上复杂的云计算系统中并不有效,因为跨地理和组织边界的多个互操作软件和硬件组件层。要确定云基础设施中安全违规的来源、原因和影响,需要收集来自不同和不同来源的取证和日志,这是一项不可克服的任务。同时,日志只提供与每个应用程序相关的操作的连续历史记录。溯源数据提供了数据对象所有更改的起源历史、转发或处理对象的组件列表以及查看和/或修改对象并增强了保证要求的用户。
Researchers have presented several data provenance related efforts. PASS is the first scheme to address the collection and maintenance of provenance data at the operation system level [8]. A file provenance system [9] is proposed to collect provenance data by intercepting file system calls below the virtual file system, which requires changes to operating systems. For cloud data provenance, S2Logger[10], was developed as an end to end data tracking tool which provides both file-level and block-level provenance in kernel space. In addition to data provenance techniques and tools, the security of provenance data and user privacy has also been explored. Asghar et al. [11] proposed a secure data provenance solution in the cloud, which adopts twofolder encryption method to enhance privacy albeit at a higher computation cost. SPROVE [12] protects provenance data confidentiality and integrity using encryption and digital signature. However, SPROVE does not possess provenance data querying capability. Progger [13] is a kernel-level logging tool which can provide log tamper-evidence at the expense of user privacy. There are also efforts which use provenance data for managing cloud environment, such as, discovery of usage patterns for cloud resources, popularized resource reuse and fault management [14].
研究人员提出了一些与数据来源相关的工作。pass是第一个在操作系统级别处理来源数据收集和维护的方案[8]。提出了一种文件来源系统[9],通过截获虚拟文件系统下的文件系统调用来收集来源数据,这需要对操作系统进行更改。对于云数据源,S2logger[10]被开发为端到端的数据跟踪工具,它在内核空间提供文件级和块级源。除了数据溯源技术和工具外,还探讨了来源数据的安全性和用户隐私。阿斯格尔等。[11]提出了云安全数据源解决方案,采用双重加密方法,提高了隐私性,但计算成本较高。sprove[12]使用加密和数字签名保护溯源数据的机密性和完整性。但是,sprove不具备对于溯源数据查询功能。progger[13]是一种内核级日志记录工具,它可以提供日志篡改证据,但牺牲用户隐私。此外,还努力使用溯源数据管理云环境,如发现云资源的使用模式、普及资源重用和故障管理[14]。
B. Blockchain
Blockchain technology has attracted tremendous interest from wide range of stakeholders, which include finance, healthcare, utilities, real estate and government agencies. Blockchains are shared, distributed and fault-tolerant database that every participant in the network can share, but no entity can control. The technology is designed to operate
n a highly contested environment against adversaries who are determined to compromise. Blockchains assume the presence of adversaries in the network and nullify the adversarial strategies by harnessing the computational capabilities of the honest nodes and information exchanged is resilient to manipulation and destruction. The reconciliation process between entities is sped up due to absence of trusted central authority or intermediary. Tampering of blockchains are extremely challenging due to use of a cryptographic data structure and no reliance of secrets. The blockchain networks are fault tolerant which allows nodes to eliminate compromised nodes. Despite this, there are several vulnerabilities exist [15], which could potentially disrupt the integrity of blockchain. However, it requires the malicious node to have enormous computational power to conduct attacks, which may not be even cost worthy.
B.区块链
区块链技术吸引了众多利益相关者的极大兴趣,这些利益相关者包括金融、医疗、公用事业、房地产和政府机构。区块链是网络中每个参与者都可以共享、分布和容错的数据库,但没有实体可以控制。这项技术是为在竞争激烈的环境中操作对抗决心妥协的对手而设计的。区块链假定网络中存在对手,并利用诚实节点的计算能力来消除对手的策略,交换的信息能够抵御操纵和破坏。由于缺乏可信的中央机构或中介机构,实体之间的和解进程加快。由于加密数据结构的使用和对秘密的不依赖,对区块链的篡改非常具有挑战性。区块链网络是容错的,允许节点消除受损节点。尽管如此,仍然存在一些漏洞[15],这可能会破坏区块链的完整性。然而,它要求恶意节点具有强大的计算能力来进行攻击,这甚至可能是不划算的。
The decentralization and security characteristics of blockchain have attracted researchers to develop various applications such as smart contracts, distributed DNS, and identity management etc. Besides Bitcoin, Ethereum [16] is also designed on top of public blockchain for simple and quick development of decentralized applications with per-address transaction model. Multichain [17] provides an open-source permissioned blockchain network, where developers can host their blockchain on a private cloud architecture. Multichain uses per output transaction model and can handle high throughput [18]. Tierion [19] provides a platform for uploading and publishing data records into the Blockchain network. With public APIs available, Tierion is convenient for integrating applications that demand need of blockchain. Developers can post metadata using HTTP request into Tierion data store and fetch record information. Each data record has a record ID which can be used to retrieve the blockchain receipt generated based on the blockchain transactions. The blockchain receipt contains the transaction ID which will be used to locate a transaction and the block that hosts the transaction. In this way, the data record posted on the blockchain cannot be tampered and the integrity is assured.
区块链的分散性和安全性特点吸引了研究人员开发各种应用,如智能合约、分布式DNS、身份管理等,除比特币外,以太坊[16]还设计在公共区块链之上,以简单快速地开发每个地址的交易模型分散应用。multicahain[17]提供了一个开放源代码许可的区块链网络,开发者可以在该网络上托管私有云架构上的区块链。多链使用每输出事务模型,可以处理高吞吐量[18]。Tierion[19]提供了一个将数据记录上传和发布到区块链网络的平台。有了公共API,Tierion可以方便地集成需要区块链的应用程序。开发人员可以使用HTTP请求将元数据发布到tierion数据存储中并获取记录信息。每个数据记录都有一个记录ID,可用于检索基于区块链交易生成的区块链收据。区块链收据包含用于定位交易的交易ID和承载交易的块。这样,就不会篡改区块链上发布的数据记录,并确保完整性。
The Blockstack Labs from Princeton University proposed a decentralized PKI service on top of Namecoin and a blockchain based naming and storage system [20]. Blockchain application in information-centric network for name based security of content distribution has also been proposed [21]. Enigma is a decentralized computation platform with guaranteed privacy which uses blockchain network to control the network, manage access control and identity, and create tamper-proof log of events [22]. Guardtime provides industrial-scale blockchain services using Keyless Signature Infrastructure (KSI) and secure one-way hash function, which is quantum-immune in contrast to RSA [23]. Guardtime also proposed a blockchain standard for digital identity and a protocol for authentication and digital signature which provides a simplified mechanism for revocation management and long-term validity [24].
普林斯顿大学的Blockstack实验室提出了一个位于Namecoin之上的分散式PKI服务和一个基于区块链的命名和存储系统[20]。还提出了区块链在以信息为中心的网络中用于内容分发的基于名称的安全[21]。Enigma是一个具有保证隐私的分散计算平台,它使用区块链网络控制网络,管理访问控制和身份,并创建事件的防篡改日志[22]。GuardTime使用无键签名基础设施(ksi)和安全的单向散列函数提供工业规模的区块链服务,与之相比,这是量子免疫的 RSA〔23〕。GuardTime还提出了数字身份的区块链标准和认证和数字签名协议,为撤销管理和长期有效性提供了简化机制[24]。
III. PROVCHAIN ARCHITECTURE
ProvChain is a data provenance architecture built on a blockchain which will provide the ability to audit data operations for cloud storage. ProvChain achieves the following four objectives.
ProvChain是建立在区块链上的数据溯源架构,它将提供审核云存储数据操作的能力。ProvChain实现以下四个目标。
• Real-time Cloud Data Provenance:
实时云数据溯源
User operations are monitored in real time to collect provenance data, which will further support access control policy enforcement [25] and intrusion detection.
实时监控用户操作以收集来源数据,这将进一步支持访问控制策略实施[25]和入侵检测。
• Tamper-proof Environment:
防篡改环境
Data provenance record is collected and then published to the blockchain network which protects the provenance data. All data on the blockchain is shared among the nodes. ProvChain builds a public time-stamped log of all user operations on cloud data without the presence of a trusted third
party. Every provenance entry is assigned a blockchain receipt for future validation.
收集数据出处记录,然后发布到区块链网络,保护溯源数据。区块链上的所有数据在节点之间共享。ProvChain在不存在可信第三方的情况下,为云数据上的所有用户操作建立一个公共时间戳日志。每个溯源条目都被分配一个区块链收据以供将来验证。
• Enhanced Privacy Preservation:
加强隐私保护
Data provenance record is associated with a hashed user ID to preserve privacy so that blockchain network node cannot correlated data records associated with a specific user. Provenance auditor can access provenance data owned by the user but can never identify the true owner. Only the service provider can link each record with the owner of the record data.
数据来源记录与哈希用户ID关联,以保护隐私,使得区块链网络节点无法关联与特定用户关联的数据记录。Provenance Auditor可以访问用户拥有的Provenance数据,但无法识别真正的所有者。只有服务提供者才能将每个记录与记录数据的所有者链接起来。
• Provenance Data Validation:
验证溯源数据
Data provenance record is published globally on blockchain network, where a number of blockchain nodes provide confirmation for every block. ProvChain uses blockchain receipt to validate every provenance data entry.
数据溯源记录在区块链网络上全球发布,许多区块链节点为每个区块提供确认。ProvChain使用区块链收据验证每个出处数据输入。
To achieve the above objectives, we adopt the below methods to design ProvChain’s architecture.
为了实现上述目标,我们采用以下方法设计ProvChain的体系结构。
• Monitor user activities in real time using hooks and listeners so that every user operation on files will be collected and recorded for generating provenance data.
•使用钩子和监听器实时监控用户活动,以便收集和记录每个用户对文件的操作,以生成溯源数据。
• Store all hashed data operations in form of blocks in the blockchain. Every node on the blockchain can verify the operations by mining the block so that data provenance is authentic and tamper-proof.
•将所有哈希数据操作以块的形式存储在区块链中。区块链上的每个节点都可以通过挖掘区块来验证操作,从而确保数据来源是真实的和防篡改的。
• Hash the user ID while publishing provenance data so that the blockchain network and the provenance auditor cannot determine user identity and the data operations.
在发布出处数据时散列用户ID,使得区块链网络和出处审计员无法确定用户身份和数据操作。
• Provenance auditor validates provenance data by retrieving transactions from the blockchain network by using blockchain receipt which contains block and transaction information.
Provenance Auditor通过使用包含块和交易信息的区块链收据从区块链网络检索交易来验证Provenance数据。
A. Architecture Overview
An overview of ProvChain architecture is illustrated in Figure 1. Following are the critical components of ProvChain.
ProvChain体系结构的概述如图1所示。以下是ProvChain的关键组成部分。
• Cloud User. A user, who owns its data and has sharing relationship on other users’ data, may opt for the provenance service, where the provenance data is stored on the public blockchain. Data changes made by the user can be monitored and validated by blockchain nodes, but the nodes may not know about details of other users’ activities. The provenance data will not expose real user identity.
拥有数据并与其他用户数据共享关系的用户可以选择溯源服务,溯源数据存储在公共区块链上。用户所做的数据更改可以由区块链节点监控和验证,但这些节点可能不知道其他用户活动的细节。溯源数据不会公开真实的用户标识。
• Cloud Service Provider (CSP). The cloud service provider offers a cloud storage service and is responsible for user registration. A CSP can benefit from our system in the following aspects. First, they can audit the data changes all the time, and they can learn a lot about data operations performed by all the users to better improve their service. Through provenance data, they can also detect intrusion from anomalous behaviours. Besides, they can protect their own daily records just like normal users. As far as business aspects, they can gain brand reputation from using blockchain provenance services since they provide trustworthiness.
云服务提供商提供云存储服务,并负责用户注册。CSP可以从我们的系统获得以下方面的优势。首先,他们可以随时审核数据更改,并且可以了解所有用户执行的数据操作,以更好地改进他们的服务。通过溯源数据,还可以检测到异常行为的入侵。此外,他们可以像普通用户一样保护自己的日常记录。在业务方面,他们可以通过使用区块链出处服务获得品牌声誉,因为他们提供了可信赖性。
• Provenance Database. The provenance database records all provenance data on the blockchain network, which is used for detecting malicious behaviors. All data records are anonymized.
溯源数据库记录了区块链网络上的所有溯源数据,用于检测恶意行为。所有数据记录都是匿名的。
• Provenance Auditor (PA). PA can retrieve all the provenance data from the blockchain into the provenance database and validate the blockchain receipt. The PA maintains the provenance database but cannot correlate the provenance entry to the data owner.
PA可以将区块链中的所有出处数据检索到出处数据库中,并验证区块链收据。PA维护出处数据库,但不能将出处条目与数据所有者关联。
• Blockchain Network. The blockchain network consists of globally participating nodes. All the provenance data will be recorded in form of blocks and verified by blockchain nodes.
区块链网络由全球参与节点组成。所有溯源数据将以区块形式记录,并由区块链节点验证。
B.Preliminaries and Concepts
B.序言和概念
ProvChain uses cloud file as data unit and monitors file operations to provide data provenance service. After each file operation is detected, a provenance entry will be generated. The cloud service provider will upload the provenance entry onto the blockchain network. In this section, we describe the details on file provenance use case and block structure.
ProvChain使用云文件作为数据单元,监控文件操作以提供数据溯源服务。检测到每个文件操作后,将生成一个出处条目。云服务提供商将把出处条目上传到区块链网络。在本节中,我们将详细描述文件来源用例和块结构。
File Provenance Use Case. For each file provenance, we can record activities, such as, file creation, file modification, file copy, file share and file delete. Examples are shown in Table I. A file can be created by user A, which refers to origin of file X. Then user A copies file X to another location, probably for backup or other reasons. The read and write operation of user A on file X can also be recorded. If a user B asks for sharing file X from user A, there will also be a record both on user A and user B. User A shares the file X at pre-defined location and user B creates a new file Y from the shared file X. Then user B can operate on file Y just the same as user A on file X, such as read and write operations. If user B deletes the file, there will be a record for deletion. At some point of time, user A decides to make the file X public so that the file access is changed. Anyone accesses it will also create a new file at their own respective location. History of files (different versions of file) can be backed up for future use.
文件来源用例。
对于每个文件来源,我们可以记录活动,如文件创建、文件修改、文件复制、文件共享和文件删除。示例如表I所示。用户A可以创建一个文件,它引用文件X的原点。然后用户A将文件X复制到另一个位置,可能是出于备份或其他原因。还可以记录用户A对文件X的读写操作。如果用户B要求从用户A共享文件X,用户A和用户B上也会有一个记录。用户A在预先定义的位置共享文件X,用户B从共享文件X创建一个新的文件Y。然后用户B可以在文件Y上操作,就像用户A在文件X上一样,例如读写操作。如果用户B删除该文件,将有一个要删除的记录。在某个时间点,用户A决定将文件x公开,以便更改文件访问权限。任何人访问它也将在各自的位置创建一个新文件。可以备份文件的历史记录(不同版本的文件),以备将来使用。
Block Structure.
ProvChain uses blockchain network to provide data record verification and resist against tampering. The block structure is composed of two parts, block header and a list of transactions. The main attributes in the header are block hash, height, confirmations, nonce and Merkle root. Block hash is computed using the previous block hash and a nonce. The height represents the block index in the global blockchain network. The confirmation number of the block indicates the number of nodes that have mined this block and the nonce is used by blockchain nodes to check the integrity of the block. The Merkle root is the root of binary hash tree created out of all the transactions in a block. Transaction lists come after the block header. Each transaction has a hash, with inputs and outputs. In ProvChain, each data record is hashed into a Merkle tree node. The Merkle tree root node will be anchored to one transaction in a certain block.
块结构。
ProvChain使用区块链网络提供数据记录验证并抵御篡改。块结构由两部分组成:块头和事务列表。头中的主要属性是块散列、高度、确认、nonce和merkle根。块散列是使用前一个块散列和一个nonce计算的。高度表示全球区块链网络中的区块索引。区块的确认号表示已挖掘该区块的节点数,区块链节点使用nonce检查区块的完整性。merkle根是在块中的所有事务中创建的二进制哈希树的根。事务列表位于块头之后。每个事务都有一个具有输入和输出的哈希。在provchain中,每个数据记录都被散列到一个merkle树节点中。Merkle树根节点将锚定到特定块中的一个事务。
C. Threat Model
Here, we analyze the potential vulnerabilities in ProvChain. The cloud service provider offers data provenance service as well as cloud storage service, which allow user to store data on the cloud platform and provide the option to enable the data provenance service. The cloud service provider cannot guarantee that data records will remain unchanged due to known vulnerabilities in hypervisors and cloud operating systems. Once the data provenance service is enabled, the user will be able to trace the data and the provenance auditor is allowed to access all the provenance data. However, the provenance auditor cannot be completely trusted. The adversary can potentially access or modify user data and/or user provenance data. Since ProvChain’s main objective is to protect provenance data, we assume that user data stored on the cloud is encrypted and is not accessible to anyone without the decryption key.
C.威胁模型
这里,我们分析ProvChain中的潜在漏洞。云服务提供商提供数据溯源服务和云存储服务,允许用户在云平台上存储数据,并提供启用数据来源服务的选项。由于虚拟机监控程序和云操作系统中存在已知漏洞,云服务提供商无法保证数据记录保持不变。一旦启用了数据来源服务,用户将能够跟踪数据,并且允许来源审核人员访问所有来源数据。然而,出处审计员不能完全信任。对手可以潜在地访问或修改用户数据和/或用户来源数据。由于ProvChain的主要目标是保护出处数据,因此我们假设存储在云端的用户数据是加密的,没有解密密钥,任何人都无法访问。
D. Key Establishment
To use ProvChain, users are required to register the service and create their credentials(资格证书;凭证,证件). For cloud storage applications, users generate data encryption key pairs to encrypt their cloud data for confidentiality. If the user wants to share a file, a data sharing key will be provided. For provenance data, cloud service provider generates key pairs to encrypt provenance data for privacy considerations, because provenance data will be further uploaded and published to the blockchain network. We describe each of key as follows.
• User Registration Key . KUR .In ProvChain, user needs to register the cloud storage service to store data on the cloud. We denote the key as KUR. Every time user wants to operate on cloud data, the registration key is needed.
• Data Encryption Key . KDE.After registration, the user generates an encryption key KDE, for encrypting all the data stored in the cloud. When a file is created, user has the option to encrypt the file, which limits the file access only to key holders.
• Data Sharing Public/Private Key Pair (PKDS, PRDS). For data sharing, a public/private key pair will be generated, denoted as (PKDS, PRDS). For common cases, the private key is used to generate a signature from the owner, while the public key is used by others to verify the data ownership. When users share the data with others, they share the private key for data ownership changes.
• Provenance Verification Key . KPV .Each block on the blockchain holds several provenance data entries and provenance data entry is produced upon detection of a file operation. Every data operation will trigger the cloud service provider to generate a key KPV to encrypt the provenance data. The key will be shared with PA if the user assigns a provenance auditor to audit the provenance data.
D.关键设施
要使用provchain,用户需要注册服务并创建其凭据。对于云存储应用程序,用户生成数据加密密钥对来加密他们的云数据以实现机密性。如果用户想要共享文件,将提供数据共享密钥。对于出处数据,云服务提供商生成密钥对来加密出处数据以供隐私考虑,因为出处数据将进一步上传并发布到区块链网络。我们将每个键描述如下。
•用户注册密钥kur。在ProvChain中,用户需要注册云存储服务来在云上存储数据。我们将键表示为kur。每次用户想要操作云数据时,都需要注册密钥。
•数据加密密钥kde。注册后,用户生成一个加密密钥kde,用于加密云中存储的所有数据。创建文件时,用户可以选择加密文件,这将文件访问权限限制为只允许密钥持有人访问。
•数据共享公钥/私钥对(pkds,PRDS)。对于数据共享,一个公钥/私钥对将生成,表示为(pkds,prds)。为了在常见情况下,私钥用于从所有者生成签名,而公钥则由其他人用于验证数据所有权。当用户与其他人共享数据时,他们共享数据所有权更改的私钥。
•来源验证密钥KPV。每个块上
区块链包含多个出处数据条目,在检测到文件操作时生成出处数据条目。每一个数据操作都会触发云服务提供商生成一个密钥KPV来加密出处数据。如果用户指派一个出处审核人员审核出处数据,那么密钥将与PA共享。
IV. PROVCHAIN IMPLEMENTATION
The implementation of ProvChain is conducted using a three layer architecture, comprising of data storage layer, blockchain layer, and provenance database layer, as in Figure 2. The functions for each layer are described as follows.
ProvChain的实现采用三层架构,由数据存储层、区块链层和来源数据库层组成,如图2所示。每个层的功能描述如下。
• Data Storage Layer.
ProvChain is implemented to support cloud storage applications. Here we use one cloud service provider but our architecture can be scaled to multiple providers.
ProvChain的实现是为了支持云存储应用程序。这里我们使用一个云服务提供商,但是我们的架构可以扩展到多个提供商。
• Blockchain Network Layer.
We use blockchain network to record each provenance data entry. Each block can record multiple data operations. Here we use file as a data unit, so we record each file operation with username and file name. File access operations include Create, Share, Change and Delete.
我们使用区块链网络记录每个出处数据输入。每个块可以记录多个数据操作。这里我们使用文件作为数据单元,所以我们用用户名和文件名记录每个文件操作。文件访问操作包括创建、共享、更改和删除。
• Provenance Database Layer. We build an extended database locally for recording the file operation as well as querying. In ProvChain, the service provider can assign a provenance auditor to verify the data from the blockchain network. The response is a blockchain receipt that gets validated and appended in the database.
There are three phases in the life cycle of data provenance for ProvChain, namely, provenance data collection, provenance data storage, and provenance data validation.
我们在本地构建了一个扩展数据库,用于记录文件操作和查询。在ProvChain中,服务提供商可以指定一个出处审计员来验证来自区块链网络的数据。响应是一个区块链收据,它被验证并附加到数据库中。ProvChain的数据来源生命周期分为三个阶段,即来源数据收集、来源数据存储和来源数据验证。
A. Provenance Data Collection and Storage
Once user performs actions on the data files stored in the cloud, the corresponding operations get recorded. The operation can be denoted in a metadata, including all the
attributes mentioned in Table I. Note for this phase, only RecordID, Date and Time, Username, Filename, AffectedUser, and Action attributes are recorded. The transaction hash, block hash and validation field will be collected after provenance auditor queries the blockchain network. The AffectedUser attribute is considered in two cases. One is data modification in which the same user is operating on the data, using the data encryption key, where there are no affected users other than the user itself. The other case is data sharing, where user shares a file with someone else. In second case, the attribute, AffectedUser, in the file operation metadata, will include all the users in the sharing group.
一旦用户对存储在云中的数据文件执行操作,相应的操作就会被记录下来。该操作可以在元数据中表示,包括表一中提到的属性。注意,对于这个阶段,只记录recordid、日期和时间、用户名、文件名、受影响的用户和操作属性。出处审核员查询区块链网络后,采集交易哈希、块哈希和验证字段。在两种情况下会考虑affectedUser属性。一种是数据修改,其中同一用户使用数据加密密钥对数据进行操作,除用户本身之外,没有其他受影响的用户。另一种情况是数据共享,用户与其他人共享文件。在第二种情况下,文件操作元数据中的属性affectedUser将包括共享组中的所有用户。
Figure 3: Provenance Data Collection and Storage.
ProvChain is built on top of an open source application called ownCloud [26] to collect the provenance data. OwnCloud provides both web-based cloud storage services and desktop client, similar to Dropbox and Google Drive, which provides user control of personal data and universal file access to all of the data seamlessly. Besides, ownCloud is flexible and developers can utilize their functions to develop various applications on top of it. In order to collect provenance data, we use hooks to listen to file operations in ownCloud web interface. After an operation is monitored, the record is generated, which will be uploaded to the blockchain network and stored in the provenance database. Figure 3 shows the architecture of our provenance data collection and storage.
For provenance data storage, we use Tierion API [19] to publish data records to blockchain network. We take file change operation as an example to demonstrate the original provenance data in JSON format as follows.
ProvChain构建在一个名为owncloud[26]的开源应用程序之上,用于收集出处数据。owncloud同时提供基于Web的云存储服务和桌面客户端,类似于Dropbox和Google Drive,后者提供用户对个人数据的控制,并无缝地对所有数据进行通用文件访问。此外,owncloud是灵活的,开发人员可以利用它们的功能在上面开发各种应用程序。为了收集出处数据,我们使用钩子来监听owncloud Web界面中的文件操作。监控一个操作后,生成记录,上传到区块链网络,并存储在来源数据库中。图3显示了我们的溯源数据的收集和存储的体系结构。
对于溯源数据存储,我们使用Tierion API[19]将数据记录发布到区块链网络。我们以文件更改操作为例,以JSON格式演示原始出处数据,如下所示。
{
“app”:“files”,
“type”:“file_changed”,
“affecteduser”:“test”,
“user”:“test”,
“timestamp”:“1475679929”,
“subject”:“changed_self”,
“message”:"",
“messageparams”:"[]",
“priority”:“30”,
“object_type”:“files”,
“object_id”:“142”,
“object_name”:“66.txt”,
“link”:"/apps/files/"
}
For privacy consideration, ProvChain hashes user name. In that case, the provenance auditor cannot know which user each provenance data belongs to. Only the service provider can relate each user with the hashed user name since the provider keeps a list of user names. ProvChain also keeps the provenance data in a local provenance database for further update and validation. For publishing data records to blockchain network, we adopt Chainpoint standard [27]. Chainpoint proposes a scalable protocol for publishing data records on the blockchain and generating blockchain receipts. According to Chainpoint 2.0, data records are hashed so that each Merkle tree can host a number of records, as is shown in Figure 4. The target hash of the specific record and the path to the Merkle root constitute the Merkle proof of the provenance data. The Merkle root for each Merkle tree is related to one transaction in the blockchain network.
出于隐私考虑,ProvChain会散列用户名。在这种情况下,出处审核人员无法知道每个出处数据属于哪个用户。由于提供程序保留用户名列表,因此只有服务提供程序才能将每个用户与哈希用户名关联起来。ProvChain还将出处数据保存在本地出处数据库中,以便进一步更新和验证。对于向区块链网络发布数据记录,我们采用了链点标准[27]。ChainPoint提出了一种可扩展的协议,用于在区块链上发布数据记录并生成区块链收据。根据chainpoint 2.0,对数据记录进行散列,以便每个merkle树可以承载多个记录,如图4所示。特定记录的目标散列和到merkle根的路径构成了来源数据的merkle证明。每个merkle树的merkle根与区块链网络中的一个交易相关。
B. Provenance Data Validation
To validate the data records that are published in the blockchain network, the provenance auditor requests the blockchain receipt via Tierion API. The blockchain receipt contains information of the blockchain transaction and the Merkle proof used to validate the transaction. Figure 4 is a sample blockchain receipt. We reconstruct the Merkle tree from the blockchain receipt. Each provenance record is stored along with other records in the blockchain network as a transaction, which is accessible in blockchain Block Explorer [28]. Since the transaction attribute height represents the block index, we can find the exact block information as well. Both information are shown in Figure 5. Algorithm 1 is used to validate the blockchain receipt by the provenance auditor. In the algorithm, the proof, merkleRoot and targetHash in the blockchain receipt are inputs and the output is a validation result. If true is returned, then the data record is validated based on the fact that the transaction and block is authentic. If false is returned, it means the block is tampered and the data record is forged. Note all the hashes are handled in binary format. The anchors in the receipt indicates how the data record is anchored.
为了验证区块链网络中发布的数据记录,出处审计员通过tierion api请求区块链收据。区块链收据包含区块链交易的信息和用于验证交易的Merkle证明。图4是区块链收据示例。我们从区块链收据重建Merkle树。每条出处记录与区块链网络中的其他记录一起作为交易存储,可在区块链块资源管理器中访问[28]。因为事务属性height代表块索引,所以我们也可以找到准确的块信息。这两个信息如图5所示。算法1用于验证出处审计员收到的区块链。在算法中,区块链收据中的证明、merkleroot和targethash为输入,输出为验证结果。如果返回true,则根据事务和块的真实性验证数据记录。如果返回false,则表示块被篡改,数据记录被伪造。注意,所有哈希都是以二进制格式处理的。收据中的锚定表示如何锚定数据记录。
After the validation of the blockchain receipt, the provenance auditor can update the data record in the provenance database by filling in the remaining attributes including transaction hash, block hash and validation result. If the validation result is true, then the provenance auditor can make sure that the provenance data is authentic. If the result is false, then the provenance auditor will report to service provider that a tamper has happened.
区块链验证收据后,出处审核员可以通过填写事务哈希、块哈希、验证结果等剩余属性,更新出处数据库中的数据记录。如果验证结果是真的,那么出处审计员可以确保出处数据是真实的。如果结果是错误的,那么出处审计员将向服务提供者报告发生了篡改。
V. EVALUATION
A. Summary of ProvChain’s capabilities
Prior to providing the performance evaluation of ProvChain, we summarize the capabilities.
• ProvChain provides a real-time auditing for all data access in the cloud storage application. We use file as a data unit and all the operations on the cloud data objects are audited as well as recorded using blockchain. In this way, evidence for all cloud data access events can be collected and monitored.
ProvChain为云存储应用程序中的所有数据访问提供实时审计。我们使用文件作为数据单元,云数据对象上的所有操作都使用区块链进行审计和记录。通过这种方式,可以收集和监控所有云数据访问事件的证据。
• For each of the access record, we transform the provenance data and upload the record to the blockchain network. By doing so, we create an unalterable fingerprint of file operations, with secure and permanent record keeping as well as tamper-proof timestamp. Any changes to the blockchain will be detected by validating the blockchain receipt. Once the data record is published, no one can maliciously rewrite or alter the records without exposure.
对于每个访问记录,我们转换溯源数据并将记录上传到区块链网络。通过这样做,我们创建了一个不可更改的文件操作指纹,具有安全和永久的记录保存以及防篡改时间戳。对区块链的任何更改都将通过验证区块链收据来检测。一旦数据记录发布,任何人都不能恶意重写或更改记录而不暴露。
• By utilizing blockchain network, we reduce the need for trust. There is no need to trust the owner of the remote computers involved in the blockchain network, thus removing the requirement for a trusted third party. Even the cloud service provider is not trusted for keeping the provenance data record. With decentralization, data records are confirmed and validated by continual system cross checking among computing nodes. Besides, the decentralized method ensures the integrity of data records and each of the data record has a copy with each node in the blockchain network, thereby resisting against any DDoS attack. Besides, there is no single point failure problem since no single machine holds all the data record.
通过利用区块链网络,我们减少了信任的需要。无需信任区块链网络中涉及的远程计算机的所有者,从而消除了对可信第三方的要求。即使是云服务提供商也不受信任来保存来源数据记录。在分散的情况下,通过计算节点之间的连续系统交叉检查来确认和验证数据记录。此外,分散的方法确保了数据记录的完整性,每个数据记录与区块链网络中的每个节点都有一个副本,从而抵御任何DDOS攻击。此外,由于没有一台机器保存所有的数据记录,因此没有单点故障问题。
• Users can subscribe to the data provenance service while preserving their privacy. User access records are anonymized in the blockchain network. The provenance auditor cannot learn user activities. Anonymity is preserved in two aspects. For one hand, user identity wil not be linked to provenance data entries since the user ID is hashed. For the other hand, the unlinkability between each user is also achieved, especially for provenance of shared data.
用户可以在保留隐私的同时订阅数据来源服务。用户访问记录在区块链网络中匿名。出处审核人员无法学习用户活动。匿名有两个方面。一方面,用户标识不会链接到来源数据条目,因为用户ID是散列的。另一方面,还实现了每个用户之间的不链接,特别是共享数据的来源。
B. Performance and Overhead
For provenance collection, we use Apache Jmeter [29] to assess the performance of the provenance enabled ownCloud application. We use file create operation as a use case for our performance evaluation. The evaluation for other file operations follow the same procedures. We perform file create with random file names and file contents for 500 repetitions in Jmeter [30]. The file size ranges from 1KB to 2MB. Figure 6 shows the average response time of both provenance enabled ownCloud and non provenance ownCloud. Provenance service brings an average of 6.49% of total overhead against original ownCloud application in terms of the response time, which is acceptable considering the security features it provides. Besides, with the file size increases, the overhead is generally not as much as it is when the file size is smaller, since the larger file size is, the more time will be spent on transmitting the file itself and the less time for provenance service.
对于Provenance集合,我们使用ApacheJMeter[29]来评估支持Provenance的owncloud应用程序的性能。我们使用文件创建操作作为性能评估的用例。其他文件操作的评估遵循相同的过程。我们使用随机文件名和文件内容执行文件创建,在jmeter[30]中重复500次。文件大小从1KB到2MB不等。图6显示了启用源的owncloud和非源的owncloud的平均响应时间。Provenance服务在响应时间方面相对于原始的owncloud应用程序平均带来总开销的6.49%,考虑到它提供的安全特性,这是可以接受的。此外,随着文件大小的增加,开销通常不会像文件大小越小时那样多,因为文件大小越大,传输文件本身所花费的时间就越多,提供源服务的时间就越短。
Figure 7 shows the throughput for both original ownCloud 7(a) and provenance enabled ownCloud 7(b). We choose 64KB as the file size to assess the performance where only one server is responsible for the provenance service regardless of the production environment which comprises of a web server and services for load balancing and network flow optimization. The results show that both systems have the same amount of traffic received, however there is a difference in amount of traffic sent. The provenance enabled ownCloud has a comparable transaction rate as depicted in Figure 8. Overall, the transaction time distribution is considered acceptable as shown in Figure 9. More evaluations can be conducted with varying file types, operations and file sharing status. For provenance retrieval, we focus on the efficiency of requesting blockchain receipt for each of the provenance data entry. In our experiment, we query 10 records each time with a total size of 1.004KB, which uses an average time of 221ms. For each retrieval of blockchain receipt, we record the retrieval time for different file operations. Performance test for provenance data storage follows the similar way. Table II is the provenance retrieval overhead, from which we can conclude that our retrieval methods have a low overhead for the cloud storage system.
图7显示了原始owncloud 7(a)和启用来源的owncloud 7(b)的吞吐量。我们选择64kb作为文件大小来评估性能,其中只有一台服务器负责来源服务,而不管生产环境是由Web服务器和用于负载平衡和网络流优化的服务组成的。结果表明,两个系统接收的流量相同,但发送的流量不同。支持来源的owncloud具有类似的事务处理率,如图8所示。总的来说,事务时间分布被认为是可以接受的,如图9所示。可以使用不同的文件类型、操作和文件共享状态进行更多评估。对于出处检索,我们关注的是为每个出处数据条目请求区块链收据的效率。在我们的实验中,我们每次查询10条记录,总大小为1.004Kb,平均时间为221ms,对于每次获取区块链收据,我们记录不同文件操作的检索时间。种源数据存储的性能测试遵循类似的方法。表二是来源检索开销,从中我们可以得出我们的检索方法对于云存储系统的开销很低。
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we present the design and implementation of ProvChain, a blockchain based data provenance system for cloud auditing, with preserved user privacy and increased availability. Using blockchain technology, we make the record with unalterable timestamp and generate blockchain receipt for each of the data records for validation. Based on the current work, we can extend the system to various use cases where globally verified proof is needed. Instead of file as the data unit, we can also use other granularity such as data chunk in cloud storage. Our evaluation shows that provenance enabled ownCloud brings a low overhead. As for the rewards of blockchain miners, cloud users may have to pay for a fee to enable data provenance services by cloud service provider. The service provider can then pay for the blockchain network. In this way, miners can be paid for continuous mining on blocks and validation of block authenticity. The fee can be determined depending on different level of data usage of each user. Currently we collect provenance data inside one cloud service provider and one cloud application. For future work, we plan to develop ProvChain for federated cloud provider. Cloud storage applications on federated cloud providers will require the need to address interoperability, cross-provider data sharing and management. We will collect data provenance across different cloud providers and different cloud storage applications to provide better provenance services and enhance data security. For provenance validation, we currently use the Tierion API to validate the blockchain receipt. For future work, we will implement the validation on top of an open source architecture that will not only improve overall performance but also security and flexibility. We will use the collected provenance data to check for access control violations [31] which will in return provide better protection for the cloud storage application.
在本文中,我们介绍了ProvChain的设计和实现,ProvChain是一个基于区块链的云审计数据源系统,具有保留用户隐私和提高可用性的功能。利用区块链技术,我们用不可更改的时间戳制作记录,并为每个数据记录生成区块链收据进行验证。基于当前的工作,我们可以将系统扩展到需要全局验证证明的各种用例。我们也可以使用其他粒度,如云存储中的数据块,而不是文件作为数据单元。我们的评估表明,支持来源的owncloud带来了较低的开销。至于区块链矿工的奖励,云用户可能需要支付费用,才能让云服务提供商提供数据来源服务。然后,服务提供商可以支付区块链网络的费用。通过这种方式,可以向矿工支付连续开采区块和验证区块真实性的费用。费用可以根据每个用户不同的数据使用级别来确定。目前,我们在一个云服务提供商和一个云应用程序中收集来源数据。为了将来的工作,我们计划为联邦云提供商开发ProvChain。联合云提供商上的云存储应用程序需要解决互操作性、跨提供商数据共享和管理问题。我们将收集不同云提供商和不同云存储应用程序之间的数据来源,以提供更好的来源服务并增强数据安全性。对于来源验证,我们目前使用Tierion API验证区块链收据。对于未来的工作,我们将在一个开放源代码体系结构之上实现验证,该体系结构不仅可以提高整体性能,而且还可以提高安全性和灵活性。我们将使用收集到的来源数据来检查访问控制是否违反[31],这将为云存储应用程序提供更好的保护。